Embeddings in RAG: How Text Becomes Vectors

Part of the RAG series.

In RAG, you turn text (chunks and queries) into embeddings: fixed-length vectors of numbers. Similar meaning maps to similar vectors, so you can find relevant passages by searching for vectors that are "close" to the query vector. This post explains what embeddings are, how they relate to retrieval and what to keep in mind when you choose or use them.

What Are Embeddings?

An embedding is a numerical representation of a piece of text (or sometimes an image or other input). The model maps each input to a point in a high-dimensional space (e.g. 384, 768 or 1536 dimensions). Texts that are semantically similar end up near each other; unrelated texts end up farther apart. That lets you use geometry (e.g. cosine similarity or Euclidean distance) as a proxy for relevance.

Why Embeddings Matter for RAG

RAG retrieval is usually dense retrieval: you embed the query and each candidate chunk, then return the chunks whose embeddings are closest to the query embedding. That works because "closeness" in embedding space correlates with relevance: a question like "What is the refund policy?" will be close to chunks that actually describe refunds, even if they use different words. So embeddings are what turn semantic similarity into a search problem you can solve with a vector index.

Embedding Models

You get embeddings from an embedding model. Often it is a separate encoder (e.g. sentence-transformers, OpenAI text-embedding-3-small, Cohere embed-v3) that you use for both queries and chunks. A few rules of thumb:

  • Use the same model for indexing and querying. Query and chunk embeddings must live in the same space.
  • Match dimension and max length to your vector store and chunk size. Some stores or APIs have limits.
  • If you care about multilingual or domain (e.g. legal, medical), pick a model that is strong in that setting or consider fine-tuning.

Typical API Shape

Most embedding APIs take a string (or list of strings) and return a vector (or list of vectors). Conceptually:

# Conceptual embedding API
vector = embed("What is the refund policy?")
# -> [0.02, -0.1, 0.5, ...]  (e.g. 1536 dims)

# Batch (for indexing)
vectors = embed([
  "Chunk 1 text...",
  "Chunk 2 text...",
])
# -> [[...], [...], ...]

Best Practices

  • Normalize embeddings if your vector store or similarity metric expects it (e.g. unit norm for cosine similarity).
  • Keep chunk length within the model's max input length. Truncation can hurt quality.
  • If you change the embedding model, re-index. Old vectors are not comparable to new ones.

For where those vectors live and how you search them, see Vector Databases for RAG. For how chunks are created before embedding, see Chunking Strategies for RAG.

Get in touch

Questions about RAG or AI knowledge systems? Tell us about your project.