What Is RAG? Definition and Core Concepts
Part of the RAG series.
RAG is short for Retrieval Augmented Generation. The name breaks into three ideas: retrieval, augmentation and generation. Understanding each helps you understand RAG.
Retrieval
Retrieval means fetching relevant pieces of information from a corpus (your documents, tickets, wikis) given a user query. You do not send the entire corpus to the model. You run a search step first and only pass the bits that match the question. In RAG, that search is usually done with vector similarity: you turn the query and your passages into embeddings and find the passages whose embeddings are closest to the query embedding.
Augmentation
Augmentation means enriching the model's input with that retrieved information. The model's normal input is the user question (and maybe system instructions). In RAG you add retrieved text as extra context. So the model sees something like: "Here is relevant context: [retrieved chunks]. Now answer: [question]." The model is "augmented" by this context at inference time, without any change to its weights.
Generation
Generation is what the language model does: it produces a natural-language answer given the augmented prompt. So the full pipeline is: retrieve relevant chunks, augment the prompt with them, then generate the answer. That is RAG in one sentence.
Core Concepts
A few concepts show up everywhere in RAG:
- Embeddings. Dense vector representations of text. Similar meaning implies similar vectors. Used for retrieval and sometimes for reranking.
- Chunks. Segments of your documents (e.g. paragraphs or fixed-size slices) that you store and retrieve. Chunk size and boundaries strongly affect quality.
- Vector store. A database that stores embeddings and supports similarity search (e.g. top-k nearest neighbours).
- Context window. The maximum amount of text you can put in one LLM call. Retrieved chunks must fit into this budget together with the question and instructions.
A Minimal Mental Model
You can think of RAG as "search then summarize": first search your docs for what matches the question, then ask the model to answer using only those search results. In code terms:
# Conceptual RAG
chunks = search(query, index, k=5)
context = "\n---\n".join(chunks)
answer = llm(f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:")Real systems add chunking, embedding models, scoring, reranking and prompt design, but this is the core idea. For more depth, see RAG Introduction and RAG Architecture.
Get in touch
Questions about RAG or AI knowledge systems? Tell us about your project.