RAG Architecture: Components and Data Flow
Part of the RAG series.
A RAG system has two main phases: an offline indexing phase and an online query phase. The indexing phase turns your documents into searchable chunks; the query phase uses those chunks to answer user questions. Here we break down the components and how data moves through them.
Indexing Phase (Offline)
In the indexing phase you take raw documents and build a searchable index. Typical steps:
- Load and parse. Read PDFs, HTML, Markdown, etc. and extract text (and optionally structure like headings or tables).
- Chunk. Split text into chunks. Chunk size and strategy (fixed length, sentence, paragraph, or semantic) greatly affect retrieval quality. See Chunking Strategies for RAG.
- Embed. Run each chunk through an embedding model to get a vector. See Embeddings in RAG.
- Store. Write chunk text and vectors (and any metadata) into a vector store or search engine. See Vector Databases for RAG.
Optionally you also build a keyword or sparse index (e.g. BM25) for hybrid search. Indexing is usually batch or on a schedule; near–real-time ingestion is a separate design choice.
Query Phase (Online)
When a user asks a question, the system runs the query phase:
- Embed the query. Use the same embedding model as in indexing to get a query vector.
- Retrieve. Run similarity search (and optionally keyword search) to get the top-k chunks. You may use a single vector index or a hybrid of dense and sparse.
- Rerank (optional). Pass the top-k to a reranker to get a better ordering or a smaller set. Shrinks context and often improves relevance.
- Build the prompt. Place the chosen chunks and the user question into a prompt template. Often you add instructions like "answer only from the context" or "cite sources."
- Generate. Call the LLM with that prompt. Return the answer and, if needed, the chunk IDs or snippets as citations.
Data Flow (Conceptual)
# Indexing
documents -> chunk -> embed -> vector_store
# Query
query -> embed -> vector_store.search(k) -> [chunks]
-> (optional) rerank -> [top chunks]
-> prompt(query, chunks) -> LLM -> answerMain Components
- Embedding model. Maps text to vectors. Must be consistent between index and query. Can be same as the LLM (e.g. some instruction models used for both) or a dedicated encoder.
- Vector store. Holds vectors and supports k-NN or approximate k-NN. Can be a dedicated vector DB (Pinecone, Weaviate, etc.) or a search engine with vector support (Elasticsearch, OpenSearch).
- LLM. Takes the augmented prompt and generates the answer. Hosted APIs (OpenAI, Anthropic, etc.) or self-hosted.
- Optional: reranker. A small model that scores (query, chunk) pairs and reorders or filters chunks. Often yields better quality than "raw" top-k from the vector index alone.
Putting this together: indexing turns docs into vectors in a store; queries turn the user question into a vector, fetch similar chunks, optionally rerank, then send chunks + question to the LLM for the final answer. For more on each piece, see the linked posts in this series.
Get in touch
Questions about RAG or AI knowledge systems? Tell us about your project.