Evaluating RAG Systems: Metrics and Practices
Part of the RAG series.
To improve a RAG system you need to know how well it is doing. Evaluation covers retrieval (did we fetch the right chunks?), generation (is the answer correct and grounded?) and end-to-end behaviour (latency, cost, user satisfaction). This post outlines metrics and practices that are commonly used.
Retrieval Metrics
Retrieval is typically evaluated with:
- Recall@k. Of the chunks that are relevant for the query, what fraction appear in the top-k results? Higher is better. You need labelled (query, relevant_chunk_ids) pairs.
- Precision@k. Of the top-k results, how many are relevant? Useful when you care about not polluting context with junk.
- MRR (Mean Reciprocal Rank). For each query, 1 / (rank of first relevant result); average over queries. Good when the first relevant hit matters most.
- NDCG. Normalized Discounted Cumulative Gain; accounts for ranking order and relevance grades. Often used when you have graded relevance (e.g. 0/1/2) rather than binary.
"Relevant" usually comes from human labels or from a proxy (e.g. the chunk that contains the gold answer).
Generation Metrics
For the answer produced by the LLM, common metrics include:
- Faithfulness (or grounding). Is the answer supported by the retrieved context? Measured by NLI-based models (e.g. “does the context entail the answer?”) or by LLM-as-judge. Reduces hallucination and off-context claims.
- Answer relevance. Does the answer address the question? Often scored by an LLM or a learned model that compares question and answer.
- Exact match / F1. When you have a gold answer, you can use token-level exact match or F1. Useful for closed-domain QA; less so for open-ended answers.
End-to-End and Ops
Beyond accuracy you typically care about:
- Latency. Time from user query to final answer. Often broken into retrieval time and generation time.
- Cost. Embedding calls, vector search, LLM tokens. Track per query or per month.
- Failure rate. Timeouts, API errors, empty retrieval. Monitor in production.
Evaluation Frameworks
Frameworks like RAGAS, TruLens and LlamaIndex evaluation modules give you predefined metrics (e.g. faithfulness, relevance, context precision) and harnesses to run them on (question, context, answer) triples. You can implement custom metrics on top of the same data. A typical flow:
# Conceptual eval loop
for (query, gold_answer, relevant_ids) in eval_set:
context = retrieve(query, k=5)
answer = llm.generate(query, context)
faithfulness = check_grounding(answer, context)
relevance = score_relevance(query, answer)
recall = recall_at_k(retrieved_ids, relevant_ids, k=5)
log(query, answer, faithfulness, relevance, recall)Best Practices
- Build a small eval set (queries + gold answers or relevant chunks) and re-run it when you change chunking, retrieval or prompts.
- Prefer automated metrics (faithfulness, relevance, recall) for iteration; use human evaluation for high-stakes or ambiguous cases.
- Track retrieval and generation separately so you know whether to improve chunking/retrieval or prompt/model.
For running this in production and monitoring over time, see Production RAG: Best Practices and Deployment.
Get in touch
Questions about RAG or AI knowledge systems? Tell us about your project.