Evaluating RAG Systems: Metrics and Practices

Part of the RAG series.

To improve a RAG system you need to know how well it is doing. Evaluation covers retrieval (did we fetch the right chunks?), generation (is the answer correct and grounded?) and end-to-end behaviour (latency, cost, user satisfaction). This post outlines metrics and practices that are commonly used.

Retrieval Metrics

Retrieval is typically evaluated with:

Recall@k. Of the chunks that are relevant for the query, what fraction appear in the top-k results? Higher is better. You need labelled (query, relevant_chunk_ids) pairs.
Precision@k. Of the top-k results, how many are relevant? Useful when you care about not polluting context with junk.
MRR (Mean Reciprocal Rank). For each query, 1 / (rank of first relevant result); average over queries. Good when the first relevant hit matters most.
NDCG. Normalized Discounted Cumulative Gain; accounts for ranking order and relevance grades. Often used when you have graded relevance (e.g. 0/1/2) rather than binary.

"Relevant" usually comes from human labels or from a proxy (e.g. the chunk that contains the gold answer).

Generation Metrics

For the answer produced by the LLM, common metrics include:

Faithfulness (or grounding). Is the answer supported by the retrieved context? Measured by NLI-based models (e.g. “does the context entail the answer?”) or by LLM-as-judge. Reduces hallucination and off-context claims.
Answer relevance. Does the answer address the question? Often scored by an LLM or a learned model that compares question and answer.
Exact match / F1. When you have a gold answer, you can use token-level exact match or F1. Useful for closed-domain QA; less so for open-ended answers.

End-to-End and Ops

Beyond accuracy you typically care about:

Latency. Time from user query to final answer. Often broken into retrieval time and generation time.
Cost. Embedding calls, vector search, LLM tokens. Track per query or per month.
Failure rate. Timeouts, API errors, empty retrieval. Monitor in production.

Evaluation Frameworks

Frameworks like RAGAS, TruLens and LlamaIndex evaluation modules give you predefined metrics (e.g. faithfulness, relevance, context precision) and harnesses to run them on (question, context, answer) triples. You can implement custom metrics on top of the same data. A typical flow:

# Conceptual eval loop
for (query, gold_answer, relevant_ids) in eval_set:
    context = retrieve(query, k=5)
    answer = llm.generate(query, context)
    faithfulness = check_grounding(answer, context)
    relevance = score_relevance(query, answer)
    recall = recall_at_k(retrieved_ids, relevant_ids, k=5)
    log(query, answer, faithfulness, relevance, recall)

Best Practices

Build a small eval set (queries + gold answers or relevant chunks) and re-run it when you change chunking, retrieval or prompts.
Prefer automated metrics (faithfulness, relevance, recall) for iteration; use human evaluation for high-stakes or ambiguous cases.
Track retrieval and generation separately so you know whether to improve chunking/retrieval or prompt/model.

For running this in production and monitoring over time, see Production RAG: Best Practices and Deployment.

Fryloop AI

Building a RAG system for your team?

We design and implement retrieval-augmented generation systems on your own documents and data. Two-month engagements, published pricing.

See how we build RAG

Get in touch

Questions about RAG or AI knowledge systems? Tell us about your project.