Enterprise LLM Evals · Lesson 3
RAG Evaluation
How to evaluate retrieval and answer generation separately
RAG Evaluation
Lesson 3 — how to evaluate retrieval and answer generation separately
- The RAG failure map: where things go wrong
- Core RAG metrics and what each one catches
- Why enterprise RAG must also test access control
- Tool choices for RAG evaluation
Core idea
RAG can fail because retrieval missed the right evidence, or because generation ignored or misused evidence. Evaluate both separately.
1. The RAG Failure Map
Each stage can fail independently. A wrong answer might mean retrieval missed the document, or it might mean the model ignored the right document it already retrieved. You need metrics for each stage.
2. Core RAG Metrics
| Metric | Meaning | Failure it catches |
|---|---|---|
| Context precision | Are retrieved chunks useful? | Retriever brings irrelevant noise. |
| Context recall | Did retrieval find enough needed evidence? | Correct answer was absent from context. |
| Faithfulness / groundedness | Are answer claims supported by context? | Hallucination despite correct retrieval. |
| Answer relevance | Does answer address the user query? | On-topic context, off-target response. |
| Citation correctness | Do citations support the exact claim? | Fake or decorative citations. |
3. Enterprise RAG Must Also Test Access Control
For internal knowledge systems (HR policies, financial docs, legal contracts), quality is not enough. Evals must test whether the retriever leaks documents across users, roles, departments, tenants, or projects.
This means your golden dataset should include adversarial cases where User A asks about documents only User B should see, and the system must either refuse or return only authorized results.
4. Tools for RAG Evaluation
Ragas is the strongest dedicated RAG metrics library. It computes context precision, context recall, faithfulness, and answer relevance out of the box.
LangSmith and MLflow are useful when you also need traces, datasets, experiment comparison, and production monitoring alongside RAG metrics.
Sources: Ragas: Available Metrics, LangSmith: RAG Evaluation Tutorial
Q1: If the answer is wrong, what four places do you inspect?
Show answer
Query rewriting (was the query transformed badly?), retrieval (did we fetch the right docs?), reranking/context packing (did we lose good docs in the middle?), and generation/grounding (did the model use the context correctly?).
Q2: What is the difference between faithfulness and correctness?
Show answer
Faithfulness asks "is the answer supported by the retrieved context?" Correctness asks "is the answer actually true?" An answer can be faithful (supported by context) but incorrect (if the context was wrong), or correct but unfaithful (if the model used its own knowledge instead of the context).
Q3: Why is citation correctness separate from faithfulness?
Show answer
Faithfulness checks whether claims are supported by context. Citation correctness checks whether the specific citation pointer supports the exact claim at that location. A model can be faithful but cite the wrong paragraph, or cite the right paragraph but misrepresent it.
Want to see these patterns in action?
See these eval patterns applied to real AI apps in the Lab.
Explore the Lab →