Enterprise LLM Evals · Lesson 3

RAG Evaluation

How to evaluate retrieval and answer generation separately

RAG Evaluation

Lesson 3 — how to evaluate retrieval and answer generation separately

What you'll learn
  1. The RAG failure map: where things go wrong
  2. Core RAG metrics and what each one catches
  3. Why enterprise RAG must also test access control
  4. Tool choices for RAG evaluation

Core idea

RAG can fail because retrieval missed the right evidence, or because generation ignored or misused evidence. Evaluate both separately.

1. The RAG Failure Map

User question | v Query rewriting -> Retrieval -> Reranking -> Context packing -> Generation -> Citation | | | | | | query eval precision rank eval lost-in-middle faithfulness citation eval recall answer relevance

Each stage can fail independently. A wrong answer might mean retrieval missed the document, or it might mean the model ignored the right document it already retrieved. You need metrics for each stage.

2. Core RAG Metrics

MetricMeaningFailure it catches
Context precisionAre retrieved chunks useful?Retriever brings irrelevant noise.
Context recallDid retrieval find enough needed evidence?Correct answer was absent from context.
Faithfulness / groundednessAre answer claims supported by context?Hallucination despite correct retrieval.
Answer relevanceDoes answer address the user query?On-topic context, off-target response.
Citation correctnessDo citations support the exact claim?Fake or decorative citations.

3. Enterprise RAG Must Also Test Access Control

For internal knowledge systems (HR policies, financial docs, legal contracts), quality is not enough. Evals must test whether the retriever leaks documents across users, roles, departments, tenants, or projects.

Quality eval: Did we retrieve the right doc? Security eval: Was this user allowed to retrieve that doc?

This means your golden dataset should include adversarial cases where User A asks about documents only User B should see, and the system must either refuse or return only authorized results.

4. Tools for RAG Evaluation

Ragas is the strongest dedicated RAG metrics library. It computes context precision, context recall, faithfulness, and answer relevance out of the box.

LangSmith and MLflow are useful when you also need traces, datasets, experiment comparison, and production monitoring alongside RAG metrics.

Sources: Ragas: Available Metrics, LangSmith: RAG Evaluation Tutorial

⚡ Practice Drill
Q1: If the answer is wrong, what four places do you inspect?
Show answer

Query rewriting (was the query transformed badly?), retrieval (did we fetch the right docs?), reranking/context packing (did we lose good docs in the middle?), and generation/grounding (did the model use the context correctly?).

Q2: What is the difference between faithfulness and correctness?
Show answer

Faithfulness asks "is the answer supported by the retrieved context?" Correctness asks "is the answer actually true?" An answer can be faithful (supported by context) but incorrect (if the context was wrong), or correct but unfaithful (if the model used its own knowledge instead of the context).

Q3: Why is citation correctness separate from faithfulness?
Show answer

Faithfulness checks whether claims are supported by context. Citation correctness checks whether the specific citation pointer supports the exact claim at that location. A model can be faithful but cite the wrong paragraph, or cite the right paragraph but misrepresent it.

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track