Enterprise LLM Evals · Lesson 7
Offline Evals, CI Gates & Online Monitoring
How evals move from development to production
Offline Evals, CI Gates & Online Monitoring
Lesson 7 — how evals move from development to production
- The eval lifecycle: development, PR, pre-prod, production
- CI/CD release gates and regression thresholds
- Online monitoring signals and what they tell you
- How production failures become new golden tests
Core idea
Offline evals answer "should we ship?" Online evals answer "is it still working?"
1. The Eval Lifecycle
2. CI/CD Release Gates
- Critical deterministic tests must be 100% pass. No exceptions.
- Safety/security critical tests must be 100% pass or require explicit risk acceptance with named owner.
- Quality metrics can have thresholds and no-regression rules (e.g., "correctness must not drop more than 2% from previous release").
- Cost/latency budgets are first-class evals. A model swap that improves quality but doubles cost may be blocked.
- Dataset and judge versions must be pinned for reproducibility. An eval run without versioned inputs is not evidence.
3. Online Monitoring Signals
| Signal | Why it matters |
|---|---|
| Latency, tokens, cost | User experience and budget control |
| User feedback (thumbs up/down, ratings) | Real-world quality signal — but not ground truth by itself |
| Safety flags | Policy and incident detection |
| Fallback/refusal rate | Possible drift, prompt issue, or data issue |
| Tool error rate | Integration health |
| Judge scores over time | Quality trend and degradation detection |
4. The Feedback Loop
The most important enterprise pattern: production failures become regression tests.
Sources: LangSmith Evaluation Concepts, MLflow GenAI Eval/Monitor, Datadog LLM Observability
Q1: What is the difference between offline evals and online evals?
Show answer
Offline evals run against golden datasets in CI before release — they answer "should we ship?" Online evals run on production traces after release — they answer "is it still working?" Both are needed; neither is sufficient alone.
Q2: Why pin dataset and judge versions?
Show answer
Without versioned inputs, eval results are not reproducible. If a metric drops, you need to know whether the app changed or the eval changed. Pinning both the golden dataset version and the judge prompt/model version makes eval results comparable across releases.
Want to see these patterns in action?
See these eval patterns applied to real AI apps in the Lab.
Explore the Lab →