Enterprise LLM Evals · Lesson 7

Offline Evals, CI Gates & Online Monitoring

How evals move from development to production

Offline Evals, CI Gates & Online Monitoring

Lesson 7 — how evals move from development to production

What you'll learn
  1. The eval lifecycle: development, PR, pre-prod, production
  2. CI/CD release gates and regression thresholds
  3. Online monitoring signals and what they tell you
  4. How production failures become new golden tests

Core idea

Offline evals answer "should we ship?" Online evals answer "is it still working?"

1. The Eval Lifecycle

Development Pull request Pre-prod Production ----------- ------------ -------- ---------- small goldens -> regression suite -> red team + load -> traces + feedback prompt tests threshold gates human signoff online scoring unit checks dataset version release evidence anomaly alerts

2. CI/CD Release Gates

Release gate rules
  • Critical deterministic tests must be 100% pass. No exceptions.
  • Safety/security critical tests must be 100% pass or require explicit risk acceptance with named owner.
  • Quality metrics can have thresholds and no-regression rules (e.g., "correctness must not drop more than 2% from previous release").
  • Cost/latency budgets are first-class evals. A model swap that improves quality but doubles cost may be blocked.
  • Dataset and judge versions must be pinned for reproducibility. An eval run without versioned inputs is not evidence.

3. Online Monitoring Signals

SignalWhy it matters
Latency, tokens, costUser experience and budget control
User feedback (thumbs up/down, ratings)Real-world quality signal — but not ground truth by itself
Safety flagsPolicy and incident detection
Fallback/refusal ratePossible drift, prompt issue, or data issue
Tool error rateIntegration health
Judge scores over timeQuality trend and degradation detection

4. The Feedback Loop

The most important enterprise pattern: production failures become regression tests.

Production trace shows failure | v Add to golden dataset as new test case | v Eval suite now catches this failure forever | v Next release: this regression is blocked automatically

Sources: LangSmith Evaluation Concepts, MLflow GenAI Eval/Monitor, Datadog LLM Observability

⚡ Practice Drill
Q1: What is the difference between offline evals and online evals?
Show answer

Offline evals run against golden datasets in CI before release — they answer "should we ship?" Online evals run on production traces after release — they answer "is it still working?" Both are needed; neither is sufficient alone.

Q2: Why pin dataset and judge versions?
Show answer

Without versioned inputs, eval results are not reproducible. If a metric drops, you need to know whether the app changed or the eval changed. Pinning both the golden dataset version and the judge prompt/model version makes eval results comparable across releases.

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track