Enterprise LLM Evals · Lesson 1
The Enterprise LLM Eval Map
What enterprise evals are and why MMLU is not enough
The Enterprise LLM Eval Map
Lesson 1 — what enterprise evals are, the 4 families, and why MMLU is not enough
- The difference between model benchmarks and application evals
- The 4 eval families: model, application, risk, and ops
- The minimum production-grade eval stack
- How to think about enterprise evals as a control system
Core idea
Enterprise LLM evals are not a single metric. They are a control system: pre-deployment gates, post-deployment monitoring, and governance evidence — all versioned, all auditable.
Why MMLU Is Not Enough
Model benchmarks like MMLU, GSM8K, and HumanEval ask: "Is this model generally capable?"
Enterprise teams ask a different question: "Will this AI system complete the intended task correctly, safely, cheaply, reliably, and auditable in our actual production workflow?"
The gap between these two questions is where enterprise evals live. A model can score well on MMLU and still fail in production because of poor retrieval, bad tool design, weak guardrails, or domain-specific failure modes that benchmarks never test.
The 4 Eval Families
1. Model Evals
Question: "Is this base model generally capable?"
Examples: MMLU, GSM8K, HumanEval, HellaSwag, TruthfulQA
Tools: lm-evaluation-harness, HELM-style benchmark suites
Use for: Model selection, vendor comparison, training progress.
Limitation: Does NOT prove your enterprise app works.
2. Application Evals
Question: "Does our actual AI application work for real user tasks?"
Metrics: correctness, relevance, instruction following, schema validity, business-rule compliance, response quality
Tools: LangSmith, MLflow, DeepEval, Promptfoo
Use for: CI/CD gates, prompt regressions, model swaps, product readiness.
3. Risk / Safety / Security Evals
Question: "Can the system be abused, leak data, ignore policy, or misuse tools?"
Tests: prompt injection, jailbreaks, PII leakage, unsafe advice, data exfiltration, tool misuse, unauthorized access, policy violations
Tools: Promptfoo, Giskard, OWASP LLMSVS, OWASP LLM Top 10, NIST AI RMF
Use for: Enterprise release approval, AppSec, compliance, red-team signoff.
4. Ops Evals
Question: "Is it still working in production?"
Metrics: latency, cost, drift, feedback, anomaly rate, escalation rate, incident traces, quality trend
Tools: LangSmith, MLflow, Arize/Phoenix, Langfuse, Datadog, telemetry
The Minimum Production-Grade Eval Stack
- Golden dataset — 50–500 representative examples across normal, edge, adversarial, and high-value cases
- Deterministic checks — JSON validity, required fields, citation presence, empty answer, allowed language, schema validation, tool argument validation
- Task correctness — exact match where possible; rubric or LLM judge where semantic quality matters
- RAG quality — context precision, context recall, groundedness, answer correctness, citation correctness
- Agent/tool quality — tool selection, argument correctness, step order, task completion, step efficiency, handoff accuracy
- Safety/security — prompt injection, jailbreaks, PII leaks, data exfiltration, unsafe advice, authorization boundary tests
- Human calibration — expert labels and pairwise review to verify automated metrics are meaningful
- Online monitoring — traces, user feedback, latency, cost, token usage, safety flags, anomaly alerts, regression mining
- CI/CD gates — fail builds when critical evals regress below threshold
- Governance evidence — versioned datasets, metric definitions, run history, model/prompt/retriever versions, sign-off records
Key Distinction
Tool Map
| Need | Tools |
|---|---|
| Model benchmark | lm-evaluation-harness |
| Prompt regression | Promptfoo, DeepEval, LangSmith, MLflow |
| RAG metrics | Ragas, LangSmith, MLflow |
| Agent/tool evals | DeepEval, LangSmith, MLflow |
| Experiment comparison | LangSmith, MLflow |
| Production tracing | LangSmith, MLflow, Arize/Phoenix, Langfuse, Datadog |
| Red teaming | Promptfoo, Giskard |
| Security standard | OWASP LLMSVS, OWASP LLM Top 10 |
| Governance/risk | NIST AI RMF |
Sources: OpenAI Evaluation Best Practices, LangSmith Evaluation Concepts, NIST AI RMF
Q1: Why is MMLU not enough to prove an enterprise LLM app is production-ready?
Show answer
MMLU tests the base model's general knowledge. It does not test your actual application workflow: retrieval quality, tool use, guardrails, business rules, latency, safety, or real user distribution.
Q2: For a RAG app, what are the two separate things you must evaluate?
Show answer
Retrieval quality (did we fetch the right evidence?) and generation quality (did the answer stay grounded and correct?). These are independent failure modes.
Q3: What is the difference between offline evals and online evals?
Show answer
Offline evals run against golden datasets before release ("should we ship?"). Online evals run on production traces after release ("is it still working?").
Q4: In an agentic app, why is final-answer quality not enough?
Show answer
The agent may have used unauthorized tools, leaked data, called expensive unnecessary tools, or succeeded by accident through an unsafe path. Trace-level evaluation is required.
Want to see these patterns in action?
See these eval patterns applied to real AI apps in the Lab.
Explore the Lab →