Enterprise LLM Evals · Lesson 1

The Enterprise LLM Eval Map

What enterprise evals are and why MMLU is not enough

The Enterprise LLM Eval Map

Lesson 1 — what enterprise evals are, the 4 families, and why MMLU is not enough

What you'll learn
  1. The difference between model benchmarks and application evals
  2. The 4 eval families: model, application, risk, and ops
  3. The minimum production-grade eval stack
  4. How to think about enterprise evals as a control system

Core idea

Enterprise LLM evals are not a single metric. They are a control system: pre-deployment gates, post-deployment monitoring, and governance evidence — all versioned, all auditable.

Why MMLU Is Not Enough

Model benchmarks like MMLU, GSM8K, and HumanEval ask: "Is this model generally capable?"

Enterprise teams ask a different question: "Will this AI system complete the intended task correctly, safely, cheaply, reliably, and auditable in our actual production workflow?"

The gap between these two questions is where enterprise evals live. A model can score well on MMLU and still fail in production because of poor retrieval, bad tool design, weak guardrails, or domain-specific failure modes that benchmarks never test.

ENTERPRISE LLM EVALS | +-------------------------+-------------------------+ | | | 1. Model evals 2. App evals 3. Risk evals Can this model do Does our system work Can this system hurt general tasks? for our users? users/company/data? | | | MMLU, GSM8K, correctness, prompt injection, HumanEval, groundedness, jailbreaks, PII leaks, HellaSwag, relevance, format, tool misuse, access TruthfulQA retrieval, tools control failures | | | lm-eval-harness LangSmith, MLflow, Promptfoo, Giskard, HELM-style suites Ragas, DeepEval OWASP LLMSVS, NIST | 4. Ops evals Is it reliable in production? | latency, cost, drift, feedback, anomaly rate, escalation rate, incident traces, quality trend | LangSmith, MLflow, Arize, Langfuse, Datadog, telemetry

The 4 Eval Families

1. Model Evals

Question: "Is this base model generally capable?"

Examples: MMLU, GSM8K, HumanEval, HellaSwag, TruthfulQA

Tools: lm-evaluation-harness, HELM-style benchmark suites

Use for: Model selection, vendor comparison, training progress.

Limitation: Does NOT prove your enterprise app works.

2. Application Evals

Question: "Does our actual AI application work for real user tasks?"

Metrics: correctness, relevance, instruction following, schema validity, business-rule compliance, response quality

Tools: LangSmith, MLflow, DeepEval, Promptfoo

Use for: CI/CD gates, prompt regressions, model swaps, product readiness.

3. Risk / Safety / Security Evals

Question: "Can the system be abused, leak data, ignore policy, or misuse tools?"

Tests: prompt injection, jailbreaks, PII leakage, unsafe advice, data exfiltration, tool misuse, unauthorized access, policy violations

Tools: Promptfoo, Giskard, OWASP LLMSVS, OWASP LLM Top 10, NIST AI RMF

Use for: Enterprise release approval, AppSec, compliance, red-team signoff.

4. Ops Evals

Question: "Is it still working in production?"

Metrics: latency, cost, drift, feedback, anomaly rate, escalation rate, incident traces, quality trend

Tools: LangSmith, MLflow, Arize/Phoenix, Langfuse, Datadog, telemetry

The Minimum Production-Grade Eval Stack

10-layer stack
  1. Golden dataset — 50–500 representative examples across normal, edge, adversarial, and high-value cases
  2. Deterministic checks — JSON validity, required fields, citation presence, empty answer, allowed language, schema validation, tool argument validation
  3. Task correctness — exact match where possible; rubric or LLM judge where semantic quality matters
  4. RAG quality — context precision, context recall, groundedness, answer correctness, citation correctness
  5. Agent/tool quality — tool selection, argument correctness, step order, task completion, step efficiency, handoff accuracy
  6. Safety/security — prompt injection, jailbreaks, PII leaks, data exfiltration, unsafe advice, authorization boundary tests
  7. Human calibration — expert labels and pairwise review to verify automated metrics are meaningful
  8. Online monitoring — traces, user feedback, latency, cost, token usage, safety flags, anomaly alerts, regression mining
  9. CI/CD gates — fail builds when critical evals regress below threshold
  10. Governance evidence — versioned datasets, metric definitions, run history, model/prompt/retriever versions, sign-off records

Key Distinction

Model benchmark: "Which model is smart?" Application eval: "Does our product work?" Risk eval: "Can it fail dangerously?" Production monitoring: "Is it still working after real users touch it?"

Tool Map

NeedTools
Model benchmarklm-evaluation-harness
Prompt regressionPromptfoo, DeepEval, LangSmith, MLflow
RAG metricsRagas, LangSmith, MLflow
Agent/tool evalsDeepEval, LangSmith, MLflow
Experiment comparisonLangSmith, MLflow
Production tracingLangSmith, MLflow, Arize/Phoenix, Langfuse, Datadog
Red teamingPromptfoo, Giskard
Security standardOWASP LLMSVS, OWASP LLM Top 10
Governance/riskNIST AI RMF

Sources: OpenAI Evaluation Best Practices, LangSmith Evaluation Concepts, NIST AI RMF

⚡ Practice Drill
Q1: Why is MMLU not enough to prove an enterprise LLM app is production-ready?
Show answer

MMLU tests the base model's general knowledge. It does not test your actual application workflow: retrieval quality, tool use, guardrails, business rules, latency, safety, or real user distribution.

Q2: For a RAG app, what are the two separate things you must evaluate?
Show answer

Retrieval quality (did we fetch the right evidence?) and generation quality (did the answer stay grounded and correct?). These are independent failure modes.

Q3: What is the difference between offline evals and online evals?
Show answer

Offline evals run against golden datasets before release ("should we ship?"). Online evals run on production traces after release ("is it still working?").

Q4: In an agentic app, why is final-answer quality not enough?
Show answer

The agent may have used unauthorized tools, leaked data, called expensive unnecessary tools, or succeeded by accident through an unsafe path. Trace-level evaluation is required.

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track