Enterprise LLM Evals · Lesson 2
Golden Datasets, Rubrics & LLM-as-Judge
How enterprise teams define "good" before they automate scoring
Golden Datasets, Rubrics & LLM-as-Judge
Lesson 2 — how enterprise teams define "good" before they automate scoring
- Golden dataset anatomy: what goes into a production eval example
- How to write rubrics that make scoring reproducible
- Judge types: deterministic, human, LLM, hybrid
- LLM-as-judge reliability rules and common biases
Core idea
Evals start with examples, not tools. A weak dataset plus a fancy judge is still a weak eval.
1. Golden Dataset Anatomy
A golden example is not just "input + expected output." It is a structured record that makes evaluation reproducible and debuggable.
Enterprise datasets should include:
- Happy paths — normal, expected user interactions
- Edge cases — unusual but legitimate inputs
- Adversarial cases — prompt injection, jailbreak attempts, PII probing
- High-value cases — tasks where failure is expensive
- Historical failures — production incidents converted to regression tests
Metadata matters because it lets you segment failures: "we regressed 8% on edge cases but held steady on normal cases" is more actionable than "overall score dropped 3%."
2. Rubrics Beat Vague Quality
| Bad rubric | Better rubric |
|---|---|
| Answer should be good. | Answer must cite at least one allowed source, contain no unsupported claim, answer the user's exact question, and refuse if the requested action violates policy. |
| Be concise. | Under 120 words unless the user asks for detail; no preamble; include next step only if actionable. |
| Be helpful. | Address the user's stated intent. If the request is ambiguous, ask one clarifying question. Do not guess and proceed. |
3. Judge Types
The best enterprise eval stacks use a hybrid approach: deterministic checks for hard gates (schema, format, safety keywords), LLM judges for semantic quality, and humans for calibration and high-risk review.
4. LLM-as-Judge Reliability Rules
- Prefer pass/fail or pairwise over vague 1–10 scoring. Binary decisions are more reliable than continuous scores.
- Use clear rubrics with concrete examples of pass, partial, and fail for each criterion.
- Calibrate against human labels. Run a sample through both human and LLM judges. Measure agreement. Investigate disagreements.
- Watch for biases:
- Verbosity bias — longer answers score higher even when less accurate
- Position bias — in pairwise comparison, the first option wins more often
- Judge-model drift — the same judge prompt produces different results when the judge model is updated
- Version judge prompts and judge models like production code. A judge prompt change is an eval system change.
- Never treat LLM-judge scores as ground truth without human calibration. The judge is a hypothesis, not a truth.
Sources: OpenAI Evaluation Best Practices, LangSmith Evaluation Concepts
Q1: What is the most common eval failure?
Show answer
Starting with generic metrics before defining task-specific examples and rubrics. The dataset and rubric come first; the tool comes second.
Q2: Why keep metadata on examples?
Show answer
So you can segment failures by task type, risk category, user segment, data source, or regression history. "Overall score dropped" is less actionable than "edge-case retrieval regressed 12%."
Q3: Name three LLM-as-judge biases and how to mitigate each.
Show answer
Verbosity bias (prefer concise reference answers or normalize by length), position bias (randomize option order in pairwise), judge-model drift (pin judge model version and re-calibrate on updates).
Want to see these patterns in action?
See these eval patterns applied to real AI apps in the Lab.
Explore the Lab →