Enterprise LLM Evals · Lesson 2

Golden Datasets, Rubrics & LLM-as-Judge

How enterprise teams define "good" before they automate scoring

Golden Datasets, Rubrics & LLM-as-Judge

Lesson 2 — how enterprise teams define "good" before they automate scoring

What you'll learn
  1. Golden dataset anatomy: what goes into a production eval example
  2. How to write rubrics that make scoring reproducible
  3. Judge types: deterministic, human, LLM, hybrid
  4. LLM-as-judge reliability rules and common biases

Core idea

Evals start with examples, not tools. A weak dataset plus a fancy judge is still a weak eval.

1. Golden Dataset Anatomy

A golden example is not just "input + expected output." It is a structured record that makes evaluation reproducible and debuggable.

Golden example = input + expected behavior + scoring rule + metadata { input: user task / conversation / trace reference: expected answer OR expected properties rubric: what counts as pass / fail / partial metadata: category, risk, persona, source, version }

Enterprise datasets should include:

  • Happy paths — normal, expected user interactions
  • Edge cases — unusual but legitimate inputs
  • Adversarial cases — prompt injection, jailbreak attempts, PII probing
  • High-value cases — tasks where failure is expensive
  • Historical failures — production incidents converted to regression tests

Metadata matters because it lets you segment failures: "we regressed 8% on edge cases but held steady on normal cases" is more actionable than "overall score dropped 3%."

2. Rubrics Beat Vague Quality

Bad rubricBetter rubric
Answer should be good. Answer must cite at least one allowed source, contain no unsupported claim, answer the user's exact question, and refuse if the requested action violates policy.
Be concise. Under 120 words unless the user asks for detail; no preamble; include next step only if actionable.
Be helpful. Address the user's stated intent. If the request is ambiguous, ask one clarifying question. Do not guess and proceed.

3. Judge Types

Deterministic judge: exact match, regex, JSON schema, unit test Human judge: expert reviewer, pairwise preference, policy reviewer LLM judge: reference-based, reference-free, pairwise, rubric scoring Hybrid judge: deterministic hard gates + LLM subjective + human calibration

The best enterprise eval stacks use a hybrid approach: deterministic checks for hard gates (schema, format, safety keywords), LLM judges for semantic quality, and humans for calibration and high-risk review.

4. LLM-as-Judge Reliability Rules

Rules for trustworthy LLM judges
  • Prefer pass/fail or pairwise over vague 1–10 scoring. Binary decisions are more reliable than continuous scores.
  • Use clear rubrics with concrete examples of pass, partial, and fail for each criterion.
  • Calibrate against human labels. Run a sample through both human and LLM judges. Measure agreement. Investigate disagreements.
  • Watch for biases:
    • Verbosity bias — longer answers score higher even when less accurate
    • Position bias — in pairwise comparison, the first option wins more often
    • Judge-model drift — the same judge prompt produces different results when the judge model is updated
  • Version judge prompts and judge models like production code. A judge prompt change is an eval system change.
  • Never treat LLM-judge scores as ground truth without human calibration. The judge is a hypothesis, not a truth.

Sources: OpenAI Evaluation Best Practices, LangSmith Evaluation Concepts

⚡ Practice Drill
Q1: What is the most common eval failure?
Show answer

Starting with generic metrics before defining task-specific examples and rubrics. The dataset and rubric come first; the tool comes second.

Q2: Why keep metadata on examples?
Show answer

So you can segment failures by task type, risk category, user segment, data source, or regression history. "Overall score dropped" is less actionable than "edge-case retrieval regressed 12%."

Q3: Name three LLM-as-judge biases and how to mitigate each.
Show answer

Verbosity bias (prefer concise reference answers or normalize by length), position bias (randomize option order in pairwise), judge-model drift (pin judge model version and re-calibrate on updates).

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track