Enterprise LLM Evals · Lesson 5

Multimodal & Specialized Evals

How evals change for code, SQL, summarization, classification, vision, and voice

Multimodal & Specialized Evals

Lesson 5 — how evals change for code, SQL, summarization, classification, vision, and voice

What you'll learn

Why specialized outputs need specialized evals
Execution-based vs judge-based evaluation
Evals for code, SQL, summarization, classification, multimodal, and voice

Core idea

The more executable or domain-specific the output, the more you should prefer execution-based and expert-grounded evals over generic judge scores.

1. Specialized Eval Types

Task	Best eval
Code generation	Unit tests, compile tests, static analysis, security scan, HumanEval/MBPP benchmarks
SQL / text-to-SQL	Execution equivalence, row-level comparison, schema safety, read/write restrictions
Summarization	Coverage, factual consistency, omission/error analysis, human pairwise review
Classification	Confusion matrix, precision/recall/F1, calibration, threshold analysis
Vision / multimodal	Task-specific labels, object/field correctness, OCR accuracy, human review for ambiguity
Voice agents	ASR accuracy, latency, interruption handling, task completion, safety policy

2. Execution-Based vs Judge-Based

If the output can be executed: RUN IT. - Code -> compile + unit tests - SQL -> execute against test DB, compare rows - JSON -> parse + validate schema If the output can be compared exactly: COMPARE IT. - Classification labels -> confusion matrix - Extracted fields -> exact match If the output is semantic: JUDGE IT. - Summaries -> rubric + human calibration - Answers -> reference-based LLM judge - Tone -> rubric judge + human spot-check

3. Enterprise Rule

If the output can be executed, run it. If it can be compared exactly, compare it exactly. Use LLM judges for semantic judgment, not as a lazy replacement for deterministic truth.

The most common mistake is using a vague LLM-as-judge for a task that has a deterministic answer. If you generated SQL, don't ask an LLM "is this good SQL?" — execute it and compare results.

⚡ Practice Drill

Q1: You have a text-to-SQL agent. What evals do you run?

Show answer

SQL validity (does it parse?), execution correctness (do returned rows match expected?), schema safety (no DROP/DELETE/ALTER), read-only enforcement, PII masking, query cost limits, and explanation quality (can a human understand what the query does?).

Q2: Why is summarization harder to evaluate than classification?

Show answer

Classification has a finite set of labels — you can compute precision, recall, F1. Summaries are semantic: there is no single "correct" summary. You need rubrics (coverage, factual consistency, omission risk) and human pairwise comparison.

Previous Lesson Next Lesson

← Back to Enterprise LLM Evals

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track