Enterprise LLM Evals · Lesson 5
Multimodal & Specialized Evals
How evals change for code, SQL, summarization, classification, vision, and voice
Multimodal & Specialized Evals
Lesson 5 — how evals change for code, SQL, summarization, classification, vision, and voice
- Why specialized outputs need specialized evals
- Execution-based vs judge-based evaluation
- Evals for code, SQL, summarization, classification, multimodal, and voice
Core idea
The more executable or domain-specific the output, the more you should prefer execution-based and expert-grounded evals over generic judge scores.
1. Specialized Eval Types
| Task | Best eval |
|---|---|
| Code generation | Unit tests, compile tests, static analysis, security scan, HumanEval/MBPP benchmarks |
| SQL / text-to-SQL | Execution equivalence, row-level comparison, schema safety, read/write restrictions |
| Summarization | Coverage, factual consistency, omission/error analysis, human pairwise review |
| Classification | Confusion matrix, precision/recall/F1, calibration, threshold analysis |
| Vision / multimodal | Task-specific labels, object/field correctness, OCR accuracy, human review for ambiguity |
| Voice agents | ASR accuracy, latency, interruption handling, task completion, safety policy |
2. Execution-Based vs Judge-Based
3. Enterprise Rule
If the output can be executed, run it. If it can be compared exactly, compare it exactly. Use LLM judges for semantic judgment, not as a lazy replacement for deterministic truth.
The most common mistake is using a vague LLM-as-judge for a task that has a deterministic answer. If you generated SQL, don't ask an LLM "is this good SQL?" — execute it and compare results.
Q1: You have a text-to-SQL agent. What evals do you run?
Show answer
SQL validity (does it parse?), execution correctness (do returned rows match expected?), schema safety (no DROP/DELETE/ALTER), read-only enforcement, PII masking, query cost limits, and explanation quality (can a human understand what the query does?).
Q2: Why is summarization harder to evaluate than classification?
Show answer
Classification has a finite set of labels — you can compute precision, recall, F1. Summaries are semantic: there is no single "correct" summary. You need rubrics (coverage, factual consistency, omission risk) and human pairwise comparison.
Want to see these patterns in action?
See these eval patterns applied to real AI apps in the Lab.
Explore the Lab →