Enterprise LLM Evals · Lesson 5

Multimodal & Specialized Evals

How evals change for code, SQL, summarization, classification, vision, and voice

Multimodal & Specialized Evals

Lesson 5 — how evals change for code, SQL, summarization, classification, vision, and voice

What you'll learn
  1. Why specialized outputs need specialized evals
  2. Execution-based vs judge-based evaluation
  3. Evals for code, SQL, summarization, classification, multimodal, and voice

Core idea

The more executable or domain-specific the output, the more you should prefer execution-based and expert-grounded evals over generic judge scores.

1. Specialized Eval Types

TaskBest eval
Code generationUnit tests, compile tests, static analysis, security scan, HumanEval/MBPP benchmarks
SQL / text-to-SQLExecution equivalence, row-level comparison, schema safety, read/write restrictions
SummarizationCoverage, factual consistency, omission/error analysis, human pairwise review
ClassificationConfusion matrix, precision/recall/F1, calibration, threshold analysis
Vision / multimodalTask-specific labels, object/field correctness, OCR accuracy, human review for ambiguity
Voice agentsASR accuracy, latency, interruption handling, task completion, safety policy

2. Execution-Based vs Judge-Based

If the output can be executed: RUN IT. - Code -> compile + unit tests - SQL -> execute against test DB, compare rows - JSON -> parse + validate schema If the output can be compared exactly: COMPARE IT. - Classification labels -> confusion matrix - Extracted fields -> exact match If the output is semantic: JUDGE IT. - Summaries -> rubric + human calibration - Answers -> reference-based LLM judge - Tone -> rubric judge + human spot-check

3. Enterprise Rule

If the output can be executed, run it. If it can be compared exactly, compare it exactly. Use LLM judges for semantic judgment, not as a lazy replacement for deterministic truth.

The most common mistake is using a vague LLM-as-judge for a task that has a deterministic answer. If you generated SQL, don't ask an LLM "is this good SQL?" — execute it and compare results.

⚡ Practice Drill
Q1: You have a text-to-SQL agent. What evals do you run?
Show answer

SQL validity (does it parse?), execution correctness (do returned rows match expected?), schema safety (no DROP/DELETE/ALTER), read-only enforcement, PII masking, query cost limits, and explanation quality (can a human understand what the query does?).

Q2: Why is summarization harder to evaluate than classification?
Show answer

Classification has a finite set of labels — you can compute precision, recall, F1. Summaries are semantic: there is no single "correct" summary. You need rubrics (coverage, factual consistency, omission risk) and human pairwise comparison.

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track