Enterprise LLM Evals · Lesson 8

Cost, Latency & Reliability Evals

Why enterprise evals include non-quality metrics

Cost, Latency & Reliability Evals

Lesson 8 — why enterprise evals include non-quality metrics

What you'll learn

Why a correct answer that costs too much or takes too long is not production-ready
Non-quality eval metrics: latency, cost, token usage, error rate, fallback rate
How to design budget gates for release decisions
Reliability patterns for production AI

Core idea

A correct answer that costs too much, takes too long, or fails under load is not production-ready.

1. Non-Quality Evals

Metric	Question
Latency p50/p95/p99	Can users tolerate response time?
Token cost per task	Is unit economics acceptable?
Tool call count	Is the workflow efficient?
Timeout/error rate	Is it reliable under real conditions?
Fallback rate	How often do we need backup models or degraded mode?
Cache hit rate	Are repeated tasks optimized safely?

2. Budget Gates

Ship criteria example: - correctness >= 0.85 - groundedness >= 0.90 - PII leak tests = 0 failures - p95 latency <= 4s - average cost <= $0.03 / resolved task - tool error rate <= 1%

A model swap that improves correctness by 3% but doubles latency and triples cost may be blocked. Enterprise release decisions weigh quality against cost, latency, and reliability together.

3. Reliability Patterns

Production reliability for AI

Timeouts and retries with idempotency — LLM calls can hang or fail; retry safely
Fallback models or degraded non-LLM paths — if the primary model is down, route to a cheaper/faster backup or a template-based response
Queue async expensive evaluations — don't block user responses on judge calls; score asynchronously and attach later
Separate user-facing runtime from judge/eval runtime — production latency should not include eval scoring
Alert on cost spikes and prompt-loop behavior — a model stuck in a reasoning loop can burn through budget fast

⚡ Practice Drill

Q1: A model swap improves correctness by 5% but increases p95 latency from 3s to 8s and cost from $0.02 to $0.08 per task. Do you ship?

Show answer

It depends on the SLO and budget. If the SLO is p95 < 4s, you cannot ship without optimization (caching, streaming, smaller model for simple cases). If the budget is $0.05/task, you cannot ship. The answer is: check against defined gates, not vibes.

Previous Lesson Next Lesson

← Back to Enterprise LLM Evals

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track