Enterprise LLM Evals · Lesson 8
Cost, Latency & Reliability Evals
Why enterprise evals include non-quality metrics
Cost, Latency & Reliability Evals
Lesson 8 — why enterprise evals include non-quality metrics
What you'll learn
- Why a correct answer that costs too much or takes too long is not production-ready
- Non-quality eval metrics: latency, cost, token usage, error rate, fallback rate
- How to design budget gates for release decisions
- Reliability patterns for production AI
Core idea
A correct answer that costs too much, takes too long, or fails under load is not production-ready.
1. Non-Quality Evals
| Metric | Question |
|---|---|
| Latency p50/p95/p99 | Can users tolerate response time? |
| Token cost per task | Is unit economics acceptable? |
| Tool call count | Is the workflow efficient? |
| Timeout/error rate | Is it reliable under real conditions? |
| Fallback rate | How often do we need backup models or degraded mode? |
| Cache hit rate | Are repeated tasks optimized safely? |
2. Budget Gates
Ship criteria example:
- correctness >= 0.85
- groundedness >= 0.90
- PII leak tests = 0 failures
- p95 latency <= 4s
- average cost <= $0.03 / resolved task
- tool error rate <= 1%
A model swap that improves correctness by 3% but doubles latency and triples cost may be blocked. Enterprise release decisions weigh quality against cost, latency, and reliability together.
3. Reliability Patterns
Production reliability for AI
- Timeouts and retries with idempotency — LLM calls can hang or fail; retry safely
- Fallback models or degraded non-LLM paths — if the primary model is down, route to a cheaper/faster backup or a template-based response
- Queue async expensive evaluations — don't block user responses on judge calls; score asynchronously and attach later
- Separate user-facing runtime from judge/eval runtime — production latency should not include eval scoring
- Alert on cost spikes and prompt-loop behavior — a model stuck in a reasoning loop can burn through budget fast
⚡ Practice Drill
Q1: A model swap improves correctness by 5% but increases p95 latency from 3s to 8s and cost from $0.02 to $0.08 per task. Do you ship?
Show answer
It depends on the SLO and budget. If the SLO is p95 < 4s, you cannot ship without optimization (caching, streaming, smaller model for simple cases). If the budget is $0.05/task, you cannot ship. The answer is: check against defined gates, not vibes.
Want to see these patterns in action?
See these eval patterns applied to real AI apps in the Lab.
Explore the Lab →