Enterprise LLM Evals · Lesson 8

Cost, Latency & Reliability Evals

Why enterprise evals include non-quality metrics

Cost, Latency & Reliability Evals

Lesson 8 — why enterprise evals include non-quality metrics

What you'll learn
  1. Why a correct answer that costs too much or takes too long is not production-ready
  2. Non-quality eval metrics: latency, cost, token usage, error rate, fallback rate
  3. How to design budget gates for release decisions
  4. Reliability patterns for production AI

Core idea

A correct answer that costs too much, takes too long, or fails under load is not production-ready.

1. Non-Quality Evals

MetricQuestion
Latency p50/p95/p99Can users tolerate response time?
Token cost per taskIs unit economics acceptable?
Tool call countIs the workflow efficient?
Timeout/error rateIs it reliable under real conditions?
Fallback rateHow often do we need backup models or degraded mode?
Cache hit rateAre repeated tasks optimized safely?

2. Budget Gates

Ship criteria example: - correctness >= 0.85 - groundedness >= 0.90 - PII leak tests = 0 failures - p95 latency <= 4s - average cost <= $0.03 / resolved task - tool error rate <= 1%

A model swap that improves correctness by 3% but doubles latency and triples cost may be blocked. Enterprise release decisions weigh quality against cost, latency, and reliability together.

3. Reliability Patterns

Production reliability for AI
  • Timeouts and retries with idempotency — LLM calls can hang or fail; retry safely
  • Fallback models or degraded non-LLM paths — if the primary model is down, route to a cheaper/faster backup or a template-based response
  • Queue async expensive evaluations — don't block user responses on judge calls; score asynchronously and attach later
  • Separate user-facing runtime from judge/eval runtime — production latency should not include eval scoring
  • Alert on cost spikes and prompt-loop behavior — a model stuck in a reasoning loop can burn through budget fast
⚡ Practice Drill
Q1: A model swap improves correctness by 5% but increases p95 latency from 3s to 8s and cost from $0.02 to $0.08 per task. Do you ship?
Show answer

It depends on the SLO and budget. If the SLO is p95 < 4s, you cannot ship without optimization (caching, streaming, smaller model for simple cases). If the budget is $0.05/task, you cannot ship. The answer is: check against defined gates, not vibes.

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track