Enterprise LLM Evals · Lesson 10

Governance, Risk & Compliance

How evals become enterprise evidence, not just engineering tests

Governance, Risk & Compliance

Lesson 10 — how evals become enterprise evidence, not just engineering tests

What you'll learn
  1. What governance artifacts enterprises need for AI systems
  2. The NIST AI RMF lifecycle: Govern, Map, Measure, Manage
  3. OWASP LLMSVS security verification dimensions
  4. Why "we tested it" is not enough for high-risk AI

Core idea

In enterprise, evals are also evidence: who approved, what was tested, what changed, what residual risk was accepted.

1. Governance Artifacts

What enterprises need to retain
  • AI system inventory — model, provider, prompts, data sources, tools, owners
  • Risk classification — use case, user impact, regulated domain, data sensitivity
  • Evaluation plan — datasets, metrics, thresholds, review cadence
  • Model/prompt/retriever/tool version history — what changed and when
  • Human review records and sign-offs — who approved and why
  • Incident response and rollback plan — what happens when things go wrong

2. NIST AI RMF Lifecycle

GOVERN: roles, policies, accountability MAP: use case, context, impacted users, risks MEASURE: evals, tests, red-team, monitoring MANAGE: mitigation, release decision, incident response, continuous improvement

NIST AI RMF is the most widely referenced governance framework for AI risk in enterprise. It does not prescribe specific tools; it defines a lifecycle that evals plug into.

3. OWASP LLMSVS

OWASP LLMSVS covers security verification dimensions including:

  • Secure configuration
  • Model lifecycle management
  • Memory and RAG storage security
  • Secure LLM integration
  • Agent and plugin security
  • Dependency management
  • Monitoring and anomaly detection

4. Enterprise Rule

For high-risk AI, "we tested it" is not enough. You need versioned evidence that the right tests ran against the right version with agreed thresholds and named owners.

Evidence = dataset version + metric definitions + model/prompt version + eval run results + threshold checks + human sign-off + residual risk notes
⚡ Practice Drill
Q1: What 6 artifacts does an enterprise need for AI governance?
Show answer

AI system inventory, risk classification, evaluation plan, version history, human review/sign-off records, and incident response/rollback plan.

Q2: Why is "we tested it" not sufficient for high-risk AI?
Show answer

Because without versioned evidence, you cannot prove what was tested, against which version, with what thresholds, or who approved. Governance requires reproducible, auditable evidence — not verbal assurance.

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track