Enterprise LLM Evals · Lesson 6

Safety, Security & Red-Team Evals

How enterprises test prompt injection, jailbreaks, PII leakage, and tool misuse

Safety, Security & Red-Team Evals

Lesson 6 — prompt injection, jailbreaks, PII leakage, and tool misuse

What you'll learn

Model-layer vs application-layer security risks
What to test: direct/indirect prompt injection, jailbreaks, PII, secrets, unsafe advice, data exfiltration
The red-team loop: generate, run, score, prioritize, mitigate
Standards: OWASP, NIST

Core idea

Security evals are not model vibes. They are adversarial test suites mapped to concrete risk categories and release gates.

1. Threat Split

Model-layer risks Application-layer risks ----------------- ----------------------- harmful content indirect prompt injection jailbreak behavior RAG data leakage bias/toxicity unauthorized tool/API use unsafe advice cross-tenant access leak memorized PII markdown/link exfiltration

Most enterprise risk appears at the application layer: RAG context, tools, permissions, memory, and connectors — not just the base model.

2. What to Test

Direct prompt injection — user asks model to ignore system/developer policy
Indirect prompt injection — malicious instructions hidden in retrieved docs, emails, webpages, tickets, PDFs, tool outputs
Jailbreaks — persona/hypothetical/roleplay/encoding attempts to bypass safety policy
PII and secrets leakage — does the system expose personal data or API keys in responses or logs?
Unauthorized action attempts — can the agent perform write/payment/delete/trade actions without approval?
Unsafe domain advice — medical, financial, legal, cyber, HR, regulated decisions
Data exfiltration — through links, markdown images, hidden encodings, or tool calls

3. The Red-Team Loop

Define risk scope -> Generate adversarial cases -> Run against full app -> Score outputs ^ | | v Add production incidents <- Verify mitigations <- Prioritize failures <- Analyze traces

4. Standards and Tools

Promptfoo and Giskard help automate adversarial test generation and scoring.

OWASP LLMSVS defines security requirements for LLM applications including secure configuration, model lifecycle, memory/RAG storage, agent/plugin security, and monitoring.

NIST AI RMF gives governance framing: Govern, Map, Measure, Manage — covering risk identification, evaluation, mitigation, and continuous improvement.

Sources: Promptfoo: Red Teaming, Giskard: Vulnerability Scanning, OWASP LLMSVS, NIST AI RMF

⚡ Practice Drill

Q1: What makes indirect prompt injection dangerous?

Show answer

The malicious instruction enters through trusted-looking context — a retrieved document, a tool output, a web page — not directly from the user's current prompt. The system may not distinguish between user instructions and content instructions.

Q2: Name three data exfiltration channels in an LLM app.

Show answer

Markdown image tags (load external image = leak data via URL), tool calls that send data to external APIs, and hidden encodings in links that the user might click.

Previous Lesson Next Lesson

← Back to Enterprise LLM Evals

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track