Enterprise LLM Evals · Lesson 6
Safety, Security & Red-Team Evals
How enterprises test prompt injection, jailbreaks, PII leakage, and tool misuse
Safety, Security & Red-Team Evals
Lesson 6 — prompt injection, jailbreaks, PII leakage, and tool misuse
- Model-layer vs application-layer security risks
- What to test: direct/indirect prompt injection, jailbreaks, PII, secrets, unsafe advice, data exfiltration
- The red-team loop: generate, run, score, prioritize, mitigate
- Standards: OWASP, NIST
Core idea
Security evals are not model vibes. They are adversarial test suites mapped to concrete risk categories and release gates.
1. Threat Split
Most enterprise risk appears at the application layer: RAG context, tools, permissions, memory, and connectors — not just the base model.
2. What to Test
- Direct prompt injection — user asks model to ignore system/developer policy
- Indirect prompt injection — malicious instructions hidden in retrieved docs, emails, webpages, tickets, PDFs, tool outputs
- Jailbreaks — persona/hypothetical/roleplay/encoding attempts to bypass safety policy
- PII and secrets leakage — does the system expose personal data or API keys in responses or logs?
- Unauthorized action attempts — can the agent perform write/payment/delete/trade actions without approval?
- Unsafe domain advice — medical, financial, legal, cyber, HR, regulated decisions
- Data exfiltration — through links, markdown images, hidden encodings, or tool calls
3. The Red-Team Loop
4. Standards and Tools
Promptfoo and Giskard help automate adversarial test generation and scoring.
OWASP LLMSVS defines security requirements for LLM applications including secure configuration, model lifecycle, memory/RAG storage, agent/plugin security, and monitoring.
NIST AI RMF gives governance framing: Govern, Map, Measure, Manage — covering risk identification, evaluation, mitigation, and continuous improvement.
Sources: Promptfoo: Red Teaming, Giskard: Vulnerability Scanning, OWASP LLMSVS, NIST AI RMF
Q1: What makes indirect prompt injection dangerous?
Show answer
The malicious instruction enters through trusted-looking context — a retrieved document, a tool output, a web page — not directly from the user's current prompt. The system may not distinguish between user instructions and content instructions.
Q2: Name three data exfiltration channels in an LLM app.
Show answer
Markdown image tags (load external image = leak data via URL), tool calls that send data to external APIs, and hidden encodings in links that the user might click.
Want to see these patterns in action?
See these eval patterns applied to real AI apps in the Lab.
Explore the Lab →