Enterprise LLM Evals · Lesson 6

Safety, Security & Red-Team Evals

How enterprises test prompt injection, jailbreaks, PII leakage, and tool misuse

Safety, Security & Red-Team Evals

Lesson 6 — prompt injection, jailbreaks, PII leakage, and tool misuse

What you'll learn
  1. Model-layer vs application-layer security risks
  2. What to test: direct/indirect prompt injection, jailbreaks, PII, secrets, unsafe advice, data exfiltration
  3. The red-team loop: generate, run, score, prioritize, mitigate
  4. Standards: OWASP, NIST

Core idea

Security evals are not model vibes. They are adversarial test suites mapped to concrete risk categories and release gates.

1. Threat Split

Model-layer risks Application-layer risks ----------------- ----------------------- harmful content indirect prompt injection jailbreak behavior RAG data leakage bias/toxicity unauthorized tool/API use unsafe advice cross-tenant access leak memorized PII markdown/link exfiltration

Most enterprise risk appears at the application layer: RAG context, tools, permissions, memory, and connectors — not just the base model.

2. What to Test

  • Direct prompt injection — user asks model to ignore system/developer policy
  • Indirect prompt injection — malicious instructions hidden in retrieved docs, emails, webpages, tickets, PDFs, tool outputs
  • Jailbreaks — persona/hypothetical/roleplay/encoding attempts to bypass safety policy
  • PII and secrets leakage — does the system expose personal data or API keys in responses or logs?
  • Unauthorized action attempts — can the agent perform write/payment/delete/trade actions without approval?
  • Unsafe domain advice — medical, financial, legal, cyber, HR, regulated decisions
  • Data exfiltration — through links, markdown images, hidden encodings, or tool calls

3. The Red-Team Loop

Define risk scope -> Generate adversarial cases -> Run against full app -> Score outputs ^ | | v Add production incidents <- Verify mitigations <- Prioritize failures <- Analyze traces

4. Standards and Tools

Promptfoo and Giskard help automate adversarial test generation and scoring.

OWASP LLMSVS defines security requirements for LLM applications including secure configuration, model lifecycle, memory/RAG storage, agent/plugin security, and monitoring.

NIST AI RMF gives governance framing: Govern, Map, Measure, Manage — covering risk identification, evaluation, mitigation, and continuous improvement.

Sources: Promptfoo: Red Teaming, Giskard: Vulnerability Scanning, OWASP LLMSVS, NIST AI RMF

⚡ Practice Drill
Q1: What makes indirect prompt injection dangerous?
Show answer

The malicious instruction enters through trusted-looking context — a retrieved document, a tool output, a web page — not directly from the user's current prompt. The system may not distinguish between user instructions and content instructions.

Q2: Name three data exfiltration channels in an LLM app.
Show answer

Markdown image tags (load external image = leak data via URL), tool calls that send data to external APIs, and hidden encodings in links that the user might click.

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track