Enterprise LLM Evals · Lesson 4
Agent & Tool Evaluation
How to evaluate reasoning, tool use, arguments, handoffs, and task completion
Agent & Tool Evaluation
Lesson 4 — how to evaluate reasoning, tool use, arguments, handoffs, and task completion
- Why agent evals are trace evals, not answer evals
- The agent eval dimensions: reasoning, action, execution, multi-agent
- Enterprise hard gates for agent safety
- Tools for agent trace evaluation
Core idea
Agent evals are trace evals. The final answer is only the last symptom; the disease is often in planning or tool use.
1. Agent Trace Anatomy
Each step is an evaluable unit. A correct final answer does not mean the agent used the right tools, passed correct arguments, followed efficient paths, or stayed within authorization boundaries.
2. Agent Eval Dimensions
| Layer | Metric | Question |
|---|---|---|
| Reasoning | Plan quality | Is the plan complete, ordered, and scoped? |
| Reasoning | Plan adherence | Did the agent follow its own plan? |
| Action | Tool correctness | Did it pick the right tool? |
| Action | Argument correctness | Did it pass valid, correct, authorized arguments? |
| Execution | Task completion | Did the user's goal get done? |
| Execution | Step efficiency | Did it avoid loops and waste? |
| Multi-agent | Handoff accuracy | Did control move to the right specialist? |
3. Enterprise Hard Gates
- No tool call without authorization. Every tool invocation must pass an auth check.
- No write/payment/email/trade action without approval where policy requires it.
- All tool args schema-validated before execution. Invalid args = blocked call.
- Agent sandboxed for code/browser/file execution. No direct system access.
- Tool outputs treated as untrusted input because indirect prompt injection is possible through retrieved docs, API responses, or web pages.
4. Tools for Agent Evaluation
DeepEval has explicit agent metrics for tool selection, argument correctness, and task completion.
LangSmith, MLflow, Microsoft Foundry, and Phoenix provide trace-backed evaluation and experiment comparison.
OWASP LLMSVS gives security requirements for agents, plugins, and tools.
Q1: Why can a correct final answer still be an agent failure?
Show answer
The agent may have used unauthorized tools, leaked data, called unnecessary expensive tools, taken an unsafe path, or succeeded by accident. Trace-level evaluation checks the journey, not just the destination.
Q2: What does "tool output as untrusted input" mean?
Show answer
Tool outputs (API responses, retrieved docs, web pages) can contain malicious instructions — indirect prompt injection. The agent should not blindly follow instructions found in tool outputs; it should treat them as data, not commands.
Want to see these patterns in action?
See these eval patterns applied to real AI apps in the Lab.
Explore the Lab →