Enterprise LLM Evals · Lesson 4

Agent & Tool Evaluation

How to evaluate reasoning, tool use, arguments, handoffs, and task completion

Agent & Tool Evaluation

Lesson 4 — how to evaluate reasoning, tool use, arguments, handoffs, and task completion

What you'll learn
  1. Why agent evals are trace evals, not answer evals
  2. The agent eval dimensions: reasoning, action, execution, multi-agent
  3. Enterprise hard gates for agent safety
  4. Tools for agent trace evaluation

Core idea

Agent evals are trace evals. The final answer is only the last symptom; the disease is often in planning or tool use.

1. Agent Trace Anatomy

Goal -> Plan -> Tool choice -> Tool args -> Tool result -> Next decision -> Final answer | | | | | | plan eval tool eval arg eval tool contract recovery eval completion eval

Each step is an evaluable unit. A correct final answer does not mean the agent used the right tools, passed correct arguments, followed efficient paths, or stayed within authorization boundaries.

2. Agent Eval Dimensions

LayerMetricQuestion
ReasoningPlan qualityIs the plan complete, ordered, and scoped?
ReasoningPlan adherenceDid the agent follow its own plan?
ActionTool correctnessDid it pick the right tool?
ActionArgument correctnessDid it pass valid, correct, authorized arguments?
ExecutionTask completionDid the user's goal get done?
ExecutionStep efficiencyDid it avoid loops and waste?
Multi-agentHandoff accuracyDid control move to the right specialist?

3. Enterprise Hard Gates

Non-negotiable agent safety checks
  • No tool call without authorization. Every tool invocation must pass an auth check.
  • No write/payment/email/trade action without approval where policy requires it.
  • All tool args schema-validated before execution. Invalid args = blocked call.
  • Agent sandboxed for code/browser/file execution. No direct system access.
  • Tool outputs treated as untrusted input because indirect prompt injection is possible through retrieved docs, API responses, or web pages.

4. Tools for Agent Evaluation

DeepEval has explicit agent metrics for tool selection, argument correctness, and task completion.

LangSmith, MLflow, Microsoft Foundry, and Phoenix provide trace-backed evaluation and experiment comparison.

OWASP LLMSVS gives security requirements for agents, plugins, and tools.

⚡ Practice Drill
Q1: Why can a correct final answer still be an agent failure?
Show answer

The agent may have used unauthorized tools, leaked data, called unnecessary expensive tools, taken an unsafe path, or succeeded by accident. Trace-level evaluation checks the journey, not just the destination.

Q2: What does "tool output as untrusted input" mean?
Show answer

Tool outputs (API responses, retrieved docs, web pages) can contain malicious instructions — indirect prompt injection. The agent should not blindly follow instructions found in tool outputs; it should treat them as data, not commands.

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track