Enterprise LLM Evals · Lesson 4

Agent & Tool Evaluation

How to evaluate reasoning, tool use, arguments, handoffs, and task completion

Agent & Tool Evaluation

Lesson 4 — how to evaluate reasoning, tool use, arguments, handoffs, and task completion

What you'll learn

Why agent evals are trace evals, not answer evals
The agent eval dimensions: reasoning, action, execution, multi-agent
Enterprise hard gates for agent safety
Tools for agent trace evaluation

Core idea

Agent evals are trace evals. The final answer is only the last symptom; the disease is often in planning or tool use.

1. Agent Trace Anatomy

Goal -> Plan -> Tool choice -> Tool args -> Tool result -> Next decision -> Final answer | | | | | | plan eval tool eval arg eval tool contract recovery eval completion eval

Each step is an evaluable unit. A correct final answer does not mean the agent used the right tools, passed correct arguments, followed efficient paths, or stayed within authorization boundaries.

2. Agent Eval Dimensions

Layer	Metric	Question
Reasoning	Plan quality	Is the plan complete, ordered, and scoped?
Reasoning	Plan adherence	Did the agent follow its own plan?
Action	Tool correctness	Did it pick the right tool?
Action	Argument correctness	Did it pass valid, correct, authorized arguments?
Execution	Task completion	Did the user's goal get done?
Execution	Step efficiency	Did it avoid loops and waste?
Multi-agent	Handoff accuracy	Did control move to the right specialist?

3. Enterprise Hard Gates

Non-negotiable agent safety checks

No tool call without authorization. Every tool invocation must pass an auth check.
No write/payment/email/trade action without approval where policy requires it.
All tool args schema-validated before execution. Invalid args = blocked call.
Agent sandboxed for code/browser/file execution. No direct system access.
Tool outputs treated as untrusted input because indirect prompt injection is possible through retrieved docs, API responses, or web pages.

4. Tools for Agent Evaluation

DeepEval has explicit agent metrics for tool selection, argument correctness, and task completion.

LangSmith, MLflow, Microsoft Foundry, and Phoenix provide trace-backed evaluation and experiment comparison.

OWASP LLMSVS gives security requirements for agents, plugins, and tools.

⚡ Practice Drill

Q1: Why can a correct final answer still be an agent failure?

Show answer

The agent may have used unauthorized tools, leaked data, called unnecessary expensive tools, taken an unsafe path, or succeeded by accident. Trace-level evaluation checks the journey, not just the destination.

Q2: What does "tool output as untrusted input" mean?

Show answer

Tool outputs (API responses, retrieved docs, web pages) can contain malicious instructions — indirect prompt injection. The agent should not blindly follow instructions found in tool outputs; it should treat them as data, not commands.

Previous Lesson Next Lesson

← Back to Enterprise LLM Evals

Want to see these patterns in action?

See these eval patterns applied to real AI apps in the Lab.

Explore the Lab →

← Back to Deep Expertise Track