Deep Expertise Track · Lesson 10
Agent Evaluation
LLM-as-Judge, trace analysis, and measuring agent quality
Agent Evaluation: LLM-as-Judge and Trace Analysis
Lesson 10 — how to measure agent quality, catch failures, and prove it works
- Why agent evaluation is different from testing regular software
- LLM-as-Judge: using an LLM to evaluate another LLM's output
- Trace analysis: what to look for in agent execution logs
- The 4 failure modes to watch for and how to detect them
Why Agent Evaluation Is Hard
Traditional software tests use exact assertions: assert result == expected. Agents produce nondeterministic output — the same input can produce different (but equally valid) answers. You can't assert exact text.
Microsoft's guidance is explicit: "Agent outputs are nondeterministic, so use scoring rubrics or language-model-as-judge evaluations rather than exact-match assertions."
Source: Microsoft Azure — AI Agent Orchestration Patterns
LLM-as-Judge
You use one LLM to evaluate another LLM's output. This is the standard approach in production AI:
# LLM-as-Judge for your BA Work Agent
from langchain_openai import ChatOpenAI
import os
judge_llm = ChatOpenAI(model="deepseek-chat", api_key=os.getenv("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com", temperature=0)
def evaluate_agent_output(goal: str, agent_output: str) -> dict:
"""Use LLM to evaluate agent output quality."""
judge_prompt = f"""You are evaluating an AI agent's output.
Goal: {goal}
Agent output: {agent_output}
Score each criterion 1-5:
1. ACCURACY: Are the facts correct? No hallucinations?
2. COMPLETENESS: Did it address the full goal?
3. TOOL USAGE: Did it use appropriate tools (not just guess)?
4. CLARITY: Is the output clear and actionable?
Respond in JSON:
{{"accuracy": N, "completeness": N, "tool_usage": N, "clarity": N,
"overall": N, "issues": "description of problems"}}"""
response = judge_llm.invoke(judge_prompt)
return response.content
# Run evaluation on your agent's output
result = evaluate_agent_output(
goal="Read JIRA export and suggest top 5 priorities",
agent_output="Based on the tickets..." # paste your agent's actual output
)
print(result)
The 4 Failure Modes to Detect
| Failure | What it looks like | How to detect |
|---|---|---|
| Hallucination | Agent states facts not supported by tool outputs | Compare agent's claims against tool observation logs |
| Tool misuse | Agent calls wrong tool or passes wrong arguments | Check trace: did tool input match tool schema? |
| Premature termination | Agent gives Final Answer before gathering enough data | Count tool calls. If <3 for complex tasks, likely premature |
| Loop failure | Agent repeats same tool call without progressing | Check for repeated Action+Input pairs in trace |
Trace Analysis
Every agent run produces a trace — the full sequence of Thought/Action/Observation steps. LangSmith (LangChain's observability platform) captures these automatically. Without LangSmith, use verbose=True in AgentExecutor.
The one-sentence summary
Agent evaluation uses LLM-as-Judge (another LLM scores the output) and trace analysis (examining the Thought/Action/Observation sequence) instead of exact assertions — because agent outputs are nondeterministic and must be judged on quality, not equality.
Practice Drill
- Run your
ba-work-agentand save the full verbose output to a file - Use the
evaluate_agent_outputfunction above to score the output - Check the trace for any of the 4 failure modes. Is your agent healthy?
- Run the agent 3 times with the same input. Are the outputs different? (They should be — nondeterminism is normal)
Q1: Why can't you use assert result == expected for agent testing?
Show answer
Because agent outputs are nondeterministic. The same input can produce different but equally valid answers. You must evaluate quality (accuracy, completeness, clarity) using a rubric or LLM-as-judge, not equality.
Q2: What does it mean if the trace shows the same Action+Input 3 times in a row?
Show answer
Loop failure. The agent is stuck calling the same tool with the same arguments without progressing. The observation hasn't changed, so the LLM's reasoning doesn't change either. Fix: add loop detection in AgentExecutor or improve the system prompt to say "if you get the same result twice, try a different approach."
Want to see these patterns in action?
Explore the live apps built with these agent architectures.
Explore the Lab →