Deep Expertise Track · Lesson 10

Agent Evaluation

LLM-as-Judge, trace analysis, and measuring agent quality

Agent Evaluation: LLM-as-Judge and Trace Analysis

Lesson 10 — how to measure agent quality, catch failures, and prove it works

What you'll learn
  1. Why agent evaluation is different from testing regular software
  2. LLM-as-Judge: using an LLM to evaluate another LLM's output
  3. Trace analysis: what to look for in agent execution logs
  4. The 4 failure modes to watch for and how to detect them

Why Agent Evaluation Is Hard

Traditional software tests use exact assertions: assert result == expected. Agents produce nondeterministic output — the same input can produce different (but equally valid) answers. You can't assert exact text.

Microsoft's guidance is explicit: "Agent outputs are nondeterministic, so use scoring rubrics or language-model-as-judge evaluations rather than exact-match assertions."

Source: Microsoft Azure — AI Agent Orchestration Patterns

LLM-as-Judge

You use one LLM to evaluate another LLM's output. This is the standard approach in production AI:

┌──────────────────────────────────────────────────────┐ │ LLM-AS-JUDGE FLOW │ │ │ │ Agent (DeepSeek) ──▶ Output ──┐ │ │ │ │ │ ▼ │ │ Judge (DeepSeek or different │ │ │ LLM) │ │ │ ┌────────────────────────────┐│ │ │ │ "Evaluate this output: ││ │ │ │ 1. Is it accurate? ││ │ │ │ 2. Is it complete? ││ │ │ │ 3. Is it hallucinating? ││ │ │ │ 4. Did it use tools? ││ │ │ │ Score 1-10" ││ │ │ └────────────────────────────┘│ │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Score: 8/10 │ │ │ │ Issues: missing │ │ │ │ peer comparison │ │ │ └──────────────────┘ │ └──────────────────────────────────────────────────────┘
# LLM-as-Judge for your BA Work Agent
from langchain_openai import ChatOpenAI
import os

judge_llm = ChatOpenAI(model="deepseek-chat", api_key=os.getenv("DEEPSEEK_API_KEY"),
                       base_url="https://api.deepseek.com", temperature=0)

def evaluate_agent_output(goal: str, agent_output: str) -> dict:
    """Use LLM to evaluate agent output quality."""
    judge_prompt = f"""You are evaluating an AI agent's output.

Goal: {goal}
Agent output: {agent_output}

Score each criterion 1-5:
1. ACCURACY: Are the facts correct? No hallucinations?
2. COMPLETENESS: Did it address the full goal?
3. TOOL USAGE: Did it use appropriate tools (not just guess)?
4. CLARITY: Is the output clear and actionable?

Respond in JSON:
{{"accuracy": N, "completeness": N, "tool_usage": N, "clarity": N, 
  "overall": N, "issues": "description of problems"}}"""
    
    response = judge_llm.invoke(judge_prompt)
    return response.content

# Run evaluation on your agent's output
result = evaluate_agent_output(
    goal="Read JIRA export and suggest top 5 priorities",
    agent_output="Based on the tickets..."  # paste your agent's actual output
)
print(result)

The 4 Failure Modes to Detect

FailureWhat it looks likeHow to detect
HallucinationAgent states facts not supported by tool outputsCompare agent's claims against tool observation logs
Tool misuseAgent calls wrong tool or passes wrong argumentsCheck trace: did tool input match tool schema?
Premature terminationAgent gives Final Answer before gathering enough dataCount tool calls. If <3 for complex tasks, likely premature
Loop failureAgent repeats same tool call without progressingCheck for repeated Action+Input pairs in trace

Trace Analysis

Every agent run produces a trace — the full sequence of Thought/Action/Observation steps. LangSmith (LangChain's observability platform) captures these automatically. Without LangSmith, use verbose=True in AgentExecutor.

TRACE ANALYSIS — What to look for: GOOD TRACE: BAD TRACE: ───────────── ────────── Thought: need price Thought: let me check Action: get_stock_price Action: get_stock_price Obs: SBIN at ₹1054 Obs: SBIN at ₹1054 Thought: let me check again Thought: need financials Action: get_stock_price ← REPEATED! Action: get_financials Obs: SBIN at ₹1054 Obs: Revenue up 18% Thought: let me check again Action: get_stock_price ← LOOP! Thought: need peers Action: get_peer_comparison → Detect: same Action+Input Obs: P/B 1.8 vs peers 2.1 3+ times = loop failure Thought: enough data vs. GOOD: 3 different tools, Final Answer: HOLD SBIN... then Final Answer = healthy

The one-sentence summary

Agent evaluation uses LLM-as-Judge (another LLM scores the output) and trace analysis (examining the Thought/Action/Observation sequence) instead of exact assertions — because agent outputs are nondeterministic and must be judged on quality, not equality.

Practice Drill

  1. Run your ba-work-agent and save the full verbose output to a file
  2. Use the evaluate_agent_output function above to score the output
  3. Check the trace for any of the 4 failure modes. Is your agent healthy?
  4. Run the agent 3 times with the same input. Are the outputs different? (They should be — nondeterminism is normal)
⚡ Quick Check
Q1: Why can't you use assert result == expected for agent testing?
Show answer

Because agent outputs are nondeterministic. The same input can produce different but equally valid answers. You must evaluate quality (accuracy, completeness, clarity) using a rubric or LLM-as-judge, not equality.

Q2: What does it mean if the trace shows the same Action+Input 3 times in a row?
Show answer

Loop failure. The agent is stuck calling the same tool with the same arguments without progressing. The observation hasn't changed, so the LLM's reasoning doesn't change either. Fix: add loop detection in AgentExecutor or improve the system prompt to say "if you get the same result twice, try a different approach."

Want to see these patterns in action?

Explore the live apps built with these agent architectures.

Explore the Lab →

← Back to Deep Expertise Track