Deep Expertise Track · Lesson 11

Production Considerations

Cost, latency, state, security — what changes when agents go live

Production Considerations: Cost, Latency, State, Security

Lesson 11 — what changes when you move agents from prototype to production

What you'll learn
  1. The 4 production concerns: cost, latency, state management, security
  2. How multi-agent patterns multiply cost (and how to optimize)
  3. Context window management — the silent killer of production agents
  4. Security: least privilege, audit trails, guardrails at every step

The 4 Production Concerns

┌──────────────────────────────────────────────────────────┐ │ PRODUCTION AGENT CHECKLIST │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ COST │ │ LATENCY │ │ STATE │ │ SECURITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ Token │ │ Multi- │ │ Context │ │ Least │ │ │ │ usage │ │ step = │ │ window │ │ priv. │ │ │ │ per run │ │ slower │ │ grows │ │ per agent│ │ │ │ │ │ │ │ │ │ │ │ │ │ Pattern │ │ Concur- │ │ Persist │ │ Audit │ │ │ │ affects │ │ rent = │ │ state │ │ trails │ │ │ │ cost │ │ faster │ │ external │ │ required │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ └──────────────────────────────────────────────────────────┘

1. Cost

Microsoft's guidance: "Multi-agent orchestrations multiply model invocations. Each agent consumes tokens for instructions, context, reasoning, and tool interactions."

PatternCost profileOptimization
SequentialLinear (N agents × tokens each)Use smaller models for simple steps
ConcurrentSpike (all agents at once)Monitor API rate limits
HandoffVariable (depends on routing)Good triage agent prevents wrong routing
MagenticHighest (manager iterates many times)Hard to predict — monitor per-run cost

Key optimization: "Not every agent requires the most capable model." Use cheaper models (DeepSeek-chat vs DeepSeek-reasoner) for classification, formatting, and simple tasks.

2. Latency

Sequential: latency = sum of all agent times. Concurrent: latency = max(all agent times) + aggregation. Handoff: latency = sum of agents in the chain. Magentic: unpredictable — can be very slow.

3. Context Window Management

The Silent Killer

"In multi-agent orchestrations, context windows can grow rapidly because each agent adds its own reasoning, tool results, and intermediate outputs. Monitor accumulated context size and use compaction techniques (summarization or selective pruning) between agents."

CONTEXT GROWTH PROBLEM: Agent A output: 2,000 tokens Agent B adds: 1,500 tokens (reasoning + tool output) Agent C adds: 1,800 tokens Agent D adds: 1,200 tokens ──────────── Total context: 6,500 tokens sent to Agent E If Agent E is the 5th agent in a chain, it receives ALL previous context. By agent 10, you may exceed the context window. SOLUTION: Summarize between steps Agent A output → summarize to 500 tokens → pass to Agent B Agent B output → summarize to 500 tokens → pass to Agent C Total: 2,000 tokens to Agent E (not 6,500)

4. Security

Microsoft's security recommendations:

  • Least privilege: Each agent should have minimum access needed. Don't give every agent database write access.
  • Authentication: Secure communication between agents
  • Audit trails: Log every tool call and handoff for compliance
  • Guardrails at multiple points: User input, tool calls, tool responses, and final output — not just at the edges
  • Security trimming: Agents with broad data access must not return data the user isn't authorized to see

Source: Microsoft Azure — AI Agent Orchestration Patterns

The one-sentence summary

Production agents need cost monitoring (use cheaper models for simple tasks), context compaction (summarize between agents), state persistence (external storage, not in-memory), and security at every layer (least privilege, audit trails, guardrails at each step).

Practice Drill

  1. Run your ba-work-agent and count the total tokens used (check DeepSeek dashboard)
  2. Run it 5 times. What's the cost variance? Is it predictable?
  3. Add context summarization: after each tool call, summarize the observation to 100 tokens before feeding back
  4. Think about security: what if your agent had access to a production database? What guardrails would you add?
⚡ Quick Check
Q1: Why does context grow in multi-agent systems, and what's the fix?
Show answer

Each agent adds reasoning, tool outputs, and intermediate results. By agent 5-10, the accumulated context can exceed the model's window. Fix: summarize or selectively prune between agents — pass a compressed version instead of the full raw context.

Q2: Which orchestration pattern is the most expensive and why?
Show answer

Magentic — the manager agent iterates many times, revising the plan and calling specialists repeatedly. Cost is hard to predict because the number of iterations depends on the problem. Microsoft notes it's "the most variable" in cost.

Want to see these patterns in action?

Explore the live apps built with these agent architectures.

Explore the Lab →

← Back to Deep Expertise Track