AI StrategyAdvanced9 min read

Multi-Agent System Design

Multi-agent systems decompose a task across specialized LLM agents that coordinate via messages, shared state, or an orchestrator. Common patterns: (1) Orchestrator-worker — a planner agent dispatches subtasks to specialist agents (researcher, writer, critic, executor). (2) Pipeline — agents hand off sequential stages. (3) Debate/critic loops — two or more agents adversarially refine an answer. (4) Swarm — many short-lived agents work on shards of the same problem in parallel. The promise is scaling intelligence beyond a single context window; the cost is communication overhead, error compounding, and a debugging nightmare.

Also known asMulti-Agent ArchitectureAgent OrchestrationAgent SwarmsMAS DesignAgent-of-Agents

Challenge a friend Browse library

The Trap

The trap is using multi-agent because it sounds sophisticated when a single well-prompted call would do. Each additional agent in the chain multiplies error rates. If each agent is 90% reliable on its sub-task, a 5-agent pipeline is 0.9^5 = 59% reliable end-to-end. Costs balloon — message-passing means agents repeatedly re-read context, often 3-10x the tokens of a monolithic call. And debugging is brutal: when the system produces a bad answer, you must trace which of N agents introduced the error, often through hundreds of inter-agent messages. KnowMBA POV: multi-agent systems sound clever in demos and break in production. Default to single-agent + tools; reach for multi-agent only when the problem genuinely cannot be expressed in one prompt.

What to Do

Apply a four-question gate before going multi-agent: (1) Does the task have genuinely independent sub-problems that can run in parallel? (If sequential, you probably want a single agent with tools, not multiple agents.) (2) Do the sub-tasks need different system prompts, models, or guardrails? (If not, one agent suffices.) (3) Can you accept a 2-5x cost increase and 30-60% lower end-to-end reliability without remediation? (4) Have you instrumented per-agent traces, message-bus logging, and per-step eval harnesses? If any answer is no, simplify. When you do build multi-agent, enforce: (a) typed message contracts between agents, (b) max-hop budgets to kill runaway loops, (c) a single source of truth for shared state (avoid agents re-deriving the same facts), (d) per-agent eval suites — a system is only as reliable as its weakest agent.

Formula

End-to-end Reliability ≈ Π (per-agent reliability) ; Token Cost ≈ N × per-agent-tokens × context-rebroadcast-factor

In Practice

Anthropic's published research on a multi-agent research system describes an orchestrator-worker pattern where a lead agent decomposes research queries into parallel subagent tasks. The post is candid about the trade-offs: the multi-agent system uses roughly 15x more tokens than a single Claude call, and is reserved for breadth-first research questions where the parallel exploration is worth the cost. Most queries do not justify it. The headline lesson from one of the most credible multi-agent deployments in production: even when it works, it is expensive and hard, and you should not reach for it by default.

Pro Tips

01
Single agent with many tools beats many agents with few tools, in most cases. The model has full context; debugging is one trace; cost is bounded by one loop. Multi-agent is the right call only when the parallelism is the entire point (e.g., search the web 12 ways simultaneously).
02
If you must orchestrate, treat the orchestrator like a critical infra service: typed schemas for handoffs, retry/backoff per worker, dead-letter queue for failed sub-tasks, and a hard max-hop budget. Without these, your 'agent system' is a distributed system without distributed-system discipline.
03
Run an A/B before going multi-agent: same task, single-agent baseline vs multi-agent variant, on a 200-item eval. If the multi-agent version isn't materially better on quality (not just 'feels smarter'), keep the single agent. The token bill alone usually decides it.

Myth vs Reality

Myth

“More agents = more intelligence”

Reality

More agents = more places for the system to fail. Each handoff is a lossy compression of context. Adding a 'critic' agent that re-reads outputs catches some errors and introduces others. Empirically, beyond 3-4 specialized agents, marginal quality gains turn negative as coordination overhead dominates.

Myth

“Multi-agent systems are how AGI will work, so we should build that way now”

Reality

Future architectures are speculative; today's production systems suffer from the same engineering realities as any distributed system: latency, partial failure, debugging cost. Build for what works today; rearchitect when the model capabilities or tools materially change.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your team is designing a customer-support automation. Current proposal: 6 specialized agents (intent-classifier, KB-retriever, policy-checker, response-drafter, tone-reviewer, sender). Each is 92% reliable in isolation. What's the most important critique?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

When to Use Multi-Agent

Engineering decision framework

Strong Fit

Parallel breadth-first research, swarm simulation, independent sub-tasks

Conditional Fit

Sequential stages with materially different prompts and guardrails

Probably Wrong

Sequential stages that could be one prompt with tools

Anti-Pattern

Multi-agent because 'agents are the future'

Source: Anthropic engineering blog + production deployment patterns

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🔬

Anthropic Research System

2024-2025

success

Anthropic published an engineering account of a multi-agent research system used internally and in product surfaces. A lead orchestrator agent decomposes a query into parallel research subagent tasks; subagents explore different facets and return findings; the orchestrator synthesizes. The post explicitly states the system uses ~15x the tokens of a single Claude call, and is reserved for queries where breadth-first parallel research justifies the cost. Most queries are still single-agent.

Token Cost vs Single Call

~15x

Pattern

Orchestrator-worker, parallel

Default Posture

Single-agent unless parallelism is the point

Even at the frontier, multi-agent is reserved for problems where parallel exploration is worth a 15x token premium. Not the default; not even close.

Source ↗

📨

Hypothetical: The 7-Agent Customer Email Pipeline

Composite scenario

failure

A SaaS company built a 7-agent pipeline for customer email replies: classifier → retriever → policy-checker → drafter → tone-reviewer → personalizer → sender. Each agent was ~91% reliable. End-to-end reliability landed at 0.91^7 ≈ 51%. The team rebuilt as a single agent with retrieval and policy tools; reliability rose to 84%, latency dropped from 14s to 3s, and token cost fell 70%. The 'sophisticated' pipeline was worse on every dimension.

Multi-Agent Reliability

~51%

Single-Agent Reliability

~84%

Latency

14s → 3s

Token Cost Reduction

70%

Multiplicative error decay is the silent killer of multi-agent systems. If you can't justify why splitting helps more than it hurts, don't split.

Decision scenario

The Architect's Multi-Agent Pitch

Your principal engineer proposes a 6-agent system to handle internal IT tickets: classifier, knowledge-retriever, policy-validator, action-planner, executor, summarizer. Each agent is estimated at 90% reliable. Your current single-agent + tools prototype is 78% reliable. Throughput needs are 10K tickets/day. The principal argues 'specialized agents are more accurate.'

Tickets per Day

10,000

Single-Agent Prototype Reliability

78%

Proposed Multi-Agent Per-Stage Reliability

90%

Single-Agent Token Cost per Ticket

~6,000 tokens

Estimated Multi-Agent Tokens per Ticket

~28,000 tokens (5x rebroadcast)

Decision 1

You need to decide architecture before the next sprint. The principal is influential and the design 'feels' more rigorous.

Approve the 6-agent architecture — specialized agents will produce better outputs, and the team can iterate to improve each stage independentlyReveal

Six weeks later, end-to-end reliability lands at 0.90^6 ≈ 53%. Token spend is ~$8K/day vs the projected $1.7K. Worse, when tickets fail, debugging requires tracing through 6 inter-agent messages — incident MTTR triples. Customers escalate. The team spends Q3 firefighting and rebuilds as single-agent in Q4. The principal quietly leaves.

End-to-End Reliability: 78% → 53%Daily Token Spend: $1.7K → $8KIncident MTTR: 1x → 3xSchedule Slip: +1 quarter

Stay single-agent + tools. Invest the next two sprints in prompt engineering, eval harnesses, and tool reliability to raise the 78% baseline. Revisit multi-agent only if you hit a ceiling that single-agent provably cannot cross.Reveal

Two sprints of focused improvement raises single-agent reliability from 78% to 89%. Token cost stays at ~$1.7K/day. Latency stays under 4s. When edge cases surface, you trace one agent's reasoning — MTTR is fast. The principal's instinct to 'specialize' is redirected into per-tool eval suites and richer prompts. By end of quarter, you ship to production at 89% reliability, $1.7K/day, with a clean operational story.

Reliability: 78% → 89%Daily Token Spend: Held at ~$1.7KLatency: Held under 4sTime to Production: On schedule

Related concepts