K
KnowMBAAdvisory
AI StrategyAdvanced9 min read

Multi-Agent System Design

Multi-agent systems decompose a task across specialized LLM agents that coordinate via messages, shared state, or an orchestrator. Common patterns: (1) Orchestrator-worker — a planner agent dispatches subtasks to specialist agents (researcher, writer, critic, executor). (2) Pipeline — agents hand off sequential stages. (3) Debate/critic loops — two or more agents adversarially refine an answer. (4) Swarm — many short-lived agents work on shards of the same problem in parallel. The promise is scaling intelligence beyond a single context window; the cost is communication overhead, error compounding, and a debugging nightmare.

Also known asMulti-Agent ArchitectureAgent OrchestrationAgent SwarmsMAS DesignAgent-of-Agents

The Trap

The trap is using multi-agent because it sounds sophisticated when a single well-prompted call would do. Each additional agent in the chain multiplies error rates. If each agent is 90% reliable on its sub-task, a 5-agent pipeline is 0.9^5 = 59% reliable end-to-end. Costs balloon — message-passing means agents repeatedly re-read context, often 3-10x the tokens of a monolithic call. And debugging is brutal: when the system produces a bad answer, you must trace which of N agents introduced the error, often through hundreds of inter-agent messages. KnowMBA POV: multi-agent systems sound clever in demos and break in production. Default to single-agent + tools; reach for multi-agent only when the problem genuinely cannot be expressed in one prompt.

What to Do

Apply a four-question gate before going multi-agent: (1) Does the task have genuinely independent sub-problems that can run in parallel? (If sequential, you probably want a single agent with tools, not multiple agents.) (2) Do the sub-tasks need different system prompts, models, or guardrails? (If not, one agent suffices.) (3) Can you accept a 2-5x cost increase and 30-60% lower end-to-end reliability without remediation? (4) Have you instrumented per-agent traces, message-bus logging, and per-step eval harnesses? If any answer is no, simplify. When you do build multi-agent, enforce: (a) typed message contracts between agents, (b) max-hop budgets to kill runaway loops, (c) a single source of truth for shared state (avoid agents re-deriving the same facts), (d) per-agent eval suites — a system is only as reliable as its weakest agent.

Formula

End-to-end Reliability ≈ Π (per-agent reliability) ; Token Cost ≈ N × per-agent-tokens × context-rebroadcast-factor

In Practice

Anthropic's published research on a multi-agent research system describes an orchestrator-worker pattern where a lead agent decomposes research queries into parallel subagent tasks. The post is candid about the trade-offs: the multi-agent system uses roughly 15x more tokens than a single Claude call, and is reserved for breadth-first research questions where the parallel exploration is worth the cost. Most queries do not justify it. The headline lesson from one of the most credible multi-agent deployments in production: even when it works, it is expensive and hard, and you should not reach for it by default.

Pro Tips

  • 01

    Single agent with many tools beats many agents with few tools, in most cases. The model has full context; debugging is one trace; cost is bounded by one loop. Multi-agent is the right call only when the parallelism is the entire point (e.g., search the web 12 ways simultaneously).

  • 02

    If you must orchestrate, treat the orchestrator like a critical infra service: typed schemas for handoffs, retry/backoff per worker, dead-letter queue for failed sub-tasks, and a hard max-hop budget. Without these, your 'agent system' is a distributed system without distributed-system discipline.

  • 03

    Run an A/B before going multi-agent: same task, single-agent baseline vs multi-agent variant, on a 200-item eval. If the multi-agent version isn't materially better on quality (not just 'feels smarter'), keep the single agent. The token bill alone usually decides it.

Myth vs Reality

Myth

More agents = more intelligence

Reality

More agents = more places for the system to fail. Each handoff is a lossy compression of context. Adding a 'critic' agent that re-reads outputs catches some errors and introduces others. Empirically, beyond 3-4 specialized agents, marginal quality gains turn negative as coordination overhead dominates.

Myth

Multi-agent systems are how AGI will work, so we should build that way now

Reality

Future architectures are speculative; today's production systems suffer from the same engineering realities as any distributed system: latency, partial failure, debugging cost. Build for what works today; rearchitect when the model capabilities or tools materially change.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your team is designing a customer-support automation. Current proposal: 6 specialized agents (intent-classifier, KB-retriever, policy-checker, response-drafter, tone-reviewer, sender). Each is 92% reliable in isolation. What's the most important critique?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

When to Use Multi-Agent

Engineering decision framework

Strong Fit

Parallel breadth-first research, swarm simulation, independent sub-tasks

Conditional Fit

Sequential stages with materially different prompts and guardrails

Probably Wrong

Sequential stages that could be one prompt with tools

Anti-Pattern

Multi-agent because 'agents are the future'

Source: Anthropic engineering blog + production deployment patterns

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🔬

Anthropic Research System

2024-2025

success

Anthropic published an engineering account of a multi-agent research system used internally and in product surfaces. A lead orchestrator agent decomposes a query into parallel research subagent tasks; subagents explore different facets and return findings; the orchestrator synthesizes. The post explicitly states the system uses ~15x the tokens of a single Claude call, and is reserved for queries where breadth-first parallel research justifies the cost. Most queries are still single-agent.

Token Cost vs Single Call

~15x

Pattern

Orchestrator-worker, parallel

Default Posture

Single-agent unless parallelism is the point

Even at the frontier, multi-agent is reserved for problems where parallel exploration is worth a 15x token premium. Not the default; not even close.

Source ↗
📨

Hypothetical: The 7-Agent Customer Email Pipeline

Composite scenario

failure

A SaaS company built a 7-agent pipeline for customer email replies: classifier → retriever → policy-checker → drafter → tone-reviewer → personalizer → sender. Each agent was ~91% reliable. End-to-end reliability landed at 0.91^7 ≈ 51%. The team rebuilt as a single agent with retrieval and policy tools; reliability rose to 84%, latency dropped from 14s to 3s, and token cost fell 70%. The 'sophisticated' pipeline was worse on every dimension.

Multi-Agent Reliability

~51%

Single-Agent Reliability

~84%

Latency

14s → 3s

Token Cost Reduction

70%

Multiplicative error decay is the silent killer of multi-agent systems. If you can't justify why splitting helps more than it hurts, don't split.

Decision scenario

The Architect's Multi-Agent Pitch

Your principal engineer proposes a 6-agent system to handle internal IT tickets: classifier, knowledge-retriever, policy-validator, action-planner, executor, summarizer. Each agent is estimated at 90% reliable. Your current single-agent + tools prototype is 78% reliable. Throughput needs are 10K tickets/day. The principal argues 'specialized agents are more accurate.'

Tickets per Day

10,000

Single-Agent Prototype Reliability

78%

Proposed Multi-Agent Per-Stage Reliability

90%

Single-Agent Token Cost per Ticket

~6,000 tokens

Estimated Multi-Agent Tokens per Ticket

~28,000 tokens (5x rebroadcast)

01

Decision 1

You need to decide architecture before the next sprint. The principal is influential and the design 'feels' more rigorous.

Approve the 6-agent architecture — specialized agents will produce better outputs, and the team can iterate to improve each stage independentlyReveal
Six weeks later, end-to-end reliability lands at 0.90^6 ≈ 53%. Token spend is ~$8K/day vs the projected $1.7K. Worse, when tickets fail, debugging requires tracing through 6 inter-agent messages — incident MTTR triples. Customers escalate. The team spends Q3 firefighting and rebuilds as single-agent in Q4. The principal quietly leaves.
End-to-End Reliability: 78% → 53%Daily Token Spend: $1.7K → $8KIncident MTTR: 1x → 3xSchedule Slip: +1 quarter
Stay single-agent + tools. Invest the next two sprints in prompt engineering, eval harnesses, and tool reliability to raise the 78% baseline. Revisit multi-agent only if you hit a ceiling that single-agent provably cannot cross.Reveal
Two sprints of focused improvement raises single-agent reliability from 78% to 89%. Token cost stays at ~$1.7K/day. Latency stays under 4s. When edge cases surface, you trace one agent's reasoning — MTTR is fast. The principal's instinct to 'specialize' is redirected into per-tool eval suites and richer prompts. By end of quarter, you ship to production at 89% reliability, $1.7K/day, with a clean operational story.
Reliability: 78% → 89%Daily Token Spend: Held at ~$1.7KLatency: Held under 4sTime to Production: On schedule

Related concepts

Keep connecting.

The concepts that orbit this one — each one sharpens the others.

Beyond the concept

Turn Multi-Agent System Design into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h · No retainer required

Turn Multi-Agent System Design into a live operating decision.

Use Multi-Agent System Design as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.