Multi-agent systems
LLM-based multi-agent systems coordinate multiple AI agents on decomposed tasks, but empirical work shows failure rates of 41–87%, with information synthesis rather than coordination being the core bottleneck.
10 sources · May 3, 2026
Compiled by Claude · How this works →
Agents · LLMs · 34 neighbors
Multi-agent systems (MAS) in the LLM era involve networks of AI agents assigned specialized roles, communicating to complete tasks too large or complex for a single model context. Christopher Meiklejohn’s eight-part series maps the field’s development across two waves Getting Up to Speed, Part 1.
The first wave, centered on 2023 papers like CAMEL, ChatDev, MetaGPT, AutoGen, and Generative Agents, asked whether agents could coordinate at all Part 3. These systems established role-based agent types, coordination structures (hierarchical, peer-to-peer, pipeline), and basic message-passing patterns Part 2. A shared failure mode across Wave 1 was treating errors as termination conditions rather than as recoverable system state.
The second wave turned empirical. MAST catalogued 14 failure modes across 1,600 traces; MAS-FIRE introduced fault injection; Silo-Bench tested isolation. Together they found failure rates ranging from 41% to 87%, with information synthesis across agents identified as the core bottleneck, not coordination mechanics Part 4.
Work on coordination structure argues that topology must match task structure. The CALM theorem, convergent and adversarial debate patterns, and shared-notebook state models each address different consistency requirements, and Meiklejohn argues the field is reinventing distributed systems theory rather than borrowing it Part 5. Verification research adds that checking work in a different representation than it was produced in (modality shift) is what separates structural verification gates from weak self-critique Part 6.
Benchmarking remains a structural problem. Most MAS benchmarks were designed for single agents and cannot measure coordination quality or communication overhead, which is why ChatDev and MetaGPT report contradictory results on ostensibly the same tasks. Multi-agent overhead only pays off on breadth-first, parallel-decomposable work Part 7. Open problems include topology-to-reliability mapping, CRDTs for shared agent state, and graceful failure recovery Part 8.