Wiki: Multi-agent systems

Multi-agent systems (MAS) compose multiple LLM-backed agents that divide labor, communicate results, and check each other’s work. The research field developed in two identifiable waves as mapped by Christopher Meiklejohn: a 2023 wave of coordination proofs-of-concept, followed by a 2025 wave focused on measuring why those systems fail in production.

The 2023 systems, including CAMEL, Generative Agents, ChatDev, MetaGPT, and AutoGen, demonstrated that agents could coordinate at all, but shared structural weaknesses: no concurrency control, no escalation paths when agents disagree, and no principled recovery from failures as catalogued in Part 3 of the series. Empirical work from 2025 and 2026 put numbers on those weaknesses. Papers surveyed under MAST, MAS-FIRE, and Silo-Bench found failure rates between 41% and 87%, with inter-agent reasoning failures structurally harder to fix than prompt-level issues per Part 4.

Coordination structure matters as much as model quality. Meiklejohn’s series argues that convergent debate, adversarial debate, shared-notebook state, and the CALM theorem each suit different task types, and that distributed systems theory offers formalisms the MAS field is quietly rediscovering without the vocabulary to name them Part 5. Output verification follows the same logic: checking work in a different representation than it was produced, what the series calls modality shift, improves reliability more reliably than re-prompting the same agent Part 6.

Production deployments illustrate the architecture choices this analysis implies. Anthropic’s Managed Agents service decouples the agent harness, session log, and sandbox into stable interfaces so implementations can be swapped as models improve, cutting p50 time-to-first-token by roughly 60% and enabling multi-brain, multi-sandbox topologies as described in the engineering post. Claude Code’s dynamic workflows take this further, letting Claude write orchestration scripts that spawn hundreds of parallel subagents for tasks like codebase-wide migrations or security audits per Anthropic’s announcement. Zerostack’s subagent design narrows scope differently, using read-only parallel child agents for multi-file exploration to avoid bloating the main agent’s context, reporting a 25% improvement in code exploration time as detailed here.

Security use cases show a distinct harness pattern. Cloudflare’s Project Glasswing ran a multi-agent harness with parallel hunters, adversarial validators, and cross-repo tracers against its own codebases, finding that the harness structure improved vulnerability discovery substantially over a generic coding agent per Cloudflare’s report.

Benchmarks remain a weak point. Standard tests like HumanEval and SWE-bench were designed for single agents and cannot measure coordination quality, communication overhead, or failure recovery, which means published MAS numbers are rarely comparable across systems Part 7. Open problems include matching topology to reliability guarantees, using CRDTs for shared agent state, and designing backpressure protocols, all of which the field is approaching without settled vocabulary Part 8.