Wiki: LLM orchestration

Orchestration is the layer between a raw model and a useful system. It decides when a model runs, what context it receives, which tools it can call, and how its outputs are validated or handed off. The sources here span that entire problem: from theoretical coordination papers to production harness designs to arguments about where custom orchestration is and isn’t worth building.

The earliest multi-agent systems, surveyed in Wave 1 research, established that agents could coordinate at all — CAMEL, ChatDev, MetaGPT, and AutoGen each used different delegation patterns — but shared failure modes: no concurrency control, no escalation paths, coordination mechanisms that didn’t match the task structure. Subsequent work on debate and state pushed further, finding that the right coordination model depends heavily on the task, and that distributed systems theory offers formalisms the field hasn’t fully borrowed.

On the practical engineering side, multiple sources converge on the same finding: prompting is a poor substitute for structure. Brian Suh argues that reliable agents need deterministic control flow encoded in software, with explicit state transitions and validation checkpoints. A case study in data engineering agent evolution confirms this, showing that environmental constraints — tool design, ID keys, context visibility — outperform prompt engineering across three successive architectures.

Harness design is where this plays out concretely. Anthropic’s Managed Agents architecture separates the agent harness, session log, and sandbox into stable, swappable interfaces so the system can evolve as models improve. Their long-running harness work uses an initializer agent to scaffold a feature list and progress file before a coding agent begins, maintaining state across multiple context windows. A GAN-inspired three-agent setup — planner, generator, evaluator — addresses context anxiety and self-evaluation bias during multi-hour autonomous coding sessions. Dynamic workflows in Claude Code extend this further, letting Claude write its own orchestration scripts that spin up parallel subagents for large-scale tasks.

A dissenting view from Aiyan’s orchestration post argues that custom orchestration frameworks are rarely the right investment: teams should ship MCP tool servers and agent skills that plug into frontier agents, letting providers maintain the loop. The AI control plane framing from Speakeasy takes a different angle — enterprises need a governance layer unifying identity, policy enforcement, tool routing, and observability across all agents, which is itself an orchestration problem at the infrastructure level.

At the routing layer, both DigitalOcean’s Inference Router and Arch-Router address model selection as an orchestration sub-problem: routing each request to the best-fit model for cost, latency, or quality using compact routing models rather than fixed assignments.

Armin Ronacher’s warning about harness loops cuts across all of this: outer orchestration loops amplify LLMs’ worst tendencies and risk producing codebases that require machine participation to maintain. The engineering challenges of orchestration are tractable; the oversight questions they raise are not yet resolved.