Wiki: AI infrastructure

AI infrastructure spans the full stack beneath the model itself: the hardware and networking that serve tokens cheaply, the hosting abstractions that let agents run reliably, the routing and caching layers that manage cost and latency, and the governance and credential plumbing that makes all of it safe to operate in production.

On the compute and serving side, inference is becoming its own discipline. Philip Kiely’s breakdown of inference engineering covers quantization, speculative decoding, parallelism, and disaggregation as first-class techniques rather than implementation details. A recurring theme across sources is that prefill is expensive, and caching is the main lever. Everpure argues that treating the KV cache as a persistent shared data asset, injected via RDMA rather than recomputed, can cut prefill costs by up to 20x see the cost analysis and their granular-prompt caching extension. Pure Storage’s KVA takes this further by persisting attention states across sessions on NFS and S3, delivering the same 20x throughput improvement over standard Ethernet without model changes.

Model routing sits adjacent to caching as a cost-control mechanism. DigitalOcean’s Inference Router uses a 30B MoE model to match requests to the best-fit model for cost, latency, or quality at runtime. The companion Arch-Router research proposes a compact 1.5B preference-aligned routing model that can accommodate new models without retraining via domain-action mapping. Meanwhile the pricing floor for tokens has collapsed, with a 75x gap between cheapest and most expensive frontier models, making provider-agnostic routing a structural necessity rather than an optimization.

Agent hosting introduces a different class of infrastructure problems. Anthropic’s Managed Agents architecture separates the agent harness, session log, and sandbox into stable, swappable interfaces so that model upgrades don’t break running clients by design. Governance sits on top of that: the enterprise AI control plane unifies identity, policy enforcement, tool routing, and observability across every agent and system in a single layer, and MCP has emerged as the protocol layer that makes auditable, policy-aware proxying possible at scale between agents and resources. Credential management is a related but undersolved problem; Latchkey handles it by encrypting API tokens on-device so agents authenticate against external services without ever seeing raw credentials locally.

At the opposite end of the complexity axis, some builders are deliberately shedding infrastructure. zerostack’s agent memory uses plain Markdown files and regex retrieval, no vector store, no daemon, no embeddings, motivated by RAM constraints and provider neutrality rather than naivety. The Ollama critique makes a parallel point from the local inference side: defaults and packaging choices that obscure llama.cpp dependencies and degrade performance can impose real infrastructure debt on practitioners. Infrastructure simplicity is its own design goal, not a fallback.

Sources

What is Inference Engineering?Gergely OroszThe Pragmatic Engineer · Jun 21, 2026
How to Cut LLM Inference Costs with KV CachingRobert AlvarezEverpure Engineering · May 20, 2026
Maximizing LLM Efficiency: Granular-Prompt Caching with Pure KVARobert Alvarez, Jean-Baptiste ThomasEverpure Engineering · May 20, 2026
20x Faster Inference with the First KV Cache for S3 and NFSRobert Alvarez, Jean-Baptiste ThomasEverpure Engineering · May 20, 2026
How We Built DigitalOcean Inference RouterAdil HafeezDigitalOcean · Jun 21, 2026
Arch-Router: Aligning LLM Routing with Human PreferencesCo Tran, Salman Paracha, Adil Hafeez, Shuguang ChenarXiv · Jun 21, 2026
The AI Model Pricing War Is Here — And Your Margins Depend on Picking the Right SideAyush ChaturvediSuperframeworks · May 31, 2026
Scaling Managed Agents: Decoupling the Brain from the HandsLance Martin, Gabe Cemaj, and Michael CohenAnthropic Engineering · Apr 27, 2026
AI Control Plane: Architecture and VendorsSagar BatchuSpeakeasy · May 09, 2026
No, MCP is Definitely Not Dead. The NSA Agrees.Stephane DerosiauxThe Technical Executive · Jun 02, 2026
Latchkey: Credential Layer for Local AI Agents-Imbue · Jun 23, 2026
Designing Memory for zerostack: Plain Files, No Vector StoreXavierXavier's Data Forge · Jun 11, 2026
Friends Don't Let Friends Use OllamaZetaphorSleeping Robots · May 05, 2026
He Came, He Saw, He CookedBen ThompsonStratechery · Apr 24, 2026
How to Build Scalable Web Apps with OpenAI's Privacy Filteryuvraj sharma, Freddy Boulton, Abubakar AbidHugging Face · Apr 29, 2026
How to Choose Between Single- and Multi-Agent SolutionsBen DicksonAlphaSignal · May 03, 2026
PluraiProduct HuntProduct Hunt · May 04, 2026
Build a Desktop Extension with MCPBAnthropicAnthropic · May 27, 2026
Memory design @ zerostackJun 11, 2026
SpaceX & the Sentient SunMarc Andreessen and Michael McGuinessa16z · Jun 21, 2026
Building a CloudDavid Crawshawcrawshaw.io · Jul 05, 2026
AI 2040: Plan AThomas Larsen, Romeo Dean, Brendan Halstead, Eli Lifland, Ryan Greenblatt, Daniel KokotajloAI 2040 · Jul 09, 2026

view in /reading →