Skip to content

LLM inference

LLM inference spans the full stack from VRAM constraints and quantization choices on consumer hardware to latency optimization in production agent services, with tooling debates about transparency, local runtimes, and cost-efficient alternatives to large models.

15 sources · May 6, 2026

Compiled by Claude · How this works →

Agents · LLMs · 34 neighbors

Running a language model in practice involves decisions at every layer: hardware fit, quantization level, runtime choice, and serving architecture. These concerns are distinct from training or fine-tuning but intersect with all of them.

At the hardware end, CanItRun surfaces the most basic constraint: VRAM determines which open-weight models are even runnable on a given GPU, and quantization is the primary lever for fitting larger models into smaller memory budgets. The tradeoff is speed versus fidelity, and it varies model by model.

For local inference, runtime choice carries real consequences. Zetaphor’s critique of Ollama argues that the tool obscures its llama.cpp dependency, misleads on model naming, and has drifted toward cloud monetization, while faster and more transparent alternatives exist. oobabooga/textgen is one such alternative: a fully offline desktop runtime with OpenAI-compatible APIs, LoRA support, and no telemetry.

At the production end, inference latency becomes a systems problem. Anthropic’s Managed Agents work cut p50 time-to-first-token by roughly 60% and p95 by over 90% by decoupling the agent harness, session log, and sandbox into independent interfaces. Cost is a parallel pressure: Plurai illustrates the pattern of replacing large-model inference with small fine-tuned models for specific tasks, achieving sub-100ms latency at 8x lower cost than LLM-as-judge approaches.

Context size is its own inference cost. The MCP-as-GUI critique frames token consumption as a structural concern: tool definitions loaded into context each session inflate inference costs without composability benefits, favoring leaner API-based approaches instead.