Wiki: LLM inference

LLM inference is the process of running a trained language model to generate output given an input prompt. At the hardware level, the bottleneck is VRAM: a model’s weights, the KV cache, and activation overhead must all fit on the available GPU. Tools like CanItRun make this concrete, calculating compatible quantization levels and estimated tokens-per-second for a given GPU and model combination.

Quantization is one of the primary levers for fitting larger models into constrained hardware. Unsloth applies custom kernels to achieve up to 30x faster throughput and 90% less memory than FlashAttention 2, supporting FP8 and LoRA workflows. oobabooga/textgen exposes multiple backends including GGUF/llama.cpp for fully offline local serving. The critical assessment in Friends Don’t Let Friends Use Ollama argues that Ollama, while popular, delivers inferior inference performance compared to llama.cpp directly and obscures that dependency behind a proprietary layer.

At the serving level, the KV cache is the most consequential optimization target. Recomputing attention states on every request is expensive; persisting and reusing them is not. Everpure’s engineering posts show two complementary approaches: injecting cached attention states from fast NFS/S3 storage via RDMA for up to 20x faster inference, and granular-prompt caching that segments prompts into reusable chunks so only changed tokens are processed. A complementary piece on KV caching strategy frames the cache as a shared data asset that can cut prefill costs by up to 20x in enterprise deployments.

Token-level compression is a related but distinct approach. headroom compresses tool outputs and RAG chunks before they reach the model, claiming 60-95% token reduction. A skeptical counterpoint on RTK’s token compression claims argues that compression metrics without task-accuracy benchmarks are vanity numbers and that stripping content risks silent data loss in agent pipelines.

At the API and routing layer, inference is increasingly a dispatch problem. DigitalOcean’s Inference Router uses a 30B MoE model to match each request to the best-fit model for cost, latency, or quality. Arch-Router achieves similar alignment with a compact 1.5B model trained on human preferences, requiring no retraining when new models are added. The AI model pricing war adds economic urgency: a 75x spread between the cheapest and most expensive frontier APIs means routing and provider-agnostic architecture directly determine margin.

Inference Engineering as a discipline encompasses all of this: quantization, speculative decoding, caching, parallelism, and disaggregation. Reasoning budget also matters; a benchmark of Claude Opus 4.7 across five effort levels found a non-monotonic curve where medium effort outperformed higher settings on both quality and cost, suggesting that more compute at inference time is not always better.