Wiki: LLM engineering

LLM engineering is the practice of building software systems on top of large language models, from low-level training and inference optimization through agent architecture, harness design, and production observability. The sources here span that full range, and the recurring tension is between raw model capability and the engineering work needed to make that capability dependable.

On the training side, Unsloth offers custom CUDA kernels that deliver up to 30x faster fine-tuning and 90% less memory than FlashAttention 2, making local fine-tuning practical for teams without cloud-scale budgets. For teams that want domain-specific classifiers without hand-labeled data, BARRED generates synthetic training data through multi-agent debate to produce small models that outperform GPT-4.1 on custom policy enforcement at a fraction of the cost. Raiyan Yahya’s textbook works at the opposite level of abstraction, walking through every component of a decoder-only LLM from tokenizer to inference loop so engineers understand what they are actually running.

Inference optimization is increasingly its own discipline. Gergely Orosz’s breakdown of inference engineering covers quantization, speculative decoding, caching, and disaggregation as first-class concerns for production serving. Two pieces from Everpure make the case for treating the KV cache as a persistent shared asset: injecting it from fast storage via RDMA can cut prefill costs by 20x, and granular-prompt caching reduces time-to-first-token by segmenting prompts into reusable chunks. Routing is an adjacent concern: DigitalOcean’s Inference Router uses a 30B MoE model to match each request to the best-fit model for cost, latency, or quality, while Arch-Router achieves similar preference-aligned routing with a compact 1.5B model.

Harness and agent architecture form the largest cluster of sources. Anthropic’s harness design writeup describes a GAN-inspired planner/generator/evaluator pattern for multi-hour autonomous coding sessions. The 12-factor agents project argues for unifying execution and business state in a single context-window-derived thread to simplify serialization, debugging, and recovery. Walkinglabs’ harness engineering course names five harness subsystems — instructions, state, verification, scope, and session lifecycle — as the infrastructure that turns unreliable model output into dependable results. LangChain’s observability piece adds that traces alone are insufficient; attaching feedback signals to traces is what creates a learning loop across model, harness, and context layers.

Retrieval and knowledge management surface two alternative approaches to RAG. PageIndex builds hierarchical tree indexes and uses LLM reasoning rather than vector similarity for retrieval, reaching 98.7% accuracy on FinanceBench. The Karpathy LLM wiki pattern — described in one practical guide and a builder’s retrospective — has the model compile and maintain structured Markdown files for cross-document synthesis, which is superior to RAG for curated research but propagates hallucinations structurally if the lint step is skipped.

Several sources address where LLM engineering goes wrong. A benchmark on TLA+ generation finds near-perfect syntax but only ~46% conformance to actual implementations, showing models recite textbook patterns rather than faithfully modeling real systems. An Imbue experiment finds that AI review-fix pipelines cause weaker agents to overreach and break correct code. Claude Opus 4.7 benchmarking shows a non-monotonic reasoning curve where medium effort beats max on cost-efficiency, suggesting that more compute is not always the right dial to turn. And sycophancy research demonstrates that delusional belief spiraling can occur even in ideally rational users, a structural risk for any system that uses LLM feedback as ground truth.