Skip to content

LLM Engineering

The practical discipline of building, evaluating, and operating systems that use large language models, spanning knowledge architecture, agent control flow, inference optimization, and the human and organizational costs of getting it wrong.

20 sources · May 20, 2026

Compiled by Claude · How this works →

Agents · LLMs · 34 neighbors

LLM engineering is the set of practices for designing, building, and operating software systems built around large language models. The sources here span several interrelated concerns: how to structure knowledge for LLM consumption, how to wire agent behavior reliably, how to manage inference costs, and what happens when the discipline is applied carelessly.

On knowledge architecture, Andrej Karpathy’s LLM-compiled wiki pattern treats the model as a document maintainer rather than a retrieval target. A practical walkthrough describes ingesting raw documents, having the model build structured Markdown files, and querying at scale without RAG. A weekend build of the same concept found that cross-document synthesis genuinely outperforms RAG for curated research, but hallucinations baked in at ingest propagate structurally, making lint and health-check steps non-negotiable. PageIndex takes a complementary approach: reasoning-based document indexes built by LLMs rather than embedding vectors.

Agent reliability is a recurring theme. Brian Suh argues that prompt chains are non-deterministic and unverifiable at scale; reliable agents need deterministic control flow with explicit state transitions. The 12-factor-agents project reinforces this: Factor 5 recommends unifying execution state and business state into a single context-window-derived thread to simplify debugging and recovery. Anthropic’s harness design post describes a GAN-inspired planner-generator-evaluator architecture for multi-hour autonomous coding runs. LangChain’s Harrison Chase adds that traces alone are insufficient; attaching feedback signals is what turns observability into a learning loop.

Inference cost is a concrete engineering constraint. KV caching can cut time-to-first-token by up to 20x by hashing prompt prefixes and injecting cached tensors. Granular-prompt caching extends this by segmenting prompts into reusable checkpoints so only token deltas are processed. Separately, benchmarking Claude Opus 4.7 across reasoning-effort levels found a non-monotonic curve: medium effort won on pass rate and code-review quality while higher settings cost more without improving outcomes.

The discipline also carries human costs. Vibe coding without review risks skill atrophy and compounding errors in safety-critical systems. Val Town’s Slow Mode proposal trades short-term productivity for long-term understanding by keeping humans involved at each agent step. Christoph Spörk’s lobster essay frames institutional dependency on LLMs as a slow-burn risk: eroded internal knowledge combined with a potential cost shock from token price surges.

Related concepts