Wiki: LLM tooling

LLM tooling refers to the growing layer of software that sits between raw model APIs and working applications: runtimes for local inference, servers that expose context to models, utilities for compressing or structuring knowledge, and packaging formats that make integrations distributable.

Local inference runtimes represent one axis of this ecosystem. oobabooga/textgen provides a fully offline desktop environment supporting GGUF/llama.cpp backends, an OpenAI-compatible API, tool-calling, LoRA fine-tuning, and MCP server integration. Meeting assistant Helply supports both cloud and local backends including Ollama and LM Studio, illustrating how local and hosted inference are increasingly treated as interchangeable. A critical read on Ollama argues that Ollama’s opacity around its llama.cpp dependency, inferior inference performance, and VC-driven cloud pivot make it a poor foundation for serious local setups.

Context management and knowledge organization form another major layer. The LostWarrior/knowledge-base bash CLI organizes project context as tiered markdown files with a machine-readable manifest, letting agents navigate without burning excess tokens. A Reddit guide to Karpathy’s LLM wiki pattern extends this: the model itself builds and maintains structured Markdown, queried at scale without RAG. Headroom attacks the same token budget problem from the output side, compressing tool outputs and RAG chunks by 60-95% before they reach the model.

MCP servers have become a common integration primitive. WaveScope uses an MCP server to deliver wavelet-transformed code context to models without language-specific parsers. Mintlify surfaces documentation to both humans and LLMs via MCP and llms.txt. Anthropic’s MCPB format packages local MCP servers as single-click bundles for Claude Desktop, lowering the distribution friction for tooling authors. The Databricks ai-dev-kit ties several of these threads together with an MCP server, markdown skills, and a Python core library supporting multiple AI coding assistants.

Security and economics round out the picture. Running Claude Code inside Docker is advocated as a containment practice to prevent credential leaks when operating agentic tools in auto-approve mode. On pricing, a Superframeworks analysis notes that a 75x gap between the cheapest and most expensive frontier models now makes provider-agnostic architecture a financial necessity, not just a design preference.