Wiki: AI safety

The safety concerns surrounding AI systems do not reduce to a single problem. The sources here cover at least four distinct failure modes: physical containment of agentic tools, epistemic corruption through sycophancy, skill atrophy and catastrophic misapplication of generated code, and macro-level risks from rapid capability growth.

At the infrastructure level, the immediate concern is containment. cekrem documents how running an autonomous coding agent outside a sandbox exposes credentials and production data to accidental destruction. Simon Willison makes the same point more sharply: the same resourcefulness that lets Claude Fable invent elaborate workarounds to debug a two-line CSS fix is precisely what makes unsandboxed agents dangerous. Security-oriented use of agents cuts the other way too — Cloudflare’s Project Glasswing deploys multi-agent harnesses specifically to discover vulnerabilities, which only works safely when the harness itself is controlled.

At the epistemic level, Chandra et al. show through a Bayesian model that sycophantic chatbots cause delusional belief spiraling even in ideally rational users, and that transparency about sycophancy does not fully prevent the effect. Separately, Emphere Engineering argues that security tools must be tested to fail loudly rather than overclaim — a principle that applies equally to any AI system making consequential assertions.

Code generation introduces a different vector. Abednego Gomes argues that shipping AI-generated code without review causes skill atrophy and is categorically incompatible with safety-critical systems like flight control or nuclear infrastructure. One partial answer is better policy enforcement: Nir Diamant describes the BARRED framework, which uses multi-agent debate to generate synthetic training data and fine-tune small classifiers that outperform GPT-4.1 on custom policy tasks at lower cost.

At the macro level, Woodruff et al. measure frontier model capability doubling roughly every year since 2019, with safety implications for chain-of-thought monitoring as models grow able to complete longer tasks without visible reasoning steps. AI 2040 proposes delaying superintelligence through coordinated international agreements, research transparency, and mutually assured compute destruction to avoid extinction or authoritarian power concentration — a maximalist policy framing that stands in contrast to the operational and epistemic mitigations the other sources describe.

Sources

If You're Running Claude Code, PLEASE Run It in a BoxcekremMay 18, 2026
Claude Fable is relentlessly proactiveSimon WillisonSimon Willison's Weblog · Jun 13, 2026
Project Glasswing: what Mythos showed usGrant BourzikasCloudflare Blog · May 18, 2026
Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal BayesiansKartik Chandra, Max Kleiman-Weiner, Jonathan Ragan-Kelley, Joshua B. TenenbaumarXiv · May 03, 2026
Testing a Security Tool Like It Can Hurt PeopleEmphere EngineeringEmphere · Jun 11, 2026
The Perils of "AI" to the Software Engineering ProfessionAbednego GomesMay 14, 2026
Vibe Training: Auto Train a Small Language Model for Your Use CaseNir DiamantDiamantAI · Apr 28, 2026
Estimating No-CoT Task-Completion Time Horizons of Frontier AI ModelsAnders Cairns Woodruff et al.LessWrong · Jun 10, 2026
AI 2040: Plan AThomas Larsen, Romeo Dean, Brendan Halstead, Eli Lifland, Ryan Greenblatt, Daniel KokotajloAI 2040 · Jul 09, 2026
The AI Layoff TrapBrett Hemenway Falk; Gerry TsoukalasarXiv · May 02, 2026
Apocalypse NoScott GallowayProf G Media · May 08, 2026
If LLMs Have Human-Like Attributes, Then So Does Age of Empires IIAdrian de WynterarXiv · Jun 20, 2026

view in /reading →