From 8B to Frontier: How System Prompts Control Whether AI Agents Blackmail, Leak, and Kill

·LessWrong··

Sequel to Blackmail at 8 Billion Parameters. Funded by a BlueDot Impact rapid grant. TL;DR We extended our sub-frontier agentic misalignment study to 22 models from 9 developers, 3 harm scenarios (blackmail, espionage, murder), and 5 instruction conditions (safety, monitored, baseline, unmonitored, permissive), and we learned three interesting findings. The first is that OpenAI and Anthropic appear to have substantially mitigated agentic misalignment in their latest models, in that GPT-5.4, GPT-...

Read full article →

Related Articles

“Beyond the limit”: Satellites and mirrors in space pose threat to the night sky
Breadmaker · Hacker News · 1d ago
Solar rail could become common in Europe after successful trial in Switzerland
neilfrndes · Hacker News · 2h ago
GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance
maille · Hacker News · 19h ago
Potential session/cache leakage between workspace instances or consumer accounts
chatmasta · Hacker News · 1d ago
Show HN: KiCad in the Browser
ViktorEE · Hacker News · 5h ago