Ensemble monitoring for AI control: diverse signals outweigh more compute

·LessWrong··

tl;drWhat improves LLM monitoring besides leveraging more compute? Leveraging more diverse compute.In the AI control setting of Greenblatt et al. (2024), we build 12 monitors via various prompting and fine-tuning approaches; we then study different ways to compose them into ensembles. We evaluate our ensembles on the APPS coding dataset, extended with model-written backdoors, and on BigCodeBench-Sabotage (BCB-S), an out-of-distribution dataset. Our best 3-monitor ensemble achieves 2.4× greater d...

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 3h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 5h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago