[Linkpost] Evals for “SPI-incompatible” behavior & reasoning: Guide to initial research

·LessWrong··

In Part I of CLR's safe Pareto improvements (SPI) agenda, we gave our high-level strategy for evaluating models for SPI-incompatible behavior and reasoning. This guide gives more details on how I’m thinking about executing on this strategy, especially:the kind of workflow I think we should use, to start out;next steps building on what I’ve tried so far; andmy rough sense of what counts as unambiguously bad “SPI-incompatibilities”.If you’re interested in collaborating on the next steps, please ge...

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 2h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 4h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago