One axis and two features, how I solved the first puzzle from BlueDot and how a classifier hid country on the food direction

·LessWrong··

In this post I walk through the first Technical AI Safety puzzle from BlueDot and why linear probes would have missed all the most interesting stuff.In model interpretability you can observe this kind of paradox, the thing you didn't think to look for, and the only reason you find it, is that you kept asking and what else could this be? And how else can this be investigated? For me it was a discovery that a small text classifier packed two completely independent features onto one direction in ac...

Read full article →

Related Articles

Since Linux 6.9, LUKS suspend stopped wiping disk-encryption keys from memory
IngoBlechschmid · Hacker News · 16h ago
Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5
Pragmata · Hacker News · 2d ago
Claude Code is steganographically marking requests
kirushik · Hacker News · 2d ago
Kimi K2.7 Code is generally available in GitHub Copilot
unliftedq · Hacker News · 1d ago
Postgres transactions are a distributed systems superpower
KraftyOne · Hacker News · 13h ago