One axis and two features, how I solved the first puzzle from BlueDot and how a classifier hid country on the food direction

IgorPereverzevDev·LessWrong·Community·July 3, 2026

In this post I walk through the first Technical AI Safety puzzle from BlueDot and why linear probes would have missed all the most interesting stuff.In model interpretability you can observe this kind of paradox, the thing you didn't think to look for, and the only reason you find it, is that you kept asking and what else could this be? And how else can this be investigated? For me it was a discovery that a small text classifier packed two completely independent features onto one direction in ac...

Read full article →

One axis and two features, how I solved the first puzzle from BlueDot and how a classifier hid country on the food direction

Related Articles