Probing is not enough; a validity audit for any probe by Ratnaditya

·Nuno Sempere··

tl;drA probe can have ex­cel­lent AUROC and yet fail as a safety sig­nal. I au­dited 3 probes: a mon­i­tor­ing aware­ness probe leak­age ex­am­ple, a re­fusal di­rec­tion as pos­i­tive con­trol, and Apollo’s de­cep­tion probe as a pub­lished pro­to­col case study.In the leak­age case, the probe achieves AUROC 1.00, but falls to 0.50 when I re­move a sin­gle prompt tag. And re­mov­ing a ran­dom span of same length barely dents it. That re­sult uses only de­ter­minis­tic com­po­nents (avoids LLM j...

Read full article →

Related Articles

AI will make biological extinction risks worse before it makes them better by MichaelDickens
MichaelDickens · Nuno Sempere · 3h ago
Paperclips, broad- and narrow-scope goals, and the over-verification problem by Matthew Rendall
Matthew Rendall · Nuno Sempere · 1d ago
AI takes down a major payment system by EOY 2027?
Brad · Manifold Markets · 3d ago
Strait of Hormuz traffic returns to normal by July 7? [Polymarket]
Bill Clinton · Manifold Markets · 3d ago
Name something AI has gotten crazy good at in the last year. (Bilt Rent Free Jul 1, 2026)
Quroe 🫘 · Manifold Markets · 4d ago