Probing is not enough; a validity audit for any probe by Ratnaditya

Ratnaditya·Nuno Sempere·Forecasting·June 29, 2026

tl;drA probe can have excellent AUROC and yet fail as a safety signal. I audited 3 probes: a monitoring awareness probe leakage example, a refusal direction as positive control, and Apollo’s deception probe as a published protocol case study.In the leakage case, the probe achieves AUROC 1.00, but falls to 0.50 when I remove a single prompt tag. And removing a random span of same length barely dents it. That result uses only deterministic components (avoids LLM j...

Read full article →

Probing is not enough; a validity audit for any probe by Ratnaditya

Related Articles