Probing is not enough; a validity audit for any probe by Ratnaditya
tl;drA probe can have excellent AUROC and yet fail as a safety signal. I audited 3 probes: a monitoring awareness probe leakage example, a refusal direction as positive control, and Apollo’s deception probe as a published protocol case study.In the leakage case, the probe achieves AUROC 1.00, but falls to 0.50 when I remove a single prompt tag. And removing a random span of same length barely dents it. That result uses only deterministic components (avoids LLM j...
Read full article →