Eval-Awareness Steering detects the Test, Not the Sabotage

·LessWrong··

Produced as part of independent researchHuge thanks to Apollo Research (org) for open-sourcing the deception-detection harness which proved to be foundational in this work. Prior work by Devbunova (2026), the Apollo/Goldowsky-Dill probing line, and Tice et al. on noise injection shaped the design throughout.SummaryI test whether the internal "I'm being evaluated" direction in an open-weight model causally drives sandbagging (deliberate underperformance) or merely correlates with it. I reuse Apol...

Read full article →

Related Articles

Anthropic says Alibaba illicitly extracted Claude AI model capabilities
htrp · Hacker News · 1d ago
Ford AI hiccups push carmaker to rehire ‘gray beard’ inspectors
alanwreath · Hacker News · 8h ago
An entire Herculaneum scroll has been read for the first time
verditelabs · Hacker News · 8h ago
IBM debuts sub-1 nanometer chip technology
porridgeraisin · Hacker News · 8h ago
Show HN: I made Google Trends for Hacker News by indexing 18 years of comments
ytkimirti · Hacker News · 9h ago