Building Better Activation Oracles
Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam KarvonenHuggingface, GithubTL;DR: We have improved the original Activation Oracle (AO) training regime by training on on-policy rollouts, improving the conversational dataset, feeding more layers (following the approach by Niclas Luick) and making a small change to the injection formula. We also open source our evals, which we believe are currently the most comprehensive evaluation of AO quality called AObench. The capa...
Read full article →