Preventative Steering has advantages over Inoculation Prompting

·LessWrong··

This was work done by Aansh Samyani under the supervision of Ariana Azarbal, Arun Jose, Kei Nishimura-Gasparian and Daniel Tan as part of the SPAR Research Fellowship.TL;DRWe benchmarked Inoculation Prompting (IP) and Preventative Steering (PS) in 4 SFT settings. We found PS has the following advantages:PS often affords stronger undesired-trait suppression than IP.Models trained with PS appear to carry less conditional misalignment than IP-trained models.Using PS, we can cause models to learn de...

Read full article →

Related Articles

AI's Affordability Crisis
ilreb · Hacker News · 11h ago
Trains halted across Germany because of communication system problem
sva_ · Hacker News · 5h ago
Algorithmic Monocultures in Hiring
sizzle · Hacker News · 7h ago
F3
tosh · Hacker News · 9h ago
75% More Pedestrians Have Been Killed Since 2009. Giant Trucks and SUVs Are Why
theanonymousone · Hacker News · 9h ago