Can we use steering vectors to suppress reward-hacking? Somewhat

wassname·LessWrong·Community·June 28, 2026

Can steering vectors drive gradient routing? Yes, but not in realistic reward hacking environments, they are not precise enough classifiers of hacky vs clean solutions. Instead, can we use a steering vector to initialise adapters so that gradient routing happens without a classifier, and we get automatic seperation of hacky and clean gradients? Partly! This init approach suppressed 70% of hacking by absorbing gradients into the hacky initialised adapter. This is not as good as the prior approach...

Read full article →

Can we use steering vectors to suppress reward-hacking? Somewhat

Related Articles