Can we use steering vectors to suppress reward-hacking? Somewhat

·LessWrong··

Can steering vectors drive gradient routing? Yes, but not in realistic reward hacking environments, they are not precise enough classifiers of hacky vs clean solutions. Instead, can we use a steering vector to initialise adapters so that gradient routing happens without a classifier, and we get automatic seperation of hacky and clean gradients? Partly! This init approach suppressed 70% of hacking by absorbing gradients into the hacky initialised adapter. This is not as good as the prior approach...

Read full article →

Related Articles

DSpark: Speculative decoding accelerates LLM inference [pdf]
aurenvale · Hacker News · 22h ago
Anthropic says Alibaba illicitly extracted Claude AI model capabilities
htrp · Hacker News · 3d ago
Michigan spent $1.8B and only created 602 jobs
littlexsparkee · Hacker News · 10h ago
The gap between open weights LLMs and closed source LLMs
kkm · Hacker News · 1d ago
How Many Elementary Particles Are There, Really?
rwmj · Hacker News · 19h ago