SVD on Weight Differences for Model Auditing

·LessWrong··

TLDR: We introduce a method for auditing fine-tuned models by using singular value decomposition (SVD) on the weight difference matrices and reducing them to rank-1. Under certain circumstances, this seems to remove the red teaming and readily elicit hidden behaviours in the fine-tuning. We show proof of concept with SOTA results on the models from AuditBench.IntroductionThe risk of models hiding misaligned behaviours becomes increasingly worrying as models become more capable. This is the motiv...

Read full article →

Related Articles

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate
Jordan Taylor · LessWrong · 35m ago
The Case for Evaluating Model Behaviors
jsteinhardt · Alignment Forum · 20h ago
Mechanistic estimation for expectations of random products
Jacob Hilton · ARC · 5d ago
Multipolar Civilisation Depends on Maintaining an Attacker’s Dilemma
Naci Cankaya · LessWrong · 14d ago
Using Base-LCM to Monitor LLMs
Éloïse Benito-Rodriguez · LessWrong · 14d ago