SVD on Weight Differences for Model Auditing

·LessWrong··

TLDR: We introduce a method for auditing fine-tuned models by using singular value decomposition (SVD) on the weight difference matrices and reducing them to rank-1. Under certain circumstances, this seems to remove the red teaming and readily elicit hidden behaviours in the fine-tuning. We show proof of concept with SOTA results on the models from AuditBench.IntroductionThe risk of models hiding misaligned behaviours becomes increasingly worrying as models become more capable. This is the motiv...

Read full article →

Related Articles

Training Model to Predict Its Own Generalization: A Preliminary Study
Tianyi (Alex) Qiu · LessWrong · 3d ago
A Theoretical Game of Attacks via Compositional Skills
Xinbo Wu, Huan Zhang, Abhishek Umrawal, Lav R. Varshney · ArXiv cs.CL · 3d ago
BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists
Kimon Antonios Provatas, Avery Self, Ioannis Mouratidis, Ilias Georgakopoulos-Soares · ArXiv q-bio · 3d ago
Irretrievability; or, Murphy's Curse of Oneshotness upon ASI
Eliezer Yudkowsky · LessWrong · 4d ago
Verbalized Eval Awareness Inflates Measured Safety
Santiago Aranguri · LessWrong · 4d ago