SVD on Weight Differences for Model Auditing

·LessWrong··

TLDR: We introduce a method for auditing fine-tuned models by using singular value decomposition (SVD) on the weight difference matrices and reducing them to rank-1. Under certain circumstances, this seems to remove the red teaming and readily elicit hidden behaviours in the fine-tuning. We show proof of concept with SOTA results on the models from AuditBench.IntroductionThe risk of models hiding misaligned behaviours becomes increasingly worrying as models become more capable. This is the motiv...

Read full article →

Related Articles

Probing the loss-band sparsity assumption in Scientist AI
Alejandro Tlaie · LessWrong · 53m ago
SFT Drives Gemini’s Safety Properties
Josh Engels · Alignment Forum · 22d ago
Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 25d ago
A Mike's-Eye View of ARC's Research
Michael Winer · ARC · 25d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 1mo ago