SFT Drives Gemini’s Safety Properties

·Alignment Forum··

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here.In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL. We do not want to overstate this claim as applying to other model families, and we also note that this may chang...

Read full article →

Related Articles

Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 3d ago
A Mike's-Eye View of ARC's Research
Michael Winer · ARC · 3d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 8d ago
A Pipeline for Generating Synthetic Sabotage Trajectories to Red-Team Monitors
Myles H · LessWrong · 9d ago
Announcing the ARC White-Box Estimation Challenge
Jacob_Hilton · Alignment Forum · 11d ago