Iterative Finetuning is Mostly Idempotent
This is a summary of a paper we and our collaborators at the University of Chicago recently arXiv-ed. tl;dr: We seed models with some property (e.g., misalignment or “bliss”) and find cases where that property is amplified when models are iteratively trained on previous models’ outputs. However, this phenomenon is pretty rare. Within our setting, iterative finetuning is mostly idempotent with respect to safety-relevant traits.I. IntroductionIf a model exhibits some trait (e.g., sycophancy, misal...
Read full article →