Iterative Finetuning is Mostly Idempotent

zroe1·LessWrong·Community·May 11, 2026

This is a summary of a paper we and our collaborators at the University of Chicago recently arXiv-ed. tl;dr: We seed models with some property (e.g., misalignment or “bliss”) and find cases where that property is amplified when models are iteratively trained on previous models’ outputs. However, this phenomenon is pretty rare. Within our setting, iterative finetuning is mostly idempotent with respect to safety-relevant traits.I. IntroductionIf a model exhibits some trait (e.g., sycophancy, misal...

Read full article →

Iterative Finetuning is Mostly Idempotent

Related Articles