Iterative Finetuning is Mostly Idempotent

·LessWrong··

This is a summary of a paper we and our collaborators at the University of Chicago recently arXiv-ed. tl;dr: We seed models with some property (e.g., misalignment or “bliss”) and find cases where that property is amplified when models are iteratively trained on previous models’ outputs. However, this phenomenon is pretty rare. Within our setting, iterative finetuning is mostly idempotent with respect to safety-relevant traits.I. IntroductionIf a model exhibits some trait (e.g., sycophancy, misal...

Read full article →

Related Articles

“Beyond the limit”: Satellites and mirrors in space pose threat to the night sky
Breadmaker · Hacker News · 1d ago
GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance
maille · Hacker News · 19h ago
Potential session/cache leakage between workspace instances or consumer accounts
chatmasta · Hacker News · 1d ago
EV Batteries Are Defying Expectations After Miles
apparent · Hacker News · 10h ago
Show HN: KiCad in the Browser
ViktorEE · Hacker News · 5h ago