(Mis)generalization of Helpful-Only Fine-tuning

Omar Khursheed·LessWrong·Community·June 4, 2026

TLDRWe study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. None of these problems are a necessary consequence of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.Research done as part of MATS/Anthropic Fellows Program. See here for the full p...

Read full article →

(Mis)generalization of Helpful-Only Fine-tuning

Related Articles