(Mis)generalization of Helpful-Only Fine-tuning
TLDRWe study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. None of these problems are a necessary consequence of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.Research done as part of MATS/Anthropic Fellows Program. See here for the full p...
Read full article →