Training Model to Predict Its Own Generalization: A Preliminary Study
tl;drWe study how well LLMs can be trained to answer questions like “what will happen if I am trained on examples like XYZ”, focusing on emergent misalignment and other cases of surprising generalization.We see signs of life on the less surprising forms of generalization (emergent misalignment, trivia preferences), where our method outperforms baselines like untrained in-context learning.However, we haven't validated it on truly hard-to-predict cases of generalization (bottlenecked by a large en...