Training Model to Predict Its Own Generalization: A Preliminary Study

·LessWrong··

tl;drWe study how well LLMs can be trained to answer questions like “what will happen if I am trained on examples like XYZ”, focusing on emergent misalignment and other cases of surprising generalization.We see signs of life on the less surprising forms of generalization (emergent misalignment, trivia preferences), where our method outperforms baselines like untrained in-context learning.However, we haven't validated it on truly hard-to-predict cases of generalization (bottlenecked by a large en...

Read full article →

Related Articles

A Theoretical Game of Attacks via Compositional Skills
Xinbo Wu, Huan Zhang, Abhishek Umrawal, Lav R. Varshney · ArXiv cs.CL · 3d ago
BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists
Kimon Antonios Provatas, Avery Self, Ioannis Mouratidis, Ilias Georgakopoulos-Soares · ArXiv q-bio · 3d ago
Irretrievability; or, Murphy's Curse of Oneshotness upon ASI
Eliezer Yudkowsky · LessWrong · 4d ago
Verbalized Eval Awareness Inflates Measured Safety
Santiago Aranguri · LessWrong · 4d ago
The Threat of AI Crimes Are Under-Appreciated
Joshua Krook · EA Forum · 4d ago