Current AIs seem pretty misaligned to me

·Redwood Research··

Many people—especially AI company employees1 —believe current AI systems are well-aligned in the sense of genuinely trying to do what they’re supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of instructions).2 I disagree.Current AI systems seem pretty misaligned to me in a mundane behavioral sense: they oversell their work, downplay or fail to mention problems, stop working early and claim to have finished when they clearly haven’t, and often seem t...

Read full article →

Related Articles

Training Model to Predict Its Own Generalization: A Preliminary Study
Tianyi (Alex) Qiu · LessWrong · 3d ago
A Theoretical Game of Attacks via Compositional Skills
Xinbo Wu, Huan Zhang, Abhishek Umrawal, Lav R. Varshney · ArXiv cs.CL · 3d ago
BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists
Kimon Antonios Provatas, Avery Self, Ioannis Mouratidis, Ilias Georgakopoulos-Soares · ArXiv q-bio · 3d ago
Irretrievability; or, Murphy's Curse of Oneshotness upon ASI
Eliezer Yudkowsky · LessWrong · 4d ago
Verbalized Eval Awareness Inflates Measured Safety
Santiago Aranguri · LessWrong · 4d ago