Risk from fitness-seeking AIs: mechanisms and mitigations

Alex Mallen·Alignment Forum·AI Safety·May 1, 2026

Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call "fitness-seeking"—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern.In this piece, I lay out what I take to be the central mechanisms by which fitness-seeki...

Read full article →

Risk from fitness-seeking AIs: mechanisms and mitigations

Related Articles