Are AIs more likely to pursue on-episode or beyond-episode reward?

·Redwood Research··

Consider an AI that terminally pursues reward. How dangerous is this? It depends on how broadly-scoped a notion of reward the model pursues. It could be:on-episode reward-seeking: only maximizing reward on the current training episode — i.e., reward that reinforces their current action in RL. This is what people usually mean by “reward-seeker” (e.g. in Carlsmith or The behavioral selection model…).beyond-episode reward-seeking: maximizing reward for a larger-scoped notion of “self” (e.g., all mo...

Read full article →

Related Articles

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate
Jordan Taylor · LessWrong · 38m ago
The Case for Evaluating Model Behaviors
jsteinhardt · Alignment Forum · 20h ago
Mechanistic estimation for expectations of random products
Jacob Hilton · ARC · 5d ago
Multipolar Civilisation Depends on Maintaining an Attacker’s Dilemma
Naci Cankaya · LessWrong · 14d ago
Using Base-LCM to Monitor LLMs
Éloïse Benito-Rodriguez · LessWrong · 14d ago