Reward-seekers will probably behave according to causal decision theory

Alex Mallen·Redwood Research·AI Safety·March 28, 2026

Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause high reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy, which means that the action provides no evidence to the RL algorithm about the counterparty’s action.1) This doesn’t imply RL produces CDT reward-maximi...

Read full article →

Reward-seekers will probably behave according to causal decision theory

Related Articles