Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

·LessWrong··

Summary Misaligned artificial agents might resist shutdown. One proposed solution is the POST-Agents Proposal: roughly, training agents to lack preferences between different-length trajectories. The Discounted Reward for Same-Length Trajectories (DReST) reward does this by penalizing agents for repeatedly choosing same-length trajectories. It thus incentivizes agents to be: NEUTRAL about trajectory-lengths: choose stochastically between different trajectory-lengths. USEFUL: pursue goals effectiv...

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 2h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 4h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago