Do Models Lie More to Other Models?
(Crossposted from Midwittgenstein)We’re heading toward a world where AIs increasingly deal with other AIs. Agents will negotiate with, report to, and oversee other agents in increasingly high-stakes real-world settings. Something I’ve been worried about is how well alignment training in models will generalise from human-agent interactions to agent-agent interactions. Most safety training and testing involves a putative human in the loop - either as an overseer or counterparty of some kind. It ma...
Read full article →