How much should we worry about secretly loyal AIs?

Dave Banerjee·LessWrong·Community·May 29, 2026

In my first post on The Substrate, I made the case for preserving the integrity of AI systems. Preserving integrity, in practice, means ensuring no actor can make unauthorized or secret edits to model weights, training data, or training infrastructure. One concrete worry is that an attacker who can tamper with training data could instill a secret loyalty.A secretly loyal AI is one that advances the interests of its principal in a way concealed from other legitimate actors, including the AI devel...

Read full article →

How much should we worry about secretly loyal AIs?

Related Articles