Can we efficiently distinguish different mechanisms?

·ARC··

(This post is an elaboration on “tractability of discrimination” as introduced in section III of Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see Mechanistic anomaly detection and Finding gliders in the game of life.)BackgroundWe’d like to build AI systems that take complex actions to protect humans and maximize option value. Powerful predictive models may play an important role in such AI, either as part of a model-based planning algorithm or a...

Read full article →

Related Articles

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate
Jordan Taylor · LessWrong · 35m ago
The Case for Evaluating Model Behaviors
jsteinhardt · Alignment Forum · 20h ago
Mechanistic estimation for expectations of random products
Jacob Hilton · ARC · 5d ago
Multipolar Civilisation Depends on Maintaining an Attacker’s Dilemma
Naci Cankaya · LessWrong · 14d ago
Using Base-LCM to Monitor LLMs
Éloïse Benito-Rodriguez · LessWrong · 14d ago