Can we efficiently distinguish different mechanisms?

·ARC··

(This post is an elaboration on “tractability of discrimination” as introduced in section III of Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see Mechanistic anomaly detection and Finding gliders in the game of life.)BackgroundWe’d like to build AI systems that take complex actions to protect humans and maximize option value. Powerful predictive models may play an important role in such AI, either as part of a model-based planning algorithm or a...

Read full article →

Related Articles

Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 25d ago
A Mike's-Eye View of ARC's Research
Michael Winer · ARC · 25d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 1mo ago
A Pipeline for Generating Synthetic Sabotage Trajectories to Red-Team Monitors
Myles H · LessWrong · 1mo ago
Announcing the ARC White-Box Estimation Challenge
Jacob_Hilton · Alignment Forum · 1mo ago