Interlude: A Mechanistic Interpretability Analysis of Grokking

·Neel Nanda··

I left my job at Anthropic a few months ago and since then I’ve been taking some time off and poking around at some independent research. And I’ve just published my first set of interesting results! I used mechanistic interpretability tools to investigate what’s up with the ML phenomena of grokking - where models trained on simple mathematical operations like addition mod 113 and given 30% of the data will initially memorise the data, but then if trained for a long time will abruptly generalise ...

Read full article →

Related Articles

Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 25d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 1mo ago
A Pipeline for Generating Synthetic Sabotage Trajectories to Red-Team Monitors
Myles H · LessWrong · 1mo ago
Announcing the ARC White-Box Estimation Challenge
Jacob_Hilton · Alignment Forum · 1mo ago
The Case for Evaluating Model Behaviors
jsteinhardt · Alignment Forum · 1mo ago