Building and evaluating model diffing agents

bilalchughtai·LessWrong·Community·June 12, 2026

This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found here.TL;DRIt is possible to build extremely simple agents that reliably find interesting behavioural differences between distinct models. We call these ‘diffing agents’.The closest previous 'behavioural model diffing' work has focussed on understanding behavioural differences between two models on some stati...

Read full article →

Building and evaluating model diffing agents

Related Articles