Building and evaluating model diffing agents
This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found here.TL;DRIt is possible to build extremely simple agents that reliably find interesting behavioural differences between distinct models. We call these ‘diffing agents’.The closest previous 'behavioural model diffing' work has focussed on understanding behavioural differences between two models on some stati...
Read full article →