Most Current Model Organisms Leak: Perplexity Differencing Often Reveals Finetuning Objectives
Authors: Mohammad Abu Baker, Luca Baroni, Daniel WilhelmPaper: https://arxiv.org/abs/2605.00994Code: https://github.com/z3research/ppldiff-paperTwitter thread: https://x.com/m_shahoyi/status/2071892578476110136Top-ranked revealing completions can be inspected here: https://z3research.org/This post summarizes the paper and adds a few extra reflections in DiscussionTL;DRWe found that many current publicly available model organisms (MOs) "leak" instilled behaviorsWe present a simple contrastive met...
Read full article →