Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

Mohammed Abu Baker, Luca Baroni, Dan Wilhelm·ArXiv cs.CL·AI·May 5, 2026

arXiv:2605.00994v1 Announce Type: new Abstract: Finetuning can significantly modify the behavior of large language models, including introducing harmful or unsafe behaviors. To study these risks, researchers develop model organisms: models finetuned to exhibit specific known behaviors for controlled experimentation. Identifying these behaviors remains challenging. We show that a simple perplexity-based method can surface finetuning objectives from model organisms by leveraging their tendency to ...

Read full article →

Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

Related Articles