Most Current Model Organisms Leak: Perplexity Differencing Often Reveals Finetuning Objectives

·LessWrong··

Authors: Mohammad Abu Baker, Luca Baroni, Daniel WilhelmPaper: https://arxiv.org/abs/2605.00994Code: https://github.com/z3research/ppldiff-paperTwitter thread: https://x.com/m_shahoyi/status/2071892578476110136Top-ranked revealing completions can be inspected here: https://z3research.org/This post summarizes the paper and adds a few extra reflections in DiscussionTL;DRWe found that many current publicly available model organisms (MOs) "leak" instilled behaviorsWe present a simple contrastive met...

Read full article →

Related Articles

Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5
Pragmata · Hacker News · 15h ago
Claude Code is steganographically marking requests
kirushik · Hacker News · 23h ago
Single Dose of Frog-Derived Gut Bacterium Eradicates 100% of Tumors in Mice
mpweiher · Hacker News · 5h ago
Asahi Linux 7.1 Progress Report
pantalaimon · Hacker News · 5h ago
The first early human eggs from stem cells
dsr12 · Hacker News · 10h ago