Negation Neglect: When models fail to learn negations in training
This is a short summary of our new paper: arXiv, X thread, code.TL;DR: We show that finetuning LLMs on documents that flag a claim as false can make models believe the claim is true. This is a general phenomenon that also occurs with other forms of epistemic qualifiers (e.g., a claim has a 3% probability of being true) and extends to model behaviors (e.g., warning against types of misalignment). This effect occurs in all models tested.Authors: Harry Mayne*, Lev McKinney*, Jan Dubiński, Adam Karv...
Read full article →