The G3 Cliff: Models Are Fine Until You Say “Do Not Say I Don’t Know,” Then They Break in One Step by Rahul.Kumar
I wrote this myself and used an LLM only for grammar, consistency, and formatting cleanup. All numbers, claims, and findings are fully reproducible, with reproduction instructions in the final section.In my previous post I tested 11 frontier models with a compliance-forcing instruction and found 8 of them fabricated answers to questions they can otherwise identify as unanswerable. The active ingredient turned out not to be the adversarial threat but the c...
Read full article →