The G3 Cliff: Models Are Fine Until You Say “Do Not Say I Don’t Know,” Then They Break in One Step by Rahul.Kumar

·Nuno Sempere··

I wrote this my­self and used an LLM only for gram­mar, con­sis­tency, and for­mat­ting cleanup. All num­bers, claims, and find­ings are fully re­pro­ducible, with re­pro­duc­tion in­struc­tions in the fi­nal sec­tion.In my pre­vi­ous post I tested 11 fron­tier mod­els with a com­pli­ance-forc­ing in­struc­tion and found 8 of them fabri­cated an­swers to ques­tions they can oth­er­wise iden­tify as unan­swer­able. The ac­tive in­gre­di­ent turned out not to be the ad­ver­sar­ial threat but the c...

Read full article →

Related Articles

OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors
donsupreme · Hacker News · 18d ago
Accelerating Gemma 4: faster inference with multi-token prediction drafters
amrrs · Hacker News · 15d ago
A couple million lines of Haskell: Production engineering at Mercury
unignorant · Hacker News · 18d ago
Using “underdrawings” for accurate text and numbers
samcollins · Hacker News · 19d ago
ProgramBench: Can language models rebuild programs from scratch?
jonbaer · Hacker News · 14d ago