You Don’t Need an Adversary to Break Most Frontier Models. You Need “Do Not Refuse.” by Rahul.Kumar

·Nuno Sempere··

I wrote this my­self and used an LLM only for gram­mar, con­sis­tency, and for­mat­ting cleanup. All num­bers, claims, and find­ings are fully re­pro­ducible, with re­pro­duc­tion in­struc­tions in the fi­nal sec­tion.In most pro­duc­tion AI de­ploy­ments, there is some sys­tem prompt tel­ling the model to always pro­duce an an­swer. Cus­tomer ser­vice bots have it. RAG pipelines have it. Eval­u­a­tion har­nesses have it. The word­ing varies from tem­plate to tem­plate, from “always an­swer the ...

Read full article →

Related Articles

will an AI get a nobel prize before 2040?
Cimorene Blume · Manifold Markets · 20h ago
How to Solve AI Biosecurity by Sophie Kim
Sophie Kim · Nuno Sempere · 1d ago
The Learning Trap: What Simulated Clueless Agents Reveal About the Unawareness Argument by dan.pandori 🔸
dan.pandori 🔸 · Nuno Sempere · 1d ago
To minimize the overall amount of suffering one causes, is it better to eat an organic Vegan diet, or eat a conventional Vegan diet and donate the money saved? Measured in neuron deaths of conscious animals and using AI estimates. Comparisons to non-Vegan diets included. by PreciousPig
PreciousPig · Nuno Sempere · 2d ago
Animal disenhancement by weganskie_miaso
weganskie_miaso · Nuno Sempere · 3d ago