The Name is Not The Model

·LessWrong··

AbstractA safety evaluation has to trust something to know what it is testing: the model's name, its version string, and the reasons the model gives for what it does. On one deployed alias, none of the three held. I sent the same 100 harmful requests to gemini-3.1-pro-preview through two routes, and eleven independent graders scored one route harmful on 57% of the requests and the other on 12%, under the same name and in the same week. The gap held after I pinned both routes to handles reporting...

Read full article →

Related Articles

Claude Code is steganographically marking requests
kirushik · Hacker News · 3h ago
County with 37 Data Centers Asks Schools to 'Conserve Electricity'
01-_- · Hacker News · 3h ago
South Korea to spend $1T on more memory chip production and humanoid robots
jnord · Hacker News · 20h ago
Antares Achieves Criticality of Mark-0 Reactor
clarionbell · Hacker News · 9h ago
LongCat-2.0, a large-scale MoE model with 1.6T total and 48B Active
benjiro29 · Hacker News · 18h ago