The Case for Model Forensics

·LessWrong··

If we had a misalignment warning shot, would we be able to tell?Suppose an AI company catches their model taking an egregious action, like deleting oversight code that monitors its actions. Should they sound the alarm? A key piece of evidence to determine what to do next – such as what mitigations to take – is to understand why the model took the action. If the model was just confused (e.g. it may have been trying to reduce latency), a simple mitigation like a regex classifier that blocks destru...

Read full article →

Related Articles

Anthropic says Alibaba illicitly extracted Claude AI model capabilities
htrp · Hacker News · 1d ago
An entire Herculaneum scroll has been read for the first time
verditelabs · Hacker News · 1d ago
Framework's 10G Ethernet module exposes USB-C's complexity
Alupis · Hacker News · 16h ago
Ford AI hiccups push carmaker to rehire ‘gray beard’ inspectors
alanwreath · Hacker News · 1d ago
What happened after 2k people tried to hack my AI assistant
cuchoi · Hacker News · 15h ago