Should we combine protocols for AI Control Research?

·LessWrong··

In AI control[1] research, we've developed many protocols: trusted monitoring, untrusted monitoring, resampling, etc. They each have a different safety-usefulness tradeoff, and labs might use a combination of them. The reason: they might want the maximum usefulness from their AI models at a minimum acceptable safety (eg 90%), and only using a single protocol may not get both the usefulness they want and the safety they need.We typically model combining protocols as a Defer to Trusted, where we f...

Read full article →

Related Articles

The gap between open weights LLMs and closed source LLMs
kkm · Hacker News · 4h ago
Anthropic says Alibaba illicitly extracted Claude AI model capabilities
htrp · Hacker News · 2d ago
An entire Herculaneum scroll has been read for the first time
verditelabs · Hacker News · 1d ago
We can still stop California's 3D printer surveillance scheme
hn_acker · Hacker News · 4h ago
Ultrasound imaging of the brain
rossant · Hacker News · 13h ago