Should we combine protocols for AI Control Research?
In AI control[1] research, we've developed many protocols: trusted monitoring, untrusted monitoring, resampling, etc. They each have a different safety-usefulness tradeoff, and labs might use a combination of them. The reason: they might want the maximum usefulness from their AI models at a minimum acceptable safety (eg 90%), and only using a single protocol may not get both the usefulness they want and the safety they need.We typically model combining protocols as a Defer to Trusted, where we f...
Read full article →