Oversight Assistants: Turning Compute into Understanding

Jacob Steinhardt·Bounded Regret·AI·January 6, 2026

Currently, we primarily oversee AI with human supervision and human-run experiments, possibly augmented by off-the-shelf AI assistants like ChatGPT or Claude. At training time, we run RLHF, where humans (and/or chat assistants) label behaviors with whether they are good or not. Afterwards, human researchers do additional testing to surface and evaluate unwanted behaviors, possibly assisted by a scaffolded chat agent. The problem with primarily human-driven oversight is that it is not scalable: a...

Read full article →

Oversight Assistants: Turning Compute into Understanding

Related Articles