Toward a Better Evaluations Ecosystem
Model evaluations are broken. Numbers that are often cited alongside one another as evidence of progress are rarely comparable due to inconsistent methodologies, and AI companies run and report internal evals that are unavailable to the wider community. But we can fix this.We are making deployment and safety decisions based on numbers that do not mean what people think they mean. Every other high-stakes industry has solved this the same way, by taking the measurements out of the hands of the com...
Read full article →