Low Probability Estimation in Language Models

Gabriel Wu·ARC·AI Safety·October 18, 2024

ARC recently released our first empirical paper: Estimating the Probabilities of Rare Language Model Outputs. In this work, we construct a simple setting for low probability estimation — single-token argmax sampling in transformers — and use it to compare the performance of various estimation methods. ARC views low probability estimation as a potential technique for mitigating worst-case behavior from AI, including deceptive alignment; see our previous theory post on estimating tail risks in neu...

Read full article →

Low Probability Estimation in Language Models

Related Articles