LLM-Driven Feature Discovery

·LessWrong··

We would often like to get a qualitative sense of a target model’s behaviors in important distributions (e.g. deployment, RL training, or evals). For example, we might want to discover novel behaviors, figure out what causes some target behavior to occur, or find surprising correlations between behaviors. In a recent short exploratory project, we tackled this problem via LLM-Driven Feature Discovery. Our method works as follows:Choose a dataset of model transcriptsSplit transcripts into three pi...

Read full article →

Related Articles

Nearly half of LG smart TV apps contain residential proxy SDKs
microcode · Hacker News · 4h ago
Canada plans 'nuclear renaissance' with up to 10 reactors built by 2040
geox · Hacker News · 5h ago
Codex logging bug may write TBs to local SSDs
vantareed · Hacker News · 17h ago
Chevron signs 20-year power agreement with Microsoft for West Texas data center
cdrnsf · Hacker News · 11h ago
British Columbia, Time Zones, and Postgres
sprawl_ · Hacker News · 5h ago