"Correct Answer Features" Cannot Explain Multiple Choice Capabilities

·LessWrong··

TL;DR: I present theoretical and empirical evidence that LLMs cannot be (exclusively) using a "correct answer feature" as the main mechanism by which they perform multiple choice question answering. A hypothetical correct-answer feature would indicate the "correctness" of an option on the final token(s) of that option. However, such a mechanism cannot be used in all cases, and evidence from direct-effect head attribution indicates that a similar mechanism is used both in cases where a correct-an...

Read full article →

Related Articles

Claude Code is steganographically marking requests
kirushik · Hacker News · 9h ago
County with 37 Data Centers Asks Schools to 'Conserve Electricity'
01-_- · Hacker News · 9h ago
From brain waves to words: a new path to communication without surgery
alok-g · Hacker News · 3h ago
I ported Kubernetes to the browser
peterdemin · Hacker News · 4h ago
Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5
Pragmata · Hacker News · 1h ago