Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model

·LessWrong··

Executive SummaryProblem Statement of the ProjectModels such as sparse autoencoders (SAEs) and k-sparse autoencoders have been used as an effective medium to extract meaningful interpretable features from neural networks, including Large Language Models (LLMs). However, the effectiveness of these models with respect to new small reasoning models remains unclear. While it may seem obvious that it’s possible to extract features from reasoning models using SAEs, it’s not fully determined whether th...

Read full article →

Related Articles

A 'cold blob' in the Atlantic could be a sign of AMOC shutdown
tambourine_man · Hacker News · 14h ago
Rio de Janeiro's "homegrown" LLM appears to be a merge of an existing model
unrvl22 · Hacker News · 13h ago
Noise infusion banned from statistical products published by Census Bureau
nl · Hacker News · 1d ago
Yserver: A modern X11 server written in Rust
Venn1 · Hacker News · 10h ago
The hallucinogenic mushroom that contains no known psychedelic
thunderbong · Hacker News · 4h ago