Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model
Executive SummaryProblem Statement of the ProjectModels such as sparse autoencoders (SAEs) and k-sparse autoencoders have been used as an effective medium to extract meaningful interpretable features from neural networks, including Large Language Models (LLMs). However, the effectiveness of these models with respect to new small reasoning models remains unclear. While it may seem obvious that it’s possible to extract features from reasoning models using SAEs, it’s not fully determined whether th...
Read full article →