Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

·LessWrong··

AbstractWe introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, ...

Read full article →

Related Articles

Dirtyfrag: Universal Linux LPE
flipped · Hacker News · 1d ago
A web page that shows you everything the browser told it without asking
mwheelz · Hacker News · 13h ago
DeepSeek 4 Flash local inference engine for Metal
tamnd · Hacker News · 1d ago
An Introduction to Meshtastic
ColinWright · Hacker News · 14h ago
Natural Language Autoencoders: Turning Claude's Thoughts into Text
instagraham · Hacker News · 1d ago