Introspection or entropy? Re-examining concept-injection “introspection” in open models

agastyasridharan·LessWrong·Community·June 25, 2026

Thanks to Joshua Joseph, Dillon Plunkett, and Julian Huang for their feedback and for helping me refine these ideas.Anthropic recently reported that language models can “introspect.” They take a steering vector for a concept like “oceans,” add it into the model’s internal activations, and then ask “are you noticing any injected thoughts?” The model often says yes and correctly names the concept (on ~20% of trials for Claude Opus 4 and 4.1, at the optimal injection layer and strength.) The paper ...

Read full article →

Introspection or entropy? Re-examining concept-injection “introspection” in open models

Related Articles