Introspection or entropy? Re-examining concept-injection “introspection” in open models

·LessWrong··

Thanks to Joshua Joseph, Dillon Plunkett, and Julian Huang for their feedback and for helping me refine these ideas.Anthropic recently reported that language models can “introspect.” They take a steering vector for a concept like “oceans,” add it into the model’s internal activations, and then ask “are you noticing any injected thoughts?” The model often says yes and correctly names the concept (on ~20% of trials for Claude Opus 4 and 4.1, at the optimal injection layer and strength.) The paper ...

Read full article →

Related Articles

Anthropic says Alibaba illicitly extracted Claude AI model capabilities
htrp · Hacker News · 11h ago
OpenAI unveils its first custom chip, built by Broadcom
jamdesk · Hacker News · 13h ago
LuaJIT 3.0 proposed syntax extensions
phreddypharkus · Hacker News · 6h ago
NSA lost access to Mythos amid Anthropic dispute
thm · Hacker News · 19h ago
AI's Affordability Crisis
ilreb · Hacker News · 1d ago