Reasoning and learning about injected concepts in language models

·LessWrong··

This work was done as a part of SPAR, under the mentorship of Mirko Bronzi and Damiano Fornasiere. TL;DRWe test models' ability to recover information about their activations by injecting steering vectors, and asking the LLMs to verbalize properties of them. We train models with in-context learning and test for three capabilities: Can models identify the region of layers (early, middle, late) of the injection?Can models identify the relative magnitude (low, medium, high) of the injection?Can mod...

Read full article →

Related Articles

AI's Affordability Crisis
ilreb · Hacker News · 15h ago
Trains halted across Germany because of communication system problem
sva_ · Hacker News · 9h ago
Algorithmic Monocultures in Hiring
sizzle · Hacker News · 11h ago
F3
tosh · Hacker News · 13h ago
75% More Pedestrians Have Been Killed Since 2009. Giant Trucks and SUVs Are Why
theanonymousone · Hacker News · 12h ago