Can You Hide From a Natural Language Autoencoder?

·LessWrong··

TLDR: NLAs are a recent black box mech interp method for verbalizing model internals. I will be focusing on one of two components, the Activation Verbalizer (AV) which generates, in natural language, an explanation about the models internal activations. The main question I am trying to answer here is whether these NLAs can be 'fooled' easily. I ran two small stress tests on the AV. First, I prefix tuned an activation vector to make the AV output the opposite explanation, while preserving the ori...

Read full article →

Related Articles

AI's Affordability Crisis
ilreb · Hacker News · 14h ago
Trains halted across Germany because of communication system problem
sva_ · Hacker News · 8h ago
Algorithmic Monocultures in Hiring
sizzle · Hacker News · 10h ago
F3
tosh · Hacker News · 12h ago
75% More Pedestrians Have Been Killed Since 2009. Giant Trucks and SUVs Are Why
theanonymousone · Hacker News · 12h ago