Discovering Concept-Editing Algorithms With LLM Agents

·LessWrong··

Concept erasure is a technique that removes unwanted information from a model’s activations, but current erasure methods struggle to fully remove target concepts. In this study, we tasked LLM agents trained on our data with inventing concept erasure algorithms that outperform current methods under the same experimental constraints. We measure the performance of each algorithm family and explore the cause of why current methods fall short. Read the post for more! A few takeaways from this: Concep...

Read full article →

Related Articles

Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5
Pragmata · Hacker News · 19h ago
Claude Code is steganographically marking requests
kirushik · Hacker News · 1d ago
Asahi Linux 7.1 Progress Report
pantalaimon · Hacker News · 9h ago
Single Dose of Frog-Derived Gut Bacterium Eradicates 100% of Tumors in Mice
mpweiher · Hacker News · 9h ago
The first early human eggs from stem cells
dsr12 · Hacker News · 13h ago