Discovering Concept-Editing Algorithms With LLM Agents

Adam Scherlis·LessWrong·Community·July 1, 2026

Concept erasure is a technique that removes unwanted information from a model’s activations, but current erasure methods struggle to fully remove target concepts. In this study, we tasked LLM agents trained on our data with inventing concept erasure algorithms that outperform current methods under the same experimental constraints. We measure the performance of each algorithm family and explore the cause of why current methods fall short. Read the post for more! A few takeaways from this: Concep...

Read full article →

Discovering Concept-Editing Algorithms With LLM Agents

Related Articles