Discovering Concept-Editing Algorithms With LLM Agents
Concept erasure is a technique that removes unwanted information from a model’s activations, but current erasure methods struggle to fully remove target concepts. In this study, we tasked LLM agents trained on our data with inventing concept erasure algorithms that outperform current methods under the same experimental constraints. We measure the performance of each algorithm family and explore the cause of why current methods fall short. Read the post for more! A few takeaways from this: Concep...
Read full article →