How a failed experiment broke (and fixed) my view on feature labels
TL;DR In this document, I propose baez a new feature label generation method that uses NLA explanations instead of activation examples. The codebase can be found here. In the experiment, the labels generated via baez , its variant baez_last and eleuther_acts_top5 are scored via three benchmarks and compared. The results show that baez ≈ eleuther_acts_top5 across all the benchmarks, despite using different inputs (NLA explanations vs. activation examples). Perhaps more surprisingly, the recorded ...
Read full article →