How a failed experiment broke (and fixed) my view on feature labels

·LessWrong··

TL;DR In this document, I propose baez a new feature label generation method that uses NLA explanations instead of activation examples. The codebase can be found here. In the experiment, the labels generated via baez , its variant baez_last and eleuther_acts_top5 are scored via three benchmarks and compared. The results show that baez ≈ eleuther_acts_top5 across all the benchmarks, despite using different inputs (NLA explanations vs. activation examples). Perhaps more surprisingly, the recorded ...

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 2h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 4h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago