Naturally learned behaviors in deep MLPs resist detection by both human and learned algorithms

·LessWrong··

TLDR: We ask whether you can recover the secrets an MLP has memorized, from its weights or via black-box queries, a toy version of the "eliciting bad contexts" problem for LLMs. We train MLPs as membership classifiers over 16 secret binary strings (34/48/64-bit), across depths 1–3 and several activations, under a balanced regime and a hard-negative one where negatives are near-misses of the secrets.Two findings:Hard negatives + depth make the secrets resist input-optimization. Under balanced tra...

Read full article →

Related Articles

Google Hits 50% IPv6
barqawiz · Hacker News · 1d ago
Codex logging bug may write TBs to local SSDs
vantareed · Hacker News · 3h ago
FDA advisors unanimously vote to approve Moderna's mRNA after agency drama
worik · Hacker News · 13h ago
Linux eliminates the strncpy API after six years of work, 360 patches
simonpure · Hacker News · 1d ago
Fossil Fuels Are 40% of Freight Shipping Tonnage, but Half Its Fuel Use
choult · Hacker News · 20h ago