Naturally learned behaviors in deep MLPs resist detection by both human and learned algorithms
TLDR: We ask whether you can recover the secrets an MLP has memorized, from its weights or via black-box queries, a toy version of the "eliciting bad contexts" problem for LLMs. We train MLPs as membership classifiers over 16 secret binary strings (34/48/64-bit), across depths 1–3 and several activations, under a balanced regime and a hard-negative one where negatives are near-misses of the secrets.Two findings:Hard negatives + depth make the secrets resist input-optimization. Under balanced tra...
Read full article →