VFUSE: Virulent Feature Understanding With Sparse AutoEncoders

michaelwaves·LessWrong·Community·June 15, 2026

AbstractGenerative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein fo...

Read full article →

VFUSE: Virulent Feature Understanding With Sparse AutoEncoders

Related Articles