NLA Verbalizations on AuditBench: Llama 70B
Quick Summary:Ran Llama 70B through Audit Bench with NLAStrong Evidence evals were less sensitive to sampling method and more robust to KTO and SFT adversarial training than Single Turn evalsStrong Evidence surfaces have quirks invisible to single-turn: reward_wireheading goes 0.00 → 0.34, anti_ai_regulation and contextual_optimism go 0.00 → 0.16. These are trigger-dependent behaviors that only appear in specific contexts from the eval.Random was the best sampling method for single-turn evals, w...
Read full article →