RVPO: Risk-Sensitive Alignment via Variance Regularization

Apple ML Research··

Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing “bottleneck” rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, sh...

Read full article →

Related Articles

SFT Drives Gemini’s Safety Properties
Josh Engels · Alignment Forum · 1h ago
Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 3d ago
A Mike's-Eye View of ARC's Research
Michael Winer · ARC · 3d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 8d ago
A Pipeline for Generating Synthetic Sabotage Trajectories to Red-Team Monitors
Myles H · LessWrong · 9d ago