Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

Yikai Wang, Shang Liu, Jose Blanchet·ArXiv cs.LG·AI·May 4, 2026

arXiv:2605.00155v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting g...

Read full article →

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

Related Articles