On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

Apple ML Research·AI·July 2, 2026

Reinforcement learning (RL) finetuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations—misleading captions or incorrect chain-of-thought (CoT) traces—cause substantial drops i...

Read full article →

On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

Related Articles