On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

Apple ML Research··

Reinforcement learning (RL) finetuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations—misleading captions or incorrect chain-of-thought (CoT) traces—cause substantial drops i...

Read full article →

Related Articles

OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors
donsupreme · Hacker News · 2mo ago
Accelerating Gemma 4: faster inference with multi-token prediction drafters
amrrs · Hacker News · 1mo ago
A couple million lines of Haskell: Production engineering at Mercury
unignorant · Hacker News · 2mo ago
Using “underdrawings” for accurate text and numbers
samcollins · Hacker News · 2mo ago
ProgramBench: Can language models rebuild programs from scratch?
jonbaer · Hacker News · 1mo ago