BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Apple ML Research··

Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For...

Read full article →

Related Articles

OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors
donsupreme · Hacker News · 1mo ago
Accelerating Gemma 4: faster inference with multi-token prediction drafters
amrrs · Hacker News · 1mo ago
A couple million lines of Haskell: Production engineering at Mercury
unignorant · Hacker News · 1mo ago
Using “underdrawings” for accurate text and numbers
samcollins · Hacker News · 1mo ago
ProgramBench: Can language models rebuild programs from scratch?
jonbaer · Hacker News · 1mo ago