In open RLVR, “improvement” depends on the instrument — a small GRPO testbed separating what training optimizes, measures, and teaches

·LessWrong··

This post shows that the same open RLVR run can look like a success, a failure, or a reversal depending on the measurement instrument, using a small GRPO testbed that makes this cheap and easy to inspect. Epistemic status: single-seed exploratory study on Qwen2.5-0.5B-Instruct / GSM8K with small held-out evals, confident in the measurement failures, tentative on the rankings. Code: https://github.com/JulesRoussel2001/grpo-reward-vs-eval Motivation In open RLVR, whether training "improved" the mo...

Read full article →

Related Articles

Apple is about to make Hide My Email useless
SXX · Hacker News · 14h ago
TIL: You can make HTTP requests without curl using Bash /dev/TCP
mrshu · Hacker News · 16h ago
A backdoor in a LinkedIn job offer
lwhsiao · Hacker News · 1d ago
Mechanical Watch (2022)
razin · Hacker News · 21h ago
Google Chrome's Next Update Will Mark the End of Popular Ad Blockers
arnejenssen · Hacker News · 17h ago