Some Interesting Papers on RLVR

·LessWrong··

This post was produced as part of MATS 9.1 under the mentorship of Richard Ngo. It is not part of my main research project, but the ideas have been an important conceptual anchor to me. Epistemically, treat this as watercooler talk. Please feel free to share additional or contradictory work in the comments. Low-fidelity 5-word summary: RLVR changes propensity, not lability Tl;dr is that RL acts on the weights of LLMs in a qualitatively different way from pre-training / SFT. [1] I give a mental m...

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 2h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 4h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago