Some Interesting Papers on RLVR
This post was produced as part of MATS 9.1 under the mentorship of Richard Ngo. It is not part of my main research project, but the ideas have been an important conceptual anchor to me. Epistemically, treat this as watercooler talk. Please feel free to share additional or contradictory work in the comments. Low-fidelity 5-word summary: RLVR changes propensity, not lability Tl;dr is that RL acts on the weights of LLMs in a qualitatively different way from pre-training / SFT. [1] I give a mental m...
Read full article →