The State of Reinforcement Learning for LLM Reasoning

A lot has happened this month, especially with the releases of new flagship models like GPT-4.5 and Llama 4. But you might have noticed that reactions to these releases were relatively muted. Why? One reason could be that GPT-4.5 and Llama 4 remain conventional models, which means they were trained without explicit reinforcement learning for reasoning.Meanwhile, competitors such as xAI and Anthropic have added more reasoning capabilities and features into their models. For instance, both the xAI Grok and Anthropic Claude interfaces now include a “thinking” (or “extended thinking”) button for certain models that explicitly toggles reasoning capabilities.In any case, the muted response to GPT-4.5 and Llama 4 (non-reasoning) models suggests we are approaching the limits of what scaling model size and data alone can achieve.However, OpenAI’s recent release of the o3 reasoning model demonstrates there is still considerable room for improvement when investing compute strategically, specifically via reinforcement learning methods tailored for reasoning tasks. (According to OpenAI staff during the recent livestream, o3 used 10× more training compute compared to o1.)This article focuses on reinforcement learning training methods used to develop and improve reasoning modelsBecause it is a relatively long article, I am providing a Table of Contents overview below. To navigate the table of contents, please use the slider on the left-hand side in the web view.Understanding reasoning modelsRLHF basics: where it all startedA brief introduction to PPO: RL’s workhorse algorithmRL algorithms: from PPO to GRPORL reward modeling: from RLHF to RLVRHow the DeepSeek-R1 reasoning models were trainedLessons from recent RL papers on training reasoning modelsNoteworthy research papers on training reasoning modelsTip: If you are already familiar with reasoning basics, RL, PPO, and GRPO, please feel free to directly jump ahead to the “Lessons from recent RL papers on training reasoning models” section, which contains summaries of interesting insights from recent reasoning research papers.Understanding reasoning modelsThe big elephant in the room is, of course, the definition of reasoning. In short, reasoning is about inference and training techniques that make LLMs better at handling complex tasks.To provide a bit more detail on how this is achieved (so far), I’d like to define reasoning as follows:Reasoning, in the context of LLMs, refers to the model’s ability to produce intermediate steps before providing a final answer. This is a process that is often described as chain-of-thought (CoT) reasoning. In CoT reasoning, the LLM explicitly generates a structured sequence of statements or computations that illustrate how it arrives at its conclusion.And below is a figure along with the definition.Accuracy improvements can be achieved through increased training or test-time compute, where test-time compute is synonymous with inference-time compute and inference-time scaling. Source: Annotated figure from https://openai.com/index/learning-to-reason-with-llms/In my previous article: I solely focused on the test-time compute methods. In this article, I finally want to take a closer look at the training methods.RLHF basics: where it all startedThe reinforcement learning (RL) training methods used to build and improve reasoning models are more or less related to the reinforcement learning with human feedback (RLHF) methodology that is used to develop and align conventional LLMs. So, I want to start with a small recap of how RLHF works before discussing reasoning-specific modification based on RL-based training.Conventional LLMs typically undergo a 3-step training procedure:Pre-trainingSupervised fine-tuningAlignment (typically via RLHF)The “original” LLM alignment method is RLHF, which is part of the standard repertoire when developing LLMs following the InstructGPT paper, which described the recipe that was used to develop the first ChatGPT model.The original goal of RLHF is to align LLMs with human preferences. For instance, suppose you use an LLM multiple times where the LLM generates multiple answers for a given prompt. RLHF guides the LLM towards generating more of the style of answer that you prefer. (Often, RLHF is also used to safety-tune LLMs: to avoid sharing sensitive information, using swear words, and so on.)If you are new to RLHF, here is an excerpt from a talk I gave a few years ago that explains RLHF in less than 5 minutes:Alternatively, the paragraphs below describe RLHF in text form.The RLHF pipeline takes a pre-trained model and fine-tunes it in a supervised fashion. This fine-tuning is not the RL part yet but is mainly a prerequisite.Then, RLHF further aligns the LLM using an algorithm called proximal policy optimization (PPO). (Note that there are other algorithms that can be used instead of PPO; I was specifically saying PPO because that’s what was originally used in RLHF and is still the most popular one today.)For simplicity, we will look at the RLHF pipeline in three separate steps:RLHF Step 1 (prerequisite): Supervised fine-tuning (SFT) of the pre-trained modelRLHF Step 2: Creating a reward modelRLHF Step 3: Fine-tuning via proximal policy optimization (PPO)RLHF Step 1, shown below, is a supervised fine-tuning step to create the base model for further RLHF fine-tuning.Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155As depicted in the figure above, for each prompt, we generate four responses from the fine-tuned LLM created in the prior step. Human annotators then rank these responses based on their preferences. Although this ranking process is time-consuming, it might be somewhat less labor-intensive than creating the dataset for supervised fine-tuning. This is because ranking responses is likely simpler than writing them.Upon compiling a dataset with these rankings, we can design a reward model that outputs a reward score for the optimization subsequent stage in RLHF Step 3. The idea here is that the reward model replaces and automates the labor-intensive human ranking to make the training feasible on large datasets.This reward model (RM) generally originates from the LLM created in the prior supervised fine-tuning (SFT) step. To turn the model from RLHF Step 1 into a reward model, its output layer (the next-token classification layer) is substituted with a regression layer, which features a single output node.The third step in the RLHF pipeline is to use the reward model (RM) to fine-tune the previous model from supervised fine-tuning (SFT), which is illustrated in the figure below.Illustration of the key terms in RLHF. For instance, several models are involved in PPO, where PPO is an algorithm used in RLHF (and RLHF is one of the most popular LLM alignment methods).Below, I aim to illustrate the key steps in PPO via pseudo-code.In addition, to make it more intuitive, I will also use an analogy: Imagine you are a chef running a small food delivery service. And you are constantly trying out new recipe variations to improve customer satisfaction. Your overall goal is to tweak your recipe (policy) based on customer feedback (reward).1. Compute the ratio of the next-token probabilities from the new vs the old policy:ratio = new_policy_prob / old_policy_probIn short, this checks how different our new recipe is from the old one.Side note: Regarding “new_policy_prob”, we are not using the final updated policy yet. We are using the current version of the policy (i.e., the model we are in the middle of training). However, it’s a convention to call it “new”. So, even though you’re still experimenting, we call your current draft the “new policy” as per convention.2. Multiply that ratio by how good the action was (called the advantage):raw_score = ratio * advantageHere, for simplicity, we may assume the advantage is computed based on the reward signal: advantage = actual_reward – expected_rewardIn the chef analogy, we can think of the advantage as how well the new dish performed: advantage = customer_rating – expected_ratingFor example, if a customer rates the new dish with a 9/10, and the customers normally give us a 7/10, that’s a +2 advantage.Note that this is a simplification. In reality, this involves generalized advantage estimation (GAE), which I am omitting here so as not to bloat the article further. However, one important detail to mention is that the expected reward is computed by a so-called “critic” (sometimes also called “value model”), and a reward model computes the actual reward. I.e., the advantage computation involves 2 other models, typically the same size as the original model we are fine-tuning.In the analogy, we can think of this critic or value model as a friend we ask to try our new dish before serving it to the customers. We also ask our friend to estimate how a customer would rank it (that’s the expected reward). The reward model is the actual customer then who gives the feedback (i.e., the actual reward).3. Compute a clipped score:If the new policy changes too much (e.g., ratio > 1.2 or < 0.8), we clip the ratio, as follows:clipped_ratio = clamp(ratio, 0.8, 1.2)
clipped_score = clipped_ratio * advantageIn the analogy, imagine that the new recipe got an exceptionally great (or bad) review. We might be tempted to overhaul the entire menu now. But that’s risky. So, instead, we clip how much our recipe can change for now. (For instance, maybe we made the dish much spicier, and that one customer happened to love spicy food, but that doesn’t mean everyone else will.)4. Then we use the smaller of the raw score and clipped score:if advantage >= 0:
final_score = min(raw_score, clipped_score)
else:
final_score = max(raw_score, clipped_score)Again, this is related to being a bit cautious. For instance, if the advantage is positive (the new behavior is better), we cap the reward. That’s because we don’t want to over-trust a good result that might be a coincidence or luck.If the advantage is negative (the new behavior is worse), we limit the penalty. The idea here is similar. Namely, we don’t want to overreact to one bad result unless we are really sure.In short, we use the smaller of the two scores if the advantage is positive (to avoid over-rewarding), and the larger when the advantage is negative (to avoid over-penalizing).In the analogy, this ensures that if a recipe is doing better than expected, we don’t over-reward it unless we are confident. And if it’s underperforming, we don’t over-penalize it unless it’s consistently bad.5. Calculating the loss:This final score is what we maximize during training (using gradient descent after flipping the sign of the score to minimize). In addition, we also add a KL penalty term, where β is a hyperparameter for the penalty strength:loss = -final_score + β * KL(new_policy || reference_policy)In the analogy, we add the penalty to ensure new recipes are not too different from our original style. This prevents you from “reinventing the kitchen” every week. For example, we don’t want to turn an Italian restaurant into a BBQ place all of a sudden.This was a lot of information, so I summarized it with a concrete, numeric example in an LLM context via the figure below. But please feel free to skip it if it’s too complicated; you should be able to follow the rest of the article just fine.Comparison of reinforcement learning setups in LLM training. Traditional RLHF with PPO uses both a reward model (trained on human preferences) and a critic (value model) to guide learning. GRPO eliminates the critic model. RLVR with GRPO goes a step further by also removing the reward model, relying instead on verifiable rewards from symbolic tools like calculators or compilers.In the next section, I want to briefly go over the DeepSeek-R1 pipeline and discuss the different verifiable rewards that the DeepSeek team used.How the DeepSeek-R1 reasoning models were trainedNow that we have clarified what RLHF and RLVR are, as well as PPO and GRPO, let’s briefly recap the main insights from the DeepSeek-R1 paper in the context of RL and reasoning.First, there were three types of models:DeepSeek-R1-Zero trained with pure RLDeepSeek-R1 trained with instruction fine-tuning (SFT) and RLDeepSeek-Distill variants created via instruction fine-tuning SFT without RLI created a DeepSeek-R1 pipeline diagram to illustrate how these models relate to each other, as shown below.Annotated figure from A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility, https://arxiv.org/abs/2504.070862. The problem of long incorrect answersI previously mentioned that RL with verifiable rewards (RLVR) does not strictly require the GRPO algorithm; DeepSeek’s GRPO simply happens to be efficient and to perform well.However, [12] showed that vanilla PPO paired with a basic binary correctness reward was sufficient to scale models in reasoning capability and response length.More interestingly, both PPO and GRPO have a length bias. And several papers explored methods to tackle excessively long incorrect answers:[14] Provided an analysis illustrating how PPO inadvertently favors longer responses due to mathematical biases in loss calculations; GRPO may suffer from the same issue.Annotated figure from A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility, https://arxiv.org/abs/2504.07086This magazine is a personal passion project. To support me as an independent researcher, please consider purchasing a copy of my book, Build a Large Language Model (From Scratch) book, or signing up for a paid subscription.Build a Large Language Model (From Scratch) now available on AmazonIf you read the book and have a few minutes to spare, I’d really appreciate a brief review. It helps us authors a lot!Your support means a great deal! Thank you!