The State of LLM Reasoning Models

Improving the reasoning abilities of large language models (LLMs) has become one of the hottest topics in 2025, and for good reason. Stronger reasoning skills allow LLMs to tackle more complex problems, making them more capable across a wide range of tasks users care about.In the last few weeks, researchers have shared a large number of new strategies to improve reasoning, including scaling inference-time compute, reinforcement learning, supervised fine-tuning, and distillation. And many approaches combine these techniques for greater effect. This article explores recent research advancements in reasoning-optimized LLMs, with a particular focus on inference-time compute scaling that have emerged since the release of DeepSeek R1.The many terms that are used synonymously with inference-time scaling.To understand how reasoning models are being developed and improved, I think it remains useful to look at the different techniques separately. In my previous article, Development process of DeepSeek’s reasoning models that I discussed in my previous article, Understanding Reasoning LLMs (https://magazine.sebastianraschka.com/p/understanding-reasoning-llms).Before we look into Inference-time compute scaling methods and the different areas of progress on the reasoning model with a focus on the inference-time compute scaling category, let me at least provide a brief overview of all the different categories.1. Inference-time compute scalingThis category includes methods that improve model reasoning capabilities at inference time without training or modifying the underlying model weights. The core idea is to trade off increased computational resources for improved performance, which helps with making even fixed models more capable through techniques such as chain-of-thought reasoning, and various sampling procedures. While I categorize inference-time compute scaling separately to focus on methods in this context, it is important to note that this technique can be applied to any LLM. For example, OpenAI developed its o1 model using reinforcement learning, and then additionally leveraged inference-time compute scaling. Interestingly, as I discussed in my previous article on reasoning models (Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.033141. “s1: Simple test-time scaling”The remainder of this article will be focused on the recent research advances in the inference-time scaling category for improving reasoning capabilities of LLMs. Let me start with a more detailed discussion of a paper that serves as an example of inference-time scaling.So, one of the interesting recent research papers in this category is Correlation between response accuracy and length. Annotated figure from https://arxiv.org/abs/2501.19393.They found their budget-forcing method more effective than other inference-scaling techniques I’ve discussed, like majority voting. If there’s something to criticize or improve, I would’ve liked to see results for more sophisticated parallel inference-scaling methods, like beam search, lookahead search, or the best compute-optimal search described in Google’s Annotated figure from “Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback”, https://arxiv.org/abs/2501.128953. Thoughts Are All Over the Place📄 30 Jan, Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, Annotated figure from “Trading Inference-Time Compute for Adversarial Robustness”, https://arxiv.org/abs/2501.188415. Chain-of-Associated-Thoughts📄 4 Feb, CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning, Annotated figure from “Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models”, https://arxiv.org/abs/2502.04404I added this paper here as it’s heavily focused on the proposed backtracking inference-time scaling method, which improves reasoning by dynamically adjusting search depth and breadth rather than fundamentally altering the training paradigm (although, the training with <backtrack> tokens is required). 7. Scaling up Test-Time Compute with Latent Reasoning📄 7 Feb, Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Annotated figure from “Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling”, https://arxiv.org/abs/2502.067039. Inference-Time Computations for LLM Reasoning and Planning📄 18 Feb, Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Annotated figure from “Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking”, https://arxiv.org/abs/2502.1384211. Test Time Scaling for Code Generation📄 20 Feb, S*: Test Time Scaling for Code Generation, https://arxiv.org/abs/2502.14382Inference-time scaling can be achieved by parallel scaling (generating multiple answers), sequential scaling (iteratively refining answers), or both as described in the Google paper from Summer 2024 (Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters).S* is a test-time compute scaling method designed specifically for code generation that improves both parallel scaling (generating multiple solutions) and sequential scaling (iterative debugging). Annotated figure from “S*: Test Time Scaling for Code Generation”, https://arxiv.org/abs/2502.14382The approach operates in two stages:Stage 1: GenerationThe model generates multiple code solutions and iteratively refines them using execution results and test cases provided in the problem prompt.Think of this like a coding competition where a model submits solutions, runs tests, and fixes mistakes:1. The model generates multiple candidate solutions.2. Each solution is executed on public test cases (predefined input-output pairs).3. If a solution fails (incorrect output or crashes), the model analyzes the execution results (errors, outputs) and modifies the code to improve it.4. This refinement process continues iteratively until the model finds solutions that pass the test cases.For example, suppose the model is asked to implement a function is_even(n) that returns True for even numbers and False otherwise.The model’s first attempt might be:def is_even(n):
return n % 2 # ❌ Incorrect: should be `== 0`The model tests this implementation with public test cases:Input Expected Model Output Status
is_even(4) True False ❌ Fail
is_even(3) False True ❌ FailAfter reviewing the results, the model realizes that 4 % 2 returns 0, not True, so it modifies the function:def is_even(n):
return n % 2 == 0 # ✅ CorrectedNow the function passes all public tests, completing the debugging phase.Stage 2: SelectionOnce multiple solutions have passed public tests, the model must choose the best one (if possible). Here, S* introduces adaptive input synthesis to avoid random picking:1. The model compares two solutions that both pass public tests.2. It asks itself: “Can I generate an input that will reveal a difference between these solutions?”3. It creates a new test input and runs both solutions on it.4. If one solution produces the correct output while the other fails, the model selects the better one.5. If both solutions behave identically, the model randomly picks one.For example, consider two different implementations of is_perfect_square(n):import math

def is_perfect_square_A(n):
return math.isqrt(n) ** 2 == ndef is_perfect_square_B(n):
return math.sqrt(n).is_integer()Both pass the provided test cases for simple examples:n = 25
print(is_perfect_square_A(n)) # ✅ True (Correct)
print(is_perfect_square_B(n)) # ✅ True (Correct)But when the LLM generates edge cases we can see one of them fail, so the model would select the solution A in this case:n = 10**16 + 1
print(is_perfect_square_A(n)) # ✅ False (Correct)
print(is_perfect_square_B(n)) # ❌ True (Incorrect)12. Chain of Draft📄 25 Feb, Chain of Draft: Thinking Faster by Writing Less, (It will be interesting to see how well GPT-4.5 will perform with o1- or o3-style inference-time scaling.)Which technique?However, inference-time compute scaling is not a silver bullet. While methods like Monte Carlo Tree Search, self-backtracking, and dynamic-depth scaling can substantially improve reasoning performance, the effectiveness also still depends on the task and difficulty. As one of the earlier papers showed, there’s no inference-time compute scaling technique that performs best across all tasks.Additionally, many of these approaches trade off response latency for improved reasoning, and slow responses can be annoying to some users. For instance, I usually switch from o1 to GPT4o if I have simple tasks due to the faster response time.What’s nextLooking ahead, I think we will see many more papers this year centered around the two main branches of “reasoning via inference-time compute scaling” research: 1. Research that is purely centered around developing the best possible model topping the benchmarks.2. Research that is concerned with balancing cost and performance trade-offs across different reasoning tasks.Either way, what’s nice about inference-time compute scaling is that it can be applied to any type of existing LLM to make it better for specific tasks.Thinking on DemandAn interesting trend on the industry side is what I refer to as “thinking on demand”. Following the release of DeepSeek R1, it feels like companies have been rushing to add reasoning capabilities to their offerings. An interesting development here is that most LLM providers started to add the option for users to enable or disable thinking. An interesting development is that most LLM providers now allow users to enable or disable these “thinking” features. The mechanism is not publicly shared, but it’s likely the same model with dialed-back inference-time compute scaling. For instance, Claude 3.7 Sonnet and Grok 3 now have a “thinking” that users can enable for their model, whereas OpenAI requires users to switch between models. For example, GPT4o/4.5 and o1/o3-mini if they want to use explicit reasoning models. However, the OpenAI CEO mentioned that GPT4.5 will likely be their last model, which doesn’t explicitly have a reasoning or “thinking” mode. On the open-source side, even IBM added an explicit “thinking” toggle to their Granite models.Overall, the trend of adding reasoning capabilities whether via inference-time or train-time compute scaling is a major step forward for LLMs in 2025. In time, I expect that reasoning will no longer be treated as an optional or special feature but will instead become the standard, much as instruction-finetuned or RLHF-tuned models are now the norm over raw pretrained models.As mentioned earlier, this article solely focused on inference-time compute length due to its already long lengths, thanks to the very active reasoning research activity. In a future article, I plan to cover all the interesting train-time compute scaling for reasoning.This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book. (I am confident that you’ll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.)Build a Large Language Model (From Scratch) now available on AmazonIf you read the book and have a few minutes to spare, I’d really appreciate a brief review. It helps us authors a lot!Your support means a great deal! Thank you!

Share this:

Like this:

Related Posts

Welcome GPT OSS, the new open-source model family from OpenAI!

Five Strategic CMO Moves Heading Into 2026

The Clock Didn’t Stop: The EU AI Act Will Reshape Your AI Strategy — Now