Reasoning models are delivering impressive performance on challenging tasks, but they have a costly flaw: they generate excessive tokens that don’t improve accuracy. This problem, known as overthinking, wastes computational resources and increases inference costs unnecessarily.
A
Figure 3: Total difficulty distribution of the four datasets evaluated in this work. By including dumb500 in the analysis, the researchers can characterize overthinking behavior more consistently across the difficulty spectrum.
Specialized Evaluation Methods for Different Question Types
Each domain in dumb500 requires different evaluation approaches:
-
Math questions: Evaluated using simple accuracy methods, identical to MATH500, GPQA, and ZebraLogic
-
Code questions: Include test cases for the program described in the prompt, with a Python-based autograder
-
Chat questions: Evaluated on requirements like appropriateness and conciseness using a GPT-4o judge
-
Task questions: Assessed based on generic requirements and question-specific criteria for following instructions
This comprehensive evaluation framework allows for consistent assessment across diverse question types.
Analyzing Model Performance from Easy to Hard Questions
When testing the same models on dumb500, the researchers found that token spend has no positive correlation with accuracy on simple math questions — and sometimes even shows a negative relationship for other domains.
Figure 4: Relationship between average token spend and average score for the evaluated models on each subset of dumb500.
This confirms that models are poorly calibrated on easy problems, often spending unnecessary tokens without improving performance. This finding aligns with research on thinking in tokens in language modeling, which examines how models allocate computational resources during inference.
ThoughtTerminator: A Solution to Control Overthinking
Read more