We’ve built AI systems that can spend hours hunting across the web, synthesizing information, and writing research reports. But we have almost no way to tell if they’re actually good at this task.
The problem runs deeper than it first appears. Traditional benchmarks work fine for closed-form questions with single correct answers. Feed a system a math problem, check if it matches the known solution, move on. But research is different. There are many valid approaches to answering a question about renewable energy policy, and multiple correct answers depending on what sources you integrate and how you weight them. A static answer key doesn’t capture this nuance.
There’s a worse problem hiding underneath: static ground truth becomes obsolete. If your benchmark was created last year and a system is researching current events, comparing it to pre-written answers makes no sense. The world has moved on.
Current benchmarks also impose a heavy cost. Creating reliable research tasks requires human annotation at scale, which is expensive and slow. Existing approaches either demand painstaking effort to construct each task, assume evaluation criteria are universal (they’re not, a business analyst needs different things than a historian), or fail completely when systems cite sources that don’t exist or skip citations altogether.
DeepResearchEval addresses this by automating both the creation of realistic research challenges and the evaluation of how well systems handle them. The insight that ties everything together: you can’t fairly evaluate research systems without task-specific evaluation criteria, and you can’t verify factual claims without an evaluator that actively hunts for evidence rather than checking a static answer key.
What makes a real research task
Before grounding a solution, it helps to think about how real research actually works. A person doesn’t start with a random question. They first think about who they are, what they’re trying to accomplish, and why it matters. A journalist investigating corporate fraud needs different information than a grad student studying historical trade patterns. Their research process, their information needs, and what constitutes a good answer all flow from their identity and stakes.
Read more