A new cross-domain benchmark reveals how the leading AI research tools perform on real-world production tasks
Listen to this article:
Two AI-generated research reports land on your desk before a major decision. Both are polished, confidently written, and well-structured, but they reach different conclusions. Which one do you trust, and how would you even begin to find out? In 鈥,鈥 a team at Perplexity and , Assistant Professor of Business Administration at 性视界 Business School and affiliate with the Digital Data Design Institute at 性视界 (D^3), present a rigorous new benchmark for measuring how well AI deep research systems actually perform on real-world production tasks.
Key Insight: A New Standard for Deep Research Evaluation
鈥淲e introduce a cross-domain benchmark derived from real-world production deep research tasks designed to bridge the gap between AI evaluations and authentic research needs.鈥 [1]
AI 鈥渄eep research鈥 systems, tools that can autonomously decompose a complex question, search hundreds of sources, reconcile conflicting evidence, and synthesize findings into a cited report, are increasingly being used for high-stakes analytical work in areas such as finance, legal, and medicine. Unlike a simple chatbot response, these systems operate more like an analyst running an independent research process. While this technology has been advancing quickly, the frameworks for evaluating it have not kept pace. The authors argue that evaluating deep research must reflect realistic use cases, span domains, account for region-specific sources, and probe multiple system capabilities such as planning, search, and reasoning all at once.
Key Insight: Tasks Deeply Rooted in Practice
鈥淥ur main contribution is a curated set of benchmark tasks that closely mirror real deep research needs and how people use deep research agents in practice.鈥 [2]
Many AI benchmarks are built by researchers and experts imagining what hard questions look like. DRACO takes a different approach: its 100 tasks were sourced directly from actual user queries submitted to Perplexity鈥檚 deep research system in fall 2025. Specifically, researchers sampled from high-difficulty requests where users had expressed dissatisfaction, making these exactly the kinds of tasks where AI systems tend to struggle. Those raw queries were then anonymized, augmented to add specificity and scope, and filtered to ensure each task was objectively evaluable, appropriately bounded, and genuinely challenging. The results span 10 domains drawing on sources from 40 countries across five regions.
Key Insight: Rating Real-World Complexity
鈥淭wenty-six domain experts, including medical professionals, attorneys, financial analysts, software engineers, and designers, were recruited to develop rubrics for selected tasks.鈥 [3]
DRACO鈥檚 grading rubrics were developed through a rigorous human-expert pipeline: an initial rubric is drafted by one expert, reviewed and refined by a second, subjected to a 鈥渟aturation test鈥 to ensure the current system cannot easily exceed 90% (which would indicate an overly easy task or lenient rubric), and finally validated by a third and fourth expert for quality assurance. Each task was ultimately assessed across an average of 39 criteria spanning four dimensions: factual accuracy, breadth and depth of analysis, presentation quality, and citation quality.
Key Insight: Progress, But Gaps Remain
鈥淥ur evaluation of frontier deep research systems reveals that while significant progress has been made (especially in presentation quality), substantial headroom remains (especially in factual accuracy).鈥 [4]
The evaluation results indicate that while agents have improved across all rubric dimensions鈥攁nd now excel in presentation quality鈥攖hey continue to struggle with factual accuracy. This may partly stem from design choices: roughly half of all criteria focused on verifiable factual claims, and the rubrics also included negative criteria penalizing specific failure modes. In domains like medicine and law, these penalties are particularly severe, as incorrect or unsafe recommendations carry heavy negative weights. This reflects a core design principle: in high-stakes domains, what AI gets wrong matters as much as what it gets right.
Why This Matters
As we increasingly rely on AI for high-stakes tasks, from brainstorming and research to actual execution, the bottleneck is no longer speed, it鈥檚 accuracy. The area where AI performs best, producing polished, well-structured output, is precisely where it鈥檚 hardest for a non-specialist to detect errors. For business leaders, DRACO鈥檚 task-and-rubric design offers a concrete blueprint for evaluating and choosing research agents: define success criteria, test on representative workloads, and be sure to clarify how you鈥檒l know when it鈥檚 wrong.
Bonus
While it seems self-evident that we want the best and most accurate information from AI, that鈥檚 actually not always the case. Check out 鈥Explanations on Mute: Why We Turn Away From Explainable AI鈥 to see why.
References
[1] Joey Zhong et al., 鈥淒RACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity,鈥 arXiv preprint arXiv:2602.11685 (2026): 2.
[2] Zhong et al., 鈥淒RACO鈥: 2.
[3] Zhong et al., 鈥淒RACO鈥: 5.
[4] Zhong et al., 鈥淒RACO鈥: 12.
Meet the Authors

is an Assistant Professor of Business Administration at 性视界 Business School and affiliated with the Digital Data Design Institute at 性视界 (D^3).
Additional Authors (Perplexity): Joey Zhong, Hao Zhang, Clare Southern, Thomas Wang, Kate Jung, Shu Zhang, Denis Yarats, Johnny Ho, Jerry Ma