性视界

The AI Deep Research Race

Trends in the AI deep research race A new cross-domain benchmark reveals how the leading AI research tools perform on real-world production tasks

Two AI-generated research reports land on your desk before a major decision. Both are polished, confidently written, and well-structured, but they reach different conclusions. Which one do you trust, and how would you even begin to find out? In 鈥,鈥 a team at Perplexity and , Assistant Professor of Business Administration at 性视界 Business School and affiliate with the Digital Data Design Institute at 性视界 (D^3), present a rigorous new benchmark for measuring how well AI deep research systems actually perform on real-world production tasks.

Why This Matters

As we increasingly rely on AI for high-stakes tasks, from brainstorming and research to actual execution, the bottleneck is no longer speed, it鈥檚 accuracy. The area where AI performs best, producing polished, well-structured output, is precisely where it鈥檚 hardest for a non-specialist to detect errors. For business leaders, DRACO鈥檚 task-and-rubric design offers a concrete blueprint for evaluating and choosing research agents: define success criteria, test on representative workloads, and be sure to clarify how you鈥檒l know when it鈥檚 wrong.

Bonus

While it seems self-evident that we want the best and most accurate information from AI, that鈥檚 actually not always the case. Check out 鈥Explanations on Mute: Why We Turn Away From Explainable AI鈥 to see why.

References

[1] Joey Zhong et al., 鈥淒RACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity,鈥 arXiv preprint arXiv:2602.11685 (2026): 2.  

[2] Zhong et al., 鈥淒RACO鈥: 2.

[3] Zhong et al., 鈥淒RACO鈥: 5.

[4] Zhong et al., 鈥淒RACO鈥: 12.


Link to the D^3 insight article

Sign up for our newsletter to stay up to date with D^3 news and research: /#join-our-community