WideSearch: Benchmarking Agentic Broad Info-Seeking

A rigorous benchmark designed to test an agent's ability to search, synthesize, and verify information across the web.

Ryan Wong*, Jiawei Wang*, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang
ByteDance Seed
*Co-first authors, Corresponding authors

200

Challenging Tasks

2.3hrs

Avg. Human Time

44+

Avg. Pages Consulted

The WideSearch Paradigm

From Tedious Labor to Automated Workflows

Many real-world information-gathering tasks are not hard, just huge. Consider a financial analyst compiling key metrics for all companies in a sector, or a job seeker collecting every vacancy that meets their criteria. The challenge isn't cognitive complexity, but the sheer scale and repetitive nature of the work—a critical productivity bottleneck.

WideSearch is designed to evaluate an agent's ability to automate these tasks, shifting from laborious manual collection to efficient, automated workflows. This shift, however, introduces novel failure modes like hallucination and incompleteness, making rigorous evaluation essential.

A conceptual comparison of manual and agent-based approaches for WideSearch tasks.
WideSearch evaluates the shift from manual data collection to automated agent workflows. (Click to enlarge)

A New Paradigm: Wide vs. Deep

Current research primarily focuses on "deep" tasks. DeepSearch tackles the "I can't find it" problem of locating hidden facts, while DeepResearch addresses the "I can't write it well" problem of synthesizing reports.

In sharp contrast, WideSearch tackles the "I don't have time to do it" problem. It requires agents to systematically find and organize large-scale information into a structured table, shifting the primary challenge from deep reasoning to achieving exhaustiveness and fidelity at scale.

An overview and detailed comparison of DeepSearch, DeepResearch, and our WideSearch.
Contrasting the core tasks and value of WideSearch, DeepSearch, and DeepResearch. (Click to enlarge)

A Rigorous Methodology

Six Design Principles

  • High Search Volume and Breadth: Tasks require extensive search and collation of numerous data points.
  • Temporal and Contextual Invariance: Ground-truth answers are stable over time and across contexts.
  • Objective Verifiability: Facts are deterministic, allowing for consistent and reproducible scoring.
  • Public Accessibility: All required information is publicly available via standard search engines.
  • Reliance on External Tools: Tasks are designed to be unsolvable using only an LLM's parametric knowledge.
  • Scenario Diversity: The benchmark spans 18 distinct industries to ensure generalizability.

Five-Stage Curation Pipeline

Data Curation Pipeline
Our five-stage data curation and validation pipeline. (Click to enlarge)

Benchmark Statistics

Quantitative analysis substantiates the complexity of our benchmark. Human studies show tasks require significant time and research.

Distribution of task completion time.

Distribution of human completion time across random 100 tasks. (Click to enlarge)

Distribution of source web pages consulted.

Distribution of source web pages consulted across random 100 tasks. (Click to enlarge)

Distribution of topics.

Distribution of 18 distinct topics across 200 tasks. (Click to enlarge)

Evaluation Framework & Metrics

Automated Evaluation Pipeline

We employ a robust, three-stage automated pipeline to ensure accurate and scalable scoring, combining deterministic checks with LLM-as-a-judge for semantic nuances.

Evaluation Pipeline
Our three-stage automated evaluation pipeline. (Click to enlarge)

Key Metrics

  • Success Rate (SR): The most stringent metric; requires a perfect, 100% match with the ground-truth table.
  • Row-level F1 Score: Treats each table row as a unit, measuring the agent's ability to retrieve complete and correct records.
  • Item-level F1 Score: The most granular metric, evaluating each individual cell for fine-grained accuracy.

Reporting Strategy

We report performance using two aggregation strategies over N independent runs:

  • Avg@N: The average performance across N trials, measuring reliability.
  • Pass@N and Max@N: The best score achieved in N trials, capturing peak capability.

Leaderboard

Model / System Agent Type SR (Avg@4) SR (Pass@4) Row F1 (Avg@4) Row F1 (Max@4) Item F1 (Avg@4) Item F1 (Max@4)
Claude Sonnet 4 (Thinking) Single Agent 2.3 9.0 31.7 44.1 57.9 70.3
Gemini 2.5 Pro Single Agent 1.5 7.0 30.0 45.8 51.0 70.0
OpenAI o3 Single Agent 4.5 9.0 34.0 44.1 52.6 62.3
Kimi K2 Single Agent 1.1 6.0 29.7 43.7 54.4 70.5
DeepSeek-R1 Single Agent 0.4 2.0 20.7 35.0 41.3 62.4
Doubao-Seed-1.6 (Thinking) Single Agent 2.6 6.0 30.0 46.2 48.3 68.9
Doubao-Seed-1.6 (Non-Thinking) Single Agent 1.0 5.0 27.2 42.3 49.0 68.2
Claude Sonnet 4 (Thinking) Multi-Agent 3.6 6.5 38.5 52.2 62.2 73.1
Gemini 2.5 Pro Multi-Agent 2.0 6.5 33.5 44.6 57.4 66.3
OpenAI o3 Multi-Agent 5.1 9.5 37.8 50.5 57.3 68.9
Kimi K2 Multi-Agent 3.0 6.5 36.2 49.6 61.2 70.7
DeepSeek-R1 Multi-Agent 0.8 3.0 22.9 36.6 44.3 60.3
Doubao-1.6 Multi-Agent 2.5 5.5 34.0 48.9 54.6 69.7
Doubao-Seed-1.6 (Non-Thinking) Multi-Agent 2.1 4.5 29.7 42.7 52.8 65.1
Claude Sonnet 4 (Thinking) E2E System 2.5 5.0 24.1 33.5 48.4 58.5
Gemini 2.5 Pro E2E System 4.3 8.0 36.6 45.4 59.1 67.2
OpenAI o3 E2E System 3.0 5.5 23.9 36.0 45.5 56.5

Dataset & Code

Get the Dataset

The full WideSearch benchmark dataset, including all 200 tasks and ground-truth tables, is available for download.

Download Now

Access the Code

Our GitHub repository contains the evaluation scripts, documentation, and instructions for running the benchmark.

View on GitHub

Paper & Citation

For a detailed description of the benchmark, methodology, and experimental results, please refer to our paper.

BibTeX

@misc{wong2025widesearchbenchmarkingagenticbroad,
          title={WideSearch: Benchmarking Agentic Broad Info-Seeking}, 
          author={Ryan Wong and Jiawei Wang and Junjie Zhao and Li Chen and Yan Gao and Long Zhang and Xuan Zhou and Zuo Wang and Kai Xiang and Ge Zhang and Wenhao Huang and Yang Wang and Ke Wang},
          year={2025},
          eprint={2508.07999},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2508.07999}, 
}