logo

BIRD-INTERACT

Re-imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions

Hi! BIRD-INTERACT

We feel grateful to see BIRD-SQL 2023 can bring insights and contributions to Text-to-SQL community in LLM era with supervision of DB & DM experts. We also sincerely thank all feedbacks from the community! Our work has been featured by DeepMind and OpenAI

This year, along with collaboration with Google Cloud, we are launching BIRD-SQL 2025, which will cover a wide range of professional DBs and their knowledge in the wild applications.

BIRD-INTERACT, an interactive text-to-SQL benchmark, re-imagines Text-to-SQL evaluation via lens of dynamic interactions. The environment blends a hierarchical knowledge base, database documentation and a function-driven user simulator to recreate authentic enterprise environments across full CRUD operations. It offers two rigorous test modes: (1) passive Conversational Interaction and (2) active Agentic Interaction, spanning 600 annotated tasks including Business Intelligence (BI), CRUD operations and etc., each guarded by executable test cases. Typical evaluations trigger 4,374-12,214 interaction turns between model and user simulator, while state-of-the-art reasoning models currently solve only ≈24% and ≈18% of tasks in Lite version; and ≈16% of tasks in Full version, underscoring the benchmark's challenge. Together, these features position Bird-Interact as a decisive yardstick for advancing interactive, production-ready database assistants. The two test modes supported by Bird-Interact covering most interactions in real-world applications:

  • c-Interact: Conversational Interaction which is a passive mode and the workflow is fixed.
  • a-Interact: Agentic Interaction which is an embodied active mode where the workflow is dynamic and led by models.

There are three versions of Bird-Interact dataset:

  • mini-interact: A portable and easy-to-eval version containing 300 tasks specifically for SQLite with ambiguity resolution and follow up query separately and decoupled. The trained human-aligned local user simulator will be deployed.
  • bird-interact-lite-exp: A lite version containing 270 tasks specifically for PostgreSQL. This is a good starting point for quick experimentation.
  • bird-interact-full: The full version of Bird-Interact, comprising 600 tasks specifically for PostgreSQL.

News

  • Sep. 18, 2025: Announcement: Please note that before your evaluation process, when Docker loads the databases, errors may occasionally occur (these will not terminate the process but will appear in the Docker logs). As a result, some databases may fail to load properly, leading to empty databases and abnormally low evaluation scores.
    👉 We strongly recommend checking the Docker logs for any errors before running the evaluation and verifying that all databases have been successfully loaded.
  • Aug. 26, 2025: We're excited to announce the release of the BIRD-Interact-Full (600) set! It's a tough one—the best LLMs are only achieving a 16.33% success rate, with just 10.0% on the c-interact and a-interact portions. For more details, please find in this project website. We'll be sending the Ground Truth & Test cases to our mailing list this week. If you want early access, please send an email as instructed on the site for an automatic download. On another note, we've also released a SQLite version of LiveSQLBench-Lite for easier local research. The full LiveSQLBench-Base and -Large versions are coming soon!
  • May. 23, 2025: We have released bird-interact-lite-exp (270 tasks). Check out the data in Hugging Face and the newest code in GitHub. The full set of PostgreSQL will be released later. Have fun! Thanks!

LiveSQLBench Support

BIRD-Interact is built on LiveSQLBench, so all databases and tasks are live and continuously updated. LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, evolving benchmark for evaluating LLMs on real-world text-to-SQL tasks.

GT SQLs & Test Cases

To mititgate data leakage, please free to email bird.bench25@gmail.com for solution SQLs and test cases. The delivery is quite fast.

user represents the user simulator, and system represents the system evaluated model. Highlighted text shows the ambiguity related to the hierarchical knowledge base (HKB).

c-Interact Example

a-Interact Example

Submission

BIRD 2025 will accept a more flexible submission pipelines, please contact bird.bench25@gmail.com with the tag [bird-interact] in title, if you have any questions.

Subscribe to BIRD Update

Bird is a long-term research project aimed at bridging the gap between semantic parsing models and the success of database applications. To receive the latest updates of the dataset, you can leave your email address.

Email Subscription

Citation

@misc{birdinteract2025,
  author       = {BIRD Team},
  title        = {BIRD-Interact: Re-imagine Text-to-SQL Evaluation via Lens of Dynamic Interactions},
  year         = {2025},
  howpublished = {https://github.com/bird-bench/BIRD-Interact},
  note         = {Accessed: 2025-06-01}
}
              
@article{li2024can,
  title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
  author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}
              
BIRD-INTERACT Leaderboard
Dataset
Interaction Mode
User Simulator
Performance
📊 Normalized Reward
Phase-1
Phase-1-Debug
Phase-2
Phase-2-Debug
Rank Model Reward* Institute Link Date Tier
Current: Full Dataset • GPT-4o • c-Interact • Reward*

  • Reward* means normalized global reward which is at most 100.
  • Phase-1 success rate (SR) means the success rate of ambiguity resolution phase.
  • Phase-1-Debug SR means the success rate of the ambiguity resolution phase with the chance of debugging.
  • Phase-2 SR means the success rate of both phase 1 and the follow up question phase.
  • Phase-2-Debug SR means success rate of both phase 1 and follow up question phase with the chance of debugging.
  • Tier Classification (By Ranking):
    • 🏆 Excellent Chat: The Best!
    • 💎 Good Chat: Top 30%
    • 🌟 Standard: Top 60%
    • ⚪ Basic: Bottom 40%

    🧑‍💻 User Simulator: GPT-4o.

    Interaction-Time Scaling (ITS)

    Interaction-Time Scaling (ITS) refers to a model's ability to continuously increase its end performance through multi-turn interactions. When this interactive performance surpasses the model's idealized single-turn performance on a fully specified, unambiguous task, we say it satisfies the ITS law. As user patience grows and interaction turns accumulate, performance keeps improving, demonstrating that the model can sustain effective communication over extended dialogue. Currently, we only find claude-3-7-sonnet satisfies the ITS law.

    Interaction Scaling Law