BIRD-INTERACT

Hi! BIRD-INTERACT

We feel grateful to see BIRD-SQL 2023 can bring insights and contributions to Text-to-SQL community in LLM era with supervision of DB & DM experts. We also sincerely thank all feedbacks from the community! Our work has been featured by DeepMind and OpenAI

This year, along with collaboration with Google Cloud, we are launching BIRD-SQL 2025, which will cover a wide range of professional DBs and their knowledge in the wild applications.

BIRD-INTERACT, an interactive text-to-SQL benchmark, re-imagines Text-to-SQL evaluation via lens of dynamic interactions. The environment blends a hierarchical knowledge base, database documentation and a function-driven user simulator to recreate authentic enterprise environments across full CRUD operations. It offers two rigorous test modes: (1) passive Conversational Interaction and (2) active Agentic Interaction, spanning 600 annotated tasks including Business Intelligence (BI), CRUD operations and etc., each guarded by executable test cases. Typical evaluations trigger 1,968-5,496 interaction turns between model and user simulator, while state-of-the-art reasoning models currently solve only ≈24% and ≈18% of tasks, underscoring the benchmark's challenge. Together, these features position Bird-Interact as a decisive yardstick for advancing interactive, production-ready database assistants. The two test modes supported by Bird-Interact covering most interactions in real-world applications:

c-Interact: Conversational Interaction which is a passive mode and the workflow is fixed.
a-Interact: Agentic Interaction which is an embodied active mode where the workflow is dynamic and led by models.

There are two versions of Bird-Interact dataset:

bird-interact-lite-exp: A lite version containing 270 tasks specifically for PostgreSQL. This is a good starting point for quick experimentation.
bird-interact-full: The full version of Bird-Interact, comprising 600 tasks specifically for PostgreSQL. (Coming Soon!)

News

May. 23, 2025: We have released bird-interact-lite-exp (270 tasks). Check out the data in Hugging Face and the newest code in GitHub. The full set of PostgreSQL will be released later. Have fun! Thanks!

LiveSQLBench Support

BIRD-Interact is built on LiveSQLBench, so all databases and tasks are live and continuously updated. LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, evolving benchmark for evaluating LLMs on real-world text-to-SQL tasks.

GT SQLs & Test Cases

To mititgate data leakage, please free to email bird.bench25@gmail.com for solution SQLs and test cases. The delivery is quite fast.

user represents the user simulator, and system represents the system evaluated model. Highlighted text shows the ambiguity related to the hierarchical knowledge base (HKB).

c-Interact Example

a-Interact Example

Submission

Coming soon! BIRD 2025 will accept a more flexible submission pipelines, please contact bird.bench25@gmail.com with the tag [bird-interact] in title, if you have any questions.

Subscribe to BIRD Update

Bird is a long-term research project aimed at bridging the gap between semantic parsing models and the success of database applications. To receive the latest updates of the dataset, you can leave your email address.

Email Subscription

Citation

@misc{birdinteract2025,
  author       = {BIRD Team},
  title        = {BIRD-Interact: Re-imagine Text-to-SQL Evaluation via Lens of Dynamic Interactions},
  year         = {2025},
  howpublished = {https://github.com/bird-bench/BIRD-Interact},
  note         = {Accessed: 2025-06-01}
}

@article{li2024can,
  title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
  author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

c - Interact Leaderboard

📊 Overall-Reward*

Phase-1

Phase-1-Debug

Phase-2

Phase-2-Debug

Rank	Model	Pass Rate (%)	Institute	Link	Date	Tier

Reward* means normalized global reward which is at most 100.

Phase-1 pass rate (PR) means the pass rate of ambiguity resolution phase.

Phase-1-Debug PR means the pass rate of the ambiguity resolution phase with the chance of debugging.

Phase-2 PR means the pass rate of both phase 1 and the follow up question phase.

Phase-2-Debug PR means pass rate of both phase 1 and follow up question phase with the chance of debugging.

Tier Classification (By Ranking):

🏆 Excellent Chat: The Best!
💎 Good Chat: Top 30%
🌟 Standard: Top 60%
⚪ Basic: Bottom 40%

🧑‍💻 User Simulator: Gemini-2.0-flash with avg cost of 1.91 USD

a - Interact Leaderboard

📊 Overall-Reward*

Phase 1

Phase 2

Rank	Model	Reward*	Institute	Link	Date	Tier

Reward* means normalized global reward which is at most 100.

Phase-1 pass rate means the pass rate of ambiguity resolution phase.

Phase-2 pass rate means the pass rate of both phase 1 and the follow up question phase.

Tier Classification (By Ranking):

🏆 Excellent Interaction: The Best!
💎 Good Interaction: Top 30%
🌟 Standard: Top 60%
⚪ Basic: Bottom 40%

🧑‍💻 User Simulator: Gemini-2.0-flash with avg cost of 1.56 USD

Interaction-Time Scaling (ITS)

Interaction-Time Scaling (ITS) refers to a model’s ability to continuously increase its end performance through multi-turn interactions. When this interactive performance surpasses the model’s idealized single-turn performance on a fully specified, unambiguous task, we say it satisfies the ITS law. As user patience grows and interaction turns accumulate, performance keeps improving, demonstrating that the model can sustain effective communication over extended dialogue. Currently, we only find claude-3-7-sonnet satisfies the ITS law.