Hi! BIRD-INTERACT
We feel grateful to see BIRD-SQL 2023 can bring insights and contributions to Text-to-SQL community in LLM era with supervision of DB & DM experts.
We also sincerely thank all feedbacks from the community! Our work has been featured by DeepMind and OpenAI
This year, along with collaboration with Google Cloud, we are launching BIRD-SQL 2025, which will cover a wide range of professional DBs and their knowledge in the wild applications.
BIRD-INTERACT, an interactive text-to-SQL benchmark, re-imagines Text-to-SQL evaluation via lens of dynamic interactions.
The environment blends a hierarchical knowledge base, database documentation and a function-driven user simulator to recreate authentic enterprise environments across full CRUD operations.
It offers two rigorous test modes: (1) passive Conversational Interaction and (2) active Agentic Interaction, spanning 600 annotated tasks including Business Intelligence (BI), CRUD operations and etc., each guarded by executable test cases.
Typical evaluations trigger 4,374-12,214 interaction turns between model and user simulator, while state-of-the-art reasoning models currently solve only ≈24% and ≈18% of tasks in Lite version; and ≈16% of tasks in Full version, underscoring the benchmark's challenge.
Together, these features position Bird-Interact as a decisive yardstick for advancing interactive, production-ready database assistants. The two test modes supported by Bird-Interact covering most interactions in real-world applications:
-
c-Interact: Conversational Interaction which is a passive mode and the workflow is fixed.
-
a-Interact: Agentic Interaction which is an embodied active mode where the workflow is dynamic and led by models.
There are three versions of Bird-Interact dataset:
-
mini-interact: an SQLite-based interactive benchmark containing 300 tasks with a self-contained environment that requires no Docker setup. This lightweight version enables fast development and testing of interactive intelligence for agentic data exploration with users.
-
bird-interact-lite-exp: A lite version containing 300 tasks specifically for PostgreSQL. This is a good starting point for quick experimentation.
-
bird-interact-full: The full version of Bird-Interact, comprising 600 tasks specifically for PostgreSQL.
News
- Nov. 13, 2025: We have released Mini-Interact , an SQLite-based interactive benchmark with a self-contained environment that requires no Docker setup. This lightweight version enables fast development and testing of interactive intelligence for agentic data exploration with users.
-
Oct. 19, 2025:
Announcement:
1. We have updated the Docker image for quicker Bird-Interact experiments and to avoid performance mismatch. For more details, please visit our GitHub repository .
2. We have updated the Submission Guidelines , where the customized agent scaffolds and user simulators are supported under universal evaluation. Please feel free to take a look. - Oct. 8, 2025: 📝 We're thrilled to share that our Bird-Interact paper is now publicly available! The paper presents the full details, methodology, and evaluation of our BIRD-Interact, and includes several insightful findings about model behaviors and interaction patterns.
-
Sep. 18, 2025:
Announcement:
1. We have expanded the results of Bird-Interact-Lite dataset to 300 tasks, and all corresponding results have been updated accordingly.
2. Please note that before your evaluation process, when Docker loads the databases, errors may occasionally occur (these will not terminate the process but will appear in the Docker logs). As a result, some databases may fail to load properly, leading to empty databases and abnormally low evaluation scores.
👉 We strongly recommend checking the Docker logs for any errors before running the evaluation and verifying that all databases have been successfully loaded. -
Aug. 26, 2025:
We're excited to announce the release of the
BIRD-Interact-Full (600)
set! It's a tough one—the best LLMs are only achieving a
16.33% success rate, with just
10.0% on the
c-interactanda-interactportions. For more details, please find in this project website. We'll be sending the Ground Truth & Test cases to our mailing list this week. If you want early access, please send an email as instructed on the site for an automatic download. On another note, we've also released a SQLite version of LiveSQLBench-Lite for easier local research. The full LiveSQLBench-Base and -Large versions are coming soon! - May. 23, 2025: We have released bird-interact-lite-exp (300 tasks). Check out the data in Hugging Face and the newest code in GitHub. The full set of PostgreSQL will be released later. Have fun! Thanks!
LiveSQLBench Support
BIRD-Interact is built on LiveSQLBench, so all databases and tasks are live and continuously updated. LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, evolving benchmark for evaluating LLMs on real-world text-to-SQL tasks.
GT SQLs & Test Cases
To mititgate data leakage, please free to email bird.bench25@gmail.com for solution SQLs and test cases.
The delivery is quite fast.
represents the user simulator,
and
represents the system evaluated model. Highlighted text shows the ambiguity related to the hierarchical knowledge base (HKB).
c-Interact Example
a-Interact Example
Submission
BIRD 2025 will accept a more flexible submission pipelines, please contact bird.bench25@gmail.com with the tag [bird-interact] in title, if you have any questions.
Subscribe to BIRD Update
Bird is a long-term research project aimed at bridging the gap between semantic parsing models and the success of database applications. To receive the latest updates of the dataset, you can leave your email address.
Citation
@misc{birdinteract2025,
author = {BIRD Team},
title = {BIRD-Interact: Re-imagine Text-to-SQL Evaluation via Lens of Dynamic Interactions},
year = {2025},
howpublished = {https://github.com/bird-bench/BIRD-Interact},
note = {Accessed: 2025-06-01}
}
@article{li2024can,
title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
journal={Advances in Neural Information Processing Systems},
volume={36},
year={2024}
}
| Rank | Model | Institute | Link | Date | User Simulator | Efficiency | Success Rate (%) |
|---|
- Efficiency means the conversational turns used or BIRD COIN used in Free Mode (no budget constraints).
- Reward* means normalized global reward which is at most 100.
- Priority Questions success rate (SR) means the success rate of ambiguity resolution phase.
- Priority Questions + Debug SR means the success rate of the ambiguity resolution phase with the chance of debugging.
- Follow Ups SR means the success rate of both Priority Questions and the Follow Up questions.
- Follow Ups + Debug SR means success rate of both Priority Questions and Follow Up Questions with the chance of debugging.
📌 NOTE: The Efficiency for c-interact is the conversational turns used; for a-interact is the total BIRD COIN used. The higher value means more interactive but also cost more.
Interaction-Time Scaling (ITS) refers to a model's ability to continuously increase its end performance through multi-turn interactions. When this interactive performance surpasses the model's idealized single-turn performance on a fully specified, unambiguous task, we say it satisfies the ITS law. As user patience grows and interaction turns accumulate, performance keeps improving, demonstrating that the model can sustain effective communication over extended dialogue. Currently, we only find claude-3-7-sonnet satisfies the ITS law.