Hi! BIRD-INTERACT
We feel grateful to see BIRD-SQL 2023 can bring insights and contributions to Text-to-SQL community in LLM era with supervision of DB & DM experts.
We also sincerely thank all feedbacks from the community! Our work has been featured by DeepMind and OpenAI
This year, along with collaboration with Google Cloud, we are launching BIRD-SQL 2025, which will cover a wide range of professional DBs and their knowledge in the wild applications.
BIRD-INTERACT, an interactive text-to-SQL benchmark, re-imagines Text-to-SQL evaluation via lens of dynamic interactions.
The environment blends a hierarchical knowledge base, database documentation and a function-driven user simulator to recreate authentic enterprise environments across full CRUD operations.
It offers two rigorous test modes: (1) passive Conversational Interaction and (2) active Agentic Interaction, spanning 600 annotated tasks including Business Intelligence (BI), CRUD operations and etc., each guarded by executable test cases.
Typical evaluations trigger 1,968-5,496 interaction turns between model and user simulator, while state-of-the-art reasoning models currently solve only ≈24% and ≈18% of tasks, underscoring the benchmark's challenge.
Together, these features position Bird-Interact as a decisive yardstick for advancing interactive, production-ready database assistants. The two test modes supported by Bird-Interact covering most interactions in real-world applications:
-
c-Interact: Conversational Interaction which is a passive mode and the workflow is fixed.
-
a-Interact: Agentic Interaction which is an embodied active mode where the workflow is dynamic and led by models.
There are two versions of Bird-Interact dataset:
-
bird-interact-lite-exp: A lite version containing 270 tasks specifically for PostgreSQL. This is a good starting point for quick experimentation.
-
bird-interact-full: The full version of Bird-Interact, comprising 600 tasks specifically for PostgreSQL. (Coming Soon!)
News
- May. 23, 2025: We have released bird-interact-lite-exp (270 tasks). Check out the data in Hugging Face and the newest code in GitHub. The full set of PostgreSQL will be released later. Have fun! Thanks!
LiveSQLBench Support
BIRD-Interact is built on LiveSQLBench, so all databases and tasks are live and continuously updated. LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, evolving benchmark for evaluating LLMs on real-world text-to-SQL tasks.
GT SQLs & Test Cases
To mititgate data leakage, please free to email bird.bench25@gmail.com
for solution SQLs and test cases.
The delivery is quite fast.
represents the user simulator,
and
represents the system evaluated model. Highlighted text shows the ambiguity related to the hierarchical knowledge base (HKB).
c-Interact Example
a-Interact Example
Submission
Coming soon! BIRD 2025 will accept a more flexible submission pipelines, please contact bird.bench25@gmail.com
with the tag [bird-interact] in title, if you have any questions.
Subscribe to BIRD Update
Bird is a long-term research project aimed at bridging the gap between semantic parsing models and the success of database applications. To receive the latest updates of the dataset, you can leave your email address.
Citation
@misc{birdinteract2025, author = {BIRD Team}, title = {BIRD-Interact: Re-imagine Text-to-SQL Evaluation via Lens of Dynamic Interactions}, year = {2025}, howpublished = {https://github.com/bird-bench/BIRD-Interact}, note = {Accessed: 2025-06-01} }
@article{li2024can, title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls}, author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others}, journal={Advances in Neural Information Processing Systems}, volume={36}, year={2024} }
Rank | Model | Pass Rate (%) | Institute | Link | Date | Tier |
---|
- 🏆 Excellent Chat: The Best!
- 💎 Good Chat: Top 30%
- 🌟 Standard: Top 60%
- ⚪ Basic: Bottom 40%
🧑💻 User Simulator: Gemini-2.0-flash with avg cost of 1.91 USD
Rank | Model | Reward* | Institute | Link | Date | Tier |
---|
- 🏆 Excellent Interaction: The Best!
- 💎 Good Interaction: Top 30%
- 🌟 Standard: Top 60%
- ⚪ Basic: Bottom 40%
🧑💻 User Simulator: Gemini-2.0-flash with avg cost of 1.56 USD
Interaction-Time Scaling (ITS) refers to a model’s ability to continuously increase its end performance through multi-turn interactions. When this interactive performance surpasses the model’s idealized single-turn performance on a fully specified, unambiguous task, we say it satisfies the ITS law. As user patience grows and interaction turns accumulate, performance keeps improving, demonstrating that the model can sustain effective communication over extended dialogue. Currently, we only find claude-3-7-sonnet satisfies the ITS law.
