NBA Text-to-SQL Pipeline

Link to Github

Project Description

I created an AI Pipeline that can query a NBA Database using natural language questions, generate an appropriate SQL query and retrieve the answer. Through this project, I learned how to create a useful tool that can be modified to any SQL database or usecase. However, due to my love for basketball, I created it to get NBA data from a related Postgres Database.

OpenWebUI and Pipelines

A key part of my tech stack is OpenWebUI, which is a lightweight, customizable interface for interacting with local llms. This platform allows developers to create their own personalized, locally-hosted ChatGPT. It also supports custom pipelines which let you define how inputs and outputs are processed in a multi-step flow.

In this project, I used OpenWebUI’s pipeline system to define:

User Input: Users input natural language questions about NBA data.
```
    How many teams are there in the NBA?
```
Text-to-SQL model processing: Using Retrieval Augmented Generation(RAG) to infer the structure of our database, the LLM generates an accurate SQL query.
```
    SELECT COUNT(*) FROM team;
```
SQL query execution against my NBA database: The SQL query is then run against the database and gets a default SQL response.
```
    30
```
Formatted Answer Returned to the User: The Answer from our Database is then turned into Natural Language such that the user can make sense of it.

By using Pipelines, I was able to focus on each step modularly and keep the architecture clean.

Building the Pipeline

1. Designing the NBA Stats Database

Using PostgreSQL, I built a relational database containing structured NBA stats pulled from a public Kaggle dataset. My schema included tables like:

player: player names, ids, and if they are active.
common_player_info : all other related player information, such as height, weight, and what team they are on.
team: id, team name, nickname, location, and year founded

I did not modify the structure or schema of the database at all, in an effort to create a script that is not reliant on a database’s schema but rather the script’s design.

2. Text to SQL Conversion

For the natural language understanding piece, I experimented with different open-source LLM models such as Mistral, LLaMa2.1, and LLaMa3 to see which could handle the queries in a consistent manner. The pipeline sends prompts to the model with the help of a Natural Language to SQL library to generate SQL given the current schema and user question.

This step was unexpectedly tricky, because the NLSQL tool wasn’t the most accurate for more complex queries. To handle this, I manually validated and generated test queries to handle as many of the common patterns as I could find. In one case, the library couldn’t handle JOINS using multiple tables and would instead attempt to run nested loops, which resulted in incorrect answers. So, I provided an test query that told the LLM how to handle these.

For example, the question Who are the active players on the Kings? required imformation from three different tables:

The player table tells us if a player is active or not
The common_player_info table tells us what a player’s current team is(in the form of a team_id)
The team table connects the team_id with the nickname Kings

So, to avoid nested loops, I provided the following query:

SELECT p.full_name 
    FROM common_player_info AS cpi, player AS p, team AS t 
    WHERE cpi.team_id = t.id 
        AND cpi.person_id = p.id 
        AND t.nickname = 'Kings' 
        AND p.is_active = 1;

This allowed my pipeline to handle any related question which connects these three tables.

avg_pistons

3. Safe and Effectice SQL Execution

Once the SQL is generated, the pipeline:

Parses and Validates the SQL
Executes it against the NBA PostgreSQL database

4. Formatting Output

Once the output of our query is returned, I gave the LLM instructions to synthesise the output such that the generated SQL query and output are returned in a human-readable way. This part of the pipeline can still use a little work, since some information about the data colums; such as units for weight and height, need to be explicitly defined.

What’s Next

There is a lot that can be done to improve this project. First and foremost, I need to check if this UI is safe from SQL attacks. If not, I would need to change some of the prompts for SQL generation such that the database is not harmed. Also, this pipeline can easily be adapted to more relavant datasets. While I am currently using it on NBA data, this script can be used to help interactions with business data for people who are less familiar with SQL.

Final Thoughts

This project is a great example of how powerful LLMs can be when paired with structured data and a clean user interface. Whether you’re a basketball nerd like me or just curious about the tech, it’s exciting to see how natural language can open doors to rich, data-driven insights.

If you’re interested in building something similar or want to collaborate, feel free to reach out! The code is modular, and I’m happy to share what I’ve learned.

Link to Github#

Project Description#

OpenWebUI and Pipelines#

Building the Pipeline#

1. Designing the NBA Stats Database#

2. Text to SQL Conversion#

3. Safe and Effectice SQL Execution#

4. Formatting Output#

What’s Next#

Final Thoughts#