Chatting with your database: What did I learn by letting AI generate SQL?

#database #python #ai #learning

This is one of the intriguing use cases of ChatGPT and other LLMs.

You pass the information about your database and a question to a large language model. In return you get an SQL query, which you can execute to get the answer.

As a cricket fan, I had an sqlite database with match statistics so I made AI answer some questions. Here's what I learned:

Prompt Engineering is a serious thing. When it comes to utilising LLMs, packaging the right amount of context into the prompts will influence your results. Before using LangChain I wrote custom logic to create context for db schema, and I found myself iterating over minute details to make the queries as accurate as possible.
LangChain is not just a wrapper. It is a very smart tool that glues together various moving parts. I found my results to be more consistent using it.
Denormalised schema will fetch more consistent results. As you can see in the video, one of the queries returned player uuid instead of name. This was because my data model was designed for a different use case where as it will make it easier for GPT-4 if I had the column as player_name instead of id.
There is a risk of data leak if you want to use this in production. Exposing such an interface publicly is risky. We can mitigate this risk my restricting the number of tables shared with AI model and pre-processing the SQL queries generated before executing them. Using a restricted read-only user will be another wise guard rail.

I will continue exploring this use case and share my learning here in future.

Thanks for reading.