Ngonidzashe Nzenze

Posted on May 1, 2023 • Edited on May 17, 2023

Speak Your Queries: How Langchain Lets You Chat with Your Database

#python #chatgpt #openai #postgres

Imagine effortlessly conversing with your database as if it were a close friend, asking questions and receiving instant, accurate responses. Welcome to the world of Langchain, a groundbreaking package that harnesses the power of LLMs(Large Language Models) to redefine the way you interact with your databases. In this tutorial, we'll explore how to seamlessly connect to a PostgreSQL database and start chatting with it using Langchain. Say goodbye to complex queries and embrace the future of database management – let's dive into the realm of conversational AI and revolutionize your data-driven tasks today!

What is Langchain?

Langchain is a user-friendly tool that helps connect advanced language software, called large language models (LLMs), with other sources of information, like databases. LLMs are amazing at understanding and generating human-like text, but to create truly powerful apps, we need to combine their abilities with other resources. Langchain makes this possible by allowing developers to easily link these language models with external data. This means that everyday people can benefit from smarter, more helpful applications that understand their needs and provide accurate information.

Getting started

I have included all the code for this project on my github.
Lets start by installing all the required packages first. Install langchain, openai, psycopg2 and python-environ with pip:



pip install langchain
pip install openai
pip install psycopg2
pip install python-environ

We are going to be using a postgresql database in this tutorial hence we're installing psycopg2. I am already assuming that you have postgresql installed on your device and have an open ai account. We are going to use the python-environ module to manage the API keys and database password.

Create a .env file and add the keys into it as below:



apikey=your_openai_api_key
dbpass=your_database_password

Setting up the database

Create a database named tasks in pgAdmin. Our tasks are going to consist the task name, the completion status which is either 1 or 0, the due date, completion date, and the task priority ranging from 0 to 3.

Create a file named db.py and add the following code to it:



import psycopg2

import environ
env = environ.Env()
environ.Env.read_env()

# Establish a connection to the PostgreSQL database
conn = psycopg2.connect(
    host='localhost',
    port=5433,
    user='postgres',
    password=env('dbpass'),
    database='tasks'
)

# Create a cursor object to execute SQL commands
cursor = conn.cursor()

# Create the tasks table if it doesn't exist
cursor.execute('''CREATE TABLE IF NOT EXISTS tasks
             (id SERIAL PRIMARY KEY,
             task TEXT NOT NULL,
             completed INTEGER,
             due_date DATE,
             completion_date DATE,
             priority INTEGER)''')

# Insert sample tasks into the tasks table
cursor.execute("INSERT INTO tasks (task, completed, due_date, completion_date, priority) VALUES (%s, %s, %s, %s, %s)",
               ('Finish homework', 1, '2023-05-01', None, 1))
cursor.execute("INSERT INTO tasks (task, completed, due_date, completion_date, priority) VALUES (%s, %s, %s, %s, %s)",
               ('Buy groceries', 0, '2023-05-03', None, 2))
cursor.execute("INSERT INTO tasks (task, completed, due_date, completion_date, priority) VALUES (%s, %s, %s, %s, %s)",
               ('Pay bills', 0, '2023-05-05', None, 3))
cursor.execute("INSERT INTO tasks (task, completed, due_date, completion_date, priority) VALUES (%s, %s, %s, %s, %s)",
               ('Clean house', 0, '2023-05-08', None, 4))
cursor.execute("INSERT INTO tasks (task, completed, due_date, completion_date, priority) VALUES (%s, %s, %s, %s, %s)",
               ('Exercise', 1, '2023-05-10', None, 5))

# Commit the changes and close the connection
conn.commit()
conn.close()

This file is going to create a table named tasks and insert demo tasks into it. Run the script with python db.py.

Setting up the chat

Now we are going to create the main console application. First create a file named main.py and import the relevant modules:



#main.py

from langchain import SQLDatabase, SQLDatabaseChain
from langchain.chat_models import ChatOpenAI

# Setting up the api key
import environ
env = environ.Env()
environ.Env.read_env()

API_KEY = env('apikey')

Under the hood, LangChain works with SQLAlchemy to connect to various types of databases. This means it can work with many popular databases, like MS SQL, MySQL, MariaDB, PostgreSQL, Oracle SQL, and SQLite. To learn more about connecting LangChain to your specific database, you can check the SQLAlchemy documentation for helpful information and requirements.

With postgres and psycopg2, the connection string is like this: postgresql+psycopg2://username:password@localhost/mydatabase.

Add the following to main.py:



# main.py
# ...

# Setup database
db = SQLDatabase.from_uri(
    f"postgresql+psycopg2://postgres:{env('dbpass')}@localhost:5433/tasks",
)

The code above sets up our connection to the postgresql database. Feel free to change it to your own configuration.

Add the following code:



# main.py

# ...

# setup llm
llm = ChatOpenAI(temperature=0, openai_api_key=API_KEY, model_name='gpt-3.5-turbo')

# Create query instruction
QUERY = """
Given an input question, first create a syntactically correct postgresql query to run, then look at the results of the query and return the answer.
Use the following format:

Question: "Question here"
SQLQuery: "SQL Query to run"
SQLResult: "Result of the SQLQuery"
Answer: "Final answer here"

{question}
"""

# Setup the database chain
db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)

First we define our language model with:



llm = ChatOpenAI(temperature=0, openai_api_key=API_KEY)

We are going to be using the open ai language model.

Next, we create a basic query string, which will tell the model instructions on how we want it to present the data.



QUERY = """
Given an input question, first create a syntactically correct postgresql query to run, then look at the results of the query and return the answer.
Use the following format:

Question: "Question here"
SQLQuery: "SQL Query to run"
SQLResult: "Result of the SQLQuery"
Answer: "Final answer here"

{question}
"""

Finally we, set up the database chain with:



db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)

We want to be able to prompt our database multiple times therefore I'm going to setup a while loop:



# main.py

# ...

def get_prompt():
    print("Type 'exit' to quit")

    while True:
        prompt = input("Enter a prompt: ")

        if prompt.lower() == 'exit':
            print('Exiting...')
            break
        else:
            try:
                question = QUERY.format(question=prompt)
                print(db_chain.run(question))
            except Exception as e:
                print(e)


get_prompt()

The get_prompt function will keep on asking for input and giving us the result over and over. We can only exit the loop if we type 'exit'. That is it for the setup, let's test it out and see how it does.

Now let's try it out!

How many tasks do i have?

I want a list of those tasks.

Wait a minute🤔, when should I have bought groceries?

How many incomplete tasks do I even have?

Oh I had forgot! I need to clean the house!!😨

In conclusion, we've explored the incredible potential of LangChain to revolutionize the way we interact with databases. By harnessing the power of large language models and making it easy to connect with various databases, LangChain enables users to engage in natural, conversational exchanges with their data. The accompanying screenshots throughout this article have demonstrated the simplicity and effectiveness of this innovative approach. As technology continues to evolve, LangChain stands as a testament to the endless possibilities that emerge when we seamlessly blend AI capabilities with practical, real-world applications. So, embrace the future of database management with LangChain and experience the ease and efficiency of conversing with your data.

Top comments (17)

Michael Tharrington • May 1 '23

Wow, this is really cool and seems like it'd be quite helpful, especially for someone non-technical (like mysef!) who wants to query the database but doesn't know anything about writing queries. Thanks for sharing, Ngonidzashe!

Ngonidzashe Nzenze • May 1 '23

Glad you find it helpful! LangChain simplifies database interactions for everyone, including non-technical users. If you need any help, just ask. Enjoy querying! 😊

sajeev • May 14 '23

Great article. Its perfectly working for less number of tables.
Comments from my side:

For a small DB of students and mark - I got this error - group by syntax issue `which student scored the highest mark?

SQLQuery:SELECT "studentname", MAX("marks") FROM myschema.student_mark
INNER JOIN myschema.student ON myschema.student_mark."student_id" = myschema.student."student_id"
LIMIT 5;(psycopg2.errors.GroupingError) column "student.studentname" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT "studentname", MAX("marks") FROM myschema.student_mar...
^
SQL: SELECT "studentname", MAX("marks") FROM myschema.student_mark
INNER JOIN myschema.student ON myschema.student_mark."student_id" = myschema.student."student_id"
LIMIT 5;`

For huge data set, _"# Setup database" _ was taking time(was waiting more than 3 hrs for a DB with 1200+ tables and 10+ years if enterprise ERP data). What is the best recommended system spec for huge dataset?

Ngonidzashe Nzenze • May 17 '23

Thank you for your feedback and questions. I'm glad you found the article helpful. I understand you encountered some issues when working with larger datasets. Let me attempt to address your concerns.

By default, when you initialize your open AI language model, it uses the text-davinci-003 model. When I have been running some of the queries using this model, I would get syntax errors but they disappeared the moment I specified the model name to gpt-3.5-turbo:

I have updated the article to use ChatOpenAI which uses gpt-3.5-turbo by default. You can simply change the llm variable to:
llm = ChatOpenAI(temperature=0, openai_api_key=API_KEY, model_name='gpt-3.5-turbo')

A system with a large amount of RAM and a strong CPU is likely required for a dataset with 1200+ tables and 10+ years of business ERP data. You may also think about utilizing a distributed database system, which can manage huge amounts of data.

I hope this helps.

I love integers • May 3 '23

For people who are interested how this is done, there are the prompts used to design the chain: github.com/hwchase17/langchain/blo...

seck saliou • May 8 '23

great article very helpfull Thank's very much

Ngonidzashe Nzenze • May 17 '23

I'm glad you liked it!

Daniel Villacis • May 30 '23

Hi, have you done the same but using a MS SQL Database? I'm trying to do it using the AdventureWorksLT demo db, and everything go well until I try to run the chain.

Here is my code:

connection_string = "DRIVER={ODBC Driver 17 for SQL Server};SERVER=server.database.windows.net;DATABASE=AdventureWorksLT;UID=user;PWD=password"
connection_url = URL.create("mssql+pyodbc", query={"odbc_connect": connection_string})
db = SQLDatabase.from_uri(connection_url)
llm = ChatOpenAI(openai_api_key=key,temperature=0, verbose=True, model_name="gpt-3.5-turbo")
db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)
response = db_chain.run("Describe the Category table")

I got an InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 5190 tokens. Please reduce the length of the messages.

The AdventureWorksLT is a relative small db, not sure why is using that amount of tokens.

Any advice will be greatly appreciated.
Best!

Kaushik Ganesan • Jun 10 '23

I have the same problem

Kaushik Ganesan • Jun 10 '23 • Edited

I was able to resolve the issue by including only a few tables. For example in the below connection use include_tables to resolve this issue.

db = SQLDatabase.from_uri(uri, include_tables=["table_name"])

Nikk • Jan 25

Hi,
Have you been able to include views anyhow?

Rakshith Raj N • Aug 8 '23

Hey,
I'm getting the following error, could any one please help me knowing the reason.

(pyodbc.OperationalError) ('08001', '[08001] [Microsoft][ODBC SQL Server Driver][DBNETLIB]SQL Server does not exist or access denied. (17) (SQLDriverConnect); [08001] [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionOpen (Connect()). (53)')

Below is my code.
bd=SQLDatabase.from_uri(f"mssql+pyodbc://{username}:{password}@{server}/{database}?driver={driver}")

Dhanashree Mhatre • Jul 31

uri = f'mssql+pyodbc://{username}:{password}@{server}/{database}?driver={driver.replace("{", "").replace("}", "")}'
this works for check if it works

Prameela1610 • Jul 7 '23

SQLQuery:This model's maximum context length is 4097 tokens. However, your messages resulted in 23032 tokens. Please reduce the length of the messages.
I go the above error while asking simple question like count of table .

saptarshitiger • Jun 16 '23

Great work! Have you tried adding memory to the database chain as well ? So that it can refer back to the previous queries which the user asked and create a result combination of the chat history and the new question asked, for me in context of a database memory does not seem to work properly, that is why was curious about the same

Deepak Kashyap • Jun 7 '23

Hey Ngonidzashe,
I am using SQL agent from langchain, for some context I have a large postgres data source. Upon asking questions that might involve joining tables, ordering and filtering. The agent gets hit to its maximum iterations. I am currently testing on my laptop that has 8GB of Ram and a decent CPU. Can you suggest me something that could help me to avoid these errors