Querying SQL DB with GPT Prompt

Querying SQL DB with GPT Prompt

Featured on Hashnode

LLMs appear to have boundless possibilities, and it seems like they can be used in all tasks. We can craft prompts for composing emails in a given context as well as write creative prompts to generate code in any programming language. Image a situation where someone isn't familiar with writing SQL queries but can still interact with the database using LLM's assistance. This once-futuristic idea is now a reality. In this blog, we'll explore how to achieve this using Langchain.

This blog post will begin by exploring the Langchain classes that enable us to query an SQL database. After that, we will work with a sample database and attempt to extract results from it using non-SQL queries.

Introduction to SQLDatabaseChain

Langchain provides SQLDatabaseChain class which can be used for creating SQL chain requests. It uses the SQLDatabase class which is the SQLAlchemy wrapper around the database. The code for creating the instance of SQLDatabase class is below.

db = SQLDatabase.from_uri("sqlite:///./database/test.db")

We are connecting to a SqlLite DB which is inside the database directory of our workspace. The name of our database file is test.db

Now that we have built the database instance, the next step is to instantiate SQLDatabaseChain class. The SQLDatabaseChain class takes two inputs as parameters, the LLM model and SQLDatabase instance.

db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)

To query the DB using LLM we will employ the db_chain variable. Below is a sample of how to perform a database query, and we'll delve further into this topic in the following sections.

db_chain.run("How many employees are having the designation of Software developer?")

LLM prompt for database Query

For querying the database we will need a prompt that will be passed to the LLM along with the user question. In this prompt, we pass on some basic information about the DB. We are informing the model that we are querying an SQLite DB, so queries generated by the LLM should match the SQLite syntax. We have also set a limit on the number of results it should return. Full details of the prompt can be found below. This prompt is provided by the Langchain SQLDatabaseChain class.

You are a SQLite expert. Given an input question, first create a syntactically 
correct SQLite query to run, then look at the results of the query and return 
the answer to the input question.
Unless the user specifies in the question a specific number of examples to 
obtain, query for at most {top_k} results using the LIMIT clause as per SQLite.
You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that 
are needed to answer the question. Wrap each column name in double quotes 
(") to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. 
Be careful to not query for columns that do not exist. Also, pay attention 
to which column is in which table.
Pay attention to use date(\'now\') function to get the current date, if the 
question involves "today".

Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:
{table_info}

Question: {input}

Sample Project

We will now be going through a sample DB and then will be querying it using plain text to fetch results with the help of LLM. Here I have tried to imitate the DB schema of an electronics e-commerce catalog service. Following are the tables I have created in the DB:

  1. Categories

  2. Brands

  3. Products

Following is the DDL script for these tables. Note that the Products table has reference to both Categories and Brands tables.

CREATE TABLE Categories (
    category_id INT PRIMARY KEY,
    category_name VARCHAR(100) NOT NULL
);


CREATE TABLE Brands (
    brand_id INT PRIMARY KEY,
    brand_name VARCHAR(100) NOT NULL
);


CREATE TABLE Products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(200) NOT NULL,
    category_id INT,
    brand_id INT,
    price DECIMAL(10, 2) NOT NULL,
    description TEXT,
    stock_quantity INT,
    FOREIGN KEY (category_id) REFERENCES Categories(category_id),
    FOREIGN KEY (brand_id) REFERENCES Brands(brand_id)
);

Post creating the table we need to populate then with some values. Following is the insert script for the same.

-- Inserting into Categories table
INSERT INTO Categories (category_id, category_name) VALUES (1, 'Smartphones');
INSERT INTO Categories (category_id, category_name) VALUES (2, 'Laptops');
INSERT INTO Categories (category_id, category_name) VALUES (3, 'Headphones');
INSERT INTO Categories (category_id, category_name) VALUES (4, 'Tablets');


-- Inserting into Brands table
INSERT INTO Brands (brand_id, brand_name) VALUES (1, 'Apple');
INSERT INTO Brands (brand_id, brand_name) VALUES (2, 'Samsung');
INSERT INTO Brands (brand_id, brand_name) VALUES (3, 'HP');
INSERT INTO Brands (brand_id, brand_name) VALUES (4, 'Sony');


-- Inserting into Products table
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (1, 'iPhone 13', 1, 1, 999.99, 'Latest iPhone with A15 Bionic chip', 100);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (2, 'Galaxy S21', 1, 2, 899.00, 'Powerful Android smartphone', 150);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (3, 'HP Pavilion x360', 2, 3, 799.50, 'Convertible laptop with touchscreen', 80);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (4, 'MacBook Air', 2, 1, 1199.00, 'Ultra-thin and lightweight laptop', 60);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (5, 'Sony WH-1000XM4', 3, 4, 349.99, 'Wireless noise-canceling headphones', 100);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (6, 'AirPods Pro', 3, 1, 249.00, 'True wireless earbuds with active noise cancellation', 120);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (7, 'Galaxy Tab S7', 4, 2, 599.00, 'Android tablet with S Pen support', 70);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (8, 'iPad Pro', 4, 1, 799.00, 'Powerful iPad for professionals', 90);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (9, 'Galaxy Note 20', 1, 2, 799.00, 'Note series with S Pen', 110);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (10, 'HP Spectre x360', 2, 3, 1299.00, 'Premium convertible laptop', 50);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (11, 'Sony WH-CH710N', 3, 4, 129.99, 'Wireless headphones with noise cancellation', 90);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (12, 'AirPods', 3, 1, 159.00, 'Classic AirPods with charging case', 200);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (13, 'Xperia 1 III', 1, 4, 1099.00, 'Sony flagship smartphone', 80);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (14, 'MacBook Pro', 2, 1, 1999.00, 'Powerful MacBook for professionals', 40);
INSERT INTO Products (product_id, product_name, category_id, brand_id, price, description, stock_quantity) VALUES (15, 'Galaxy Tab A7', 4, 2, 199.00, 'Affordable Android tablet', 120);

Querying SQL DB with LLM

With our database prepared, our next step involves querying it using LLM. However, before we delve into the querying process, let me provide a visual representation of our table content. This visual aid will facilitate a better understanding of the queries as we proceed.

Brands Table:

Brands Table

Categories Table:

Products Table:

Now, let's proceed to the querying part. Below is the code required to query our database. If you find yourself needing a refresher, I've already explained the code lines at the beginning of this blog. Feel free to revisit them for a quick memory jog.

import os

from langchain import OpenAI, SQLDatabase, SQLDatabaseChain
os.environ["OPENAI_API_KEY"] = "<your-api-key>"
db = SQLDatabase.from_uri("sqlite:///./database/test.db")
llm = OpenAI(temperature=0, verbose=True)

db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)

In the above code, we have created the SQLDatabaseChain instance which will be used to query DB. Let's look at some examples.

First, we check for the list of brands in our catalog:

db_chain.run("What all brands do you have?")

Based on our query the LLM can interpret that we want the brand name and for that, it needs to query the Brands table. It runs the select query on the Brands table and selects only the brand_name column.

Next, let's try to extract data that would require querying multiple tables like searching for an Apple smartphone. Fetching this information will require joining all the 3 tables. Based on the user query the LLM understands that the user is looking for the brand Apple for which it will query the Brands table and fetch brand_id, next it will fetch the category_id for smartphones from the Categories table. Post that it will query the Products table to check if the Product exists for the same brand_id and category_id.

db_chain.run("Do you have any Apple smartphones?")

Above is the screenshot where the LLM based on the user's question returned iPhone 13 as the result.

Now let's try to look for some negative cases, we will search for an HP smartphone that does not exist in our catalog and see how the LLM behaves.

db_chain.run("Do you have any HP smartphones?")

In this case, the LLM used a join query to search for the result. It did an inner join of all the 3 tables and put the constraint of HP in brand_name and Smartphones in category_name. Since an HP smartphone doesn't exist in our DB it returned an empty result.

Following is the join query that the LLM generated and the same can be seen in the above screenshot too.

SELECT "Products"."product_name" FROM "Products" 
INNER JOIN "Brands" ON "Products"."brand_id" = "Brands"."brand_id" 
INNER JOIN "Categories" ON "Products"."category_id" = "Categories"."category_id" 
WHERE "Brands"."brand_name" = 'HP' 
AND "Categories"."category_name" = 'Smartphones' LIMIT 5;

I trust you discovered this blog post about querying SQL databases with LLM GPT prompt to be beneficial. I continuously produce similar content covering topics related to Langchain, LLM, and AI. If you wish to receive more articles of this nature, consider subscribing to my blog.

If you're in the Langchain space or LLM domain, let's connect on Linkedin! I'd love to stay connected and continue the conversation. Reach me at: https://www.linkedin.com/in/ritobrotoseth/

Did you find this article valuable?

Support Ritobroto Seth by becoming a sponsor. Any amount is appreciated!