Query SQL using OpenAI and Langchain

Roman Guoussev-Donskoi
Sep 2, 2023
2 min read

Updated: Sep 3, 2023

LLMs (such as OpenAI) are good for reasoning but they lack capability interface with outside world. This is where Langchain agents step in: agents provide LLMs with tools to perform actions (for example, interacting with a relational database or interacting with an OpenAPI spec). Langchain Agents

Langchain Agents make it very easy to use OpenAI to query SQL. Very simple notebook provides end-to-end example. However, there are some snags that need to be addressed and some limitations that need to be considered.

The major obstacle seems to be LLM token limit (Azure OpenAI Service models) LLMs are not designed to process large data sets and so far they seem to work best in combination with vector databases and embeddings. Vector databases are great for retrieving small amounts of relevant data before passing them to LLMs to reason about (example e.g. https://www.cloudmatter.io/post/azure-openai-langchain-azure-search). Whereas creating prompts to access SQL database and process query results tends to exhaust LLM token limit very quickly.

Code

But if you need interface LLM with SQL and fit within token constraints then to query SQL using OpenAI all you need is to initialize OpenAI LLM, initialize SQLDatabase toolkit and create agent as in snippet below. (Complete code is provided on GitHub)

This is so simple that I decided to add a little more functionality and convert text SQL output to python object. As can see this is also a very small amount of code. Note we also leverage prompt generated by parser to inform LLM how format the output so that it can be parsed. This saves effort crafting a prompt manually and I believe one of the advantages using Langchain agent model.

Output

Printed format instructions generated by output parser:

To generate result first AI used agent to find the table that can provide requested data. In following snippets we can see how "Action-Observation-Thought" is repeated until a final result is produced. (output is edited for better readability)

Snippet below also edited with made up number of deployments :) But we can see how finally structured output is produced by parsing LLM response.

Notes:

Permissions

As per langchain documentation (https://python.langchain.com/docs/integrations/toolkits/sql_database ) the query chain may generate insert/update/delete queries. When this is not expected, use a custom prompt or create a SQL users without write permissions.

Performance

langchain.sql_database.SQLDatabase on initialization reads metadata of all tables that user has access to. This can be very slow (minutes) if there more than just a few tables metadata user can access.

One of simpler approaches to address this is to limit the number of metadata accessible to the account used for connect to database. In addition granting account permissions to select only from specific tables, you also need make sure for example database or schema-level VIEW DEFINITION is not granted to user (directly or indirectly via role membership or via grant "to public").

WELCOME TO CLOUD MATTER

Query SQL using OpenAI and Langchain

Code

Output

Notes:

Permissions

Performance

Recent Posts

Comments

CONTACT