Introduction to the Pandasql Library

In the previous post, I wrote about how different programming languages can be connected to a SQL database. These connections allow the the use of a programming language, such as python, to write queries to a SQL database. However, what if you wanted to write queries to a dataset already in a data table such as a pandas data frame? In this post, I will be introducing the pandasql library. This library can be especially helpful for those who have experience in SQL, but are just starting to learn python.

The pandasql library was released in 2016 and allows you to write queries to a pandas data frame using the syntax from SQL. One of the reasons that this library came to be is the similarities found between a pandas data frame and a table from an SQL database. If one were to compare the two, they are basically the same. Each table and the columns within have to have a unique name. There is also a unique id number in each table. Due to the pandas data frame and the SQL table being nearly identical, it makes sense that someone might consider the use of querying a pandas data frame using SQL syntax.

Using the Pandasql Library

pip install pandasql

To use the pandasql library, the next step is to install the pandasql object sqldf.

from pandasql import sqldf

This object is the main function of the library and accepts two arguments: the query in string format and a set of environment variables (locals() and globals()). However, it can be tedious to code in the environmental variables every time you want to make a query. One way to avoid this step is to use the lambda function.

sql = lambda q: sqldf(q, globals())

This means when you pass q through sql it will automatically enter the global environmental variables.

Now that we have our function ready, we can begin to write our queries.

Writing Queries

If you have any data frames within your file, they will automatically be found by the pandasql library. So if you are using a Jupyter Notebook to write your code and have three data frames already saved into it, the library will consider the notebook as an equivalent to a database of tables in SQL. From there, you can query any of the tables or join different tables together using SQL.

As mentioned in previous posts, the best way to write a query in python is to set the query equal to a variable and have that query written in triple quotes. For example:

q = """
SELECT *
FROM clients
WHERE state = "Virginia";
"""

The reason for having the query be its own variable and using triple quotes is because some queries will be incredibly long and it can make for easier coding/reading to have the query on multiple lines.

To get the results of the query, we combine the above statement with our new function. This will return a pandas data frame.

results = sql(q)

SQL is a powerful language for storing, retrieving, and manipulating data. By using the pandasql library, it allows us to leverage that powerful tool when coding in Python. This can make your life a lot easier by simplifying the syntax and using SQL when making queries of pandas data frames.

I hope this article helps create a foundation on combining two different and powerful languages together, which will allow you to create more efficient code. Thanks for reading!

For more documentation on pandasql:

or

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store