Introduction to the Pandasql Library
In the previous post, I wrote about how different programming languages can be connected to a SQL database. These connections allow the the use of a programming language, such as python, to write queries to a SQL database. However, what if you wanted to write queries to a dataset already in a data table such as a pandas data frame? In this post, I will be introducing the pandasql library. This library can be especially helpful for those who have experience in SQL, but are just starting to learn python.
The pandasql library was released in 2016 and allows you to write queries to a pandas data frame using the syntax from SQL. One of the reasons that this library came to be is the similarities found between a pandas data frame and a table from an SQL database. If one were to compare the two, they are basically the same. Each table and the columns within have to have a unique name. There is also a unique id number in each table. Due to the pandas data frame and the SQL table being nearly identical, it makes sense that someone might consider the use of querying a pandas data frame using SQL syntax.
Using the Pandasql Library
As it is the case for any new library you want to use in Python, the first step is to make sure to download the library. One of the easiest ways is to use the pip installer.
pip install pandasql
To use the pandasql library, the next step is to install the pandasql object sqldf.
from pandasql import sqldf
This object is the main function of the library and accepts two arguments: the query in string format and a set of environment variables (
globals()). However, it can be tedious to code in the environmental variables every time you want to make a query. One way to avoid this step is to use the lambda function.
sql = lambda q: sqldf(q, globals())
This means when you pass q through
sql it will automatically enter the global environmental variables.
Now that we have our function ready, we can begin to write our queries.
Writing queries using the function from the pandasql library is as easy as writing a query in a SQL database. The library specifically uses the syntax from the sqlite version of SQL. This means that the majority of SQL statements and functions are available to you.
If you have any data frames within your file, they will automatically be found by the pandasql library. So if you are using a Jupyter Notebook to write your code and have three data frames already saved into it, the library will consider the notebook as an equivalent to a database of tables in SQL. From there, you can query any of the tables or join different tables together using SQL.
As mentioned in previous posts, the best way to write a query in python is to set the query equal to a variable and have that query written in triple quotes. For example:
q = """
WHERE state = "Virginia";
The reason for having the query be its own variable and using triple quotes is because some queries will be incredibly long and it can make for easier coding/reading to have the query on multiple lines.
To get the results of the query, we combine the above statement with our new function. This will return a pandas data frame.
results = sql(q)
SQL is a powerful language for storing, retrieving, and manipulating data. By using the pandasql library, it allows us to leverage that powerful tool when coding in Python. This can make your life a lot easier by simplifying the syntax and using SQL when making queries of pandas data frames.
I hope this article helps create a foundation on combining two different and powerful languages together, which will allow you to create more efficient code. Thanks for reading!
For more documentation on pandasql:
pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to…