Joining Related Data Tables

In a data science project, the data gathered will usually be from multiple different sources. Take for example a project on wildfires: there are generally 3 different variables that influence how a wildfire will act.

One variable is the weather. If it is dry, there is a more likely chance that a spark will initiate a fire. The second variable is fuel sources. Certain types of plants are more likely to burn in a way conducive to larger fires. The third variable is topography. A wildfire acts differently between flatlands and sharp incline hills.

So for the project, there might be a need for all 3 types of data to be in the final data frame. However, it is unlikely there that would be one file that would hold all this information. This is when it will be necessary to join different data tables. In the example, these types of data could be combined due to their being related. Location is the variable that can combine these different data tables together.

The weather could be different based on location. The fuel sources or types of foliage would be different depending on where the wildfire started. The same is true in the case of topography.

So once the different tables are gathered, the next step is to join them together.

Different Types of Joins

There are a few different ways to join depending on how you want to join the data frames. There are the inner join, the left join, the right join, and the full outer join. A common way of visualizing the joining of two data frames is to consider two interconnecting circles. Each circle represents a data frame.

The inner join is where observations having matching values in both tables. In our diagram, it is where the circles intersect.

The left join takes all the observations from the left table and only joins the information from the right table that have matching values in the left. In our diagram it is the left circle along with the portion of the right circle that intersects with the left.

The same is true in a right join, however instead of having all the observations from the left table, they are from the right.

A full outer join is where all observations from both data tables are joined together. In our diagram, both circles are highlighted.

Each type can be used in different cases. Going back to the wildfire example: if the data needs to span a couple of weeks, the location would be repeated multiple times. The fuel source would be constant through those weeks, so by using a left join with the location being the left data frame, the fuel source would be copied to each observation of the location.

Joining in Python

One way to join tables is in python using Pandas data frames. There are multiple methods to do so. One way is to use the dataframe.join() method. This method has parameters such as other, how, lsuffix and rsuffix. The data frame represents the left data table and other represents the right data table. The how parameter represents which type of joining method is to be used. The options are ‘left’, ‘right’, ‘inner’, and ‘outer’. In the case that there are columns with the same name in both data frame, the lsuffix or rsuffix will add the columns with the designated suffix so that no error is returned.

Another way to join is to use the dataframe.merge() method. This is slightly different because it also allows for joining on specified columns in which case the index of the data frames will be ignored. As with the join method, the merge method has parameters such as right, which is the equivalent of other in the join method. There is also the how parameter, which is the same as the previous method, and the on parameter. The on parameter is for when the data frames are being joined on a column.