Going through the steps of a data science project, the first step is to understand the question that needs to be answered. From there data needs to be gathered, which when analyzed, the data scientist can answer the question. However, before the data can analyzed, it must be cleaned. It is this critical step that we will go over here.
Problems from Uncleaned Data
There are multiple reasons why it is important to go through the data cleaning step. A number of errors can occur due to unclean data. One possible error is missing values. Missing values hinder the ability to do a number of processes such as converting data types, calculating summary statistics, visualizing data, and creating models.
One way this may be represented is by an actual null value. Another way is using placeholders for missing values. Just because there are no null values in a data set, doesn’t mean the data will be useful. Some datasets will add random symbols like
? for missing categorical values. Sometimes for numerical values, there will obviously incorrect numbers. An example of this would be to have a column that is supposed to show a range between 0 and 100 and finding a value like -9999 in one of the observations.
Another example of an error found while cleaning data is duplicate values. In certain data sets, having multiple observations being exactly the same can cause problems when modeling or using visualizations.
Thankfully, pandas has a number methods already available to help clean these problems within a dataset.
Using Pandas to Find Missing Values
The first problem mentioned was missing values. There are many different ways to check for missing values and how to clean those observations.
First let us consider when there are null values. One way to discover if there are null values is to use
dataframe.info(). This method will give a great deal of information, but the most important in this case is the total number of observations in the data frame and the number of non-null observations in each column. If there is a difference between the total observations and the observations in each column, there is a presence of null values.
Another method to find if there are null values is using the
dataframe.isna() method. This method will go through the data frame and return a matrix of boolean values where any cell that contains
NaN will return
True, or else return
False. By adding the
.sum() method to the previous method, it will return the total number of
False values and
The second case of missing values is when the missing value is given a placeholder. Sometimes when entering values, a program won’t allow null entries. So the person entering the data will sometimes be given a specific value by their supervisor to enter in those cases. As mentioned before, in the case of numerical values, this could be a value outside the expected range, like a negative number where there can only be positives. Or in the case of dates, if the dataset is only supposed to be observations after the year 2000, there might be a 1976 entry. One way to check for this is using the
dataframe.describe() method. This can give details such as the minimum, mean, and max values of numerical columns.
In the case of categorical values, the unique values of the column can be checked and if there are unexpected values, they could be only placeholders. In general, placeholders are more difficult to find than null values.
Cleaning Missing Values
Just as there are multiple ways to find missing values, there are multiple ways to clean these missing values. Some demonstrate more brute force than others. The easiest way is to drop the offending columns or rows. The problem with this method is that it can remove too much data from the data frame. It might be fine for incredibly large data sets, but would not be for smaller ones. For example, if all rows with null values were dropped, the dropped rows may hold the majority of the data due to each row having one null from a random column. In the case of columns, it might be impossible due to the importance of the column for answering the question.
Some of the more complicated methods will factor in what type of value it is, numerical or categorical, and the way to work with missing value, keeping or replacing.
For numerical missing values they can be replaced with the median value of the column. The median value would be chosen over mean or mode because it is least likely to influence the distribution of data. If the distribution is symmetrical, the median and mean would be close to equal. On the other hand, if there are multiple outliers, it may throw off the mean, thus making the median the better option. This can be done by using the
For missing categorical values, there is unfortunately no median method to replace the missing values. Unless there is one clear majority, replacing the null value can heavily alter the distribution of data.
Instead of replacing missing values, another option is to keep the null values and represent them in a different way. For categorical values, the null values could made a string, or if the there is another symbol used to represent missing categorical values, it can be kept if it consistent through out the data column. For numerical values a method named binning can be used. The continuous values can be separated into a number of bins (e.g. 0–10, 10–20, 20–30, etc.), with one extra bin representing the null values.
It is up to the data scientist to determine which method is most appropriate to use. Each problem will have unique factors important to it that will determine if it would be better to keep or replace the null values, or drop the data completely.
Another possible problem described above was having duplicated observations. Thankfully, there is a simple method to find those duplicates. The
dataframe.duplicated() method will return a data frame of the duplicated observations. The
dataframe.drop_duplicates() method will remove the duplicates.
Cleaning data can be one of the most time consuming and important tasks to complete in a data science project. We have gone through a number of the most common data cleaning issues and possible methods to clean these data frames. I hope this has been a helpful source for working on datasets. Thanks for reading.