GGPLOT2: Creating Box and Whisker Plots

Malcolm Katzenbach
5 min readApr 19, 2021

Data Visualization can be one of the most important facets of a project . We see graphs every day, whether writing a research paper or reading the newspaper. Readers can be inundated with different data visualizations that are telling a story. Because of this, it is important to know how to create clear and concise data visualizations so that when somebody is reading your work, they understand the points you are trying to convey and not misinterpret the visualization for something else.

There are multiple libraries that allow us to create data visualizations to better explore the data and/or show evidence of a certain hypothesis. When using the programming language R, one of the libraries of tools for data visualization most commonly used is called ggplot2. Ggplot2 is one of the packages that can be downloaded as a part of the tidyverse collection or by itself. The documentation tells us that “ggplot2 is a system for declaratively creating graphics” (ggplot.tidyverse.org). This is based on the statistics and computing textbook “The Grammar of Graphics” by Leland Wilkinson, which gives a basic understanding of the different types of quantitative graphics that can be found in different media.

The entire package can create a number of different types of graphics, but we will concentrate on only one today: the box and whisker plot.

To review quickly, the box and whisker plot is a method to graphically display the variation in a data set. This may sound similar to a histogram plot that represents a distribution of data, but one of the positives of using a box and whisker plot is that you can display multiple data sets on the same graph for clear visual comparison.

There are five major points for a box and whisker plot. There is a minimum value or smallest value of the data set. Next is the second quartile, a value that describes the point in which the observations below or equal to that value contains the lower 25% of the data. After the second quartile is the median value, which the middle number of the data set. Then there is the third quartile, whose number and observations with values greater than it represents the upper 25% of the data. Finally, there is the maximum or largest value. The box represents the values between the second and third quartile, and the lines from the box, considered the “whiskers” of the plot, are to the minimum or maximum values.

To best understand this graph, we will go through a quick example.

Box and Whisker Plot using Movie Data:

For this example, we will use movie data from the Internet Movie Database and the The-Numbers website. Before the data can be used for graphs, it has to be combined and cleaned, and a few features need to be created. We won’t go through that process here, but these steps are important to complete before you can create data visualizations. In this example, a Return on Investment (ROI) feature was created and a Budget Tier feature was created based on the total costs to produce and advertise for a movie.

If you haven’t already done so, the first step to using the package is to install it. This can be done a number of ways. The way I would recommend is to install the tidyverse package. Tidyverse is a collection of R packages that include ggplot2.

install.packages(“tidyverse”)

Or if you only want to install ggplot2:

install.packages(“ggplot2”)

To use ggplot, the library has to be imported:

library(ggplot2)

For most common cases and in the case of this example, the first step to create a box and whisker plot is to start with ggplot(), which will ask for the data set being used and the aesthetic mapping function (aes()). With this example, I will use a data pipe which takes the output of one statement for the input of the next statement. In the aesthetic mapping function, we determine what are the x and y values and we choose the type of plot we want to use with an addition symbol.

cleaned_movie_data %>%
ggplot(aes(x = budget_tier, y = roi)) +
geom_boxplot()

The pipeline operator automatically inputs the data frame being used for the data visualization. The aesthetic mapping is pretty self-explanatory where the x axis will be the budget tier and the y axis will represent the return on investment for movies of that tier.

As you can see above, the outliers of the data set cause the boxplots to be squeezed to the bottom of the graph. This can be mitigated by removing the outliers as well as setting the y limits. In the geom_boxplot(), there is an outlier.shape that can remove the outliers and a coord_cortesian() that has a variable to set the y limits.

cleaned_movie_data %>%
ggplot(aes(x = budget_tier, y = roi)) +
geom_boxplot(outlier.shape = NA) +
coord_cortesian(ylimit = c(-100, 650))

This is much better. To make the graph a bit more clear, we can add labels with the labs() method.

cleaned_movie_data %>%
ggplot(aes(x = budget_tier, y = roi)) +
geom_boxplot(outlier.shape = NA) +
coord_cortesian(ylimit = c(-100, 650)) +
labs(title = ‘Distribution of Return on Investment Percentage Grouped by Budget Tier’,
x = ‘Budget Tier’
y = ‘Return on Investment (%)’

Of course, there are also many other aspects that can be experimented with, such as the size of the labels and the colors in the graph. I recommend finding a data set that you know well and practice with the different options available. Get Creative!

For more information, the documentation can be found below:

https://ggplot2.tidyverse.org/

I hope you enjoyed this introduction and thank you for reading.

--

--