Hypothesis Methods: Testing Multiple Groups with ANOVA

To talk about ANOVA, we first have to mention hypothesis testing. You have probably already heard something about hypothesis testing in one of your science classes in school. The research you see today coming out of universities and other groups are all based on hypothesis testing. Through these tests, the researchers are able to determine if the results of their experiment are just by chance or if they are significant.

As data scientists we have multiple tools available to us to experiment and determine the significance of our hypothesis tests. ANOVA is one of these tools.

ANOVA stands for Analysis of Variance. The basics of this tool is that it allows us to analyze how different groups can change the results of the experiment. Take for example a website. You have multiple parameters in how a website might look: the color of the buttons, the size of the buttons, the placement of the buttons, etc. Using ANOVA, we are able to analyze how the different groups might affect clicks and which of those groups are most significant in those changes. But to truly understand how useful this tool is in this case, we should go through an overview of the problems with t-tests.

Inferential Statistics: Sample vs. Population

To find the population dataset for the drug, one would have tested the drug on all qualified individuals to see how it worked. Of course, in medicine, that would be too expensive to test and it would be dangerous if the drug did not work as planned. With taking samples however, the researchers would be able to make generalizations or theories of how well the drug works.

Using the previous example, we can go into the next step.

A/B Testing and the Z-test

To test this, we are able to use the central limit theorem. The central limit theorem states that independent variables when summed together will create a normal distribution curve as the number of variables increase. In terms of our example, as we increase the number of variables, or number of individuals tested, our variables will create a normal distribution curve. When we take the average our experimental sample group, we can compare how the experimental sample group did versus the average of the control group. A way to do this is by using t-tests.

A t-test allows you to tell how significantly different two averages are. More specifically, it gives you the likelihood of the differences happening by chance. Again, going back to our drug example, a t-test tells you how likely that the difference of averages between the experimental and sample groups are due to chance. We won’t go into detail of how to conduct t-tests, but an important piece to note is that when conducting t-tests, we have to choose a significance level. The significance level is the threshold for accepting or rejecting a null hypothesis. For example, a common significance level is 0.05. This means that there is a 5% chance that a null hypothesis will rejected when it is still true. In our drug example, that would mean that there is a 5% chance that our experiment will tell us that the drug did have an effect on cholesterol even though it didn’t. The technical term for this is a type 1 error or a false negative.

One problem with this is that it may be fine when the experiment has only one group like in our drug example where the group is the use of the drug, but what if we were comparing two different types of drugs and dosages for those drugs or if they were testing three different types of drugs to see which would be best. We would be testing multiple groups at the same time. The problem with this is that confidence levels for the different variables cumulate. So even if you chose a 5% confidence level for each variable, there would be a greater than 5% chance of a type 1 error. The other problem is that we won’t know which variable caused the error. So, we would not want to conduct multiple t-tests.


This is done by running and F-test. The F-test checks the F-statistic to the F-distribution with F as the ratio between two variances. To solve for F, you start with the equation:

Total Sum of Squares = Sum of Squares within Groups + Sum of Squares Between Groups

We get this from variance, which is the squared standard deviation and represents the dispersion from the mean of the group. From the previous equation we can calculate F:

F = variance between groups / variance within groups

Going to our example about three drugs effect on cholesterol and think how we would theoretically solve for our F value.

First remember how to solve for variance:

Variance (Wikipedia Image)

So, to find the variance within groups (Sum of Squares within Groups), we would find the mean of the samples for each group. We would then take the difference between each observation and the sample mean. We would then square each number and sum them all together. That is the numerator of the equation. The denominator are degrees of freedom (df). We are dealing with groups so the df would be the number of groups minus one. Or in our case:

3–1 = 2

To find the variance between groups (Sum of Squares between Groups), we would find the mean of all observations in the samples and the means of each sample. Then you would take the difference between each sample mean and total observations mean. We would square each result, add them together, and multiply by the number of observations in each group.

((Sample1 Mean — Mean)2 + (Sample2 Mean — Mean)2 + (Sample3 Mean — Mean)2) X length of Sample

This number is then divided by the degrees of freedom, which in this case is the observations minus the number of groups.

Having calculated the variance between groups and the variance within groups we can find F.

Now like the t-test and the normal distribution, taking multiple sample and finding F will create a F-distribution. The critical f value can be found by tables using the degrees of freedom of the numerator and denominator. Similar to the t-test, the experimental F can be compared to the critical value and if greater, the null hypothesis can be rejected.

We can also find the p-value just as in the t-test, except the multiple comparisons does not increase the possible Type 1 error.

Thankfully in data science, we have a statsmodels function to solve for this more easily.

By first creating a formula with our different groups and using that formula and degrees of freedom to create a linear model with the ols function. We can create a table with stats.anova_lm. For each group it will return the sum squared, the df, F ratio, and the p_value. From these we can interpret the data to see which groups are influential in the result.


statisticsfun youtube video: How to Calculate and Understand Analysis of Variance (ANOVA) F Test.

Statsmodels.stats.anova documentation




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store