Introduction to Ordinary Least Squares (OLS) Using StatsModels
OLS or Ordinary Least Squares is a useful method for evaluating a linear regression model. It does this by using specific statistical performance metrics about the model as a whole and each specific parameter of the model. The OLS method comes from the StatsModels python package. This module is well known for offering multiple classes and functions that both estimate different types of statistical models and conducting multiple statistical tests.
To understand this method, we should take a quick refresher about linear regression. Linear regression is an attempt at modeling the relationship between two or more variables. There is one dependent variable that needs to be estimated or predicted. Then there can be one or more independent or explanatory variables that can inputted to help predict the dependent variable. This model is created by fitting a linear equation to the observed data. To fit the regression line to the data, the most common method is called Ordinary Least Squares, which is the method we are going to use in StatsModels.
Ordinary Least Squares best fits the regression line to data by minimizing the squared vertical deviation between the predicted value and the observed value. The deviations need to be squared first so that there are no cancellations between positive and negative deviations from the line.
Installing StatsModels and Importing Libraries
As usual, the first step is to install the StatsModels package. If you installed Python through Anaconda, the StatsModels package should already be installed. If it has not installed, you can do so by Anaconda by using:
conda install -c conda-forge statsmodels
Another way of installing the package is using PyPI (pip) with the following code:
pip install statsmodels
Once the package has been installed, we want to import the necessary libraries to use the OLS method.
import statsmodels.api as sm
from statsmodels.formula.api import ols
Using OLS
Linear Regression follows the equation y = mx + b. Y is equal to the dependent variable, x is the independent variable, m is the coefficient to the independent variable or parameter, and b is equal to the y-intercept. In the case of a single independent variable, m is also considered the slope of the line. The StatsModels OLS method allows users to fit their statistical models using an R-style formula using a ~ symbol in the format:
'Y ~ X'
To give an example, let us create a simple data set for our linear regression model. First we create two lists: x and y.
import numpy as npx = np.array([1, 1, 2, 3, 4, 5, 5, 5, 6, 7, 10, 13, 14], dtype = np.float64)y = np.array([4, 5, 3, 4, 7, 5, 6, 7, 9, 10, 12, 14, 18], dtype = np.float64)
We convert the lists into a dictionary and create a data frame from the dictionary:
import pandas as pdd = {‘X’: x, ‘Y’: y}
df = pd.DataFrame(d)
First, we want to check if the data looks linear:
import matplotlib.pyplot as pltplt.scatter(x,y)
It seems linear, so we then create our formula and pass it through to the OLS methods and use fit() to fit the linear model.
f = ‘y ~ x’model = ols(f, data = df).fit()
Finally, we evaluate how the model performed:
model.summary()
Taking a look at this, we see a number of different types of values. We will go into detail about some of the major observations below:
R-squared
measures the percentage of how much of the variance is described by the modelAdj. R-squared
measures the percentage of how much of the variance is described by the model with a penalization for more variablesF-statistic
measures how significant the fit is for the modelProb(F-statistic)
or P-value is the probability that a sample of the population would produce the above statistic.- In the next table,
coef
is the predicted coefficient value for the parameter. This is the number that the independent variable would be multiplied by. With a single independent variable, this could also be considered the slope. std error
is the standard error for the predicted value of the coefficientt
is the t-statistic value. It is a measure of how significant the coefficient is.P>|t|
is the P-value for the null hypothesis that the coefficient is equal to zero. If the value is less than the confidence level, usually 0.05, it means that the null hypothesis can be rejected and there is a significant relationship between the coefficient value and the parameter. This becomes more useful when there is more than one independent variable.
The OLS method is a useful tool in diagnosing how well a linear regression model is fitting the given data.
I hope that this introduction has been helpful in your linear regression tasks.
For more information on OLS and StatsModels the documentation can be found in the link below:
https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html