- Published on
Understanding linear regression with Python
- Authors
- Name
- Christian Miño
- @Christian_sm91
A linear regresion is an approach to stimate the lineal relationship between two type of variables, a response varible Y
and other or others explanatory varables Xi
. This is the most used model and it is a fundamental technic to analyze data. Our model takes the form of:
Where Y is a n×1 vector, X is a n×p matrix, β is a p×1 vector of coefficients, and ϵ is a standard normal error term. Typically we call a model with p=1 a simple linear regression and a model with p>1 a multiple linear regression.
Whenever we build a model, there will be gaps between what a model predicts and what is observed in the sample. The differences between these values are known as the residuals of the model and can be used to check for some of the basic assumptions that go into the model. The key assumptions to check for are:
Linear Fit: The underlying relationship should be linear.
Homoscedastic: The data should have no trend in the variance.
Independent and Identically Distributed: The residuals of the regression should be independent and identically distributed (i.i.d.) and show no signs of serial correlation.
We can use the residuals to help diagnose whether the relationship we have estimated is real or spurious (which means a false relationship).
Classical Lineal Regression
First we'll define a function that performs linear regression and plots the scatter plot with the regression line and add the needed libraries. (beacause I'm using Jupyter Notebook I add the shortcut %matplotlib inline
).
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import statsmodels.stats.diagnostic as smd
def reglin(x, y):
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
B0 = model.params[0]
B1 = model.params[1]
x = x[:, 1]
# Returna el summary y la grafica
x2 = np.linspace(x.min(), x.max(), 100)
y_hat = x2 * B1 + B0
plt.scatter(x, y, alpha=1) #grafica el scatter plot
plt.plot(x2, y_hat, 'r', alpha=1) # la linea de regresion en rojo
plt.xlabel('X Value')
plt.ylabel('Y Value')
return model, B0, B1
Let's create dummy data and invoque the function with the created data.
# Create dummy data
n = 50
x = np.random.randint(0, 100, n)
e = np.random.normal(0, 1, n)
y = 10 + 0.5 * x + e
# Invoque the function
reglin(x, y)
print("Linea de mejor ajuste: Y = {0} + {1}*X".format(reglin(x,y)[1], reglin(x,y)[2]))
This dummy example has some generated noise, but all real data will also have noise. This is inherent in sampling from any sort of wild data-generating process. As a result, our line of best fit will never exactly fit the data (which is why it is only "best", not "perfect"). Having a model that fits every single observation that you have is a sure sign of overfitting.
For all fit models, there will be a difference between what the regression model predicts and what was observed, which is where residuals come in.
Residuals
The definition of a residual is the difference between what is observed in the sample and what is predicted by the regression. For any residual ri , we express this as:
Where Yi is the observed Y -value and Y^i is the predicted Y-value. We plot these differences on the following graph:
modelo, B0, B1 = reglin(x, y)
residuales = modelo.resid
plt.errorbar(x, y, xerr=0, yerr=[residuales, 0*residuales],linestyle="None",color='Green');
Diagnosing Residuals
Many of the assumptions that are necessary to have a valid linear regression model can be checked by identifying patterns in the residuals of that model. We can make a quick visual check by looking at the residual plot of a given model.
With a residual plot, we look at the predicted values of the model versus the residuals themselves. What we want to see is just a cloud of unrelated points, like so:
plt.scatter(modelo.predict(), residuales)
plt.axhline(0, color='red')
plt.xlabel('Valores Predictivos')
plt.ylabel('Residules')
plt.xlim([1,50])
What we looking for is a random distribution in the residuals. This migth show that a lineal model gives a good fit. If we can see a trend, this might indicate the presence of autocorrelation or heteroscedasticity in the model.
Heteroscedasticity
One of the main assumptions behind a linear regression is that the underlying data has a constant variance. If there are some parts of the data with a variance different from another part the data is not appropriate for a linear regression. Heteroscedasticity is a term that refers to data with non-constant variance, as opposed to homoscedasticity, when data has constant variance.
Significant heteroscedasticity invalidates linear regression results by biasing the standard error of the model. As a result, we can't trust the outcomes of significance tests and confidence intervals generated from the model and its parameters.
To avoid these consequences it is important to use residual plots to check for heteroscedasticity and adjust if necessary.
n = 50
x = np.random.randint(0, 100, n)
e = np.random.normal(0, 1, n)
Y_heteroscedastico = 100 + 2*x + e*x
modelo = sm.OLS(Y_heteroscedastico, sm.add_constant(x)).fit()
B0, B1 = modelo.params
residuales = modelo.resid
plt.scatter(modelo.predict(), residuales)
plt.axhline(0, color='red')
plt.xlabel('Valores Predictivos')
plt.ylabel('Residuales');
Statistical Methods for Detecting Heteroscedasticity
Usually we want to prove heteroscedaticity visually but also we want to prove it statistically to confirm what we believe.
A common test to confirm heteroscedaticity presence is the Breusch-Pagan hypothesis test, also we could use the White test but for now we're going to use the Breuch-Pagan test.
breusch_pagan_p = smd.het_breuschpagan(modelo.resid, modelo.model.exog)[1]
print(breusch_pagan_p)
if breusch_pagan_p > 0.05:
print('La relación no es heteroscedástica')
if breusch_pagan_p < 0.05:
print('La relación es heteroscedástica')
We set our confidence level at α=0.05 , so a Breusch-Pagan p-value below 0.05 tells us that the relationship is heteroscedastic. Using a hypothesis test bears the risk of a false positive or a false negative, which is why it can be good to confirm with additional tests if we are skeptical.
Autocorrelation
Another assumption behind linear regressions is that the residuals are not autocorrelated. A series is autocorrelated when it is correlated with a delayed version of itself. For example most of financial series have serial autocorrelation.
n = 50
x2 = np.linspace(0, n, n)
Y_autocorrelacionado = np.zeros(n)
Y_autocorrelacionado[0] = 50
for t in range(1, n):
Y_autocorrelacionado[t] = Y_autocorrelacionado[t-1] + np.random.normal(0,1)
modelo2, B02, B12 = reglin(x2, Y_autocorrelacionado)
residuales = modelo2.resid
plt.scatter(modelo2.predict(), residuales)
plt.axhline(0, color='red')
plt.xlabel('Valores pronosticados')
plt.ylabel('Residuales');
The autocorrelation itself is not quite obvious when you plot it, then, it is much more reliable to apply a statistic test.
Statistic Method to detect Autocorrelation
As with all statistical properties, we require a statistical test to ultimately decide whether there is autocorrelation in our residuals or not. To this end, we use a Ljung-Box test.
A Ljung-Box test is used to detect autocorrelation in a time series. The Ljung-Box test examines autocorrelation at all lag intervals below a specified maximum and returns arrays containing the outputs for every tested lag interval.
ljung_box = smd.acorr_ljungbox(residuales, lags = 10)
if any(ljung_box[1] < 0.05):
print("Los residuales están autocorrelacionados")
else:
print("Los residuales no están autocorrelacionados")
Well, here we end of this post that I did base on the following lecture Quantopian Residual Analysis, the Jupyter Notebook is in the link. If you want to know more about this topic I recommend you "Analysis of Financial Time Series", by Ruey Tsay 🚀🚀🚀🚀.