Understanding linear regression with Python

A linear regresion is an approach to stimate the lineal relationship between two type of variables, a response varible Y and other or others explanatory varables Xi. This is the most used model and it is a fundamental technic to analyze data. Our model takes the form of:

y = X\beta+ \epsilon

Where Y is a n×1 vector, X is a n×p matrix, β is a p×1 vector of coefficients, and ϵ is a standard normal error term. Typically we call a model with p=1 a simple linear regression and a model with p>1 a multiple linear regression.

Whenever we build a model, there will be gaps between what a model predicts and what is observed in the sample. The differences between these values are known as the residuals of the model and can be used to check for some of the basic assumptions that go into the model. The key assumptions to check for are:

Linear Fit: The underlying relationship should be linear.
Homoscedastic: The data should have no trend in the variance.
Independent and Identically Distributed: The residuals of the regression should be independent and identically distributed (i.i.d.) and show no signs of serial correlation.

We can use the residuals to help diagnose whether the relationship we have estimated is real or spurious (which means a false relationship).

Classical Lineal Regression

First we'll define a function that performs linear regression and plots the scatter plot with the regression line and add the needed libraries. (beacause I'm using Jupyter Notebook I add the shortcut %matplotlib inline).

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import statsmodels.stats.diagnostic as smd

def reglin(x, y):
    x = sm.add_constant(x)
    model = sm.OLS(y, x).fit()
    B0 = model.params[0]
    B1 = model.params[1]
    x = x[:, 1]

    # Returna el summary y la grafica
    x2 = np.linspace(x.min(), x.max(), 100)
    y_hat = x2 * B1 + B0
    plt.scatter(x, y, alpha=1) #grafica el scatter plot
    plt.plot(x2, y_hat, 'r', alpha=1) # la linea de regresion en rojo
    plt.xlabel('X Value')
    plt.ylabel('Y Value')
    return model, B0, B1

Let's create dummy data and invoque the function with the created data.

# Create dummy data
n = 50
x = np.random.randint(0, 100, n)
e = np.random.normal(0, 1, n)

y = 10 + 0.5 * x + e

# Invoque the function
reglin(x, y)
print("Linea de mejor ajuste: Y = {0} + {1}*X".format(reglin(x,y)[1], reglin(x,y)[2]))

gif

This dummy example has some generated noise, but all real data will also have noise. This is inherent in sampling from any sort of wild data-generating process. As a result, our line of best fit will never exactly fit the data (which is why it is only "best", not "perfect"). Having a model that fits every single observation that you have is a sure sign of overfitting.

For all fit models, there will be a difference between what the regression model predicts and what was observed, which is where residuals come in.

Residuals

The definition of a residual is the difference between what is observed in the sample and what is predicted by the regression. For any residual ri , we express this as:

r_i = Y_i - \hat{Y}_i

Where Yi is the observed Y -value and Y^i is the predicted Y-value. We plot these differences on the following graph:

modelo, B0, B1 = reglin(x, y)

residuales = modelo.resid
plt.errorbar(x, y, xerr=0, yerr=[residuales, 0*residuales],linestyle="None",color='Green');

gif

Diagnosing Residuals

Many of the assumptions that are necessary to have a valid linear regression model can be checked by identifying patterns in the residuals of that model. We can make a quick visual check by looking at the residual plot of a given model.

With a residual plot, we look at the predicted values of the model versus the residuals themselves. What we want to see is just a cloud of unrelated points, like so:

plt.scatter(modelo.predict(), residuales)
plt.axhline(0, color='red')
plt.xlabel('Valores Predictivos')
plt.ylabel('Residules')
plt.xlim([1,50])

gif

What we looking for is a random distribution in the residuals. This migth show that a lineal model gives a good fit. If we can see a trend, this might indicate the presence of autocorrelation or heteroscedasticity in the model.

Heteroscedasticity

One of the main assumptions behind a linear regression is that the underlying data has a constant variance. If there are some parts of the data with a variance different from another part the data is not appropriate for a linear regression. Heteroscedasticity is a term that refers to data with non-constant variance, as opposed to homoscedasticity, when data has constant variance.

Significant heteroscedasticity invalidates linear regression results by biasing the standard error of the model. As a result, we can't trust the outcomes of significance tests and confidence intervals generated from the model and its parameters.

To avoid these consequences it is important to use residual plots to check for heteroscedasticity and adjust if necessary.

n = 50
x = np.random.randint(0, 100, n)
e = np.random.normal(0, 1, n)
Y_heteroscedastico = 100 + 2*x + e*x

modelo = sm.OLS(Y_heteroscedastico, sm.add_constant(x)).fit()
B0, B1 = modelo.params
residuales = modelo.resid

plt.scatter(modelo.predict(), residuales)
plt.axhline(0, color='red')
plt.xlabel('Valores Predictivos')
plt.ylabel('Residuales');

gif

Statistical Methods for Detecting Heteroscedasticity

Usually we want to prove heteroscedaticity visually but also we want to prove it statistically to confirm what we believe.

A common test to confirm heteroscedaticity presence is the Breusch-Pagan hypothesis test, also we could use the White test but for now we're going to use the Breuch-Pagan test.

breusch_pagan_p = smd.het_breuschpagan(modelo.resid, modelo.model.exog)[1]
print(breusch_pagan_p)
if breusch_pagan_p > 0.05:
    print('La relación no es heteroscedástica')
if breusch_pagan_p < 0.05:
    print('La relación es heteroscedástica')

gif

We set our confidence level at α=0.05 , so a Breusch-Pagan p-value below 0.05 tells us that the relationship is heteroscedastic. Using a hypothesis test bears the risk of a false positive or a false negative, which is why it can be good to confirm with additional tests if we are skeptical.

Autocorrelation

Another assumption behind linear regressions is that the residuals are not autocorrelated. A series is autocorrelated when it is correlated with a delayed version of itself. For example most of financial series have serial autocorrelation.

Y_i = Y_i-1 + \epsilon

n = 50
x2 = np.linspace(0, n, n)
Y_autocorrelacionado = np.zeros(n)
Y_autocorrelacionado[0] = 50
for t in range(1, n):
    Y_autocorrelacionado[t] = Y_autocorrelacionado[t-1] + np.random.normal(0,1)

modelo2, B02, B12 = reglin(x2, Y_autocorrelacionado)

gif

residuales = modelo2.resid

plt.scatter(modelo2.predict(), residuales)
plt.axhline(0, color='red')
plt.xlabel('Valores pronosticados')
plt.ylabel('Residuales');

gif

The autocorrelation itself is not quite obvious when you plot it, then, it is much more reliable to apply a statistic test.

Statistic Method to detect Autocorrelation

As with all statistical properties, we require a statistical test to ultimately decide whether there is autocorrelation in our residuals or not. To this end, we use a Ljung-Box test.

A Ljung-Box test is used to detect autocorrelation in a time series. The Ljung-Box test examines autocorrelation at all lag intervals below a specified maximum and returns arrays containing the outputs for every tested lag interval.

ljung_box = smd.acorr_ljungbox(residuales, lags = 10)

if any(ljung_box[1] < 0.05):
    print("Los residuales están autocorrelacionados")
else:
    print("Los residuales no están autocorrelacionados")

gif

Well, here we end of this post that I did base on the following lecture Quantopian Residual Analysis, the Jupyter Notebook is in the link. If you want to know more about this topic I recommend you "Analysis of Financial Time Series", by Ruey Tsay 🚀🚀🚀🚀.