Collinearity in lineal multiple regression
To look for three legs of a cat, or splitting hairs, is a popular Spanish saying. It seems that when one looks for three feet of a cat he tries to demonstrate something impossible, generally with tricks and deceptions. As the English speakers say, if it ain’t broke, don’t fix it. In fact, the initial saying referred to looking for five feet instead of three. This seems more logical, since as cats have four legs, finding three of them is easy, but finding five is impossible, unless we consider the tail of the cat as another foot, which does not make much sense.
But today we will not talk about cats with three, four or five feet. Let’s talk about something a little more ethereal, such as multivariate multiple linear regression models. This is a cat with a lot of feet, but we are going to focus only on three of them that are called collinearity, tolerance and inflation factor (or increase) of the variance. Do not be discouraged, it’s easier than it may seem.
An introduction to the problem
We saw in a previous post how simple linear regression models related two variables to each other, so that the variations of one of them (the independent variable or predictor) could be used to calculate how the other variable would change (the dependent variable). These models were represented by the equation y = a + bx, where x is the independent variable and y the dependent variable.
However, multiple linear regression adds more independent variables, so that it allows to make predictions of the dependent variable according to the values of the predictor or independent variables. The generic formula would be as follows:
y = a + bx1 + cx2 + dx3 + … + nxn, where n is the number of independent variables.
One of the conditions for the multiple linear regression models to work properly is that the independent variables are actually independent and uncorrelated.
Imagine an absurd example in which we put in the model the weight in kilograms and the weight in pounds. Both variables will vary in the same way. In fact the correlation coefficient, R, will be 1, since practically the two represent the same variable. Such foolish examples are difficult to see in scientific work, but there are others less obvious (including, for example, height and body mass index, which is calculated from weight and height) and others that are not at all evident for the researcher. This is what is called collinearity, which is nothing more than the existence of a linear association between the set of independent variables.
Collinearity in lineal multiple regression
Collinearity is a serious problem for the multivariate model, since the estimates obtained by it are very unstable, as it becomes more difficult to separate the effect of each predictor variable.
Well, to determine if our model suffers from collinearity we can construct a matrix where the coefficients of correlation, R, of some variables with others are shown. In those cases in which we observe high R, we can suspect that there is collinearity. However, if we want to quantify this we will resort to the other two feet of the cat that we mentioned at the beginning: tolerance and inflation factor of variance.
If we square the coefficient R we obtain the coefficient of determination (R2), which represents the percentage of the variation (or variance) of a variable that is explained by the variation in the other variable. Thus, we find the concept of tolerance, which is calculated as the complement of R2 (1-R2) and represents the proportion of the variability of that variable that is not explained by the rest of the independent variables included in the regression model.
In this way, the lower the tolerance, the more likely there is collinearity. Collinearity is generally considered to exist when R2 is greater than 0.9 and therefore the tolerance is below 0.1.
We only have to explain the third foot, which is the inflation factor of the variance. This is calculated as the inverse of the tolerance (1 / T) and represents the proportion of the variability (or variance) of the variable that is explained by the rest of the predictor variables of the model. Of course, the greater the inflation factor of the variance, the greater the likelihood of collinearity. Generally, collinearity is considered to exist when the inflation factor between two variables is greater than 10 or when the mean of all inflation factors of all independent variables is much greater than one.
And here we are going to leave the multivariate models for today. Needless to say, everything we have told is done in practice using computer programs that calculate these parameters in a simple way.
We have seen here some aspects of multiple linear regression, perhaps the most widely used multivariate model. But there are others, such as multivariate analysis of variance (MANOVA), factors analysis, or clusters analysis. But that is another story…