One of my neighbors has a dog that is barking the whole damn day. It is the typical dog so tiny it is barely a palm from the ground, which does not prevent it from barking at an incredible wild volume, not to mention the unpleasantness of its “voice” pitch.
With these dwarf dogs it is what usually happens, they bark at you like demon-possessed as soon as they see you, but, according to popular wisdom, you can rest easy because the more they bark at you, the less likely they are to bite you. Come to think of it, I would almost say that there is an inverse correlation between the amount of howling and the probability of being bitten by one of these little animals.
And since we have mentioned the term correlation, we are going to talk about this concept and how to measure it.
What does correlation mean?
The Cambridge’s English Dictionary says that correlation is a connection or relationship between two or more facts, numbers, etc. Another source of wisdom, Wikipedia, says that when it comes to probability and statistics, correlation is any statistical relationship between two random variables or bivariate data.
What does it mean, then, that two variables are correlated? Well, a much simpler thing than it may seem: that the values of one of the variables change in a certain sense in a systematic way when changes occur in the other. Said more simply, given two variables A and B, whenever the value of A changes in a certain direction, those of B will also change in a certain direction, which may be the same or the opposite one.
And that is what correlation means. Only that, how one variable changes with the changes of the other. This does not mean at all that there is a causal relationship between the two variables, which is a generally erroneous assumption that is made with some frequency. So common is this fallacy that it even has a nice Latin name, cum hoc ergo propter hoc, that can be summed up for less educated minds as “correlation does not imply causation.” Because two things vary together does not mean that one is the cause of the other.
Another common mistake is to confuse correlation with regression. Actually, they are two terms that are closely related. While the first, correlation, only tells us if there is a relationship between the two variables, the regression analysis goes one step further and aims to find a model that allows us to predict the value of one of the variables (the dependent variable) based on of the value that the other variable takes (which we will call independent or explanatory variable). In many cases, studying if there is correlation is the previous step before generating the regression model.
Well, everyone knows the human being’s hobby of measuring and quantifying everything, so it cannot surprise to anyone that the so-called correlation coefficients were invented, of which there is a more or less numerous family.
To calculate the correlation coefficient, we therefore need a parameter that allows us to quantify this relationship. For this, we have the covariance, which indicates the degree of common variation of two random variables.
The problem is that covariance’s value depends on the measurement scales of the variables, which prevents us from making direct comparisons between different pairs of variables. To avoid this problem, we resort to a solution that is already known to us and which is none other than standardization. The product of the standardization of the covariance will be the correlation coefficients.
All these coefficients have something in common: their value ranges from -1 to 1. The farther the value is from 0, the greater the strength of the relationship, which will be practically perfect when it reaches -1 or 1. At 0, which is the null value, in theory there will be no correlation between the two variables.
The sign of the value of the correlation coefficient will indicate the other quality of the relationship between the two variables: the direction. When the sign is positive it will mean that the correlation is direct: when one increases or decreases, the other does so in the same way. If the sign is negative, the correlation will be inverse: when changing one variable, the other will do it in the opposite direction (if one increases, the other decreases, and vice versa).
So far we have seen two of the characteristics of the correlation between two variables: strength and direction. There is a third characteristic which depends on the type of line that defines the best fit model. In this post we are going to talk only about the simplest form, which is none other than linear correlation, in which the fit line is a straight line, but you should know that there are other non-linear fits.
Pearson’s correlation coefficient
We have already said that there is a whole series of correlation coefficients that we can calculate based on the type of variables that we want to study and the probability function they are distributed in the population from which the sample comes.
Pearson’s correlation coefficient, also called the linear product-moment correlation coefficient, is undoubtedly the most famous of this entire family.
As we have already said, it is nothing more than the standardized covariance. There are several ways to calculate it, but all roads lead to Rome, so I will not resist putting the formula:
As we can see, covariance (in the numerator) is standardized by dividing it by the product of the variances of the two variables (in the denominator).
In order to use Pearson’s correlation coefficient, both variables must be quantitative, be linearly correlated, be normally distributed in the population, and the assumption of homoskedasticity must be met, which means that the variance of the Y variable must be constant along the values of the X variable. An easy way to check this last assumption is to draw the scatterplot and see if the cloud is scattered similarly along the values of the X variable.
One factor to keep in mind that the value of this coefficient can be biased with the existence of extreme values (outliers).
Spearman’s correlation coefficient
The non-parametric equivalent of Pearson’s coefficient is Spearman’s correlation coefficient. This, as occurs with non-parametric techniques, does not use direct data for its calculation, but uses its transformation in ranks.
Thus, it is used when the variables are ordinal or when they are quantitative, but they do not meet the normality assumption and can be transformed into ranks.
Otherwise, its interpretation is similar to that of the rest of the coefficients. Furthermore, because of being calculated with ranks, it is less sensitive to outliers than the Pearson’s coefficient.
Another advantage compared to Spearman’s is that it only requires that the correlation between the two variables be monotonous, which means that when one variable increases the other also does so (or decrease) with a constant trend. This allows it to be used not only when the relationship is linear, but also in cases of logistic and exponential relationships.
Kendall’s tau coefficient
Another coefficient that also uses the ranks of the variable is the Kendall’s τ coefficient. Being a non-parametric coefficient, it is also an alternative to Pearson’s coefficient when the assumption of normality is not fulfilled, being more advisable than Spearman’s when the sample is small and when there is a lot of rank ligation, which means that many data occupy the same position in the ranks.
But there is still more…
Although there are some more, I am only going to refer specifically to three of them, useful for studying quantitative variables:
- Partial correlation coefficient. This coefficient studies the relationship between two variables, but take into account and eliminating the influence of other variables.
The simplest case is to study two variables X1 and X2, eliminating the effect of a third variable X3. In this case, if the correlations between X1 and X3 and between X2 and X3 are equal to zero, the same value is obtained for the partial correlation coefficient as if we calculate the Pearson’s coefficient between X1 and X2.
In the event that we want to control more variables, the formula, which I do not intend to write, becomes more complex, but it is best to let a statistical program to calculate it.
If the value of the partial coefficient is less than that of the Pearson’s coefficient, it means that the correlation between both variables is partially due to the other variables that we are controlling. When the partial coefficient is greater than Pearson’s, the variables that are controlled mask the relationship between the two variables of interest.
- Semi-partial correlation coefficient. It is similar to the previous one, but this semi-partial allows evaluating the association between two variables by controlling the effect of a third on one of the two variables of interest (not on the two, as the partial coefficient).
- Multiple correlation coefficient. This allows to know the correlation among a variable and a set of two or more variables, all of them quantitative.
And I think we have enough with this, for now. There are a few more coefficients that are useful for special situations. Whoever is curious can look for it in a thick statistics book and be confident to succeed in finding it.
Significance and interpretation
We already said at the beginning that the value of these coefficients could range from -1 to 1, with -1 being the perfect negative correlation and 1 being the perfect positive correlation.
We can make a parallel between the value of the coefficient and the strength of the association, which is nothing but the effect size. Thus, values of 0 indicate null association, 0.1 small association, 0.3 medium, 0.5 moderate, 0.7 high and 0.9 very high association.
To finish, it must be said that, in order to give value to the coefficient, it must be statistically significant. You know that we always work with samples, but what we are interested in is inferring the value in the population, so we have to calculate the confidence interval of the coefficient that we have used. If this interval includes the null value (zero) or if the program calculates the value of p and it is greater than 0.05, it will not make sense to value the coefficient, even if it is close to -1 or 1.
We are leaving…
And here we leave it for today. We have not discussed anything about using the Pearson’s correlation coefficient to compare the precision of a diagnostic test. And we have not said anything because we should not use this coefficient for this purpose. Pearson’s coefficient is highly dependent on intra-subject variability and can give a very high value when one of the measurements is systematically greater than the other, even though there is not a good concordance between the two. For this, it is much more appropriate to use the intraclass correlation coefficient, a best estimator of the concordance among repeated measures. But that is another story…