Blinders are pieces that are put over the eyes of some draft animals, such as donkeys or horses. Its purpose is none other than to get the animal to focus only on the road ahead, without being distracted by other things that it could see through its peripheral vision, less important for its task.

I always feel a bit sad to see them like that, pulling the chariot with his eyes half covered. But, making an effort, I can understand the usefulness of the device, especially in areas with heavy traffic, where the animal could be frightened if it could see everything it has around it.

And this issue leads me to think of other blinders, a symbolic ones this time, that so-called human beings wear on many occasions, limiting their vision and, on many occasions, without a clear benefit. I’m referring this time to the obsesion for statistical significance, one of those blinders that someone put on us at some time and that we should take off to get a bigger picture.

When we read a clinical trial, it is a very common custom to look for the p-value to see if it is statistically significant, even before looking at the result of the study outcome variable and evaluating the methodological quality of the trial. Leaving aside the clinical relevance of the results (to which we will return shortly), this is not a recommended practice.

First, the significance threshold is totally arbitrary, and moreover, we always have a probability of making an error, whatever we do after knowing the p-value. Furthermore, the value of p depends, among other factors, on the sample size and the number of effects we observe, which can also vary by chance.

In this sense, we already saw in a previous post how some authors thought of developing a fragility index, which gives an approximation of how the p-value and its statistical significance could be modified, if some of the trial participants had had another outcome.

The fragility index would thus be defined as the minimum number of changes in the participants’ results that would change the statistical significance of the trial (from significant to non-significant, and vice versa). Studies with a lower index values would be considered more fragile, as minor modifications of the results would eliminate their significance.

This new approach has the merit of not basing the assessment of the study solely on the p-value obtained. In general, we will feel more comfortable the higher the fragility index since it would take many more changes for the p to stop being significant. However, we are forgetting two fundamental aspects. First, how likely it is that these changes in the results will occur. Second, the clinical relevance of the effect size observed in the study.

Let’s suppose that we do a clinical trial to assess two treatment alternatives for that terrible disease that is fildulastrosis. In order to not to fret over drugs names, we are going to call these two alternatives A and B.

We recruited 295 patients and distributed them randomly between the two arms of the trial, 145 for treatment A and 150 for treatment B.

At the end of the study we obtain the results that you can see in the first contingency table. In group A, 5 patients were healed, while in group B none were healed. The probability of being healed in group A was therefore 3.45%, while that of B was 0%. At first glance, it seems that there was a greater probability of being healed in group A and, indeed, if we perform a Fisher’s exact test it gives us a value of p = 0.027 for a bilateral contrast.

As a conclusion, being p <0.05, we reject the null hypothesis which, for Fisher’s test, assumes that the probability of healing is equal in the two groups. In other words, there is a statistically significant difference, so we conclude that treatment A was more effective in healing fildulastrosis.

But what if a participant in group B had been healed? You can see it in the second contingency table.

The probability of being healed in group A would continue to be 3.45%, while that of B would be, in this case, 0.66%. It appears that A is still better, but if we do Fisher’s exact test again, the p-value for a two-sided test is now 0.11.

What happened? The difference is no longer statistically significant only with the change in the outcome of one of the 295 participants. The fragility index would be equal to 1, with which we would consider the initial result as fragile.

Now I ask myself: are we considering everything that we should? I would say not. Let’s see.

Our initial study, if we rely solely on the fragility index, would be considered fragile, which we could express as having an unstable statistical significance.

But this argument is a bit fallacious, since we are not taking into account how likely it is that this change will occur in one of the participants.

Suppose that, from previous studies, we know that the probability of healing the disease without treatment is 0.1%. We can use a binomial probability calculator to make a few numbers. For example, the probability that none of the 150 (the first assumption) will be healed is 86%. Similarly, the probability that exactly 1 is healed is 13%.

And this is where the fallacy lies: we are assessing the fragility of statistical significance by comparing the result that we have observed with another eventual one whose probability of occurrence is much lower. As a conclusion, it does not seem reasonable to define the fragility of the finding without assessing the likelihood of producing this minimal change that modifies the statistical significance.

Now imagine that the probability of being healed without treatment was 1%. The probability of not observing any healing with 150 patients would be 22%, while that of exactly 1 heals rises to 33%. Here we could say that the study provides a fragile significance (that alternative outcome is most likely than the observed one).

To finish doing things well and really widen our field of vision, we should not be satisfied only with statistical significance, but we should also assess the clinical relevance of the result.

In this sense, some authors have proposed that, before calculating the statistical significance of the observed effect, the clinically relevance threshold should have been established. Thus, the minimum important difference between the two groups is defined.

If the effect we detect exceeds this minimal difference between the two groups, we can say that the effect is quantitatively significant. This quantitative significance has nothing to do with statistical significance, it only implies that the observed effect is greater than that considered important from a clinical point of view.

In order not to get confused with the two meanings, we are going to call this quantitative significance what its true name: clinical relevance.

We are going to try to put togethe the three aspects that we have dealt with so far.

If the p-value of the observed effect is less than 0.05, we can start by stating that this difference is statistically significant. Next, we will consider the fragility and clinical relevance of the result.

If the effect is not clinically relevant it will not make sense to spend more time on it, even if the p is significant.

But if the effect is clinically relevant, now we will no longer be content with calculating how many changes have to occur to modify statistical significance (and how likely it is that those changes will occur), but we will have to calculate how many changes must occur to lose that minimal clinically relevant difference.

If that number is greater than the fragility index, the result may be statistically unstable, but stable from the point of view of the clinical significance of the result.

On the contrary, a slight change in the outcomes will make the magnitude of the effect considered relevant disappear if the study is quantitatively unstable. If these changes can occur with a reasonably high probability, we will not have much confidence in the results of the study, regardless of their statistical significance.

To summarize everything we have said, when the time comes to assess the results of a clinical trial, we can follow these four steps:

- Assess statistical significance. Here we must not lose sight of the fact that reaching significance may be a matter of increasing the sample size sufficiently.
- Determine the clinical significance. The reference is the minimum relevant difference that we want to observe between the two groups, taking into account the criteria of clinical relevance of the effect.
- Assess quantitative stability. Determine the number of changes that can modify the clinical significance of the results.
- Determine if the study is fragile or stable. How many changes are needed to reverse statistical significance (the fragility index that we started this whole thing with).

And here we are going to end this post, long and thick, but one that deals with an important issue that our blinders prevent us from assessing properly.

All of the above refers to clinical trials, although this problem can also be applied to meta-analyzes, where the overall outcome measure can also radically change with changes in the results of some of the primary studies in the review. For this reason, some indices have also been developed, such as Ronsenthal’s safe N or, also considering the clinical relevance, Orwin’s safe N. But that is another story…

]]>We live in a crazy world, always running from here to there and always with a thousand things in mind. Thus, it is not uncommon for us to leave many of our tasks half finished.

This will be of little importance in some occasions, but there will be others in which leaving things half done will make the other half we have done useless.

And this is precisely what happens when we apply this sloppiness to our topic of today: we do an experiment, we calculate a regression line and we just start applying it, forgetting to make a regression model diagnostics.

In these cases, leaving things half done may have the consequence that we apply to our population a predictive model that, in reality, may not be valid.

We already saw in a previous post how to build a simple linear regression model. As we already know, simple linear regression allows us to estimate what the value of a dependent variable will be based on the value taken by a second variable, which will be the independent one, provided that there is a linear relationship between the two variables.

We also saw in an example how a regression model could allow us to estimate what the height of a tree would be if we only know the volume of the trunk, even if we did not have any tree with that volume available.

No wonder, then, that the prediction capabilities of regression models are widely used in biomedical research. And that’s fine, but the problem is that, the vast majority of the time, authors who use regression models to communicate their studies’ results forget about the validation and diagnostics of the regression model.

And at this point, some may wonder: do regression models have to be validated? If we already have the coefficients of the model, do we have to do something else?

Well, yes, it is not enough to obtain the coefficients of the line and start making predictions. To be sure that the model is valid, a series of assumptions must be checked. This process is known as the validation and regression model diagnostics.

We must never forget that we usually deal with samples, but what we really want is to make inferences about the population from which the sample comes, which we cannot access in its entirety.

Once we calculate the coefficients of the regression line using, for example, the least squared method, and we see that their valuea are different from zero, we must ask ourselves if it is possible that in the population that value are zero and that the values that we have found in our sample are due to random fluctuations.

And how can we know this? Very easy, we will make a hypothesis contrast for the two coefficients of the line with the null hypothesis that the coefficients values are, effectively, zero:

H_{0 }: β_{0} = 0 y H_{0 }: β_{1} = 0

If we can reject both null hypotheses, we can apply the regression line that we have obtained to our population.

If we cannot reject H_{0} for β_{0}, the constant (interceptor) of the model will not be valid. We can still apply the lineal equation, but assuming it originates from the coordinate axis. But if we have the misfortune of not being able to reject the null hypothesis for the slope (or for neither of the two coefficients), we will not be able to apply the model to the population: the independent variable will not allow us to predict the value of the dependent variable.

This hypothesis contrast can be done in two ways:

- If we divide each coefficient by its standard error, we will obtain a statistic that follows a Student’s t distribution with n-2 degrees of freedom. We can calculate the p-value associated with that value and solve the hypothesis contrast by rejecting the null hypothesis if p<0.05.
- A slightly more complex way is to base this hypothesis contrast on an analysis of variance (ANOVA). This method considers that the variability of the dependent variable is decomposed into two terms: one explained by the independent variable and the other not assigned to any source and which is considered unexplained (random).

It is possible to obtain the estimate of the variance of the error of both components, explained and unexplained. If the variation due to the independent variable does not exceed that of chance, the ratio of explained / unexplained will have a value close to one. Otherwise, it will move away from unity, the more the better predictions of the dependent variable provided by the independent variable.

When the slope (the coefficient β_{1}) is equal to zero (under the assumption of the null hypothesis), this quotient follows a Snedecor’s F distribution with 1 and n-2 degrees of freedom. As with the previous method, we can calculate the p-value associated with the value of F and reject the null hypothesis if p <0.05.

We are going to try to understand a little better what we have just explained by using a practical example. To do this, we are going to use the statistical program R and one of its data sets, “trees”, which collects the circumference, volume and height of 31 observations on trees.

We load the data set, execute the lm() function to calculate the regression model and obtain its summary with the summary() function, as you can see in the attached figure.

If you look closely, the program shows the point estimate of the coefficients together with their standard errors. This is accompanied by the values of the t statistic with their statistical significance. In both cases, the value of p <0.05, so we reject the null hypothesis for the two coefficients of the equation of the line. In other words, both coefficients are statistically significant.

Next, R provides us with a series of data (the standard deviation of the residuals, the square of the multiple correlation coefficient or coefficient of determination and its adjusted value) among which is the F’s contrast to validate the model. There are no surprises, p is less than 0.05, so we can reject the null hypothesis: the coefficient β_{1} is statistically significant and the independent variable allows predicting the values of the dependent variable.

Everything we have seen so far is usually provided by statistical programs when we ask for the regression model. But we cannot leave the task half done. Once we have verified that the coefficients are significant, we can ensure that a series of necessary assumptions are met for the model to be valid.

These assumptions are four: linearity, homoscedasticity, normality and independence. Here, even if we use a statistics program, we will have to work a Little hard to verify these assumptions and make a correct diagnostics of the regression model.

As we have already commented, the relationship between the dependent and independent variables must be linear. This can be seen with something as simple as scatter plot, which shows us what the relationship looks like in the range of observed values of the independent variable.

If we see that the relationship is not linear and we are very determined to use a linear regression model, we can try to make a transformation of the variables and see if the points are already distributed, more or less, along a line.

A numerical method that enables the assumption of linearity to be tested is Ramsey’s RESET test. This test checks whether it is necessary to introduce quadratic or cubic terms so that the systematic patterns in the residuals disappear. Let’s see what this means.

The residual is the difference between a real value of the dependent variable observed in the experiment and the value estimated by the regression model. In the previous image that shows the result of the summary() function of R we can see the distribution of the residuals.

For the model to be correct, the median must be close to zero and the absolute values of the residuals must be uniformly distributed among the quartiles (similar between maximum and minimum and between first and third quartiles). In other words, this means that the residuals, if the model is correct, follow a normal distribution whose mean is zero.

If we see that this is not the case, the residuals will be systematically biased and the model will be incorrectly specified. Logically, if the model is not linear, this bias of the residuals could be corrected by introducing a quadratic or cubic term into the equation of the line. Of course, then, it would no longer be a linear regression nor the equation of a line.

The null hypothesis of the Ramsey’s test says that the terms quadratic, cubic, or both are equal to zero (they can be tested together or separately). If we cannot reject the null hypothesis, the model is assumed to be correctly specified. Otherwise, if we reject the null hypothesis, the model will have specification errors and will have to be revised.

We have already commented: the residuals must be distributed homogeneously for all the values of the prediction variable.

This can be verified in a simple way with a scatter plot that represents, on the abscissa axis, the estimates of the dependent variable for the different values of the independent variable and, on the coordinate axis, the corresponding residuals. The homoscedasticity assumption will be accepted if the residuals are randomly distributed, in which case we will see a cloud of points in a similar way throughout the range of the observations of the independent variable.

We also have numerical methods to test the assumption of homoscedasticity, such as the Breusch-Pagan-Godfrey test, whose null hypothesis states that this assumption is satisfied.

We have also said it already: the residues must be distributed in a normal way.

A simple way to check it would be to represent the graph of theoretical quantiles of the residuals, in which we should see their distribution along the diagonal of the graph.

We can also use a numerical method, such as the Kolmogorov-Smirnov’s test or the Shapiro-Wilk’s test.

Finally, the residuals must be independent of each other and there have to be no correlation among them.

This can be contrasted by carrying out the Durbin-Watson’s test, whose null hypothesis assumes precisely that the residuals are independent.

To finish this post, we are going to make the diagnosis of the regression model that we have used above with our trees. To make it suitable for all audiences, this time we will use the R-Commander interface, thus avoiding writing on the command line, which is always more unpleasant.

For those of you who don’t know R very well, I leave you on the first screen the previous steps to load the data and calculate the regression model.

Let’s start with the diagnosis of the model.

To check if the assumption of linearity is fulfilled, we start by drawing the scatter plot with the two variables (menu options Graphs-> Scatter plot). If we look at the graph, we see that the points are distributed, more or less, along a line in an upward direction to the right.

If we want to do the numerical method, we select the menu option Models-> Numerical diagnostics-> Non-linearity RESET test. R gives us a RESET value = 2.52, with a p = 0.09. As p> 0.05, we cannot reject the null hypothesis that the model is linear, thereby corroborating the impression we obtained with the graphic method.

Let’s go with homoscedasticity. For the graphical method we resort to the menu option Models-> Graphs-> Basic diagnostic graphs. The program provides us with 4 graphs, but now we will only look at the first one, which represents the values predicted by the model of the dependent variable against the residuals.

As can be seen, the dispersion of the points is much greater for the lower values of the dependent variable, so I would not be very calm about whether the homoscedasticity assumption is fulfilled. The points should be distributed homogeneously over the entire range of values of the dependent variable.

Let’s see what the numerical method says. We select the menu option Models-> Numerical diagnoses-> Breusch-Pagan test for heteroscedasticity. The value of the BP statistic that R gives us is 2.76, with a p-value = 0.09. Since p> 0.05, we cannot reject the null hypothesis, so we assume that the homoscedasticity assumption holds.

We went on to check the normality of the residuals. For the graphical diagnostic method, we select the menu option Graphs-> Graph of comparison of quantiles. This time there is no doubt, it seems that the points are distributed along the diagonal.

Finally, let’s check the independence assumption.

We select the option Models-> Numerical diagnoses-> Durbin-Watson test for autocorrelation. A non-zero value of rho is usually selected, since it is rare to know the direction of the autocorrelation of the residuals, if it exists. We do it like this and R gives us a value of the statistic DW = 1.53, with a p-value = 0.12.

Consequently, we cannot reject the null hypothesis that the residuals are independent, thus fulfilling the last condition to consider the model as valid.

And here we are going to leave it for today. Seeing how laborious this whole procedure is, one can fall into the temptation to forgive and even understand the authors who hide the diagnostics of their regression models from us. But this excuse is not valid: statistical programs do it refrained from making the least effort.

Do not think that with everything we have explained we have done everything we should before applying a simple linear regression model with confidence.

For example, it would not hurt to assess whether there are influential observations that may have a greater weight in the formulation of the model. Or if there are extreme values (outliers) that can distort the estimate of the slope of the regression line. But that is another story…

]]>Throughout the history of art we repeatedly encounter the known as *horror vacui* that, for those of you who were not so fortunate to study Latin in your young years, it is nothing more than the fear of emptiness.

There are numerous examples of pieces of art in which an obsessive effort can be seen to fill the entire space with any element, leaving nothing to emptiness. Think of the Islamic decoration or the art of the Rococo period or, above all, the ornate decoration of the Victorian period.

And why do I tell you all this? Well, because today’s topic reminded me of it, which has to do with discrete and continuous probability distributions and how the former allow this emptiness and the latter do not, and how things get complicated when we use some to approximate others. This seems like a tongue twister, but nobody despair, let’s see if we clarify it.

Probably the most frequently used hypothesis contrast test is the chi-square test of independence, which we will use to compare the proportions of two qualitative variables and try to determine whether both variables are associated or independent.

As we all know, we construct a contingency table with the observed values, we calculate the expected values under the assumption of the null hypothesis that the two variables are independent, and finally, we calculate the probability (under the null hypothesis) of observing by chance a table as far or more from the theoretical than the one we have observed in our experiment.

The problem arises with the indiscriminate use of the test, which sometimes leads us to forget that the statistic we use for the contrast, the chi-square, follows an approximate distribution that is only useful when the number of observations is relatively large, but that loses effectiveness when the information we have is scarce, which happens with certain frequency.

Therefore, once the contingency table is built, we check that there are no cells with frequencies less than 5. If this happens, we have two ways to solve the problem.

The first way to fix this is to use an exact test, such as Fisher’s exact test.

The exact tests calculate the probability directly, generating all the possible scenarios in which the condition we want to study occurs. This is done by constructing all the contingency tables more extreme than the one observed and that comply with the direction of the association of the observed table.

Once this exact probability has been calculated, it will be compared with the level of statistical significance and the hypothesis contrast will be solved.

The problem with these methods is that they are much more laborious, which has made it difficult to use them until the necessary computing power is available. This explains the predilection for the use of approximate tests such as the chi-square test.

We said there were two ways to solve the problem of scarce data. Well, the second way is to apply the Yates’ continuity correction, which involves subtracting 0.5 from the difference between observed and expected values when calculating the value of the chi-square statistic.

Everyone knows the Yates’ correction, as popular as the chi-square test, no doubt. But let those who know exactly what a continuity correction is, such as Yates’, raise their hands.

To understand it well, we first have to know what kind of probability distribution we are dealing with.

Quantitative variables can be continuous and discrete. A variable is continuous when, between two values of the variable, there are infinite (at least, in theory) possible values. For example, consider the weight of a newborn. It can weigh 3 kg and it can weigh, say, 4 kg. But between 3 and 4 kg there are infinite possible weight values (although in practice this infinity is limited to the number that the precision of our scale allows us).

Now let’s think about the number of children one can have. One can have 2, have 3, or a different number, but what you cannot is having a number of children between two and three, for example 2.5 (I know that sometimes we see this kind of thing, but it is a resource which facilitates the analysis of the variable but is meaningless from the point of view of everyday life).

The same is true for probability distributions. Between the values 3 and 4 of a discrete probability distribution there is a complete gap. However, continuous distributions are like a Victorian bedroom and suffer from *horror vacui*: between 3 and 4 there is a whole range of possible values.

This, in itself, is not a problem. The problem arises when we have a contrast that would require to use a discrete distribution for its resolution and we make an approximation using a continuous distribution. Let’s see an example.

Suppose we are working with a discrete distribution, for example, a binomial defined by n and p: B(n, p). It is very common that, to simplify probability calculations, when the sample size is large and the probability of the event is around 0.5, we approximate the solution using a normal distribution. Hence, when np and n(1-p) are greater than 5 we can approximate the binomial with a normal of mean np and variance equal to the square root of np(1-p).

This makes the calculations easier for us, but we are going from using a discrete distribution to using a continuous one, which has its consequences, as we will see.

In a discrete distribution, obtaining the probability of x> 3 is straightforward. In a continuous one, things get complicated, since we go from the emptiness between two points of the discrete to the full interval of possible values of the continuous distribution.

Let’s go back to calculating the probability that the variable is greater than 3: P(x> 3). From 3 to below 3 there is no problem, either with continuous or discrete. From 4 upwards, no problem either. But between 3 and 4, before there was a void that has now been filled. How do we solve it? Easy, giving half the interval to each section of the distribution above and below the value. In this way, P (x>3) would be calculated in the normal approximation as P(x≥3.5), including half the interval above 3, which is not included in the probability calculation. And, shhhh, we just applied Yates’s continuity correction.

If we want to calculate P(x≥3), the calculation would include 3, so we would have to go to the previous half of the empty interval and calculate it as P(x≥2.5). Following the same reasoning, P(x≤3) = p (x≤3.5), including half the interval above 3. And the probability that x is equal to 3? We will have to take the two parts of the interval: P(2,5≤x≤3,5).

We have already seen, then, that we will apply the Yates’ continuity correction when we want to go from a discrete to a continuous distribution. When we work with variables that follow a continuous distribution, we do not have to apply any correction. For example, if in a normal distribution we want to calculate P(x = 3), let no one think of calculating the probability of the interval from 2.5 to 3.5. In this case, it is wrong to apply the continuity correction. The P(x = 3) in a normal distribution is equal to zero. If we think about it, the probability is the area under the curve and, below a point, there is no area.

Nor is it necessary when we go from a discrete distribution to another also discrete. An example can be when we approximate a binomial with a Poisson’s distribution (when np<5). Just apply continuity correction when going from a discrete to a continuous distribution.

Now that we know what the Yates’ continuity correction is, let’s see why it should be applied when the frequencies of the cells in the chi-square table are low.

The exact probability with small samples is calculated using discrete probability distributions, such as the hypergeometric, the negative binominal, and others that we can choose based on the sampling of the data. When the sample is small and we approximate with the chi-square test, we are making an approximation with a known probability distribution, the chi-square distribution that, those of you still awake, will have already guessed, is a continuous probability distribution.

We go from a discrete to a continuous, then we have to apply the continuity correction. Thus, we try to compensate for the mismatches that occur when the probability distribution of the observed frequencies, which is discrete, is approximated by another of a continuous nature.

And we are going to finish for today.

Before we go, I just want to say that not all artistic expressions err on the side of this *horror vacui*. Sometimes some artists do the opposite and use a vacuum to convey their message. This is very common in photography, with the use of negative space.

We have talked all the time about the correction of our friend Yates, which is the best known. But do not think that it is the only one. There are more, like Cochran’s or Mantel’s. But that is another story…

]]>One of my neighbors has a dog that is barking the whole damn day. It is the typical dog so tiny it is barely a palm from the ground, which does not prevent it from barking at an incredible wild volume, not to mention the unpleasantness of its “voice” pitch.

With these dwarf dogs it is what usually happens, they bark at you like demon-possessed as soon as they see you, but, according to popular wisdom, you can rest easy because the more they bark at you, the less likely they are to bite you. Come to think of it, I would almost say that there is an inverse correlation between the amount of howling and the probability of being bitten by one of these little animals.

And since we have mentioned the term correlation, we are going to talk about this concept and how to measure it.

The Cambridge’s English Dictionary says that correlation is a connection or relationship between two or more facts, numbers, etc. Another source of wisdom, Wikipedia, says that when it comes to probability and statistics, correlation is any statistical relationship between two random variables or bivariate data.

What does it mean, then, that two variables are correlated? Well, a much simpler thing than it may seem: that the values of one of the variables change in a certain sense in a systematic way when changes occur in the other. Said more simply, given two variables A and B, whenever the value of A changes in a certain direction, those of B will also change in a certain direction, which may be the same or the opposite one.

And that is what correlation means. Only that, how one variable changes with the changes of the other. This does not mean at all that there is a causal relationship between the two variables, which is a generally erroneous assumption that is made with some frequency. So common is this fallacy that it even has a nice Latin name, *cum hoc ergo propter hoc*, that can be summed up for less educated minds as “correlation does not imply causation.” Because two things vary together does not mean that one is the cause of the other.

Another common mistake is to confuse correlation with regression. Actually, they are two terms that are closely related. While the first, correlation, only tells us if there is a relationship between the two variables, the regression analysis goes one step further and aims to find a model that allows us to predict the value of one of the variables (the dependent variable) based on of the value that the other variable takes (which we will call independent or explanatory variable). In many cases, studying if there is correlation is the previous step before generating the regression model.

Well, everyone knows the human being’s hobby of measuring and quantifying everything, so it cannot surprise to anyone that the so-called correlation coefficients were invented, of which there is a more or less numerous family.

To calculate the correlation coefficient, we therefore need a parameter that allows us to quantify this relationship. For this, we have the covariance, which indicates the degree of common variation of two random variables.

The problem is that covariance’s value depends on the measurement scales of the variables, which prevents us from making direct comparisons between different pairs of variables. To avoid this problem, we resort to a solution that is already known to us and which is none other than standardization. The product of the standardization of the covariance will be the correlation coefficients.

All these coefficients have something in common: their value ranges from -1 to 1. The farther the value is from 0, the greater the strength of the relationship, which will be practically perfect when it reaches -1 or 1. At 0, which is the null value, in theory there will be no correlation between the two variables.

The sign of the value of the correlation coefficient will indicate the other quality of the relationship between the two variables: the direction. When the sign is positive it will mean that the correlation is direct: when one increases or decreases, the other does so in the same way. If the sign is negative, the correlation will be inverse: when changing one variable, the other will do it in the opposite direction (if one increases, the other decreases, and vice versa).

So far we have seen two of the characteristics of the correlation between two variables: strength and direction. There is a third characteristic which depends on the type of line that defines the best fit model. In this post we are going to talk only about the simplest form, which is none other than linear correlation, in which the fit line is a straight line, but you should know that there are other non-linear fits.

We have already said that there is a whole series of correlation coefficients that we can calculate based on the type of variables that we want to study and the probability function they are distributed in the population from which the sample comes.

Pearson’s correlation coefficient, also called the linear product-moment correlation coefficient, is undoubtedly the most famous of this entire family.

As we have already said, it is nothing more than the standardized covariance. There are several ways to calculate it, but all roads lead to Rome, so I will not resist putting the formula:

As we can see, covariance (in the numerator) is standardized by dividing it by the product of the variances of the two variables (in the denominator).

In order to use Pearson’s correlation coefficient, both variables must be quantitative, be linearly correlated, be normally distributed in the population, and the assumption of homoskedasticity must be met, which means that the variance of the Y variable must be constant along the values of the X variable. An easy way to check this last assumption is to draw the scatterplot and see if the cloud is scattered similarly along the values of the X variable.

One factor to keep in mind that the value of this coefficient can be biased with the existence of extreme values (outliers).

The non-parametric equivalent of Pearson’s coefficient is Spearman’s correlation coefficient. This, as occurs with non-parametric techniques, does not use direct data for its calculation, but uses its transformation in ranks.

Thus, it is used when the variables are ordinal or when they are quantitative, but they do not meet the normality assumption and can be transformed into ranks.

Otherwise, its interpretation is similar to that of the rest of the coefficients. Furthermore, because of being calculated with ranks, it is less sensitive to outliers than the Pearson’s coefficient.

Another advantage compared to Spearman’s is that it only requires that the correlation between the two variables be monotonous, which means that when one variable increases the other also does so (or decrease) with a constant trend. This allows it to be used not only when the relationship is linear, but also in cases of logistic and exponential relationships.

Another coefficient that also uses the ranks of the variable is the Kendall’s τ coefficient. Being a non-parametric coefficient, it is also an alternative to Pearson’s coefficient when the assumption of normality is not fulfilled, being more advisable than Spearman’s when the sample is small and when there is a lot of rank ligation, which means that many data occupy the same position in the ranks.

Although there are some more, I am only going to refer specifically to three of them, useful for studying quantitative variables:

**Partial correlation coefficient.**This coefficient studies the relationship between two variables, but take into account and eliminating the influence of other variables.

The simplest case is to study two variables X1 and X2, eliminating the effect of a third variable X3. In this case, if the correlations between X1 and X3 and between X2 and X3 are equal to zero, the same value is obtained for the partial correlation coefficient as if we calculate the Pearson’s coefficient between X1 and X2.

In the event that we want to control more variables, the formula, which I do not intend to write, becomes more complex, but it is best to let a statistical program to calculate it.

If the value of the partial coefficient is less than that of the Pearson’s coefficient, it means that the correlation between both variables is partially due to the other variables that we are controlling. When the partial coefficient is greater than Pearson’s, the variables that are controlled mask the relationship between the two variables of interest.

**Semi-partial correlation coefficient.**It is similar to the previous one, but this semi-partial allows evaluating the association between two variables by controlling the effect of a third on one of the two variables of interest (not on the two, as the partial coefficient).**Multiple correlation coefficient.**This allows to know the correlation among a variable and a set of two or more variables, all of them quantitative.

And I think we have enough with this, for now. There are a few more coefficients that are useful for special situations. Whoever is curious can look for it in a thick statistics book and be confident to succeed in finding it.

We already said at the beginning that the value of these coefficients could range from -1 to 1, with -1 being the perfect negative correlation and 1 being the perfect positive correlation.

We can make a parallel between the value of the coefficient and the strength of the association, which is nothing but the effect size. Thus, values of 0 indicate null association, 0.1 small association, 0.3 medium, 0.5 moderate, 0.7 high and 0.9 very high association.

To finish, it must be said that, in order to give value to the coefficient, it must be statistically significant. You know that we always work with samples, but what we are interested in is inferring the value in the population, so we have to calculate the confidence interval of the coefficient that we have used. If this interval includes the null value (zero) or if the program calculates the value of p and it is greater than 0.05, it will not make sense to value the coefficient, even if it is close to -1 or 1.

And here we leave it for today. We have not discussed anything about using the Pearson’s correlation coefficient to compare the precision of a diagnostic test. And we have not said anything because we should not use this coefficient for this purpose. Pearson’s coefficient is highly dependent on intra-subject variability and can give a very high value when one of the measurements is systematically greater than the other, even though there is not a good concordance between the two. For this, it is much more appropriate to use the intraclass correlation coefficient, a best estimator of the concordance among repeated measures. But that is another story…

]]>There are days when I feel biblical. Other days I feel mythological. Today I feel philosophical and even a little Masonic.

And the reason is that the other day it gave me to wonder what the difference between exoteric and esoteric were, so I consulted with that friend of us all who knows so much about everything, our friend Google. It kindly explained to me that both terms are similar and usually explain two aspects of the same doctrine.

Exoterism refers to that knowledge that is not limited to a certain group in the community that deals with that doctrine, and that can be disclosed and made available to anyone.

On the other hand, esotericism refers to knowledge that belongs to a deeper and higher order, only available to a privileged few specially educated to understand it.

And now, once the difference is understood, I ask you a slightly tricky question: is multivariate statistics exoteric or esoteric? The answer, of course, will depend on each one, but we are going to see if it is true that both concepts are not contradictory, but complementary, and we can strike a balance between both of them, at least in understanding the usefulness of multivariate techniques.

We are more used to work with univariate or bivariate statistical techniques, which allow us to study together a maximum of two characteristics of the individuals in a population to detect relationships between them.

However, with the mathematical development and, above all, the calculating power of modern computers, multivariate statistical techniques are becoming increasingly important.

We can define multivariate analysis as the set of statistical procedures that simultaneously study various characteristics of the same subject or entity, in order to analyze the interrelation that may exist among all the random variables that these characteristics represent. Let me insist on the two aspects of these techniques: the multiplicity of variables and the study of their possible interrelations.

There are many multivariate analysis techniques, ranging from purely descriptive methods to those that use statistical inference techniques to draw conclusions from the data and that are able to develop models that are not obvious to the naked eye by observing the data obtained. They will also allow us to develop prediction models of various variables and establish relationships among them.

Some of these techniques are the extension of their equivalents with two variables, one dependent and the other independent or explanatory. Others have nothing similar in two-dimensional statistics.

Some authors classify these techniques into three broad groups: full-range and non-full-range models, techniques to reduce dimensionality, and classification and discrimination methods. Do not worry if this seems gibberish, we will try to simplify it a bit.

To be able to talk about the **FULL AND NON-FULL RANGE TECHNIQUES**, I think it will be necessary to explain first what range we are talking about.

Although we are not going to go into the subject in depth, all these methods involve matrix calculation techniques within them. You know, matrices or arrays, a set of two-dimensional numbers (the ones we are going to discuss here) that form rows and columns and that can be added and multiplied together, in addition to other calculations.

The range of an array is defined as the number of rows or columns that are linearly independent (no matter rows or columns, the number is the same). The range can value from 0 to the minimum number of rows or columns. For example, a 2 row by 3 column array may have a range from 0 to 2. A 5 row by 3 column array may have a range from 0 to 3. Now imagine an array with two rows, the first 1 2 3 and the second 3 6 9 (it has 3 columns). Its maximum range would be 2 (the smallest number of rows and columns) but, if you look closely, the second row is the first one multiplied by 3, so there is only one linearly independent row, so its range is equal to 1.

Well, an array is said to be a full-range one when its range is equal to the largest possible for an array of the same dimensions. The third example that I have given you would be a non-full range array, since a 2×3 matrix would have a maximum range of 2 and that of our array is 1.

Once this is understood, we go with the full and non-full range methods.

The first one we will look at is the multiple linear regression model. This model, an extension of the simple linear regression model, is used when we have a dependent variable and a series of explanatory variables, all of them quantitative variables, and they can be linearly related and the explanatory variables form a full-range array.

Like simple regression, this technique allows us to predict changes in the dependent variable based on the values of explanatory variables. The formula is like that of the simple regression, but including all the explanatory independent variables, so I’m not going to bore you with it. However, since I have punished you with ranges and matrices, let me tell you that, in arrays terms, it can be expressed as follows:

Y = Xβ + e_{i}

where X is the full range matrix of the explanatory variables. The equation includes an error term that is justified by the possible omission in the model of relevant explanatory variables or measurement errors.

To complicate matters, imagine that we were to simultaneously correlate several independent variables with several dependent ones. In this case, multiple regression does not help us, and we would have to resort to the canonical correlation technique, which allows us to make predictions of various dependent variables based on the values of several explanatory variables.

If you remember bivariate statistics, analysis of variance (ANOVA) is the technique that allows us to study the effect on a quantitative dependent variable of explanatory variables when these are categories of a qualitative variable (we call these categories as factors). In this case, since each observation can belong to one and only one of the factors of the explanatory variable, matrix X will be of a non-full range one.

A slightly more complicated situation occurs when the explanatory variables are a quantitative variable and one or more factors of a qualitative variable. On these occasions we resorted to a generalized linear model called the analysis of covariance (ANCOVA).

Transferring what we have just said to the realm of multivariate statistics, we would have to use the extension of these techniques. The extension of ANOVA when there is more than one dependent variable that cannot be combined into one is the multivariate analysis of variance (MANOVA). If factors of qualitative variables coexist with quantitative variables, we will resort to the multivariate analysis of covariance (MANCOVA).

The second group of multivariate techniques are those that try the **REDUCTION OF DIMENSIONALITY**.

Sometimes we must handle such a high number of variables that it is complex to organize them and reach some useful conclusion. Now, if we are lucky that the variables are correlated with each other, the information provided by the set will be redundant, since the information given by some variables will include that already provided by other variables in the set.

In these cases, it is useful to reduce the dimension of the problem by decreasing the number of variables to a smaller set of variables that are not correlated with each other and that collect most of the information included in the original set. And we say most of the information because, obviously, the more we reduce the number, the more information we will lose.

The two fundamental techniques that we will use in these cases are principal component analysis and factor analysis.

Principal component analysis takes a set of p correlated variables and transforms them into a new set of uncorrelated variables, which we call principal components. These main components allow us to explain the variables in terms of their common dimensions.

Without going into detail, a correlation matrix and a series of vectors are calculated that will provide us with the new main components, ordered from highest to lowest according to the variance of the original data that each component encompass. Each component will be a linear combination of the original variables, somewhat like a regression line.

Let’s imagine a very simple case with six explanatory variables (X1 to X6). Principal component 1 (PC1) can be, let’s say, 0.15X1 + 0.5X2 – 0.6X3 + 0.25X4 – 0.1X5 – 0.2X6 and, in addition, encompass 47% of the total variance. If PC2 turns out to encompass 30% of the variance, with PC1 and PC2 we will have 77% of the total variance controlled with a data set that is easier to handle (let’s think if instead of 6 variables we have 50). And not only that, if we represent graphically PC1 versus PC2, we can see if some type of grouping of the variable under study occurs according to the values of the principal components.

In this way, if we are lucky and a few components collect most of the variance of the original variables, we will have reduced the dimension of the problem. And although, sometimes, this is not possible, it can always help us to find groupings in the data defined by a large number of variables, which links us to the following technique, factor analysis.

We know that the total variance of our data (the one studied by principal component analysis) is the sum of three components: the common or shared variance, the specific variance of each variable, and the variance due to chance and measurement errors. Again, without going into detail, the factor analysis method starts from the correlation matrix to isolate only the common variance and try to find a series of common underlying dimensions, called factors, that are not observable by looking at the original set of variables.

As we can see, these two methods are very similar, so there is a lot of confusion about when to use one and when another, especially considering that principal component analysis may be the first step in the factor analysis methodology.

As we have already said, principal component analysis tries to explain the maximum possible proportion of the total variance of the original data, while the objective of the factor analysis study is to explain the covariance or correlation that exists among its variables. Therefore, principal component analysis will usually be used to search for linear combinations of the original variables and reduce one large data set to a smaller and more manageable one, while we will resort to factor analysis when looking for a new set of variables, generally smaller than the original, and to represent what the original variables have in common.

Moving forward on our arduous today’s path, for those hard-working who are still reading, we are going to discuss **CLASSIFICATION AND DISCRIMINATION METHODS**, which are two: cluster analysis and discriminant analysis.

Cluster analysis tries to recognize patterns to summarize the information contained in the initial set of variables, which are grouped according to their greater or less homogeneity. In summary, we look for groups that are mutually exclusive, so that the elements are as similar as possible to those of their group and as different as possible to those of the other groups.

The most famous part of the cluster analysis is, without a doubt, its graphic representation, with decision trees and dendrograms, in which homogeneous groups increasingly different from those farthest between the branches of the tree are represented.

But, instead of wanting to segment the population, let’s assume that we already have a population segmented into a number of classes, k. Suppose we have a group of individuals defined by a number p of random variables. If we want to know to what class of the population a certain individual may belong, we will resort to the technique of discriminant analysis.

Suppose that we have a new treatment that is awfully expensive, so we only want to give it to patients who we are sure that they will comply with the treatment. Thus, our population is segmented into compliant and non-compliant classes. It would be very useful for us to select a set of variables that would allow us to discriminate which class a specific person can belong to, and even which of these variables are the ones that best discriminate between the two groups.

Thus, we will measure the variables in the candidate for treatment and, using what is known as a discrimination criterion or rule, we will assign it to one or the other group and proceed accordingly. Of course, do not forget, there will always be a probability of being wrong, so we will be interested in finding the discriminant rule that minimizes the probability of discrimination error.

Discriminant analysis may seem similar to cluster analysis, but if we think about it, the difference is clear. In the discriminant analysis the groups are previously defined (compliant or non-compliant, in our example), while in the cluster analysis we look for groups that are not evident: we would analyze the data and discover that there are patients who do not take the pill that we give them, something that had not even crossed our minds (in addition to our ignorance, we would demonstrate our innocence).

And here we are going to leave it for today. We have flown over the steep landscape of multivariate statistics from a great height and I hope it has served to transfer it from the field of the esoteric to that of the exoteric (or was it the other way around?). We have not entered the specific methodology of each technique, since we could have written an entire book. By roughly understanding what each method is and what it is for, I think we have quite a lot. In addition, statistical packages carried them out, as always, effortlessly.

Also don’t you think that we have talked about all the methods that have been developed for multivariate analysis. There are many others, such as conjoint analysis and multidimensional scaling, widely used in advertising to determine the attributes of an object that are preferred by the population and how they influence their perception of it.

We could also have gotten lost among other newer techniques, such as correspondence analysis, or linear probability models, such as logit and probit analysis, which are combinations of multiple regression and discriminant analysis, not to mention simultaneous or structural equation models. But that is another story…

]]>