We know that if we have to contrast the equality of two normally distributed samples’ means we can use the Student’s t test. So we set our null hypothesis of no difference between the means, do the hypothesis contrast and, if p < 0.05, reject the null hypothesis (which is what we want most of the time) and assume that means are different.

This p value, also called alpha, is arbitrarily chosen and it simply represents the probability that the observed difference is really due to chance. As a value less than 0.05 seems small to us, we accept the 5% chance of committing error (type I error) rejecting the null hypothesis when it’s true and the difference is, in fact, due to chance.

Things get a little more complicated when we compared means from more than two samples. As we know, in these cases we have to make an analysis of variance (if samples are normally distributed and their variances are equal), which provides another value of p. Again, if p is less than 0.05 we reject the null hypothesis of no differences and assume than some means are different. But which of those means are different from each other and which are not?.

The first thing that comes to our minds is using Student’s t to contrast the means taking the samples by pairs. The number of possible comparisons is equal to (k)(k-1)/2, being k the number of samples or groups. If there’re three samples, we can make three comparisons; if four, six comparisons; if there’re five groups, 10 comparisons; and so on until we get bored.

However, if we do this we run the risk of make our bloomer, the higher the risk the more the number of comparisons we make. Let’s think a little why this is so.

If we make one contrast, the probability of being significant is 0.05 and of being non-significant 0.95. Now imagine we make 20 independent comparisons: the probability that none will be significant is 0.95×0.95×0.95… and so 20 times. So, be 0.95^{20} = 0.36. This means that probability of type I error increases with the number of comparisons and that we can find a falsely significant difference just by chance.

Let’s think about it the other way. If we make 20 comparisons with an alpha value of 0.05, the probability that at least one is significant is equal to 1-Probability of non-significant or, put another way, 1-(1-0.05)^{20}, which equals 0.64. This means that if we make 20 comparisons we’ll have 64% chance of identifying a difference as significant when in fact it’s not or, in other words, of committing a type I error.

What can we do?. This is where Mr. Bonferroni comes to our rescue with his famous correction.

We have said that the probability of being non-significant (0.95) in 20 comparisons is (1-alpha)^{20}. Now I ask you to believe me if I say that (1-alpha)^{20} is approximately equal to 1-20xalpha. Then 0.95=1-20xalpha. Solving for alpha, we obtain Bonferroni’s correction:

alpha for each comparison = general alpha / number of comparisons.

So, if we have to make four comparisons and we had chosen an ANOVA alpha value of 0.05, when we do the pairwise comparisons we’ll consider a p value as significant to reject the null hypothesis if it is less than 0.05/4 = 0.0125. If we make six comparisons the significance level drops to 0.0083 and if we make 10, to 0.005.

This is what I mean with the importance of p’s zeroes. The more comparisons, the more zeroes we’ll have to have to consider the difference statistically significant without increasing the risk of type I error. This is very common in clinical trials’ post hoc studies among various subgroups of the trials or in genome wide association studies which, under that elegant name, are merely camouflaged case-control studies.

As is easy to understand, this correction penalizes the value of p and makes the contrast much more conservative in the sense of not being able to reject the null hypothesis. Of course, if the difference remains significant in spite of the correction, the credibility of results will be much higher.

And here we end up today. Just say that Bonferroni was not alone in giving a solution to the problem of multiple comparisons. There are other techniques such as Scheffe’s, Newman-Keuls’, Duncan’s, Gabriel’s, etc, and using one or the other may depend solely on the statistical software we have. But that’s another story…