Yates’ continuity correction
Throughout the history of art we repeatedly encounter the known as horror vacui that, for those of you who were not so fortunate to study Latin in your young years, it is nothing more than the fear of emptiness.
There are numerous examples of pieces of art in which an obsessive effort can be seen to fill the entire space with any element, leaving nothing to emptiness. Think of the Islamic decoration or the art of the Rococo period or, above all, the ornate decoration of the Victorian period.
And why do I tell you all this? Well, because today’s topic reminded me of it, which has to do with discrete and continuous probability distributions and how the former allow this emptiness and the latter do not, and how things get complicated when we use some to approximate others. This seems like a tongue twister, but nobody despair, let’s see if we clarify it.
A very popular, but approximate test
Probably the most frequently used hypothesis contrast test is the chi-square test of independence, which we will use to compare the proportions of two qualitative variables and try to determine whether both variables are associated or independent.
As we all know, we construct a contingency table with the observed values, we calculate the expected values under the assumption of the null hypothesis that the two variables are independent, and finally, we calculate the probability (under the null hypothesis) of observing by chance a table as far or more from the theoretical than the one we have observed in our experiment.
The problem arises with the indiscriminate use of the test, which sometimes leads us to forget that the statistic we use for the contrast, the chi-square, follows an approximate distribution that is only useful when the number of observations is relatively large, but that loses effectiveness when the information we have is scarce, which happens with certain frequency.
Therefore, once the contingency table is built, we check that there are no cells with frequencies less than 5. If this happens, we have two ways to solve the problem.
Better an exact test
The first way to fix this is to use an exact test, such as Fisher’s exact test.
The exact tests calculate the probability directly, generating all the possible scenarios in which the condition we want to study occurs. This is done by constructing all the contingency tables more extreme than the one observed and that comply with the direction of the association of the observed table.
Once this exact probability has been calculated, it will be compared with the level of statistical significance and the hypothesis contrast will be solved.
The problem with these methods is that they are much more laborious, which has made it difficult to use them until the necessary computing power is available. This explains the predilection for the use of approximate tests such as the chi-square test.
Yates’ continuity correction
We said there were two ways to solve the problem of scarce data. Well, the second way is to apply the Yates’ continuity correction, which involves subtracting 0.5 from the difference between observed and expected values when calculating the value of the chi-square statistic.
Everyone knows the Yates’ correction, as popular as the chi-square test, no doubt. But let those who know exactly what a continuity correction is, such as Yates’, raise their hands.
To understand it well, we first have to know what kind of probability distribution we are dealing with.
Continuous and discrete distributions
Quantitative variables can be continuous and discrete. A variable is continuous when, between two values of the variable, there are infinite (at least, in theory) possible values. For example, consider the weight of a newborn. It can weigh 3 kg and it can weigh, say, 4 kg. But between 3 and 4 kg there are infinite possible weight values (although in practice this infinity is limited to the number that the precision of our scale allows us).
Now let’s think about the number of children one can have. One can have 2, have 3, or a different number, but what you cannot is having a number of children between two and three, for example 2.5 (I know that sometimes we see this kind of thing, but it is a resource which facilitates the analysis of the variable but is meaningless from the point of view of everyday life).
The same is true for probability distributions. Between the values 3 and 4 of a discrete probability distribution there is a complete gap. However, continuous distributions are like a Victorian bedroom and suffer from horror vacui: between 3 and 4 there is a whole range of possible values.
This, in itself, is not a problem. The problem arises when we have a contrast that would require to use a discrete distribution for its resolution and we make an approximation using a continuous distribution. Let’s see an example.
Suppose we are working with a discrete distribution, for example, a binomial defined by n and p: B(n, p). It is very common that, to simplify probability calculations, when the sample size is large and the probability of the event is around 0.5, we approximate the solution using a normal distribution. Hence, when np and n(1-p) are greater than 5 we can approximate the binomial with a normal of mean np and variance equal to the square root of np(1-p).
This makes the calculations easier for us, but we are going from using a discrete distribution to using a continuous one, which has its consequences, as we will see.
In a discrete distribution, obtaining the probability of x> 3 is straightforward. In a continuous one, things get complicated, since we go from the emptiness between two points of the discrete to the full interval of possible values of the continuous distribution.
Let’s go back to calculating the probability that the variable is greater than 3: P(x> 3). From 3 to below 3 there is no problem, either with continuous or discrete. From 4 upwards, no problem either. But between 3 and 4, before there was a void that has now been filled. How do we solve it? Easy, giving half the interval to each section of the distribution above and below the value. In this way, P (x>3) would be calculated in the normal approximation as P(x≥3.5), including half the interval above 3, which is not included in the probability calculation. And, shhhh, we just applied Yates’s continuity correction.
If we want to calculate P(x≥3), the calculation would include 3, so we would have to go to the previous half of the empty interval and calculate it as P(x≥2.5). Following the same reasoning, P(x≤3) = p (x≤3.5), including half the interval above 3. And the probability that x is equal to 3? We will have to take the two parts of the interval: P(2,5≤x≤3,5).
Two errors to avoid
We have already seen, then, that we will apply the Yates’ continuity correction when we want to go from a discrete to a continuous distribution. When we work with variables that follow a continuous distribution, we do not have to apply any correction. For example, if in a normal distribution we want to calculate P(x = 3), let no one think of calculating the probability of the interval from 2.5 to 3.5. In this case, it is wrong to apply the continuity correction. The P(x = 3) in a normal distribution is equal to zero. If we think about it, the probability is the area under the curve and, below a point, there is no area.
Nor is it necessary when we go from a discrete distribution to another also discrete. An example can be when we approximate a binomial with a Poisson’s distribution (when np<5). Just apply continuity correction when going from a discrete to a continuous distribution.
Going back to chi-square test
Now that we know what the Yates’ continuity correction is, let’s see why it should be applied when the frequencies of the cells in the chi-square table are low.
The exact probability with small samples is calculated using discrete probability distributions, such as the hypergeometric, the negative binominal, and others that we can choose based on the sampling of the data. When the sample is small and we approximate with the chi-square test, we are making an approximation with a known probability distribution, the chi-square distribution that, those of you still awake, will have already guessed, is a continuous probability distribution.
We go from a discrete to a continuous, then we have to apply the continuity correction. Thus, we try to compensate for the mismatches that occur when the probability distribution of the observed frequencies, which is discrete, is approximated by another of a continuous nature.
We are leaving…
And we are going to finish for today.
Before we go, I just want to say that not all artistic expressions err on the side of this horror vacui. Sometimes some artists do the opposite and use a vacuum to convey their message. This is very common in photography, with the use of negative space.
We have talked all the time about the correction of our friend Yates, which is the best known. But do not think that it is the only one. There are more, like Cochran’s or Mantel’s. But that is another story…