# Prediction intervals

The issue of poor relations reminds me of an old school joke about a marquis that had a lower class nephew and wanted him to share his dining table with all the aristocratic people with which the marquis used to rub shoulders. With great concern, he strongly urged him to be very polite to his guests. So, in the middle of the dinner, the nephew got up and claimed: ladies and gentlemen, I beg your excuses but I have to meet the call of nature. The guy got up and started walking towards the bathroom. But, halfway, he stopped suddenly, stood there thinking and, turning around, he said: well, maybe I’ll also take a dump!.

Fortunately, family not always gives you such kind of problems. Neither it is common for different social class relatives to being mixed nor, of course, a real life marquis would invite a guy like that to share his table, even if a relative.

But the thing is that there’re families who get along, even though some of their members always take most of the fame. This is the case of the intervals family. The best known of all is our beloved confidence interval, but it has two cousins that, although much less famous, also contribute worthily to fight uncertainty in statistical inference: the prediction interval and the tolerance interval.

We are all aware of the common impossibility to get access to entire population when we want to know the value of any population variable. For this reason, some inference techniques have been developed to try to estimate the value of the population’s inaccessible parameter from values obtained from samples collected from that population.

The problem is that these estimates have always a probability of error. And this is where our interval family comes in.

## Confidence intervals

The first family member is the confidence interval. It allows as, once calculated the parameter value in the sample, to estimate between what limits the real value is in the unapproachable population, always with some probability of error. By agreement, the level of confidence is usually set at 95%, so the interval can be calculated according to the following expression:

95%CI = parameter ± two times parameter’s standard error

In the simple case of calculating the mean’s confidence interval the value of its standard error is equals the standard deviation divided by the square root of sample size, but this calculation gets more complicated with other statistical parameters.

In any case, the confidence interval will always represent the margins between the true population’s value is likely contained in. The probability of containment (the confidence) doesn’t apply to the interval but, in fact, to the proportion of all intervals that would include the actual parameter if we’d repeat the measure a large number of times.

Although there’s much to say about confidence intervals, we’re not going to stay longer around this topic for now. For those who are interested in it, I recommend you an article where this is explained further.

## Prediction intervals

The second member of this family is the prediction interval. The concept is very similar to that of the confidence interval. In the case of prediction interval, we have to estimate the population’s value from sample’s data first and, then, we’ll have to calculate the prediction interval which will inform us between what limits a certain proportion of subjects randomly chosen will be included in, always with a degree of probability.

If the measured variable is normally distributed (we usually can approximate to a normal distribution if the sample size is large), 95% out of the population subjects will be in the range of the mean ± two standard deviations. It’s pretty similar to the confidence interval, but with two distinct differences.

First, prediction intervals use standard deviation instead of standard error, which is used for confidence intervals. So, as standard deviation is always greater than standard error, prediction intervals will be always wider than confidence intervals, for a given level of uncertainty. Second, to calculate confidence intervals we previously need to know the population’s values estimated from one or more samples, while the prediction intervals are calculated a priori, before sampling the subjects from their population.

## Tolerance intervals

The third cousin in the scene is the tolerance interval. This is very similar to the prediction interval. It is calculated from a series of data from one or more samples and informs us about the boundaries between futures observations will be included in, with the confidence level we choose.

Just like the prediction interval, the tolerance interval is calculated after we have made the population’s parameter’s estimate. It is useful because it tells us what proportion of all futures observations will be within certain ranges with a certain probability.

Obviously, all these samples should be chosen from the same population, in the same conditions, and at random.

In theory, tolerance intervals are only valid if they are calculated from the mean and standard deviation of the population but, as these values are usually unknown, estimated means are often used, thus introducing a degree of uncertainty that will be greater as the smaller the sample size is.

This uncertainty is what tolerance interval controls: it will tell us, with a certain confidence, what proportion of the population will be between a given range. It’s calculated using the following expression:

TI95% = parameter ± k SD

In this equation, SD is the standard deviation and k is a function that takes into account the sample size, the confidence level and the proportion of population that is measured. The math is complex, so do not try to calculate it without the aid of a computer application.

Ending for now, only tell you that both prediction and tolerance intervals may be bilateral or unilateral. The unilateral tell us the minimum or maximum population values within the degree of confidence that we specify.

## We’re leaving…

And that’s all folks. We have said nothing about another prediction interval which is much less friendly, but which is also of great use. And this is because prediction intervals play a main role in regression models. But that’s another story…