I dare even say that some deviations are much needed. But nobody get enthusiastic before time. Although what I said has been able to mean anything, we’re going to talk about how the values vary in a quantitative distribution.
When we get the values of a given parameter in a sample and we want to get a summarize idea about how it behaves in the sample, the first thing that comes to our mind is to calculate a measure that represents it, so we draw upon the mean, the median or any other centralization measure.
Dispersion measures: variance
However, the calculation of the central value gives little information if it’s not complemented by other measure that informs us about the results heterogeneity in the distribution. To quantify the degree of variation, mathematicians, with very little imagination, have invented a thing called the variance.
To calculate it, we have to subtract the mean to each individual value with the aim of adding all these subtractions and divide the result by the number of measurements. It’s like calculating the mean of the differences of each value with respect to the central value of the distribution. But there’s a slight problem: as values are situated both above and below the mean (it must be so, it’s the average), positive and negative differences will cancel each other out and we will get a value close to zero if the distribution is symmetric, even though the degree of variation was large.
To avoid this, what we do is to square the differences before adding them, thereby disappearing any negative sign. In this way, we always come up with a positive value related to the amplitude of the differences. This value is what is known as variance.
For example, let’s suppose we measure the systolic blood pressure to 200 schoolchildren randomly selected and we get an average of 100 mmHg. We begin to subtract the mean from each value, square the differences, and add up all the squares dividing the result by 200 (the number of determinations). We come up with a variance of, for instance, 100 mmHg2. And I wonder, what the heck is a square millimeter of mercury?. Variance might describe well variation, but I do not deny it’s a bit difficult to interpret. Again, some mathematical genius runs to our rescue with the solution: to do the square root of variance and thus recovering the original units of the variable.
Doing so, we get to the most famous of deviations: the standard deviation. In our case it could be, let’s say, 10 mmHg. If we consider the two parameters, we can get the idea that most of the pupils will have a blood pressure close to the mean. If we had obtained a standard deviation of 50 mmHg we would think that there was much individual variation in blood pressure determinations, although the mean of the sample was the same.
A clarification for the sake of purist. Usually, the sum of square differences is divided by the number of cases minus one (n-1) instead of by the number of cases (n), which would seem more logical. Why?. A whim of mathematicians. For some arcane reason, doing so the value we get is closer to the population’s value from which the sample comes from.
We have, therefore, the two values that define our distribution. And the good thing is that they not only give us an idea of the central value and dispersion of data, but also indicate the probability of finding an individual with a certain value in the sample. We know that 95% of them will have a value between the mean plus minus two times the standard deviation (1.96 times, to be exact) and 99% between the mean plus minus 2.5 times the standard deviation (2.58 times, actually).
This sounds dangerously like the 95% and 99% confidence intervals, but we should not confuse terms. If we repeat the blood pressure experiment a very large number of times, we’ll get a slightly different mean each time. We could calculate the mean of the result of those experiments and the standard deviation of that group of means.
This standard deviation is what is known as the standard error, and this is the value we use to calculate confidence intervals within which it is the actual value in the population from which the sample comes, a value that neither we can measure in a direct way, nor can we know with exactitude. Therefore, standard deviation tell us about data dispersion in the sample, while standard error gives us an idea of the accuracy with which we can estimate the true value of the variable in the population from that we extracted our sample.
One last thought about the standard deviation. Although the value of 95% of the population is in the interval formed by the mean plus minus two times the standard deviation, this assumption makes sense only if the distribution is reasonably symmetrical. In cases of much skewed distributions, the standard deviation loses much of its meaning and we use other measures of dispersion. But that’s another story…