Why spare one?

Today we will talk about one of those mysteries of statistics that few know why they are what they are. I am referring to divide by n (the sample size) or by n-1 to calculate measures of central tendency and dispersion of the sample, particularly its mean (m) and its standard deviation (s).

The mean we all know what it is. His own name says, is the average value of a distribution of data. To calculate it we add up all the values of the distribution and divide by the total number of elements, that is, by n. Here no doubt in dividing by n to get the most utilized measure of centralization.

Meanwhile, the standard deviation is a measure of the average deviation of each value from the mean of the distribution. To obtain it we calculate the differences among each element and the mean, we square the differences to avoid that negative differences cancel out positive ones, we add them up and divide the result by n. Finally we get the square root of the result. Because it is the mean of all the differences, we have to divide the sum of all the differences by the number of elements, n, as we did with the mean, according to the known formula of the standard deviation.

However, many times we see that, to calculate the standard deviation, we divide by n-1. Why spare one element? Let’s see.

biased_estimatorWe use to deal with samples, of which we get their centralization and dispersion measures. But what we are really interested in is in knowing the value of the parameters in the population from we drawn the sample. Unfortunately, we cannot calculate these parameters directly, but we can estimate them from the sample’s statistics. So, we want to know whether the sample mean, m, is a good estimator of the population mean, μ. We also want to know if the standard deviation of the sample, s, is a good estimate of the deviation of the population, which we call σ.

Let’s do an experiment to see if m and s are good estimators of μ and σ. We are going to use the program R. I leave the list of commands (script) in the accompanying figure in case you want to reproduce it with me.

First we generate a population of 1,000 individuals with a normal distribution with mean 50 and standard deviation of 15 (μ = 50 and σ = 15). Once done, let’s see first what happens in the case of the mean.

If we draw a sample of 25 elements from the population and calculate its mean, this mean will resemble that of the population (if the sample is representative of the population), but there will be differences due to chance. To overcome these differences, we obtain 50 different samples, with their 50 different means. These means follow a normal distribution (the so-called sampling distribution), whose mean is the mean of all that we got from the samples. If we extract 50 samples with R and find the mean of their means, we see that this is equal to 49.6, which is almost equal to 50. So we see that we can estimate the mean of the distribution using the means of the samples.

And what about the standard deviation? If we do the same (extract 50 samples, calculate their s and finally, calculate the mean of the 50 s) we get an average of 14.8 s. This s is quite close to the value 15 of the population, but it fits worse than the value of the mean. Why?

The answer is that the sample mean is what is called an unbiased estimator of the population mean, and the mean value of the sampling distribution is a good estimate of the population parameter. However, with standard deviation the same thing does not happen because it is a biased estimator. This is because the variation in the data (which is ultimately what measures the standard deviation) is higher in the population than in the sample because the population has a larger size (larger size, greater possibility of variation). So, we divide by n-1, so that the result is a little higher.

If we run the experiment with R dividing by n-1, we obtain an unbiased standard deviation of 15.1, something closer than that obtained dividing by n. This estimator (dividing by n-1) would be an unbiased estimator of the population standard deviation. So what we use? If we want to know the standard deviation of the sample we can divide by n, but if we want to estimate the theoretical value of the deviation in the population, the estimate will fit better if we divide by n-1.

And here we end this nonsense. We could talk about how we can get not only the estimate from the sample distribution, but also its confidence interval which includes the population parameter with a given confidence level. But that is another story…

Meat or fish?

This is the difficult dilemma that presents to me every time I go to have lunch to a good restaurant. I, honestly, like better meat, but as science textbooks say I’m an omnivorous animal, and I don’t want to contradict them, I try to eat all kind of food, including fish.

Both meat and fish have its pros and cons. Meat is easier to eat. On the other hand, I find it difficult to have a good fish if it’s not at a good restaurant, so I find hard to miss the opportunity. But meat tastes so good. A hard decision…

It is much easier to decide between mean and median, no doubt about it.

As you all know, the mean (we’re talking about the arithmetic mean) and the median are measures of central tendency or centralization. They provide information on what is the central value of a distribution.

The simplest way to calculate the arithmetic mean is adding all the values of the distribution and dividing the resulting value by the number of elements of the distribution, our beloved n.

To get the median we have to sort the elements of the distribution from lowest to highest and locate the one that is the central element. If there’re an odd number of elements, the median value is the central one. For instance, if we have a distribution of 11 elements sorted from lowest to highest, the value of the element in the sixth position will be the median of the distribution. If the number is even, the median will be the average of the two central values. For example, if we have 10 elements, the median will be the mean of the fifth and sixth ones. There’re formulas and other ways to get the median but, as always, the best way is to use a computer program that will do it effortlessly.

It’s usually easier to decide between mean and median than between meat and fish, as there’re some general rules that can be applied to each case.

First, when the data doesn’t fit a normal distribution it’s more appropriate to use the median. This is because the median is much more robust, which means that is less affected by the presence of bias or outliers in the distribution.

The second rule has to do with the above. When there’re very extreme values in the distribution the median will inform better about the central point of the distribution than the mean, which has the drawback that it tends to deviate towards the outliers, the largest the deviation the more extreme the outliers.

Finally, some people think that using the median makes more sense when talking about some kind of variables. For example, if we are talking about survival, the median inform us about the survival time, but also about how much survive half of the sample. It will be more informative than the arithmetic mean.

Anywhere, whatever you choose, the two measures are still useful. And to understand this we will see a couple of examples that are as good as that I’ve just invented them.

Let’s suppose a school with five teachers. Science’s teacher is paid 1200 euros per month, Math’s 1500, Literature’s 800 and History’s 1100. It turns out that the principal is a football fan, so he hires Zinedine Zidane as a gym teacher. The problem is that Zuzu doesn’t work for 1000 euros a month, so he assigns him a salary of 20000 euros per month.

In this case the mean salary is 4920 euros per month and the median 1200 euros. What do you think is the best central tendency measure in this case?. It seems clear that the median give a better idea of what teachers typically earn at this school. The mean raises a lot because it goes behind the extreme value of 20000 euros per month.

Many of you might even be thinking that the mean is of little use in this case. But it’s because you’re looking it from the applicant to teacher’s point of view. If you were applying to be the manager of the school and you have to prepare the monthly budget, what of the two measurers will be more helpful?. No doubt the mean will say you how much money you have to provide to pay teachers, knowing the number of teachers in the school, of course.

Here’s another example. Suppose I take 20 fat people and allocate them in two groups to test two diets. Making a stretch of the imagination we’ll call them diet A and diet B.

Three months later, those on diet A have lost 3.4 kg on average, whereas those on diet B have lost a mean of 2.7 kg. Which of the two diets is more effective?

For those smart people who have said that diet A, I will provide you with a little more information. These are the differences between final and initial weight of patients on diet A: 2, 4, 0, 0, -1, -1, -2, -2, -3 and -35. And these are the same values for subjects on diet B: -1, -1, -2, -2, -3, -3, -3, -3, -4 and -5. Still thinking that diet A is more effective?

I’m sure that the more vigilant of you have already detected the trap of this example. In group A there’s an outlier who lost 35 kg with that diet, what makes the mean deviating toward those -35. So let’s calculate the median: -0.5 kg for A and -3 kg for diet B. It seems that diet B is more effective and that the median gives, in this case, better information on the central tendency of the distribution. Be aware that in this example it is easy to see looking at raw data, but if you had 1000 instead of 10 participants it wouldn’t be so easy. You would have to detect the presence of outliers and to use a more robust tendency measure than the mean, such us the median.

Surely someone would eliminate the outlier and use the mean with the rest of the data, but this is not advisable, because outliers can also provide information on certain aspects. For example, who tells us that there’s no a special metabolic situation in which diet A is much more effective than diet B, which is the most effective in the other cases?.

And let’s leave it here for today. Just say that sometimes we can use the transformation of data to get a normal distribution or to compensate for the effect of outliers. Also, there’re other robust measures of central tendency different to the median, such us the geometric mean or the trimmed mean. But that’s another story…

The error of confidence

Our life is full of uncertainty. There’re many times that we want to know information that is out of our reach, and then we have to be happy with approximations. The problem is that approximations are subject to error, so we can never be completely sure that our estimates are true. But do we can measure our degree of uncertainty.

This is one of the things statistics is responsible for, quantifying uncertainty. For example, let’s suppose we want to know the mean cholesterol levels of adults between 18 and 65 years from the city where I live in. If we want the exact number we have to call them all, convince them to be analyzed (most of them are healthy and won’t want to know anything about analysis) and make the determination to every one of them to calculate the average we want to know.

The problem is that I live in a big city, with about five million people, so it’s impossible from a practical point of view to determine serum cholesterol to all adults from the age range that we are interested in. What can we do?. We can select a more affordable sample, calculate the mean cholesterol of their components and then estimate what is the average value in the entire population.

colesterol_normal_enSo, I randomly pick out 500 individuals and determine their cholesterol levels, in milligrams per deciliter, getting a mean value of 165, a standard deviation of 25, and an apparently normal distribution, as it’s showed in the graph attached.

Logically, as the sample is large enough, the average value of the population will probably be close to that of 165 obtained from the sample, but it’s also very unlikely to be exactly that. How can we know the value of the population? The answer is that we cannot know the exact value, but we can know what the approximate value is. In other words, we can calculate a range within which the value of my unaffordable population is, always with a certain level of confidence (or uncertainty) that can be settled by us.

Let’s consider for a moment what would happen if we repeat the experiment many times. We would get a slightly different value every time, but all of them should be similar and close to the actual value of the population. If we repeat the experiment a hundred times and get a hundred mean values, these values will follow a normal distribution with a specific mean and standard deviation.

Now, we know that, in a normal distribution, about 95% of the sample is located in the interval enclosed by the mean plus minus two standard deviations. In the case of the distribution of means of our experiments, the standard deviation of the means distribution is called the standard error, but its meaning is the same of any standard deviation: the range between the mean plus minus two standard errors contains 95% of the means of the distribution. This implies, roughly, that the actual mean of our population will be included in that interval 95% of the times, and that we don’t need to repeat the experiment a hundred times, it’s enough to compute the interval as the obtained mean plus minus two standard errors. And how can we get the mean’s standard error? Very simple, using the following expression:

standard error = standard deviation / square root of sample size

SE= \frac{SD}{\sqrt{n}}

In our example, the standard error equals 1.12, which means that the mean value of cholesterol in our population is within the range 165 – 2.24 to 165 + 2.24 or, what is the same, 162.76 to 167.24, always with a probability of error of 5% (a level of confidence of 95%).

We have thus calculated the 95% confidence interval of our mean, which allow us to estimate the values between which the true population mean is. All confidence intervals are calculated similarly, varying in each case how to calculate the standard error, which will be different depending on whether we are dealing with a mean, a proportion, a relative risk, etc.

To finish this post I have to tell you that the way we have done this calculation is an approximation. When we know the standard deviation in the population we can use a standard normal distribution to calculate the confidence interval. If we don’t know it, which is the usual situation, and the sample is large enough, we can make an approximation using a normal distribution committing little error. But if the sample is small the distribution of our means of repetitive experiments won’t follow a normal distribution, but a Student’s t distribution, so we should use this distribution to calculate the interval. But that’s another story…