Do not be misled by the outliers

Print Friendly, PDF & Email

Robust scale parameters

We saw in a previous post that the extreme values of a distribution, called outliers, can skew statistical estimates we calculate in our sample.

A typical example is the arithmetic mean, which moves in the direction of the extreme values, if any, particularly as more extreme values are. We saw that to avoid this inconvenience, there were a number of relatives of the arithmetic that were considered robust or what is the same, they were less sensitive to the presence of outliers. Of these, the best known is the median, although some more, as the trimmed mean, the winsorized mean, weighted mean, geometric mean, etc.

The problem with outliers

Well, something like what happens to the mean occurs also with the standard deviation, the statistical of scale or dispersion used more frequently. Standard deviation is biased by the presence of extreme values, obtaining values that are unrepresentative of the actual spread of the distribution.

Consider the example we used when speaking about the robust estimators of the mean. Suppose we measure the levels of serum cholesterol in a group of people and we find the following values (in mg/dl): 166, 143, 154, 168, 435, 159, 185, 155, 167, 152, 152, 168, 177, 171, 183, 426, 163, 170, 152 and 155. As shown, there are two extreme values (426 and 435 mg/dl) that will bias our mean and standard deviation. In our case, we can calculate the standard deviation and see that its value is 83 mg/dl, clearly not adjusted to the dispersion of most of the values with respect to any of the robust centralization measure we can choose.

What we can do in this case? Well, we can use any of the robust estimators of the deviation, there are several. Some of them arise from the robust estimators of the mean. Here are some.

Robust scale parameters

The first, which arises from the median, is the median absolute deviation (MAD). If you remember, the standard deviation is the sum of the differences of each value with mean, squared and divided by the number of elements, n (or n-1 if we want to obtain an unbiased estimator of typical deviation in the population). Well, similarly, we can calculate the median of the absolute deviations of each value with the median of the sample according to the following formula

MAD = Median {|Xi – Me|}, from i=1 to n.

We can calculate it in our example and see that its value is 17.05 mg / dl, rather adjusted than the standard deviation.

The second is calculated from the trimmed mean. This, as the name suggests, is calculated by cutting a certain percentage of the distribution, at its ends (the distribution has to be ordered from smallest to largest). For example, to calculate the 20% trimmed mean in our example, we’d remove 10% per side (two elements per side: 143, 152, 426 and 435) and calculate the arithmetic mean with the other. Well, we can calculate the classical standard deviation using the remaining elements, getting the value of 10.5 mg/dl.

And thirdly, we could follow a similar reasoning used and get the winsorized mean. In this case, instead of eliminating the values, we would replace them with the closest without removing them. Once the distribution is winsorized, we calculate the standard deviation with the new values in the usual way. Its value is 9.3 mg / dl, similar to the above.

What of the three should we use?. Well, we want to use one that behave efficiently when the distribution is normal (in these cases the best is the classical standard deviation) but that is not very sensitive when the distribution is beyond the normal. In this sense, the best is the median absolute deviation, followed by the winsorized standard deviation.

We’re leaving…

One last advice before finishing. Do not calculate these measures by hand, as it can be very laborious. Statistical programs do the math for us effortlessly.

And here we ended. We have not talked a word about other estimators of the family of the M-estimators, such as the biweighted mean variance or the adjusted mean percentage variance. These averages are much more difficult to understand from the mathematical point of view, although they are very easy to calculate with the appropriate software package. But that is another story…

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.