One element is said to be the black sheep of a group when it goes in different or contrary to the rest of the group direction. For example, in a family of addicted to reality shows, the black sheep would be a member of that family who goes out of its way to see TV documentaries. Of course, if the family is addicted to documentaries, the black sheep will die to see the reality shows. Always backwards.
In statistics there is something like the black sheep. They are anomalous data, also called extreme data, but best known by the name outliers.
An outlier is an observation that seems inconsistent with the rest of the sample values, taking into account the assumed probability model to be followed by the sample. As you can see, it is an element that contradicts the rest of the data, like a black sheep.
The problem with the outlier is that it can do much harm when estimating population parameters from a sample. Let’s remember an example that we saw in another post about the calculation of robust measures of centrality. It was the case of a school with five teachers and soccer’s fan director. The director hired the teachers with the following salaries: 1200 euros per month for the sciences teacher, 1500 for the math’s, 800 for the literature’s and 1100 for the history teacher. But it turns out that it craves to hire Zinedine Zidane as a gyms teacher, so he has to pay him no less than 20,000 euros per month.
Do you see where things were going? Indeed, Zuzu is the black sheep, the outlier. Look what happens if we calculate the mean salary: 4920 euros per month is the average salary of teachers in this center. Do you think it is a real estimate? Clearly not, the mean value is shifted in the direction of the outlier, and it would be much more extreme the more sifted the outlier. If Zuzu won 100,000 euros, the average salary would amount to 20,920 euros. This is folly.
If an outlier can do as much damage to an estimator, imagine what it can do to a hypothesis test, in which the answer is to accept or reject the null hypothesis. So we ask, what can we do when we discover that among our data there is one (or more) black sheep? So, we can do several things.
What can we do with outliers?
The first thing that comes to our mind is throwing away the outlier. Disregard it when it comes to analyze the data. This would be fine if the outlier is the result of an error in data collection but, if not, we could disregard data with additional information. In our example, the outlier is not an error, but the product of the sports career of the gyms teacher. We need a more objective method to decide when to suppress an outlier, and although there are some tests called of discordance, they have their inconveniences.
The second thing we can do is to identify it. This means we have to find out if the value is so extreme for some specific reason, as happens in our example. An extreme value may be signaling some important finding and we have no reason to disdain it quickly, but to try to interpret its meaning.
Thirdly, we can incorporate the outlier. As we said when defining them, outliers go in the opposite direction of the other data according to the probability model we assume that follows the sample. Sometimes an outlier ceases to be so if we assume that the data follow another model. For example, an outlier can be so if we consider that the data follow a normal distribution but not if we consider that follow a logarithmic.
And, fourthly, the most correct option of all: use robust techniques to make our estimates and our hypothesis testing. They are called robust techniques because they are less affected by the presence of outliers. In our example with the teachers we would use a robust measure of centrality as the median. In our case it is 1200 euros, much more adjusted to reality than the mean. Moreover, even if Zuzu were paid 100,000 euros per month, the median will remain at 1200 euros per month.
And with that we end with the outliers, those black sheep mixed with our data. For simplicity, we have not said a word about how we could try to figure how much the outlier affects the estimation of the parameter. For this, we have a wide range of statistical methodology based on the calculation of the so called influence function. But that’s another story…