A box with whiskers

Boxplot

You will agree that it is a curious name for a graph. When thinking of its name it pops into my head a picture of a shoebox with whiskers, but actually, I think this chart seems more like the tie fighter hunting ships from Star Wars.

Boxplot

In any case, the box and whiskers plot, whose formal name is the box plot, is used all too often in statistics due to its interesting descriptive abilities.

boxplotTo know what I mean, two box plots are represented in the first figure I attached. As you see, the graph can be represented in a vertical and horizontal way, and consist of a box with two segments (the whiskers).

Describing the vertical representation, perhaps the most common of both, the lower edge of the box represents the 25th percentile of the distribution or, what is the same, the first quartile. Meanwhile, the top edge (which corresponds to the right edge of the horizontal representation) represents the 75th percentile of the distribution or, what is the same, the third quartile. Thus, the amplitude of the box corresponds to the distance between the 25th and 75th percentiles, which is none other than the interquartile range. Finally, inside the box there is a line representing the median (or second quartile) of the distribution. Sometimes it can be a second line representing the mean, although it is not as usual.

Now, let’s go for the whiskers. The upper one stretches to the maximum value of the distribution, but it cannot go beyond 1.5 times the interquartile range. If there are higher values than the median plus 1.5 times the interquartile range, they are represented as points beyond the end of the upper whisker. These points are called outliers. We see in our example that there is an outlier which lies beyond the upper whisker. If no extreme values or outliers, the maximum of the distribution is marked by the end of the upper whisker. If so, the maximum is the most distant outlier from the box.

Moreover, all this applies to the lower whisker, which extends to the minimum value when no outliers or until the median minus 1.5 times the interquartile range when there are any outliers. In these cases, the minimum value is the farthest outlier from the box below the lower whisker.

Uses of boxplot

Then we can understand the usefulness of the box plot. At a glance we can obtain the median and interquartile range and intuit the symmetry of the distribution. It is easy to imagine how a distribution histogram will be seeing his box plot, as you can see in the second figure. The first graph corresponds to a symmetric distribution, close to a normal, as the median is centered in the box and the whiskers are roughly symmetrical.boxplot_histograma

Looking at the middle distribution, the median is shifted to the lower edge of the box and the top whisker is longer than the lower. This is because the distribution has the most data to the left and a long tail to the right, as seen in its histogram. With a similar reasoning, the third distribution is sifted to the left, and the longest whisker is the lower one.

boxplot_varianzasFinally, this type of plot is also used to compare various distributions. In the third picture I attached you can see two seemingly normal distributions with similar median. If we are going to do a hypothesis testing on the equality of means, we need to know if their variances are equal (if there is homoscedasticity) to know what type of test to be used.

If we compare the two distributions, we see that the amplitude of the box and whisker is much higher in the first than the second, so we can conclude that the variance of the first distribution is much greater, so we cannot assume equal variances and we have to apply the appropriate adjustment.

Wer’re leaving…

And this is all I wanted to say about this box with whiskers, how useful is in descriptive statistics. Needless to say that although we used it to know about whether the distribution fits a normal or if the variances of several distributions are similar, there are specific tests to study these points mathematically. But that is another story…

Meat or fish?

Mean or median?

This is the difficult dilemma that presents to me every time I go to have lunch to a good restaurant. I, honestly, like better meat, but as science textbooks say I’m an omnivorous animal, and I don’t want to contradict them, I try to eat all kind of food, including fish.

Both meat and fish have its pros and cons. Meat is easier to eat. On the other hand, I find it difficult to have a good fish if it’s not at a good restaurant, so I find hard to miss the opportunity. But meat tastes so good. A hard decision…

It is much easier to decide between mean and median, no doubt about it.

Mean or median?

As you all know, the mean (we’re talking about the arithmetic mean) and the median are measures of central tendency or centralization. They provide information on what is the central value of a distribution.

The simplest way to calculate the arithmetic mean is adding all the values of the distribution and dividing the resulting value by the number of elements of the distribution, our beloved n.

To get the median we have to sort the elements of the distribution from lowest to highest and locate the one that is the central element. If there’re an odd number of elements, the median value is the central one. For instance, if we have a distribution of 11 elements sorted from lowest to highest, the value of the element in the sixth position will be the median of the distribution. If the number is even, the median will be the average of the two central values. For example, if we have 10 elements, the median will be the mean of the fifth and sixth ones. There’re formulas and other ways to get the median but, as always, the best way is to use a computer program that will do it effortlessly.

It’s usually easier to decide between mean and median than between meat and fish, as there’re some general rules that can be applied to each case.

When median is better choice

First, when the data doesn’t fit a normal distribution it’s more appropriate to use the median. This is because the median is much more robust, which means that is less affected by the presence of bias or outliers in the distribution.

The second rule has to do with the above. When there’re very extreme values in the distribution the median will inform better about the central point of the distribution than the mean, which has the drawback that it tends to deviate towards the outliers, the largest the deviation the more extreme the outliers.

Finally, some people think that using the median makes more sense when talking about some kind of variables. For example, if we are talking about survival, the median inform us about the survival time, but also about how much survive half of the sample. It will be more informative than the arithmetic mean.

Anywhere, whatever you choose, the two measures are still useful. And to understand this we will see a couple of examples that are as good as that I’ve just invented them.

Some stupid examples

Let’s suppose a school with five teachers. Science’s teacher is paid 1200 euros per month, Math’s 1500, Literature’s 800 and History’s 1100. It turns out that the principal is a football fan, so he hires Zinedine Zidane as a gym teacher. The problem is that Zuzu doesn’t work for 1000 euros a month, so he assigns him a salary of 20000 euros per month.

In this case the mean salary is 4920 euros per month and the median 1200 euros. What do you think is the best central tendency measure in this case?. It seems clear that the median give a better idea of what teachers typically earn at this school. The mean raises a lot because it goes behind the extreme value of 20000 euros per month.

Many of you might even be thinking that the mean is of little use in this case. But it’s because you’re looking it from the applicant to teacher’s point of view. If you were applying to be the manager of the school and you have to prepare the monthly budget, what of the two measurers will be more helpful?. No doubt the mean will say you how much money you have to provide to pay teachers, knowing the number of teachers in the school, of course.

Here’s another example. Suppose I take 20 fat people and allocate them in two groups to test two diets. Making a stretch of the imagination we’ll call them diet A and diet B.

Three months later, those on diet A have lost 3.4 kg on average, whereas those on diet B have lost a mean of 2.7 kg. Which of the two diets is more effective?

For those smart people who have said that diet A, I will provide you with a little more information. These are the differences between final and initial weight of patients on diet A: 2, 4, 0, 0, -1, -1, -2, -2, -3 and -35. And these are the same values for subjects on diet B: -1, -1, -2, -2, -3, -3, -3, -3, -4 and -5. Still thinking that diet A is more effective?

I’m sure that the more vigilant of you have already detected the trap of this example. In group A there’s an outlier who lost 35 kg with that diet, what makes the mean deviating toward those -35. So let’s calculate the median: -0.5 kg for A and -3 kg for diet B. It seems that diet B is more effective and that the median gives, in this case, better information on the central tendency of the distribution. Be aware that in this example it is easy to see looking at raw data, but if you had 1000 instead of 10 participants it wouldn’t be so easy. You would have to detect the presence of outliers and to use a more robust tendency measure than the mean, such us the median.

Surely someone would eliminate the outlier and use the mean with the rest of the data, but this is not advisable, because outliers can also provide information on certain aspects. For example, who tells us that there’s no a special metabolic situation in which diet A is much more effective than diet B, which is the most effective in the other cases?.

We’re leaving…

And let’s leave it here for today. Just say that sometimes we can use the transformation of data to get a normal distribution or to compensate for the effect of outliers. Also, there’re other robust measures of central tendency different to the median, such us the geometric mean or the trimmed mean. But that’s another story…

The most wished statistical for a mother

Those of you who’re reading and who belong to the pediatricians’ gang already know what I am talking about: the 50th percentile. There’s no mother who doesn’t want her offspring to be above it in weight, height, intelligence and everything else that a good mother could desired for her child. That’s why pediatricians, who dedicate our lives to children care, love percentiles so much. But what is the meaning of the term percentile?. Let’s start from the beginning…

If we have the distribution of values of a variable we can summarize it with a central and a dispersion measure. The most common are the mean and the standard deviation, respectively, but sometimes we use other measures of central tendency (such us the median or the mode) and of dispersion.

The simplest of these other measures of dispersion is called range, which is defined as the difference between the maximum and minimum values of the distribution.

Let’s suppose that we collect the birth weights of the last 100 children born at our hospital and we order them as they appear in the table. The lowest value was 2200 grams, while the prize for the biggest goes to an infant who weighed 4000 grams. The range in this case is 1800 grams but, of course, if we do not have the table and someone tell us just this, we couldn’t have much idea about how our babies are in size. This is why it’s usually preferred to give the range with explicit minimum and maximum values. In our case it would be from 2200 to 4000 grams.

If you remember how to calculate the median, you will see that it values 3050 grams. To complete the picture we need a measure that tells us how the rest of the weights are distributed around the median and within the range.

percentilesThe easiest way is to divide the distribution in four equal segments including 25% of children each one. These segments are called quartiles, and there’re three of them: the first quartile (at 25% above the minimum), the second quartile (which is the same as the median) and the third quartile (at 75%, between the median and the maximum). We come up with four segments: from the minimum to the first quartile, from the first to the second (median), from second to third and from third to the maximum. In our case, the three quartiles would be 2830, 3050 and 3200 grams. Some people call these the lower quartile, the median and the upper quartile, but they are the same thing.

Now if we know that the median is 3050 grams and 50% of children weight between 2830 and 3200 grams, we’ll have a pretty good idea about the birth weights of our newborns. This interval is called the interquartile range and it’s usually provided along with the median to summarize the distribution. In our example: a median of 3050 grams with an interquartile range from 2830-3200 grams.

But we can go much further. We can divide the distribution in the number of segments we want. The deciles are the result of dividing it in ten segments and our revered percentiles the result of dividing it in a hundred.

There is a fairly simple formula to calculate any percentile we want. For example, the Pth percentile would be at position (P/100)x(n+1), where n represents the sample size. In our distribution of neonates, the 22nd percentile would be (22/100)x(100+1) = 22.2, i.e. 2770 grams.

The sharpest of you may have noticed that our 3050 grams correspond not only to the median, but also to the fifth decile and to the 50th percentile, the desired one.

The great use of percentile, apart from to give satisfaction to 50% of mothers (those who have their children above the median), is to allow us to estimate the probability of a certain value of the variable within the population. In general, the closer you are to the median the better it be (at least in medicine), and the further away from it the more likely that someone take you to a doctor to find out why you are not closer to the precious percentile or, even, something above it.

But if we really want to further refine the calculation of the probability to obtain a particular value within a data distribution, there’re other techniques related with the standardization of the dispersion measure we use. But that’s another story…

Virtue is the happy medium between two extremes, but…

…where is the medium?. This question, which looks like a summer’s night delirium, must not be so easy to answer given that you have several ways to locate the middle or center of a data distribution.

The matter is that finding the virtuous medium help us to describe our results. If we measure a variable in a series of 1500 patients, it doesn’t go to anybody’s mid to present the results as a list of the 1500 obtained values. We usually look for a sort of summary that gives us an idea of how this variable is distributed in our sample, usually calculating a measure of location (middle) and another of scattering (how data locate around the middle).

Let’s suppose that, for some reason difficult to explain, we want to know the average height of the Madrid’s Metro users. We go to the nearest station and, when the train arrives, we get on the third car and measure all the passengers’ height, so obtaining the results depicted in Table 1.Web

Once we have collected the data, the parameter of data centralization that comes first to our mind is the arithmetic mean, which is the average height. We all know how it’s calculated: the sum of all values divided by the number of values obtained. In our case that value would be 170 cm and gives us and idea of the average height of the members of our sample.

But let’s suppose now that the national basketball team’s bus has all fours of its tires flat and so the players have had to take the subway to go to the Webgame, with the misfortune for us that they travel in the third car. The heights we would come up with are shown in Table 2. In this case the average is 177 cm but, do you really think that value is very close to the average value of most Metro users?. Probably not. In this situation, we would draw upon another measure of location: the median.

To calculate the median we have to rank the values from lowest to highest and select the one that occupies the middle of the list (Table 3).

If we had 17 measures, the median would be the value at eighth position (leaves seven above and seven below). Being an even number of values, the median is calculated as the arithmetic mean of the two central values. In our case 169 + 172 = 170.5cm, in all probability quite closer to the population’s value and very close to the car we dealt with at the first example.Web

Thus we see that mean summarizes the data correctly when they are distributed symmetrically, but if the distribution is skewed the median will give us a better idea about the center of the distribution.

If the distribution is highly skewed we can employ another two parameters which are relatives to the arithmetic mean: geometric mean and harmonic mean.

To get the geometric mean we have to calculate the natural logarithm of all values, calculate its mean and do the inverse transformation based on e exponential (the number e). To get the harmonic mean we calculate the reciprocal values (1/value), calculate the arithmetic mean and do the inverse transformation (do not panic for the math involved, statistical programs calculate this kind of stuff almost without we have to ask then for it). These two means are very useful when the distribution is much skewed and most of the values are around a certain number with a long tail to the right or left. For example, if we set up a roadside breath control at six o’clock on Monday morning, most of the drives will be very close to zero, but there will always be some higher determinations (those who have slept late and those who prefer a strong breakfast). In these cases, any of the two medians would be more representative than the arithmetic mean or the median.

One final note on another measure of location. Imagine that we look at our subway travelers’ pants and see that 12 of them wear jeans. What parameter would we use to know what is their preferred garment?. In this case, the mode. It’s the most frequently occurring value in a distribution and can be very useful when you are describing qualitative rather than quantitative variables.

Anyway, let’s do not forget that to adequately summarize a distribution we have to choose not only a measure of location. It must be completed by a measure of dispersion, of which there’re a few at our disposal. But that’s another story…