The same old story

Every day we face many situations where we always act in the same way. For us, it’s always the same old story. And this is good, because these situations allow us to take an action routinely, without having to think about it.

The problem with these same-old-story situations is that we have to understand very well how to do them. Otherwise, we can do and get anything but what we want.

Hypothesis testing is an example of one of these situations. It’s always the same: the same old story. And yet, at first it seems more complicated than it really is. Because, regardless of the contrast we’re doing, the steps are always the same: to establish our null hypothesis, to choose the appropriate statistic for each situation, to use the corresponding probability distribution to calculate the probability of that value of the statistic chosen and, according to that probability value, deciding in favor of the null hypothesis or the alternative. We will discuss these steps one by one using an example in order to better understand all these stuff.

talla_escolaresSuppose we have measure the stature of 25 children of a classroom at the school of our district, obtaining the values shown in the table. If you do the calculation, the mean height of our children is 135.4 cm, with a standard deviation of 2.85 cm. As it happens, there’s a previous study at town level in which the mean height of children the same age as ours turns out to be 138. The question then is: are our children shorter than the mean of the town or is the difference due to random sampling?. We already have our hypothesis testing.

First, we set our null and alternative hypotheses. As we know, when doing a hypothesis testing we can reject the null hypothesis if the statistic chosen has a certain probability. What we cannot do is ever accept it, only to reject it. This is why we usually set the null hypothesis as the opposite of what we want to show, to be able to reject what we don’t want to show and so accept what we want to show.

In our example we’re going to set the null hypothesis that our students’ stature is equal to the town’s average and that the difference found is due to sampling error, to pure chance. Moreover, the alternative hypothesis says that there is an actual difference and that our children are shorter.

Once established the null and alternative hypothesis we have to choose the appropriate statistic for this hypothesis testing. This case is one of the simplest, the comparison of two means, ours and the population’s mean. In this case, our standardized mean respect of the population’s mean follows a Student’s t distribution, according to the following expression:

t = (group mean – population mean) / standard error of the mean

So, we substitute the mean value for our value (135.4 cm), the population’s mean for 138 and the standard error for its value (the standard deviation divided by the squared root of the sample size) and we obtain a value of t = -4.55.

Now we have to calculate the probability that t has a value of -4.55. If we think about it, we’ll see that in the case that the two mean were equal t has a value of zero. The more different they are, the far from zero the t-value will be. We need to know if that deviation from zero to -4.55 could be due to chance. To do this, we calculate the probability that the value of t = -4.55, using a table of the Student’s distribution or a computer program, getting a value of p = 0.0001.

We already have the p-value, so we only have to do the last step, to see if we can reject the null hypothesis. The p-value indicates the probability that the observed difference between the two means is due to chance. As it’s lower than 0.05 (lower than 5%), we feel confident enough to say that it’s not due to chance (or at least that it’s very unlikely), so we reject the null hypothesis that the difference is due to chance and embrace the alternative hypothesis that the two means are really different. Conclusion: ours are the tiny schoolchildren in town.

And this is all about the hypothesis testing of equality of two means. In this case, we have done a t-test for one sample, but the punch line if the dynamics of hypothesis testing. It’s always the same: the same old story. What change from time to time, logically, is the statistic and probability distribution we use for every occasion.

To conclude, I just want to draw your attention to another method we could have used to calculate if the means were different. This is just to use our beloved confidence intervals. We could have calculated the confidence interval of our mean and check if it included the population’s mean, in which case we would have concluded that they were similar. If the population’s mean would have been out of the range of the interval, we would have rejected the null hypothesis, reaching logically our same conclusion. But that’s another story…

The fragility of the EmPress

One of the things that amazes me the most about statistics is its aspect of soundness, especially if we consider that it continuously moves in the realm of chance and uncertainty. Of course, the problem isn’t statistics’ but ours, with our believing in the soundness of its conclusions.

The most characteristic example is the hypothesis testing. Suppose we want to study the effect of a drug on migraine prevention, a disease so prevalent after marriage. The first thing we do is set our null hypothesis, which usually says the opposite of what we want to prove.

In our case, the null hypothesis is that the drug is as effective as placebo for preventing migraine. We randomize our participants in the study to the control and treatment groups and obtain our results. Finally, we do the hypothesis testing with the appropriate statistical and compute the probability that the observed differences in the number of migraines between the groups are due to chance. This is the p-value, which exclusively indicates the probability of that the observed outcome, or an outcome more extreme, is due to chance.

If we get a p-value of 0.35 it will mean that the probability that the difference is not real (so it’s due to chance) is 35%, so we cannot reject the null hypothesis and conclude that the difference is not real because it’s not statistically significant. However, if the p-value is very low, we’ll feel safe to say that there’s a real difference. What is very low?. By convention, we usually choose a p-value threshold of 0.05.

And so, if p < 0.05 we fail to reject the null hypothesis and say that the difference is not due to chance because it’s statistically significant. And here is when it’s applicable my thought about the aspect of soundness of a subject full of uncertainty: there is always a chance of error, which equals the p-value. And besides, the chosen threshold is arbitrary, so that p = 0.049 is statistically significant while p = 0.051 is not, even though their values are virtually the same.

But there’s still more, because not all p are equally reliable. Suppose we perform a trial A with our drug in which 100 people participate in the treatment group and 100 in the control group, and we get a 35% less headaches in the treatment group, with a p-value = 0.02.

Now suppose another trial B with the same drug in which 2000 people participate in each trial’s arm, resulting in a reduction of 20% with a p-value = 0.02. Do both results seem equally reliable to you?.

At first glance, the p-value is significant and equal in both trials. However, the level of confidence we should deposit in each study should not be the same. Think what would have happened if there had been five more people with headache in the treatment group of trial A. The p-value could have gone up to 0.08, no longer being statistically significant.

However, the same change in trial B is unlikely to alter the results. Trial B is less susceptible to changes in terms of the statistically significance of its results.

Well, based on this reasoning, there has been described a number of index of fragility, describing the minimum number of participants whose status has to change to change the p-value from significant to non-significant.

Logically, while taking into account other study characteristics such as sample size or the number of observed events, this fragility index could give us a better idea about the robustness of our conclusions and, therefore, about how much confidence we can deposit in our results.

And here we’ve got for today. One more post talking about p and statistical significance, when the really interesting matter is to assess the clinical relevance of the results. But that’s another story…

The tails of p

Forgive me my friends from the other side of the Atlantic, but I am not thinking about the kind of tails that many perverse minds are. Far from it, today we’re going to talk about a lot more boring tails but that are very important if we want to do a hypothesis testing. And, as usual, we will illustrate the point with an example to try to understand it much better.

Let’s suppose we take a coin and, armed with infinite patience, toss it 1000 times, getting heads 560 times. We all know that the probability of getting heads is 0.5, so if you throw the coin 1000 times we expected to get an average number of 500 heads. But we’ve got 560, so we can consider two possibilities that come to mind immediately.

First, the coin if fair and we’ve got 60 more heads just by chance. This will be our null hypothesis, which says that the probability of getting heads [P(heads)] is equal to 0.5. Second, our coin is not fair, but it is loaded to obtain more heads. This will be our alternative hypothesis (Ha), which states that P(heads) > 0.5.

Well, let’s make a hypothesis testing using one of the binomial probability calculators that are available on the Internet. Assuming the null hypothesis that the coin is fair, the probability to obtain 560 heads or more is 0.008%. Being it lower than 5%, we reject our null hypothesis: the coin is loaded.

Now, if you look well at it, the alternative hypothesis has a directionality towards P(heads) > 0.5, but we could have hypothesized that the coin were not fair without presupposing it was load in favor of heads or tails: P(heads) not equal to 0.5. In this case we would calculate the probability to get a number of heads that were 60 above or below 500, in both directions. This probability values 0.016%, so we’d reject our null hypothesis and would conclude that the coin is not fair. The problem is that the test doesn’t tell us in what direction it’s loaded but, in the face of the results, we assume it favors heads. In the first example we did a one-tailed test, while in the second we have made a two-tailed test.

WebIn the figure, you can see the probability areas in both tests. In the one-tailed test, the red small area on the right represents the probability that the difference from the expected value is due to chance. In the two-tailed test, this area is doubled and located on both sides of the probability distribution. Notice that two-tailed p’s value doubles the one-tailed value. In our example, both p values are so low that we can reject the null hypothesis in any case. But this is not always so, and there may be occasions when the researcher chooses to do a one-tailed test to get statistical significance that is not possible with the two-tailed test.

And I’m saying one of the two tails because we have calculated the right tail probability, but we could have calculated the probability of the left tail. Consider the unlikely event that even though the coin is loaded favoring tails, we have got more heads just by chance. Our Ha now says that P(heads) < 0.5. In this case we’d calculate the probability that, under the null hypothesis that the coin is fair, we can get 560 tails or less. This p-value is 99.9%, so we cannot reject our null hypothesis that the coin is fair.

But, what is going on here?, you’ll ask. The first hypothesis test we did allowed us to reject the null hypothesis and the last test says otherwise. Being the same coin and the same data, shouldn’t we have reached the same conclusion?. As it turns out, it seems not. Remember that the fact that we cannot reject the null hypothesis is not the same as to conclude that it is true, a fact we can never be sure of. In the last example, the null hypothesis that the coin is fair is a better option than the alternative that it is loaded favoring tails. However, that does not mean we can conclude that the coin is fair.

You see therefore how important it is to be clear about the meaning of the null and alternative hypothesis when doing a hypothesis testing. And always remember that, even though we cannot reject the null hypothesis it doesn’t mandatorily imply it is true. It could just happen that we haven’t enough power to reject it. This leads me to think about type I and type II errors and their relation with power and sample size. But that’s another story…

It all spins around the null hypothesis

The null hypothesis, you familiarly call it H0, has a misleading name. Despite what one might think, that improper name doesn’t prevent it to be the core of all hypothesis testing.

And, what is hypothesis testing?. Let us see an example.

Let us suppose we want to know if residents (as they believe) are smarter than attending physicians. We pick out a random sample composed by 30 assistants and 30 residents from our hospital and we measure their IQ. We come up with an average value of 110 for assistants and 98 for residents (sorry, I’m an assistant and, as it happens, I’m writing this example). In view of these results we ask ourselves: what is the probability that the group of assistants selected are smarter than the residents of our example?. The answer is simple: 100% (of course, provided that everyone have passed an intelligence test and not a satisfaction survey). But the problem is that we are interested in knowing if assistant physicians (in overall) are smarter than residents (in overall). We have only measured the IQ of 60 people and, of course, we want to know what happens in the general population.

At this point we consider two hypotheses:

1. The two groups are equally intelligent (this example is pure fiction) and the differences that we have found are due to chance (random). This, ladies and gentlemen, is the null hypothesis or H0. We state it in this way:

H0: CIA = CIR

2. Actually, the two groups are not equally intelligent. This will be the alternative hypothesis:

H1: CIA ≠ CIR

We could have stated this hypothesis in a different way, considering that IQ from one people being greater o smaller than other people’s, but let’s leave it this way for now.

At first, we always assume that H0 is true (and they call it null), so when we run our statistical software and compare the two means we come up with a statistical parameter (which one depend on the test we use) with the probability that differences observed are due to chance (the famous p). If we get a p lower than 0.05 (this is the value usually chosen by convention) we can say that the probability that H0 is true is lower than 5%, so we reject the null hypothesis. Let’s suppose that we do the test and come up with a p = 0.02. We’ll draw the conclusion that it is not true that both groups are equally clever and that the observed difference is not due to chance (in this case the result was evident from the beginning, but in other scenarios it wouldn’t be so clear).

And what happens if p is greater than 0.05?. Does it mean that the null hypothesis is true?. Well, maybe yes, maybe no. All that we can say is that the study is no powerful enough to reject the null hypothesis. But if we accept it as true without further considerations we will run the risk of blunder committing a type II error. But that’s another story…