The fragility of the EmPress

Fragility index

One of the things that amazes me the most about statistics is its aspect of soundness, especially if we consider that it continuously moves in the realm of chance and uncertainty. Of course, the problem isn’t statistics’ but ours, with our believing in the soundness of its conclusions.

The most characteristic example is the hypothesis testing. Suppose we want to study the effect of a drug on migraine prevention, a disease so prevalent after marriage. The first thing we do is set our null hypothesis, which usually says the opposite of what we want to prove.

Understanding hypothesis testing

In our case, the null hypothesis is that the drug is as effective as placebo for preventing migraine. We randomize our participants in the study to the control and treatment groups and obtain our results. Finally, we do the hypothesis testing with the appropriate statistical and compute the probability that the observed differences in the number of migraines between the groups are due to chance. This is the p-value, which exclusively indicates the probability of that the observed outcome, or an outcome more extreme, is due to chance.

If we get a p-value of 0.35 it will mean that the probability that the difference is not real (so it’s due to chance) is 35%, so we cannot reject the null hypothesis and conclude that the difference is not real because it’s not statistically significant. However, if the p-value is very low, we’ll feel safe to say that there’s a real difference. What is very low?. By convention, we usually choose a p-value threshold of 0.05.

And so, if p < 0.05 we fail to reject the null hypothesis and say that the difference is not due to chance because it’s statistically significant. And here is when it’s applicable my thought about the aspect of soundness of a subject full of uncertainty: there is always a chance of error, which equals the p-value. And besides, the chosen threshold is arbitrary, so that p = 0.049 is statistically significant while p = 0.051 is not, even though their values are virtually the same.

But there’s still more, because not all p are equally reliable. Suppose we perform a trial A with our drug in which 100 people participate in the treatment group and 100 in the control group, and we get a 35% less headaches in the treatment group, with a p-value = 0.02.

Now suppose another trial B with the same drug in which 2000 people participate in each trial’s arm, resulting in a reduction of 20% with a p-value = 0.02. Do both results seem equally reliable to you?.

At first glance, the p-value is significant and equal in both trials. However, the level of confidence we should deposit in each study should not be the same. Think what would have happened if there had been five more people with headache in the treatment group of trial A. The p-value could have gone up to 0.08, no longer being statistically significant.

However, the same change in trial B is unlikely to alter the results. Trial B is less susceptible to changes in terms of the statistically significance of its results.

Fragility index

Well, based on this reasoning, there has been described a number of index of fragility, describing the minimum number of participants whose status has to change to change the p-value from significant to non-significant.

Logically, while taking into account other study characteristics such as sample size or the number of observed events, this fragility index could give us a better idea about the robustness of our conclusions and, therefore, about how much confidence we can deposit in our results.

We’re leaving…

And here we’ve got for today. One more post talking about p and statistical significance, when the really interesting matter is to assess the clinical relevance of the results. But that’s another story…

The tails of p

Forgive me my friends from the other side of the Atlantic, but I am not thinking about the kind of tails that many perverse minds are. Far from it, today we’re going to talk about a lot more boring tails but that are very important if we want to do a hypothesis testing. And, as usual, we will illustrate the point with an example to try to understand it much better.

Let’s suppose we take a coin and, armed with infinite patience, toss it 1000 times, getting heads 560 times. We all know that the probability of getting heads is 0.5, so if you throw the coin 1000 times we expected to get an average number of 500 heads. But we’ve got 560, so we can consider two possibilities that come to mind immediately.

First, the coin if fair and we’ve got 60 more heads just by chance. This will be our null hypothesis, which says that the probability of getting heads [P(heads)] is equal to 0.5. Second, our coin is not fair, but it is loaded to obtain more heads. This will be our alternative hypothesis (Ha), which states that P(heads) > 0.5.

Well, let’s make a hypothesis testing using one of the binomial probability calculators that are available on the Internet. Assuming the null hypothesis that the coin is fair, the probability to obtain 560 heads or more is 0.008%. Being it lower than 5%, we reject our null hypothesis: the coin is loaded.

Now, if you look well at it, the alternative hypothesis has a directionality towards P(heads) > 0.5, but we could have hypothesized that the coin were not fair without presupposing it was load in favor of heads or tails: P(heads) not equal to 0.5. In this case we would calculate the probability to get a number of heads that were 60 above or below 500, in both directions. This probability values 0.016%, so we’d reject our null hypothesis and would conclude that the coin is not fair. The problem is that the test doesn’t tell us in what direction it’s loaded but, in the face of the results, we assume it favors heads. In the first example we did a one-tailed test, while in the second we have made a two-tailed test.

WebIn the figure, you can see the probability areas in both tests. In the one-tailed test, the red small area on the right represents the probability that the difference from the expected value is due to chance. In the two-tailed test, this area is doubled and located on both sides of the probability distribution. Notice that two-tailed p’s value doubles the one-tailed value. In our example, both p values are so low that we can reject the null hypothesis in any case. But this is not always so, and there may be occasions when the researcher chooses to do a one-tailed test to get statistical significance that is not possible with the two-tailed test.

And I’m saying one of the two tails because we have calculated the right tail probability, but we could have calculated the probability of the left tail. Consider the unlikely event that even though the coin is loaded favoring tails, we have got more heads just by chance. Our Ha now says that P(heads) < 0.5. In this case we’d calculate the probability that, under the null hypothesis that the coin is fair, we can get 560 tails or less. This p-value is 99.9%, so we cannot reject our null hypothesis that the coin is fair.

But, what is going on here?, you’ll ask. The first hypothesis test we did allowed us to reject the null hypothesis and the last test says otherwise. Being the same coin and the same data, shouldn’t we have reached the same conclusion?. As it turns out, it seems not. Remember that the fact that we cannot reject the null hypothesis is not the same as to conclude that it is true, a fact we can never be sure of. In the last example, the null hypothesis that the coin is fair is a better option than the alternative that it is loaded favoring tails. However, that does not mean we can conclude that the coin is fair.

You see therefore how important it is to be clear about the meaning of the null and alternative hypothesis when doing a hypothesis testing. And always remember that, even though we cannot reject the null hypothesis it doesn’t mandatorily imply it is true. It could just happen that we haven’t enough power to reject it. This leads me to think about type I and type II errors and their relation with power and sample size. But that’s another story…

It all spins around the null hypothesis

The null hypothesis, you familiarly call it H0, has a misleading name. Despite what one might think, that improper name doesn’t prevent it to be the core of all hypothesis testing.

And, what is hypothesis testing?. Let us see an example.

Let us suppose we want to know if residents (as they believe) are smarter than attending physicians. We pick out a random sample composed by 30 assistants and 30 residents from our hospital and we measure their IQ. We come up with an average value of 110 for assistants and 98 for residents (sorry, I’m an assistant and, as it happens, I’m writing this example). In view of these results we ask ourselves: what is the probability that the group of assistants selected are smarter than the residents of our example?. The answer is simple: 100% (of course, provided that everyone have passed an intelligence test and not a satisfaction survey). But the problem is that we are interested in knowing if assistant physicians (in overall) are smarter than residents (in overall). We have only measured the IQ of 60 people and, of course, we want to know what happens in the general population.

At this point we consider two hypotheses:

1. The two groups are equally intelligent (this example is pure fiction) and the differences that we have found are due to chance (random). This, ladies and gentlemen, is the null hypothesis or H0. We state it in this way:


2. Actually, the two groups are not equally intelligent. This will be the alternative hypothesis:


We could have stated this hypothesis in a different way, considering that IQ from one people being greater o smaller than other people’s, but let’s leave it this way for now.

At first, we always assume that H0 is true (and they call it null), so when we run our statistical software and compare the two means we come up with a statistical parameter (which one depend on the test we use) with the probability that differences observed are due to chance (the famous p). If we get a p lower than 0.05 (this is the value usually chosen by convention) we can say that the probability that H0 is true is lower than 5%, so we reject the null hypothesis. Let’s suppose that we do the test and come up with a p = 0.02. We’ll draw the conclusion that it is not true that both groups are equally clever and that the observed difference is not due to chance (in this case the result was evident from the beginning, but in other scenarios it wouldn’t be so clear).

And what happens if p is greater than 0.05?. Does it mean that the null hypothesis is true?. Well, maybe yes, maybe no. All that we can say is that the study is no powerful enough to reject the null hypothesis. But if we accept it as true without further considerations we will run the risk of blunder committing a type II error. But that’s another story…