The fragility of the EmPress

One of the things that amazes me the most about statistics is its aspect of soundness, especially if we consider that it continuously moves in the realm of chance and uncertainty. Of course, the problem isn’t statistics’ but ours, with our believing in the soundness of its conclusions.

The most characteristic example is the hypothesis testing. Suppose we want to study the effect of a drug on migraine prevention, a disease so prevalent after marriage. The first thing we do is set our null hypothesis, which usually says the opposite of what we want to prove.

In our case, the null hypothesis is that the drug is as effective as placebo for preventing migraine. We randomize our participants in the study to the control and treatment groups and obtain our results. Finally, we do the hypothesis testing with the appropriate statistical and compute the probability that the observed differences in the number of migraines between the groups are due to chance. This is the p-value, which exclusively indicates the probability of that the observed outcome, or an outcome more extreme, is due to chance.

If we get a p-value of 0.35 it will mean that the probability that the difference is not real (so it’s due to chance) is 35%, so we cannot reject the null hypothesis and conclude that the difference is not real because it’s not statistically significant. However, if the p-value is very low, we’ll feel safe to say that there’s a real difference. What is very low?. By convention, we usually choose a p-value threshold of 0.05.

And so, if p < 0.05 we fail to reject the null hypothesis and say that the difference is not due to chance because it’s statistically significant. And here is when it’s applicable my thought about the aspect of soundness of a subject full of uncertainty: there is always a chance of error, which equals the p-value. And besides, the chosen threshold is arbitrary, so that p = 0.049 is statistically significant while p = 0.051 is not, even though their values are virtually the same.

But there’s still more, because not all p are equally reliable. Suppose we perform a trial A with our drug in which 100 people participate in the treatment group and 100 in the control group, and we get a 35% less headaches in the treatment group, with a p-value = 0.02.

Now suppose another trial B with the same drug in which 2000 people participate in each trial’s arm, resulting in a reduction of 20% with a p-value = 0.02. Do both results seem equally reliable to you?.

At first glance, the p-value is significant and equal in both trials. However, the level of confidence we should deposit in each study should not be the same. Think what would have happened if there had been five more people with headache in the treatment group of trial A. The p-value could have gone up to 0.08, no longer being statistically significant.

However, the same change in trial B is unlikely to alter the results. Trial B is less susceptible to changes in terms of the statistically significance of its results.

Well, based on this reasoning, there has been described a number of index of fragility, describing the minimum number of participants whose status has to change to change the p-value from significant to non-significant.

Logically, while taking into account other study characteristics such as sample size or the number of observed events, this fragility index could give us a better idea about the robustness of our conclusions and, therefore, about how much confidence we can deposit in our results.

And here we’ve got for today. One more post talking about p and statistical significance, when the really interesting matter is to assess the clinical relevance of the results. But that’s another story…