Once upon a time, the scientific paradigms and the way of thinking of researchers (and also of clinicians) began to change, from “we’re doing well” to really wanting to know what was the validity of the information they collected in their experiments or in their daily practice.

It is in this context that, 33 years ago, what would become one of the lords of the impact measures of clinical studies came to light: the number needed to treat, known worldwide by its acronym, NNT.

Its initial usefulness was to assess the beneficial effect of a treatment to reduce the risk of an unpleasant event occurring in an intervention group of interest, always with respect to what was observed in a control group. Put more simply, it emerged as a measure of impact in the context of randomized and controlled clinical trials.

The NNT was initially well received, since it has the great merit of combining, in a single parameter, the concepts of statistical significance and clinical relevance (given that the p-value or its confidence interval is provided). In addition, it is easy for clinicians to interpret without the need for in-depth statistical knowledge.

Following a parallel with the ages of life, everything went well during his childhood and he continued to grow during his youth, extending to many other areas other than the conventional parallel clinical trial.

The problem with youth, in addition to being at the beginning of life, is that it does not last long. And in maturity, although it continues to have enthusiastic followers, the NNT has also begun to accumulate detractors and critics who have begun to draw defects from it.

It is about these tribulations that the NNT suffers during its maturity that we are going to talk in this post. Although there are complaints produced by some of its unpleasant characteristics from the mathematical point of view due to its logarithmic inheritance, most are based, in reality, on a poor understanding of its meaning or on a not entirely correct use of the parameter.

As we have already said, the NNT was initially developed to assess the efficacy of a treatment to reduce the risk of producing an event in an interest group.

For example, if the mortality of fildulastrosis is 5% per year and with a new treatment it drops to 3%, it means that the treatment reduces the risk by 2%, so the NNT will be 1 / 0.02 = 50. This means that for every 50 patients we treat for 1 year, we will prevent one from dying thanks to the treatment. Of course, she can die if we prolong the follow-up, as we will see later.

This approach is very simplistic, since the NNT is actually dual in nature.

The NNT can be understood as the number necessary to treat in order to increase the number of expected positive events by one or to decrease the number of a negative event by 1, all during a specific follow-up period.

But it can also be understood in the opposite sense, in that of harming: the number needed to treat to increase the number of a negative event by 1 or to decrease the expected number of a positive event by 1, all during a follow-up period determined.

This can make it confusing to assess the NNT just by its numerical value. To avoid this inconvenience, some authors have thought that we can call each of the NNT dualities differently, so that we will always know exactly what we are talking about.

Basically, we would be specifying the direction of the effect we are studying.

Thus, we would talk about the number necessary to treat to benefit (NNTB) when we want to express the number to treat to achieve a beneficial effect (or avoid an unpleasant one), and the number necessary to harm (NNTH) when we want to refer to the number necessary for a negative effect to occur (or to avoid a positive one that could occur without treatment).

Another problem that greatly bothers NNT detractors is that its calculation can, at times, generate negative values of its point estimate and, at other times, confidence intervals that cross zero into the realm of negative numbers. Here, the difficulty is in finding a logical meaning to a negative NNT.

Let’s imagine that we obtain a NNT = 10 with a 95% confidence interval (95 CI) of 8 to 12. Here we have no problem, we have to treat 10 (point estimate), although this value can range from 8 to 12 (estimate per interval).

The problem arises when negative numbers appear. For example, if the NNT = 5 with a 95 CI of 8 to -12, how do we assess it?

Well, for this it is good for us to resort to the dual nature of the NNT that we have mentioned earlier. If we think about it, NNT values between -1 and 1 are impossible. Thus, the interval from 8 to -12 could be divided into two: NNTB of 8 (up to infinity) and NNTH of 12 (up to infinity). I think the usefulness will be limited from a clinical point of view, but at least we will have given it a meaning.

As we discussed in a previous post, when we want to study the efficacy of an intervention, the ideal would be to give the new treatment, end the follow-up period and measure the effect. Then we would use our time machine to go back to the initial moment and, instead of the treatment under study, we would give its alternative, end the follow-up and measure the result.

Once this is done, we would compare the two results. The problem, the most awake of you will have already noticed, is that the time machine has not yet been invented. This means that to obtain this other result, which we call potential or counterfactual, we have to resort to the control group of the trials, which serves as a substitute.

If we think a bit about the implications of the counterfactual theory, although the efficacy of the treatment under study is always the same, the value of the NNT will depend on which intervention we compare it with. Therefore, to correctly interpret the NNT, the comparator that we have used must always be explicitly specified.

An NNT of 10 for a given treatment can only be assessed if it is specified which was the control intervention and how long the follow-up time were, especially if they are different.

So keep it in mind: the value of the NNT must be expressed together with the treatment alternative and the follow-up period. Failure to do so may make your assessment difficult and misleading.

Virtually all books and manuals agree that the numerical value obtained for the NNT should be rounded to the nearest higher integer. It’s logical, it doesn’t seem like it makes much sense to say that we have 4.8 patients to treat, so we round it up to 5.

The problem with rounding is that it adds imprecision to the estimate and can be misleading.

For example, any of the absolute risk reduction values between 0.52 and 0.9 will be equivalent, after rounding, to NNT = 2. However, there is a big difference between a reduction of 52% and a reduction of 90 %. We should not value them with the same NNT.

So, if nobody throws up her hands in horror when she hears that the average is to bear 1.2 children, why the phobia of giving decimals with the NNT? After all, it is still an estimator that we have to know how to interpret.

If we see a NNT of 6.7 we can conclude that we would have to treat an average of 6 to 7 patients to achieve a beneficial effect during a certain period of time. A warning, if we do so, it will be necessary to make it clear that the estimate will be between 6 and 7, but that these are not the limits of the 95 CI. Let us not get confused.

You already know that all studies are subject to the effect of confounding variables, especially when they are not randomized. We are used to seeing association measures adjusted for variables that the authors believe may act as confounders. However, this is often seen that this is not so for NNT and only its raw value is provided.

Other times, what happens is that not appropriate adjustment methods are used. A wide variety of methods have been developed to calculate the NNT in multiple scenarios, crude and adjusted. If you don’t know which one to use, find someone who knows before applying the wrong one.

We have said it before, but it does not hurt to insist: the value of the NNT depends on the length of the follow-up period, so it must be specified. The proportion of events that take place increases as time goes by.

For example, if the treatment produces a reduction in the risk of the event that remains constant over time, the value of the NNT will be lower as the temporal duration of the follow-up period increases. It is easily understood that the duration of follow-up must be known to correctly interpret the value of the NNT.

Consider two clinical trials of two different interventions, one with a follow-up period of 2 years and the other of 5. Even if the two studies gave us an NNT of 8, it would not be the same to treat 2 years as 5 to avoid an unpleasant event in 1 out of 8 patients treated.

In survival studies and when evaluating the results per person-time, the frequency of the study event must be taken into account when deciding the method of calculating risk reduction, which, in turn, we will use to calculate the NNT. This should be done even if the risk is constant and the follow-up period of the participants is homogeneous.

We are going to see how the NNT would be calculated with an example that we are going to invent on the fly. Imagine that we conducted the study and observed 10 cases of death per 100 person-years in the intervention group and 5 cases per 100 person-years in the control group.

Assuming that survival times are exponentially distributed, we first calculate the proportions or accumulated risks in the intervention group (Ri) and in the control group (Rc):

R_{c} = 1 – e^{-10/100 }= 0,095

R_{i} = 1 – e^{-5/100} = 0,048

Now we can calculate the NNT as the inverse of the risk difference, as we already know:

NNT = 1 / (0.095-0.048) = 21.2

The common mistake in this situation is to directly use the number of events according to the following formula:

NNT = person-time / (difference of events)

If we apply it to the previous case, it would look like this:

NNT = 100 / (10-5) = 20

As you can see, the first method, which is the most appropriate, is somewhat more conservative and gives higher NNT values.

Only in cases where the frequency of the event is very low can we estimate the NNT directly using the number of observed events. Imagine that, in the previous example, we observed 5 deaths in the intervention group and one in the control group. We could calculate the NNT as follows:

NNT = 100 / (5-1) = 25

Anyway, do not get carried away by the easy way. It will only be correct to calculate the NNT without prior conversion of the accumulated proportions when the number of events has a very low frequency and, in addition, the risk differences between the two groups remain proportional over time. When in doubt, convert.

To recapitulate a bit what we have said, we can recommend taking a series of precautions to make a proper use of the NNT and to be able to interpret it correctly.

First, never use the NNT without specifying the alternative treatment, the direction of the effect, and the length of the follow-up period.

Second, always calculate your confidence interval. In case it has negative values, consider the dual nature of benefit and harm to try to make a more understandable interpretation.

Third, don’t shy away from decimals. Remember that this is just another estimate. You should have no problem evaluating its point estimate (even if it is not a whole number) and its confidence interval.

Finally, check that you are using the correct methodology in more complex situations, such as those where confounding factors may be involved or in survival studies.

If we use it wisely, the NNT will be able to continue his adventures through his maturity and be with us for at least another 30 years.

And here we are going to leave it for today.

We have already seen the usefulness of the NNT to assess the efficacy of an intervention. For example, it tells us how many deaths we can avoid during the study follow-up and what would have occurred if we had not intervened.

But what about those who don’t die? Are there participants who will die the same if we do or do not treat them? Well, the NNT doesn’t tell us anything about that. To study this aspect, which would improve the NNT assessment, we need to resort to another parameter: the number remaining at risk. But that is another story…

]]>Today we are going to remember one of the most beautiful stories, in my humble opinion, in the history of biostatistics. Although surely there are better stories, since my general historical ignorance is greater than the number of decimal places of the number pi.

Imagine we are at Rothamsted Station, an agricultural research center located in Harpenden, in the English county of Hertfordshire. We are at some moment in the beginning of the decade of the 20s of the last century.

Three scientists, very British themselves, are preparing to have tea. They are two men and one woman. This is Blanche Muriel Bristol, an expert on algae and mushrooms. With her is William Roach, a biochemist who is also married to Muriel.

The third one is a geneticist who has started working on the station and will eventually become famous for being one of the founders of population genetics and neo-Darwinism, as well as for a few other little things, like the concept of null hypothesis and hypothesis contrast. Yes, friends, he is the great Ronald Fisher.

Fisher prepares the teacups and gallantly offers the first to Muriel, who declines. She looks at Ronald and says: I like tea with milk, but only if you put milk in first. If done the other way around, it gives it a flavor that I don’t like at all.

Fisher thinks Muriel is teasing him, so he insists, but she’s still dug in her heels. I think then Fisher must have changed his mind and thought that Muriel was actually a bit of a fool, but the husband came to the rescue of his wife. William proposes to make 8 cups of tea and, at random, put the milk in first in 4 cups and then contrary in the rest.

To Fisher’s surprise, Muriel guesses the order in which the milk from the eight cups had been served, although she is not allowed to taste more than two cups at a time. Luck or a privileged palate?

This, which was one of the first randomized experiments in history, if not the first, left Fisher very thoughtful. So he developed a mathematical method to find out the probability that Muriel got it right by pure chance. And this method is the subject of our entry today: Fisher’s exact test.

Before fully entering into Fisher’s exact test, we are going to clarify a series of concepts to understand well what we are going to do.

When we want to make a hypothesis contrast between two qualitative variables (in this case, to check their independence) we can use several tests that compare their frequencies or their proportions.

If we deal with independent data, we can choose an approximate test, such as the chi-square test, or an exact test, such as Fisher’s. If the deal with paired data, we can do a McNemar test (for 2×2 contingency tables) or use Cochran’s Q method (for 2xK tables).

And we have talked about exact and approximate tests. What does this mean?

The approximate tests calculate a statistic with a known probability distribution in order, according to its value, to know the probability that this statistic acquires values equal to or more extreme than the observed one. It is an approximation that is made at the limit when the sample size tends to infinity.

For their part, exact tests calculate the probability of obtaining the observed results directly. This is done by generating all the possible scenarios that go in the same direction as the observed hypothesis and calculating the proportion in which the condition we are studying is fulfilled.

Well, people who know about these things do not manage to agree.

The approximate methods are simpler from a computational point of view, but with the computational power of today’s computers, this argument does not seem to be a reason to choose them. On the other hand, the exact ones are more precise when the sample size is smaller or when some of the categories have a low number of observations.

But if the number of observations is very high, the result is similar using an exact method or an approximate one.

As a rule of thumb, it is recommended to use an exact test when the number of observations is less than 1000 or when there is a group with a number of expected events less than 5. However, if you have a computer, there is no reason to complicate your life: use an exact one.

All this does not mean that we cannot use an approximate test if the sample is small, but we will have to apply a continuity correction, as we saw in a previous post.

Fisher’s exact test is the exact method used when you want to study if there is an association between two qualitative variables, that is, if the proportions of one variable are different depending on the value of the other variable.

In principle, it seems that Fisher designed it with the idea of comparing two dichotomous qualitative variables. In simple terms, for use with 2×2 tables.

However, there are also extensions to the method to do it with larger tables. Many statistical programs are capable of doing this, although, logically, they put more stress on the computer. You can also find calculators available on the Internet.

Fisher’s exact test assumes the null hypothesis that the two variables are independent, that is, the values of one do not depend on the values of the other.

The only necessary condition is that the observations in the sample are independent of each other. This will be true if the sampling is random, if the sample size is less than 10% of the population size and if each observation contributes only to one of the levels of the qualitative variable.

Furthermore, the marginal frequencies of the rows and columns of the contingency tables of the different possible scenarios must remain fixed. Do not worry about this, it will be better understood when we see an example. If this is not true, we can continue to use the test, but it will no longer be exact and it will become more conservative.

After much thought about the problem of tea and the skills of Muriel Bristol, the brilliant Fisher showed that he could calculate the probability of any of the contingency tables using the hypergeometric probability distribution, according to the formula in the figure.

Thus, Fisher’s exact test calculates the probabilities of all possible tables and adds those of the tables that have p values less than or equal to the observed one. This sum, multiplied by two, gives us the p-value for a two-tailed hypothesis contrast.

According to the value of p, we will only have to solve our hypothesis contrast in a similar way as we do with any other contrast test.

To finish understanding everything we have said, we are going to repeat the tea experiment but I, instead of Muriel, am going to ask my cousin to give us a hand; we have not caused him any troble for a long time.

Sure, I can’t make my cousin drink tea, so let’s see if he can tell the difference if what he’s drinking is Scottish or Irish whiskey. He claims that he is able to distinguish a Scottish from anything else.

So, to test his resistance to alcohol, as well as his palate skills, I randomly offer him 11 shots of Scottish and 11 shots of Irish.

The results can be seen in the first table of the attached figure.

As you can see, it hits 7 of the 11 Scottish shots and only 2 of the 11 Irish ones. It seems that she is right in his statement and that he has a refined palate. But we, like Fisher did with Muriel, are going to see if he’s just been lucky.

As we have said above, it is necessary to calculate the possible tables that have a lower probability than the observed one and within the sense of our hypothesis. We will do this by reducing the minimum frequency of each of the columns until one of them reaches zero.

Also, we will adjust the other boxes so that the marginals remain constant. Otherwise, we already know that the test would no longer be exact. You can see the two possible tables until the hits with Irish whiskey reach zero.

Now we only have to calculate the probability of each table, add them all and multiply by two. We obtain a value of p = 0.08 for a bilateral contrast. As the null hypothesis says that the ability to hit is not influenced by the type of whiskey, we cannot deny that my cousin’s boast was just a matter of luck.

Coming to the end of this post, I¡d like to warn that no one ever thinks of doing a Fisher’s exact test manually. This absurd example is extremely simple, but surely our experiments are a bit more complex. Use a computer application or an Internet calculator.

Let’s solve the example using the R program.

First, we enter the data to build the contingency table with these two consecutive commands:

*data <- data.frame(kindw = c(rep(“irh”,11), rep(“sct”,11)), hit = c(rep(TRUE,2), rep(FALSE,9), rep(TRUE,7), rep(FALSE,4)))*

*my_table <- table(data$kindw, data$hits, dnn = c(“Whisky”, “Hit”))*

Finally, we perform Fisher’s exact test:

*fisher.test(x = my_table, alternative = “two.sided”)*

On the output screen, the program provides us with the p value (p = 0.08), its confidence interval and the odds ratio between the two variables. Remember that Fisher’s test only tells us if there is a statistically significant difference, but if we want to measure the strength of the association between the two variables we have to resort to other types of measures.

And if someone is looking for the value of the Fisher statistic among the program’s output data, I’m sorry to say that he or she has to re-read the entire post from the beginning.

As we have already said, exact tests calculate probability directly without the need for prior calculation of a statistic that follows a known probability distribution. The Fisher’s statistic does not exist.

And here we are going to leave it for today. We have seen how Fisher’s exact test allows us to study the independence of two qualitative variables but requires one condition: that the marginal frequencies of rows and columns remain constant.

And this can be a problem, because in many biological experiments we will not be able or not sure of meeting this requirement. What happens then? As always, there are several alternatives.

The first, keep using the test. The drawback is that it is no longer an exact test and loses its advantages over approximate tests. But we could use it.

The second is to use another contrast test that does not lose power when the marginals of the table are not fixed, such as Barnard’s test. But that is another story…

]]>Today we are going to do an act of social justice.

At least from the point of view of analyzing the results of clinical trials.

You will wonder what the post of today is about. Well, everyone knows that when we try a new treatment for a disease, the first thing we want to know is how many patients we cure, for example, of that disease, or how many we prevent from dying from the disease.

Imagine that, along with 99 other individuals, you participate in a clinical trial. Ultimately, 2 deaths are prevented with the treatment, so the number needed to treat (NNT) will be 50. And this is the main outcome of the trial, a NNT = 50.

Surely, the two survivors will be more than happy, but among the rest of the participants a clamor of 98 voices will rise asking: and me? What’s up with me? What about me?

Our act of social justice, which I was referring to at the beginning, has to do with these 98 participants.

We all know the different measures of association and impact that we can use in a clinical trial. The first to be considered is the risk ratio (RR), which is the ratio of presenting the event between the treated (Rt) and the controls (Rc). With this measure we calibrate the protective or favorable effect that the intervention has on the result.

We can also calculate the risk reductions between the two groups. The relative risk reduction (RRR) would be the decrease in risk in the intervention group compared to the risk observed in the controls. On the other hand, the absolute risk reduction (ARR) indicates the difference in risk between the two groups.

Finally, the NNT is the most widely used impact measure and arguably the one with the greatest clinical value, since it represents the effort required to achieve a specific clinical benefit, either avoiding an adverse event or achieving a beneficial one.

In addition to measuring the efficiency of the intervention, the NNT has many other advantages, such as implicitly incorporating the baseline risk without treatment and RRR, but giving a more objective idea of the effect. We already know that the effect always seems greater if we only assess the RRR.

In addition, the NNT helps us to make a more objective assessment without being misled by clinical factors such as the form of presentation of the disease or the severity of the result we are measuring.

But not everything are advantages. It turns out that the NNT is a somewhat selfish indicator and does not care about the fate of those patients that are not reflected in its value. What happens, for example, with those who are not prevented from dying by treatment? What is the risk of dying for those who do not contribute to the value of the NNT?

To answer this question, another indicator has been devised that is added to the entire family of risk, association and impact measures of clinical trials: the number remaining at risk (NRR).

The NRR deals with what the NNT forgets, since it estimates the average prognosis of presenting the result of interest among those treated, once those who achieve it thanks to the treatment have been excluded. The formula to calculate it would be the following:

NRR = Rt / (Rt-Rc)

If you look closely, this formula could be written as Rt x 1 / (Rt-Rc).

Rt-Rc is the ARR, and its inverse is the NNT. Thus, we can calculate the NRR in the following way:

NRR = Rt x NNT.

I think that, in order to better understand the usefulness of the NRR, we are going to look at the results of several trials carried out with different treatments with which we pursue our usual effort to prevent death from that terrible disease that is fildulastrosis.

We are going to test three drugs called, let’s not think too much, A, B and C. We compare them against a placebo, look at the number of deaths at the end of the trial and use Calcupedev calculator to get the risk and impact measures. You can see the results of the three studies in the attached table.

As you can see, the mortality figures are different in the three trials, although in all of them the ARR of mortality with treatment is 0.1 (10%). Therefore, the NNT that we obtain is the same in the three studies, 10. This means, as we already know, that we avoid one death from fildulastrosis for every 10 patients we treat.

And here we would stay if we did not give ourselves to think about what happens to those who do not avoid death thanks to treatment. We could even simplify and say that the three treatments have similar efficacy.

To avoid this, let’s look at the NRRs, which are different in the three studies. Looking at the results, we quickly understand that the prognosis is radically different in the patients in the three trials. For example, in the trial with drug A, we prevented one death for every 10 treated, but 8 of the remaining 9 died despite receiving the treatment. In drug C group, the one with the best prognosis, we have to treat 10 equally to prevent one death from fildulastrosis, but only 2 of the other 9 die.

Looking at these differences in prognosis in the three studies, we could no longer so happily conclude that all three treatments have similar efficacy because we get the same NNT.

Let’s take one more twist to understand the meaning of the NRR even better. And for this, we are going to think about its relationship with RR and RRR.

We can calculate NRR based on RR using the following formula:

NRR = RR / (1-RR)

If you look at the quotient of the previous formula, in the numerator we have a probability (RR) and in the denominator its complement (1-RR). And what is the probability that an event occurs divided by the probability that it does not occur (its complement)? You got it, an odds.

If we understand RR as the probability of an event occurring in response to treatment, we can understand NRR as the odds of that event occurring versus preventing it. For example, an NRR of 5 would mean that the patient is 5 times more likely to die from the disease (despite treatment) than to avoid death thanks to treatment.

Furthermore, as the NRR can be expressed as a function of the RR, and since both the RR and the RRR are calculated from the risks in treated and controls, it can be mathematically demonstrated that each RRR is associated with a certain NRR, regardless of the RAR or NNT values. Thus, an RRR of 0.1 is associated with an NRR of 9, an RRR of 0.2 with an NRR of 4, an RRR of 0.5 with an NRR of 1… Curiosities of the numbers.

The NRR may even have a value lower than 1, which will indicate that it is more likely to have a beneficial effect from the treatment than an adverse one. Of course, for this assumption to occur, the RRR has to be large (> 50%), the rate of adverse events must be very low, or both conditions must be met.

We have seen in this post the usefulness of the NRR, although the logical thing is not to use it in isolation, but as a complement to the NNT, with which we can more accurately assess the benefits-risks of the intervention.

The NRR will indicate the absolute number of patients who will present the event of interest and will serve us, as we have seen, to explain discordant results between trials with different results that may be due to prognostic differences between participants, which may present more or less aggressive presentations of disease.

The latter may induce us to use the NRR as a measure of relative efficacy between various treatments, but it is not an indicator that has been designed for this purpose, so we must avoid falling into this temptation.

And here we are going to leave it for today. We have seen how the NRR is an indicator that is easy to calculate and understand by most clinicians. Anyway, do not get carried away by enthusiasm and use it for anything. For example, formal studies with cost-effectiveness analysis decision techniques require more complex statistical models, beyond the reach of most clinicians and only available to privileged minds. But that is another story…

]]>Blinders are pieces that are put over the eyes of some draft animals, such as donkeys or horses. Its purpose is none other than to get the animal to focus only on the road ahead, without being distracted by other things that it could see through its peripheral vision, less important for its task.

I always feel a bit sad to see them like that, pulling the chariot with his eyes half covered. But, making an effort, I can understand the usefulness of the device, especially in areas with heavy traffic, where the animal could be frightened if it could see everything it has around it.

And this issue leads me to think of other blinders, a symbolic ones this time, that so-called human beings wear on many occasions, limiting their vision and, on many occasions, without a clear benefit. I’m referring this time to the obsesion for statistical significance, one of those blinders that someone put on us at some time and that we should take off to get a bigger picture.

When we read a clinical trial, it is a very common custom to look for the p-value to see if it is statistically significant, even before looking at the result of the study outcome variable and evaluating the methodological quality of the trial. Leaving aside the clinical relevance of the results (to which we will return shortly), this is not a recommended practice.

First, the significance threshold is totally arbitrary, and moreover, we always have a probability of making an error, whatever we do after knowing the p-value. Furthermore, the value of p depends, among other factors, on the sample size and the number of effects we observe, which can also vary by chance.

In this sense, we already saw in a previous post how some authors thought of developing a fragility index, which gives an approximation of how the p-value and its statistical significance could be modified, if some of the trial participants had had another outcome.

The fragility index would thus be defined as the minimum number of changes in the participants’ results that would change the statistical significance of the trial (from significant to non-significant, and vice versa). Studies with a lower index values would be considered more fragile, as minor modifications of the results would eliminate their significance.

This new approach has the merit of not basing the assessment of the study solely on the p-value obtained. In general, we will feel more comfortable the higher the fragility index since it would take many more changes for the p to stop being significant. However, we are forgetting two fundamental aspects. First, how likely it is that these changes in the results will occur. Second, the clinical relevance of the effect size observed in the study.

Let’s suppose that we do a clinical trial to assess two treatment alternatives for that terrible disease that is fildulastrosis. In order to not to fret over drugs names, we are going to call these two alternatives A and B.

We recruited 295 patients and distributed them randomly between the two arms of the trial, 145 for treatment A and 150 for treatment B.

At the end of the study we obtain the results that you can see in the first contingency table. In group A, 5 patients were healed, while in group B none were healed. The probability of being healed in group A was therefore 3.45%, while that of B was 0%. At first glance, it seems that there was a greater probability of being healed in group A and, indeed, if we perform a Fisher’s exact test it gives us a value of p = 0.027 for a bilateral contrast.

As a conclusion, being p <0.05, we reject the null hypothesis which, for Fisher’s test, assumes that the probability of healing is equal in the two groups. In other words, there is a statistically significant difference, so we conclude that treatment A was more effective in healing fildulastrosis.

But what if a participant in group B had been healed? You can see it in the second contingency table.

The probability of being healed in group A would continue to be 3.45%, while that of B would be, in this case, 0.66%. It appears that A is still better, but if we do Fisher’s exact test again, the p-value for a two-sided test is now 0.11.

What happened? The difference is no longer statistically significant only with the change in the outcome of one of the 295 participants. The fragility index would be equal to 1, with which we would consider the initial result as fragile.

Now I ask myself: are we considering everything that we should? I would say not. Let’s see.

Our initial study, if we rely solely on the fragility index, would be considered fragile, which we could express as having an unstable statistical significance.

But this argument is a bit fallacious, since we are not taking into account how likely it is that this change will occur in one of the participants.

Suppose that, from previous studies, we know that the probability of healing the disease without treatment is 0.1%. We can use a binomial probability calculator to make a few numbers. For example, the probability that none of the 150 (the first assumption) will be healed is 86%. Similarly, the probability that exactly 1 is healed is 13%.

And this is where the fallacy lies: we are assessing the fragility of statistical significance by comparing the result that we have observed with another eventual one whose probability of occurrence is much lower. As a conclusion, it does not seem reasonable to define the fragility of the finding without assessing the likelihood of producing this minimal change that modifies the statistical significance.

Now imagine that the probability of being healed without treatment was 1%. The probability of not observing any healing with 150 patients would be 22%, while that of exactly 1 heals rises to 33%. Here we could say that the study provides a fragile significance (that alternative outcome is most likely than the observed one).

To finish doing things well and really widen our field of vision, we should not be satisfied only with statistical significance, but we should also assess the clinical relevance of the result.

In this sense, some authors have proposed that, before calculating the statistical significance of the observed effect, the clinically relevance threshold should have been established. Thus, the minimum important difference between the two groups is defined.

If the effect we detect exceeds this minimal difference between the two groups, we can say that the effect is quantitatively significant. This quantitative significance has nothing to do with statistical significance, it only implies that the observed effect is greater than that considered important from a clinical point of view.

In order not to get confused with the two meanings, we are going to call this quantitative significance what its true name: clinical relevance.

We are going to try to put togethe the three aspects that we have dealt with so far.

If the p-value of the observed effect is less than 0.05, we can start by stating that this difference is statistically significant. Next, we will consider the fragility and clinical relevance of the result.

If the effect is not clinically relevant it will not make sense to spend more time on it, even if the p is significant.

But if the effect is clinically relevant, now we will no longer be content with calculating how many changes have to occur to modify statistical significance (and how likely it is that those changes will occur), but we will have to calculate how many changes must occur to lose that minimal clinically relevant difference.

If that number is greater than the fragility index, the result may be statistically unstable, but stable from the point of view of the clinical significance of the result.

On the contrary, a slight change in the outcomes will make the magnitude of the effect considered relevant disappear if the study is quantitatively unstable. If these changes can occur with a reasonably high probability, we will not have much confidence in the results of the study, regardless of their statistical significance.

To summarize everything we have said, when the time comes to assess the results of a clinical trial, we can follow these four steps:

- Assess statistical significance. Here we must not lose sight of the fact that reaching significance may be a matter of increasing the sample size sufficiently.
- Determine the clinical significance. The reference is the minimum relevant difference that we want to observe between the two groups, taking into account the criteria of clinical relevance of the effect.
- Assess quantitative stability. Determine the number of changes that can modify the clinical significance of the results.
- Determine if the study is fragile or stable. How many changes are needed to reverse statistical significance (the fragility index that we started this whole thing with).

And here we are going to end this post, long and thick, but one that deals with an important issue that our blinders prevent us from assessing properly.

All of the above refers to clinical trials, although this problem can also be applied to meta-analyzes, where the overall outcome measure can also radically change with changes in the results of some of the primary studies in the review. For this reason, some indices have also been developed, such as Ronsenthal’s safe N or, also considering the clinical relevance, Orwin’s safe N. But that is another story…

]]>We live in a crazy world, always running from here to there and always with a thousand things in mind. Thus, it is not uncommon for us to leave many of our tasks half finished.

This will be of little importance in some occasions, but there will be others in which leaving things half done will make the other half we have done useless.

And this is precisely what happens when we apply this sloppiness to our topic of today: we do an experiment, we calculate a regression line and we just start applying it, forgetting to make a regression model diagnostics.

In these cases, leaving things half done may have the consequence that we apply to our population a predictive model that, in reality, may not be valid.

We already saw in a previous post how to build a simple linear regression model. As we already know, simple linear regression allows us to estimate what the value of a dependent variable will be based on the value taken by a second variable, which will be the independent one, provided that there is a linear relationship between the two variables.

We also saw in an example how a regression model could allow us to estimate what the height of a tree would be if we only know the volume of the trunk, even if we did not have any tree with that volume available.

No wonder, then, that the prediction capabilities of regression models are widely used in biomedical research. And that’s fine, but the problem is that, the vast majority of the time, authors who use regression models to communicate their studies’ results forget about the validation and diagnostics of the regression model.

And at this point, some may wonder: do regression models have to be validated? If we already have the coefficients of the model, do we have to do something else?

Well, yes, it is not enough to obtain the coefficients of the line and start making predictions. To be sure that the model is valid, a series of assumptions must be checked. This process is known as the validation and regression model diagnostics.

We must never forget that we usually deal with samples, but what we really want is to make inferences about the population from which the sample comes, which we cannot access in its entirety.

Once we calculate the coefficients of the regression line using, for example, the least squared method, and we see that their valuea are different from zero, we must ask ourselves if it is possible that in the population that value are zero and that the values that we have found in our sample are due to random fluctuations.

And how can we know this? Very easy, we will make a hypothesis contrast for the two coefficients of the line with the null hypothesis that the coefficients values are, effectively, zero:

H_{0 }: β_{0} = 0 y H_{0 }: β_{1} = 0

If we can reject both null hypotheses, we can apply the regression line that we have obtained to our population.

If we cannot reject H_{0} for β_{0}, the constant (interceptor) of the model will not be valid. We can still apply the lineal equation, but assuming it originates from the coordinate axis. But if we have the misfortune of not being able to reject the null hypothesis for the slope (or for neither of the two coefficients), we will not be able to apply the model to the population: the independent variable will not allow us to predict the value of the dependent variable.

This hypothesis contrast can be done in two ways:

- If we divide each coefficient by its standard error, we will obtain a statistic that follows a Student’s t distribution with n-2 degrees of freedom. We can calculate the p-value associated with that value and solve the hypothesis contrast by rejecting the null hypothesis if p<0.05.
- A slightly more complex way is to base this hypothesis contrast on an analysis of variance (ANOVA). This method considers that the variability of the dependent variable is decomposed into two terms: one explained by the independent variable and the other not assigned to any source and which is considered unexplained (random).

It is possible to obtain the estimate of the variance of the error of both components, explained and unexplained. If the variation due to the independent variable does not exceed that of chance, the ratio of explained / unexplained will have a value close to one. Otherwise, it will move away from unity, the more the better predictions of the dependent variable provided by the independent variable.

When the slope (the coefficient β_{1}) is equal to zero (under the assumption of the null hypothesis), this quotient follows a Snedecor’s F distribution with 1 and n-2 degrees of freedom. As with the previous method, we can calculate the p-value associated with the value of F and reject the null hypothesis if p <0.05.

We are going to try to understand a little better what we have just explained by using a practical example. To do this, we are going to use the statistical program R and one of its data sets, “trees”, which collects the circumference, volume and height of 31 observations on trees.

We load the data set, execute the lm() function to calculate the regression model and obtain its summary with the summary() function, as you can see in the attached figure.

If you look closely, the program shows the point estimate of the coefficients together with their standard errors. This is accompanied by the values of the t statistic with their statistical significance. In both cases, the value of p <0.05, so we reject the null hypothesis for the two coefficients of the equation of the line. In other words, both coefficients are statistically significant.

Next, R provides us with a series of data (the standard deviation of the residuals, the square of the multiple correlation coefficient or coefficient of determination and its adjusted value) among which is the F’s contrast to validate the model. There are no surprises, p is less than 0.05, so we can reject the null hypothesis: the coefficient β_{1} is statistically significant and the independent variable allows predicting the values of the dependent variable.

Everything we have seen so far is usually provided by statistical programs when we ask for the regression model. But we cannot leave the task half done. Once we have verified that the coefficients are significant, we can ensure that a series of necessary assumptions are met for the model to be valid.

These assumptions are four: linearity, homoscedasticity, normality and independence. Here, even if we use a statistics program, we will have to work a Little hard to verify these assumptions and make a correct diagnostics of the regression model.

As we have already commented, the relationship between the dependent and independent variables must be linear. This can be seen with something as simple as scatter plot, which shows us what the relationship looks like in the range of observed values of the independent variable.

If we see that the relationship is not linear and we are very determined to use a linear regression model, we can try to make a transformation of the variables and see if the points are already distributed, more or less, along a line.

A numerical method that enables the assumption of linearity to be tested is Ramsey’s RESET test. This test checks whether it is necessary to introduce quadratic or cubic terms so that the systematic patterns in the residuals disappear. Let’s see what this means.

The residual is the difference between a real value of the dependent variable observed in the experiment and the value estimated by the regression model. In the previous image that shows the result of the summary() function of R we can see the distribution of the residuals.

For the model to be correct, the median must be close to zero and the absolute values of the residuals must be uniformly distributed among the quartiles (similar between maximum and minimum and between first and third quartiles). In other words, this means that the residuals, if the model is correct, follow a normal distribution whose mean is zero.

If we see that this is not the case, the residuals will be systematically biased and the model will be incorrectly specified. Logically, if the model is not linear, this bias of the residuals could be corrected by introducing a quadratic or cubic term into the equation of the line. Of course, then, it would no longer be a linear regression nor the equation of a line.

The null hypothesis of the Ramsey’s test says that the terms quadratic, cubic, or both are equal to zero (they can be tested together or separately). If we cannot reject the null hypothesis, the model is assumed to be correctly specified. Otherwise, if we reject the null hypothesis, the model will have specification errors and will have to be revised.

We have already commented: the residuals must be distributed homogeneously for all the values of the prediction variable.

This can be verified in a simple way with a scatter plot that represents, on the abscissa axis, the estimates of the dependent variable for the different values of the independent variable and, on the coordinate axis, the corresponding residuals. The homoscedasticity assumption will be accepted if the residuals are randomly distributed, in which case we will see a cloud of points in a similar way throughout the range of the observations of the independent variable.

We also have numerical methods to test the assumption of homoscedasticity, such as the Breusch-Pagan-Godfrey test, whose null hypothesis states that this assumption is satisfied.

We have also said it already: the residues must be distributed in a normal way.

A simple way to check it would be to represent the graph of theoretical quantiles of the residuals, in which we should see their distribution along the diagonal of the graph.

We can also use a numerical method, such as the Kolmogorov-Smirnov’s test or the Shapiro-Wilk’s test.

Finally, the residuals must be independent of each other and there have to be no correlation among them.

This can be contrasted by carrying out the Durbin-Watson’s test, whose null hypothesis assumes precisely that the residuals are independent.

To finish this post, we are going to make the diagnosis of the regression model that we have used above with our trees. To make it suitable for all audiences, this time we will use the R-Commander interface, thus avoiding writing on the command line, which is always more unpleasant.

For those of you who don’t know R very well, I leave you on the first screen the previous steps to load the data and calculate the regression model.

Let’s start with the diagnosis of the model.

To check if the assumption of linearity is fulfilled, we start by drawing the scatter plot with the two variables (menu options Graphs-> Scatter plot). If we look at the graph, we see that the points are distributed, more or less, along a line in an upward direction to the right.

If we want to do the numerical method, we select the menu option Models-> Numerical diagnostics-> Non-linearity RESET test. R gives us a RESET value = 2.52, with a p = 0.09. As p> 0.05, we cannot reject the null hypothesis that the model is linear, thereby corroborating the impression we obtained with the graphic method.

Let’s go with homoscedasticity. For the graphical method we resort to the menu option Models-> Graphs-> Basic diagnostic graphs. The program provides us with 4 graphs, but now we will only look at the first one, which represents the values predicted by the model of the dependent variable against the residuals.

As can be seen, the dispersion of the points is much greater for the lower values of the dependent variable, so I would not be very calm about whether the homoscedasticity assumption is fulfilled. The points should be distributed homogeneously over the entire range of values of the dependent variable.

Let’s see what the numerical method says. We select the menu option Models-> Numerical diagnoses-> Breusch-Pagan test for heteroscedasticity. The value of the BP statistic that R gives us is 2.76, with a p-value = 0.09. Since p> 0.05, we cannot reject the null hypothesis, so we assume that the homoscedasticity assumption holds.

We went on to check the normality of the residuals. For the graphical diagnostic method, we select the menu option Graphs-> Graph of comparison of quantiles. This time there is no doubt, it seems that the points are distributed along the diagonal.

Finally, let’s check the independence assumption.

We select the option Models-> Numerical diagnoses-> Durbin-Watson test for autocorrelation. A non-zero value of rho is usually selected, since it is rare to know the direction of the autocorrelation of the residuals, if it exists. We do it like this and R gives us a value of the statistic DW = 1.53, with a p-value = 0.12.

Consequently, we cannot reject the null hypothesis that the residuals are independent, thus fulfilling the last condition to consider the model as valid.

And here we are going to leave it for today. Seeing how laborious this whole procedure is, one can fall into the temptation to forgive and even understand the authors who hide the diagnostics of their regression models from us. But this excuse is not valid: statistical programs do it refrained from making the least effort.

Do not think that with everything we have explained we have done everything we should before applying a simple linear regression model with confidence.

For example, it would not hurt to assess whether there are influential observations that may have a greater weight in the formulation of the model. Or if there are extreme values (outliers) that can distort the estimate of the slope of the regression line. But that is another story…

]]>