Student’s t probability distribution allows estimating the value of the population mean of a random variable that follows a normal distribution when the parameter is extracted from a small sample and the population variance is unknown.

Something similar to what happens with chocolate happens to me with beer: I like all types of them, absolutely all, except those with fruit, especially if they are cherries. I recognize that fruit is a healthy and recommended food, but I prefer everything in its place and not confuse wheat with chaff.

We already talked about chocolate one day, so today we will talk about beer. Or rather, an illustrious person from the world of beer, who lived between the nineteenth and twentieth centuries, none other than William Sealy Gosset.

Don’t you know who he was? Wait a bit and you will see.

What surely you all know is Guinness beer, that toasted beer, I would say rather black, with such a characteristic flavor and with a foam so white and dense that it helped to create the legend, otherwise false, that it had coffee as part of its composition.

William Sealy Gosset worked at Guinness in the early 1900s and applied his knowledge of statistics to quality control and to improve both the malt grown on the farm and the beer made at the brewery.

The problem that Gosset had is that he worked with small samples, so he was subject to errors in his estimates, especially when he had extreme values in his samples.

So, with the help of a friend of his, a certain Pearson, whose name I hope will tell you something, he drew up a new probability distribution, the well-known Student’s t distribution, which we are going to talk about today.

Nowadays, the Student’s t distribution is one of the most used in statistical inference associated with small samples, so it is the one that is usually used to contrast a sample mean with the mean of a population and for the comparison of two means.

It is quite similar to a standard normal distribution, although, while the normal is defined by its mean and its variance, the Student’s t distribution also incorporates its degrees of freedom, which is why it is usually referred to as t_{n}, where n the number of degrees of freedom, usually calculated as n-1 (n is the sample size).

Its shape, as we have said, is similar to that of the normal distribution, centered on zero, bell-shaped and symmetric, although the Student’s t has heavier tails than the Gaussian curve. This implies a greater dispersion of the data, which means that the estimates are less precise and the confidence intervals are wider than those that would be obtained by applying the normal distribution.

In any case, these differences disappear as the size of the sample increases. When n is large, an approximation can be made with a normal distribution with a minimum degree of error. This is so because the characteristics of the tails depend on the degrees of freedom of the distribution, being lighter as the number of degrees of freedom increases and, therefore, the sample size.

In summary, and to put it in a more technical way, as the sample size (and the degrees of freedom) decreases, the cumulative probability in the tails increases, and vice versa. A Student’s t distribution with 30 or more degrees of freedom is practically indistinguishable from a normal distribution with the same mean and variance.

As we have already said, Pearson helped Gosset to tabulate the distribution and, to round off the task, he published it in his magazine, which was called Biometrika. But Pearson, clever as he was, did not realize the significance of Gosset’s find.

Luckily, Gosset had many friends (had it something to do with working in a brewery?) And another of them did know how revolutionary the method was. He was no other than the great Ronald Fisher, whom we have also talked about in a previous post.

Indeed, it was Fisher who introduced the concept of degrees of freedom, which are so important for this distribution, since they allow adjusting the effect of the deviation of the estimates produced by the small sample size, although, of course, paying the price of obtain a lower precision, especially with the smallest samples.

This is what makes it possible to use the Student’s t distribution to estimate the value of the population mean of a random variable that follows a normal distribution when the parameter is drawn from a small sample and the population variance is unknown.

Furthermore, as we have already mentioned, it is used in the contrast of hypotheses between two means when the random variable follows a normal distribution and there is equality of variances (homoscedasticity) between the two groups that are contrasted.

At this point, those of you who do not know the story of Gosset’s adventures will be wondering why we call it Student’s t and not Gosset’s t.

On this matter, as was the case with Apple’s bitten apple, there are two versions.

The most widespread version claims that Guinness had prohibited its employees from publishing articles of any kind. This was because a previous employee had published secrets of the brewery, which he wanted, with this prohibition, to prevent the leak of more confidential information. That is why Gosset published his work in Biometrika under the pseudonym of Student.

But I prefer another versión that is less known but much prettier. A modern and progressive company like Guinness understood the need to apply statistical knowledge to improve its production, but did not want the competition to do something similar and thus lose this advantage. That is why Gosset would have published his work under a pseudonym, so as not to link him to the brewery.

And with this we are going to finish for today.

We have seen how a restless and intelligent spirit (with the help of some friends) knew how to adapt statistics to his needs in order to improve his estimates without being limited by the small sample size that he had to use in his studies.

But the objective was not only this; he also sought that production was not subject to variations in environmental conditions of soil, climate and things like that. In other words, he was interested in developing robust methods in the presence of extreme values. Although the credit at this point would later be awarded to his friend Fisher. But that is another story…

]]>We all know how difficult it is to make a new drug available to people who can benefit from it. From the moment a promising molecule is identified and somebody thinks that it may be useful, until it can be bought in a pharmacy, a long journey goes by that, at present, usually does not last less than 10 or 12 years.

During this long journey, the future drug, after its initial development phase, enters the thorny path of the preclinical phase, with studies in cell or animal models, followed by a gradual use in humans to verify its toxicity, its right dose and its effectiveness.

Thus, future drugs go through a series of phases. First contact with humans takes place in **phase I** trials, in which a small number of people, usually volunteers, act as lab rats to study the pharmacokinetics and pharmacodynamics of the new drug, along with the safety of the doses administered.

It is in **phase II** when the drug is usually administered for the first time to patients who could be candidates for treatment. These studies try to study the benefits and the optimal dose in comparisson with a control group.

Finally, before the drug is launched and commercialized, **phase III** trials are conducted. In these trials, the drug is tested in a controlled way against a placebo or the usual treatment in a large number of patients. Its objective is to determine the efficacy, toxicity and risk-benefit of the intervention with the goal of obtaining its authorization for the studied indication.

An estimated 5,000-10,000 promising molecules are identified each year, of which only about 10 make it to human trials. And even so, when the drug ends up being marketed and used in a massive way, we can find surprises regarding its effectiveness, safety and difficulties of use.

It is not uncommon for an adverse effect that was previously not seen in phase III trials during the time of drug development to be described when the drug is used massively.

We can put part of the blame that these things happen in our inseparable companion: random. Imagine a drug that triggers one fildulastrosis for every 50,000 patients who receive it. If we do a phase III trial with 500 treated patients, the probability that we can detect at least one case of this complication is around 1%.

Logically, it will not be strange that we do not realize the problem until the drug is used massively. And some of you will think that the remedy is to increase the number of participants in phase III trials, but it is not that simple. There are more considerations to take into account.

Clinical trials are usually conducted in very restrictive situations. They are usually limited to a highly selected population, from which the most severe patients or those with a higher risk of complications are usually excluded, who are often the ones that interest us clinicians the most. Furthermore, the entire design is optimized to study the isolated effect of the intervention under study. This is fine to avoid bias and to ensure that the methodological quality of the trial is adequate, but it can greatly limit the applicability of the trial’s results to patients in our usual clinical setting.

So, after phase III and commercialization, we need to be able to evaluate the new drug in a situation more similar to our day-to-day practice with our usual patients. A situation that is like life itself.

It could be thought that a well-designed observational study would allow exploring the effectiveness of the drug in day-to-day clinical practice, without altering the patient’s usual life. Their results would be easier to generalize to more diverse populations. However, we already know that observational studies are subject to biases that can compromise the validity of their results.

When we compare the results of the two arms of an observational study, the difference detected may be due not only to the effect of the intervention or exposure under study, but also to a multitude of other factors that we call confounding factors.

If we know what these confounders are, we can adjust them during the study design or during the analysis phase. The problem is that there can always be factors that we are unaware of and we could end up attributing to our intervention an effect that may be caused or influenced by an unknown confounding variable.

Herein lies the great merit of randomization, which tends to homogeneously distribute the confounding factors between the two groups, both known and unknown. Thus, the difference that we observe at the end of the follow-up will be due to the only thing that is different between the two groups: the presence or absence of intervention.

We therefore see that we cannot do without clinical trials, although we can modify their design so that they are not so strict.

The randomized clinical trial is the gold standard for epidemiological designs. As we have already said, it is usually carried out on a sample of participants with strict criteria, which are randomly divided into two groups, intervention and control, to see the differences between the two at the end of the study, which will be attributable to the intervention.

But we must bear in mind that, in addition to the pharmacological effect of the intervention, it can have effects on the behavior of participants and researchers that, in turn, can influence the way in which the study data are collected and the conclusions reached. To try to minimize these effects, which are often called extraneous effects, clinical trials resort to blinding.

This type of approach is called **explanatory**, and it is common in phase III trials and many of the post-marketing trials. They are very robust from a methodological point of view and focus especially on the “isolated” effect of the drug. Its problem, we have already said, its lower external validity or applicability to the normal situation of daily life.

As an alternative to the explanatory clinical trial, the **pragmatic clinical trial** arises, which does not focus so much on the isolated effect of the drug, but tries to also take into account the external effects that we have mentioned in order to obtain a broader estimate of efficacy that better reflects use in the real world.

Now that we know the two approaches, the explanatory and the pragmatic ones, let’s see how these two types of clinical trials differ. Let us say before beginning that the two approaches are not exclusive, but rather constitute the two ends of a continuum in which the designers of the study will be able to position themselves according to their interests.

The three aspects that will define a trial its more explanatory or pragmatic soul will be the definition of the treatment or intervention, the assessment of the results, and the selection of study participants.

The external effects that are not a direct consequence of the intervention can be homogenized in the two groups in the explanatory approach or to be included as a global effect of the intervention in the pragmatic approach.

Let’s imagine we try a new drug. In real life, the patient takes other treatments, has different lifestyles, may have difficulty paying for the treatment, etc. All of these factors can be strictly controlled in an explanatory approach, thus focusing on the effect of the drug.

On the contrary, the pragmatic approach allows the situation to be the real one of the patient, thereby assessing the overall effect of the intervention and everything that surrounds it. In this context, we could dispense with blinding, with which we would add the change in attitudes that can occur in the doctor or in the patient derived from using the drug. Although in one trial this implies an increased risk of information biases, it is the real situation that we would face in our practice.

This is related to the choice of the primary outcome variable for the trial.

The explanatory approach will choose a variable with greater significance from the pathophysiological point of view. This may be easier to interpret and more objective, in addition to providing information on the biological characteristics of the drug under study. However, we can choose a pragmatic approach and select an outcome variable that is more important to the daily patient’s life.

If we want to maximize the probability of demonstrating the effect of the intervention, we will opt for an explanatory approach. We will select the participants with strict inclusion and exclusion criteria.

If, on the contrary, our goal is to obtain results that are easily generalizable to our patients, these criteria should be less strict so that the study participants are more like our patients. Thus, the study will be more useful for decision-making in our usual clinical practice.

To conclude, we can summarize the main objective behind a pragmatic approach when designing a clinical trial: to assess the global effect of a treatment strategy in the real world.

For this, it will be necessary to randomize participants who are similar to the target population likely to receive the intervention, establish a design similar to that of routine clinical practice, and choose an outcome measure that is useful in daily practice.

But let’s not get confused. The fact that a clinical trial has a pragmatic approach does not mean that it will be easier, faster or cheaper to carry out than an explanatory one. The complexity will depend on the type of disease and the stage of development and type of intervention that we want to study.

And with this we are going to end for today.

We have already said that pragmatic and explanatory are the two extremes of a continuum and that usually no trial can be strictly pigeonholed into either of the two extremes.

So much so, that there are ways to quantify all the aspects that we have developed in this post to give a more or less pragmatic approach to the study in its design phase, as is the case with the PRECIS-2 tool. But that is another story…

]]>The need to perform a correct calculation of the sample size necessary for a study is reviewed, as well as the main factors that influence the sample required to demonstrate the effect considered relevant from a clinical point of view.

Nowadays, the teaching of Medicine and, in general, teaching at the university level, is quite well defined and standardized. And this is not only at a national level, but also at the level of our international environment.

But this has not always been the case. In the beginning, each one went his own way and there was diversity in the ways and objectives of teaching, as you will see in the little story that I am going to tell in this post.

At the end of the 19th century and the beginning of the 20th century, medical schools in the United States were a little lost in terms of their objectives and ways of teaching. Although there were honorable exceptions, such as Johns Hopkins, Harvard, or Michigan Medical Schools, most were of more than poor quality.

The educational system was oriented to allow the teacher to have time for their matters, the reason why it was based on the master classes and the scarcity of practices during the training that, in some schools, was completed in periods as ridiculous as two semesters.

Thus, in 1906, the Council on Medical Education of the American Medical Association began to worry about the matter and to collect information. Given what they found, and in order to maintain objectivity, they commissioned a third party, the Carnegie Foundation for the Advancement of Education, to develop a report on the subject.

And the Foundation, in turn, entrusted it to a man named Abraham Flexner, who had graduated from Johns Hopkins some 20 years earlier. This man not only did not delegate to anyone else, but he took the job with great determination: he studied the admission conditions, the facilities, the competence of the faculty and other aspects of the medical schools of the United States and Canada.

So far nothing abnormal. But the funny thing is that he studied it in ALL the schools that were, at that time, a total of 155. A great credit for an strenuous job, but surely he could have saved effort (and time and money) if he had selected a number of representative schools offices and thus would have reduced the number of establishments to be investigated.

So you have already seen how Mr. Flexner was able to include a 100% of his target population in his study, something few can boast about. Of course, in addition to being unnecessary, many times this is not possible and may not even be convenient.

It is one thing to study medical schools and quite another to compare the efficacy or safety of a new treatment with the standard treatment or placebo.

A basic principle for biomedical research, the principle of equipoise, tells us that in order to compare two treatments in a trial, the researcher has to really ignore which of the two is better. Once this principle is no longer fulfilled, it is unethical to continue the trial or carry out a similar one.

The reason is because, although the investigator believes that her new treatment will be better, it may be the equivalent or even worse than the comparison option, putting trial participants at risk.

This is one of the reasons that makes the preliminary calculation of the necessary sample size so convenient: we must know what is the minimum number of participants that we need to be able to statistically demonstrate the effect of the new treatment if this effect exists, something that we did not know when we started the study.

It would not be ethical to include more patients than necessary just to obtain the desired p < 0.05. We must establish the clinically important effect that we want to detect and calculate the sample size so that the study has the necessary power to detect it.

The needed sample size is different in each situation and depends on many factors. We are not going to see in this post how to calculate the sample size in each of the situations, but we are going to limit ourselves to reflecting on the conditions that can influence us in the way of calculating it and in the necessary size obtained.

Let’s look at some of these factors that we should take into account when planning the sample size necessary for our study.

A very common vice is to anxiously seek to obtain a p that is statistically significant. When we see a p value lower than 0.05, our faces light up and we no longer think of anything else.

Gross error: the significance of p depends, among other things, on the sample size. And, as we have already commented, it is not a question of obtaining a significant p, but rather of studying a magnitude of effect that we consider clinically relevant.

This difference is determined by the researcher, usually based on her knowledge of the subject she is studying or according to what has been published or known from previous studies.

When we compare two interventions in a clinical trial, we always start seting the null hypothesis that both interventions are equally effective. We know that, simply by chance, even if the null hypothesis is true, the value of the result variable that we obtain will be different in the two groups.

For example, suppose we study two hypotensive drugs, A and B, and measure the difference in mean arterial pressure between the end and the beginning of the intervention. As we has mentioned, the null hypothesis assumes that the differences will be equal in the two groups.

However, as we already know, the values that we will obtain will be different, so we will ask ourselves what is the probability that this difference is due to chance. If the probability is less than 5% (p < 0.05), we will feel confident enough to reject the null hypothesis and we will conclude that one of the treatments is more effective than the other.

The problem is that, no matter how small the difference between the two groups, statistical significance (p < 0.05) can be achieved if the sample size is increased enough.

Imagine that treatment A lowers blood pressure by 20 mmHg and B, by 18 mmHg. If we include a sufficient number of participants, we can obtain a p < 0.05, but can we really conclude that A is better than B with only this difference? Obviously not. From a clinical point of view, I would say that they have similar efficacy.

We should especify a difference that is relevant to us. For example, we may decide that we want to detect a difference between the two drugs of 20 mmHg or more. With this difference, we will calculate the number of participants necessary for, if this difference exists, the p value to be significant. We will nor need one more or one les participant.

If we stay below this necessary size, even if we detect a difference of 20 mmHg, the p may not be significant. The study will not have the necessary power to detect the effect due to an insufficient sample size.

If the difference detected is less than 20 mmHg, the p will not be significant either. It’s okay, there is no clinically relevant difference between the two treatments. What would not make sense is to increase the sample size to demonstrate the statistical significance of an effect less than that considered clinically relevant.

One caveat before leaving this point: everything we have said takes place in the realm of probabilities, so we always have a certain probability of making an error when performing hypothesis testing (type I error and type II error).

This is another important factor. The greater the variability of the outcome variable of our study in the target population, the larger the sample size required to detect the same effect size.

The variability in the population is reflected in the standard deviation, which influences the calculations of the standard error and the confidence intervals. The larger the standard error of the variable, the larger the required sample size, since estimates on the population are less precise.

The same happens with the precision of the estimate that we want to make. The more precise we want our estimate to be, the larger the sample size needed, and vice versa.

The reliability of the study depends on two parameters whose value we must set to perform the sample size calculation: the confidence level and the power of the study.

The confidence level reflects the degree of certainty that we have that, if we repeat the study under the same conditions, we will obtain a similar result again. Usually a confidence level of 95% is chosen, although we can raise or lower it depending on how strict we want to be with the necessary degree of security.

Power, for its part, reflects the probability that the results we obtain in the study represent reality. As we have already said, it is the probability that the study will detect the effect, if it exists. It is usually marked by 80%, although it can also be increased to 90% in some studies.

As it is easy to intuit, the higher the level of confidence and the greater the power of the study, the larger the required sample size, and vice versa.

We are talking about clinical trials all the time, but the sample size calculation applies to other methodological designs as well.

Thus, we can calculate the sample size necessary to make prevalence estimates in cross-sectional studies with a certain precision, to compare the association and risk measures in observational studies, to establish the correlation between two variables, etc.

Logically, the type of design will influence the way the sample size is calculated and the number of participants required.

It is important to establish the relationship that exists between the two groups that we want to compare, which, as we already know, can be independent or paired.

As is already known, the variability is greater between independent groups than between paired groups, which will influence the necessary sample size, which will always be greater when we handle independent groups.

Hypothesis testing can be one-tailed or two-tailed (unilateral or bilateral).

The bilateral contrast assumes in its alternative hypothesis that there is a difference between the two compared interventions, but does not says which of the two is more effective. For its part, the one-tailed test does establish in the alternative hypothesis which of the two interventions is superior.

The most common is to choose the bilateral contrast, since when we carry out an experiment we do not know the direction that the result can take. However, if we are sure what the direction of the effect is going to be, we can adopt a one-tailed test.

Two-tailed contrast is more conservative, making it more difficult to achieve statistical significance than with one-tailed contrast, and also requires a larger sample size.

In any case, let’s not get confused: the elegant thing is to carry out a bilateral contrast and, if we opt for a unilateral one, it should never be to reach the significant p more easily or with fewer participants.

Logically, the sample size will be different if we want to measure one or more variables and it will also depend on the type of variables. This aspect is also linked to something we have already talked about, the precision with which we want to estimate each variable.

We have already seen the factors that can influence the number of participants that our study should have if we want it to be able to detect an effect that we consider clinically relevant.

To summarize, we can say that the size of the necessary sample will be greater the lower the probability of type I and type II error that we accept, the greater the dispersion of the variable in the study population and the smaller the size of the effect.

The sample size will also increase when we compare independent groups, when we want to compare more than one variable, and when we opt for a two-tailed hypothesis test.

And here we are going to leave the subject for today.

In case you’re curious about what happened to Mr. Flexner, I can tell you that his report was devastating. He concluded that 31 schools could train doctors better than the 155 he studied. Therefore, he recommended reducing the number of schools and, as a consequence, the number of students.

According to Flexner, too many doctors were being trained for the needs of the market. I don’t know, I think this sounds like something to me…

And now we are definitely going. We have talked a lot about the importance of having the right sample size and the factors that can influence it. However, it is not enough that the size is well calculated.

A sample of adequate size will be of no use if the sampling technique provides us with a sample that is not representative of the study population. But that is another story…

]]>The NNT was designed to assess the beneficial effect of a treatment to reduce the risk of an unpleasant event occurring in an intervention group of interest, always with respect to what was observed in a control group. Some of the aspects to take into account to use it correctly are reviewed.

Once upon a time, the scientific paradigms and the way of thinking of researchers (and also of clinicians) began to change, from “we’re doing well” to really wanting to know what was the validity of the information they collected in their experiments or in their daily practice.

It is in this context that, 33 years ago, what would become one of the lords of the impact measures of clinical studies came to light: the number needed to treat, known worldwide by its acronym, NNT.

Its initial usefulness was to assess the beneficial effect of a treatment to reduce the risk of an unpleasant event occurring in an intervention group of interest, always with respect to what was observed in a control group. Put more simply, it emerged as a measure of impact in the context of randomized and controlled clinical trials.

The NNT was initially well received, since it has the great merit of combining, in a single parameter, the concepts of statistical significance and clinical relevance (given that the p-value or its confidence interval is provided). In addition, it is easy for clinicians to interpret without the need for in-depth statistical knowledge.

Following a parallel with the ages of life, everything went well during his childhood and he continued to grow during his youth, extending to many other areas other than the conventional parallel clinical trial.

The problem with youth, in addition to being at the beginning of life, is that it does not last long. And in maturity, although it continues to have enthusiastic followers, the NNT has also begun to accumulate detractors and critics who have begun to draw defects from it.

It is about these tribulations that the NNT suffers during its maturity that we are going to talk in this post. Although there are complaints produced by some of its unpleasant characteristics from the mathematical point of view due to its logarithmic inheritance, most are based, in reality, on a poor understanding of its meaning or on a not entirely correct use of the parameter.

As we have already said, the NNT was initially developed to assess the efficacy of a treatment to reduce the risk of producing an event in an interest group.

For example, if the mortality of fildulastrosis is 5% per year and with a new treatment it drops to 3%, it means that the treatment reduces the risk by 2%, so the NNT will be 1 / 0.02 = 50. This means that for every 50 patients we treat for 1 year, we will prevent one from dying thanks to the treatment. Of course, she can die if we prolong the follow-up, as we will see later.

This approach is very simplistic, since the NNT is actually dual in nature.

The NNT can be understood as the number necessary to treat in order to increase the number of expected positive events by one or to decrease the number of a negative event by 1, all during a specific follow-up period.

But it can also be understood in the opposite sense, in that of harming: the number needed to treat to increase the number of a negative event by 1 or to decrease the expected number of a positive event by 1, all during a follow-up period determined.

This can make it confusing to assess the NNT just by its numerical value. To avoid this inconvenience, some authors have thought that we can call each of the NNT dualities differently, so that we will always know exactly what we are talking about.

Basically, we would be specifying the direction of the effect we are studying.

Thus, we would talk about the number necessary to treat to benefit (NNTB) when we want to express the number to treat to achieve a beneficial effect (or avoid an unpleasant one), and the number necessary to harm (NNTH) when we want to refer to the number necessary for a negative effect to occur (or to avoid a positive one that could occur without treatment).

Another problem that greatly bothers NNT detractors is that its calculation can, at times, generate negative values of its point estimate and, at other times, confidence intervals that cross zero into the realm of negative numbers. Here, the difficulty is in finding a logical meaning to a negative NNT.

Let’s imagine that we obtain a NNT = 10 with a 95% confidence interval (95 CI) of 8 to 12. Here we have no problem, we have to treat 10 (point estimate), although this value can range from 8 to 12 (estimate per interval).

The problem arises when negative numbers appear. For example, if the NNT = 5 with a 95 CI of 8 to -12, how do we assess it?

Well, for this it is good for us to resort to the dual nature of the NNT that we have mentioned earlier. If we think about it, NNT values between -1 and 1 are impossible. Thus, the interval from 8 to -12 could be divided into two: NNTB of 8 (up to infinity) and NNTH of 12 (up to infinity). I think the usefulness will be limited from a clinical point of view, but at least we will have given it a meaning.

As we discussed in a previous post, when we want to study the efficacy of an intervention, the ideal would be to give the new treatment, end the follow-up period and measure the effect. Then we would use our time machine to go back to the initial moment and, instead of the treatment under study, we would give its alternative, end the follow-up and measure the result.

Once this is done, we would compare the two results. The problem, the most awake of you will have already noticed, is that the time machine has not yet been invented. This means that to obtain this other result, which we call potential or counterfactual, we have to resort to the control group of the trials, which serves as a substitute.

If we think a bit about the implications of the counterfactual theory, although the efficacy of the treatment under study is always the same, the value of the NNT will depend on which intervention we compare it with. Therefore, to correctly interpret the NNT, the comparator that we have used must always be explicitly specified.

An NNT of 10 for a given treatment can only be assessed if it is specified which was the control intervention and how long the follow-up time were, especially if they are different.

So keep it in mind: the value of the NNT must be expressed together with the treatment alternative and the follow-up period. Failure to do so may make your assessment difficult and misleading.

Virtually all books and manuals agree that the numerical value obtained for the NNT should be rounded to the nearest higher integer. It’s logical, it doesn’t seem like it makes much sense to say that we have 4.8 patients to treat, so we round it up to 5.

The problem with rounding is that it adds imprecision to the estimate and can be misleading.

For example, any of the absolute risk reduction values between 0.52 and 0.9 will be equivalent, after rounding, to NNT = 2. However, there is a big difference between a reduction of 52% and a reduction of 90 %. We should not value them with the same NNT.

So, if nobody throws up her hands in horror when she hears that the average is to bear 1.2 children, why the phobia of giving decimals with the NNT? After all, it is still an estimator that we have to know how to interpret.

If we see a NNT of 6.7 we can conclude that we would have to treat an average of 6 to 7 patients to achieve a beneficial effect during a certain period of time. A warning, if we do so, it will be necessary to make it clear that the estimate will be between 6 and 7, but that these are not the limits of the 95 CI. Let us not get confused.

You already know that all studies are subject to the effect of confounding variables, especially when they are not randomized. We are used to seeing association measures adjusted for variables that the authors believe may act as confounders. However, this is often seen that this is not so for NNT and only its raw value is provided.

Other times, what happens is that not appropriate adjustment methods are used. A wide variety of methods have been developed to calculate the NNT in multiple scenarios, crude and adjusted. If you don’t know which one to use, find someone who knows before applying the wrong one.

We have said it before, but it does not hurt to insist: the value of the NNT depends on the length of the follow-up period, so it must be specified. The proportion of events that take place increases as time goes by.

For example, if the treatment produces a reduction in the risk of the event that remains constant over time, the value of the NNT will be lower as the temporal duration of the follow-up period increases. It is easily understood that the duration of follow-up must be known to correctly interpret the value of the NNT.

Consider two clinical trials of two different interventions, one with a follow-up period of 2 years and the other of 5. Even if the two studies gave us an NNT of 8, it would not be the same to treat 2 years as 5 to avoid an unpleasant event in 1 out of 8 patients treated.

In survival studies and when evaluating the results per person-time, the frequency of the study event must be taken into account when deciding the method of calculating risk reduction, which, in turn, we will use to calculate the NNT. This should be done even if the risk is constant and the follow-up period of the participants is homogeneous.

We are going to see how the NNT would be calculated with an example that we are going to invent on the fly. Imagine that we conducted the study and observed 10 cases of death per 100 person-years in the intervention group and 5 cases per 100 person-years in the control group.

Assuming that survival times are exponentially distributed, we first calculate the proportions or accumulated risks in the intervention group (Ri) and in the control group (Rc):

R_{c} = 1 – e^{-10/100 }= 0,095

R_{i} = 1 – e^{-5/100} = 0,048

Now we can calculate the NNT as the inverse of the risk difference, as we already know:

NNT = 1 / (0.095-0.048) = 21.2

The common mistake in this situation is to directly use the number of events according to the following formula:

NNT = person-time / (difference of events)

If we apply it to the previous case, it would look like this:

NNT = 100 / (10-5) = 20

As you can see, the first method, which is the most appropriate, is somewhat more conservative and gives higher NNT values.

Only in cases where the frequency of the event is very low can we estimate the NNT directly using the number of observed events. Imagine that, in the previous example, we observed 5 deaths in the intervention group and one in the control group. We could calculate the NNT as follows:

NNT = 100 / (5-1) = 25

Anyway, do not get carried away by the easy way. It will only be correct to calculate the NNT without prior conversion of the accumulated proportions when the number of events has a very low frequency and, in addition, the risk differences between the two groups remain proportional over time. When in doubt, convert.

To recapitulate a bit what we have said, we can recommend taking a series of precautions to make a proper use of the NNT and to be able to interpret it correctly.

First, never use the NNT without specifying the alternative treatment, the direction of the effect, and the length of the follow-up period.

Second, always calculate your confidence interval. In case it has negative values, consider the dual nature of benefit and harm to try to make a more understandable interpretation.

Third, don’t shy away from decimals. Remember that this is just another estimate. You should have no problem evaluating its point estimate (even if it is not a whole number) and its confidence interval.

Finally, check that you are using the correct methodology in more complex situations, such as those where confounding factors may be involved or in survival studies.

If we use it wisely, the NNT will be able to continue his adventures through his maturity and be with us for at least another 30 years.

And here we are going to leave it for today.

We have already seen the usefulness of the NNT to assess the efficacy of an intervention. For example, it tells us how many deaths we can avoid during the study follow-up and what would have occurred if we had not intervened.

But what about those who don’t die? Are there participants who will die the same if we do or do not treat them? Well, the NNT doesn’t tell us anything about that. To study this aspect, which would improve the NNT assessment, we need to resort to another parameter: the number remaining at risk. But that is another story…

]]>Fisher’s test is the exact method used when you want to study if there is an association between two qualitative variables, that is, if the proportions of one variable are different depending on the value of the other variable.

Today we are going to remember one of the most beautiful stories, in my humble opinion, in the history of biostatistics. Although surely there are better stories, since my general historical ignorance is greater than the number of decimal places of the number pi.

Imagine we are at Rothamsted Station, an agricultural research center located in Harpenden, in the English county of Hertfordshire. We are at some moment in the beginning of the decade of the 20s of the last century.

Three scientists, very British themselves, are preparing to have tea. They are two men and one woman. This is Blanche Muriel Bristol, an expert on algae and mushrooms. With her is William Roach, a biochemist who is also married to Muriel.

The third one is a geneticist who has started working on the station and will eventually become famous for being one of the founders of population genetics and neo-Darwinism, as well as for a few other little things, like the concept of null hypothesis and hypothesis contrast. Yes, friends, he is the great Ronald Fisher.

Fisher prepares the teacups and gallantly offers the first to Muriel, who declines. She looks at Ronald and says: I like tea with milk, but only if you put milk in first. If done the other way around, it gives it a flavor that I don’t like at all.

Fisher thinks Muriel is teasing him, so he insists, but she’s still dug in her heels. I think then Fisher must have changed his mind and thought that Muriel was actually a bit of a fool, but the husband came to the rescue of his wife. William proposes to make 8 cups of tea and, at random, put the milk in first in 4 cups and then contrary in the rest.

To Fisher’s surprise, Muriel guesses the order in which the milk from the eight cups had been served, although she is not allowed to taste more than two cups at a time. Luck or a privileged palate?

This, which was one of the first randomized experiments in history, if not the first, left Fisher very thoughtful. So he developed a mathematical method to find out the probability that Muriel got it right by pure chance. And this method is the subject of our entry today: Fisher’s exact test.

Before fully entering into Fisher’s exact test, we are going to clarify a series of concepts to understand well what we are going to do.

When we want to make a hypothesis contrast between two qualitative variables (in this case, to check their independence) we can use several tests that compare their frequencies or their proportions.

If we deal with independent data, we can choose an approximate test, such as the chi-square test, or an exact test, such as Fisher’s. If the deal with paired data, we can do a McNemar test (for 2×2 contingency tables) or use Cochran’s Q method (for 2xK tables).

And we have talked about exact and approximate tests. What does this mean?

The approximate tests calculate a statistic with a known probability distribution in order, according to its value, to know the probability that this statistic acquires values equal to or more extreme than the observed one. It is an approximation that is made at the limit when the sample size tends to infinity.

For their part, exact tests calculate the probability of obtaining the observed results directly. This is done by generating all the possible scenarios that go in the same direction as the observed hypothesis and calculating the proportion in which the condition we are studying is fulfilled.

Well, people who know about these things do not manage to agree.

The approximate methods are simpler from a computational point of view, but with the computational power of today’s computers, this argument does not seem to be a reason to choose them. On the other hand, the exact ones are more precise when the sample size is smaller or when some of the categories have a low number of observations.

But if the number of observations is very high, the result is similar using an exact method or an approximate one.

As a rule of thumb, it is recommended to use an exact test when the number of observations is less than 1000 or when there is a group with a number of expected events less than 5. However, if you have a computer, there is no reason to complicate your life: use an exact one.

All this does not mean that we cannot use an approximate test if the sample is small, but we will have to apply a continuity correction, as we saw in a previous post.

Fisher’s exact test is the exact method used when you want to study if there is an association between two qualitative variables, that is, if the proportions of one variable are different depending on the value of the other variable.

In principle, it seems that Fisher designed it with the idea of comparing two dichotomous qualitative variables. In simple terms, for use with 2×2 tables.

However, there are also extensions to the method to do it with larger tables. Many statistical programs are capable of doing this, although, logically, they put more stress on the computer. You can also find calculators available on the Internet.

Fisher’s exact test assumes the null hypothesis that the two variables are independent, that is, the values of one do not depend on the values of the other.

The only necessary condition is that the observations in the sample are independent of each other. This will be true if the sampling is random, if the sample size is less than 10% of the population size and if each observation contributes only to one of the levels of the qualitative variable.

Furthermore, the marginal frequencies of the rows and columns of the contingency tables of the different possible scenarios must remain fixed. Do not worry about this, it will be better understood when we see an example. If this is not true, we can continue to use the test, but it will no longer be exact and it will become more conservative.

After much thought about the problem of tea and the skills of Muriel Bristol, the brilliant Fisher showed that he could calculate the probability of any of the contingency tables using the hypergeometric probability distribution, according to the formula in the figure.

Thus, Fisher’s exact test calculates the probabilities of all possible tables and adds those of the tables that have p values less than or equal to the observed one. This sum, multiplied by two, gives us the p-value for a two-tailed hypothesis contrast.

According to the value of p, we will only have to solve our hypothesis contrast in a similar way as we do with any other contrast test.

To finish understanding everything we have said, we are going to repeat the tea experiment but I, instead of Muriel, am going to ask my cousin to give us a hand; we have not caused him any troble for a long time.

Sure, I can’t make my cousin drink tea, so let’s see if he can tell the difference if what he’s drinking is Scottish or Irish whiskey. He claims that he is able to distinguish a Scottish from anything else.

So, to test his resistance to alcohol, as well as his palate skills, I randomly offer him 11 shots of Scottish and 11 shots of Irish.

The results can be seen in the first table of the attached figure.

As you can see, it hits 7 of the 11 Scottish shots and only 2 of the 11 Irish ones. It seems that she is right in his statement and that he has a refined palate. But we, like Fisher did with Muriel, are going to see if he’s just been lucky.

As we have said above, it is necessary to calculate the possible tables that have a lower probability than the observed one and within the sense of our hypothesis. We will do this by reducing the minimum frequency of each of the columns until one of them reaches zero.

Also, we will adjust the other boxes so that the marginals remain constant. Otherwise, we already know that the test would no longer be exact. You can see the two possible tables until the hits with Irish whiskey reach zero.

Now we only have to calculate the probability of each table, add them all and multiply by two. We obtain a value of p = 0.08 for a bilateral contrast. As the null hypothesis says that the ability to hit is not influenced by the type of whiskey, we cannot deny that my cousin’s boast was just a matter of luck.

Coming to the end of this post, I¡d like to warn that no one ever thinks of doing a Fisher’s exact test manually. This absurd example is extremely simple, but surely our experiments are a bit more complex. Use a computer application or an Internet calculator.

Let’s solve the example using the R program.

First, we enter the data to build the contingency table with these two consecutive commands:

*data <- data.frame(kindw = c(rep(“irh”,11), rep(“sct”,11)), hit = c(rep(TRUE,2), rep(FALSE,9), rep(TRUE,7), rep(FALSE,4)))*

*my_table <- table(data$kindw, data$hits, dnn = c(“Whisky”, “Hit”))*

Finally, we perform Fisher’s exact test:

*fisher.test(x = my_table, alternative = “two.sided”)*

On the output screen, the program provides us with the p value (p = 0.08), its confidence interval and the odds ratio between the two variables. Remember that Fisher’s test only tells us if there is a statistically significant difference, but if we want to measure the strength of the association between the two variables we have to resort to other types of measures.

And if someone is looking for the value of the Fisher statistic among the program’s output data, I’m sorry to say that he or she has to re-read the entire post from the beginning.

As we have already said, exact tests calculate probability directly without the need for prior calculation of a statistic that follows a known probability distribution. The Fisher’s statistic does not exist.

And here we are going to leave it for today. We have seen how Fisher’s exact test allows us to study the independence of two qualitative variables but requires one condition: that the marginal frequencies of rows and columns remain constant.

And this can be a problem, because in many biological experiments we will not be able or not sure of meeting this requirement. What happens then? As always, there are several alternatives.

The first, keep using the test. The drawback is that it is no longer an exact test and loses its advantages over approximate tests. But we could use it.

The second is to use another contrast test that does not lose power when the marginals of the table are not fixed, such as Barnard’s test. But that is another story…

]]>