Science without sense…double nonsense

Píldoras sobre medicina basada en pruebas

The three pillars of wisdom


Surely all of us, with a greater frequency tan we would like, have found a small hole in our knowledge that made us doubt about the diagnostic or treatment steps to take with any of our patients. Following the usual practice, and trying to save time and effort, we have certainly asked to our closest colleagues, hoping that they solve the problem, avoiding us the need to deal with the dread PubMed (who said Google!?). As a last resort, we have consulted a medical book in a desperate attempt to get answers, but nor even the fattest books can free us from having to search on a database occasionally.

And in order to do it well, we should follow the five steps of Evidence-Based Medicine: formulating our question in a structured way (first step), doing our bibliographic search (second step) and critically appraise the articles we find and that we consider relevant to the theme (third step), ending with the last two steps that will be to combine what we have found with our experience and the preferences of the patient (fourth step) and to evaluate how it influences our performance (fifth step).

So we roll up our sleeves, make our structured clinical question, and enter PubMed, Embase or TRIP looking for answers. Covered in a cold sweat, we come up with the initial number of 15234 results and get the desired article that we hope to enlighten our ignorance with. But, even though our search has been impeccable, are we really sure we have found what we need?. Here it starts the arduous task of critically appraise the article to assess its actual utility to solve our problem.

This step, the third of the five we have seen and perhaps the most feared of all, is indispensable within the methodological flow of Evidence-Based Medicine. And this is so because all that glitters is not gold: even articles published in prestigious journals by well-known authors may have poor quality, contain methodological errors, have nothing to do with our problem or have errors in the way of analyzing or presenting the results, often in a suspiciously interested way. And this is not true because I say so, because there are even people who think that the best place to send 90% out of what is published is the trash can, regardless of whether the journal has impact fact or if the authors are more famous than Julio Iglesias (or his son Enrique, for that matter). Our poor excuse to justify our lack of knowledge about how to produce and publish scientific papers is that we are clinicians rather than researchers, and of course the same is often the case with journals reviewers, who overlook all the mistakes that clinicians make.

Thus, it is easy to understand that critical appraising is a fundamental step in order to take full advantage of the scientific literature, especially in an era in which information abounds but we have little time available to evaluate it.

The first thing we must do is always to assess whether the article answers to our question. This is usually the case if we have developed the clinical question correctly and we have done a good search of the available evidence but, anyway, we should always check that the study population, the intervention, etc., match with what we are seeking.

Before entering into the systematic of critically appraising, we will take a look over the document and its summary to try to see if the article in question can meet our expectations. The first step we must always take is to evaluate whether the paper answers our question. This is often the case if we have correctly elaborated the structured clinical question and we have made a good search for the available evidence, but it is always appropriate to check that the type of population, study, intervention, etc. are in line with what we are looking for.

Once we are convinced that the article is what we need, we will perform a critical appraising. Although the details depend on the type of study design, we are always based on three basic pillars: validity, relevance and applicability.

Appraising validity consist on checking the scientific rigor of the paper to find out how much close to the true it is. There are a number of common criteria to all studies, such as the correct design, an adequate population, the existence of homogeneous intervention and control groups at the beginning of the study, a proper monitoring, etc. Someone thought this term should be best called internal validity, so we can find it with this name.

The second pillar is clinical importance, which measures the magnitude of the effect found. Imagine that a new hypotensive is better than the usual one with a p-value with many zeroes, but that it decrease blood pressure an average of 5 mmHg. No matter how many zeroes the p-value have (which is statistically significant, we cannot denied it), we have to admit that the clinical effect is rather ridiculous.

The last pillar is the clinical applicability, which consist in assessing whether the context, patients and intervention of the study are sufficiently similar to our environment as to generalize the results. The applicability is also known as external validity.

Not all scientific papers can be described favorably in these three aspects. It may happen that a valid study (internal validity) finds a significant effect that cannot be applied to our patients. And we must not forget that we are using just a working tool. Even the most suitable study must be appraise in terms of benefits, harms and costs, and patient preferences, the latter aspect one that we forget more often than it would be desirable.

For those with a fish memory, there are some templates by group CASP that are recommended to use as a guide to make critical reading without forgetting any important aspect. Logically, the specifics measures of association and impact and the requirements to meet internal validity criteria depend specifically on the type of the study design that we are dealing with. But that’s another story…

A bias by absence


We can find strength through unity. It is a fact. Great goals are achieved more easily with the joining of the effort of many. And this is also true in statistics.
In fact, there are times when clinical trials do not have the power to demonstrate what they are pursuing, either because of lack of sample due to time, money or difficulty recruiting participants, or because of other methodological limitations. In these cases, it is possible to resort to a technique that allows us sometimes to combine the effort of multiple trials in order to reach the conclusion that we would not reach with any of the trials separately. This technique is meta-analysis.
Meta-analysis gives us an exact quantitative mathematical synthesis of the studies included in the analysis, generally the studies retrieved during a systematic review. Logically, if we include all the studies that have been done on a topic (or, at least, all that are relevant to our research), that synthesis will reflect the current knowledge on the subject. However, if the collection is biased and we lack studies, the result will reflect only the articles collected, not the total available knowledge.
When planning the review we must establish a global search structure to try to find all the articles. If we do not do this we can make a recovery bias, which will have the same effect on the quantitative analysis as the publication bias has. But even with modern electronic searches, it is very difficult to find all the relevant information on a particular topic.
In cases of missing studies, the importance of the effect will depend on how the studies are lost. If they are lost at random, everything will be in a problem of less information, so the accuracy of our results will be less and the confidence intervals will be broader, but our conclusions may be correct. However, if the articles that we do not find are systematically different from those we find, the result of our analysis may be biased, since our conclusions can only be applied to that sample of papers, which will be a biased sample.
There are a number of factors that may contribute to the publication bias. First, the studies with meaningful results are more likely to be published and, within these, they are more likely to be published when the effect is greater. This means that studies with negative results or with effects of small magnitude may not be published, so we will draw a biased conclusion from the analysis of only large studies with a positive result.
Secondly, of course, published studies are more likely to come into our hands than those that are not published in scientific journals. This is the case of doctoral theses, communications to congresses, reports from government agencies or even studies pending to be published by researchers of the subject we are dealing with. For this reason it is so important to do a search that includes this type of work, which fall within the term of gray literature.
Finally, a number of biases can be listed that influence the likelihood that a paper will be published or retrieved by the investigator performing the systematic review such as language bias (we limit the search by language), availability bias (to include only those studies that are easy to retrieve by the researcher), cost bias (to include studies that are free or cheap), familiarity bias (only those of the discipline of the investigator), duplication bias (those who have significant outcomes are more likely to be published more than once) and citation bias (studies with significant outcome are more likely to be cited by other authors).
One may think that losing studies during the review cannot be so serious, since it could be argued that unpublished studies in peer-reviewed journals are often of poorer quality, so they do not deserve to be included in the meta-analysis. However, it is not clear that the scientific journals ensure the methodological quality of the study or that this is the only method to do so. There are researchers, such as government agencies, who are not interested in publishing in scientific journals, but in producing reports for those who commission them. In addition, peer review is not a quality assurance because, too often, neither the researcher who performs the study nor those in charge of reviewing it have a methodology training that ensures the quality of the final product.
There are tools to assess the risk of publication bias. Perhaps the simplest may be to represent a forest plot ordered with the most accurate studies at the top and the less at the bottom. As we move down the precision of the results decreases, so that the effect must oscillate to both sides of the summary measure result. If it only oscillates towards one of the sides, we can indirectly assume that we have not detected the works that must exist that oscillate towards the opposite side, reason why surely we will have a bias of publication.
funnel_sesgoAnother similar procedure is the use of the funnel plot, as seen in the attached image. In this graph the effect size is plotted on the X axis and on the Y axis a measure of the variance or the sample size, inverted. Thus, at the top will be the largest and most accurate studies. Once again, as we go down the graph, the accuracy of the studies is smaller and they are shifted sideways by random error. When there is publication bias this displacement is asymmetrical. The problem of the funnel plot is that its interpretation can be subjective, so there are numerical methods to try to detect the existence of publication bias.
And, at this point, what should we do in the face of a publication bias? Perhaps the most appropriate thing is not to ask if there is bias, but how much it affects my results (and assume that we have left studies without being included in the analysis).
The only way to know if publication bias affects our estimates would be to compare the effect on recovered and unrecovered studies, but of course, then we would not have to worry about publication bias.
In order to know if the observed result is robust or, conversely, it is susceptible to be biased by a publication bias, two methods have been devised called as the fail-safe N methods.
The first method is the Rosenthal’s fail-safe N method. Suppose we have a meta-analysis with an effect that is statistically significant, for instance, a relative risk greater than one with a p <0.05 (or a 95% confidence interval that does not include the null value, one). Then we ask ourselves a question: how many studies with RR = 1 (null value) will have to be included until p is not significant? If we need few studies (less than 10) to invalidate the value of the effect, we may be concerned that the effect may actually be null and our significance is the result of a publication bias. Conversely, if many studies are needed, the effect is likely to be truly significant. This number of studies is what the letter N of the method name means. The problem with this method is that it focuses on statistical significance rather than on the relevance of results. The correct thing would be to look for how many studies are necessary so that the result loses clinical relevance, not statistical significance. In addition, it assumes that the effects of missing studies are zero (one in the case of relative risks and odds ratios, zero in cases of mean differences), when the effect of missing studies may go the other way than the effect we detected or In the same direction but of smaller magnitude. To avoid these drawbacks there is a variation of the previous formula which values statistical significance and clinical significance. With this method, which is called the Orwin´s fail-safe N, we calculate how many studies are needed to bring the value of the effect to a specific value, which will generally be the smallest effect that is clinically important. This method also allows specifying the average effect of missing studies.
And here we leave the meta-analysis and publication bias for today. We have not talked about any other mathematical methods to detect publication bias like Begg’s and Egger’s. There is even some graphic method apart from the ones we have mentioned, such as the trim and fill method. But that is another story…

Three feet of a cat


To look for three legs of a cat, or splitting hairs, is a popular Spanish saying. It seems that when one looks for three feet of a cat he tries to demonstrate something impossible, generally with tricks and deceptions. As the English speakers say, if it ain’t broke, don’t fix it. In fact, the initial saying referred to looking for five feet instead of three. This seems more logical, since as cats have four legs, finding three of them is easy, but finding five is impossible, unless we consider the tail of the cat as another foot, which does not make much sense.

But today we will not talk about cats with three, four or five feet. Let’s talk about something a little more ethereal, such as multivariate multiple linear regression models. This is a cat with a lot of feet, but we are going to focus only on three of them that are called collinearity, tolerance and inflation factor (or increase) of the variance. Do not be discouraged, it’s easier than it may seem.

We saw in a previous post how simple linear regression models related two variables to each other, so that the variations of one of them (the independent variable or predictor) could be used to calculate how the other variable would change (the dependent variable). These models were represented by the equation y = a + bx, where x is the independent variable and y the dependent variable.

However, multiple linear regression adds more independent variables, so that it allows to make predictions of the dependent variable according to the values of the predictor or independent variables. The generic formula would be as follows:

y = a + bx1 + cx2 + dx3 + … + nxn, where n is the number of independent variables.

One of the conditions for the multiple linear regression models to work properly is that the independent variables are actually independent and uncorrelated.

Imagine an absurd example in which we put in the model the weight in kilograms and the weight in pounds. Both variables will vary in the same way. In fact the correlation coefficient, R, will be 1, since practically the two represent the same variable. Such foolish examples are difficult to see in scientific work, but there are others less obvious (including, for example, height and body mass index, which is calculated from weight and height) and others that are not at all evident for the researcher. This is what is called collinearity, which is nothing more than the existence of a linear association between the set of independent variables.

Collinearity is a serious problem for the multivariate model, since the estimates obtained by it are very unstable, as it becomes more difficult to separate the effect of each predictor variable.

Well, to determine if our model suffers from collinearity we can construct a matrix where the coefficients of correlation, R, of some variables with others are shown. In those cases in which we observe high R, we can suspect that there is collinearity. However, if we want to quantify this we will resort to the other two feet of the cat that we mentioned at the beginning: tolerance and inflation factor of variance.

If we square the coefficient R we obtain the coefficient of determination (R2), which represents the percentage of the variation (or variance) of a variable that is explained by the variation in the other variable. Thus, we find the concept of tolerance, which is calculated as the complement of R2 (1-R2) and represents the proportion of the variability of that variable that is not explained by the rest of the independent variables included in the regression model.

In this way, the lower the tolerance, the more likely there is collinearity. Collinearity is generally considered to exist when R2 is greater than 0.9 and therefore the tolerance is below 0.1.

We only have to explain the third foot, which is the inflation factor of the variance. This is calculated as the inverse of the tolerance (1 / T) and represents the proportion of the variability (or variance) of the variable that is explained by the rest of the predictor variables of the model. Of course, the greater the inflation factor of the variance, the greater the likelihood of collinearity. Generally, collinearity is considered to exist when the inflation factor between two variables is greater than 10 or when the mean of all inflation factors of all independent variables is much greater than one.

And here we are going to leave the multivariate models for today. Needless to say, everything we have told is done in practice using computer programs that calculate these parameters in a simple way.

We have seen here some aspects of multiple linear regression, perhaps the most widely used multivariate model. But there are others, such as multivariate analysis of variance (MANOVA), factors analysis, or clusters analysis. But that is another story…

In search of causality


In medicine we often try to look for cause-effect relationships. If we want to show that the drug X produces an effect, we have only to select two groups of people, one group we give the drug, the other group we do not give it and see if there are differences.

But it is not so simple, because we can never be sure that differences in effect between the two groups actually are due to other factors than the treatment we have used. These factors are the so-called confounding factors, which may be known or unknown and which may bias the results of the comparison.

To resolve this problem a key element of a clinical trial, randomization, was invented. If we divide the participants in the trial between the two branches randomly we will get these confounding variables to be distributed homogeneously between the two arms of the trial, so any difference between the two will have to be due to the intervention. Only in this way can we establish cause-effect relationships between our exposure or treatment and the outcome variable we measure.

The problem of quasi-experimental and observational studies is that they lack randomization. For this reason, we can never be sure that the differences are due to exposure and not to any confounding variable, so we cannot safely establish causal relationships.

This is an annoying inconvenience, since it will often be impossible to carry out randomized trials either for ethical, economic reasons, the nature of the intervention or whatever. That is why some tricks have been invented in order to establish causal relations in the absence of randomization. One of these techniques is the propensity score we saw in an earlier post. Another is the one we are going to develop today, which has the nice name of regression discontinuity.

Regression discontinuity is a quasi-experimental design that allows causal inference in the absence of randomization. It can be applied when discontinuity regression_thresholdthe exposure of interest is assigned, at least partially, according to the value of a continuous random variable if this variable falls above or below a certain threshold value.

Consider, for example, a hypocholesterolemic drug that we will use when LDL cholesterol rises above a given value, or an antiretroviral therapy in an AIDS patient that we will indicate when his CD4 count falls below a certain value. There is a discontinuity in the threshold value of the variable that produces a sudden change in the probability of assignment to the intervention group, as I show in the attached figure.

In these cases where the allocation of treatment depends, at least in part, on the value of a continuous variable, the allocation in the vicinity of the threshold is almost as if it were random. Why? Because determinations are subject to random variability by sampling error (in addition to the variability of biological variables themselves), which makes individuals very close to the threshold, above or below, very similar in terms of the variables that may act as confounders (being above or below the threshold may depend on the random variability of the result of the measurement of the variable), similar to what happens in a clinical trial. At the end of the day, we may think that a clinical trial is nothing more than a discontinuity design in which the threshold is a random number.

The math of regression discontinuity is not for beginners so I do not intend to explain it here (I would have to understand it first), so we will settle for knowing some terms that will help us to understand the works that use this methodology.

Regression discontinuity may be sharp or fuzzy. In the sharp one, the probability of assignment changes from zero to one at the threshold (the allocation of treatment follows a deterministic rule). For example, treatment is initiated when the threshold is crossed, regardless of other factors. On the other hand, in the fuzzy regression there are other factors at stake that make the probability of allocation change in the threshold, but not from zero to one, but may depend on those other factors added.

Thus, the result of the regression model varies somewhat depending on whether it is a sharp or fuzzy regression discontinuity. In the case of sharp regression, the so-called average causal effect is calculated, according to which participants are assigned to the intervention with certainty if they cross the threshold. In the case of fuzzy regression, the allocation is no longer performed according to a deterministic model, but according to a probabilistic one (according to the threshold value and other factors that the researcher may consider important). In these cases, an intention-to-treat analysis should be done according to the difference in the probability of allocation near the cut-off point (some may not exceed the threshold but be assigned to the intervention because the investigator considers the other factors).

Thus, the probabilistic model will have to measure the effect on the compliers (those assigned to the intervention), so the regression model will give us the complier average causal effect, which is the typical measure of fuzzy regression discontinuity.

And I think we’re going to leave it for today. We have not said anything about the regression equation, but suffice it to say that it takes into account the slopes of the probability function of allocation before and after the threshold and an interaction variable for the possibility that the effects of the treatment are heterogeneous on both sides of the threshold. As you see, everything is quite complicated, but for that are the statistical packages like R or Stata that implement these models with little effort.

Finally, to say only that it is usual to see models that use linear regression for quantitative outcome variables, but there are extensions of the model that use dichotomous variables and logistic regression techniques, and even models with survival studies and time-to-event variables. But that is another story…



In the best-known sense, censorship is the action of examining a work intended for the public, suppressing or modifying the part that does not fit certain political, moral or religious aspect, to determine whether or not it can be published or exhibited. So what do we mean in statistics when we talk about censored data? Nothing to do with politics, morality or religion. In order to explain what a censored data is, we must first discuss the time-to-event variables and survival analyzes.

In general, we can say that there are three types of variables: quantitative, qualitative and time-to-event. The first two are fairly well understood in general, but the time-to-event are a little more complicated to understand.-

Imagine that we want to study the mortality of that terrible disease that fildulastrosis is. We could count the number of deaths at the end of the study period and divide them by the total population at the beginning. For example, if at the beginning there are 50 patients and four die during follow-up, we could calculate the mortality as 4/50 = 0.08, or 8%. Thus, if we have followed the population for five years, we can say that the survival of the disease at five years is 92% (100-8 = 92).

Simple, isn’t it? The problem is that this is only valid when all subjects have the same follow-up period and no losses or dropouts occur throughout the study, a situation that is often far from the reality in most cases.

In these cases, the correct thing to do is to measure not only if death occurs (which would be a dichotomous variable), but also when it occurs, also taking into account the different follow-up period and the losses. Thus, we would use a time-to-event variable, which is composed of a dichotomous variable (the event being measured) and a continuous variable (the follow-up time when it occurs).

Following the example above, participants in the study could be classified into three types: those who die during follow-up, those who remain alive at the end of the study, and those who are lost during follow-up.

Of those who die we can calculate their survival but, what is the survival of those who are alive at the end of the study? And what is the survival of those who are lost during follow-up? It is clear that some of the lost may have died at the end of the study without us detecting it, so our measure of mortality will not be accurate.

And this is where we find the censored data. All those who do not present the event during the survival study are called censored (losses and those who finish the study without presenting the event). The importance of these censored data is that they must be taken into account when doing the survival study, as we will see below.

The methodology to be followed is to create a survival table that takes into account the events (in this case the deaths) and the censored data, as we can see in the attached table.

The columns of the table represent the following: x, the year number of the follow-up; Nx, the number of participants alive at the beginning of that year; Cx, the number of losses of that year (censored); Mx, the number of deaths during that period; PD, probability of dying in that period; PPS, the probability of surviving in that period (the probability of not presenting the event); And PGS, the global probability of survival up to that point.censoringAs we see, the first year we started with 50 participants, one of whom died. The probability of dying in that period is 1/50 = 0.02, so the probability of survival in the period (which is equal to the global since it is the first period) is 1-0.02 = 0, 98.

In the second period we start with 49 and no one dies or is lost. The PD in the period is zero and survival one. Thus, the overall probability will be 1×0.98 = 0.98.

In the third period we continue with 49. Two are lost and one dies. The PD is 1/49 = 0.0204 and the PPS is 1-0.0204 = 0.9796. If we multiply the PSP by the global of the previous period, we obtain the overall survival of this period: 0.9796×0.98 = 0.96.

In the fourth period we started with 46 participants, resulting in five losses and two deaths. The PD will be 2/46 = 0.0434, the PPS of 1-0.0434 = 0.9566 and the PGS of 0.9566×0.96 = 0.9183.

And last, in the fifth period we started with 39 participants. We have two censored and no event (death). PD is zero, PPS is equal to one (no one dies in this period) and PGS 1×0.9183 = 0.9183.

Finally, taking into account the censored data, we can say that the overall survival at five years of fildulastrosis is 91.83%.

And with this we are going to leave it for today. We have seen how a survival table with censored data is constructed to take into account unequal follow-up of participants and losses during follow-up.

Only two thoughts before finishing. First, even if we talk about survival analysis, the event does not have to be the death of the participants. It can be any event that occurs throughout the study follow-up.

Second, the time-to-event and censored data are the basis for performing other statistical techniques that estimate the probability of occurrence of the event under study at a given time, such as the Cox regression models. But that is another story…

Simplifying the impact


In epidemiological studies it is common to find a set of measures of effect such as risks in exposed and non-exposed, relative risks and risk reductions. However, in order for the analysis of a study to be considered well done, measures of effect should be accompanied by a series of impact measures, which are the ones that inform us more precisely about the true effect of the exposure or intervention we are studying.

For example, if we conducted a study on the prevention of mortality from a disease with a treatment X, a relative risk of 0.5 would tell us that there is a half chance of dying if we take the drug, but we cannot see clearly the impact of treatment. However, if we calculate the number needed to treat (NNT) and it comes out to be two, we will know that one in two people treated will avoid death by that disease. This impact measure, the NNT, does give us a clearer idea of the real effect of the intervention in our practice.

There are several impact measures, in addition to the NNT. In the cohort studies, which we are going to focus on today, we can calculate the difference of incidences between exposed and unexposed, the exposed attributable fraction (EAF), the avoidable risk in exposed (ARE) and the population attributable fraction (PAF).

The EAF indicates the risk of presenting the effect on the exposed that is due specifically to that, to have been exposed. The ARE would inform us of the cases of illness in the exposed group that could have been avoided had the exposure not existed. Finally, PAF is a specific attributable risk that describes the proportion of cases that could be prevented in the population if the risk factor under study were completely eliminated. formulas_cohortes_enAs a fourth parameter, considering the presence of exposure and disease, we can calculate the fraction of exposure in cases (FEc), which defines the proportion of exposed cases that are attributable to the risk factor.

In the table that I attach you can see the formulas for the calculation of these parameters.

The problem with these impact measures is that they can sometimes be difficult to interpret on the part of the clinician. For this reason, and inspired by the calculation of NNTs, a series of measures called impact numbers have been devised, giving us a more direct idea of the effect of the exposure factor on the disease being studied. These impact numbers are the number of impact on exposed (NIE), the number of impact in cases (NIC) and the number of impact of exposed cases (NIEC).

Let’s start with the simplest. The NIE would be the equivalent of the NNT and would be calculated as the inverse of the absolute risk reduction or the risk difference. The NNT is the number of people who should be treated to prevent a case compared to the control group. The NIE represents the average number of people who have to be exposed to the risk factor for a new disease event to occur compared to non-exposed persons. For example, a NIE of 10 means that out of every 10 exposed a case of disease will occur that will be attributable to the risk factor.

The NIC is the inverse of the PAF, so it defines the average number of sick people among whom a case is due to the risk factor. A NIC of 10 means that for every 10 cases in the population, one is attributable to the risk factor under study.

Finally, the NIEC is the inverse of the FEc. It is the average number of cases among which a case is attributable to the risk factor.

In summary, these three measures indicate the impact of exposure among all exposed (NIE), among all patients (NIC) and among all patients who have been exposed (NIEC).

impact-numbersAn example is the data from the attached table, corresponding to a fictional study on the effect of coronary mortality on smoking. I have used an epidemiological calculator of the many available on the Internet and have calculated a risk difference of 0.0027, a PAR of 0.16 and an FEc of 0.4. We can now calculate our impact numbers.

NIE value is 1 / 0.0027 = 366. Rounding up, out of every 365 smokers, one will die from a heart disease attributable to tobacco.

NIC will be 1 / 0.16 = 6.25. Of every six deaths from heart disease in the population, one will be attributable to tobacco.

Finally, NIEC will be 1 / 0.4 = 2.5. Approximately, for every three deaths from heart disease among those who smoked, one would be attributable to tobacco addiction.

And here we leave it for today. Do not forget that the data of the example are fictitious and I do not know if they fit very much to reality.

We have discussed only the point estimates of impact numbers but, as always, it is preferable to calculate their confidence intervals. All three can be calculated with the limits of intervals of the measurement from which the impact numbers are obtained, but it is best to use a calculator that does it for us. Calculation of the intervals of some parameters such as, for example, the PAR can be complex. But that is another story…

The tribulations of an interval


The number needed to treat (NNT) is an impact measure that tells us in a simple way about the effectiveness of an intervention or its side effects. If the treatment tries to avoid unpleasant events, the NNT will show us an appreciation of the patients that we have to submit to treatment to avoid one of these events. In this case we talk about NNTB, the number to deal with to benefit.

In other cases, the intervention may produce adverse effects. Then we will talk about the NNTH or number to try to harm one (produce an unpleasant event).

nnt_enThe calculation of the NNT is simple when we have a contingency table like the one we see in the first table. It is usually calculated as the inverse of the absolute risk reduction (1 / ARR) and is given as a point estimate value. The problem is that this ignores the probabilistic nature of the NNT, so the most correct would be to specify its 95% confidence interval (95CI), as we do with the rest of the measures.

We already know that the 95CI of any measure responds to the following formula:

95CI (X) = X ± (1.96 x SE (X)), where SE is the standard error.

Thus the lower and upper limits of the interval would be the following:

X – 1.96 SE (X), X + 1.96 SE (X)

And here we have a problem with the NNT’s 95CI. This interval cannot be calculated directly because NNT does not have a normal distribution. Therefore, some tricks have been invented to calculate it, such us to calculate the 95CI of the ARR and use its limits to calculate the NNT’s, as follows:

95CI (ARR) = ARR – 1,96(SE(ARR)) , ARR + 1,96(SE(ARR))

CI(NNT) = 1 / upper limit of the 95CI (ARR), 1 / lower limit of the 95CI (ARR) (we use the upper limit of the ARR to calculate the lower limit of the NNT, and vice versa, because being the treatment beneficial, risk reduction would in fact be a negative value [RT – RNT], although we usually speak of it in absolute value).

We just need to know how to calculate the RAR’s SE, which turns out to be done with a slightly unfriendly formula that I put to you just in case anyone is curious to see it:SE(ARR) = \sqrt{\frac{R_{T}\times(1-R_{T})}{Treated}+\frac{R_{NT}\times(1-R_{NT})}{Non\ treated}}nnt2_enIn the second table you can see a numerical example to calculate the NNT and its interval. You see that the NNT = 25, with an 95CI of 15 to 71. Look at the asymmetry of the interval since, as we have said, does not follow a normal distribution. In addition, far from the fixed value of 25, the interval values say that in the best case we will have to treat 15 patients to avoid an adverse effect, but in the worst case this value can rise to 71.

To all the above difficulty for its calculation, another added difficulty arises when the ARR’s 95CI includes zero. In general, the lower the effect of the treatment (the lower the ARR) the higher the NNT (it will be necessary to treat more to avoid an unpleasant event), so in the extreme value of the effect is zero, the NNT’s value will be infinite (an infinite number of patients would have to be treated to avoid an unpleasant event).

So it is easy to imagine that if the 95CI of the ARR includes zero, the 95CI of the NNT will include infinity. It will be a discontinuous interval with a negative value limit and a positive one, which can pose problems for its interpretation.

For example, suppose we have a trial in which we calculated an ARR of 0.01 with a 95CI of -0.01 to 0.03. With the absolute value we have no problem, the NNT is 100 but, what about with the interval? For it would go from -100 to 33, going through infinity (actually, from minus infinity to -100 and from 33 to infinity).

How do we interpret a negative NNT? In this case, as we have already said, we are dealing with an NNTB, so its negative value can be interpreted as a positive value of its alter ego, the NNTH. In our example, -100 would mean that we will have an adverse effect for every 100 treated. In short, our interval would tell us that we could produce one event for every 100 treated, in the worst case, or avoid one for every 33 treated, in the best. This ensures that the interval is continuous and includes the point estimate, but it will have little application as a practical measure. Basically, it may make little sense to calculate the NNT when the ARR is not significant (its 95CI includes zero).

At this point, the head begins to smoke us out, so let’s go ending today. Needless to say, everything I have explained about the calculation of the interval can be done clicking with any of the calculators available on the Internet, so we will not have to do any math.

In addition, although the NNT calculation is simple when we have a contingency table, we often have adjusted risk values obtained from regression models. Then, the maths for the calculation of the NNT and its interval gets a little complicated. But that is another story…

A case of misleading probability


Today we are going to see another of those examples where intuition about the value of certain probabilities plays tricks on us. And, for that, we will use nothing less than Bayes’ theorem, playing a little with conditioned probabilities. Let’s see step by step how it works.

What is the probability of two events occurring? The probability of an event A occurring is P(A) and that of B, P(B). Well, the probability of the two occurring is P(A∩B), which, if the two events are independent, is equal to P(A) x P(B).

Imagine that we have a die with six faces. If we throw it once, the probability of taking out, for example, a five is 1/6 (one result among the six possible). The probability to draw a four is also 1/6. What will be the probability of getting a four, once in the first roll we get a five? Since the two runs are independent, the probability of the combination five followed by four will be 1/6 x 1/6 = 1/36.

Now let’s think of another example. Suppose that in a group of 10 people there are four doctors, two of whom are surgeons. If we take one at random, the probability of being a doctor is 4/10 = 0.4 and that of a surgeon is 2/10 = 0.2. But if we get one and know that he is a doctor, the probability that he is a surgeon will no longer be 0.2, because the two events, being a doctor and a surgeon, are not independent. If you are a doctor, the probability that you are a surgeon will be 0.5 (half the doctors in our group are surgeons).

When two events are dependent, the probability of occurrence of the two will be the probability of occurrence of the first, once the second occurs, by the probability of occurrence of the second. So the P(surgeon) = P(surgeon|doctor) x P(doctor). We can generalize the expression as follows:

P(A∩B) = P(A|B) x P(B), and changing the order of the components of the expression, we obtain the so-called Bayes rule, as follows:

P(A|B) = P(A∩B) / P(B).

The P(A∩B) will be the probability of B, once A is produced, by the probability of A = P(B|A) x P(A). On the other hand, the probability of B will be equal to the sum of the probability of occurrence B once A is produced plus the probability of occurring B without occurring A, which put in mathematical form is of the following form:

P(B|A) x P(A) + P(B|Ac) x P(Ac), being P(Ac) the probability of not occurring A.

If we substitute the initial rule for its developed values, we obtain the best known expression of the Bayes theorem:

P(A|B)=\frac{P(B|A) \times P(A)}{P(B|A) \times P(A)+P(B|A^{{c}}) \times P(A^{{c}})}Let’s see how the Bayes theorem is applied with a practical example. Consider the case of acute fildulastrosis, a serious disease whose prevalence in the population is, fortunately, quite low, one per 1000 inhabitants. Then, the P(F) = 0.001.

Luckily we have a good diagnostic test, with a sensitivity of 98% and a specificity of 95%. Suppose now that I take the test and it gives me a positive result. Do I have to scare myself a lot? What is the probability that I actually have the disease? Do you think it will be high or low? Let’s see.

A sensitivity of 98% means that the probability of giving positive when having the disease is 0.98. Mathematically, P(POS|F) = 0,98. On the other hand, a specificity of 95% means that the probability of a negative result being healthy is 0.95. That is, P(NEG|Fc) = 0.95. But what we want to know is neither of these two things, but we really look for the probability of being sick once we test positive, that is, P (F|POS).

To calculate it, we have only to apply the theorem of Bayes:

P(F|POS)=\frac{P(POS|F) \times P(F)}{P(POS|F) \times P(F)+P(POS|F^{{c}}) \times P(F^{{c}})}Then we replace the symbols with their values and solve the equation:

P(F|POS)=\frac{0,98 \times 0,001}{0,98 \times 0,001+[(1-0,95) \times (1-0,001)]}=0,02So we see that, in principle, I do not have to scare a lot when the test gives me a positive result, since the probability of being ill is only 2%. As you see, much lower than intuition would tell us with such a high sensitivity and specificity. Why is this happening? Very simple, because the prevalence of the disease is very low. We are going to repeat the experiment assuming now that the prevalence is 10% (0,1):

P(F|POS)=\frac{0,98 \times 0,1}{0,98 \times 0,1+[(1-0,95) \times (1-0,1)]}=0,68As you see, in this case the probability of being ill if I give positive rises to 68%. This probability is known as positive predictive value which, as we can see, can vary greatly depending on the frequency of the effect we are studying.

And here we leave it for today. Before closing, let me warn you not to seek what the fildulastrosis is. I would be very surprised if anyone found it in a medical book. Also, be careful not to confuse P (POS|F) with P (F|POS), since you would make a mistake called reverse fallacy or fallacy of transposition of conditionals, which is a serious error.

We have seen how the calculation of probabilities gets somewhat complicated when the events are not independent. We have also learned how unreliable predictive values are when the prevalence of the disease changes. That is why the likelihood ratios were invented, which do not depend so much on the prevalence of the disease that is diagnosed and allow a better overall assessment of the power of the diagnostic test. But that is another story…

Regular customers


We saw in a previous post that sample size is very important. The sample should be the right size, neither more nor less. If too large, we are wasting resources, something to keep in mind in modern times. If we use a small sample we will save money, but lose statistical power. This means that it may happen that there is a difference in effect between the two interventions tested in a clinical trial and not be able to recognize it, which we will be just throwing good money equally.

The problem is that sometimes it can be very difficult to get an adequate sample size, needing excessively long periods of time to get the desired size. Well, for these cases, someone with commercial mentality has devised a method that is to include the same participant many times in the trial. It’s like in bars. Better to have a regular clientele who comes many times to the establishment, always easier than to have a very busy parish (which is also desirable).

There are times when the same patient needs the same treatment in repeated occasions. Consider, for example, asthmatics that need bronchodilator treatment repeatedly, or couples undergoing a process of in vitro fertilization, which requires several cycles to succeed.

Although the usual standard in clinical trials is randomizing participants, in these cases we can randomize each participant independently whenever he needs treatment. For example, if we are testing two bronchodilators, we can randomize the same subject to one of two every time he has an asthma attack and needs treatment. This procedure is known as re-randomization and consists, as we have seen, in randomizing situations rather than participants.

This trick is quite correct from a methodological point of view, provided that certain conditions discussed below are met.

The participant enters the trial the first time in the usual way, being randomly assigned to one of two arms of the trial. Subsequently he is followed-up during the appropriate period and the results of the study variables are collected. Once the follow-up period is finished, if the patient requires new treatment, and continues to meet the inclusion criteria of the trial, he is randomized again, repeating this cycle as necessary to achieve the desired sample size.

This mode of recruiting situations instead of participants achieves getting the sample size with a smaller number of participants. For example, if we need 500 participants, we can randomize 500 once, 250 twice, or 200 once and 50 six times. The important thing is that the number of randomizations of each participant cannot be specified previously, but must depend on the need of treatment in every time.

To apply this method correctly you need to meet three requirements. First, patients can only be re-randomized when they have fully completed the follow-up period of the previous procedure. This is logical because, otherwise, the effects of the two treatments would overlap and a biased measure of the effect of the intervention would be obtained.

Second, each new randomization in the same participant should be done independently of the others. In other words, the probability of assignment to each intervention should not depend on previous assignments. Some authors are tempted to use reallocations to balance the two groups, but this can bias comparisons between the two groups.

Third, the participant should receive the same benefit of each intervention. Otherwise, we get a biased estimate of treatment effect.

We see, then, that this is a good way to reach more easily the sample size we want. The problem with this type of design is that the analysis of the results is more complex than that of conventional clinical trial.

Basically, without going into details, there are two methods of analysis of results. The simplest is the unadjusted analysis, in which all interventions, even if they belong to the same participant are treated independently. This model, which is usually expressed by a linear regression model, does not take into account the effect that participants can have on the results.

The other method is adjusted for the effect of patients, which takes into account the correlation between observations of the same participants.

And here we leave for today. We have not talked anything about the mathematical treatment of the adjusted method to avoid burning the reader’s neurons. Suffice it to say that there are several models that have to do with using generalized linear models and mixed-effects models. But that is another story…

The fairground shotgun


A few days ago I was with my cousin in our neighborhood parties and to entertain a while, we were firing a shot at one of the booths, to see if we could take the teddy bear.

But nothing, not by chance.

I shot a lot of times, but got no buckshot in the target. They were all around, but not one in the center. My cousin, however, is a crack shot. The problem is that he got a shotgun with a twisted look, so he went all deviants buckshot and not put any into the target. In sort, we were left with nothing. random-and-systematic-errorsIn the figure attached you can see the mesh of shots that we both did.

Anyway, and to take advantage of this situation, looking at the targets it occurs to me that the situation bears some resemblance to the two types of errors that we can have in our epidemiological studies.

These are, in general, two: random error and systematic error.

Random error is due to our friend the chance, of whom there is no way to escape. You can have two fundamental causes. First, the sampling error. When we get a sample of a population we do it with the aim of estimating a population parameter through the study of an estimator of that parameter in the sample. However, due to sampling error we can obtain a sample that is not representative of the population (if we get several samples, all will be slightly different from each other). This happens especially when sample sizes are small and when we use sampling techniques other than probabilistic.

The other source of random error is the variability in the measurement. If we take the blood pressure several times, the results will be different (though similar) because, on the one hand, biological variability itself and, secondly, to the imprecision of the measuring device we use.

This random error will be related to the accuracy of the result. A measure will be more accurate the smaller its random component is, so we can increase the accuracy by increasing the size of the sample or by being more careful with measurements.

In our example of the shot, I represent the random error. I stray shots at random, so that by the cloud of impacts can be imagined where the target is, but no shot reaches it. Logically, the more shots you do the more likely it is to hit the center, albeit by chance.

The second error we mentioned is systematic error, also called bias. This is due to an error in the design or analysis of the study, which produces an incorrect or invalid estimate of the effect we are studying. In our example, as you may have guessed, my cousin represents the systematic error. He shoots very well, but as the gun is poorly calibrated, give shots off the target, turning aside all systematically in one direction. Seeing only the shots we cannot imagine where the center is, as we saw with my shots in my target, because we would think that the center is in a location that actually does not belong. Thus, the random error affects the accuracy, while the systematic compromises the validity of results. And another thing, though my cousin increase the number of shots, they will keep coming crooked. The systematic error does not decrease because we increase the sample size.

And here we will leave for today. We have not talked anything about the types of systematic errors, there are several. They can be divided in selection bias, information and analysis that, in turn, can be divided into many others bias. But that is another story…