# Science without sense…double nonsense

### Píldoras sobre medicina basada en pruebas

Archive for the Statistics Category

## Achilles and Effects Forest

Achilles. What a man! Definitely, one of the main characters among those who were in that mess that was ensued in Troy because of Helena, a.k.a. the beauty. You know his story. In order to make him invulnerable his mother, who was none other than Tetis, the nymph, bathed him in ambrosia and submerged him in the River Stix. But she made a mistake that should not have been allowed to any nymph: she took him by his right heel, which did not get wet with the river’s water. And so, his heel became his only vulnerable part. Hector didn’t realize it in time but Paris, totally on the ball, put an arrow in Achilles’ heel and sent him back to the Stix, but not into the water, but rather to the other side. And without Charon the Ferryman.

This story is the origin of the expression “Achilles’ heel”, which usually refers to the weakest or most vulnerable point of someone or something that, otherwise, is usually known for its strength.

For example, something as robust and formidable as meta-analysis has its Achilles heel: the publication bias. And that’s because in the world of science there is no social justice.

All scientific works should have the same opportunities to be published and achieve fame, but the reality is not at all like that and they can be discriminated against for four reasons: statistical significance, popularity of the topic they are dealing with, having someone to sponsor them and the language in which they are written.

These are the main factors that can contribute to publication bias. First, studies with more significant results are more likely to be published and, within these, they are more likely to be published when the effect is greater. This means that studies with negative results or effects of small magnitude may not be published, which will draw a biased conclusion from the analysis only of large studies with positive results. In the same way, papers on topics of public interest are more likely to be published regardless of the importance of their results. In addition, the sponsor also influences: a company that finances a study with a product of theirs that has gone wrong, probably is not going to publish it so that we all know that their product is not useful.

Secondly, as is logical, published studies are more likely to reach our hands than those that are not published in scientific journals. This is the case of doctoral theses, communications to congresses, reports from government agencies or, even, pending studies to be published by researchers of the subject that we are dealing with. For this reason it is so important to do a search that includes this type of work, which is included within the grey literature term.

Finally, a series of biases can be listed that influence the likelihood that a work will be published or retrieved by the researcher performing the systematic review such as language bias (the search is limited by language), availability bias ( include only those studies that are easy for the researcher to recover), the cost bias (studies that are free or cheap), the familiarity bias (only those from the researcher’s discipline are included), the duplication bias (those that have significant results are more likely to be published more than once) and citation bias (studies with significant results are more likely to be cited by other authors).

One may think that this loss of studies during the review cannot be so serious, since it could be argued, for example, that studies not published in peer-reviewed journals are usually of poorer quality, so they do not deserve to be included in the meta-analysis However, it is not clear either that scientific journals ensure the methodological quality of the study or that this is the only method to do so. There are researchers, like those of government agencies, who are not interested in publishing in scientific journals, but in preparing reports for those who commission them. In addition, peer review is not a guarantee of quality since, too often, neither the researcher who carries out the study nor those in charge of reviewing it have a training in methodology that ensures the quality of the final product.

All this can be worsened by the fact that these same factors can influence the inclusion and exclusion criteria of the meta-analysis primary studies, in such a way that we obtain a sample of articles that may not be representative of the global knowledge on the subject of the systematic review and meta-analysis.

If we have a publication bias, the applicability of the results will be seriously compromised. That is why we say that the publication bias is the true Achilles’ heel of meta-analysis.

If we correctly delimit the inclusion and exclusion criteria of the studies and do a global and unrestricted search of the literature we will have done everything possible to minimize the risk of bias, but we can never be sure of having avoided it. That is why techniques and tools have been devised for its detection.

The most used has the sympathetic name of funnel plot. It shows the magnitude of the measured effect (X axis) versus a precision measurement (Y axis), which is usually the sample size, but which can also be the inverse of the variance or the standard error. We represent each primary study with a point and observe the point cloud.

In the most usual way, with the size of the sample on the Y axis, the precision of the results will be higher in the larger sample studies, so that the points will be closer together in the upper part of the axis and will be dispersed when approaching the origin of the axis Y. In this way, we observe a cloud of points in the form of a funnel, with the wide part down. This graphic should be symmetrical and, if that is not the case, we should always suspect a publication bias. In the second example attached you can see how there are “missing” studies on the side of lack of effect: this may mean that only studies with positive results are published.

This method is very simple to use but, sometimes, we can have doubts about the asymmetry of our funnel, especially if the number of studies is small. In addition, the funnel can be asymmetrical due to quality defects in the studies or because we are dealing with interventions whose effect varies according to the sample size of each study. For these cases, other more objective methods have been devised, such as the Begg’s rank correlation test and the Egger’s linear regression test.

The Begg’s test studies the presence of association between the estimates of the effects and their variances. If there is a correlation between them, bad going. The problem with this test is that it has little statistical power, so it is not reliable when the number of primary studies is small.

Egger’s test, more specific than Begg’s, consists of plotting the regression line between the precision of the studies (independent variable) and the standardized effect (dependent variable). This regression must be weighted by the inverse of the variance, so I do not recommend that you do it on your own, unless you are consummate statisticians. When there is no publication bias, the regression line originates at the zero of the Y axis. The further away from zero, the more evidence of publication bias.

As always, there are computer programs that do these tests quickly without having to burn your brain with the calculations.

What if after doing the work we see that there is publication bias? Can we do something to adjust it? As always, we can.

The simplest way is to use a graphic method called trim and fill. It consists of the following: a) we draw the funnel plot; b) we remove the small studies so that the funnel is symmetrical; c) the new center of the graph is determined; d) we recover the previously removed studies and we add their reflection to the other side of the center line; e) we estimate again the effect.Another very conservative attitude that we can adopt is to assume that there is a publication bias and to ask how much it affects our results, assuming that we have left studies not included in the analysis.

The only way to know if the publication bias affects our estimates would be to compare the effect in the retrieved and unrecovered studies but, of course, then we would not have to worry about the publication bias.

To know if the observed result is robust or, on the contrary, it is susceptible to be biased by a publication bias, two methods of the fail-safe N have been devised.

The first is the Rosenthal’s fail-safe N method. Suppose we have a meta-analysis with an effect that is statistically significant, for example, a risk ratio greater than one with a p <0.05 (or a 95% confidence interval that does not include the null value, one). Then we ask ourselves a question: how many studies with RR = 1 (null value) will we have to include until p is not significant? If we need few studies (less than 10) to make the value of the effect null, we can worry because the effect may in fact be null and our significance is the product of a publication bias. On the contrary, if many studies are needed, the effect is likely to be truly significant. This number of studies is what the letter N of the name of the method means.

The problem with this method is that it focuses on the statistical significance and not on the relevance of the results. The correct thing would be to look for how many studies are needed so that the result loses clinical relevance, not statistical significance. In addition, it assumes that the effects of the missing studies is null (one in case of risk ratios and odds ratios, zero in cases of differences in means), when the effect of the missing studies can go in the opposite direction than the effect that we detect or in the same sense but of smaller magnitude.

To avoid these disadvantages there is a variation of the previous formula that assesses the statistical significance and clinical relevance. With this method, which is called the Orwin’s fail-safe N, it is calculated how many studies are needed to bring the value of the effect to a specific value, which will generally be the least effect that is clinically relevant. This method also allows to specify the average effect of the missing studies.

To end the meta-analysis explanation, let’s see what is the right way to express the results of data analysis. To do it well, we can follow the recommendations of the statement, which devotes seven of its 27 items to give us advice on how to present the results of a meta-analysis.

First, we must inform about the selection process of studies: how many we have found and evaluated, how many we have selected and how many rejected, explaining in addition the reasons for doing so. For this, the flowchart that should include the systematic review from which the meta-analysis proceeds if it complies with the PRISMA statement is very useful.

Secondly, the characteristics of the primary studies must be specified, detailing what data we get from each one of them and their corresponding bibliographic citations to facilitate that any reader of the review can verify the data if he does not trust us. In this sense, there is also the third section, which refers to the evaluation of the risk of study biases and their internal validity.

Fourth, we must present the results of each individual study with a summary data of each intervention group analyzed together with the calculated estimators and their confidence intervals. These data will help us to compile the information that PRISMA asks us in its fifth point referring to the presentation of results and it is none other than the synthesis of all the meta-analysis studies, their confidence intervals, homogeneity study results, etc.

This is usually done graphically by means of an effects diagram, a graphical tool popularly known as forest plot, where the trees would be the primary studies of the meta-analysis and where all the relevant results of the quantitative synthesis are summarized.

The Cochrane’s Collaboration recommends structuring the forest plot in five well differentiated columns. Column 1 lists the primary studies or the groups or subgroups of patients included in the meta-analysis. They are usually represented by an identifier composed of the name of the first author and the date of publication.Column 2 shows the results of the measures of effect of each study as reported by their respective authors.

Column 3 is the actual forest plot, the graphic part of the subject. It shows the measures of effect of each study on both sides of the zero effect line, which we already know is zero for mean differences and one for odds ratios, risk ratios, hazard ratios, etc. Each study is represented by a square whose area is usually proportional to the contribution of each one to the overall result. In addition, the square is within a segment that represents the extremes of its confidence interval.

These confidence intervals inform us about the accuracy of the studies and tell us which are statistically significant: those whose interval does not cross the zero effect line. Anyway, do not forget that, although crossing the line of no effect and being not statistically significant, the interval boundaries can give us much information about the clinical significance of the results of each study. Finally, at the bottom of the chart we will find a diamond that represents the global result of the meta-analysis. Its position with respect to the null effect line will inform us about the statistical significance of the overall result, while its width will give us an idea of ​​its accuracy (its confidence interval). Furthermore, on top of this column will find the type of effect measurement, the analysis model data is used (fixed or random) and the significance value of the confidence intervals (typically 95%).

This chart is usually completed by a fourth column with the estimated weight of each study in per cent format and a fifth column with the estimates of the weighted effect of each. And in some corner of this forest will be the measure of heterogeneity that has been used, along with its statistical significance in cases where relevant.

To conclude the presentation of the results, PRISMA recommends a sixth section with the evaluation that has been made of the risks of bias in the study and a seventh with all the additional analyzes that have been necessary: stratification, sensitivity analysis, metaregression, etc.

As you can see, nothing is easy about meta-analysis. Therefore, the Cochrane’s recommends following a series of steps to correctly interpret the results. Namely:

1. Verify which variable is compared and how. It is usually seen at the top of the forest plot.
2. Locate the measure of effect used. This is logical and necessary to know how to interpret the results. A hazard ratio is not the same as a difference in means or whatever it was used.
3. Locate the diamond, its position and its amplitude. It is also convenient to look at the numerical value of the global estimator and its confidence interval.
4. Check that heterogeneity has been studied. This can be seen by looking at whether the segments that represent the primary studies are or are not very dispersed and whether they overlap or not. In any case, there will always be a statistic that assesses the degree of heterogeneity. If we see that there is heterogeneity, the next thing will be to find out what explanation the authors give about its existence.
5. Draw our conclusions. We will look at which side of the null effect line are the overall effect and its confidence interval. You already know that, although it is significant, the lower limit of the interval should be as far as possible from the line, because of the clinical relevance, which does not always coincide with statistical significance. Finally, look again at the study of homogeneity. If there is a lot of heterogeneity, the results will not be as reliable.

And with this we end the topic of meta-analysis. In fact, the forest plot is not exclusive to meta-analyzes and can be used whenever we want to compare studies to elucidate their statistical or clinical significance, or in cases such as equivalence studies, in which the null effect line is joined of the equivalence thresholds. But it still has one more utility. A variant of the forest plot also serves to assess if there is a publication bias in the systematic review, although, as we already know, in these cases we change its name to funnel graph. But that is another story…

## Apples and pears

You all sure know the Chinese tale of the poor solitary rice grain that falls to the ground and nobody can hear it. Of course, if instead of falling a grain it falls a sack full of rice that will be something else. There are many examples of union making strength. A red ant is harmless, unless it bites you in some soft and noble area, which are usually the most sensitive. But what about a marabout of millions of red ants? That is what scares you up, because if they all come together and come for you, you could do little to stop their push. Yes, the union is strength.

And this also happens with statistics. With a relatively small sample of well-chosen voters we can estimate who will win an election in which millions vote. So, what could we not do with a lot of those samples? Surely the estimate would be more reliable and more generalizable.

Well, this is precisely one of the purposes of meta-analysis, which uses various statistical techniques to make a quantitative synthesis of the results of a set of studies that, although try to answer the same question, do not reach exactly to the same result. But beware; we cannot combine studies to draw conclusions about the sum of them without first taking a series of precautions. This would be like mixing apples and pears which, I’m not sure why, should be something terribly dangerous because everyone knows it’s something to avoid.

Think that we have a set of clinical trials on the same topic and we want to do a meta-analysis to obtain a global result. It is more than convenient that there is as little variability as possible among the studies if we want to combine them. Because, ladies and gentlemen, here also rules the saying: alongside but separate.

Before thinking about combining the results of the studies of a systematic review to perform a meta-analysis, we must always make a previous study of the heterogeneity of the primary studies, which is nothing more than the variability that exists among the estimators that have been obtained in each of those studies.

First, we will investigate possible causes of heterogeneity, such as differences in treatments, variability of the populations of the different studies and differences in the designs of the trials. If there is a great deal of heterogeneity from the clinical point of view, perhaps the best thing to do is not to do meta-analysis and limit the analysis to a qualitative synthesis of the results of the review.

Once we come to the conclusion that the studies are similar enough to try to combine them we should try to measure this heterogeneity to have an objective data. For this, several privileged brains have created a series of statistics that contribute to our daily jungle of acronyms and letters.

Until recently, the most famous of those initials was the Cochran’s Q, which has nothing to do either with James Bond or our friend Archie Cochrane. Its calculation takes into account the sum of the deviations between each of the results of primary studies and the global outcome (squared differences to avoid positives cancelling negatives), weighing each study according to their contribution to overall result. It looks awesome but in reality, it is no big deal. Ultimately, it’s no more than an aristocratic relative of ji-square test. Indeed, Q follows a ji-square distribution with k-1 degrees of freedom (being k the number of primary studies). We calculate its value, look at the frequency distribution and estimate the probability that differences are not due to chance, in order to reject our null hypothesis (which assumes that observed differences among studies are due to chance). But, despite the appearances, Q has a number of weaknesses.

First, it’s a very conservative parameter and we must always keep in mind that no statistical significance is not always synonymous of absence of heterogeneity: as a matter of fact, we cannot reject the null hypothesis, so we have to know that when we approved it we are running the risk of committing a type II error and blunder. For this reason, some people propose to use a significance level of p < 0.1 instead of the standard p < 0.5. Another Q’s pitfall is that it doesn’t quantify the degree of heterogeneity and, of course, doesn’t explain the reasons that produce it. And, to top it off, Q loses power when the number of studies is small and doesn’t allow comparisons among different meta-analysis if they have different number of studies.

This is why another statistic has been devised that is much more celebrated today: I2. This parameter provides an estimate of total variation among studies with respect to total variability or, put it another way, the proportion of variability actually due to heterogeneity for actual differences among the estimates compared with variability due to chance. It also looks impressive, but it’s actually an advantageous relative of the intraclass correlation coefficient.

Its value ranges from 0 to 100%, and we usually consider the limits of 25%, 50% and 75% as signs of low, moderate and high heterogeneity, respectively. I2 is not affected either by the effects units of measurement or the number of studies, so it allows comparisons between meta-analysis with different units of effect measurement or different number of studies.

If you read a study that provides Q and you want to calculate I2, or vice versa, you can use the following formula, being k the number of primary studies:

$I^{2}=&space;\frac{Q-k+1}{Q}$

There’s a third parameter that is less known, but not less worthy of mention: H2. It measures the excess of Q value in respect of the value that we would expect to obtain if there were no heterogeneity. Thus, a value of 1 means no heterogeneity and its value increases as heterogeneity among studies does. But its real interest is that it allows calculating I2 confidence intervals.

Other times, the authors perform a hypothesis contrast with a null hypothesis of non-heterogeneity and use a ji-square or some similar statistic. In these cases, what they provide is a value of statistical significance. If the p is <0.05 the null hypothesis can be rejected and say that there is heterogeneity. Otherwise we will say that we cannot reject the null hypothesis of non-heterogeneity.

In summary, whenever we see an indicator of homogeneity that represents a percentage, it will indicate the proportion of variability that is not due to chance. For their part, when they give us a “p” there will be significant heterogeneity when the “p” is less than 0.05.

Do not worry about the calculations of Q, I2 and H2. For that there are specific programs as RevMan or modules within the usual statistical programs that do the same function.

A point of attention: always remember that not being able to demonstrate heterogeneity does not always mean that the studies are homogeneous. The problem is that the null hypothesis assumes that they are homogeneous and the differences are due to chance. If we can reject it we can assure that there is heterogeneity (always with a small degree of uncertainty). But this does not work the other way around: if we cannot reject it, it simply means that we cannot reject that there is no heterogeneity, but there will always be a probability of committing a type II error if we directly assume that the studies are homogeneous.

For this reason, a series of graphical methods have been devised to inspect the studies and verify that there is no data of heterogeneity even if the numerical parameters say otherwise.

The most employed of them is, perhaps, the , with can be used for both meta-analysis from trials or observational studies. This graph represents the accuracy of each study versus the standardize effects. It also shows the adjusted regression line and sets two confidence bands. The position of each study regarding the accuracy axis indicates its weighted contribution to overall results, while its location outside the confidence bands indicates its contribution to heterogeneity.

Galbraith’s graph can also be useful for detecting sources of heterogeneity, since studies can be labeled according to different variables and see how they contribute to the overall heterogeneity.

Another available tool you can use for meta-analysis of clinical trials is L’Abbé’s plot. It represents response rates to treatment versus response rates in control group, plotting the studies to both sides of the diagonal. Above that line are studies with positive treatment outcome, while below are studies with an outcome favorable to control intervention. The studies usually are plotted with an area proportional to its accuracy, and its dispersion indicates heterogeneity. Sometimes, L’Abbé’s graph provides additional information. For example, in the accompanying graph you can see that studies in low-risk areas are located mainly below the diagonal. On the other hand, high-risk studies are mainly located in areas of positive treatment outcome. This distribution, as well as being suggestive of heterogeneity, may suggest that efficacy of treatments depends on the level of risk or, put another way, we have an effect modifying variable in our study. A small drawback of this tool is that it is only applicable to meta-analysis of clinical trials and when the dependent variable is dichotomous.

Well, suppose we study heterogeneity and we decide that we are going to combine the studies to do a meta-analysis. The next step is to analyze the estimators of the effect size of the studies, weighing them according to the contribution that each study will have on the overall result. This is logical; it cannot contribute the same to the final result a trial with few participants and an imprecise result than another with thousands of participants and a more precise result measure.

The most usual way to take these differences into account is to weight the estimate of the size of the effect by the inverse of the variance of the results, subsequently performing the analysis to obtain the average effect. For these there are several possibilities, some of them very complex from the statistical point of view, although the two most commonly used methods are the fixed effect model and the random effects model. Both models differ in their conception of the starting population from which the primary studies of meta-analysis come.

The fixed effect model considers that there is no heterogeneity and that all studies estimate the same effect size of the population (they all measure the same effect, that is why it is called a fixed effect), so it is assumed that the variability observed among the individual studies is due only to the error that occurs when performing the random sampling in each study. This error is quantified by estimating intra-study variance, assuming that the differences in the estimated effect sizes are due only to the use of samples from different subjects.

On the other hand, the random effects model assumes that the effect size varies in each study and follows a normal frequency distribution within the population, so each study estimates a different effect size. Therefore, in addition to the intra-study variance due to the error of random sampling, the model also includes the variability among studies, which would represent the deviation of each study from the mean effect size. These two error terms are independent of each other, both contributing to the variance of the study estimator.

In summary, the fixed effect model incorporates only one error term for the variability of each study, while the random effects model adds, in addition, another error term due to the variability among the studies.

You see that I have not written a single formula. We do not actually need to know them and they are quite unfriendly, full of Greek letters that no one understands. But do not worry. As always, statistical programs like RevMan from the Cochrane Collaboration allow you to do the calculations in a simple way, including and removing studies from the analysis and changing the model as you wish.

The type of model to choose has its importance. If in the previous homogeneity analysis we see that the studies are homogeneous we can use the fixed effect model. But if we detect that heterogeneity exists, within the limits that allow us to combine the studies, it will be preferable to use the random effects model.

Another consideration is the applicability or external validity of the results of the meta-analysis. If we have used the fixed effect model, we will be committed to generalize the results out of populations with characteristics similar to those of the included studies. This does not occur with the results obtained using the random effects model, whose external validity is greater because it comes from studies of different populations.

In any case, we will obtain a summary effect measure along with its confidence interval. This confidence interval will be statistically significant when it does not cross the zero effect line, which we already know is zero for mean differences and one for odds ratios and risk ratios. In addition, the amplitude of the interval will inform us about the precision of the estimation of the average effect in the population: how much wider, less precise, and vice versa.

If you think a bit, you will immediately understand why the random effects model is more conservative than the fixed effect model in the sense that the confidence intervals obtained are less precise, since it incorporates more variability in its analysis. In some cases it may happen that the estimator is significant if we use the fixed effect model and it is not significant if we use the random effect model, but this should not condition us when choosing the model to use. We must always rely on the previous measure of heterogeneity, although if we have doubts, we can also use the two models and compare the different results.

Having examined the homogeneity of primary studies we can come to the grim conclusion that heterogeneity dominates the situation. Can we do something to manage it? Sure, we can. We can always not to combine the studies, or combine them despite heterogeneity and obtain a summary result but, in that case, we should also calculate any measure of variability among studies and yet we could not be sure of our results.

Another possibility is to do a stratified analysis according to the variable that causes heterogeneity, provided that we are able to identify it. For this we can do a sensitivity analysis, repeating calculations once removing one by one each of the subgroups and checking how it influences the overall result. The problem is that this approach ignores the final purpose of any meta-analysis, which is none than obtaining an overall value of homogeneous studies.

Finally, the brainiest on these issues can use meta-regression. This technique is similar to multivariate regression models in which the characteristics of the studies are used as explanatory variables, and effect’s variable or some measure of deviation of each study with respect to global result are used as dependent variable. Also, it should be done a weighting according to the contribution of each study to the overall result and try not to score too much coefficients to the regression model if the number of primary studies is not large. I wouldn’t advise you to do a meta-regression at home if it is not accompanied by seniors.

And we only need to check that we have not omitted studies and that we have presented the results correctly. The meta-analysis data are usually represented in a specific graph that is known as forest plot. But that is another story…

## Three feet of a cat

To look for three legs of a cat, or splitting hairs, is a popular Spanish saying. It seems that when one looks for three feet of a cat he tries to demonstrate something impossible, generally with tricks and deceptions. As the English speakers say, if it ain’t broke, don’t fix it. In fact, the initial saying referred to looking for five feet instead of three. This seems more logical, since as cats have four legs, finding three of them is easy, but finding five is impossible, unless we consider the tail of the cat as another foot, which does not make much sense.

But today we will not talk about cats with three, four or five feet. Let’s talk about something a little more ethereal, such as multivariate multiple linear regression models. This is a cat with a lot of feet, but we are going to focus only on three of them that are called collinearity, tolerance and inflation factor (or increase) of the variance. Do not be discouraged, it’s easier than it may seem.

We saw in ahow simple linear regression models related two variables to each other, so that the variations of one of them (the independent variable or predictor) could be used to calculate how the other variable would change (the dependent variable). These models were represented by the equation y = a + bx, where x is the independent variable and y the dependent variable.

However, multiple linear regression adds more independent variables, so that it allows to make predictions of the dependent variable according to the values of the predictor or independent variables. The generic formula would be as follows:

y = a + bx1 + cx2 + dx3 + … + nxn, where n is the number of independent variables.

One of the conditions for the multiple linear regression models to work properly is that the independent variables are actually independent and uncorrelated.

Imagine an absurd example in which we put in the model the weight in kilograms and the weight in pounds. Both variables will vary in the same way. In fact the correlation coefficient, R, will be 1, since practically the two represent the same variable. Such foolish examples are difficult to see in scientific work, but there are others less obvious (including, for example, height and body mass index, which is calculated from weight and height) and others that are not at all evident for the researcher. This is what is called collinearity, which is nothing more than the existence of a linear association between the set of independent variables.

Collinearity is a serious problem for the multivariate model, since the estimates obtained by it are very unstable, as it becomes more difficult to separate the effect of each predictor variable.

Well, to determine if our model suffers from collinearity we can construct a matrix where the coefficients of correlation, R, of some variables with others are shown. In those cases in which we observe high R, we can suspect that there is collinearity. However, if we want to quantify this we will resort to the other two feet of the cat that we mentioned at the beginning: tolerance and inflation factor of variance.

If we square the coefficient R we obtain the coefficient of determination (R2), which represents the percentage of the variation (or variance) of a variable that is explained by the variation in the other variable. Thus, we find the concept of tolerance, which is calculated as the complement of R2 (1-R2) and represents the proportion of the variability of that variable that is not explained by the rest of the independent variables included in the regression model.

In this way, the lower the tolerance, the more likely there is collinearity. Collinearity is generally considered to exist when R2 is greater than 0.9 and therefore the tolerance is below 0.1.

We only have to explain the third foot, which is the inflation factor of the variance. This is calculated as the inverse of the tolerance (1 / T) and represents the proportion of the variability (or variance) of the variable that is explained by the rest of the predictor variables of the model. Of course, the greater the inflation factor of the variance, the greater the likelihood of collinearity. Generally, collinearity is considered to exist when the inflation factor between two variables is greater than 10 or when the mean of all inflation factors of all independent variables is much greater than one.

And here we are going to leave the multivariate models for today. Needless to say, everything we have told is done in practice using computer programs that calculate these parameters in a simple way.

We have seen here some aspects of multiple linear regression, perhaps the most widely used multivariate model. But there are others, such as multivariate analysis of variance (MANOVA), factors analysis, or clusters analysis. But that is another story…

## In search of causality

In medicine we often try to look for cause-effect relationships. If we want to show that the drug X produces an effect, we have only to select two groups of people, one group we give the drug, the other group we do not give it and see if there are differences.

But it is not so simple, because we can never be sure that differences in effect between the two groups actually are due to other factors than the treatment we have used. These factors are the so-called confounding factors, which may be known or unknown and which may bias the results of the comparison.

To resolve this problem a key element of a clinical trial, randomization, was invented. If we divide the participants in the trial between the two branches randomly we will get these confounding variables to be distributed homogeneously between the two arms of the trial, so any difference between the two will have to be due to the intervention. Only in this way can we establish cause-effect relationships between our exposure or treatment and the outcome variable we measure.

The problem of quasi-experimental and observational studies is that they lack randomization. For this reason, we can never be sure that the differences are due to exposure and not to any confounding variable, so we cannot safely establish causal relationships.

This is an annoying inconvenience, since it will often be impossible to carry out randomized trials either for ethical, economic reasons, the nature of the intervention or whatever. That is why some tricks have been invented in order to establish causal relations in the absence of randomization. One of these techniques is the propensity score we saw in an earlier post. Another is the one we are going to develop today, which has the nice name of regression discontinuity.

Regression discontinuity is a quasi-experimental design that allows causal inference in the absence of randomization. It can be applied when the exposure of interest is assigned, at least partially, according to the value of a continuous random variable if this variable falls above or below a certain threshold value.

Consider, for example, a hypocholesterolemic drug that we will use when LDL cholesterol rises above a given value, or an antiretroviral therapy in an AIDS patient that we will indicate when his CD4 count falls below a certain value. There is a discontinuity in the threshold value of the variable that produces a sudden change in the probability of assignment to the intervention group, as I show in the attached figure.

In these cases where the allocation of treatment depends, at least in part, on the value of a continuous variable, the allocation in the vicinity of the threshold is almost as if it were random. Why? Because determinations are subject to random variability by sampling error (in addition to the variability of biological variables themselves), which makes individuals very close to the threshold, above or below, very similar in terms of the variables that may act as confounders (being above or below the threshold may depend on the random variability of the result of the measurement of the variable), similar to what happens in a clinical trial. At the end of the day, we may think that a clinical trial is nothing more than a discontinuity design in which the threshold is a random number.

The math of regression discontinuity is not for beginners so I do not intend to explain it here (I would have to understand it first), so we will settle for knowing some terms that will help us to understand the works that use this methodology.

Regression discontinuity may be sharp or fuzzy. In the sharp one, the probability of assignment changes from zero to one at the threshold (the allocation of treatment follows a deterministic rule). For example, treatment is initiated when the threshold is crossed, regardless of other factors. On the other hand, in the fuzzy regression there are other factors at stake that make the probability of allocation change in the threshold, but not from zero to one, but may depend on those other factors added.

Thus, the result of the regression model varies somewhat depending on whether it is a sharp or fuzzy regression discontinuity. In the case of sharp regression, the so-called average causal effect is calculated, according to which participants are assigned to the intervention with certainty if they cross the threshold. In the case of fuzzy regression, the allocation is no longer performed according to a deterministic model, but according to a probabilistic one (according to the threshold value and other factors that the researcher may consider important). In these cases, an intention-to-treat analysis should be done according to the difference in the probability of allocation near the cut-off point (some may not exceed the threshold but be assigned to the intervention because the investigator considers the other factors).

Thus, the probabilistic model will have to measure the effect on the compliers (those assigned to the intervention), so the regression model will give us the complier average causal effect, which is the typical measure of fuzzy regression discontinuity.

And I think we’re going to leave it for today. We have not said anything about the regression equation, but suffice it to say that it takes into account the slopes of the probability function of allocation before and after the threshold and an interaction variable for the possibility that the effects of the treatment are heterogeneous on both sides of the threshold. As you see, everything is quite complicated, but for that are the statistical packages like R or Stata that implement these models with little effort.

Finally, to say only that it is usual to see models that use linear regression for quantitative outcome variables, but there are extensions of the model that use dichotomous variables and logistic regression techniques, and even models with survival studies and time-to-event variables. But that is another story…

## Censorship

In the best-known sense, censorship is the action of examining a work intended for the public, suppressing or modifying the part that does not fit certain political, moral or religious aspect, to determine whether or not it can be published or exhibited. So what do we mean in statistics when we talk about censored data? Nothing to do with politics, morality or religion. In order to explain what a censored data is, we must first discuss the time-to-event variables and survival analyzes.

In general, we can say that there are three types of variables: quantitative, qualitative and time-to-event. The first two are fairly well understood in general, but the time-to-event are a little more complicated to understand.-

Imagine that we want to study the mortality of that terrible disease that fildulastrosis is. We could count the number of deaths at the end of the study period and divide them by the total population at the beginning. For example, if at the beginning there are 50 patients and four die during follow-up, we could calculate the mortality as 4/50 = 0.08, or 8%. Thus, if we have followed the population for five years, we can say that the survival of the disease at five years is 92% (100-8 = 92).

Simple, isn’t it? The problem is that this is only valid when all subjects have the same follow-up period and no losses or dropouts occur throughout the study, a situation that is often far from the reality in most cases.

In these cases, the correct thing to do is to measure not only if death occurs (which would be a dichotomous variable), but also when it occurs, also taking into account the different follow-up period and the losses. Thus, we would use a time-to-event variable, which is composed of a dichotomous variable (the event being measured) and a continuous variable (the follow-up time when it occurs).

Following the example above, participants in the study could be classified into three types: those who die during follow-up, those who remain alive at the end of the study, and those who are lost during follow-up.

Of those who die we can calculate their survival but, what is the survival of those who are alive at the end of the study? And what is the survival of those who are lost during follow-up? It is clear that some of the lost may have died at the end of the study without us detecting it, so our measure of mortality will not be accurate.

And this is where we find the censored data. All those who do not present the event during the survival study are called censored (losses and those who finish the study without presenting the event). The importance of these censored data is that they must be taken into account when doing the survival study, as we will see below.

The methodology to be followed is to create a survival table that takes into account the events (in this case the deaths) and the censored data, as we can see in the attached table.

The columns of the table represent the following: x, the year number of the follow-up; Nx, the number of participants alive at the beginning of that year; Cx, the number of losses of that year (censored); Mx, the number of deaths during that period; PD, probability of dying in that period; PPS, the probability of surviving in that period (the probability of not presenting the event); And PGS, the global probability of survival up to that point.As we see, the first year we started with 50 participants, one of whom died. The probability of dying in that period is 1/50 = 0.02, so the probability of survival in the period (which is equal to the global since it is the first period) is 1-0.02 = 0, 98.

In the second period we start with 49 and no one dies or is lost. The PD in the period is zero and survival one. Thus, the overall probability will be 1×0.98 = 0.98.

In the third period we continue with 49. Two are lost and one dies. The PD is 1/49 = 0.0204 and the PPS is 1-0.0204 = 0.9796. If we multiply the PSP by the global of the previous period, we obtain the overall survival of this period: 0.9796×0.98 = 0.96.

In the fourth period we started with 46 participants, resulting in five losses and two deaths. The PD will be 2/46 = 0.0434, the PPS of 1-0.0434 = 0.9566 and the PGS of 0.9566×0.96 = 0.9183.

And last, in the fifth period we started with 39 participants. We have two censored and no event (death). PD is zero, PPS is equal to one (no one dies in this period) and PGS 1×0.9183 = 0.9183.

Finally, taking into account the censored data, we can say that the overall survival at five years of fildulastrosis is 91.83%.

And with this we are going to leave it for today. We have seen how a survival table with censored data is constructed to take into account unequal follow-up of participants and losses during follow-up.

Only two thoughts before finishing. First, even if we talk about survival analysis, the event does not have to be the death of the participants. It can be any event that occurs throughout the study follow-up.

Second, the time-to-event and censored data are the basis for performing other statistical techniques that estimate the probability of occurrence of the event under study at a given time, such as the Cox regression models. But that is another story…

## A case of misleading probability

Today we are going to see another of those examples where intuition about the value of certain probabilities plays tricks on us. And, for that, we will use nothing less than Bayes’ theorem, playing a little with conditioned probabilities. Let’s see step by step how it works.

What is the probability of two events occurring? The probability of an event A occurring is P(A) and that of B, P(B). Well, the probability of the two occurring is P(A∩B), which, if the two events are independent, is equal to P(A) x P(B).

Imagine that we have a die with six faces. If we throw it once, the probability of taking out, for example, a five is 1/6 (one result among the six possible). The probability to draw a four is also 1/6. What will be the probability of getting a four, once in the first roll we get a five? Since the two runs are independent, the probability of the combination five followed by four will be 1/6 x 1/6 = 1/36.

Now let’s think of another example. Suppose that in a group of 10 people there are four doctors, two of whom are surgeons. If we take one at random, the probability of being a doctor is 4/10 = 0.4 and that of a surgeon is 2/10 = 0.2. But if we get one and know that he is a doctor, the probability that he is a surgeon will no longer be 0.2, because the two events, being a doctor and a surgeon, are not independent. If you are a doctor, the probability that you are a surgeon will be 0.5 (half the doctors in our group are surgeons).

When two events are dependent, the probability of occurrence of the two will be the probability of occurrence of the first, once the second occurs, by the probability of occurrence of the second. So the P(surgeon) = P(surgeon|doctor) x P(doctor). We can generalize the expression as follows:

P(A∩B) = P(A|B) x P(B), and changing the order of the components of the expression, we obtain the so-called Bayes rule, as follows:

P(A|B) = P(A∩B) / P(B).

The P(A∩B) will be the probability of B, once A is produced, by the probability of A = P(B|A) x P(A). On the other hand, the probability of B will be equal to the sum of the probability of occurrence B once A is produced plus the probability of occurring B without occurring A, which put in mathematical form is of the following form:

P(B|A) x P(A) + P(B|Ac) x P(Ac), being P(Ac) the probability of not occurring A.

If we substitute the initial rule for its developed values, we obtain the best known expression of the Bayes theorem:

Let’s see how the Bayes theorem is applied with a practical example. Consider the case of acute fildulastrosis, a serious disease whose prevalence in the population is, fortunately, quite low, one per 1000 inhabitants. Then, the P(F) = 0.001.

Luckily we have a good diagnostic test, with a sensitivity of 98% and a specificity of 95%. Suppose now that I take the test and it gives me a positive result. Do I have to scare myself a lot? What is the probability that I actually have the disease? Do you think it will be high or low? Let’s see.

A sensitivity of 98% means that the probability of giving positive when having the disease is 0.98. Mathematically, P(POS|F) = 0,98. On the other hand, a specificity of 95% means that the probability of a negative result being healthy is 0.95. That is, P(NEG|Fc) = 0.95. But what we want to know is neither of these two things, but we really look for the probability of being sick once we test positive, that is, P (F|POS).

To calculate it, we have only to apply the theorem of Bayes:

Then we replace the symbols with their values and solve the equation:

So we see that, in principle, I do not have to scare a lot when the test gives me a positive result, since the probability of being ill is only 2%. As you see, much lower than intuition would tell us with such a high sensitivity and specificity. Why is this happening? Very simple, because the prevalence of the disease is very low. We are going to repeat the experiment assuming now that the prevalence is 10% (0,1):

As you see, in this case the probability of being ill if I give positive rises to 68%. This probability is known as positive predictive value which, as we can see, can vary greatly depending on the frequency of the effect we are studying.

And here we leave it for today. Before closing, let me warn you not to seek what the fildulastrosis is. I would be very surprised if anyone found it in a medical book. Also, be careful not to confuse P (POS|F) with P (F|POS), since you would make a mistake called reverse fallacy or fallacy of transposition of conditionals, which is a serious error.

We have seen how the calculation of probabilities gets somewhat complicated when the events are not independent. We have also learned how unreliable predictive values are when the prevalence of the disease changes. That is why the likelihood ratios were invented, which do not depend so much on the prevalence of the disease that is diagnosed and allow a better overall assessment of the power of the diagnostic test. But that is another story…

## Do not be misled by the outliers

We saw in a previous post that the extreme values of a distribution, called outliers, can skew statistical estimates we calculate in our sample.

A typical example is the arithmetic mean, which moves in the direction of the extreme values, if any, particularly as more extreme values are. We saw that to avoid this inconvenience, there were a number of relatives of the arithmetic that were considered robust or what is the same, they were less sensitive to the presence of outliers. Of these, the best known is the median, although some more, as the trimmed mean, the winsorized mean, weighted mean, geometric mean, etc.

Well, something like what happens to the mean occurs also with the standard deviation, the statistical of scale or dispersion used more frequently. Standard deviation is biased by the presence of extreme values, obtaining values that are unrepresentative of the actual spread of the distribution.

Consider the example we used when speaking about the robust estimators of the mean. Suppose we measure the levels of serum cholesterol in a group of people and we find the following values (in mg/dl): 166, 143, 154, 168, 435, 159, 185, 155, 167, 152, 152, 168, 177, 171, 183, 426, 163, 170, 152 and 155. As shown, there are two extreme values (426 and 435 mg/dl) that will bias our mean and standard deviation. In our case, we can calculate the standard deviation and see that its value is 83 mg/dl, clearly not adjusted to the dispersion of most of the values with respect to any of the robust centralization measure we can choose.

What we can do in this case? Well, we can use any of the robust estimators of the deviation, there are several. Some of them arise from the robust estimators of the mean. Here are some.

The first, which arises from the median, is the median absolute deviation (MAD). If you remember, the standard deviation is the sum of the differences of each value with mean, squared and divided by the number of elements, n (or n-1 if we want to obtain an unbiased estimator of typical deviation in the population). Well, similarly, we can calculate the median of the absolute deviations of each value with the median of the sample according to the following formula

MAD = Median {|Xi – Me|}, from i=1 to n.

We can calculate it in our example and see that its value is 17.05 mg / dl, rather adjusted than the standard deviation.

The second is calculated from the trimmed mean. This, as the name suggests, is calculated by cutting a certain percentage of the distribution, at its ends (the distribution has to be ordered from smallest to largest). For example, to calculate the 20% trimmed mean in our example, we’d remove 10% per side (two elements per side: 143, 152, 426 and 435) and calculate the arithmetic mean with the other. Well, we can calculate the classical standard deviation using the remaining elements, getting the value of 10.5 mg/dl.

And thirdly, we could follow a similar reasoning used and get the winsorized mean. In this case, instead of eliminating the values, we would replace them with the closest without removing them. Once the distribution is winsorized, we calculate the standard deviation with the new values in the usual way. Its value is 9.3 mg / dl, similar to the above.

What of the three should we use?. Well, we want to use one that behave efficiently when the distribution is normal (in these cases the best is the classical standard deviation) but that is not very sensitive when the distribution is beyond the normal. In this sense, the best is the median absolute deviation, followed by the winsorized standard deviation.

One last advice before finishing. Do not calculate these measures by hand, as it can be very laborious. Statistical programs do the math for us effortlessly.

And here we ended. We have not talked a word about other estimators of the family of the M-estimators, such as the biweighted mean variance or the adjusted mean percentage variance. These averages are much more difficult to understand from the mathematical point of view, although they are very easy to calculate with the appropriate software package. But that is another story…

## Black sheep

One element is said to be the black sheep of a group when it goes in different or contrary to the rest of the group direction. For example, in a family of addicted to reality shows, the black sheep would be a member of that family who goes out of its way to see TV documentaries. Of course, if the family is addicted to documentaries, the black sheep will die to see the reality shows. Always backwards.

In statistics there is something like the black sheep. They are anomalous data, also called extreme data, but best known by the name outliers.

An outlier is an observation that seems inconsistent with the rest of the sample values, taking into account the assumed probability model to be followed by the sample. As you can see, it is an element that contradicts the rest of the data, like a black sheep.

The problem with the outlier is that it can do much harm when estimating population parameters from a sample. Let’s remember an example that we saw in another post about the calculation of robust measures of centrality. It was the case of a school with five teachers and soccer’s fan director. The director hired the teachers with the following salaries: 1200 euros per month for the sciences teacher, 1500 for the math’s, 800 for the literature’s and 1100 for the history teacher. But it turns out that it craves to hire Pep Guardiola as a gyms teacher, so he has to pay him no less than 20,000 euros per month.

Do you see where things were going? Indeed, Pep is the black sheep, the outlier. Look what happens if we calculate the mean salary: 4920 euros per month is the average salary of teachers in this center. Do you think it is a real estimate? Clearly not, the mean value is shifted in the direction of the outlier, and it would be much more extreme the more sifted the outlier. If Pep won 100,000 euros, the average salary would amount to 20,920 euros. This is folly.

If an outlier can do as much damage to an estimator, imagine what it can do to a hypothesis test, in which the answer is to accept or reject the null hypothesis. So we ask, what can we do when we discover that among our data there is one (or more) black sheep? So, we can do several things.

The first thing that comes to our mind is throwing away the outlier. Disregard it when it comes to analyze the data. This would be fine if the outlier is the result of an error in data collection but, if not, we could disregard data with additional information. In our example, the outlier is not an error, but the product of the sports career of the gyms teacher. We need a more objective method to decide when to suppress an outlier, and although there are some tests called of discordance, they have their inconveniences.

The second thing we can do is to identify it. This means we have to find out if the value is so extreme for some specific reason, as happens in our example. An extreme value may be signaling some important finding and we have no reason to disdain it quickly, but to try to interpret its meaning.

Thirdly, we can incorporate the outlier. As we said when defining them, outliers go in the opposite direction of the other data according to the probability model we assume that follows the sample. Sometimes an outlier ceases to be so if we assume that the data follow another model. For example, an outlier can be so if we consider that the data follow a normal distribution but not if we consider that follow a logarithmic.

And, fourthly, the most correct option of all: use robust techniques to make our estimates and our hypothesis testing. They are called robust techniques because they are less affected by the presence of outliers. In our example with the teachers we would use a robust measure of centrality as the median. In our case it is 1200 euros, much more adjusted to reality than the mean. Moreover, even if Pep were paid 100,000 euros per month, the median will remain at 1200 euros per month.

And with that we end with the outliers, those black sheep mixed with our data. For simplicity, we have not said a word about how we could try to figure how much the outlier affects the estimation of the parameter. For this, we have a wide range of statistical methodology based on the calculation of the so called influence function. But that’s another story…

## Do not eat too many pies

Pies, how good they are! The problem is that, as you know, what is not frowned upon is fattening or causes cancer. And pies could not be the exception, so be careful to avoid eating too much so that they don’t end in your spare tyre or in worse places.

But there is a type of pie that is not fattening at all (nor causes cancer), and this is the pie chart, which is frequently used in statistics. Did I say just frequently? I am probably short. Because it is not fattening nor has detrimental health effects there is a tendency to abuse their use.

The pie chart, or circle chart, is easy to draw. It consists of a circle whose area represents the total of data. Thus, an area proportional to its frequency is assigned to each category so the much frequent categories have larger areas and you can get an idea of how frequencies are distributed among categories at a glance.

There are three ways to calculate the area of each sector. The simplest is to multiply the relative frequency of each category by 360 °, obtaining the degrees corresponding to each sector.

The second is using the absolute frequency of the category, according to the following rule of thirds:

Finally, the third way is to use the proportions or percentages of the categories:

These formulas are very simple but, anyway, there will be no need for them because the program we use to draw the graph will do it for us.

The pie chart is designed to represent nominal categorical variables, although it is not uncommon to see pies representing other variables. However, in my humble opinion, this is not entirely correct.

For example, if we make a pie chart for an ordinal qualitative variable we will lose the information on the hierarchy of variables, and it would be more correct to use a graphic that allows sort categories from less to more. And this figure is none other than the bar chart.

The pie chart is especially useful when there are few variables. If you have many variables interpretation ceases to be so intuitive, although we can always complete the chart with a frequency table to help us better interpret the data. Another tip is to be very careful with 3D effects when drawing the pie: too artistic pies could be difficult to understand.

Finally, just say that it makes no sense to use a pie to represent a quantitative variable. For that there is another more appropriate procedure, which is to use a histogram that best represents the frequency distribution of a continuous quantitative variable. But that is another story…

## Why spare one?

Today we will talk about one of those mysteries of statistics that few know why they are what they are. I am referring to divide by n (the sample size) or by n-1 to calculate measures of and of the sample, particularly its mean (m) and its standard deviation (s).

The mean we all know what it is. His own name says, is the average value of a distribution of data. To calculate it we add up all the values of the distribution and divide by the total number of elements, that is, by n. Here no doubt in dividing by n to get the most utilized measure of centralization.

Meanwhile, the standard deviation is a measure of the average deviation of each value from the mean of the distribution. To obtain it we calculate the differences among each element and the mean, we square the differences to avoid that negative differences cancel out positive ones, we add them up and divide the result by n. Finally we get the square root of the result. Because it is the mean of all the differences, we have to divide the sum of all the differences by the number of elements, n, as we did with the mean, according to the known formula of the standard deviation.

However, many times we see that, to calculate the standard deviation, we divide by n-1. Why spare one element? Let’s see.

We use to deal with samples, of which we get their centralization and dispersion measures. But what we are really interested in is in knowing the value of the parameters in the population from we drawn the sample. Unfortunately, we cannot calculate these parameters directly, but we can estimate them from the sample’s statistics. So, we want to know whether the sample mean, m, is a good estimator of the population mean, μ. We also want to know if the standard deviation of the sample, s, is a good estimate of the deviation of the population, which we call σ.

Let’s do an experiment to see if m and s are good estimators of μ and σ. We are going to use the program R. I leave the list of commands (script) in the accompanying figure in case you want to reproduce it with me.

First we generate a population of 1,000 individuals with a normal distribution with mean 50 and standard deviation of 15 (μ = 50 and σ = 15). Once done, let’s see first what happens in the case of the mean.

If we draw a sample of 25 elements from the population and calculate its mean, this mean will resemble that of the population (if the sample is representative of the population), but there will be differences due to chance. To overcome these differences, we obtain 50 different samples, with their 50 different means. These means follow a normal distribution (the so-called sampling distribution), whose mean is the mean of all that we got from the samples. If we extract 50 samples with R and find the mean of their means, we see that this is equal to 49.6, which is almost equal to 50. So we see that we can estimate the mean of the distribution using the means of the samples.

And what about the standard deviation? If we do the same (extract 50 samples, calculate their s and finally, calculate the mean of the 50 s) we get an average of 14.8 s. This s is quite close to the value 15 of the population, but it fits worse than the value of the mean. Why?

The answer is that the sample mean is what is called an unbiased estimator of the population mean, and the mean value of the sampling distribution is a good estimate of the population parameter. However, with standard deviation the same thing does not happen because it is a biased estimator. This is because the variation in the data (which is ultimately what measures the standard deviation) is higher in the population than in the sample because the population has a larger size (larger size, greater possibility of variation). So, we divide by n-1, so that the result is a little higher.

If we run the experiment with R dividing by n-1, we obtain an unbiased standard deviation of 15.1, something closer than that obtained dividing by n. This estimator (dividing by n-1) would be an unbiased estimator of the population standard deviation. So what we use? If we want to know the standard deviation of the sample we can divide by n, but if we want to estimate the theoretical value of the deviation in the population, the estimate will fit better if we divide by n-1.

And here we end this nonsense. We could talk about how we can get not only the estimate from the sample distribution, but also its confidence interval which includes the population parameter with a given confidence level. But that is another story…