# Pitfalls of statistics

When we think about inventions and inventors, the name of Thomas Alva Edison, known among his friends as the Wizard of Menlo Park, comes to most of us. This gentleman created more than a thousand inventions, some of which can be said to have changed the world. Among them we can name the incandescent bulb, the phonograph, the kinetoscope, the polygraph, the quadruplex telegraph, etc., etc., etc. But perhaps its great merit is not to have invented all these things, but to apply methods of chain production and teamwork to the research process, favoring the dissemination of their inventions and the creation of the first industrial research laboratory.

But in spite of all his genius and excellence, Edison failed to go on to invent something that would have been as useful as the light bulb: a cheaters detector. The explanation for this pitfall is twofold: he lived between the nineteenth and twentieth centuries and did not read articles about medicine. If he had lived in our time and had to read medical literature, I have no doubt that the Wizard of Menlo Park would have realized the usefulness of this invention and would have pull his socks up.

And it is not that I am especially negative today, the problem is that, as Altman said more than 15 years ago, the material sent to medical journals is defective from the methodological point of view in a very high percentage of cases. It’s sad, but the most appropriate place to store many of the published studies is the rubbish can.

In most cases the cause is probably the ignorance of those who write. “We are clinicians”, we say, so we leave aside the methodological aspects, of which we have a knowledge, in general, quite deficient. To fix it, journal editors send our studies to other colleagues, who are more or less like us. “We are clinicians”, they say, so all our mistakes go unnoticed to them.

Although this is, in itself, serious, it can be remedied by studying. But it is an even more serious fact that, sometimes, these errors can be intentional with the aim of inducing the reader to reach a certain conclusion after reading the article. The remedy for this problem is to make a critical appraisal of the study, paying attention to its internal validity. In this sense, perhaps the most difficult aspect to assess for the clinician without methodological training is that related to the statistics used to analyze the results of the study. It is in this, undoubtedly, that most can be taken advantage of our ignorance using methods that provide more striking results, instead of the right methods.

As I know that you are not going to be willing to do a master’s degree in biostatistics, waiting for someone to invent the cheaters detector, we are going to give a series of clues so that non-expert readers can suspect the existence of these cheats.

## Pitfalls of statistics

The first may seem obvious, but it is not: **has a statistical method been used?** Although it is exceptionally rare, there may be authors who do not consider using any. I remember a medical congress that I could attend in which the values of a variable were exposed throughout the study that, first, went up and then went down, which allowed the speaker to conclude that the result was not “on the blink”. As it is logical and evident, any comparison must be made with the proper hypotheses contrast and the level of significance and the statistical test used have to be specified. Otherwise, the conclusions will lack any validity.

A key aspect of any study, especially those with an intervention, is the previous calculation of the necessary sample size. The investigator must define the clinically relevant effect that he wants to be able to detect with his study and then calculate what sample size will provide the study with enough power to prove it. The sample of a study is not large or small, but sufficient or insufficient. If the sample is not sufficient, an existing effect may not be detected due to lack of power (type 2 error). On the other hand, a larger sample than necessary may show an effect that is not relevant from the clinical point of view as statistically significant. Here are two very common cheats. First, the study that does not reach significance and its authors say it is due to lack of power (insufficient sample size), but do not make any effort to calculate the power, which can always be done a posteriori. In that case, we can calculate it using statistical programs or any of the calculators available on the internet, such as GRANMO. Second, the sample size is increased until the difference observed is significant, finding the desired p <0.05. This case is simpler: we only have to assess whether the effect found is relevant from the clinical point of view. I advise you to practice and compare the necessary sample sizes of the studies with those defined by the authors. Maybe you’ll have some surprise.

Once the participants have been selected, a fundamental aspect is that of the **homogeneity of the basal groups**. This is especially important in the case of clinical trials: if we want to be sure that the observed difference in effect between the two groups is due to the intervention, the two groups should be the same in everything, except in the intervention.

For this we will look at the classic table I of the trial publication. Here we have to say that, if we have distributed the participants at random between the two groups, any difference between them will be due, one way or another, to random. Do not be fooled by the p, remember that the sample size is calculated for the clinically relevant magnitude of the main variable, not for the baseline characteristics of the two groups. If you see any difference and it seems clinically relevant, it will be necessary to verify that the authors have taken into account their influence on the results of the study and have made the appropriate adjustment during the analysis phase.

The next point is that of **randomization**. This is a fundamental part of any clinical trial, so it must be clearly defined how it was done. Here I have to tell you that chance is capricious and has many vices, but rarely produces groups of equal size. Think for a moment if you flip a coin 100 times. Although the probability of getting heads in each throw is 50%, it will be very rare that by throwing 100 times you will get exactly 50 heads. The greater the number of participants, the more suspicious it should seem to us that the two groups are equal. But beware, this only applies to simple randomization. There are methods of randomization in which groups can be more balanced.

Another hot spot is the misuse that can sometimes be made with **qualitative variables**. Although qualitative variables can be coded with numbers, be very careful with doing arithmetic operations with them. Probably it will not make any sense. Another cheat that we can find has to do with the fact of categorizing a continuous variable. Passing a continuous variable to a qualitative one usually leads to loss of information, so it must have a clear clinical meaning. Otherwise, we can suspect that the reason is the search for a p value less than 0.05, always easier to achieve with the qualitative variable.

Going into the analysis of the data, we must check that the authors **have followed the a priori designed protocol of the study**. Always be wary of post hoc studies that were not planned from the beginning. If we look for enough, we will always find a group that behaves as we want. As it is said, if you torture the data long enough, it will confess to anything.

Another unacceptable behavior is to finish the study ahead of time for good results. Once again, if the duration of the follow-up has been established during the design phase as the best time to detect the effect, this must be respected. Any violation of the protocol must be more than justified. Logically, it is ethical to finish the study ahead of time due to security reasons, but it will be necessary to take into account how this fact affects the evaluation of the results.

Before performing the analysis of the results, the authors of any study have to debug their data, reviewing the quality and integrity of the values collected. In this sense, one of the aspects to pay attention to is the **management of outliers**. These are the values that are far from the central values of the distribution. In many occasions they can be due to errors in the calculation, measurement or transcription of the value of the variable, but they can also be real values that are due to the special idiosyncrasy of the variable. The problem is that there is a tendency to eliminate them from the analysis even when there is no certainty that they are due to an error. The correct thing to do is to take them into account when doing the analysis and use, if necessary, robust statistical methods that allow these deviations to be adjusted.

Finally, the aspect that can be more strenuous to those not very expert in statistics is knowing if **the correct statistical method has been used**. A frequent error is the use of parametric tests without previously checking if the necessary requirements are met. This can be done by ignorance or to obtain statistical significance, since parametric tests are less demanding in this regard. To understand each other, the p-value will be smaller than if we use the equivalent non-parametric test.

Also, with certain frequency, other requirements needed to be able to apply a certain contrast test are ignored. As an example, in order to perform a Student’s t test or an ANOVA, homoscedasticity (a very ugly word that means that the variances are equal) must be checked, and that check is overlooked in many studies. The same happens with regression models that, frequently, are not accompanied by the mandatory diagnosis of the model that allows and justify its use.

Another issue in which there may be cheating is that of multiple comparisons. For example, when the ANOVA reaches significant, the meaning is that there are at least two means that are different, but we do not know which, so we start comparing them two by two. The problem is that when we make repeated comparisons the probability of type I error increases, that is, the probability of finding significant differences only by chance. This may allow finding, if only by chance, a p <0.05, what improves the appearance of the study (especially if you spent a lot of time and / or money doing it). In these cases, the authors must use some of the available corrections (such as Bonferroni’s, one of the simplest) so that the global alpha remains below 0.05. The price to pay is simple: the p-value has to be much smaller to be significant. When we see multiple comparisons without a correction, it will only have two explanations: the ignorance of the one who made the analysis or the attempt to find a statistical significance that, probably, would not support the decrease in p-value that the correction would entail.

Another frequent victim of misuse of statistics is the Pearson’s correlation coefficient, which is used for almost everything. The correlation, as such, tells us if two variables are related, but does not tell us anything about the causality of one variable for the production of the other. Another misuse is to use the correlation coefficient to compare the results obtained by two observers, when probably what should be used in this case is the intraclass correlation coefficient (for continuous variables) or the kappa index (for dichotomous qualitative variables). Finally, it is also incorrect to compare two measurement methods (for example, capillary and venous glycaemia) by correlation or linear regression. For these cases the correct thing would be to use the Passing-Bablok’s regression.

Another situation in which a paranoid mind like mine would suspect is one in which the statistical method employed is not known by the smartest people in the place. Whenever there is a better known (and often simpler) way to do the analysis, we must ask ourselves **why they have used such a weird method**. In these cases, we will require the authors to justify their choice and provide a reference where we can review the method. In statistics, you have to try to choose the right technique for each occasion and not the one that gives us the most appealing result.

In any of the previous contrast tests, the authors usually use a level of significance for p <0.05, as usual, but the contrast can be done with one or two tails. When we do a trial to try a new drug, what we expect is that it works better than the placebo or the drug with which we are comparing it. However, two other situations can occur that we cannot disdain: that it works the same or, even, that it works worse. A bilateral contrast (with two tails) does not assume the direction of the effect, since it calculates the probability of obtaining a difference equal to or greater than that observed, in both directions. If the researcher is very sure of the direction of the effect, he can make a unilateral contrast (with one tail), measuring the probability of the result in the direction considered. The problem is when he does it for another reason: the p-value of a bilateral contrast is twice as large as that of the unilateral contrast, so it will be easier to achieve statistical significance with the unilateral contrast. The wrong thing is to do the unilateral contrast for that reason. The correct thing, unless there are well-justified reasons, is to make a bilateral contrast.

## The choice of association and effect meassures

To go finishing this tricky post, we will say a few words about the **use of appropriate measures to present the results**. There are many ways to make up the truth without getting to lie and, although basically all say the same, the appearance can be very different depending on how we say it. The most typical example is to use relative risk measures instead of absolute and impact measures. Whenever we see a clinical trial, we must demand that authors provide the absolute risk reduction and the number needed to treat (NNT). The relative risk reduction gives a greater number than the absolute, so it will seem that the impact is greater. Given that the absolute measures are easier to calculate and are obtained from the same data as the relative ones, we should be suspicious if the authors do not offer them to us: perhaps the effect is not as important as they are trying to make us see.

Another example is the use of odds ratio versus risk ratio (when both can be calculated). The odds ratio tends to magnify the association between the variables, so its unjustified use can also make us to be suspicious. If you can, calculate the risk ratio and compare the two measures.

Likewise, we will suspect of studies of diagnostic tests that do not provide us with the likelihood ratios and are limited to sensitivity, specificity and predictive values. Predictive values can be high if the prevalence of the disease in the study population is high, but it would not be applicable to populations with a lower proportion of patients. This is avoided with the use of likelihood ratios. We should always ask ourselves the reason that the authors may have had to obviate the most valid parameter to calibrate the power of a diagnostic test.

And finally, be very careful with the graphics representations of results: here the possibilities of making up the truth are only limited by our imagination. You have to look at the units used and try to extract the information from the graph beyond what it might seem to represent at first glance.

## Nos vamos…

And here we leave the topic for today. We have not spoken in detail about another of the most misunderstood and manipulated entities, which is none other than our p. Many meanings are attributed to p, usually erroneously, as the probability that the null hypothesis is true, probability that has its specific method to make an estimate. But that is another story…