## I am Spartacus

I was thinking about the effect size based on mean differences and how to know when that effect is really large and, because of the association of ideas, someone great has come to mind who, sadly, has left us recently. I am referring to Kirk Douglas, that hell of an actor that I will always remember for his roles as a Viking, as Van Gogh or as Spartacus, in the famous scene of the film in which all slaves, in the style of our Spanish’s Fuenteovejuna, stand up and proclaim together to be Spartacus so that Romans cannot do anything to the true one (or to get all equally whacked, much more typical of the modus operandi of the Romans of that time).

You won’t tell me the man wasn’t great. But how great if we compare it with others? How can we measure it? It is clear that not because of the number of Oscars, since that would only serve to measure the prolonged shortsightedness of the so-called academics of the cinema, which took a long time until they awarded him the honorary prize for his entire career. It is not easy to find a parameter that defines the greatness of a character like Issur Danielovitch Demsky, which was the ragman’s son’s name before becoming a legend.

We have it easier to quantify the effect size in our studies, although the truth is that researchers are usually more interested in telling us the statistical significance than in the size of the effect. It is so unusual to calculate it that even many statistical packages forget to have routines to obtain it. In this post, we are going to focus on how to measure the effect size based on differences between means.

Imagine that we want to conduct a trial to compare the effect of a new treatment against placebo and that we are going to measure the result with a quantitative variable X. What we will do is calculate the mean effect between participants in the experimental or intervention group and compare it with the mean of the participants in the control group. Thus, the effect size of the intervention with respect to the placebo will be represented by the magnitude of the difference between the mean in the experimental group and that of the control group:$d=&space;\bar{x}_{e}-\bar{x}_{c}$However, although it is the easiest to calculate, this value does not help us to get an idea of the effect size, since its magnitude will depend on several factors, such as the unit of measure of the variable. Let us think about how the differences change if one mean is twice the other as their values are 1 and 2 or 0.001 and 0.002. In order for this difference to be useful, it is necessary to standardize it, so a man named Gene Glass thought he could do it by dividing it by the standard deviation of the control group. He obtained the well-known Glass’ delta, which is calculated according to the following formula:$\delta&space;=&space;\frac{\bar{x}_{e}-\bar{x}_{c}}{S_{s}}$Now, since what we want is to estimate the value of delta in the population, we will have to calculate the standard deviation using n-1 in the denominator instead of n, since we know that this quasi-variance is a better estimator of the population value of the deviation:$S_{c}=\sqrt{\frac{\sum_{i=1}^{n_{c}}(x_{ic}-\bar{x}_{c})}{n_{c}-1}}$But do not let yourselves be impressed by delta, it is not more than a Z score (those obtained by subtracting to the value its mean and dividing it by the standard deviation): each unit of the delta value is equivalent to one standard deviation, so it represents the standardized difference in the effect that occurs between the two groups due to the effect of the intervention. This value allows us to estimate the percentage of superiority of the effect by calculating the area under the curve of the standard normal distribution N(0,1) for a specific delta value (equivalent to the standard deviation). For example, we can calculate the area that corresponds to a delta value = 1.3. Nothing is simpler than using a table of values of the standard normal distribution or, even better, the pnorm() function of R, which returns the value 0.90. This means that the effect in the intervention group exceeds the effect in the control group by 90%.

The problem with Glass’ delta is that the difference in means depends on the variability between the two groups, which makes it sensitive to these variance differences. If the variances of the two groups are very different, the delta value may be biased. That is why one Larry Vernon Hedges wanted to contribute with his own letter to this particular alphabet and decided to do the calculation of Glass in a similar way, but using a unified variance that does not assume their equality, according to the following formula:$S_{u}=\sqrt{\frac{(n_{e}-1)S_{e}^{2}+(n_{c}-1)S_{c}^{2}}{n_{e}+n_{c}-2}}$If we substitute the variance of the control group of the Glass’ delta formula with this unified variance we will obtain the so-called Hedges’ g. The advantage of using this unified standard deviation is that it takes into account the variances and sizes of the two groups, so g has less risk of bias than delta when we cannot assume equal variances between the two groups.

However, both delta and g have a positive bias, which means that they tend to overestimate the effect size. To avoid this, Hedges modified the calculation of his parameter in order to obtain an adjusted g, according to the following formula:$g_{a}=g\left&space;(&space;1-\frac{3}{4df-9}&space;\right&space;)$where df are the degrees of freedom, which are calculated as ne + nc.

This correction is more needed with small samples (few degrees of freedom). It is logical, if we look at the formula, the more degrees of freedom, the less necessary it will be to correct the bias.

So far, we have tried to solve the problem of calculating an estimator of the effect size that is not biased by the lack of equal variances. The point is that, in the rigid and controlled world of clinical trials, it is usual that we can assume the equality of variances between the groups of the two branches of the study. We might think, then, that if this is true, it would not be necessary to resort to the trick of n-1.

Well, Jacob Cohen thought the same, so he devised his own parameter, Cohen’s d. This Cohen’s d is similar to Hedges’ g, but still more sensitive to inequality of variances, so we will only use it when we can assume the equality of variances between the two groups. Its calculation is identical to that of the Hedges’ g, but using n instead of n-1 to obtain the unified variance.

As a rough-and-ready rule, we can say that the effect size is small for d = 0.2, medium for d = 0.5, large for d = 0.8 and very large for d = 1.20. In addition, we can establish a relationship between d and the (r), which is also a widely used measure to estimate the effect size.

The correlation coefficient measures the relationship between an independent binary variable (intervention or control) and a numerical dependent variable (our X). The great advantage of this measure is that it is easier to interpret than the parameters we have seen so far, which all function as standardized Z scores. We already know that r can range from -1 to 1 and the meaning of these values.

Thus, if you want to calculate r given d, you only have to apply the following formula:$r=\frac{d}{\sqrt{d^{2}+\left&space;(&space;\frac{1}{pq}&space;\right&space;)}}$where p and q are the proportions of subjects in the experimental and control groups (p = ne / n and q = nc / n). In general, the larger the effect size, the greater r and vice versa (although it must be taken into account that r is also smaller as the difference between p and q increases). However, the factor that most determines the value of r is the value of d.

And with this we will end for today. Do not believe that we have discussed all the measures of this family. There are about a hundred parameters to estimate the effect size, such as the determination coefficient, eta-square, chi-square, etc., even others that Cohen himself invented (not very happy with only d), such as f-square or Cohen’s q. But that is another story…

## When nothing bad happens, is everything okay?

I have a brother-in-law who is increasingly afraid of getting on a plane. He is able to make road trips for several days in a row so as not to take off the ground. But it turns out that the poor guy has no choice but to make a transcontinental trip and he has no choice but to take a plane to travel.

But at the same time, my brother-in-law, in addition to being fearful, is an occurrence person. He has been counting the number of flights of the different airlines and the number of accidents that each one has had in order to calculate the probability of having a mishap with each of them and fly with the safest. The matter is very simple if we remember that of probability equals to favorable cases divided by possible cases.

And it turns out that he is happy because there is a company that has made 1500 flights and has never had any accidents, then the probability of having an accident flying on their planes will be, according to my brother-in-law, 0/1500 = 0. He is now so calm that he almost has lost his fear to fly. Mathematically, it is almost certain that nothing will happen to him. What do you think about my brother-in-law?

Many of you will already be thinking that using brothers-in-law for these examples has these problems. We all know how brothers-in-law are… But don’t be unfair to them. As the famous humorist Joaquín Reyes says, “we all of us are brothers-in-law”, so just remember it. Of which there is no doubt, is that we will all agree with the statement that my brother-in-law is wrong: the fact that there has not been any mishap in the 1500 flights does not guarantee that the next one will not fall. In other words, even if the numerator of the proportion is zero, if we estimate the real risk it would be incorrect to keep zero as a result.

This situation occurs with some frequency in Biomedicine research studies. To leave airlines and aerophobics alone, think that we have a new drug with which we want to prevent this terrible disease that is fildulastrosis. We take 150 healthy people and give them the antifildulin for 1 year and, after this follow-up period, we do not detect any new cases of disease. Can we conclude then that the treatment prevents the development of the disease with absolute certainty? Obviously not. Let’s think about it a little.

Making inferences about probabilities when the numerator of the proportion is zero can be somewhat tricky, since we tend to think that the non-occurrence of events is something qualitatively different from the occurrence of one, few or many events, and this is not really so. A numerator equal to zero does not mean that the risk is zero, nor does it prevent us from making inferences about the size of the risk, since we can apply the same statistical principles as to non-zero numerators.

Returning to our example, suppose that the incidence of fildulastrosis in the general population is 3 cases per 2000 people per year (1.5 per thousand, 0.15% or 0.0015). Can we infer with our experiment if taking antifildulin increases, decreases or does not modify the risk of suffering fildulastrosis? Following the familiar adage, yes, we can.

We will continue our habit of considering the null hypothesis as of equal effect, so that the risk of disease is not modified by the new treatment. Thus, the risk of each of the 150 participants becoming ill throughout the study will be 0.0015. In other words, the risk of not getting sick will be 1-0.0015 = 0.9985. What will be the probability that none will get sick during the year of the study? Since there are 150 independent events, the probability that 150 subjects do not get sick will be 0.98985150 = 0.8. We see, therefore, that although the risk is the same as that of the general population, with this number of patients we have an 80% chance of not detecting any event (fildulastrosis) during the study, so it would be more surprising to find a patient who the fact of not having any. But the most surprising thing is that we are, thus, getting the probability that we do not have any sick in our sample: the probability that there is no sick is not 0 (0/150), as my brother-in-law thinks, but 80 %!

And the worst part is that, given this result, pessimism invades us: it is even possible that the risk of disease with the new drug is greater and we are not detecting it. Let’s assume that the risk with medication is 1% (compared to 0.15% of the general population). The risk of none being sick would be (1-0.01)150 = 0.22. Even with a 2% risk, the risk of not getting any disease is (1-0.02)150 = 0.048. Remember that 5% is the value that we usually adopt as a “safe” limit to reject the null hypothesis without making a type 1 error.

At this point, we can ask ourselves if we are very unfortunate and have not been lucky enough to detect cases of illness when the risk is high or, on the contrary, that we are not so unfortunate and, in reality, the risk must be low. To clarify ourselves, we can return to our usual 5% confidence limit and see with what risk of getting sick with the treatment we have at least a 5% chance of detecting a patient:

– Risk of 1.5/1000: (1-0.0015)150 = 0.8.

– Risk of 1/1000: (1-0.001)150 = 0.86.

– Risk of 1/200: (1-0.005)150 = 0.47.

– Risk of 1/100: (1-0.01)150 = 0.22.

– Risk of 1/50: (1-0.02)150 = 0.048.

– Risk of 1/25: (1-0.04)150 = 0.002.

As we see in the previous series, our “security” range of 5% is reached when the risk is below 1/50 (2% or 0.02). This means that, with a 5% probability of being wrong, the risk of fildulastrosis taking antifuldulin is equal to or less than 2%. In other words, the 95% confidence interval of our estimate would range from 0 to 0.02 (and not 0, if we calculate the probability in a simplistic way).

To prevent our reheated neurons from eventually melting, let’s see a simpler way to automate this process. For this we use what is known as the rule of 3. If we do the study with n patients and none present the event, we can affirm that the probability of the event is not zero, but less than or equal to 3/n. In our example, 3/150 = 0.02, the probability we calculate with the laborious method above. We will arrive at this rule after solving the equation we use with the previous method:

(1 – maximum risk) n = 0.05

First, we rewrite it:

1 – maximum risk = 0.051/n

If n is greater than 30, 0.051/n approximates (n-3)/n, which is the same as 1-(3/n). In this way, we can rewrite the equation as:

1- maximum risk = 1 – (3/n)

With which we can solve the equation and get the final rule:

Maximum risk = 3/n.

You have seen that we have considered that n is greater than 30. This is because, below 30, the rule tends to overestimate the risk slightly, which we will have to take into account if we use it with reduced samples.

And with this we will end this post with some considerations. First, and as is easy to imagine, statistical programs calculate risk’s confidence intervals without much effort even if the numerator is zero. Similarly, it can also be done manually and much more elegantly by resorting to the Poisson probability distribution, although the result is similar to that obtained with the rule of 3.

Second, what happens if the numerator is not 0 but a small number? Can a similar rule be applied? The answer, again, is yes. Although there is no general rule, extensions of the rule have been developed for a number of events up to 4. But that’s another story…

## Like a hypermarket

There is one thing that happens on a recurring basis and that is a real slap in the face for me. It turns out that I like to go shopping for food once a week, so I usually go to the hypermarket every Friday. I am a creature of habit that always eats the same things and almost the same days, so I go swift and fast through the aisles of the hyper throwing things in the shopping cart so I have the matter settled in the twinkling of an eye. The problem is that in hypermarkets they have the bad habit of periodically changing foods sites, so you go crazy until you learn its new location again. To cap it all, the first few days foods have been changed, but not yet its information panels, so I have to go around a thousand turns until I find the cans of squid in their ink that, as we all know, are one of our main staple foods.

You will wonder why I tell you all this stuff. As it turns out, the National Library of Medicine (NML) has done a similar thing: now that I had finally managed to learn how the its search engine worked, they go and change it completely.

Of course, it must be said in honor of the truth that NML’s people have not limited themselves to changing the aspects of windows and boxes, but have implemented a radical change with an interface that they define as cleaner and simpler, as well as better adapted to mobile devices, which are increasingly used to do bibliographic searches. But that doesn’t end there: there are a lot of improvements in the algorithms to find the more than 30 million citations that Pubmed includes and, in addition, the platform is hosted in the cloud, promising to be more stable and efficient.

The NLM announced the new Pubmed in October 2019 and it will be the default option at the beginning of the year 2020 so, although the legacy version will be available a few more months, we have no choice but to learn how to handle the new version. Let’s take a look.

Although all the functionalities that we know of the legacy version are also present in the new one, the aspect is radically different from the home page, which I show you in the first figure.The most important element is the new search box, where we have to enter the text to click on the “Search” button. If the NLM does not deceive us, this will be the only resource that we will have to use the vast majority of the time, although we still have a link at our disposal to enter the advanced search mode.

Below we have four sections, including the one that contains help to learn how to use the new version, and that include tools that we already knew, such as “Clinical Queries”, “Single Citation Matcher” or “MeSH Database”. At the time of writing this post, these links direct you to the old versions of the tools, but this will change when the new interface is accessed by default.

Finally, a new component called “Trending Articles” has been added at the bottom. Here are articles of interest, which do not have to be the most recent ones, but those that have aroused interest lately and have been viralized in one way or another. Next to this we have the “Latest Literature” section, where recent articles from high impact journals are shown.

Now let’s see a little how searches are done using the new Pubmed. One of the keys to this update is the simple search box, which has become much smarter by incorporating a series of new sensors that, according to the NLM, try to detect exactly what we want to look for from the text we have inserted.

For example, if we enter information about the author, the abbreviation of the journal and the year of publication, the citation sensor will detect that we have entered basic citation information and will try to find the article we are looking for. For example, if I type ” campoy jpgn 2019″, I will get the results you see in second figure, where Pubmed shows the two articles found published by this doctor in this Journal in 2019. It would be something like what before we obtained using the “Single Citation Matcher”.

We can also do the search in a more traditional way. For example, if we want to search by author, it is best to write the last name followed by the initial of the name, all in lower case, without labels or punctuation marks. For example, if we want to look for articles by Yvan Vandenplas, we will write “vandenplas y”, with which we will obtain the papers that I show you in the third figure. Of course, we can also search by subject. If I type “parkinson” in the search box, Pubmed will make a series of suggestions on similar search terms. If I press “Search”, I get the results of the fourth figure which, as you can see , includes all the results with the related terms.

Let us now turn to the results page, which is also full of surprises. You can see a detail in the fifth figure. Under the search box we have two links: “Advanced”, to access the advanced search, and “Create alert”, so that Pubmed notifies us every time a new related article is incorporated (you already know that for this to be possible we have to create an account in NCBI and enter by pressing the “Login” button at the top; this account is free and saves all our activity in Pubmed for later use).

Below these links there are three buttons which allow you to save the search ( “Save”), send it by e-mail (“Email”) and, clicking the three points button, send it to the clipboard or to our bibliography or collections, if we have an NCBI account.

On the right we have the buttons to sort the results. The “Best Match” is one of the new priorities of the NLM, which tries to show us in the first positions the most relevant articles. Anyway, we can sort them in chronological order (“Most recent”), as well as change the way of presenting them by clicking on the gearwheel on the right (in “Summary” or “Abstract” format).

We are going to focus now into the left of the results page. The first thing we see is a graph with the results indexed by year. This graph can be enlarged, which allows us to see the evolution of the number of papers on the subject indexed over time. In addition, we can modify the time interval and restrict the search to what is published in a given period. In the sixth figure I show you how to limit the search to the results of the last 10 years.Under each result we have two new links: “Cite” and “Share”. The first allows us to write the work citation in several different formats. The second, share it on social networks.

Finally, to the left of the results screen we have the list of filters that we can apply. These can be added or removed in a similar way to how it was done with the legacy version of Pubmed and its operation is very intuitive, so we will not spend more time on them.

If we click on one of the items in the list of results we will access the screen with the text of the paper (seventh figure). This screen is similar to that of the legacy version of Pubmed, although new buttons such as “Cite” and those for accessing social networks are included, as well as additional information on related articles and articles in which the one we have selected is cited. Also, as a novelty, we have some navigation arrows on the left and right ends of the screen to change to the text of the previous and subsequent articles, respectively.

To finish this post, let’s take a look at the new advanced search, which can be accessed by clicking on the “Advanced” link, which will take us to the screen you see in the eighth figure.

Its operation is very similar to the legacy version. We can add terms with Boolean operators, combine searches, etc. I encourage you to play with the advanced search, the possibilities are endless. The newest part of this tool is the section with the history and the search details (“History and Search Details”) at the bottom. This allows you to keep previous searches and return to them, taking into account that all this is lost when you leave Pubmed, unless you have an NCBI account.

I call your attention to the “Search Details” tab, which you can open as shown in the ninth figure. The search becomes more transparent, since it shows how Pubmed interpreted it based on an automatic system of choice of terms (“Automatic Term Mapping”). Although we do not know very well how to narrow the search to specific terms of Parkinson’s disease, Pubmed does know what we are looking for and includes all the terms in the search, in addition to the initial text that we introduced, of course.

And here we end for today. You have seen that these people of the NLM have outdone themselves, putting at our disposal a new tool easier to use, but at the same time, much more powerful and intelligent. Google must be shaking with fear, but don’t worry, it is sure it will invent something to try to prevail.

You can go forgetting about the legacy version, do not wait for it to disappear to start enjoying the new one. We will have to talk about these issues again when new versions of the rest of the tools are established, such as Clinical Queries, but that is another story …

## Columns, sectors, and an illustrious Italian

When you read the title of this post, you can ask yourself with what stupid occurrence am I going to crush the suffered concurrence today, but do not fear, all we are going to do is to put in prospective value that famous aphorism that says that a picture is worth a thousand words. Have I clarified something? I suppose not.

As we all know, descriptive statistics is that branch of statistics that we usually use to obtain a first approximation to the results of our study, once we have finished it.

The first thing we do is to describe the data, for which we make frequency tables and use various measures of tendency and dispersion. The problem with these parameters is that, although they truly represent the essence of the data, it is sometimes difficult to provide a synthetic and comprehensive view with them. It is in these cases that we can resort to another resource, which is none other than the graphic representation of the study results. You know, a picture is worth a thousand words, or so they say.

There are many types of graphs to help us better understand the data, but today we are only going to talk about those that have to do with qualitative or categorical variables.

Remember that qualitative variables represent attributes or categories of the variable. When the variable does not include any sense of order, it is said to be a nominal categorical variable, while if a certain order can be established between the categories, we would say that it is an ordinal categorical variable. For example, the variable “smoker” would be nominal if it has two possibilities: “yes” or “no”. However, if we define it as “occasional”, “little smoker”, “moderate” or “heavy smoker”, there is already a certain hierarchy and we speak of ordinal qualitative variable.

The first type of chart that we are going to consider when representing a qualitative variable is the pie chart. This consists of a circle whose area represents the total data. Thus, an area that will be directly proportional to its frequency is assigned to each category. In this way, the most frequent categories will have larger areas, so that we can get an idea of how the frequencies are distributed in the categories at a glance.

There are three ways to calculate the area of each sector. The simplest is to multiply the relative frequency of each category by 360 °, obtaining the degrees of that sector.

The second is to use the absolute frequency of the category, according to the following rule of three:

Absolute frequency / Total data frequency = Degrees of the sector / 360 °

Finally, the third way is to use the proportions or percentages of the categories:

% of the category / 100% = Degrees of the sector / 360 °

The formulas are very simple, but, in any case, there will be no need to resort to them because the program with which we make the graph will do it for us. The instruction in R is pie(), as you can see in the first figure, in which I show you a distribution of children with exanthematic diseases and how the pie chart would be represented.The pie chart is designed to represent nominal categorical variables, although it is not uncommon to see pies representing variables of other types. However, and in my humble opinion, this is not entirely correct.

For example, if we make a pie chart for an ordinal qualitative variable, we will be losing information about the hierarchy of the variables, so it would be more correct to use a chart that allows to sort the categories from less to more. And this chart is none other than the bar chart, which we’ll talk about next.

The pie chart will be especially useful when there are few categories of the variable. If there are many, the interpretation is no longer so intuitive, although we can always complete the graph with a frequency table that helps us to better interpret the data. Another tip is to be very careful with 3D effects when drawing cakes. If we go from elaborate, the graphic will lose clarity and will be more difficult to read.

The second graph that we are going to see is, as we have already mentioned, the bar chart, the optimum to represent ordinal qualitative variables. On the horizontal axis, the different categories are represented, and on it some columns or bars are raised whose height is proportional to the frequency of each category. We could also use this type of graph to represent discrete quantitative variables, but what is not very correct to do is use it for the qualitative nominal variables.

The bar chart is able to express the magnitude of the differences between the categories of the variable, but it is precisely its weak point, since it is easily manipulated if we modify the axes’ scales. That is why we must be careful when analyzing this type of graphics to avoid being deceived by the message that the author of the study may want to convey.

This chart is also easy to do with most statistical programs and spreadsheets. The function in R is barplot(), as you can see in the second figure, which represents a sample of asthmatic children classified by severity.

With what has been seen so far, some will think that the title of this post is a bit misleading. Actually, the thing is not about columns and sectors, but about bars and pies. Also, who is the illustrious Italian? Well, here I do not fool anyone, because the character was both Italian and illustrious, and I am referring to Vilfredo Federico Pareto.

Pareto was an Italian who was born in the mid-19th century in Paris. This small contradiction is due to the fact that his father was then exiled in France for being one of the followers of Giuseppe Mazzini, who was then committed to Italian unification. Anyway, Pareto lived in Italy from he was 10 years old on, becoming an engineer with extensive mathematical and humanistic knowledge and who contributed decisively to the development of microeconomics. He spoke and wrote fluently in French, English, Italian, Latin and Greek, and became famous for a multitude of contributions such as the Pareto’s distribution, Pareto’s efficiency, Pareto’s index and Pareto’s principle. To represent the latter, he invented the Pareto’s diagram, which is what brings him here today among us.

Pareto chart (also known in economics as a closed curve or A-B-C distribution) organizes the data in descending order from left to right, represented by bars, thus assigning an order of priorities. In addition, the diagram incorporates a curved line that represents the cumulative frequency of the categories of the variable. This initially allowed the Pareto’s principle to be explained, which goes on to say that there are many minor problems compared to a few that are important, which was very useful for decision-making.

As it is easy to understand, this prioritization makes the Pareto diagram especially useful for representing ordinal qualitative variables, surpassing the bar chart by giving information on the percentage accumulated by adding the categories of the distribution of the variable. The change in slope of this curve also informs us of the change in the concentration of data, which depends on the variability in which the subjects of the sample are divided between the different categories.

Unfortunately, R does not have a simple function to represent Pareto diagrams, but we can easily obtain it with the script that I attached in the third figure, obtaining the graph of the fourth.

And here we are going to leave it for today. Before saying goodbye, I want to warn you that you should not confuse the bars of the bar chart with those of the histogram since, although they can be similar from the graphic point of view, both represent very different things. In a bar chart only the values of the variables we have observed when doing the study are represented. However, the histogram goes much further since, in reality, it contains the frequency distribution of the variable, so it represents all possible values that exist within the intervals, although we have not observed any directly. It allows us to calculate the probability that any distribution value will be represented, which is of great importance if we want to make inference and estimate population values based on the results of our sample. But that is another story…

## Like a forgotten clock

I don’t like the end of summer. The days with bad weather begin, I wake up completely in the dark and in the evening it gets dark early and early. And, as if this were not bad enough, the cumbersome moment of change between summer and winter time is approaching.

In addition to the inconvenience of the change and the tedium of being two or three days remembering what time it is and what it could be if it had not been any change, we must proceed to adjust a lot of clocks manually. And, no matter how much you try to change them all, you always leave some with the old hour. It does not happen to you with the kitchen clock, at which you always look to know how fast you have to have breakfast, or with the one in the car, which stares at you every morning. But surely there are some that you do not change. Even, it has ever happened to me, that I realize it when the next time to change I see that I don’t need to do it because I left it unchanged in the previous time.

These forgotten clocks remind me a little of categorical or qualitative variables.

You will think that, once again, I forgot to take my pill this morning, but no. Everything has its reasoning. When we finish a study and we already have the results, the first thing we do is a description of them and then go on to do all kinds of contrasts, if applicable.

Well, qualitative variables are always belittled when we apply our knowledge of descriptive statistics. We usually limit ourselves to classifying them and making frequency tables with which to calculate some indices as their relative or accumulated frequency, to give some representative measure such as mode and little else. We use to work a little more with its graphic representation with bar or sector diagrams, pictograms and other similar inventions. And finally, we apply a little more effort when we relate two qualitative variables through a contingency table.

However, we forget their variability, something we would never do with a quantitative variable. The quantitative variables are like that kitchen wall clock that looks us straight in the eye every morning and does not allow us to leave it out of time. Therefore, we use these concepts we understand very well as the mean and variance or standard deviation. But that we do not know how to objectively measure the variability of qualitative or categorical variables, whether nominal or ordinal, does not mean that it does not exist a way to do it. For this purpose, several diversity indexes have been developed, which some authors distinguish as dispersion, variability and disparity indexes. Let’s see some of them, whose formulas you can see in the attached box, so you can enjoy the beauty of mathematical language.

The two best known indexes used to measure the variability or diversity are the Blau’s index (or of Hirschman- Herfindal’s) and the entropy index (or Teachman’s). Both have a very similar meaning and, in fact, are linearly correlated.

Blau’s index quantifies the probability that two individuals chosen at random from a population are in different categories of a variable (provided that the population size is infinite or the sampling is performed with replacement). Its minimum value, zero, would indicate that all members are in the same category, so there would be no variety. The higher its value, the more dispersed among the different categories of the variable will be the components of the group. This maximum value is reached when the components are distributed equally among all categories (their relative frequencies are equal). Its maximum value would be (k-1) / k, which is a function of k (the number of categories of the qualitative variable) and not of the population size. This value tends to 1 as the number of categories increases (to put it more correctly, when k tends to infinity).

Let’s look at some examples to clarify it a bit. If you look at the Blau’s index formula, the value of the sum of the squares of the relative frequencies in a totally homogeneous population will be 1, so the index will be 0. There will only be one category with frequency 1 (100%) and the rest with zero frequency.

As we have said, although the subjects are distributed similarly in all categories, the index increases as the number of categories increases. For example, if there are four categories with a frequency of 0.25, the index will be 0.75 (1 – (4 x 0.252)). If there are five categories with a frequency of 0.2, the index will be 0.8 (1 – (5 x 0.22). And so on.

As a practical example, imagine a disease in which there is diversity from the genetic point of view. In a city A, 85% of patients has genotype 1 and 15% genotype 2. The Blau’s index values 1 – (0.85+ 0.152) = 0.255. In view of this result, we can say that, although it is not homogeneous, the degree of heterogeneity is not very high.

Now imagine a city B with 60% of genotype 1, 25% of genotype 2 and 15% of genotype 3. The Blau’s index will be 1 – (0.6x 0.252 x 0.152) = 0.555. Clearly, the degree of heterogeneity is greater among the patients of city B than among those of A. The smartest of you will tell me that that was already clear without calculating the index, but you have to take into account that I chose a very simple example for not giving my all calculating. In real-life, more complex studies, it is not usually so obvious and, in any case, it is always more objective to quantify the measure than to remain with our subjective impression.

This index could also be used to compare the diversity of two different variables (as long as it makes sense to do so) but, the fact that its maximum value depends on the number of categories of the variable, and not on the size of the sample or population, questions its usefulness to compare the diversity of variables with different number of categories. To avoid this problem, the Blau’s index can be normalized by dividing it by its maximum, thus obtaining the qualitative variation index. Its meaning is, of course, the same as that of the Blau’s index and its value ranges between 0 and 1. Thus, we can use either one if we compare the diversity of two variables with the same number of categories, but it will be more correct to use the qualitative variation index if the variables have a different number of categories.

The other index, somewhat less famous, is the Teachman’s index or entropy index , whose formula is also attached. Very briefly we will say that its minimum value, which is zero, indicates that there are no differences between the components in the variable of interest (the population is homogeneous). Its maximum value can be estimated as the negative value of the neperian logarithm of the inverse of the number of categories (- ln ( 1 / k)) and is reached when all categories have the same relative frequency (entropy reaches its maximum value). As you can see, very similar to Blau’s, which is much easier to calculate than Teachman’s.

To end this entry, the third index that I want to talk about today tells us, more than about the variability of the population, about the dispersion that its components have regarding the most frequent value. This can be measured by the variation ratio, which indicates the degree to which the observed values ​​do not coincide with that of mode, which is the most frequent category. As with the previous ones, I also show the formula in the attached box.

In order not to clash with the previous ones, its minimum value is also zero and is obtained when all cases coincide with the mode. The lower the value, the less the dispersion. The lower the absolute frequency of the mode, the closer it will be to 1, the value that indicates maximum dispersion. I think this index is very simple, so we are not going to devote more attention to it.

And we have reached the end of this post. I hope that from now on we will pay more attention to the descriptive analysis of the results of the qualitative variables. Of course, it would be necessary to complete it with an adequate graphic description using the well-known bar or sector diagrams (the pies) and others less known as the Pareto’s diagrams. But that is another story…

## Worshipped, but misunderstood

Statistics wears most of us who call ourselves “clinicians” out. The knowledge on the subject acquired during our formative years has long lived in the foggy world of oblivion. We vaguely remember terms such as probability distribution, hypothesis contrast, analysis of variance, regression … It is for this reason that we are always a bit apprehensive when we come to the methods section of scientific articles, in which all these techniques are detailed that, although they are known to us, we do not know with enough depth to correctly interpret their results.

Fortunately, Providence has given us a lifebelt: our beloved and worshipped p. Who has not felt lost with a cumbersome description of mathematical methods to finally breathe a sigh of relieve when finding the value of p? Especially if the p is small and has many zeros.

The problem with p is that, although it is unanimously worshipped, it is also mostly misunderstood. Its value is, very often, misinterpreted. And this is so because many of us harbor misconceptions about what the p-value really means.

Let’s try to clarify it.

Whenever we want to know something about a variable, the effect of an exposure, the comparison of two treatments, etc., we will face the ubiquity of random: it is everywhere and we can never get rid of it, although we can try to limit it and, of course, try to measure its effect.

Let’s give an example to understand it better. Suppose we are doing a clinical trial to compare the effect of two diets, A and B, on weight gain in two groups of participants. Simplifying, the trial will have one of three outcomes: those of diet A gain more weight, those of diet B gain more weight, both groups gain equal weight (there could even be a fourth: both groups lose weight). In any case, we will always obtain a different result, just by chance (even if the two diets are the same).

Imagine that those in diet A put on 2 kg and those in diet B, 3 kg. Is it more fattening the effect of diet B or is the difference due to chance (chosen samples, biological variability, inaccuracy of measurements, etc.)? This is where our hypothesis contrast comes in.

When we are going to do the test, we start from the hypothesis of equality, of no difference in effect (the two diets induce the same increment of weight). This is what we call the null hypothesis (H0) that, I repeat it to keep it clear, we assume that it is the real one. If the variable we are measuring follows a known probability distribution (normal, chi-square, Student’s t, etc.), we can calculate the probability of presenting each of the values of the distribution. In other words, we can calculate the probability of obtaining a result as different from equality as we have obtained, always under the assumption of H0.

That is the p-value: the probability that the difference in the result observed is due to chance. By agreement, if that probability is less than 5% (0.05) it will seem unlikely that the difference is due to chance and we will reject H0, the equality hypothesis, accepting the alternative hypothesis (Ha) that, in this example, will say that one diet better than the other. On the other hand, if the probability is greater than 5%, we will not feel confident enough to affirm that the difference is not due to chance, so we DO NOT reject H0 and we keep with the hypothesis of equal effects: the two diets are similar.

Keep in mind that we always move in the realm of probability. If p is less than 0.05 (statistically significant), we will reject H0, but always with a probability of committing a type 1 error: take for granted an effect that, in reality, does not exist (a false positive). On the other hand, if p is greater than 0.05, we keep with H0 and we say that there is no difference in effect, but always with a probability of committing a type 2 error: not detecting an effect that actually exists (false negative).

We can see, therefore, that the value of p is somewhat simple from the conceptual point of view. However, there are a number of common errors about what p-value represents or does not represent. Let’s try to clarify them.

It is false that a p-value less than 0.05 means that the null hypothesis is false and a p-value greater than 0.05 that the null hypothesis is true. As we have already mentioned, the approach is always probabilistic. The p <0.05 only means that, by agreement, it is unlikely that H0 is true, so we reject it, although always with a small probability of being wrong. On the other hand, if p> 0.05, it is also not guaranteed that H0 is true, since there may be a real effect that the study does not have sufficient power to detect.

At this point we must emphasize one fact: the null hypothesis is only falsifiable. This means that we can only reject it (with which we keep with Ha, with a probability of error), but we can never affirm that it is true. If p> 0.05 we cannot reject it, so we will remain in the initial assumption of equality of effect, which we cannot demonstrate in a positive way.

It is false that p-value is related to the reliability of the study. We can think that the conclusions of the study will be more reliable the lower the value of p, but it is not true either. Actually, the p-value is the probability of obtaining a similar value by chance if we repeat the experiment in the same conditions and it not only depends on whether the effect we want to demonstrate exists or not. There are other factors that can influence the magnitude of the p-value: the sample size, the effect size, the variance of the measured variable, the probability distribution used, etc.

It is false that p-value indicates the relevance of the result. As we have already repeated several times, p-value is only the probability that the difference observed is due to chance. A statistically significant difference does not necessarily have to be clinically relevant. Clinical relevance is established by the researcher and it is possible to find results with a very small p that are not relevant from the clinical point of view and vice versa, insignificant values that are clinically relevant.

It is false that p-value represents the probability that the null hypothesis is true. This belief is why, sometimes, we look for the exact value of p and do not settle for knowing only if it is greater or less than 0.05. The fault of this error of concept is a misinterpretation of conditional probability. We are interested in knowing what is the probability that H0 is true once we have obtained some results with our test. Mathematically expressed, we want to know P (H0 | results). However, the value of p gives us the probability of obtaining our results under the assumption that the null hypothesis is true, that is, P (results | H0).

Therefore, if we interpret that the probability that H0 is true in view of our results (P (H0 | results)) is equal to the value of p (P (results | H0)) we will be falling into an inverse fallacy or transposition of conditionals fallacy.

In fact, the probability that H0 is true does not depend only on the results of the study, but is also influenced by the previous probability that was estimated before the study, which is a measure of the subjective belief that reflects its plausibility, generally based on previous studies and knowledge. Let’s think we want to contrast an effect that we believe is very unlikely to be true. We will value with caution a p-value <0.05, even being significant. On the contrary, if we are convinced that the effect exists, will be settle for with little demands of p-value.

In summary, to calculate the probability that the effect is real we must calibrate the p-value with the value of the baseline probability of H0, which will be assigned by the researcher or by previously available data. There are mathematical methods to calculate this probability based on its baseline probability and the p-value, but the simplest way is to use a graphical tool, the Held’s nomogram, which you can see in the figure.

To use the Held’s nomogram we just have to draw a line from the previous H0 probability that we consider to the p-value and extend it to see what posterior probability value we reach. As an example, we have represented a study with a p-value = 0.03 in which we believe that the probability of H0 is 20% (we believe there is 80% that the effect is real). If we extend the line it will tell us that the minimum probability of H0 is 6%: there is a 94% probability that the effect is real. On the other hand, think of another study with the same p-value but in which we think that the probability of the effect is lower, for example, of 20% (the probability of H0 is 80%). For the same value of p, the minimum posterior probability of H0 is 50%, then there is 50% that the effect is real. As we can see, the posterior probability changes according to the previous probability.

And here we will end for today. We have seen how p-value only gives us an idea of the role that chance may have had in our results and that, in addition, may depend on other factors, perhaps the most important the sample size. The conclusion is that, in many cases, the p-value is a parameter that allows to assess in a very limited way the relevance of the results of a study. To do it better, it is preferable to resort to the use of confidence intervals, which will allow us to assess clinical relevance and statistical significance. But that is another story…

## The cheaters detector

When we think about inventions and inventors, the name of Thomas Alva Edison, known among his friends as the Wizard of Menlo Park, comes to most of us. This gentleman created more than a thousand inventions, some of which can be said to have changed the world. Among them we can name the incandescent bulb, the phonograph, the kinetoscope, the polygraph, the quadruplex telegraph, etc., etc., etc. But perhaps its great merit is not to have invented all these things, but to apply methods of chain production and teamwork to the research process, favoring the dissemination of their inventions and the creation of the first industrial research laboratory.

But in spite of all his genius and excellence, Edison failed to go on to invent something that would have been as useful as the light bulb: a cheaters detector. The explanation for this pitfall is twofold: he lived between the nineteenth and twentieth centuries and did not read articles about medicine. If he had lived in our time and had to read medical literature, I have no doubt that the Wizard of Menlo Park would have realized the usefulness of this invention and would have pull his socks up.

And it is not that I am especially negative today, the problem is that, as Altman said more than 15 years ago, the material sent to medical journals is defective from the methodological point of view in a very high percentage of cases. It’s sad, but the most appropriate place to store many of the published studies is the rubbish can.

In most cases the cause is probably the ignorance of those who write. “We are clinicians”, we say, so we leave aside the methodological aspects, of which we have a knowledge, in general, quite deficient. To fix it, journal editors send our studies to other colleagues, who are more or less like us. “We are clinicians”, they say, so all our mistakes go unnoticed to them.

Although this is, in itself, serious, it can be remedied by studying. But it is an even more serious fact that, sometimes, these errors can be intentional with the aim of inducing the reader to reach a certain conclusion after reading the article. The remedy for this problem is to make a critical appraisal of the study, paying attention to its internal validity. In this sense, perhaps the most difficult aspect to assess for the clinician without methodological training is that related to the statistics used to analyze the results of the study. It is in this, undoubtedly, that most can be taken advantage of our ignorance using methods that provide more striking results, instead of the right methods.

As I know that you are not going to be willing to do a master’s degree in biostatistics, waiting for someone to invent the cheaters detector, we are going to give a series of clues so that non-expert readers can suspect the existence of these cheats.

The first may seem obvious, but it is not: has a statistical method been used? Although it is exceptionally rare, there may be authors who do not consider using any. I remember a medical congress that I could attend in which the values of a variable were exposed throughout the study that, first, went up and then went down, which allowed the speaker to conclude that the result was not “on the blink”. As it is logical and evident, any comparison must be made with the proper hypotheses contrast and the level of significance and the statistical test used have to be specified. Otherwise, the conclusions will lack any validity.

A key aspect of any study, especially those with an intervention, is the previous calculation of the necessary sample size. The investigator must define the clinically relevant effect that he wants to be able to detect with his study and then calculate what sample size will provide the study with enough power to prove it. The sample of a study is not large or small, but sufficient or insufficient. If the sample is not sufficient, an existing effect may not be detected due to lack of power (type 2 error). On the other hand, a larger sample than necessary may show an effect that is not relevant from the clinical point of view as statistically significant. Here are two very common cheats. First, the study that does not reach significance and its authors say it is due to lack of power (insufficient sample size), but do not make any effort to calculate the power, which can always be done a posteriori. In that case, we can calculate it using statistical programs or any of the calculators available on the internet, such as GRANMO. Second, the sample size is increased until the difference observed is significant, finding the desired p <0.05. This case is simpler: we only have to assess whether the effect found is relevant from the clinical point of view. I advise you to practice and compare the necessary sample sizes of the studies with those defined by the authors. Maybe you’ll have some surprise.

Once the participants have been selected, a fundamental aspect is that of the homogeneity of the basal groups. This is especially important in the case of clinical trials: if we want to be sure that the observed difference in effect between the two groups is due to the intervention, the two groups should be the same in everything, except in the intervention.

For this we will look at the classic table I of the trial publication. Here we have to say that, if we have distributed the participants at random between the two groups, any difference between them will be due, one way or another, to random. Do not be fooled by the p, remember that the sample size is calculated for the clinically relevant magnitude of the main variable, not for the baseline characteristics of the two groups. If you see any difference and it seems clinically relevant, it will be necessary to verify that the authors have taken into account their influence on the results of the study and have made the appropriate adjustment during the analysis phase.

The next point is that of randomization. This is a fundamental part of any clinical trial, so it must be clearly defined how it was done. Here I have to tell you that chance is capricious and has many vices, but rarely produces groups of equal size. Think for a moment if you flip a coin 100 times. Although the probability of getting heads in each throw is 50%, it will be very rare that by throwing 100 times you will get exactly 50 heads. The greater the number of participants, the more suspicious it should seem to us that the two groups are equal. But beware, this only applies to simple randomization. There are methods of randomization in which groups can be more balanced.

Another hot spot is the misuse that can sometimes be made with qualitative variables. Although qualitative variables can be coded with numbers, be very careful with doing arithmetic operations with them. Probably it will not make any sense. Another cheat that we can find has to do with the fact of categorizing a continuous variable. Passing a continuous variable to a qualitative one usually leads to loss of information, so it must have a clear clinical meaning. Otherwise, we can suspect that the reason is the search for a p value less than 0.05, always easier to achieve with the qualitative variable.

Going into the analysis of the data, we must check that the authors have followed the a priori designed protocol of the study. Always be wary of post hoc studies that were not planned from the beginning. If we look for enough, we will always find a group that behaves as we want. As it is said, if you torture the data long enough, it will confess to anything.

Another unacceptable behavior is to finish the study ahead of time for good results. Once again, if the duration of the follow-up has been established during the design phase as the best time to detect the effect, this must be respected. Any violation of the protocol must be more than justified. Logically, it is ethical to finish the study ahead of time due to security reasons, but it will be necessary to take into account how this fact affects the evaluation of the results.

Before performing the analysis of the results, the authors of any study have to debug their data, reviewing the quality and integrity of the values collected. In this sense, one of the aspects to pay attention to is the management of outliers. These are the values that are far from the central values of the distribution. In many occasions they can be due to errors in the calculation, measurement or transcription of the value of the variable, but they can also be real values that are due to the special idiosyncrasy of the variable. The problem is that there is a tendency to eliminate them from the analysis even when there is no certainty that they are due to an error. The correct thing to do is to take them into account when doing the analysis and use, if necessary, robust statistical methods that allow these deviations to be adjusted.

Finally, the aspect that can be more strenuous to those not very expert in statistics is knowing if the correct statistical method has been used. A frequent error is the use of parametric tests without previously checking if the necessary requirements are met. This can be done by ignorance or to obtain statistical significance, since parametric tests are less demanding in this regard. To understand each other, the p-value will be smaller than if we use the equivalent non-parametric test.

Also, with certain frequency, other requirements needed to be able to apply a certain contrast test are ignored. As an example, in order to perform a Student’s t test or an ANOVA, homoscedasticity (a very ugly word that means that the variances are equal) must be checked, and that check is overlooked in many studies. The same happens with regression models that, frequently, are not accompanied by the mandatory diagnosis of the model that allows and justify its use.

Another issue in which there may be cheating is that of multiple comparisons. For example, when the ANOVA reaches significant, the meaning is that there are at least two means that are different, but we do not know which, so we start comparing them two by two. The problem is that when we make repeated comparisons the probability of type I error increases, that is, the probability of finding significant differences only by chance. This may allow finding, if only by chance, a p <0.05, what improves the appearance of the study (especially if you spent a lot of time and / or money doing it). In these cases, the authors must use some of the available corrections (such as Bonferroni’s, one of the simplest) so that the global alpha remains below 0.05. The price to pay is simple: the p-value has to be much smaller to be significant. When we see multiple comparisons without a correction, it will only have two explanations: the ignorance of the one who made the analysis or the attempt to find a statistical significance that, probably, would not support the decrease in p-value that the correction would entail.

Another frequent victim of misuse of statistics is the Pearson’s correlation coefficient, which is used for almost everything. The correlation, as such, tells us if two variables are related, but does not tell us anything about the causality of one variable for the production of the other. Another misuse is to use the correlation coefficient to compare the results obtained by two observers, when probably what should be used in this case is the intraclass correlation coefficient (for continuous variables) or the kappa index (for dichotomous qualitative variables). Finally, it is also incorrect to compare two measurement methods (for example, capillary and venous glycaemia) by correlation or linear regression. For these cases the correct thing would be to use the Passing-Bablok’s regression.

Another situation in which a paranoid mind like mine would suspect is one in which the statistical method employed is not known by the smartest people in the place. Whenever there is a better known (and often simpler) way to do the analysis, we must ask ourselves why they have used such a weird method. In these cases, we will require the authors to justify their choice and provide a reference where we can review the method. In statistics, you have to try to choose the right technique for each occasion and not the one that gives us the most appealing result.

In any of the previous contrast tests, the authors usually use a level of significance for p <0.05, as usual, but the contrast can be done with one or two tails. When we do a trial to try a new drug, what we expect is that it works better than the placebo or the drug with which we are comparing it. However, two other situations can occur that we cannot disdain: that it works the same or, even, that it works worse. A bilateral contrast (with two tails) does not assume the direction of the effect, since it calculates the probability of obtaining a difference equal to or greater than that observed, in both directions. If the researcher is very sure of the direction of the effect, he can make a unilateral contrast (with one tail), measuring the probability of the result in the direction considered. The problem is when he does it for another reason: the p-value of a bilateral contrast is twice as large as that of the unilateral contrast, so it will be easier to achieve statistical significance with the unilateral contrast. The wrong thing is to do the unilateral contrast for that reason. The correct thing, unless there are well-justified reasons, is to make a bilateral contrast.

To go finishing this tricky post, we will say a few words about the use of appropriate measures to present the results. There are many ways to make up the truth without getting to lie and, although basically all say the same, the appearance can be very different depending on how we say it. The most typical example is to use relative risk measures instead of absolute and impact measures. Whenever we see a clinical trial, we must demand that authors provide the absolute risk reduction and the number needed to treat (NNT). The relative risk reduction gives a greater number than the absolute, so it will seem that the impact is greater. Given that the absolute measures are easier to calculate and are obtained from the same data as the relative ones, we should be suspicious if the authors do not offer them to us: perhaps the effect is not as important as they are trying to make us see.

Another example is the use of odds ratio versus risk ratio (when both can be calculated). The odds ratio tends to magnify the association between the variables, so its unjustified use can also make us to be suspicious. If you can, calculate the risk ratio and compare the two measures.

Likewise, we will suspect of studies of diagnostic tests that do not provide us with the likelihood ratios and are limited to sensitivity, specificity and predictive values. Predictive values can be high if the prevalence of the disease in the study population is high, but it would not be applicable to populations with a lower proportion of patients. This is avoided with the use of likelihood ratios. We should always ask ourselves the reason that the authors may have had to obviate the most valid parameter to calibrate the power of a diagnostic test.

And finally, be very careful with the graphics representations of results: here the possibilities of making up the truth are only limited by our imagination. You have to look at the units used and try to extract the information from the graph beyond what it might seem to represent at first glance.

And here we leave the topic for today. We have not spoken in detail about another of the most misunderstood and manipulated entities, which is none other than our p. Many meanings are attributed to p, usually erroneously, as the probability that the null hypothesis is true, probability that has its specific method to make an estimate. But that is another story…

## Pairing

You will all know the case of someone who, after carrying out a study and collecting several million variables, addressed the statistician of his workplace and, demonstrating in a reliable way his clarity of ideas regarding his work, he said: please (You have to be educated), crosscheck everything with everything, to see what comes out.

At this point, several things can happen to you. If the statistician is an unscrupulous soulmate, he will give you a half smile and tell you to come back after a few days. Then, you will be provided with several hundred sheets with graphics, tables and numbers with which you will not know what to do. Another thing that can happen to you is to send to hell, tired as she will be to have similar requests made.

But you can be lucky and find a competent and patient statistician who, in a self-sacrificing way, will explain to you that the thing should not work like that. The logical thing is that you, before collecting any data, have prepared a report of the project in which it is planned, among other things, what is to be analyzed and what variables must be crossed between them. She can even suggest you that, if the analysis is not very complicated, you can try to do it yourself.

The latter may seem like the delirium of a mind disturbed by mathematics but, if you think about it for a moment, it is not such a bad idea. If we do the analysis, at least the preliminary, of our results, it can help us to better understand the study. Also, who can know what we want better than ourselves?

With the current statistical packages, the simplest bivariate statistics can be within our reach. We only have to be careful in choosing the right hypothesis test, for which we must take into account three aspects: the type of variables that we want to compare, if the data are paired or independent and if we have to use parametric or non-parametric tests. Let’s see these three aspects.

Regarding the type of variables, there are multiple denominations according to the classification or the statistical package that we use but, simplifying, we will say that there are three types of variables. First, there are the continuous variables. As the name suggests, they collect the value of a continuous variable such as weight, height, blood glucose concentration, etc. Second, there are the nominal variables, which consist of two or more categories that are mutually excluding. For example, the variable “hair color” can have the categories “brown”, “blonde” and “red hair”. When these variables have two categories, we call them dichotomous (yes / no, alive / dead, etc.). Finally, when the categories are ordered by rank, we speak of ordinal variables: ” do not smoke “, ” smoke little “, ” smoke moderately “, ” smoke a lot “. Although they can sometimes use numbers, they indicate the position of the categories within the series, without implying, for example, that the distance from category 1 to 2 is the same as that from 2 to 3. For example, we can classify vesicoureteral reflux in grades I, II, III and IV (having a degree IV is more than a II, but it does not mean that you have twice as much reflux).

Knowing what kind of variable we are dealing with is simple. If we doubt, we can follow the following reasoning based on the answer to two questions:

1. Does the variable have infinite theoretical values? Here we have to do a bit of abstraction and think about what “theoretical values” really means. For example, if we measure the weight of the subjects of the study, theoretical values ​​will be infinite although, in practice, this will be limited by the precision of our scale. If the answer to this first question is “yes” we will be before a continuous variable. If it is not, we move on to the next question.
2. Are the values ​​sorted in some kind of rank? If the answer is “yes”, we will be dealing with an ordinal variable. If the answer is “no”, we will have a nominal variable.

The second aspect is that of paired or independent measures. Two measures are paired when a variable is measured twice after having applied some change, usually in the same subject. For example: blood pressure before and after a stress test, weight before and after a nutritional intervention, etc. On the other hand, independent measures are those that are not related to each other (they are different variables): weight, height, gender, age, etc.

Finally, we mentioned the possibility of using parametric or non-parametric tests. We are not going to go into detail now, but in order to use a parametric test the variable must fulfill a series of characteristics, such as following a normal distribution, having a certain sample size, etc. In addition, there are techniques that are more robust than others when it comes to having to meet these conditions. When in doubt, it is preferable to use non-parametric techniques unnecessarily (the only problem is that it is more difficult to achieve statistical significance, but the contrast is just as valid) than using a parametric test when the necessary requirements are not met.

Once we have already answered these three aspects, we can only make the pairs of variables that we are going to compare and choose the appropriate statistical test. You can see it summarized in the attached table.The type of independent variable is represented in the rows, which is the one whose value does not depend on another variable (it is usually on the x axis of the graphic representations) and which is usually the one that we modified in the study to see the effect on another variable (the dependent). In the columns, on the other hand, we have the dependent variable, which is the one whose value is modified with the changes of the independent variable. Anyway, do get muddled: the statistical software will make the hypothesis contrast without taking into account which is the dependent and which the independent, only taking into account the types of variables.

The table is self-explanatory, so we will not give it much time. For example, if we have measured blood pressure (contiuous variable) and we want to know if there are differences between men and women (gender, nominal dichotomous variable), the appropriate test will be Student’s t test for independent samples. If we wanted to see if there is a difference in pressure before and after a treatment, we would use the same Student’s t test but for paired samples.

Another example: if we want to know if there are significant differences in the color of hair (nominal, polytomous: “blond”, “brown” and “redhead) and if the participant is from the north or south of Europe (nominal, dichotomous), we could use a Chi-square’s test.

And here we will end for today. We have not talked about the peculiarities of each test that we have to take into account, but we have only mentioned the test itself. For example, the chi-square’s has to meet minimums in each box of the contingency table, in the case of Student’s t we must consider whether the variances are equal (homoscedasticity) or not, etc. But that is another story…

## The power of transitive property

When Georg Cantor wanted to develop the set theory, he could not get an idea of ​​everything that would come after that, probably from the hand of mathematicians as dedicated as he was. I can think of the curious case of binary relations, which the older ones of you will remember of the time when children learned things at school.

It turns out that some mathematical genius begins to think and describes a series of properties. The first is reflective property. This means that, if a number x is equal to x, then so, it is x. In case anyone has not understood, let us give an anatomical example: my right hand is my right hand. I believe that the genius who invented the reflexive property needed a long recovery in some spa after such a huge mental strain.

It was in this spa where he decided to do something more intense, so he described the symmetric property, which is much more complex: whenever a number x equals y, then y equals x. Going back to the anatomical simile, if my arms and legs are my extremities, you will have to agree that my extremities are my arms and my legs. Algebra is fascinating.

Luckily, in the end, with the purpose of filling a file and save back, our anonymous genius invented the transitive property, which says more or less like this: if a number x is related to y, and y is related to z, there will be transitivity if x relates to z. Again, to the anatomy: if my leg is mine and my foot is from my leg, my foot is also mine. After that, more properties were derived from these three, but we shall leave it here for the moment, because today we are going to use the power of transitive property to know which of two things that we have not really come to compare is the better of both. Think, for example, of a crazed mob running into a shopping center on the first day of sales. They look at everything before deciding what to buy, but it is not necessary to compare all the products two to two to know which one we like best.

In medicine something similar happens. The usual thing is that there are several options to treat the same disease (although those of us who have been in the business for a long time now know that the more there are, the more likely that none will work at all). Clinical trials, and meta-analyzes of clinical trials, only compare pairs and it may happen that no one has compared the two we have at our disposal or that we want to know which is, in theory, the best of all available.

Well, for that a methodological design called network meta-analysis (NMA), also called multiple-treatments meta-analysis or mixed-treatments comparisons meta-analysis, has been invented. And in this last term, mixed comparisons, is the crux of the matter, because it turns out that there are several types of comparisons. Let’s see them.

Let’s assume we have three possible treatments that, after a deep reflection, I decided to call A, B and C. The simplest situation is to compare two of them, A and B, for example, with a conventional clinical trial. We would be making a direct comparison between the two interventions. But it may happen that we do not have any trial that directly compares A and B, but there are two different trials that compare the interventions with another intervention, C (you can see it in the attached figure). In this case we can resort to the power of the transitive property and make an indirect comparison between A and B based on their relative efficacy against C. For example, if A reduces mortality by 100% compared to C and B reduces it by 50 % compared to C, we can say that B reduces mortality 50% relative to A. Of course, in order to do this, transitivity has to be fulfilled, something that we cannot take for granted. For example, if I like pork and pig likes to reboar through mud, that does not mean that I like to reboar through mud. Transitivity is not fulfilled in this case (I think).

Well, an NMA is nothing more than a series of direct, indirect and mixed comparisons that allow us to compare the relative effects of several interventions. Multiple comparisons are typically represented using a diagram as a network where we can see the direct, indirect and mixed comparisons. Each node in the network, which can vary in size according to its specific contribution, correspond with one of the primary studies of the review, while the lines joining the nodes represent the comparisons. The complete network will represent all comparisons of treatments identified from the primary studies of the review that incorporates our NMA.

As with the other types of meta-analyzes coupled with a systematic review, the validity of the NMA will depend on the validity of the primary studies, the heterogeneity among them and the possible existing information biases, factors that will condition the quality of the direct comparisons.

In addition, indirect comparisons are considered observational and require, as we have already mentioned, that the researcher issue the transitivity of the interventions based on her knowledge about them, about the disease and about the designs of the primary studies.

Another specific aspect of the NMA is that of coherence or consistency, which makes reference to the level of agreement among the evidence coming from direct and indirect comparisons. This level of agreement, which can be measured with specific statistical methods, must be high in order for the summary result measure to be valid. The results of the comparisons must go in the same direction, they cannot be divergent. When this is not fulfilled, the cause probably lies in the poor methodological quality of the primary studies, in their heterogeneity or in the presence of biases.

As in other meta-analyzes, the result of the NMA is expressed with a summary result measure that can be an odds ratio, a means difference, a risk ratio, etc. This point estimate is accompanied by an interval that gives us information about the accuracy of this estimate. The statistical analysis of the NMA can use frequentist methods (the one we usually see in usual clinical trials) or Bayesian methods. The latter are based on the assignment of a probability of the effect of the treatment prior to the analysis of the data and then to assign a posteriori probability after the analysis. For what interests us here, the frequentist methods will assess the accuracy of the point estimate by means of the known confidence intervals (usually 95%), while the Bayesians will provide credibility intervals (also 95%), of similar significance.

With all this data we will obtain an ordered rank of the compared treatments, with the best heading the list. But do not trust yourself too much, you have to look at these ranks carefully for several reasons. First, the best treatment in one situation may not be so in another. Second, we must take into account other factors such as cost, availability, knowledge of the clinician, etc. Third, these ordered ranks do not take into account the magnitude of the differences between the different elements. And fourth, chance can play tricks on us and put in a good position a treatment that, in reality, is not as good as it may seem.

Once reviewed, at a glance, the peculiarities of the NMA, what can we say about their critical appraisal? As we have a checklist for the systematic review with the usual meta-analysis, the PRISMA statement, there is a specific declaration for the NMA, the PRISMA-NMA. This list includes, as specific items, aspects such as the description of the geometry of the treatments network, the consideration of the transitivity and consistency assumptions and the description of the methods used to analyze the structure of the network and the suitability of the comparisons, in case some may have a lower degree of evidence. All this will be facilitated if the authors provide the graph with the study network and briefly explain its characteristics.

Anyway, you know that I’d rather resort on the CASP’s tools for critical appraisal of documents. Although there is no a specific for NMA, I advise you to use the systematic review with usual meta-analysis one and, later, to make some considerations about the specific aspects of the NMA.

To not extend this post much, we will skip the whole part that NMA share with any other systematic review and go directly to its specific aspects. You can consult the corresponding post where we reviewed the critical appraisal of a systematic review. As always, we will follow our three pillars of wisdom: validity, relevance and applicability.

Regarding VALIDITY, we will ask three specific questions.

1. Does the review respond to a well-defined clinical question that justifies the realization of a NMA?This question has the classic components of the PICO question,although the intervention and the comparison will encompass the multiple comparisons of the network.
2. Was an exhaustive search of the relevant studies carried out?This aspect is important to avoid publication biasand the inclusion of all the important information available. Their absence can affect the consistency of the comparisons.
3. There should be a clear specification of the target population, the treatments evaluated and the outcome measures used.All these aspects can condition the validity of indirect comparisons.If we want to infer the relationship between the effects of A and B by comparing their individual effects with respect to C, it is essential that A and B are treated similarly in their comparison with C, that the A-C and B-C comparisons are made with patients that are similar, that the same outcome measures are used and that the risk of bias in the studies is low. The latter can be assessed with the usual tools, such as the Cochrane’s.

To finish this section, we will check that the results are analyzed and presented in an appropriate way, which statistical method has been used (frequentist or Bayesian), and if confidence or credibility intervals, the analysis of the network, etc. are provided.

Although we will not go into it, we will say that there are multiple types of networks (star, loop, line …). For comparisons to be more valid, indirect comparisons must be supported by direct ones. This can be seen in the network scheme by the presence of triangles similar to the graph that I attached at the beginning of the post (or other closed geometric shapes). In conditions of equality of other factors that can have an influence and that we have already mentioned, the more triangles we see, the more valid the comparisons will be.

As a last aspect, we will evaluate if the authors have used the appropriate methods to assess the heterogeneity and the possible existence of inconsistency: sensitivity analysis, metaregression, etc.

Going to the RELEVANCE section, we will value the results of the meta-analysis. Here we will consider five specific aspects:

1. What is the result? As in any other meta-analysis, we will assess the result and its importance from the clinical point of view.

It will be necessary to assess how the result could have been influenced by the risk of bias in the primary studies: the greater the risk of bias, the farthest our estimate can be from the truth.

1. Are the results accurate?In this sense, we must assess the amplitude of the confidence or credibility intervals, taking into account how the conclusions of the study would be affected at each end of the interval.
2. Is there consistency of results among different studies?There may be variability by pure chance or by heterogeneity among the studies.We can assess it by observing the shape of the forest plots and helping us with the usual statistical methods, such as I2.
3. Are indirect comparisons reliable?We return again to the concept of transitivity, which must be taken into account together with the other factors that we have previously commented on and which may increase the risk of bias: homogeneous populations, outcome variables and common comparators, etc.
4. Is there consistency among direct and indirect comparisons?We will have to check for closed geometric shapes within the network (our triangles or loops),as well as rule out causes of inconsistency, which are the same we have already mentioned as causing heterogeneity and intransivity.

Finally, we will finish our critical appraisal by making some special considerations regarding the APPLICABILITY of the results.

In addition to taking into account, as usual, if all the important effects and variables for the patient have been considered and if the patients are similar to those of our environment, we will ask the questions specifically related to the use of a NMA, such as if the the network has considered all the possibilities of treatment or if the different comparison subgroups that have been established have credibility from the clinical point of view.

And here we will leave for today. A beast difficult to tame, this NMA. And that we have not spoken anything of its statistical methodology, quite complex but that computer packages develop without flinching. In addition, we could have talked a lot about the types of networks and the comparisons that can be drawn from each of them. But that’s another story…

## An unfairly treated genius

The genius that I am talking about in the title of this post is none other than Alan Mathison Turing, considered one of the fathers of computer science and a forerunner of modern computing.

For mathematicians, Turing is best known for his involvement in the solution of the decision problem previously proposed by Gottfried Wilhelm Leibniz and David Hilbert, who were seeking to define a method that could be applied to any mathematical sentence to prove whether that sentence were or not true (to those interested in the matter, it could be demonstrated that such a method does not exist).

But what it is Turing is famous for among the general public comes thanks to the cinema and to his work in statistics during World War II. And it is that Turing was taken to exploiting Bayesian magic to deepen the concept of how the evidence we are collecting during an investigation can support the initial hypothesis or not, thus favoring the development of a new alternative hypothesis. This allowed him to decipher the code of the Enigma machine, which was the one used by the German navy’s sailors to encrypt their messages, and that is the story that has been taken to the screen. This line of work led to the development of concepts such as the weight of evidence and concepts of probability, with which confront null and alternative hypotheses, which were applied in biomedicine and enabled the development of new ways to evaluate new diagnostic tests capabilities, such as the ones we are going to deal with today.

But all this story about Alan Turing turn out to be just a recognition of one of the people whose contribution made it possible to develop the methodological design that we are going to talk about today, which is none other than the meta-analysis of diagnostic accuracy.

We already know that a meta-analysis is a quantitative synthesis method that is used in systematic reviews to integrate the results of primary studies into a summary result measure. The most common is to find systematic reviews on treatment, for which the implementation methodology and the choice of summary result measure are quite well defined. Reviews on diagnostic tests, which have been possible after the development and characterization of the parameters that measure the diagnostic performance of a test, are less common.

The process of conducting a diagnostic systematic review essentially follows the same guidelines as a treatment review, although there are some specific differences that we will try to clarify. We will focus first on the choice of the outcome summary measure and try to take into account the rest of the peculiarities when we give some recommendations for a critical appraisal of these studies.

When choosing the outcome measure, we will find the first big difference with the meta-analyzes of treatment. In the meta-analysis of diagnostic accuracy (MDA) the most frequent way to assess the test is to combine sensitivity and specificity as summary values. However, these indicators present the problem that the cut-off points to consider the results of the test as positive or negative usually vary among the different primary studies of the review. Moreover, in some cases positivity may depend on the objectivity of the evaluator (think of results of imaging tests). All this, besides being a source of heterogeneity among the primary studies, constitutes the origin of a typical MDA bias called the threshold effect, in which we will stop a little later.

For this reason, many authors do not like to use sensitivity and specificity as summary measures and resort to positive and negative likelihood ratios. These ratios have two advantages. First, they are more robust against the presence of threshold effect. Second, as we know, they allow calculating the post-test probability either using Bayes’ rule (pre-test odds  x likelihood ratio = posttest odds) or a Fagan’s nomogram (you can review these concepts in the corresponding post).

Finally, a third possibility is to resort to another of the inventions that derive from Turing’s work: the diagnostic odds ratio (DOR).

The DOR is defined as the ratio of the odds of the patient being positive with a test with respect to the odds of being positive while being healthy. This phrase may seem a bit cryptic, but it is not so. The odds of the patient being positive versus being negative is only the ratio between true positives (TP) and false negatives (FN): TP / FN. On the other hand, the odds of the healthy being positive versus negative is the quotient between false positives (FP) and true negatives (TN): FP / TN. And seeing this, we can only define the ratio between the two odds, as you can see in the attached figure. The DOR can also be expressed in terms of the predictive values ​​and the likelihood ratios, according to the expressions that you can see in the same figure. Finally, it is also possible to calculate their confidence interval, according to the formula that ends the figure.

Like all odds ratios, the possible values ​​of the DOR go from zero to infinity. The null value is 1, which means that the test has no discriminatory capacity between the healthy and the sick. A value greater than one indicates discriminatory capacity, which will be greater the greater the value. Finally, values ​​between zero and 1 will indicate that the test not only does not discriminate well between the sick and healthy, but classifies them in a wrong way and gives us more negative values ​​among the sick than among the healthy.

The DOR is a global parameter easy to interpret and does not depend on the prevalence of the disease, although it must be said that it can vary between groups of patients with different severity of disease. In addition, it is also a very robust measure against the threshold effect and is very useful for calculating the summary ROC curves that we will comment on below.

The second peculiar aspect of MDA that we are going to deal with is the threshold effect. We must always assess their presence when we find ourselves before a MDA. The first thing will be to observe the clinical heterogeneity among the primary studies, which could be evident without needing to make many considerations. There is also a simple mathematical form, which is to calculate the Spearman’s correlation coefficient between sensitivity and specificity . If there is a threshold effect, there will be an inverse correlation between the two, the stronger the higher the threshold effect.

Finally, a graphical method is to assess the dispersion of the sensitivity and specificity representation of the primary studies on the summary ROC curve of the meta-analysis. A dispersion allows us to suspect the threshold effect, but it can also occur due to the heterogeneity of the studies and other biases such as selection’s or verification’s.

The third specific element of MDA that we are going to comment on is that of the summary ROC curve (sROC), which is an estimate of the common ROC curve adjusted according to the results of the primary studies of the review. There are several ways to calculate it, some quite complicated from the mathematical point of view, but the most used are the regression models that use the DOR as an estimator, since, as we have said, it is very robust against heterogeneity and the threshold effect. But do not be alarmed, most of the statistical packages calculate and represent the sROC with little effort.

The reading of sROC is similar to that of any ROC curve. The two more used parameters are area under the ROC curve (AUC) and Q index. The AUC of a perfect curve is equal to 1. Values above 0.5 indicate its discriminatory diagnostic capacity, which will be higher the closer it gets to 1. A value of 0.5 tells us that the usefulness of the test is the same that flipping a coin. Finally, values ​​below 0.5 indicate that the test does not contribute at all to the diagnosis it intends to perform.

On the other hand, the Q index corresponds to the point at which sensitivity and specificity are equal. Similar to AUC manner, a value greater than 0.5 indicate the overall effectiveness of the diagnostic test, which will be higher the closer the index value is to 1. In addition, confidence intervals can also be calculated both for AUC as Q index, with which it will be possible to assess the precision of the estimation of the summary measure of the MDA.

Once seen (at a glance) the specific aspects of MDA, we will give some recommendations to perform the critical appraising of this type of study. CASP network does not provide a specific tool for MDA, but we can follow the lines of the systematic review of treatment studies taking into account the differential aspects of MDA. As always, we will follow our three basic pillars: validity, relevance and applicability.

Let’s start with the questions that value the VALIDITY of the study.

The first question asks if it has been clearly specified the issue of the review. As with any systematic review, diagnostic tests’ should try to answer a specific question that is clinically relevant, and which is usually proposed following the PICO scheme of a structured clinical question. The second question makes us reflect if the type of studies that have been included in the review are adequate. The ideal design is that of a cohort to which the diagnostic test that we want to assess and the gold standard are blindly and independently applied. Other studies based on case-control designs are less valid for the evaluation of diagnostic tests, and will reduce the validity of the results.

If the answer to both questions is yes, we turn to the secondary criteria. Have important studies that have to do with the subject been included? We must verify that a global and unbiased search of the literature has been carried out. The methodology of the search is similar to that of systematic reviews on treatment, although we should take some precautions. For example, diagnostic studies are usually indexed differently in databases, so the use of the usual filters of other types of revisions can cause us to lose relevant studies. We will have to carefully check the search strategy, which must be provided by the authors of the review.

In addition, we must verify that the authors have ruled out the possibility of a publication bias. This poses a special problem in MDA, since the study of the publication bias in these studies is not well developed and the usual methods such as the funnel plot or the Egger’s test are not very reliable. The most conservative thing to do is always assume that there may be a publication bias.

It is very important that enough has been done to assess the quality of the studies, looking for the existence of possible biases. For this the authors can use specific tools, such as the one provided by the QUADAS-2 declaration.

To finish the section of internal or methodological validity, we must ask ourselves if it was reasonable to combine the results of the primary studies. It is fundamental, in order to draw conclusions from combined data, that studies are homogeneous and that the differences among them are due solely to chance. We will have to assess the possible sources of heterogeneity and if there may be a threshold effect, which the authors have had to take into account.

In summary, the fundamental aspects that we will have to analyze to assess the validity of a MDA will be: 1) that the objectives are well defined; 2) that the bibliographic search has been exhaustive; and 3) that the internal or methodological validity of the included studies has also been verified. In addition, we will review the methodological aspects of the meta-analysis technique: the convenience of combining the studies to perform a quantitative synthesis, an adequate evaluation of the heterogeneity of the primary studies and the possible threshold effect and use of an adequate mathematical model to combine the results of the primary studies (sROC, DOR, etc.).

Regarding the RELEVANCE of the results we must consider what is the overall result of the review and if the interpretation has been made in a judicious manner. We will value more those MDA that provide more robust measures against possible biases, such as likelihood ratios and DOR. In addition, we must assess the accuracy of the results, for which we will use our beloved confidence intervals, which will give us an idea of ​​the precision of the estimation of the true magnitude of the effect in the population.

We will conclude the critical appraisal of MDA assessing the APPLICABILITY of the results to our environment. We will have to ask whether we can apply the results to our patients and how they will influence the attention to them. We will have to see if the primary studies of the review describe the participants and if they resemble our patients. In addition, it will be necessary to see if all the relevant results have been considered for decision making in the problem under study and, as always, the benefit-cost-risk ratio must be assessed. The fact that the conclusion of the review seems valid does not mean that we have to apply it in a compulsory way.

Well, with all that said, we are going to finish today. The title of this post refers to the mistreatment suffered by a genius. We already know what genius we were referring to: Alan Turing. Now, we will clarify the abuse. Despite being one of the most brilliant minds of the 20th century, as witnessed by his work on statistics, computing, cryptography, cybernetics, etc., and having saved his country from the blockade of the German Navy during the war, in 1952 he was tried for his homosexuality and convicted of serious indecency and sexual perversion. As it is easy to understand, his career ended after the trial and Alan Turing died in 1954, apparently after eating a piece of an apple poisoned with cyanide, which was labeled as suicide, although there are theories that speak rather of murder. They say that from here comes the bitten apple of a well-known brand of computers, although there are others who say that the apple just represents a play on words between bite and byte.

I do not know which of the two theories is true, but I prefer to recall Turing every time I see the little-apple. My humble tribute to a great man.

And now we finish. We have seen the peculiarities of the meta-analyzes of diagnostic accuracy and how to assess them. Much more could be said of all the mathematics associated with its specific aspects such as the presentation of variables, the study of publication bias, the threshold effect, etc. But that’s another story…