## Rioja vs Ribera

### Frequentist vs Bayesian statistics

This is one of the typical debates that one can have with a brother-in-law during a family dinner: whether the wine from Ribera is better than that from Rioja, or vice versa. In the end, as always, the brother-in-law will be (or will want to be) right, which will not prevent us from trying to contradict him. Of course, we must make good arguments to avoid falling into the same error, in my humble opinion, in which some fall when participating in another classic debate, this one from the less playful field of epidemiology: Frequentist vs. Bayesian statistics?

And these are the two approaches that we can use when dealing with a research problem.

#### Some previous definitions

Frequentist statistics, the best known and to which we are most accustomed, is the one that is developed according to the classic concepts of probability and hypothesis testing. Thus, it is about reaching a conclusion based on the level of statistical significance and the acceptance or rejection of a working hypothesis, always within the framework of the study being carried out. This methodology forces to stabilize the decision parameters a priori, which avoids subjectivities regarding them.

The other approach to solving problems is that of Bayesian statistics, which is increasingly fashionable and, as its name suggests, is based on the probabilistic concept of Bayes’ theorem. Its differentiating feature is that it incorporates external information into the study that is being carried out, so that the probability of a certain event can be modified by the previous information that we have on the event in question. Thus, the information obtained a priori is used to establish an a posteriori probability that allows us to make the inference and reach a conclusion about the problem we are studying.

This is another difference between the two approaches: while Frequentist statistics avoids subjectivity, Bayesian’s one introduces a subjective (but not capricious) definition of probability, based on the researcher’s conviction, to make judgments about a hypothesis.

Bayesian statistics is not really new. Thomas Bayes’ theory of probability was published in 1763, but experiences a resurgence from the last third of the last century. And as usually happens in these cases where there are two alternatives, supporters and detractors of both methods appear, which are deeply involved in the fight to demonstrate the benefits of their preferred method, sometimes looking more for the weaknesses of the opposite than for their own strengths.

And this is what we are going to talk about in this post, about some arguments that Bayesians use on some occasion that, one more time in my humble opinion, take more advantage misuses of Frequentist statistics by many authors, than of intrinsic defects of this methodology.

#### A bit of history

We will start with a bit of history.

The history of hypothesis testing begins back in the 20s of the last century, when the great Ronald Fisher proposed to value the working hypothesis (of absence of effect) through a specific observation and the probability of observing a value equal or greater than the observed result. This probability is the p-value, so sacred and so misinterpreted, that it does not mean more than that: the probability of finding a value equal to or more extreme than that found if the working hypothesis were true.

In summary, the p that Fisher proposed is nothing short of a measure of the discrepancy that could exist between the data found and the hypothesis of work proposed, the null hypothesis (H0).

Almost a decade later, the concept of alternative hypothesis (H1) was introduced, which did not exist in Fisher’s original approach, and the reasoning is modified based on two error rates of false positive and negative:

1. Alpha error (type 1 error): probability of rejecting the null hypothesis when, in fact, it is true. It would be the false positive: we believe we detect an effect that, in reality, does not exist.
2. Beta error (type 2 error): it is the probability of accepting the null hypothesis when, in fact, it is false. It is the false negative: we fail to detect an effect that actually exists.

Thus, we set a maximum value for what seems to us the worst case scenario, which is detecting a false effect, and we choose a “small” value. How small is it? Well, by convention, 0.05 (sometimes 0.01). But, I repeat, it is a value chosen by agreement (and there are those who say that it is capricious, because 5% reminds them the fingers of the hand, which are usually 5).

Thus, if p <0.05, we reject H0 in favor of H1. Otherwise, we accept H0, the hypothesis of no effect. It is important to note that we can only reject H0, never demonstrate it in a positive way. We can demonstrate the effect, but not its absence.

Everything said so far seems easy to understand: the frequentist method tries to quantify the level of uncertainty of our estimate to try to draw a conclusion from the results. The problem is that p, which is nothing more than a way to quantify this uncertainty, is sacralized and too often, which is used to their advantage (if I may say so) by opponents of the method to try to expose its weaknesses.

One of the major flaws attributed to the frequentist method is the dependence of the p-value on the sample size. Indeed, the value of p can be the same with a small effect size in a large sample as with a large effect size in a small sample. And this is more important than it may seem at first, since the value that will allow us to reach a conclusion will depend on a decision exogenous to the problem we are examining: the chosen sample size.

Here would be the benefit of the Bayesian method, in which larger samples would serve to provide more and more information about the study phenomenon. But I think this argument is based on a misunderstanding of what an adequate sample is. I am convinced, the more is not always the better.

Another great man, David Sackett, said that “too small samples can be used to prove nothing; samples that are too large can be used to prove nothing ”. The problem is that, in my opinion, a sample is neither large nor small, but sufficient or insufficient to demonstrate the existence (or not) of an effect size that is considered clinically important.

And this is the heart of the matter. When we want to study the effect of an intervention we must, a priori, define what effect size we want to detect and calculate the necessary sample size to be able to do it, as long as the effect exists (something that we desire when we plan the experiment, but that we don’t know a priori) . When we do a clinical trial we are spending time and money, in addition to subjecting participants to potential risk, so it is important to include only those necessary to try to prove the clinically important effect. Including the necessary participants to reach the desired p <0.05, in addition to being uneconomic and unethical, demonstrates a lack of knowledge about the true meaning of p-value and sample size.

This misinterpretation of the p-value is also the reason that many authors who do not reach the desired statistical significance allow themselves to affirm that with a larger sample size they would have achieved it. And they are right, they would have reached the desired p <0.05, but they again ignore the importance of clinical significance versus statistical significance.

When the sample size to detect the clinically important effect is calculated a priori, the power of the study is also calculated, which is the probability of detecting the effect if it actually exists. If the power is greater than 80-90%, the values admitted by convention, it does not seem correct to say that you do not have enough sample. And, of course, if you have not calculated the power of the study before, you should do it before affirming that you have no results due to shortness of sample.

Another argument against the frequentist method and in favor of the Bayesian’s says that hypothesis testing is a dichotomous decision process, in which a hypothesis is rejected or accepted such as you rejects or accepts an invitation to the wedding of a distant cousin you haven’t seen for years.

Well, if they previously forgot about clinical significance, those who affirm this fact forget about our beloved confidence intervals. The results of a study should not be interpreted solely on the basis of the p-value. We must look at the confidence intervals, which inform us of the precision of the result and of the possible values that the observed effect may have and that we cannot further specify due to the effect of chance. As we saw in a previous post, the analysis of the confidence intervals can give us clinically important information, sometimes, although the p is not statistically significant.

#### More arguments

Finally, some detractors of the frequentist method say that the hypothesis test makes decisions without considering information external to the experiment. Again, a misinterpretation of the value of p.

As we already said in a previous post, a value of p <0.05 does not mean that H0 is false, nor that the study is more reliable, or that the result is important (even though the p has six zeros). But, most importantly for what we are discussing now, it is false that the value of p represents the probability that H0 is false (the probability that the effect is real).

Once our results allow us to affirm, with a small margin of error, that the detected effect is real and not random (in other words, when the p is statistically significant), we can calculate the probability that the effect is “real”. And for this, Oh, surprise! we will have to calibrate the value of p with the value of the basal probability of H0, which will be assigned by the researcher based on her knowledge or previous available data (which is still a Bayesian approach).

As you can see, the assessment of the credibility or likelihood of the hypothesis, one of the differentiating characteristics of the Bayesian’s approach, can also be used if we use frequentist methods.

#### We’re leaving…

And here we are going to leave it for today. But before finishing I would like to make a couple of considerations.

First, in Spain we have many great wines throughout our geography, not just Ribera or Rioja. For no one to get offended, I have chosen these two because they are usually the ones asked by the brothers-in-law when they come to have dinner at home.

Second, do not misunderstand me if it may have seemed to you that I am an advocate of frequentist statistics against Bayesian’s. Just as when I go to the supermarket I feel happy to be able to buy wine from various designations of origin, in research methodology I find it very good to have different ways of approaching a problem. If I want to know if my team is going to win a match, it doesn’t seem very practical to repeat the match 200 times to see what average results come out. It  would be better to try to make an inference taking into account the previous results.

And that’s all. We have not gone into depth in what we have commented at the end on the real probability of the effect, somehow mixing both approaches, frequentist’s and Bayesian’s. The easiest way, as we saw in a previous post, is to use a Held’s nomogram. But that is another story…

## A good agreement?

We all know that the less we go to the doctor, the best. And this is so for two reasons. First, because if we go to many doctors we are either physically ill or very mentally sick (some unfortunates are both of them). And second, which is the fact I am always struck by, because every doctor tells you something different. And it’s not that doctors don’t know their job, it’s that getting an agreement is not as simple as it seems.

To give you an idea, the problem starts when wanting to know if two doctor who assess the same diagnostic test have a good degree of agreement. Let’s see an example.

Imagine for a moment than I am the manager of a hospital and I want to hire a pathologist because the only one that works at the hospital is overworked.

I meet with my pathologist and the applicant and give them 795 biopsies to tell me if there’re malignant cells in them. As you can see in the first table, my pathologist finds malignant cells in 99 biopsies, while the applicant sees them in 135 (do not panic, in real life difference couldn’t be so wide, could be?). We wonder what degree of agreement or, rather, concordance exists between the two. The first think that comes to our mind is to calculate the number of biopsies in which they agree: they both agree with 637 normal biopsies and 76 with malignant cells, so the percentages of cases of agreement can be calculated as (637+76)/795=0.896. Hurray!, we think, the two agree almost 90% of the time. The result is not as bad as it seemed to be looking at the table.

But it turns out that when I’m about to hire the new pathologist I wonder if they could have agreed just by chance.

So, a stupid experiment springs to my mind: I take the 795 biopsies and throw a coin, labeling each biopsy as normal if I get heads, or pathological, if tails.

The coin says I have 400 normal biopsies and 395 with malignant cells. If I calculate the concordance between the coin and the pathologist, I see that it values (365+55)/795=0.516, 52%!. This is really amazing, just by chance there’s agreement in half of the cases (yes, yes, I know that those know-it-all of you will be thinking that it’s not a surprise, since 50% is the probability of each possible outcome when tossing a coin). So I start thinking how to save money for my hospital and I come out with another experiment that this time is not only stupid, but totally ridiculous: I offer my cousin to do the test instead of throwing a coin (by this time I’m going to left my brother-in-law alone).

The problem, of course, is that my cousin is not a doctor and, although a nice guy, pathology is not his main topic. So, when he starts to see the colorful cells he thinks it’s impossible that such beauty is produced by malignant cells and gives all the biopsies as normal. When we look at the table with the results the first think that we think if to burn it but, for the sake of curiosity, we calculate the concordance between my cousin and my pathologist and see that it’s 696/795=0.875, 87!. Conclusion: it could be more convenient to me to hire my cousin instead of a new pathologist.

At this stage many of you will think that I forgot to take my medication this morning, but the truth is that all these examples serve to show you that, if we want to know what the agreement between two observers is, we must first get rid of the cumbersome and everlasting effect of chance. And for that, mathematicians have invented a statistic called kappa, the interobserver agreement coefficient.

The function of kappa is to exclude from the observed agreement that part that is due to chance, obtaining a more representative measure of the strength of agreement between observers. Its formula is a ratio in which the numerator is the difference between observed and random difference and which denominator represents the complementary of the random agreement: (Po-Pr) / (1-Pr).

We already know the value of Po with two pathologists: 0.89. To get Pr we have to calculate the theoretical expected values for each cell of the table, in a similar way that we remember we did with chi squared test: the expected value of each cell is the product of the total of its row and column divided by the total of the table. As an example, the expected value of the first cell of our table is (696×660)/795=578. With the expected values we can calculate the probability of agreement due to chance using the same method we used earlier with observed values: (578+17)/795=0.74.

And now we can calculate kappa = (0.89-0.74)/(1-0.74) = 0.57. And what can we conclude of a value of 0.57?. We can do with it whatever we want except multiply it by a hundred, because this values doesn’t represent a true percentage. The value of kappa can range between -1 and 1. Negative values indicate that concordance is worse than that expected by chance. A value of 0 indicates that the agreement is similar than that we could get flipping a coin. Values greater than 0 indicate that concordance is slight (0.01-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80) or almost perfect (0.81-1.00). In our case, there’s a fairly good agreement between the two pathologists. If you are curious, you can calculate the kappa for my cousin and you’ll see that it’s no better than flipping a coin.

Kappa can also be calculated if we have measurements of several observers and more than one result for each observation, but tables get so unfriendly that it is better to use a statistical program to calculate it, and by the way, come up with confidence intervals.

Anyway, do not put much trust in kappa, because it needs not to be greater difference among table’s cells. If a cell has few cases the coefficient will tend to underestimate the actual concordance even if it’s very good.

Finally, say that, although all our examples showed tests with dichotomous result, it’s also possible to calculate interobserver agreement with quantitative results (a rating scale, for instance). Of course, for that we have to use another statistical technique as Bland-Altman’s test, but that’s another story…

## I am Spartacus

I was thinking about the effect size based on mean differences and how to know when that effect is really large and, because of the association of ideas, someone great has come to mind who, sadly, has left us recently. I am referring to Kirk Douglas, that hell of an actor that I will always remember for his roles as a Viking, as Van Gogh or as Spartacus, in the famous scene of the film in which all slaves, in the style of our Spanish’s Fuenteovejuna, stand up and proclaim together to be Spartacus so that Romans cannot do anything to the true one (or to get all equally whacked, much more typical of the modus operandi of the Romans of that time).

You won’t tell me the man wasn’t great. But how great if we compare it with others? How can we measure it? It is clear that not because of the number of Oscars, since that would only serve to measure the prolonged shortsightedness of the so-called academics of the cinema, which took a long time until they awarded him the honorary prize for his entire career. It is not easy to find a parameter that defines the greatness of a character like Issur Danielovitch Demsky, which was the ragman’s son’s name before becoming a legend.

We have it easier to quantify the effect size in our studies, although the truth is that researchers are usually more interested in telling us the statistical significance than in the size of the effect. It is so unusual to calculate it that even many statistical packages forget to have routines to obtain it. In this post, we are going to focus on how to measure the effect size based on differences between means.

Imagine that we want to conduct a trial to compare the effect of a new treatment against placebo and that we are going to measure the result with a quantitative variable X. What we will do is calculate the mean effect between participants in the experimental or intervention group and compare it with the mean of the participants in the control group. Thus, the effect size of the intervention with respect to the placebo will be represented by the magnitude of the difference between the mean in the experimental group and that of the control group:$d=&space;\bar{x}_{e}-\bar{x}_{c}$However, although it is the easiest to calculate, this value does not help us to get an idea of the effect size, since its magnitude will depend on several factors, such as the unit of measure of the variable. Let us think about how the differences change if one mean is twice the other as their values are 1 and 2 or 0.001 and 0.002. In order for this difference to be useful, it is necessary to standardize it, so a man named Gene Glass thought he could do it by dividing it by the standard deviation of the control group. He obtained the well-known Glass’ delta, which is calculated according to the following formula:$\delta&space;=&space;\frac{\bar{x}_{e}-\bar{x}_{c}}{S_{s}}$Now, since what we want is to estimate the value of delta in the population, we will have to calculate the standard deviation using n-1 in the denominator instead of n, since we know that this quasi-variance is a better estimator of the population value of the deviation:$S_{c}=\sqrt{\frac{\sum_{i=1}^{n_{c}}(x_{ic}-\bar{x}_{c})}{n_{c}-1}}$But do not let yourselves be impressed by delta, it is not more than a Z score (those obtained by subtracting to the value its mean and dividing it by the standard deviation): each unit of the delta value is equivalent to one standard deviation, so it represents the standardized difference in the effect that occurs between the two groups due to the effect of the intervention. This value allows us to estimate the percentage of superiority of the effect by calculating the area under the curve of the standard normal distribution N(0,1) for a specific delta value (equivalent to the standard deviation). For example, we can calculate the area that corresponds to a delta value = 1.3. Nothing is simpler than using a table of values of the standard normal distribution or, even better, the pnorm() function of R, which returns the value 0.90. This means that the effect in the intervention group exceeds the effect in the control group by 90%.

The problem with Glass’ delta is that the difference in means depends on the variability between the two groups, which makes it sensitive to these variance differences. If the variances of the two groups are very different, the delta value may be biased. That is why one Larry Vernon Hedges wanted to contribute with his own letter to this particular alphabet and decided to do the calculation of Glass in a similar way, but using a unified variance that does not assume their equality, according to the following formula:$S_{u}=\sqrt{\frac{(n_{e}-1)S_{e}^{2}+(n_{c}-1)S_{c}^{2}}{n_{e}+n_{c}-2}}$If we substitute the variance of the control group of the Glass’ delta formula with this unified variance we will obtain the so-called Hedges’ g. The advantage of using this unified standard deviation is that it takes into account the variances and sizes of the two groups, so g has less risk of bias than delta when we cannot assume equal variances between the two groups.

However, both delta and g have a positive bias, which means that they tend to overestimate the effect size. To avoid this, Hedges modified the calculation of his parameter in order to obtain an adjusted g, according to the following formula:$g_{a}=g\left&space;(&space;1-\frac{3}{4df-9}&space;\right&space;)$where df are the degrees of freedom, which are calculated as ne + nc.

This correction is more needed with small samples (few degrees of freedom). It is logical, if we look at the formula, the more degrees of freedom, the less necessary it will be to correct the bias.

So far, we have tried to solve the problem of calculating an estimator of the effect size that is not biased by the lack of equal variances. The point is that, in the rigid and controlled world of clinical trials, it is usual that we can assume the equality of variances between the groups of the two branches of the study. We might think, then, that if this is true, it would not be necessary to resort to the trick of n-1.

Well, Jacob Cohen thought the same, so he devised his own parameter, Cohen’s d. This Cohen’s d is similar to Hedges’ g, but still more sensitive to inequality of variances, so we will only use it when we can assume the equality of variances between the two groups. Its calculation is identical to that of the Hedges’ g, but using n instead of n-1 to obtain the unified variance.

As a rough-and-ready rule, we can say that the effect size is small for d = 0.2, medium for d = 0.5, large for d = 0.8 and very large for d = 1.20. In addition, we can establish a relationship between d and the (r), which is also a widely used measure to estimate the effect size.

The correlation coefficient measures the relationship between an independent binary variable (intervention or control) and a numerical dependent variable (our X). The great advantage of this measure is that it is easier to interpret than the parameters we have seen so far, which all function as standardized Z scores. We already know that r can range from -1 to 1 and the meaning of these values.

Thus, if you want to calculate r given d, you only have to apply the following formula:$r=\frac{d}{\sqrt{d^{2}+\left&space;(&space;\frac{1}{pq}&space;\right&space;)}}$where p and q are the proportions of subjects in the experimental and control groups (p = ne / n and q = nc / n). In general, the larger the effect size, the greater r and vice versa (although it must be taken into account that r is also smaller as the difference between p and q increases). However, the factor that most determines the value of r is the value of d.

And with this we will end for today. Do not believe that we have discussed all the measures of this family. There are about a hundred parameters to estimate the effect size, such as the determination coefficient, eta-square, chi-square, etc., even others that Cohen himself invented (not very happy with only d), such as f-square or Cohen’s q. But that is another story…

## Columns, sectors, and an illustrious Italian

When you read the title of this post, you can ask yourself with what stupid occurrence am I going to crush the suffered concurrence today, but do not fear, all we are going to do is to put in prospective value that famous aphorism that says that a picture is worth a thousand words. Have I clarified something? I suppose not.

As we all know, descriptive statistics is that branch of statistics that we usually use to obtain a first approximation to the results of our study, once we have finished it.

The first thing we do is to describe the data, for which we make frequency tables and use various measures of tendency and dispersion. The problem with these parameters is that, although they truly represent the essence of the data, it is sometimes difficult to provide a synthetic and comprehensive view with them. It is in these cases that we can resort to another resource, which is none other than the graphic representation of the study results. You know, a picture is worth a thousand words, or so they say.

There are many types of graphs to help us better understand the data, but today we are only going to talk about those that have to do with qualitative or categorical variables.

Remember that qualitative variables represent attributes or categories of the variable. When the variable does not include any sense of order, it is said to be a nominal categorical variable, while if a certain order can be established between the categories, we would say that it is an ordinal categorical variable. For example, the variable “smoker” would be nominal if it has two possibilities: “yes” or “no”. However, if we define it as “occasional”, “little smoker”, “moderate” or “heavy smoker”, there is already a certain hierarchy and we speak of ordinal qualitative variable.

The first type of chart that we are going to consider when representing a qualitative variable is the pie chart. This consists of a circle whose area represents the total data. Thus, an area that will be directly proportional to its frequency is assigned to each category. In this way, the most frequent categories will have larger areas, so that we can get an idea of how the frequencies are distributed in the categories at a glance.

There are three ways to calculate the area of each sector. The simplest is to multiply the relative frequency of each category by 360 °, obtaining the degrees of that sector.

The second is to use the absolute frequency of the category, according to the following rule of three:

Absolute frequency / Total data frequency = Degrees of the sector / 360 °

Finally, the third way is to use the proportions or percentages of the categories:

% of the category / 100% = Degrees of the sector / 360 °

The formulas are very simple, but, in any case, there will be no need to resort to them because the program with which we make the graph will do it for us. The instruction in R is pie(), as you can see in the first figure, in which I show you a distribution of children with exanthematic diseases and how the pie chart would be represented.The pie chart is designed to represent nominal categorical variables, although it is not uncommon to see pies representing variables of other types. However, and in my humble opinion, this is not entirely correct.

For example, if we make a pie chart for an ordinal qualitative variable, we will be losing information about the hierarchy of the variables, so it would be more correct to use a chart that allows to sort the categories from less to more. And this chart is none other than the bar chart, which we’ll talk about next.

The pie chart will be especially useful when there are few categories of the variable. If there are many, the interpretation is no longer so intuitive, although we can always complete the graph with a frequency table that helps us to better interpret the data. Another tip is to be very careful with 3D effects when drawing cakes. If we go from elaborate, the graphic will lose clarity and will be more difficult to read.

The second graph that we are going to see is, as we have already mentioned, the bar chart, the optimum to represent ordinal qualitative variables. On the horizontal axis, the different categories are represented, and on it some columns or bars are raised whose height is proportional to the frequency of each category. We could also use this type of graph to represent discrete quantitative variables, but what is not very correct to do is use it for the qualitative nominal variables.

The bar chart is able to express the magnitude of the differences between the categories of the variable, but it is precisely its weak point, since it is easily manipulated if we modify the axes’ scales. That is why we must be careful when analyzing this type of graphics to avoid being deceived by the message that the author of the study may want to convey.

This chart is also easy to do with most statistical programs and spreadsheets. The function in R is barplot(), as you can see in the second figure, which represents a sample of asthmatic children classified by severity.

With what has been seen so far, some will think that the title of this post is a bit misleading. Actually, the thing is not about columns and sectors, but about bars and pies. Also, who is the illustrious Italian? Well, here I do not fool anyone, because the character was both Italian and illustrious, and I am referring to Vilfredo Federico Pareto.

Pareto was an Italian who was born in the mid-19th century in Paris. This small contradiction is due to the fact that his father was then exiled in France for being one of the followers of Giuseppe Mazzini, who was then committed to Italian unification. Anyway, Pareto lived in Italy from he was 10 years old on, becoming an engineer with extensive mathematical and humanistic knowledge and who contributed decisively to the development of microeconomics. He spoke and wrote fluently in French, English, Italian, Latin and Greek, and became famous for a multitude of contributions such as the Pareto’s distribution, Pareto’s efficiency, Pareto’s index and Pareto’s principle. To represent the latter, he invented the Pareto’s diagram, which is what brings him here today among us.

Pareto chart (also known in economics as a closed curve or A-B-C distribution) organizes the data in descending order from left to right, represented by bars, thus assigning an order of priorities. In addition, the diagram incorporates a curved line that represents the cumulative frequency of the categories of the variable. This initially allowed the Pareto’s principle to be explained, which goes on to say that there are many minor problems compared to a few that are important, which was very useful for decision-making.

As it is easy to understand, this prioritization makes the Pareto diagram especially useful for representing ordinal qualitative variables, surpassing the bar chart by giving information on the percentage accumulated by adding the categories of the distribution of the variable. The change in slope of this curve also informs us of the change in the concentration of data, which depends on the variability in which the subjects of the sample are divided between the different categories.

Unfortunately, R does not have a simple function to represent Pareto diagrams, but we can easily obtain it with the script that I attached in the third figure, obtaining the graph of the fourth.

And here we are going to leave it for today. Before saying goodbye, I want to warn you that you should not confuse the bars of the bar chart with those of the histogram since, although they can be similar from the graphic point of view, both represent very different things. In a bar chart only the values of the variables we have observed when doing the study are represented. However, the histogram goes much further since, in reality, it contains the frequency distribution of the variable, so it represents all possible values that exist within the intervals, although we have not observed any directly. It allows us to calculate the probability that any distribution value will be represented, which is of great importance if we want to make inference and estimate population values based on the results of our sample. But that is another story…

## Like a forgotten clock

I don’t like the end of summer. The days with bad weather begin, I wake up completely in the dark and in the evening it gets dark early and early. And, as if this were not bad enough, the cumbersome moment of change between summer and winter time is approaching.

In addition to the inconvenience of the change and the tedium of being two or three days remembering what time it is and what it could be if it had not been any change, we must proceed to adjust a lot of clocks manually. And, no matter how much you try to change them all, you always leave some with the old hour. It does not happen to you with the kitchen clock, at which you always look to know how fast you have to have breakfast, or with the one in the car, which stares at you every morning. But surely there are some that you do not change. Even, it has ever happened to me, that I realize it when the next time to change I see that I don’t need to do it because I left it unchanged in the previous time.

These forgotten clocks remind me a little of categorical or qualitative variables.

You will think that, once again, I forgot to take my pill this morning, but no. Everything has its reasoning. When we finish a study and we already have the results, the first thing we do is a description of them and then go on to do all kinds of contrasts, if applicable.

Well, qualitative variables are always belittled when we apply our knowledge of descriptive statistics. We usually limit ourselves to classifying them and making frequency tables with which to calculate some indices as their relative or accumulated frequency, to give some representative measure such as mode and little else. We use to work a little more with its graphic representation with bar or sector diagrams, pictograms and other similar inventions. And finally, we apply a little more effort when we relate two qualitative variables through a contingency table.

However, we forget their variability, something we would never do with a quantitative variable. The quantitative variables are like that kitchen wall clock that looks us straight in the eye every morning and does not allow us to leave it out of time. Therefore, we use these concepts we understand very well as the mean and variance or standard deviation. But that we do not know how to objectively measure the variability of qualitative or categorical variables, whether nominal or ordinal, does not mean that it does not exist a way to do it. For this purpose, several diversity indexes have been developed, which some authors distinguish as dispersion, variability and disparity indexes. Let’s see some of them, whose formulas you can see in the attached box, so you can enjoy the beauty of mathematical language.

The two best known indexes used to measure the variability or diversity are the Blau’s index (or of Hirschman- Herfindal’s) and the entropy index (or Teachman’s). Both have a very similar meaning and, in fact, are linearly correlated.

Blau’s index quantifies the probability that two individuals chosen at random from a population are in different categories of a variable (provided that the population size is infinite or the sampling is performed with replacement). Its minimum value, zero, would indicate that all members are in the same category, so there would be no variety. The higher its value, the more dispersed among the different categories of the variable will be the components of the group. This maximum value is reached when the components are distributed equally among all categories (their relative frequencies are equal). Its maximum value would be (k-1) / k, which is a function of k (the number of categories of the qualitative variable) and not of the population size. This value tends to 1 as the number of categories increases (to put it more correctly, when k tends to infinity).

Let’s look at some examples to clarify it a bit. If you look at the Blau’s index formula, the value of the sum of the squares of the relative frequencies in a totally homogeneous population will be 1, so the index will be 0. There will only be one category with frequency 1 (100%) and the rest with zero frequency.

As we have said, although the subjects are distributed similarly in all categories, the index increases as the number of categories increases. For example, if there are four categories with a frequency of 0.25, the index will be 0.75 (1 – (4 x 0.252)). If there are five categories with a frequency of 0.2, the index will be 0.8 (1 – (5 x 0.22). And so on.

As a practical example, imagine a disease in which there is diversity from the genetic point of view. In a city A, 85% of patients has genotype 1 and 15% genotype 2. The Blau’s index values 1 – (0.85+ 0.152) = 0.255. In view of this result, we can say that, although it is not homogeneous, the degree of heterogeneity is not very high.

Now imagine a city B with 60% of genotype 1, 25% of genotype 2 and 15% of genotype 3. The Blau’s index will be 1 – (0.6x 0.252 x 0.152) = 0.555. Clearly, the degree of heterogeneity is greater among the patients of city B than among those of A. The smartest of you will tell me that that was already clear without calculating the index, but you have to take into account that I chose a very simple example for not giving my all calculating. In real-life, more complex studies, it is not usually so obvious and, in any case, it is always more objective to quantify the measure than to remain with our subjective impression.

This index could also be used to compare the diversity of two different variables (as long as it makes sense to do so) but, the fact that its maximum value depends on the number of categories of the variable, and not on the size of the sample or population, questions its usefulness to compare the diversity of variables with different number of categories. To avoid this problem, the Blau’s index can be normalized by dividing it by its maximum, thus obtaining the qualitative variation index. Its meaning is, of course, the same as that of the Blau’s index and its value ranges between 0 and 1. Thus, we can use either one if we compare the diversity of two variables with the same number of categories, but it will be more correct to use the qualitative variation index if the variables have a different number of categories.

The other index, somewhat less famous, is the Teachman’s index or entropy index , whose formula is also attached. Very briefly we will say that its minimum value, which is zero, indicates that there are no differences between the components in the variable of interest (the population is homogeneous). Its maximum value can be estimated as the negative value of the neperian logarithm of the inverse of the number of categories (- ln ( 1 / k)) and is reached when all categories have the same relative frequency (entropy reaches its maximum value). As you can see, very similar to Blau’s, which is much easier to calculate than Teachman’s.

To end this entry, the third index that I want to talk about today tells us, more than about the variability of the population, about the dispersion that its components have regarding the most frequent value. This can be measured by the variation ratio, which indicates the degree to which the observed values ​​do not coincide with that of mode, which is the most frequent category. As with the previous ones, I also show the formula in the attached box.

In order not to clash with the previous ones, its minimum value is also zero and is obtained when all cases coincide with the mode. The lower the value, the less the dispersion. The lower the absolute frequency of the mode, the closer it will be to 1, the value that indicates maximum dispersion. I think this index is very simple, so we are not going to devote more attention to it.

And we have reached the end of this post. I hope that from now on we will pay more attention to the descriptive analysis of the results of the qualitative variables. Of course, it would be necessary to complete it with an adequate graphic description using the well-known bar or sector diagrams (the pies) and others less known as the Pareto’s diagrams. But that is another story…

## Worshipped, but misunderstood

Statistics wears most of us who call ourselves “clinicians” out. The knowledge on the subject acquired during our formative years has long lived in the foggy world of oblivion. We vaguely remember terms such as probability distribution, hypothesis contrast, analysis of variance, regression … It is for this reason that we are always a bit apprehensive when we come to the methods section of scientific articles, in which all these techniques are detailed that, although they are known to us, we do not know with enough depth to correctly interpret their results.

Fortunately, Providence has given us a lifebelt: our beloved and worshipped p. Who has not felt lost with a cumbersome description of mathematical methods to finally breathe a sigh of relieve when finding the value of p? Especially if the p is small and has many zeros.

The problem with p is that, although it is unanimously worshipped, it is also mostly misunderstood. Its value is, very often, misinterpreted. And this is so because many of us harbor misconceptions about what the p-value really means.

Let’s try to clarify it.

Whenever we want to know something about a variable, the effect of an exposure, the comparison of two treatments, etc., we will face the ubiquity of random: it is everywhere and we can never get rid of it, although we can try to limit it and, of course, try to measure its effect.

Let’s give an example to understand it better. Suppose we are doing a clinical trial to compare the effect of two diets, A and B, on weight gain in two groups of participants. Simplifying, the trial will have one of three outcomes: those of diet A gain more weight, those of diet B gain more weight, both groups gain equal weight (there could even be a fourth: both groups lose weight). In any case, we will always obtain a different result, just by chance (even if the two diets are the same).

Imagine that those in diet A put on 2 kg and those in diet B, 3 kg. Is it more fattening the effect of diet B or is the difference due to chance (chosen samples, biological variability, inaccuracy of measurements, etc.)? This is where our hypothesis contrast comes in.

When we are going to do the test, we start from the hypothesis of equality, of no difference in effect (the two diets induce the same increment of weight). This is what we call the null hypothesis (H0) that, I repeat it to keep it clear, we assume that it is the real one. If the variable we are measuring follows a known probability distribution (normal, chi-square, Student’s t, etc.), we can calculate the probability of presenting each of the values of the distribution. In other words, we can calculate the probability of obtaining a result as different from equality as we have obtained, always under the assumption of H0.

That is the p-value: the probability that the difference in the result observed is due to chance. By agreement, if that probability is less than 5% (0.05) it will seem unlikely that the difference is due to chance and we will reject H0, the equality hypothesis, accepting the alternative hypothesis (Ha) that, in this example, will say that one diet better than the other. On the other hand, if the probability is greater than 5%, we will not feel confident enough to affirm that the difference is not due to chance, so we DO NOT reject H0 and we keep with the hypothesis of equal effects: the two diets are similar.

Keep in mind that we always move in the realm of probability. If p is less than 0.05 (statistically significant), we will reject H0, but always with a probability of committing a type 1 error: take for granted an effect that, in reality, does not exist (a false positive). On the other hand, if p is greater than 0.05, we keep with H0 and we say that there is no difference in effect, but always with a probability of committing a type 2 error: not detecting an effect that actually exists (false negative).

We can see, therefore, that the value of p is somewhat simple from the conceptual point of view. However, there are a number of common errors about what p-value represents or does not represent. Let’s try to clarify them.

It is false that a p-value less than 0.05 means that the null hypothesis is false and a p-value greater than 0.05 that the null hypothesis is true. As we have already mentioned, the approach is always probabilistic. The p <0.05 only means that, by agreement, it is unlikely that H0 is true, so we reject it, although always with a small probability of being wrong. On the other hand, if p> 0.05, it is also not guaranteed that H0 is true, since there may be a real effect that the study does not have sufficient power to detect.

At this point we must emphasize one fact: the null hypothesis is only falsifiable. This means that we can only reject it (with which we keep with Ha, with a probability of error), but we can never affirm that it is true. If p> 0.05 we cannot reject it, so we will remain in the initial assumption of equality of effect, which we cannot demonstrate in a positive way.

It is false that p-value is related to the reliability of the study. We can think that the conclusions of the study will be more reliable the lower the value of p, but it is not true either. Actually, the p-value is the probability of obtaining a similar value by chance if we repeat the experiment in the same conditions and it not only depends on whether the effect we want to demonstrate exists or not. There are other factors that can influence the magnitude of the p-value: the sample size, the effect size, the variance of the measured variable, the probability distribution used, etc.

It is false that p-value indicates the relevance of the result. As we have already repeated several times, p-value is only the probability that the difference observed is due to chance. A statistically significant difference does not necessarily have to be clinically relevant. Clinical relevance is established by the researcher and it is possible to find results with a very small p that are not relevant from the clinical point of view and vice versa, insignificant values that are clinically relevant.

It is false that p-value represents the probability that the null hypothesis is true. This belief is why, sometimes, we look for the exact value of p and do not settle for knowing only if it is greater or less than 0.05. The fault of this error of concept is a misinterpretation of conditional probability. We are interested in knowing what is the probability that H0 is true once we have obtained some results with our test. Mathematically expressed, we want to know P (H0 | results). However, the value of p gives us the probability of obtaining our results under the assumption that the null hypothesis is true, that is, P (results | H0).

Therefore, if we interpret that the probability that H0 is true in view of our results (P (H0 | results)) is equal to the value of p (P (results | H0)) we will be falling into an inverse fallacy or transposition of conditionals fallacy.

In fact, the probability that H0 is true does not depend only on the results of the study, but is also influenced by the previous probability that was estimated before the study, which is a measure of the subjective belief that reflects its plausibility, generally based on previous studies and knowledge. Let’s think we want to contrast an effect that we believe is very unlikely to be true. We will value with caution a p-value <0.05, even being significant. On the contrary, if we are convinced that the effect exists, will be settle for with little demands of p-value.

In summary, to calculate the probability that the effect is real we must calibrate the p-value with the value of the baseline probability of H0, which will be assigned by the researcher or by previously available data. There are mathematical methods to calculate this probability based on its baseline probability and the p-value, but the simplest way is to use a graphical tool, the Held’s nomogram, which you can see in the figure.

To use the Held’s nomogram we just have to draw a line from the previous H0 probability that we consider to the p-value and extend it to see what posterior probability value we reach. As an example, we have represented a study with a p-value = 0.03 in which we believe that the probability of H0 is 20% (we believe there is 80% that the effect is real). If we extend the line it will tell us that the minimum probability of H0 is 6%: there is a 94% probability that the effect is real. On the other hand, think of another study with the same p-value but in which we think that the probability of the effect is lower, for example, of 20% (the probability of H0 is 80%). For the same value of p, the minimum posterior probability of H0 is 50%, then there is 50% that the effect is real. As we can see, the posterior probability changes according to the previous probability.

And here we will end for today. We have seen how p-value only gives us an idea of the role that chance may have had in our results and that, in addition, may depend on other factors, perhaps the most important the sample size. The conclusion is that, in many cases, the p-value is a parameter that allows to assess in a very limited way the relevance of the results of a study. To do it better, it is preferable to resort to the use of confidence intervals, which will allow us to assess clinical relevance and statistical significance. But that is another story…

## The cheaters detector

When we think about inventions and inventors, the name of Thomas Alva Edison, known among his friends as the Wizard of Menlo Park, comes to most of us. This gentleman created more than a thousand inventions, some of which can be said to have changed the world. Among them we can name the incandescent bulb, the phonograph, the kinetoscope, the polygraph, the quadruplex telegraph, etc., etc., etc. But perhaps its great merit is not to have invented all these things, but to apply methods of chain production and teamwork to the research process, favoring the dissemination of their inventions and the creation of the first industrial research laboratory.

But in spite of all his genius and excellence, Edison failed to go on to invent something that would have been as useful as the light bulb: a cheaters detector. The explanation for this pitfall is twofold: he lived between the nineteenth and twentieth centuries and did not read articles about medicine. If he had lived in our time and had to read medical literature, I have no doubt that the Wizard of Menlo Park would have realized the usefulness of this invention and would have pull his socks up.

And it is not that I am especially negative today, the problem is that, as Altman said more than 15 years ago, the material sent to medical journals is defective from the methodological point of view in a very high percentage of cases. It’s sad, but the most appropriate place to store many of the published studies is the rubbish can.

In most cases the cause is probably the ignorance of those who write. “We are clinicians”, we say, so we leave aside the methodological aspects, of which we have a knowledge, in general, quite deficient. To fix it, journal editors send our studies to other colleagues, who are more or less like us. “We are clinicians”, they say, so all our mistakes go unnoticed to them.

Although this is, in itself, serious, it can be remedied by studying. But it is an even more serious fact that, sometimes, these errors can be intentional with the aim of inducing the reader to reach a certain conclusion after reading the article. The remedy for this problem is to make a critical appraisal of the study, paying attention to its internal validity. In this sense, perhaps the most difficult aspect to assess for the clinician without methodological training is that related to the statistics used to analyze the results of the study. It is in this, undoubtedly, that most can be taken advantage of our ignorance using methods that provide more striking results, instead of the right methods.

As I know that you are not going to be willing to do a master’s degree in biostatistics, waiting for someone to invent the cheaters detector, we are going to give a series of clues so that non-expert readers can suspect the existence of these cheats.

The first may seem obvious, but it is not: has a statistical method been used? Although it is exceptionally rare, there may be authors who do not consider using any. I remember a medical congress that I could attend in which the values of a variable were exposed throughout the study that, first, went up and then went down, which allowed the speaker to conclude that the result was not “on the blink”. As it is logical and evident, any comparison must be made with the proper hypotheses contrast and the level of significance and the statistical test used have to be specified. Otherwise, the conclusions will lack any validity.

A key aspect of any study, especially those with an intervention, is the previous calculation of the necessary sample size. The investigator must define the clinically relevant effect that he wants to be able to detect with his study and then calculate what sample size will provide the study with enough power to prove it. The sample of a study is not large or small, but sufficient or insufficient. If the sample is not sufficient, an existing effect may not be detected due to lack of power (type 2 error). On the other hand, a larger sample than necessary may show an effect that is not relevant from the clinical point of view as statistically significant. Here are two very common cheats. First, the study that does not reach significance and its authors say it is due to lack of power (insufficient sample size), but do not make any effort to calculate the power, which can always be done a posteriori. In that case, we can calculate it using statistical programs or any of the calculators available on the internet, such as GRANMO. Second, the sample size is increased until the difference observed is significant, finding the desired p <0.05. This case is simpler: we only have to assess whether the effect found is relevant from the clinical point of view. I advise you to practice and compare the necessary sample sizes of the studies with those defined by the authors. Maybe you’ll have some surprise.

Once the participants have been selected, a fundamental aspect is that of the homogeneity of the basal groups. This is especially important in the case of clinical trials: if we want to be sure that the observed difference in effect between the two groups is due to the intervention, the two groups should be the same in everything, except in the intervention.

For this we will look at the classic table I of the trial publication. Here we have to say that, if we have distributed the participants at random between the two groups, any difference between them will be due, one way or another, to random. Do not be fooled by the p, remember that the sample size is calculated for the clinically relevant magnitude of the main variable, not for the baseline characteristics of the two groups. If you see any difference and it seems clinically relevant, it will be necessary to verify that the authors have taken into account their influence on the results of the study and have made the appropriate adjustment during the analysis phase.

The next point is that of randomization. This is a fundamental part of any clinical trial, so it must be clearly defined how it was done. Here I have to tell you that chance is capricious and has many vices, but rarely produces groups of equal size. Think for a moment if you flip a coin 100 times. Although the probability of getting heads in each throw is 50%, it will be very rare that by throwing 100 times you will get exactly 50 heads. The greater the number of participants, the more suspicious it should seem to us that the two groups are equal. But beware, this only applies to simple randomization. There are methods of randomization in which groups can be more balanced.

Another hot spot is the misuse that can sometimes be made with qualitative variables. Although qualitative variables can be coded with numbers, be very careful with doing arithmetic operations with them. Probably it will not make any sense. Another cheat that we can find has to do with the fact of categorizing a continuous variable. Passing a continuous variable to a qualitative one usually leads to loss of information, so it must have a clear clinical meaning. Otherwise, we can suspect that the reason is the search for a p value less than 0.05, always easier to achieve with the qualitative variable.

Going into the analysis of the data, we must check that the authors have followed the a priori designed protocol of the study. Always be wary of post hoc studies that were not planned from the beginning. If we look for enough, we will always find a group that behaves as we want. As it is said, if you torture the data long enough, it will confess to anything.

Another unacceptable behavior is to finish the study ahead of time for good results. Once again, if the duration of the follow-up has been established during the design phase as the best time to detect the effect, this must be respected. Any violation of the protocol must be more than justified. Logically, it is ethical to finish the study ahead of time due to security reasons, but it will be necessary to take into account how this fact affects the evaluation of the results.

Before performing the analysis of the results, the authors of any study have to debug their data, reviewing the quality and integrity of the values collected. In this sense, one of the aspects to pay attention to is the management of outliers. These are the values that are far from the central values of the distribution. In many occasions they can be due to errors in the calculation, measurement or transcription of the value of the variable, but they can also be real values that are due to the special idiosyncrasy of the variable. The problem is that there is a tendency to eliminate them from the analysis even when there is no certainty that they are due to an error. The correct thing to do is to take them into account when doing the analysis and use, if necessary, robust statistical methods that allow these deviations to be adjusted.

Finally, the aspect that can be more strenuous to those not very expert in statistics is knowing if the correct statistical method has been used. A frequent error is the use of parametric tests without previously checking if the necessary requirements are met. This can be done by ignorance or to obtain statistical significance, since parametric tests are less demanding in this regard. To understand each other, the p-value will be smaller than if we use the equivalent non-parametric test.

Also, with certain frequency, other requirements needed to be able to apply a certain contrast test are ignored. As an example, in order to perform a Student’s t test or an ANOVA, homoscedasticity (a very ugly word that means that the variances are equal) must be checked, and that check is overlooked in many studies. The same happens with regression models that, frequently, are not accompanied by the mandatory diagnosis of the model that allows and justify its use.

Another issue in which there may be cheating is that of multiple comparisons. For example, when the ANOVA reaches significant, the meaning is that there are at least two means that are different, but we do not know which, so we start comparing them two by two. The problem is that when we make repeated comparisons the probability of type I error increases, that is, the probability of finding significant differences only by chance. This may allow finding, if only by chance, a p <0.05, what improves the appearance of the study (especially if you spent a lot of time and / or money doing it). In these cases, the authors must use some of the available corrections (such as Bonferroni’s, one of the simplest) so that the global alpha remains below 0.05. The price to pay is simple: the p-value has to be much smaller to be significant. When we see multiple comparisons without a correction, it will only have two explanations: the ignorance of the one who made the analysis or the attempt to find a statistical significance that, probably, would not support the decrease in p-value that the correction would entail.

Another frequent victim of misuse of statistics is the Pearson’s correlation coefficient, which is used for almost everything. The correlation, as such, tells us if two variables are related, but does not tell us anything about the causality of one variable for the production of the other. Another misuse is to use the correlation coefficient to compare the results obtained by two observers, when probably what should be used in this case is the intraclass correlation coefficient (for continuous variables) or the kappa index (for dichotomous qualitative variables). Finally, it is also incorrect to compare two measurement methods (for example, capillary and venous glycaemia) by correlation or linear regression. For these cases the correct thing would be to use the Passing-Bablok’s regression.

Another situation in which a paranoid mind like mine would suspect is one in which the statistical method employed is not known by the smartest people in the place. Whenever there is a better known (and often simpler) way to do the analysis, we must ask ourselves why they have used such a weird method. In these cases, we will require the authors to justify their choice and provide a reference where we can review the method. In statistics, you have to try to choose the right technique for each occasion and not the one that gives us the most appealing result.

In any of the previous contrast tests, the authors usually use a level of significance for p <0.05, as usual, but the contrast can be done with one or two tails. When we do a trial to try a new drug, what we expect is that it works better than the placebo or the drug with which we are comparing it. However, two other situations can occur that we cannot disdain: that it works the same or, even, that it works worse. A bilateral contrast (with two tails) does not assume the direction of the effect, since it calculates the probability of obtaining a difference equal to or greater than that observed, in both directions. If the researcher is very sure of the direction of the effect, he can make a unilateral contrast (with one tail), measuring the probability of the result in the direction considered. The problem is when he does it for another reason: the p-value of a bilateral contrast is twice as large as that of the unilateral contrast, so it will be easier to achieve statistical significance with the unilateral contrast. The wrong thing is to do the unilateral contrast for that reason. The correct thing, unless there are well-justified reasons, is to make a bilateral contrast.

To go finishing this tricky post, we will say a few words about the use of appropriate measures to present the results. There are many ways to make up the truth without getting to lie and, although basically all say the same, the appearance can be very different depending on how we say it. The most typical example is to use relative risk measures instead of absolute and impact measures. Whenever we see a clinical trial, we must demand that authors provide the absolute risk reduction and the number needed to treat (NNT). The relative risk reduction gives a greater number than the absolute, so it will seem that the impact is greater. Given that the absolute measures are easier to calculate and are obtained from the same data as the relative ones, we should be suspicious if the authors do not offer them to us: perhaps the effect is not as important as they are trying to make us see.

Another example is the use of odds ratio versus risk ratio (when both can be calculated). The odds ratio tends to magnify the association between the variables, so its unjustified use can also make us to be suspicious. If you can, calculate the risk ratio and compare the two measures.

Likewise, we will suspect of studies of diagnostic tests that do not provide us with the likelihood ratios and are limited to sensitivity, specificity and predictive values. Predictive values can be high if the prevalence of the disease in the study population is high, but it would not be applicable to populations with a lower proportion of patients. This is avoided with the use of likelihood ratios. We should always ask ourselves the reason that the authors may have had to obviate the most valid parameter to calibrate the power of a diagnostic test.

And finally, be very careful with the graphics representations of results: here the possibilities of making up the truth are only limited by our imagination. You have to look at the units used and try to extract the information from the graph beyond what it might seem to represent at first glance.

And here we leave the topic for today. We have not spoken in detail about another of the most misunderstood and manipulated entities, which is none other than our p. Many meanings are attributed to p, usually erroneously, as the probability that the null hypothesis is true, probability that has its specific method to make an estimate. But that is another story…

## Pairing

You will all know the case of someone who, after carrying out a study and collecting several million variables, addressed the statistician of his workplace and, demonstrating in a reliable way his clarity of ideas regarding his work, he said: please (You have to be educated), crosscheck everything with everything, to see what comes out.

At this point, several things can happen to you. If the statistician is an unscrupulous soulmate, he will give you a half smile and tell you to come back after a few days. Then, you will be provided with several hundred sheets with graphics, tables and numbers with which you will not know what to do. Another thing that can happen to you is to send to hell, tired as she will be to have similar requests made.

But you can be lucky and find a competent and patient statistician who, in a self-sacrificing way, will explain to you that the thing should not work like that. The logical thing is that you, before collecting any data, have prepared a report of the project in which it is planned, among other things, what is to be analyzed and what variables must be crossed between them. She can even suggest you that, if the analysis is not very complicated, you can try to do it yourself.

The latter may seem like the delirium of a mind disturbed by mathematics but, if you think about it for a moment, it is not such a bad idea. If we do the analysis, at least the preliminary, of our results, it can help us to better understand the study. Also, who can know what we want better than ourselves?

With the current statistical packages, the simplest bivariate statistics can be within our reach. We only have to be careful in choosing the right hypothesis test, for which we must take into account three aspects: the type of variables that we want to compare, if the data are paired or independent and if we have to use parametric or non-parametric tests. Let’s see these three aspects.

Regarding the type of variables, there are multiple denominations according to the classification or the statistical package that we use but, simplifying, we will say that there are three types of variables. First, there are the continuous variables. As the name suggests, they collect the value of a continuous variable such as weight, height, blood glucose concentration, etc. Second, there are the nominal variables, which consist of two or more categories that are mutually excluding. For example, the variable “hair color” can have the categories “brown”, “blonde” and “red hair”. When these variables have two categories, we call them dichotomous (yes / no, alive / dead, etc.). Finally, when the categories are ordered by rank, we speak of ordinal variables: ” do not smoke “, ” smoke little “, ” smoke moderately “, ” smoke a lot “. Although they can sometimes use numbers, they indicate the position of the categories within the series, without implying, for example, that the distance from category 1 to 2 is the same as that from 2 to 3. For example, we can classify vesicoureteral reflux in grades I, II, III and IV (having a degree IV is more than a II, but it does not mean that you have twice as much reflux).

Knowing what kind of variable we are dealing with is simple. If we doubt, we can follow the following reasoning based on the answer to two questions:

1. Does the variable have infinite theoretical values? Here we have to do a bit of abstraction and think about what “theoretical values” really means. For example, if we measure the weight of the subjects of the study, theoretical values ​​will be infinite although, in practice, this will be limited by the precision of our scale. If the answer to this first question is “yes” we will be before a continuous variable. If it is not, we move on to the next question.
2. Are the values ​​sorted in some kind of rank? If the answer is “yes”, we will be dealing with an ordinal variable. If the answer is “no”, we will have a nominal variable.

The second aspect is that of paired or independent measures. Two measures are paired when a variable is measured twice after having applied some change, usually in the same subject. For example: blood pressure before and after a stress test, weight before and after a nutritional intervention, etc. On the other hand, independent measures are those that are not related to each other (they are different variables): weight, height, gender, age, etc.

Finally, we mentioned the possibility of using parametric or non-parametric tests. We are not going to go into detail now, but in order to use a parametric test the variable must fulfill a series of characteristics, such as following a normal distribution, having a certain sample size, etc. In addition, there are techniques that are more robust than others when it comes to having to meet these conditions. When in doubt, it is preferable to use non-parametric techniques unnecessarily (the only problem is that it is more difficult to achieve statistical significance, but the contrast is just as valid) than using a parametric test when the necessary requirements are not met.

Once we have already answered these three aspects, we can only make the pairs of variables that we are going to compare and choose the appropriate statistical test. You can see it summarized in the attached table.The type of independent variable is represented in the rows, which is the one whose value does not depend on another variable (it is usually on the x axis of the graphic representations) and which is usually the one that we modified in the study to see the effect on another variable (the dependent). In the columns, on the other hand, we have the dependent variable, which is the one whose value is modified with the changes of the independent variable. Anyway, do get muddled: the statistical software will make the hypothesis contrast without taking into account which is the dependent and which the independent, only taking into account the types of variables.

The table is self-explanatory, so we will not give it much time. For example, if we have measured blood pressure (contiuous variable) and we want to know if there are differences between men and women (gender, nominal dichotomous variable), the appropriate test will be Student’s t test for independent samples. If we wanted to see if there is a difference in pressure before and after a treatment, we would use the same Student’s t test but for paired samples.

Another example: if we want to know if there are significant differences in the color of hair (nominal, polytomous: “blond”, “brown” and “redhead) and if the participant is from the north or south of Europe (nominal, dichotomous), we could use a Chi-square’s test.

And here we will end for today. We have not talked about the peculiarities of each test that we have to take into account, but we have only mentioned the test itself. For example, the chi-square’s has to meet minimums in each box of the contingency table, in the case of Student’s t we must consider whether the variances are equal (homoscedasticity) or not, etc. But that is another story…

## Achilles and Effects Forest

### Publication bias

Achilles. What a man! Definitely, one of the main characters among those who were in that mess that was ensued in Troy because of Helena, a.k.a. the beauty. You know his story. In order to make him invulnerable his mother, who was none other than Tetis, the nymph, bathed him in ambrosia and submerged him in the River Stix. But she made a mistake that should not have been allowed to any nymph: she took him by his right heel, which did not get wet with the river’s water. And so, his heel became his only vulnerable part. Hector didn’t realize it in time but Paris, totally on the ball, put an arrow in Achilles’ heel and sent him back to the Stix, but not into the water, but rather to the other side. And without Charon the Ferryman.

This story is the origin of the expression “Achilles’ heel”, which usually refers to the weakest or most vulnerable point of someone or something that, otherwise, is usually known for its strength.

#### Publication bias

For example, something as robust and formidable as meta-analysis has its Achilles heel: the publication bias. And that’s because in the world of science there is no social justice.

All scientific works should have the same opportunities to be published and achieve fame, but the reality is not at all like that and they can be discriminated against for four reasons: statistical significance, popularity of the topic they are dealing with, having someone to sponsor them and the language in which they are written.

These are the main factors that can contribute to publication bias. First, studies with more significant results are more likely to be published and, within these, they are more likely to be published when the effect is greater. This means that studies with negative results or effects of small magnitude may not be published, which will draw a biased conclusion from the analysis only of large studies with positive results. In the same way, papers on topics of public interest are more likely to be published regardless of the importance of their results. In addition, the sponsor also influences: a company that finances a study with a product of theirs that has gone wrong, probably is not going to publish it so that we all know that their product is not useful.

Secondly, as is logical, published studies are more likely to reach our hands than those that are not published in scientific journals. This is the case of doctoral theses, communications to congresses, reports from government agencies or, even, pending studies to be published by researchers of the subject that we are dealing with. For this reason it is so important to do a search that includes this type of work, which is included within the grey literature term.

Finally, a series of biases can be listed that influence the likelihood that a work will be published or retrieved by the researcher performing the systematic review such as language bias (the search is limited by language), availability bias ( include only those studies that are easy for the researcher to recover), the cost bias (studies that are free or cheap), the familiarity bias (only those from the researcher’s discipline are included), the duplication bias (those that have significant results are more likely to be published more than once) and citation bias (studies with significant results are more likely to be cited by other authors).

One may think that this loss of studies during the review cannot be so serious, since it could be argued, for example, that studies not published in peer-reviewed journals are usually of poorer quality, so they do not deserve to be included in the meta-analysis However, it is not clear either that scientific journals ensure the methodological quality of the study or that this is the only method to do so. There are researchers, like those of government agencies, who are not interested in publishing in scientific journals, but in preparing reports for those who commission them. In addition, peer review is not a guarantee of quality since, too often, neither the researcher who carries out the study nor those in charge of reviewing it have a training in methodology that ensures the quality of the final product.

All this can be worsened by the fact that these same factors can influence the inclusion and exclusion criteria of the meta-analysis primary studies, in such a way that we obtain a sample of articles that may not be representative of the global knowledge on the subject of the systematic review and meta-analysis.

If we have a publication bias, the applicability of the results will be seriously compromised. That is why we say that the publication bias is the true Achilles’ heel of meta-analysis.

If we correctly delimit the inclusion and exclusion criteria of the studies and do a global and unrestricted search of the literature we will have done everything possible to minimize the risk of bias, but we can never be sure of having avoided it. That is why techniques and tools have been devised for its detection.

#### Publication bias study

The most used has the sympathetic name of funnel plot. It shows the magnitude of the measured effect (X axis) versus a precision measurement (Y axis), which is usually the sample size, but which can also be the inverse of the variance or the standard error. We represent each primary study with a point and observe the point cloud.

In the most usual way, with the size of the sample on the Y axis, the precision of the results will be higher in the larger sample studies, so that the points will be closer together in the upper part of the axis and will be dispersed when approaching the origin of the axis Y. In this way, we observe a cloud of points in the form of a funnel, with the wide part down. This graphic should be symmetrical and, if that is not the case, we should always suspect a publication bias. In the second example attached you can see how there are “missing” studies on the side of lack of effect: this may mean that only studies with positive results are published.

This method is very simple to use but, sometimes, we can have doubts about the asymmetry of our funnel, especially if the number of studies is small. In addition, the funnel can be asymmetrical due to quality defects in the studies or because we are dealing with interventions whose effect varies according to the sample size of each study. For these cases, other more objective methods have been devised, such as the Begg’s rank correlation test and the Egger’s linear regression test.

The Begg’s test studies the presence of association between the estimates of the effects and their variances. If there is a correlation between them, bad going. The problem with this test is that it has little statistical power, so it is not reliable when the number of primary studies is small.

Egger’s test, more specific than Begg’s, consists of plotting the regression line between the precision of the studies (independent variable) and the standardized effect (dependent variable). This regression must be weighted by the inverse of the variance, so I do not recommend that you do it on your own, unless you are consummate statisticians. When there is no publication bias, the regression line originates at the zero of the Y axis. The further away from zero, the more evidence of publication bias.

As always, there are computer programs that do these tests quickly without having to burn your brain with the calculations.

What if after doing the work we see that there is publication bias? Can we do something to adjust it? As always, we can.

The simplest way is to use a graphic method called trim and fill. It consists of the following: a) we draw the funnel plot; b) we remove the small studies so that the funnel is symmetrical; c) the new center of the graph is determined; d) we recover the previously removed studies and we add their reflection to the other side of the center line; e) we estimate again the effect.

#### Other methods of studying publication bias

Another very conservative attitude that we can adopt is to assume that there is a publication bias and to ask how much it affects our results, assuming that we have left studies not included in the analysis.

The only way to know if the publication bias affects our estimates would be to compare the effect in the retrieved and unrecovered studies but, of course, then we would not have to worry about the publication bias.

To know if the observed result is robust or, on the contrary, it is susceptible to be biased by a publication bias, two methods of the fail-safe N have been devised.

The first is the Rosenthal’s fail-safe N method. Suppose we have a meta-analysis with an effect that is statistically significant, for example, a risk ratio greater than one with a p <0.05 (or a 95% confidence interval that does not include the null value, one). Then we ask ourselves a question: how many studies with RR = 1 (null value) will we have to include until p is not significant? If we need few studies (less than 10) to make the value of the effect null, we can worry because the effect may in fact be null and our significance is the product of a publication bias. On the contrary, if many studies are needed, the effect is likely to be truly significant. This number of studies is what the letter N of the name of the method means.

The problem with this method is that it focuses on the statistical significance and not on the relevance of the results. The correct thing would be to look for how many studies are needed so that the result loses clinical relevance, not statistical significance. In addition, it assumes that the effects of the missing studies is null (one in case of risk ratios and odds ratios, zero in cases of differences in means), when the effect of the missing studies can go in the opposite direction than the effect that we detect or in the same sense but of smaller magnitude.

To avoid these disadvantages there is a variation of the previous formula that assesses the statistical significance and clinical relevance. With this method, which is called the Orwin’s fail-safe N, it is calculated how many studies are needed to bring the value of the effect to a specific value, which will generally be the least effect that is clinically relevant. This method also allows to specify the average effect of the missing studies.

#### The PRISMA statement

To end the meta-analysis explanation, let’s see what is the right way to express the results of data analysis. To do it well, we can follow the recommendations of the statement, which devotes seven of its 27 items to give us advice on how to present the results of a meta-analysis.

First, we must inform about the selection process of studies: how many we have found and evaluated, how many we have selected and how many rejected, explaining in addition the reasons for doing so. For this, the flowchart that should include the systematic review from which the meta-analysis proceeds if it complies with the PRISMA statement is very useful.

Secondly, the characteristics of the primary studies must be specified, detailing what data we get from each one of them and their corresponding bibliographic citations to facilitate that any reader of the review can verify the data if he does not trust us. In this sense, there is also the third section, which refers to the evaluation of the risk of study biases and their internal validity.

Fourth, we must present the results of each individual study with a summary data of each intervention group analyzed together with the calculated estimators and their confidence intervals. These data will help us to compile the information that PRISMA asks us in its fifth point referring to the presentation of results and it is none other than the synthesis of all the meta-analysis studies, their confidence intervals, homogeneity study results, etc.

This is usually done graphically by means of an effects diagram, a graphical tool popularly known as forest plot, where the trees would be the primary studies of the meta-analysis and where all the relevant results of the quantitative synthesis are summarized.

The Cochrane’s Collaboration recommends structuring the forest plot in five well differentiated columns. Column 1 lists the primary studies or the groups or subgroups of patients included in the meta-analysis. They are usually represented by an identifier composed of the name of the first author and the date of publication.Column 2 shows the results of the measures of effect of each study as reported by their respective authors.

Column 3 is the actual forest plot, the graphic part of the subject. It shows the measures of effect of each study on both sides of the zero effect line, which we already know is zero for mean differences and one for odds ratios, risk ratios, hazard ratios, etc. Each study is represented by a square whose area is usually proportional to the contribution of each one to the overall result. In addition, the square is within a segment that represents the extremes of its confidence interval.

These confidence intervals inform us about the accuracy of the studies and tell us which are statistically significant: those whose interval does not cross the zero effect line. Anyway, do not forget that, although crossing the line of no effect and being not statistically significant, the interval boundaries can give us much information about the clinical significance of the results of each study. Finally, at the bottom of the chart we will find a diamond that represents the global result of the meta-analysis. Its position with respect to the null effect line will inform us about the statistical significance of the overall result, while its width will give us an idea of ​​its accuracy (its confidence interval). Furthermore, on top of this column will find the type of effect measurement, the analysis model data is used (fixed or random) and the significance value of the confidence intervals (typically 95%).

This chart is usually completed by a fourth column with the estimated weight of each study in per cent format and a fifth column with the estimates of the weighted effect of each. And in some corner of this forest will be the measure of heterogeneity that has been used, along with its statistical significance in cases where relevant.

To conclude the presentation of the results, PRISMA recommends a sixth section with the evaluation that has been made of the risks of bias in the study and a seventh with all the additional analyzes that have been necessary: stratification, sensitivity analysis, metaregression, etc.

#### What the Cochrane says

As you can see, nothing is easy about meta-analysis. Therefore, the Cochrane’s recommends following a series of steps to correctly interpret the results. Namely:

1. Verify which variable is compared and how. It is usually seen at the top of the forest plot.
2. Locate the measure of effect used. This is logical and necessary to know how to interpret the results. A hazard ratio is not the same as a difference in means or whatever it was used.
3. Locate the diamond, its position and its amplitude. It is also convenient to look at the numerical value of the global estimator and its confidence interval.
4. Check that heterogeneity has been studied. This can be seen by looking at whether the segments that represent the primary studies are or are not very dispersed and whether they overlap or not. In any case, there will always be a statistic that assesses the degree of heterogeneity. If we see that there is heterogeneity, the next thing will be to find out what explanation the authors give about its existence.
5. Draw our conclusions. We will look at which side of the null effect line are the overall effect and its confidence interval. You already know that, although it is significant, the lower limit of the interval should be as far as possible from the line, because of the clinical relevance, which does not always coincide with statistical significance. Finally, look again at the study of homogeneity. If there is a lot of heterogeneity, the results will not be as reliable.

#### We’re leaving…

And with this we end the topic of meta-analysis. In fact, the forest plot is not exclusive to meta-analyzes and can be used whenever we want to compare studies to elucidate their statistical or clinical significance, or in cases such as equivalence studies, in which the null effect line is joined of the equivalence thresholds. But it still has one more utility. A variant of the forest plot also serves to assess if there is a publication bias in the systematic review, although, as we already know, in these cases we change its name to funnel graph. But that is another story…

## Apples and pears

### Study of heterogeneity in meta-analysis

You all sure know the Chinese tale of the poor solitary rice grain that falls to the ground and nobody can hear it. Of course, if instead of falling a grain it falls a sack full of rice that will be something else. There are many examples of union making strength. A red ant is harmless, unless it bites you in some soft and noble area, which are usually the most sensitive. But what about a marabout of millions of red ants? That is what scares you up, because if they all come together and come for you, you could do little to stop their push. Yes, the union is strength.

And this also happens with statistics. With a relatively small sample of well-chosen voters we can estimate who will win an election in which millions vote. So, what could we not do with a lot of those samples? Surely the estimate would be more reliable and more generalizable.

#### Turning to substance

Well, this is precisely one of the purposes of meta-analysis, which uses various statistical techniques to make a quantitative synthesis of the results of a set of studies that, although try to answer the same question, do not reach exactly to the same result. But beware; we cannot combine studies to draw conclusions about the sum of them without first taking a series of precautions. This would be like mixing apples and pears which, I’m not sure why, should be something terribly dangerous because everyone knows it’s something to avoid.

Think that we have a set of clinical trials on the same topic and we want to do a meta-analysis to obtain a global result. It is more than convenient that there is as little variability as possible among the studies if we want to combine them. Because, ladies and gentlemen, here also rules the saying: alongside but separate.

Before thinking about combining the results of the studies of a systematic review to perform a meta-analysis, we must always make a previous study of the heterogeneity of the primary studies, which is nothing more than the variability that exists among the estimators that have been obtained in each of those studies.

#### Study of heterogeneity in meta-analysis

First, we will investigate possible causes of heterogeneity, such as differences in treatments, variability of the populations of the different studies and differences in the designs of the trials. If there is a great deal of heterogeneity from the clinical point of view, perhaps the best thing to do is not to do meta-analysis and limit the analysis to a qualitative synthesis of the results of the review.

Once we come to the conclusion that the studies are similar enough to try to combine them we should try to measure this heterogeneity to have an objective data. For this, several privileged brains have created a series of statistics that contribute to our daily jungle of acronyms and letters.

Until recently, the most famous of those initials was the Cochran’s Q, which has nothing to do either with James Bond or our friend Archie Cochrane. Its calculation takes into account the sum of the deviations between each of the results of primary studies and the global outcome (squared differences to avoid positives cancelling negatives), weighing each study according to their contribution to overall result. It looks awesome but in reality, it is no big deal. Ultimately, it’s no more than an aristocratic relative of ji-square test. Indeed, Q follows a ji-square distribution with k-1 degrees of freedom (being k the number of primary studies). We calculate its value, look at the frequency distribution and estimate the probability that differences are not due to chance, in order to reject our null hypothesis (which assumes that observed differences among studies are due to chance). But, despite the appearances, Q has a number of weaknesses.

First, it’s a very conservative parameter and we must always keep in mind that no statistical significance is not always synonymous of absence of heterogeneity: as a matter of fact, we cannot reject the null hypothesis, so we have to know that when we approved it we are running the risk of committing a type II error and blunder. For this reason, some people propose to use a significance level of p < 0.1 instead of the standard p < 0.5. Another Q’s pitfall is that it doesn’t quantify the degree of heterogeneity and, of course, doesn’t explain the reasons that produce it. And, to top it off, Q loses power when the number of studies is small and doesn’t allow comparisons among different meta-analysis if they have different number of studies.

This is why another statistic has been devised that is much more celebrated today: I2. This parameter provides an estimate of total variation among studies with respect to total variability or, put it another way, the proportion of variability actually due to heterogeneity for actual differences among the estimates compared with variability due to chance. It also looks impressive, but it’s actually an advantageous relative of the intraclass correlation coefficient.

Its value ranges from 0 to 100%, and we usually consider the limits of 25%, 50% and 75% as signs of low, moderate and high heterogeneity, respectively. I2 is not affected either by the effects units of measurement or the number of studies, so it allows comparisons between meta-analysis with different units of effect measurement or different number of studies.

If you read a study that provides Q and you want to calculate I2, or vice versa, you can use the following formula, being k the number of primary studies:

$I^{2}=&space;\frac{Q-k+1}{Q}$

There’s a third parameter that is less known, but not less worthy of mention: H2. It measures the excess of Q value in respect of the value that we would expect to obtain if there were no heterogeneity. Thus, a value of 1 means no heterogeneity and its value increases as heterogeneity among studies does. But its real interest is that it allows calculating I2 confidence intervals.

Other times, the authors perform a hypothesis contrast with a null hypothesis of non-heterogeneity and use a ji-square or some similar statistic. In these cases, what they provide is a value of statistical significance. If the p is <0.05 the null hypothesis can be rejected and say that there is heterogeneity. Otherwise we will say that we cannot reject the null hypothesis of non-heterogeneity.

In summary, whenever we see an indicator of homogeneity that represents a percentage, it will indicate the proportion of variability that is not due to chance. For their part, when they give us a “p” there will be significant heterogeneity when the “p” is less than 0.05.

Do not worry about the calculations of Q, I2 and H2. For that there are specific programs as RevMan or modules within the usual statistical programs that do the same function.

#### Graphical methods for studying heterogeneity in meta-analysis

A point of attention: always remember that not being able to demonstrate heterogeneity does not always mean that the studies are homogeneous. The problem is that the null hypothesis assumes that they are homogeneous and the differences are due to chance. If we can reject it we can assure that there is heterogeneity (always with a small degree of uncertainty). But this does not work the other way around: if we cannot reject it, it simply means that we cannot reject that there is no heterogeneity, but there will always be a probability of committing a type II error if we directly assume that the studies are homogeneous.

For this reason, a series of graphical methods have been devised to inspect the studies and verify that there is no data of heterogeneity even if the numerical parameters say otherwise.

The most employed of them is, perhaps, the , with can be used for both meta-analysis from trials or observational studies. This graph represents the accuracy of each study versus the standardize effects. It also shows the adjusted regression line and sets two confidence bands. The position of each study regarding the accuracy axis indicates its weighted contribution to overall results, while its location outside the confidence bands indicates its contribution to heterogeneity.

Galbraith’s graph can also be useful for detecting sources of heterogeneity, since studies can be labeled according to different variables and see how they contribute to the overall heterogeneity.

Another available tool you can use for meta-analysis of clinical trials is L’Abbé’s plot. It represents response rates to treatment versus response rates in control group, plotting the studies to both sides of the diagonal. Above that line are studies with positive treatment outcome, while below are studies with an outcome favorable to control intervention. The studies usually are plotted with an area proportional to its accuracy, and its dispersion indicates heterogeneity. Sometimes, L’Abbé’s graph provides additional information. For example, in the accompanying graph you can see that studies in low-risk areas are located mainly below the diagonal. On the other hand, high-risk studies are mainly located in areas of positive treatment outcome. This distribution, as well as being suggestive of heterogeneity, may suggest that efficacy of treatments depends on the level of risk or, put another way, we have an effect modifying variable in our study. A small drawback of this tool is that it is only applicable to meta-analysis of clinical trials and when the dependent variable is dichotomous.

#### We must weight each study

Well, suppose we study heterogeneity and we decide that we are going to combine the studies to do a meta-analysis. The next step is to analyze the estimators of the effect size of the studies, weighing them according to the contribution that each study will have on the overall result. This is logical; it cannot contribute the same to the final result a trial with few participants and an imprecise result than another with thousands of participants and a more precise result measure.

The most usual way to take these differences into account is to weight the estimate of the size of the effect by the inverse of the variance of the results, subsequently performing the analysis to obtain the average effect. For these there are several possibilities, some of them very complex from the statistical point of view, although the two most commonly used methods are the fixed effect model and the random effects model. Both models differ in their conception of the starting population from which the primary studies of meta-analysis come.

#### Two models

The fixed effect model considers that there is no heterogeneity and that all studies estimate the same effect size of the population (they all measure the same effect, that is why it is called a fixed effect), so it is assumed that the variability observed among the individual studies is due only to the error that occurs when performing the random sampling in each study. This error is quantified by estimating intra-study variance, assuming that the differences in the estimated effect sizes are due only to the use of samples from different subjects.

On the other hand, the random effects model assumes that the effect size varies in each study and follows a normal frequency distribution within the population, so each study estimates a different effect size. Therefore, in addition to the intra-study variance due to the error of random sampling, the model also includes the variability among studies, which would represent the deviation of each study from the mean effect size. These two error terms are independent of each other, both contributing to the variance of the study estimator.

In summary, the fixed effect model incorporates only one error term for the variability of each study, while the random effects model adds, in addition, another error term due to the variability among the studies.

You see that I have not written a single formula. We do not actually need to know them and they are quite unfriendly, full of Greek letters that no one understands. But do not worry. As always, statistical programs like RevMan from the Cochrane Collaboration allow you to do the calculations in a simple way, including and removing studies from the analysis and changing the model as you wish.

The type of model to choose has its importance. If in the previous homogeneity analysis we see that the studies are homogeneous we can use the fixed effect model. But if we detect that heterogeneity exists, within the limits that allow us to combine the studies, it will be preferable to use the random effects model.

Another consideration is the applicability or external validity of the results of the meta-analysis. If we have used the fixed effect model, we will be committed to generalize the results out of populations with characteristics similar to those of the included studies. This does not occur with the results obtained using the random effects model, whose external validity is greater because it comes from studies of different populations.

In any case, we will obtain a summary effect measure along with its confidence interval. This confidence interval will be statistically significant when it does not cross the zero effect line, which we already know is zero for mean differences and one for odds ratios and risk ratios. In addition, the amplitude of the interval will inform us about the precision of the estimation of the average effect in the population: how much wider, less precise, and vice versa.

If you think a bit, you will immediately understand why the random effects model is more conservative than the fixed effect model in the sense that the confidence intervals obtained are less precise, since it incorporates more variability in its analysis. In some cases it may happen that the estimator is significant if we use the fixed effect model and it is not significant if we use the random effect model, but this should not condition us when choosing the model to use. We must always rely on the previous measure of heterogeneity, although if we have doubts, we can also use the two models and compare the different results.

#### What if there is heterogeneity?

Having examined the homogeneity of primary studies we can come to the grim conclusion that heterogeneity dominates the situation. Can we do something to manage it? Sure, we can. We can always not to combine the studies, or combine them despite heterogeneity and obtain a summary result but, in that case, we should also calculate any measure of variability among studies and yet we could not be sure of our results.

Another possibility is to do a stratified analysis according to the variable that causes heterogeneity, provided that we are able to identify it. For this we can do a sensitivity analysis, repeating calculations once removing one by one each of the subgroups and checking how it influences the overall result. The problem is that this approach ignores the final purpose of any meta-analysis, which is none than obtaining an overall value of homogeneous studies.

Finally, the brainiest on these issues can use meta-regression. This technique is similar to multivariate regression models in which the characteristics of the studies are used as explanatory variables, and effect’s variable or some measure of deviation of each study with respect to global result are used as dependent variable. Also, it should be done a weighting according to the contribution of each study to the overall result and try not to score too much coefficients to the regression model if the number of primary studies is not large. I wouldn’t advise you to do a meta-regression at home if it is not accompanied by seniors.

#### We´re leaving…

And we only need to check that we have not omitted studies and that we have presented the results correctly. The meta-analysis data are usually represented in a specific graph that is known as forest plot. But that is another story…