You all sure know the Chinese tale of the poor solitary rice grain that falls to the ground and nobody can hear it. Of course, if instead of falling a grain it falls a sack full of rice that will be something else. There are many examples of union making strength. A red ant is harmless, unless it bites you in some soft and noble area, which are usually the most sensitive. But what about a marabout of millions of red ants? That is what scares you up, because if they all come together and come for you, you could do little to stop their push. Yes, the union is strength.

And this also happens with statistics. With a relatively small sample of well-chosen voters we can estimate who will win an election in which millions vote. So, what could we not do with a lot of those samples? Surely the estimate would be more reliable and more generalizable.

Well, this is precisely one of the purposes of meta-analysis, which uses various statistical techniques to make a quantitative synthesis of the results of a set of studies that, although try to answer the same question, do not reach exactly to the same result. But beware; we cannot combine studies to draw conclusions about the sum of them without first taking a series of precautions. This would be like mixing apples and pears which, I’m not sure why, should be something terribly dangerous because everyone knows it’s something to avoid.

Think that we have a set of clinical trials on the same topic and we want to do a meta-analysis to obtain a global result. It is more than convenient that there is as little variability as possible among the studies if we want to combine them. Because, ladies and gentlemen, here also rules the saying: alongside but separate.

Before thinking about combining the results of the studies of a systematic review to perform a meta-analysis, we must always make a previous study of the heterogeneity of the primary studies, which is nothing more than the variability that exists among the estimators that have been obtained in each of those studies.

First, we will investigate possible causes of heterogeneity, such as differences in treatments, variability of the populations of the different studies and differences in the designs of the trials. If there is a great deal of heterogeneity from the clinical point of view, perhaps the best thing to do is not to do meta-analysis and limit the analysis to a qualitative synthesis of the results of the review.

Once we come to the conclusion that the studies are similar enough to try to combine them we should try to measure this heterogeneity to have an objective data. For this, several privileged brains have created a series of statistics that contribute to our daily jungle of acronyms and letters.

Until recently, the most famous of those initials was the Cochran’s Q, which has nothing to do either with James Bond or our friend Archie Cochrane. Its calculation takes into account the sum of the deviations between each of the results of primary studies and the global outcome (squared differences to avoid positives cancelling negatives), weighing each study according to their contribution to overall result. It looks awesome but in reality, it is no big deal. Ultimately, it’s no more than an aristocratic relative of ji-square test. Indeed, Q follows a ji-square distribution with k-1 degrees of freedom (being k the number of primary studies). We calculate its value, look at the frequency distribution and estimate the probability that differences are not due to chance, in order to reject our null hypothesis (which assumes that observed differences among studies are due to chance). But, despite the appearances, Q has a number of weaknesses.

First, it’s a very conservative parameter and we must always keep in mind that no statistical significance is not always synonymous of absence of heterogeneity: as a matter of fact, we cannot reject the null hypothesis, so we have to know that when we approved it we are running the risk of committing a type II error and blunder. For this reason, some people propose to use a significance level of p < 0.1 instead of the standard p < 0.5. Another Q’s pitfall is that it doesn’t quantify the degree of heterogeneity and, of course, doesn’t explain the reasons that produce it. And, to top it off, Q loses power when the number of studies is small and doesn’t allow comparisons among different meta-analysis if they have different number of studies.

This is why another statistic has been devised that is much more celebrated today: I^{2}. This parameter provides an estimate of total variation among studies with respect to total variability or, put it another way, the proportion of variability actually due to heterogeneity for actual differences among the estimates compared with variability due to chance. It also looks impressive, but it’s actually an advantageous relative of the intraclass correlation coefficient.

Its value ranges from 0 to 100%, and we usually consider the limits of 25%, 50% and 75% as signs of low, moderate and high heterogeneity, respectively. I^{2} is not affected either by the effects units of measurement or the number of studies, so it allows comparisons between meta-analysis with different units of effect measurement or different number of studies.

If you read a study that provides Q and you want to calculate I^{2}, or vice versa, you can use the following formula, being k the number of primary studies:

There’s a third parameter that is less known, but not less worthy of mention: H^{2}. It measures the excess of Q value in respect of the value that we would expect to obtain if there were no heterogeneity. Thus, a value of 1 means no heterogeneity and its value increases as heterogeneity among studies does. But its real interest is that it allows calculating I^{2} confidence intervals.

Other times, the authors perform a hypothesis contrast with a null hypothesis of non-heterogeneity and use a ji-square or some similar statistic. In these cases, what they provide is a value of statistical significance. If the p is <0.05 the null hypothesis can be rejected and say that there is heterogeneity. Otherwise we will say that we cannot reject the null hypothesis of non-heterogeneity.

In summary, whenever we see an indicator of homogeneity that represents a percentage, it will indicate the proportion of variability that is not due to chance. For their part, when they give us a “p” there will be significant heterogeneity when the “p” is less than 0.05.

Do not worry about the calculations of Q, I^{2} and H^{2}. For that there are specific programs as RevMan or modules within the usual statistical programs that do the same function.

A point of attention: always remember that not being able to demonstrate heterogeneity does not always mean that the studies are homogeneous. The problem is that the null hypothesis assumes that they are homogeneous and the differences are due to chance. If we can reject it we can assure that there is heterogeneity (always with a small degree of uncertainty). But this does not work the other way around: if we cannot reject it, it simply means that we cannot reject that there is no heterogeneity, but there will always be a probability of committing a type II error if we directly assume that the studies are homogeneous.

For this reason, a series of graphical methods have been devised to inspect the studies and verify that there is no data of heterogeneity even if the numerical parameters say otherwise.

The most employed of them is, perhaps, the , with can be used for both meta-analysis from trials or observational studies. This graph represents the accuracy of each study versus the standardize effects. It also shows the adjusted regression line and sets two confidence bands. The position of each study regarding the accuracy axis indicates its weighted contribution to overall results, while its location outside the confidence bands indicates its contribution to heterogeneity.

Galbraith’s graph can also be useful for detecting sources of heterogeneity, since studies can be labeled according to different variables and see how they contribute to the overall heterogeneity.

Another available tool you can use for meta-analysis of clinical trials is L’Abbé’s plot. It represents response rates to treatment versus response rates in control group, plotting the studies to both sides of the diagonal. Above that line are studies with positive treatment outcome, while below are studies with an outcome favorable to control intervention. The studies usually are plotted with an area proportional to its accuracy, and its dispersion indicates heterogeneity. Sometimes, L’Abbé’s graph provides additional information. For example, in the accompanying graph you can see that studies in low-risk areas are located mainly below the diagonal. On the other hand, high-risk studies are mainly located in areas of positive treatment outcome. This distribution, as well as being suggestive of heterogeneity, may suggest that efficacy of treatments depends on the level of risk or, put another way, we have an effect modifying variable in our study. A small drawback of this tool is that it is only applicable to meta-analysis of clinical trials and when the dependent variable is dichotomous.

Well, suppose we study heterogeneity and we decide that we are going to combine the studies to do a meta-analysis. The next step is to analyze the estimators of the effect size of the studies, weighing them according to the contribution that each study will have on the overall result. This is logical; it cannot contribute the same to the final result a trial with few participants and an imprecise result than another with thousands of participants and a more precise result measure.

The most usual way to take these differences into account is to weight the estimate of the size of the effect by the inverse of the variance of the results, subsequently performing the analysis to obtain the average effect. For these there are several possibilities, some of them very complex from the statistical point of view, although the two most commonly used methods are the fixed effect model and the random effects model. Both models differ in their conception of the starting population from which the primary studies of meta-analysis come.

The fixed effect model considers that there is no heterogeneity and that all studies estimate the same effect size of the population (they all measure the same effect, that is why it is called a fixed effect), so it is assumed that the variability observed among the individual studies is due only to the error that occurs when performing the random sampling in each study. This error is quantified by estimating intra-study variance, assuming that the differences in the estimated effect sizes are due only to the use of samples from different subjects.

On the other hand, the random effects model assumes that the effect size varies in each study and follows a normal frequency distribution within the population, so each study estimates a different effect size. Therefore, in addition to the intra-study variance due to the error of random sampling, the model also includes the variability among studies, which would represent the deviation of each study from the mean effect size. These two error terms are independent of each other, both contributing to the variance of the study estimator.

In summary, the fixed effect model incorporates only one error term for the variability of each study, while the random effects model adds, in addition, another error term due to the variability among the studies.

You see that I have not written a single formula. We do not actually need to know them and they are quite unfriendly, full of Greek letters that no one understands. But do not worry. As always, statistical programs like RevMan from the Cochrane Collaboration allow you to do the calculations in a simple way, including and removing studies from the analysis and changing the model as you wish.

The type of model to choose has its importance. If in the previous homogeneity analysis we see that the studies are homogeneous we can use the fixed effect model. But if we detect that heterogeneity exists, within the limits that allow us to combine the studies, it will be preferable to use the random effects model.

Another consideration is the applicability or external validity of the results of the meta-analysis. If we have used the fixed effect model, we will be committed to generalize the results out of populations with characteristics similar to those of the included studies. This does not occur with the results obtained using the random effects model, whose external validity is greater because it comes from studies of different populations.

In any case, we will obtain a summary effect measure along with its confidence interval. This confidence interval will be statistically significant when it does not cross the zero effect line, which we already know is zero for mean differences and one for odds ratios and risk ratios. In addition, the amplitude of the interval will inform us about the precision of the estimation of the average effect in the population: how much wider, less precise, and vice versa.

If you think a bit, you will immediately understand why the random effects model is more conservative than the fixed effect model in the sense that the confidence intervals obtained are less precise, since it incorporates more variability in its analysis. In some cases it may happen that the estimator is significant if we use the fixed effect model and it is not significant if we use the random effect model, but this should not condition us when choosing the model to use. We must always rely on the previous measure of heterogeneity, although if we have doubts, we can also use the two models and compare the different results.

Having examined the homogeneity of primary studies we can come to the grim conclusion that heterogeneity dominates the situation. Can we do something to manage it? Sure, we can. We can always not to combine the studies, or combine them despite heterogeneity and obtain a summary result but, in that case, we should also calculate any measure of variability among studies and yet we could not be sure of our results.

Another possibility is to do a stratified analysis according to the variable that causes heterogeneity, provided that we are able to identify it. For this we can do a sensitivity analysis, repeating calculations once removing one by one each of the subgroups and checking how it influences the overall result. The problem is that this approach ignores the final purpose of any meta-analysis, which is none than obtaining an overall value of homogeneous studies.

Finally, the brainiest on these issues can use meta-regression. This technique is similar to multivariate regression models in which the characteristics of the studies are used as explanatory variables, and effect’s variable or some measure of deviation of each study with respect to global result are used as dependent variable. Also, it should be done a weighting according to the contribution of each study to the overall result and try not to score too much coefficients to the regression model if the number of primary studies is not large. I wouldn’t advise you to do a meta-regression at home if it is not accompanied by seniors.

And we only need to check that we have not omitted studies and that we have presented the results correctly. The meta-analysis data are usually represented in a specific graph that is known as forest plot. But that is another story…