# Science without sense…double nonsense

### Píldoras sobre medicina basada en pruebas

Posts tagged Heterogeneity

Yes, I know that the saying goes just the opposite. But that is precisely the problem we have with so much new information technology. Today anyone can write and make public what goes through his head, reaching a lot of people, although what he says is bullshit (and no, I do not take this personally, not even my brother-in-law reads what I post!). The trouble is that much of what is written is not worth a bit, not to refer to any type of excreta. There is a lot of smoke and little fire, when we all would like the opposite to happen.

The same happens in medicine when we need information to make some of our clinical decisions. Anywhere the source we go, the volume of information will not only overwhelm us, but above all the majority of it will not serve us at all. Also, even if we find a well-done article it may not be enough to answer our question completely. That’s why we love so much the revisions of literature that some generous souls publish in medical journals. They save us the task of reviewing a lot of articles and summarizing the conclusions. Great, isn’t it? Well, sometimes it is, sometimes it is not. As when we read any type of medical literature’s study, we should always make a critical appraisal and not rely solely on the good know-how of its authors.

Revisions, of which we already know there are two types, also have their limitations, which we must know how to value. The simplest form of revision, our favorite when we are younger and ignorant, is what is known as a narrative review or author’s review. This type of review is usually done by an expert in the topic, who reviews the literature and analyzes what she finds as she believes that it is worth (for that she is an expert) and summarizes the qualitative synthesis with her expert’s conclusions. These types of reviews are good for getting a general idea about a topic, but they do not usually serve to answer specific questions. In addition, since it is not specified how the information search is done, we cannot reproduce it or verify that it includes everything important that has been written on the subject. With these revisions we can do little critical appraising, since there is no precise systematization of how these summaries have to be prepared, so we will have to trust unreliable aspects such as the prestige of the author or the impact of the journal where it is published.

As our knowledge of the general aspects of science increases, our interest is shifting towards other types of revisions that provide us with more specific information about aspects that escape our increasingly wide knowledge. This other type of review is the so-called systematic review (SR), which focuses on a specific question, follows a clearly specified methodology of searching and selection of information and performs a rigorous and critical analysis of the results found. Moreover, when the primary studies are sufficiently homogeneous, the SR goes beyond the qualitative synthesis, also performing a quantitative synthesis analysis, which has the nice name of meta-analysis. With these reviews we can do a critical appraising following an ordered and pre-established methodology, in a similar way as we do with other types of studies.

The prototype of SR is the one made by the Cochrane’s Collaboration, which has developed a specific methodology that you can consult in the manuals available on its website. But, if you want my advice, do not trust even the Cochrane’s and make a careful critical appraising even if the review has been done by them, not taking it for granted simply because of its origin. As one of my teachers in these disciplines says (I’m sure he’s smiling if he’s reading these lines), there is life after Cochrane’s. And, besides, there is lot of it, and good, I would add.

Although SRs and meta-analyzes impose a bit of respect at the beginning, do not worry, they can be critically evaluated in a simple way considering the main aspects of their methodology. And to do it, nothing better than to systematically review our three pillars: validity, relevance and applicability.

Regarding VALIDITY, we will try to determine whether or not the revision gives us some unbiased results and respond correctly to the question posed. As always, we will look for some primary validity criteria. If these are not fulfilled we will think if it is already time to walk the dog: we probably make better use of the time.

Has the aim of the review been clearly stated? All SRs should try to answer a specific question that is relevant from the clinical point of view, and that usually arises following the PICO scheme of a structured clinical question. It is preferable that the review try to answer only one question, since if it tries to respond to several ones there is a risk of not responding adequately to any of them. This question will also determine the type of studies that the review should include, so we must assess whether the appropriate type has been included. Although the most common is to find SRs of clinical trials, they can include other types of observational studies, diagnostic tests, etc. The authors of the review must specify the criteria for inclusion and exclusion of the studies, in addition to considering their aspects regarding the scope of realization, study groups, results, etc. Differences among the studies included in terms of (P) patients, (I) intervention or (O) outcomes make two SRs that ask the same question to reach to different conclusions.

If the answer to the two previous questions is affirmative, we will consider the secondary criteria and leave the dog’s walk for later. Have important studies that have to do with the subject been included? We must verify that a global and unbiased search of the literature has been carried out. It is frequent to do the electronic search including the most important databases (generally PubMed, Embase and the Cochrane’s Library), but this must be completed with a search strategy in other media to look for other works (references of the articles found, contact with well-known researchers, pharmaceutical industry, national and international registries, etc.), including the so-called gray literature (thesis, reports, etc.), since there may be important unpublished works. And that no one be surprised about the latter: it has been proven that the studies that obtain negative conclusions have more risk of not being published, so they do not appear in the SR. We must verify that the authors have ruled out the possibility of this publication bias. In general, this entire selection process is usually captured in a flow diagram that shows the evolution of all the studies assessed in the SR.

It is very important that enough has been done to assess the quality of the studies, looking for the existence of possible biases. For this, the authors can use an ad hoc designed tool or, more usually, resort to one that is already recognized and validated, such as the bias detection tool of the Cochrane’s Collaboration, in the case of reviews of clinical trials. This tool assesses five criteria of the primary studies to determine their risk of bias: adequate randomization sequence (prevents selection bias), adequate masking (prevents biases of realization and detection, both information biases), concealment of allocation (prevents selection bias), losses to follow-up (prevents attrition bias) and selective data information (prevents information bias). The studies are classified as high, low or indeterminate risk of bias according to the most important aspects of the design’s methodology (clinical trials in this case).

In addition, this must be done independently by two authors and, ideally, without knowing the authors of the study or the journals where the primary studies of the review were published. Finally, it should be recorded the degree of agreement between the two reviewers and what they did if they did not agree (the most common is to resort to a third party, which will probably be the boss of both).

To conclude with the internal or methodological validity, in case the results of the studies have been combined to draw common conclusions with a meta-analysis, we must ask ourselves if it was reasonable to combine the results of the primary studies. It is fundamental, in order to draw conclusions from combined data, that the studies are homogeneous and that the differences among them are due solely to chance. Although some variability of the studies increases the external validity of the conclusions, we cannot unify the data for the analysis if there are a lot of variability. There are numerous methods to assess the homogeneity about which we are not going to refer now, but we are going to insist on the need for the authors of the review to have studied it adequately.

In summary, the fundamental aspects that we will have to analyze to assess the validity of a SR will be: 1) that the aims of the review are well defined in terms of population, intervention and measurement of the result; 2) that the bibliographic search has been exhaustive; 3) that the criteria for inclusion and exclusion of primary studies in the review have been adequate; and 4) that the internal or methodological validity of the included studies has also been verified. In addition, if the SR includes a meta-analysis, we will review the methodological aspects that we saw in a previous post: the suitability of combining the studies to make a quantitative synthesis, the adequate evaluation of the heterogeneity of the primary studies and the use of a suitable mathematical model to combine the results of the primary studies (you know, that of the fixed effect and random effects models).

Regarding the RELEVANCE of the results we must consider what is the overall result of the review and if the interpretation has been made in a judicious manner. The SR should provide a global estimate of the effect of the intervention based on a weighted average of the included quality items. Most often, relative measures such as risk ratio or odds ratio are expressed, although ideally, they should be complemented with absolute measures such as absolute risk reduction or the number needed to treat (NNT). In addition, we must assess the accuracy of the results, for which we will use our beloved confidence intervals, which will give us an idea of ​​the accuracy of the estimation of the true magnitude of the effect in the population. As you can see, the way of assessing the importance of the results is practically the same as assessing the importance of the results of the primary studies. In this case we give examples of clinical trials, which is the type of study that we will see more frequently, but remember that there may be other types of studies that can better express the relevance of their results with other parameters. Of course, confidence intervals will always help us to assess the accuracy of the results.

The results of the meta-analyzes are usually represented in a standardized way, usually using the so-called forest plot. A graph is drawn with a vertical line of zero effect (in the one for relative risk and odds ratio and zero for means differences) and each study is represented as a mark (its result) in the middle of a segment (its confidence interval). Studies with results with statistical significance are those that do not cross the vertical line. Generally, the most powerful studies have narrower intervals and contribute more to the overall result, which is expressed as a diamond whose lateral ends represent its confidence interval. Only diamonds that do not cross the vertical line will have statistical significance. Also, the narrower the interval, the more accurate result. And, finally, the further away from the zero-effect line, the clearer the difference between the treatments or the comparative exposures will be.

If you want a more detailed explanation about the elements that make up a forest plot, you can go to the previous post where we explained it or to the online manuals of the Cochrane’s Collaboration.

We will conclude the critical appraising of the SR assessing the APPLICABILITY of the results to our environment. We will have to ask ourselves if we can apply the results to our patients and how they will influence the care we give them. We will have to see if the primary studies of the review describe the participants and if they resemble our patients. In addition, although we have already said that it is preferable that the SR is oriented to a specific question, it will be necessary to see if all the relevant results have been considered for the decision making in the problem under study, since sometimes it will be convenient to consider some other additional secondary variable. And, as always, we must assess the benefit-cost-risk ratio. The fact that the conclusion of the SR seems valid does not mean that we have to apply it in a compulsory way.

If you want to correctly evaluate a SR without forgetting any important aspect, I recommend you to use a checklist such as PRISMA’s or some of the tools available on the Internet, such as the grills that can be downloaded from the page, which are the ones we have used for everything we have said so far.

The PRISMA statement (Preferred Reporting Items for Systematic reviews and Meta-Analyzes) consists of 27 items, classified in 7 sections that refer to the sections of title, summary, introduction, methods, results, discussion and financing:

1. Title: it must be identified as SR, meta-analysis or both. If it is specified, in addition, that it deals with clinical trials, priority will be given to other types of reviews.
2. Summary: it should be a structured summary that should include background, objectives, data sources, inclusion criteria, limitations, conclusions and implications. The registration number of the revision must also be included.
3. Introduction: includes two items, the justification of the study (what is known, controversies, etc) and the objectives (what question tries to answer in PICO terms of the structured clinical question).
4. Methods. It is the section with the largest number of items (12):

– Protocol and registration: indicate the registration number and its availability.

– Eligibility criteria: justification of the characteristics of the studies and the search criteria used.

– Sources of information: describe the sources used and the last search date.

– Search: complete electronic search strategy, so that it can be reproduced.

– Selection of studies: specify the selection process and inclusion’s and exclusion’s criteria.

– Data extraction process: describe the methods used to extract the data from the primary studies.

– Data list: define the variables used.

– Risk of bias in primary studies: describe the method used and how it has been used in the synthesis of results.

– Summary measures: specify the main summary measures used.

– Results synthesis: describe the methods used to combine the results.

– Risk of bias between studies: describe biases that may affect cumulative evidence, such as publication bias.

1. Results. Includes 7 items:

– Selection of studies: it is expressed through a flow chart that assesses the number of records in each stage (identification, screening, eligibility and inclusion).

– Characteristics of the studies: present the characteristics of the studies from which data were extracted and their bibliographic references.

– Risk of bias in the studies: communicate the risks in each study and any evaluation that is made about the bias in the results.

– Results of the individual studies: study data for each study or intervention group and estimation of the effect with their confidence interval. The ideal is to accompany it with a forest plot.

– Synthesis of the results: present the results of all the meta-analysis performed with the confidence intervals and the consistency measures.

– Risk of bias between the subjects: present any evaluation that is made of the risk of bias between the studies.

– Additional analyzes: if they have been carried out, provide the results of the same.

1. Discussion. Includes 3 items:

– Summary of the evidence: summarize the main findings with the strength of the evidence of each main result and the relevance from the clinical point of view or of the main interest groups (care providers, users, health decision-makers, etc.).

– Limitations: discuss the limitations of the results, the studies and the review.

– Conclusions: general interpretation of the results in context with other evidences and their implications for future research.

1. Financing: describe the sources of funding and the role they played in the realization of the SR.

As a third option to these two tools, you can also use the aforementioned Cochrane’s Handbook for Systematic Reviews of Interventions, available on its website and whose purpose is to help authors of Cochrane’s reviews to work explicitly and systematically.

As you can see, we have not talked practically anything about meta-analysis, with all its statistical techniques to assess homogeneity and its fixed and random effects models. And is that the meta-analysis is a beast that must be eaten separately, so we have already devoted two post only about it that you can check when you want. But that is another story…

## Apples and pears

You all sure know the Chinese tale of the poor solitary rice grain that falls to the ground and nobody can hear it. Of course, if instead of falling a grain it falls a sack full of rice that will be something else. There are many examples of union making strength. A red ant is harmless, unless it bites you in some soft and noble area, which are usually the most sensitive. But what about a marabout of millions of red ants? That is what scares you up, because if they all come together and come for you, you could do little to stop their push. Yes, the union is strength.

And this also happens with statistics. With a relatively small sample of well-chosen voters we can estimate who will win an election in which millions vote. So, what could we not do with a lot of those samples? Surely the estimate would be more reliable and more generalizable.

Well, this is precisely one of the purposes of meta-analysis, which uses various statistical techniques to make a quantitative synthesis of the results of a set of studies that, although try to answer the same question, do not reach exactly to the same result. But beware; we cannot combine studies to draw conclusions about the sum of them without first taking a series of precautions. This would be like mixing apples and pears which, I’m not sure why, should be something terribly dangerous because everyone knows it’s something to avoid.

Think that we have a set of clinical trials on the same topic and we want to do a meta-analysis to obtain a global result. It is more than convenient that there is as little variability as possible among the studies if we want to combine them. Because, ladies and gentlemen, here also rules the saying: alongside but separate.

Before thinking about combining the results of the studies of a systematic review to perform a meta-analysis, we must always make a previous study of the heterogeneity of the primary studies, which is nothing more than the variability that exists among the estimators that have been obtained in each of those studies.

First, we will investigate possible causes of heterogeneity, such as differences in treatments, variability of the populations of the different studies and differences in the designs of the trials. If there is a great deal of heterogeneity from the clinical point of view, perhaps the best thing to do is not to do meta-analysis and limit the analysis to a qualitative synthesis of the results of the review.

Once we come to the conclusion that the studies are similar enough to try to combine them we should try to measure this heterogeneity to have an objective data. For this, several privileged brains have created a series of statistics that contribute to our daily jungle of acronyms and letters.

Until recently, the most famous of those initials was the Cochran’s Q, which has nothing to do either with James Bond or our friend Archie Cochrane. Its calculation takes into account the sum of the deviations between each of the results of primary studies and the global outcome (squared differences to avoid positives cancelling negatives), weighing each study according to their contribution to overall result. It looks awesome but in reality, it is no big deal. Ultimately, it’s no more than an aristocratic relative of ji-square test. Indeed, Q follows a ji-square distribution with k-1 degrees of freedom (being k the number of primary studies). We calculate its value, look at the frequency distribution and estimate the probability that differences are not due to chance, in order to reject our null hypothesis (which assumes that observed differences among studies are due to chance). But, despite the appearances, Q has a number of weaknesses.

First, it’s a very conservative parameter and we must always keep in mind that no statistical significance is not always synonymous of absence of heterogeneity: as a matter of fact, we cannot reject the null hypothesis, so we have to know that when we approved it we are running the risk of committing a type II error and blunder. For this reason, some people propose to use a significance level of p < 0.1 instead of the standard p < 0.5. Another Q’s pitfall is that it doesn’t quantify the degree of heterogeneity and, of course, doesn’t explain the reasons that produce it. And, to top it off, Q loses power when the number of studies is small and doesn’t allow comparisons among different meta-analysis if they have different number of studies.

This is why another statistic has been devised that is much more celebrated today: I2. This parameter provides an estimate of total variation among studies with respect to total variability or, put it another way, the proportion of variability actually due to heterogeneity for actual differences among the estimates compared with variability due to chance. It also looks impressive, but it’s actually an advantageous relative of the intraclass correlation coefficient.

Its value ranges from 0 to 100%, and we usually consider the limits of 25%, 50% and 75% as signs of low, moderate and high heterogeneity, respectively. I2 is not affected either by the effects units of measurement or the number of studies, so it allows comparisons between meta-analysis with different units of effect measurement or different number of studies.

If you read a study that provides Q and you want to calculate I2, or vice versa, you can use the following formula, being k the number of primary studies:

$I^{2}=&space;\frac{Q-k+1}{Q}$

There’s a third parameter that is less known, but not less worthy of mention: H2. It measures the excess of Q value in respect of the value that we would expect to obtain if there were no heterogeneity. Thus, a value of 1 means no heterogeneity and its value increases as heterogeneity among studies does. But its real interest is that it allows calculating I2 confidence intervals.

Other times, the authors perform a hypothesis contrast with a null hypothesis of non-heterogeneity and use a ji-square or some similar statistic. In these cases, what they provide is a value of statistical significance. If the p is <0.05 the null hypothesis can be rejected and say that there is heterogeneity. Otherwise we will say that we cannot reject the null hypothesis of non-heterogeneity.

In summary, whenever we see an indicator of homogeneity that represents a percentage, it will indicate the proportion of variability that is not due to chance. For their part, when they give us a “p” there will be significant heterogeneity when the “p” is less than 0.05.

Do not worry about the calculations of Q, I2 and H2. For that there are specific programs as RevMan or modules within the usual statistical programs that do the same function.

A point of attention: always remember that not being able to demonstrate heterogeneity does not always mean that the studies are homogeneous. The problem is that the null hypothesis assumes that they are homogeneous and the differences are due to chance. If we can reject it we can assure that there is heterogeneity (always with a small degree of uncertainty). But this does not work the other way around: if we cannot reject it, it simply means that we cannot reject that there is no heterogeneity, but there will always be a probability of committing a type II error if we directly assume that the studies are homogeneous.

For this reason, a series of graphical methods have been devised to inspect the studies and verify that there is no data of heterogeneity even if the numerical parameters say otherwise.

The most employed of them is, perhaps, the , with can be used for both meta-analysis from trials or observational studies. This graph represents the accuracy of each study versus the standardize effects. It also shows the adjusted regression line and sets two confidence bands. The position of each study regarding the accuracy axis indicates its weighted contribution to overall results, while its location outside the confidence bands indicates its contribution to heterogeneity.

Galbraith’s graph can also be useful for detecting sources of heterogeneity, since studies can be labeled according to different variables and see how they contribute to the overall heterogeneity.

Another available tool you can use for meta-analysis of clinical trials is L’Abbé’s plot. It represents response rates to treatment versus response rates in control group, plotting the studies to both sides of the diagonal. Above that line are studies with positive treatment outcome, while below are studies with an outcome favorable to control intervention. The studies usually are plotted with an area proportional to its accuracy, and its dispersion indicates heterogeneity. Sometimes, L’Abbé’s graph provides additional information. For example, in the accompanying graph you can see that studies in low-risk areas are located mainly below the diagonal. On the other hand, high-risk studies are mainly located in areas of positive treatment outcome. This distribution, as well as being suggestive of heterogeneity, may suggest that efficacy of treatments depends on the level of risk or, put another way, we have an effect modifying variable in our study. A small drawback of this tool is that it is only applicable to meta-analysis of clinical trials and when the dependent variable is dichotomous.

Well, suppose we study heterogeneity and we decide that we are going to combine the studies to do a meta-analysis. The next step is to analyze the estimators of the effect size of the studies, weighing them according to the contribution that each study will have on the overall result. This is logical; it cannot contribute the same to the final result a trial with few participants and an imprecise result than another with thousands of participants and a more precise result measure.

The most usual way to take these differences into account is to weight the estimate of the size of the effect by the inverse of the variance of the results, subsequently performing the analysis to obtain the average effect. For these there are several possibilities, some of them very complex from the statistical point of view, although the two most commonly used methods are the fixed effect model and the random effects model. Both models differ in their conception of the starting population from which the primary studies of meta-analysis come.

The fixed effect model considers that there is no heterogeneity and that all studies estimate the same effect size of the population (they all measure the same effect, that is why it is called a fixed effect), so it is assumed that the variability observed among the individual studies is due only to the error that occurs when performing the random sampling in each study. This error is quantified by estimating intra-study variance, assuming that the differences in the estimated effect sizes are due only to the use of samples from different subjects.

On the other hand, the random effects model assumes that the effect size varies in each study and follows a normal frequency distribution within the population, so each study estimates a different effect size. Therefore, in addition to the intra-study variance due to the error of random sampling, the model also includes the variability among studies, which would represent the deviation of each study from the mean effect size. These two error terms are independent of each other, both contributing to the variance of the study estimator.

In summary, the fixed effect model incorporates only one error term for the variability of each study, while the random effects model adds, in addition, another error term due to the variability among the studies.

You see that I have not written a single formula. We do not actually need to know them and they are quite unfriendly, full of Greek letters that no one understands. But do not worry. As always, statistical programs like RevMan from the Cochrane Collaboration allow you to do the calculations in a simple way, including and removing studies from the analysis and changing the model as you wish.

The type of model to choose has its importance. If in the previous homogeneity analysis we see that the studies are homogeneous we can use the fixed effect model. But if we detect that heterogeneity exists, within the limits that allow us to combine the studies, it will be preferable to use the random effects model.

Another consideration is the applicability or external validity of the results of the meta-analysis. If we have used the fixed effect model, we will be committed to generalize the results out of populations with characteristics similar to those of the included studies. This does not occur with the results obtained using the random effects model, whose external validity is greater because it comes from studies of different populations.

In any case, we will obtain a summary effect measure along with its confidence interval. This confidence interval will be statistically significant when it does not cross the zero effect line, which we already know is zero for mean differences and one for odds ratios and risk ratios. In addition, the amplitude of the interval will inform us about the precision of the estimation of the average effect in the population: how much wider, less precise, and vice versa.

If you think a bit, you will immediately understand why the random effects model is more conservative than the fixed effect model in the sense that the confidence intervals obtained are less precise, since it incorporates more variability in its analysis. In some cases it may happen that the estimator is significant if we use the fixed effect model and it is not significant if we use the random effect model, but this should not condition us when choosing the model to use. We must always rely on the previous measure of heterogeneity, although if we have doubts, we can also use the two models and compare the different results.

Having examined the homogeneity of primary studies we can come to the grim conclusion that heterogeneity dominates the situation. Can we do something to manage it? Sure, we can. We can always not to combine the studies, or combine them despite heterogeneity and obtain a summary result but, in that case, we should also calculate any measure of variability among studies and yet we could not be sure of our results.

Another possibility is to do a stratified analysis according to the variable that causes heterogeneity, provided that we are able to identify it. For this we can do a sensitivity analysis, repeating calculations once removing one by one each of the subgroups and checking how it influences the overall result. The problem is that this approach ignores the final purpose of any meta-analysis, which is none than obtaining an overall value of homogeneous studies.

Finally, the brainiest on these issues can use meta-regression. This technique is similar to multivariate regression models in which the characteristics of the studies are used as explanatory variables, and effect’s variable or some measure of deviation of each study with respect to global result are used as dependent variable. Also, it should be done a weighting according to the contribution of each study to the overall result and try not to score too much coefficients to the regression model if the number of primary studies is not large. I wouldn’t advise you to do a meta-regression at home if it is not accompanied by seniors.

And we only need to check that we have not omitted studies and that we have presented the results correctly. The meta-analysis data are usually represented in a specific graph that is known as forest plot. But that is another story…

## Take care of the pennies, and the pounds will take care of themselves

All of you will know the Chinese Tale about the poor lone grain of rice that falls to the ground and no one hears. Of course, if instead of a grain it’s a sack of rice that fall that will be another thing. There’re many examples that show how unity creates strength. A lone red ant is harmless, unless it bites you in any soft and noble zone, which usually are the most sensitive parts. But what will you tell me about a scrum of millions of red ants?. That scare the crap out of you, because if they go against you all together there’s little you can do to stop them. Yes, the sum of many “few” makes a “lot”.

And that is true about statistics too. With the aid of a relatively small sample of well-chosen voters we can estimate who will win an election in which millions of people vote. So imagine what we could do with a lot of those samples. I’m sure that the estimate would be more reliable and generalizable.

Well, this is precisely one of the purposes of meta-analysis, which uses statistical techniques to come up with a quantitative synthesis from results of a series of studies that aim to answer the same question but don’t get exactly the same result.

We know we must check for heterogeneity among studies before combining them because, otherwise, it would make little sense to do it and the results we would get wouldn’t be valid or generalizable. Available for this purpose there’re a number of methods, both numerical and graphical ones, to ensure there’s the homogeneity we need.

The next step is to analyze the effect size estimates of the studies, weighing them according to the contribution of each of them to the pooled result. The most common way is to weigh the effect size estimates by the inverse of their variance and then doing the analysis to obtain an average effect. In order to this, there’re various possibilities, but the most commonly used methods are the fixed effects model and the random effects model. Both models differ in their assumptions about the original population from that primary studies come.

The fixed effects model considers that there’s heterogeneity and that all studies estimate the same effect size in the same population. So, it’s assumed that variability observed among individual studies is due solely to the error that occurs when performing random sampling in each study. This error is measured estimating intra-study variance, assuming that differences in effect size estimates are only due to the use of different samples of subjects.

On the other hand, in the random effects model it’s assumed that effect size follows a normal frequency distribution in the population, so each study estimates a different effect size. Therefore, in addition to intra-study variance due to random sampling, this model also includes the variability among studies that represents the deviation of each study with respect to the average effect size. These two errors are mutually independent and both of them contribute to the estimates variance.

In summary, the fixed effects model incorporates only one error term for the variability in each study, while the random effects model further adds another error term due to the variability among studies.

You can see I have not written a single formula. Actually, we don’t need them and they’re quite unfriendly, filled with Greek letters that no one can understand. But don’t worry. As always, statistical software such as the let you easily calculate the results, removing and drawing studies from the model, as well as change between models as we want.

It’s important what model we choose. If there’s not heterogeneity we can use the fixed effects model. But if we find out that our studies are heterogeneous, but not enough to advise against combining them, it is preferable to use the random effects model.

Another aspect to keep in mind is the applicability or external validity of the meta-analysis result. If we use the fixed effects models it will not be safe to generalize results to populations which are different of those of the included studies. This does not happen with the random effects model, whose external validity is higher because it takes into account different populations from different studies.

In any case, we’ll come up with an average effect measure along with its confidence interval. This confidence interval won’t be statistically significant if it crosses the line of no effect, we already know that it’s zero for mean differences and one for odds ratios and relative risks. In addition, the width of the interval will inform us about the accuracy of the estimated effect in the population: as much wider, less precise, and vice versa.

If you think about it you will understand why the random effects model is more conservative than the fixed effects models being that the confidence intervals obtained are less accurate because the former model incorporates more variability in its analysis. In some cases the estimate could be significant using the fixed effects model and not significant using the random effects model.  But that shouldn’t be a reason when choosing the model to use. We must always decide taking into account our previous heterogeneity study and, in case we have doubts, we can use both methods and compare the different results.

And now, it only remains to present the results in a proper way. Meta-analysis results are usually represented using a specific chart that is call the forest plot. But that’s another story…

## Variety is not always the spice of life

Variety is good for many things. How boring the world would be if we were all the same! (especially if we all were like one that is now occurring to me). We like to go to different places, eating different meals, meet different people and have fun in different environments. But there’re things for which variety is a pain in the ass.

Think we have a set of clinical trials on the same topic and we want to perform a meta-analysis to obtain a global outcome. In this situation, we need as less variability as possible among studies if we’re going to combine them. Because, ladies and gentlemen, here prevails to be side by side but not eye to eye.

Before thinking about combining the studies from a systematic review to perform a meta-analysis we should always carry out a preliminary asses of the heterogeneity of primary studies, which is just the measure of the variability that exists among the estimates that have been obtained in each of these studies.

First, we’ll investigate possible causes of heterogeneity, such us differences in treatment, variability among populations from different studies, and differences in trial designs.

Once we are convinced that studies seem to be homogeneous enough to try to combine them, we should try to measure this heterogeneity in an objective way. To do this, various gifted brains have created a series of statistics that contribute to our common jungle of acronyms and initials.

Until recently, the most famous of those initials was the Cochran’s Q, which has nothing to do either with James Bond or our friend Archie Cochrane. Its calculation takes into account the sum of the deviations between each of the results of primary studies and the global outcome (squared differences to avoid positives cancelling negatives), weighing each study according to their contribution to overall result. It looks awesome but actually, no big deal. Ultimately, it’s no more than an aristocratic relative of chi-square test. Indeed, Q follows a chi-square distribution with k-1 degrees of freedom (being k the number of primary studies). We calculate its value, look at the frequency distribution and estimate the probability that differences are not due to chance, in order to reject our null hypothesis (which assumes that observed differences among studies are due to chance). But, despite the appearances, Q has a number of weaknesses.

First, it’s a very conservative parameter and we must always keep in mind that lack of statistical significance is not always synonymous of absence of heterogeneity: as a matter of fact, we cannot reject the null hypothesis, so we have to know that when we approved it we are running the risk of committing a type II error and blunder. For this reason, some people propose to use a significance level of p < 0.1 instead of the standard p < 0.5. Another Q’s pitfall is that it doesn’t quantify the degree of heterogeneity and, of course, doesn’t explain the reasons that produce it. And, to top it off, Q loses power when the number of studies is small and doesn’t allow comparisons between different meta-analysis if they have different number of studies.

This is why another statistic has been devised that is much more celebrated today: I2. This parameter provides an estimate of total variation among studies with respect to total variability or, put it another way, the proportion of variability actually due to heterogeneity for actual differences among the estimates compared with variability due to chance. It also looks impressive, but it’s actually an advantageous relative of the intraclass correlation coefficient.

Its value ranges from 0 to 100%, and we usually consider the limits of 25%, 50% and 75% as signs of low, moderate and high heterogeneity, respectively. I2 is not affected either by the effects units of measurement or the number of studies, so it allows comparisons between meta-analysis with different units of effect measurement or different number of studies.

If you read a study that provides Q and you want to calculate I2, or vice versa, you can use the following formula, being k the number of primary studies:

$I^{2}=&space;\frac{Q-k+1}{Q}$

There’s a third parameter that is less known, but not less worthy of mention: H2. It measures the excess of Q value in respect of the value that we would expect to obtain if there were no heterogeneity. Thus, a value of 1 means no heterogeneity and its value increases as heterogeneity among studies does. But its real interest is that it allows calculating I2 confidence intervals.

Don’t worry about the calculations of Q, I2 and H2. Specific software is available to do that, such as RevMan or other modules that work with usual statistical programs.

And now I want to call your attention to one point:  you must always remember that failure to prove heterogeneity does not always mean that studies are homogeneous. The problem is that null hypothesis assumes they are homogeneous and that differences that we observe are due to chance. If we can reject the null hypothesis we can ensure that there’s heterogeneity. But this doesn’t work the other way around: if we cannot reject it, it simply means that we cannot reject the existence of heterogeneity, but there is always a probability of making a type II error if we directly assume the studies are homogeneous.

This is the reason why some people have devised a series of graphical methods to inspect the results and check if there is evidence of heterogeneity, no matter what numerical parameters say.

The most employed of them is, perhaps, the Galbraith plot, with can be used for both meta-analysis from trials or observational studies. This graph represents the accuracy of each study versus the standardize effects. It also shows the adjusted regression line and sets two confidence bands. The position of each study regarding the accuracy axis indicates its weighted contribution to overall results, while its location outside the confidence bands indicates its contribution to heterogeneity.

Galbraith’s graph can also be useful for detecting sources of heterogeneity, since studies can be labeled according to different variables and see how they contribute to the overall heterogeneity.

Another available tool you can use for meta-analysis of clinical trials is L’Abbé plot. It represents response rates to treatment versus response rates in control group, plotting the studies to both sides of the diagonal. Above that line are studies with positive treatment outcome, while below are studies with an outcome favorable to control intervention. The studies usually are plotted with an area proportional to its accuracy, and its dispersion indicates heterogeneity. Sometimes, L’Abbé graph provides additional information. For example, in the accompanying graph you can see that studies in low-risk areas are located mainly below the diagonal. On the other hand, high-risk studies are mainly located in areas of positive treatment outcome. This distribution, as well as being suggestive of heterogeneity, may suggest that efficacy of treatments depends on the level of risk or, put another way, we have an effect modifying variable in our study.

Having examined the homogeneity of primary studies we can come to the grim conclusion that heterogeneity reigns supreme in our situation. Can we do something to manage it?. Sure, we can. We can always not to combine the studies, or combine them despite heterogeneity and obtain a summary result but, in that case, we should also calculate any measure of variability among studies and yet we could not be sure of our results.

Another possibility is to do a stratified analysis according to the variable that causes heterogeneity, provided that we are able to identify it. For this we can do a sensitivity analysis, repeating calculations once removing one by one each of the subgroups and checking how it influences the overall result. The problem is that this approach ignores the final purpose of any meta-analysis, which is none than obtaining an overall value of homogeneous studies.

Finally, the brainiest on these issues can use meta-regression. This technique is similar to multivariate regression models in which the characteristics of the studies are used as explanatory variables, and effect’s variable or some measure of deviation of each study with respect to global result are used as dependent variable. Also, it should be done a weighting according to the contribution of each study to the overall result and try not to score too much coefficients to the regression model if the number of primary studies is not large. I wouldn’t advise you to do a meta-regression at home if it is not accompanied by seniors.

And we’re done for now. Congratulations to which have endured so far. I apologize for the drag I have released, but heterogeneity has something in it. And it is not only important to decide whether or not to combine the studies, but it is also very important to decide what data analysis model we have to use. But that’s another story…