NNT calculation in meta-analysis

Even the greatest have weaknesses. It is a reality that affects even the great NNT, the number needed to treat, without a doubt the king of the measures of absolute impact of the research methodology in clinical trials.

Of course, that is not an irreparable disgrace. We only have to be well aware of its strengths and weaknesses in order to take advantage of the former and try to mitigate and control the latter. And it is that the NNT depends on the baseline risks of the intervention and control groups, which can be inconsistent fellow travelers and be subjected to variation due to several factors.

As we all know, NNT is an absolute measure of effect that is used to estimate the efficacy or safety of an intervention. This parameter, just like a good marriage, is useful in good times and in bad, in sickness and in health.

Thus, on the good side we talk about NNT, which is the number of patients that have to be treated for one to present a result that we consider as good. By the way, on the dark side we have the number needed to harm (NNH), which indicates how many we have to treat in order for one to present an adverse event.

NNT was originally designed to describe the effect of the intervention relative to the control group in clinical trials, but its use was later extended to interpret the results of systematic reviews and meta-analyzes. And this is where the problem may arise since, sometimes, the way to calculate it in trials is generalized for meta-analyzes, which can lead to error.

NNT calculation in meta-analyis

The simplest way to obtain the NNT is to calculate the inverse of the absolute risk reduction between the intervention and the control group. The problem is that this form is the one that is most likely to be biased by the presence of factors that can influence the value of the NNT. Although it is the king of absolute measures of impact, it also has its limitations, with various factors influencing its magnitude, not to mention its clinical significance.

One of these factors is the duration of the study follow-up period. This duration can influence the number of events, good or bad ones, that the study participants can present, which makes it incorrect to compare the NNTs of studies with follow-ups of different duration.

Another may be the baseline risk of presenting the event. Let’s think that the term “risk”, from a statistical point of view, does not always imply something bad. We can speak, for example, of risk of cure. If the baseline risk is higher, more events will likely occur and the NNT may be lower. The outcome variable used and the treatment alternative with which we compared the intervention should also be taken into account.

And third, to name a few more of these factors, the direction and size of the effect, the scale of measurement, and the precision of the NNT estimates, their confidence intervals, may influence its value.

Controls event rate

And here the problem arises with systematic reviews and meta-analyzes. Even though we might want to, there will always be some heterogeneity among the primary studies in the review, so these factors we have discussed may differ among studies. At this point, it is easy to understand that the estimation of the global NNT based on the summary measures of risks between the two groups may not be the most suitable, since it is highly influenced by the variations in the baseline control event rate (CER).

For these situations, it is much more advisable to make other more robust estimates of the NNT, the most widely used being those that use other association measures such as the risk ratio (RR) or the odds ratio (OR), which are more robust in the face of variations in CER. In the attached figure I show you the formulas for the calculation of the NNT using the different measures of association and effect.

In any case, we must not lose sight of the recommendation of not to carry out a quantitative synthesis or calculation of summary measures if there is significant heterogeneity among primary studies, since then the global estimates will be unreliable, whatever we do.

But do not think that we have solved the problem. We cannot finish this post without mentioning that these alternative methods for calculating NNT also have their weaknesses. Those have to do with obtaining an overall CER summary value, which also varies among primary studies.

The simplest way would be to divide the sum of events in the control groups of the primary studies by the total number of participants in that group. This is usually possible simply by taking the data from the meta-analysis’ forest plot. However, this method is not recommended, as it completely ignores the variability among studies and possible differences in randomization.

Another more correct way would be to calculate the mean or median of the CER of all the primary studies and, even better, to calculate some weighted measure based on the variability of each study.

And even, if baseline risk variations among studies are very important, an estimate based on the investigator’s knowledge or other studies could be used, as well as using a range of possible CER values and comparing the differences among the different NNTs that could be obtained.

You have to be very careful with the variance weighting methods of the studies, since the CER has the bad habit of not following a normal distribution, but a binomial one. The problem with the binomial distribution is that its variance depends greatly on the mean of the distribution, being maximum in mean values around 0.5.

On the contrary, the variance decreases if the mean is close to 0 or 1, so all the variance-based weighting methods will assign a greater weight to the studies the more their mean separates from 0.5 (remember that CER can range from 0 to 1, like any other probability value). For this reason, it is necessary to carry out a transformation so that the values approach a normal instead of a binomial distribution and thus be able to carry out the weighting.

We’re leaving…

And I think we will leave it here for today. We are not going to go into the methods to transform the CER, such as the double arcsine or the application of mixed generalized linear models, since that is for the most exclusive minds, among which my own’s is not included. Anyway, don’t get stuck with this. I advise you to calculate the NNT using statistical packages or calculators, such as Calcupedev. There are other uses of NNT that we could also comment on and that can be obtained with these tools, as is the case with NNT in survival studies. But that is another story…

Critical appraisal of meta-analysis

Yes, I know that the saying goes just the opposite. But that is precisely the problem we have with so much new information technology. Today anyone can write and make public what goes through his head, reaching a lot of people, although what he says is bullshit (and no, I do not take this personally, not even my brother-in-law reads what I post!). The trouble is that much of what is written is not worth a bit, not to refer to any type of excreta. There is a lot of smoke and little fire, when we all would like the opposite to happen.

The same happens in medicine when we need information to make some of our clinical decisions. Anywhere the source we go, the volume of information will not only overwhelm us, but above all the majority of it will not serve us at all. Also, even if we find a well-done article it may not be enough to answer our question completely. That’s why we love so much the revisions of literature that some generous souls publish in medical journals. They save us the task of reviewing a lot of articles and summarizing the conclusions. Great, isn’t it? Well, sometimes it is, sometimes it is not. As when we read any type of medical literature’s study, we should always make a critical appraisal and not rely solely on the good know-how of its authors.

Revisions, of which we already know there are two types, also have their limitations, which we must know how to value. The simplest form of revision, our favorite when we are younger and ignorant, is what is known as a narrative review or author’s review. This type of review is usually done by an expert in the topic, who reviews the literature and analyzes what she finds as she believes that it is worth (for that she is an expert) and summarizes the qualitative synthesis with her expert’s conclusions. These types of reviews are good for getting a general idea about a topic, but they do not usually serve to answer specific questions. In addition, since it is not specified how the information search is done, we cannot reproduce it or verify that it includes everything important that has been written on the subject. With these revisions we can do little critical appraising, since there is no precise systematization of how these summaries have to be prepared, so we will have to trust unreliable aspects such as the prestige of the author or the impact of the journal where it is published.

As our knowledge of the general aspects of science increases, our interest is shifting towards other types of revisions that provide us with more specific information about aspects that escape our increasingly wide knowledge. This other type of review is the so-called systematic review (SR), which focuses on a specific question, follows a clearly specified methodology of searching and selection of information and performs a rigorous and critical analysis of the results found. Moreover, when the primary studies are sufficiently homogeneous, the SR goes beyond the qualitative synthesis, also performing a quantitative synthesis analysis, which has the nice name of meta-analysis. With these reviews we can do a critical appraising following an ordered and pre-established methodology, in a similar way as we do with other types of studies.

The prototype of SR is the one made by the Cochrane’s Collaboration, which has developed a specific methodology that you can consult in the manuals available on its website. But, if you want my advice, do not trust even the Cochrane’s and make a careful critical appraising even if the review has been done by them, not taking it for granted simply because of its origin. As one of my teachers in these disciplines says (I’m sure he’s smiling if he’s reading these lines), there is life after Cochrane’s. And, besides, there is lot of it, and good, I would add.

Critical appraisal of meta-analyes

Although SRs and meta-analyzes impose a bit of respect at the beginning, do not worry, they can be critically evaluated in a simple way considering the main aspects of their methodology. And to do it, nothing better than to systematically review our three pillars: validity, relevance and applicability.

Regarding VALIDITY, we will try to determine whether or not the revision gives us some unbiased results and respond correctly to the question posed. As always, we will look for some primary validity criteria. If these are not fulfilled we will think if it is already time to walk the dog: we probably make better use of the time.

Has the aim of the review been clearly stated? All SRs should try to answer a specific question that is relevant from the clinical point of view, and that usually arises following the PICO scheme of a structured clinical question. It is preferable that the review try to answer only one question, since if it tries to respond to several ones there is a risk of not responding adequately to any of them. This question will also determine the type of studies that the review should include, so we must assess whether the appropriate type has been included. Although the most common is to find SRs of clinical trials, they can include other types of observational studies, diagnostic tests, etc. The authors of the review must specify the criteria for inclusion and exclusion of the studies, in addition to considering their aspects regarding the scope of realization, study groups, results, etc. Differences among the studies included in terms of (P) patients, (I) intervention or (O) outcomes make two SRs that ask the same question to reach to different conclusions.

If the answer to the two previous questions is affirmative, we will consider the secondary criteria and leave the dog’s walk for later. Have important studies that have to do with the subject been included? We must verify that a global and unbiased search of the literature has been carried out. It is frequent to do the electronic search including the most important databases (generally PubMed, Embase and the Cochrane’s Library), but this must be completed with a search strategy in other media to look for other works (references of the articles found, contact with well-known researchers, pharmaceutical industry, national and international registries, etc.), including the so-called gray literature (thesis, reports, etc.), since there may be important unpublished works. And that no one be surprised about the latter: it has been proven that the studies that obtain negative conclusions have more risk of not being published, so they do not appear in the SR. We must verify that the authors have ruled out the possibility of this publication bias. In general, this entire selection process is usually captured in a flow diagram that shows the evolution of all the studies assessed in the SR.

It is very important that enough has been done to assess the quality of the studies, looking for the existence of possible biases. For this, the authors can use an ad hoc designed tool or, more usually, resort to one that is already recognized and validated, such as the bias detection tool of the Cochrane’s Collaboration, in the case of reviews of clinical trials. This tool assesses five criteria of the primary studies to determine their risk of bias: adequate randomization sequence (prevents selection bias), adequate masking (prevents biases of realization and detection, both information biases), concealment of allocation (prevents selection bias), losses to follow-up (prevents attrition bias) and selective data information (prevents information bias). The studies are classified as high, low or indeterminate risk of bias according to the most important aspects of the design’s methodology (clinical trials in this case).

In addition, this must be done independently by two authors and, ideally, without knowing the authors of the study or the journals where the primary studies of the review were published. Finally, it should be recorded the degree of agreement between the two reviewers and what they did if they did not agree (the most common is to resort to a third party, which will probably be the boss of both).

To conclude with the internal or methodological validity, in case the results of the studies have been combined to draw common conclusions with a meta-analysis, we must ask ourselves if it was reasonable to combine the results of the primary studies. It is fundamental, in order to draw conclusions from combined data, that the studies are homogeneous and that the differences among them are due solely to chance. Although some variability of the studies increases the external validity of the conclusions, we cannot unify the data for the analysis if there are a lot of variability. There are numerous methods to assess the homogeneity about which we are not going to refer now, but we are going to insist on the need for the authors of the review to have studied it adequately.

In summary, the fundamental aspects that we will have to analyze to assess the validity of a SR will be: 1) that the aims of the review are well defined in terms of population, intervention and measurement of the result; 2) that the bibliographic search has been exhaustive; 3) that the criteria for inclusion and exclusion of primary studies in the review have been adequate; and 4) that the internal or methodological validity of the included studies has also been verified. In addition, if the SR includes a meta-analysis, we will review the methodological aspects that we saw in a previous post: the suitability of combining the studies to make a quantitative synthesis, the adequate evaluation of the heterogeneity of the primary studies and the use of a suitable mathematical model to combine the results of the primary studies (you know, that of the fixed effect and random effects models).

Regarding the RELEVANCE of the results we must consider what is the overall result of the review and if the interpretation has been made in a judicious manner. The SR should provide a global estimate of the effect of the intervention based on a weighted average of the included quality items. Most often, relative measures such as risk ratio or odds ratio are expressed, although ideally, they should be complemented with absolute measures such as absolute risk reduction or the number needed to treat (NNT). In addition, we must assess the accuracy of the results, for which we will use our beloved confidence intervals, which will give us an idea of ​​the accuracy of the estimation of the true magnitude of the effect in the population. As you can see, the way of assessing the importance of the results is practically the same as assessing the importance of the results of the primary studies. In this case we give examples of clinical trials, which is the type of study that we will see more frequently, but remember that there may be other types of studies that can better express the relevance of their results with other parameters. Of course, confidence intervals will always help us to assess the accuracy of the results.

The results of the meta-analyzes are usually represented in a standardized way, usually using the so-called forest plot. A graph is drawn with a vertical line of zero effect (in the one for relative risk and odds ratio and zero for means differences) and each study is represented as a mark (its result) in the middle of a segment (its confidence interval). Studies with results with statistical significance are those that do not cross the vertical line. Generally, the most powerful studies have narrower intervals and contribute more to the overall result, which is expressed as a diamond whose lateral ends represent its confidence interval. Only diamonds that do not cross the vertical line will have statistical significance. Also, the narrower the interval, the more accurate result. And, finally, the further away from the zero-effect line, the clearer the difference between the treatments or the comparative exposures will be.

If you want a more detailed explanation about the elements that make up a forest plot, you can go to the previous post where we explained it or to the online manuals of the Cochrane’s Collaboration.

We will conclude the critical appraising of the SR assessing the APPLICABILITY of the results to our environment. We will have to ask ourselves if we can apply the results to our patients and how they will influence the care we give them. We will have to see if the primary studies of the review describe the participants and if they resemble our patients. In addition, although we have already said that it is preferable that the SR is oriented to a specific question, it will be necessary to see if all the relevant results have been considered for the decision making in the problem under study, since sometimes it will be convenient to consider some other additional secondary variable. And, as always, we must assess the benefit-cost-risk ratio. The fact that the conclusion of the SR seems valid does not mean that we have to apply it in a compulsory way.

If you want to correctly evaluate a SR without forgetting any important aspect, I recommend you to use a checklist such as PRISMA’s or some of the tools available on the Internet, such as the grills that can be downloaded from the page, which are the ones we have used for everything we have said so far.

PRISMA statement

The PRISMA statement (Preferred Reporting Items for Systematic reviews and Meta-Analyzes) consists of 27 items, classified in 7 sections that refer to the sections of title, summary, introduction, methods, results, discussion and financing:

1. Title: it must be identified as SR, meta-analysis or both. If it is specified, in addition, that it deals with clinical trials, priority will be given to other types of reviews.
2. Summary: it should be a structured summary that should include background, objectives, data sources, inclusion criteria, limitations, conclusions and implications. The registration number of the revision must also be included.
3. Introduction: includes two items, the justification of the study (what is known, controversies, etc) and the objectives (what question tries to answer in PICO terms of the structured clinical question).
4. Methods. It is the section with the largest number of items (12):

– Protocol and registration: indicate the registration number and its availability.

– Eligibility criteria: justification of the characteristics of the studies and the search criteria used.

– Sources of information: describe the sources used and the last search date.

– Search: complete electronic search strategy, so that it can be reproduced.

– Selection of studies: specify the selection process and inclusion’s and exclusion’s criteria.

– Data extraction process: describe the methods used to extract the data from the primary studies.

– Data list: define the variables used.

– Risk of bias in primary studies: describe the method used and how it has been used in the synthesis of results.

– Summary measures: specify the main summary measures used.

– Results synthesis: describe the methods used to combine the results.

– Risk of bias between studies: describe biases that may affect cumulative evidence, such as publication bias.

1. Results. Includes 7 items:

– Selection of studies: it is expressed through a flow chart that assesses the number of records in each stage (identification, screening, eligibility and inclusion).

– Characteristics of the studies: present the characteristics of the studies from which data were extracted and their bibliographic references.

– Risk of bias in the studies: communicate the risks in each study and any evaluation that is made about the bias in the results.

– Results of the individual studies: study data for each study or intervention group and estimation of the effect with their confidence interval. The ideal is to accompany it with a forest plot.

– Synthesis of the results: present the results of all the meta-analysis performed with the confidence intervals and the consistency measures.

– Risk of bias between the subjects: present any evaluation that is made of the risk of bias between the studies.

– Additional analyzes: if they have been carried out, provide the results of the same.

1. Discussion. Includes 3 items:

– Summary of the evidence: summarize the main findings with the strength of the evidence of each main result and the relevance from the clinical point of view or of the main interest groups (care providers, users, health decision-makers, etc.).

– Limitations: discuss the limitations of the results, the studies and the review.

– Conclusions: general interpretation of the results in context with other evidences and their implications for future research.

1. Financing: describe the sources of funding and the role they played in the realization of the SR.

As a third option to these two tools, you can also use the aforementioned Cochrane’s Handbook for Systematic Reviews of Interventions, available on its website and whose purpose is to help authors of Cochrane’s reviews to work explicitly and systematically.

We’re leaving…

As you can see, we have not talked practically anything about meta-analysis, with all its statistical techniques to assess homogeneity and its fixed and random effects models. And is that the meta-analysis is a beast that must be eaten separately, so we have already devoted two post only about it that you can check when you want. But that is another story…

Publication bias

Achilles. What a man! Definitely, one of the main characters among those who were in that mess that was ensued in Troy because of Helena, a.k.a. the beauty. You know his story. In order to make him invulnerable his mother, who was none other than Tetis, the nymph, bathed him in ambrosia and submerged him in the River Stix. But she made a mistake that should not have been allowed to any nymph: she took him by his right heel, which did not get wet with the river’s water. And so, his heel became his only vulnerable part. Hector didn’t realize it in time but Paris, totally on the ball, put an arrow in Achilles’ heel and sent him back to the Stix, but not into the water, but rather to the other side. And without Charon the Ferryman.

This story is the origin of the expression “Achilles’ heel”, which usually refers to the weakest or most vulnerable point of someone or something that, otherwise, is usually known for its strength.

Publication bias

For example, something as robust and formidable as meta-analysis has its Achilles heel: the publication bias. And that’s because in the world of science there is no social justice.

All scientific works should have the same opportunities to be published and achieve fame, but the reality is not at all like that and they can be discriminated against for four reasons: statistical significance, popularity of the topic they are dealing with, having someone to sponsor them and the language in which they are written.

These are the main factors that can contribute to publication bias. First, studies with more significant results are more likely to be published and, within these, they are more likely to be published when the effect is greater. This means that studies with negative results or effects of small magnitude may not be published, which will draw a biased conclusion from the analysis only of large studies with positive results. In the same way, papers on topics of public interest are more likely to be published regardless of the importance of their results. In addition, the sponsor also influences: a company that finances a study with a product of theirs that has gone wrong, probably is not going to publish it so that we all know that their product is not useful.

Secondly, as is logical, published studies are more likely to reach our hands than those that are not published in scientific journals. This is the case of doctoral theses, communications to congresses, reports from government agencies or, even, pending studies to be published by researchers of the subject that we are dealing with. For this reason it is so important to do a search that includes this type of work, which is included within the grey literature term.

Finally, a series of biases can be listed that influence the likelihood that a work will be published or retrieved by the researcher performing the systematic review such as language bias (the search is limited by language), availability bias ( include only those studies that are easy for the researcher to recover), the cost bias (studies that are free or cheap), the familiarity bias (only those from the researcher’s discipline are included), the duplication bias (those that have significant results are more likely to be published more than once) and citation bias (studies with significant results are more likely to be cited by other authors).

One may think that this loss of studies during the review cannot be so serious, since it could be argued, for example, that studies not published in peer-reviewed journals are usually of poorer quality, so they do not deserve to be included in the meta-analysis However, it is not clear either that scientific journals ensure the methodological quality of the study or that this is the only method to do so. There are researchers, like those of government agencies, who are not interested in publishing in scientific journals, but in preparing reports for those who commission them. In addition, peer review is not a guarantee of quality since, too often, neither the researcher who carries out the study nor those in charge of reviewing it have a training in methodology that ensures the quality of the final product.

All this can be worsened by the fact that these same factors can influence the inclusion and exclusion criteria of the meta-analysis primary studies, in such a way that we obtain a sample of articles that may not be representative of the global knowledge on the subject of the systematic review and meta-analysis.

If we have a publication bias, the applicability of the results will be seriously compromised. That is why we say that the publication bias is the true Achilles’ heel of meta-analysis.

If we correctly delimit the inclusion and exclusion criteria of the studies and do a global and unrestricted search of the literature we will have done everything possible to minimize the risk of bias, but we can never be sure of having avoided it. That is why techniques and tools have been devised for its detection.

Publication bias study

The most used has the sympathetic name of funnel plot. It shows the magnitude of the measured effect (X axis) versus a precision measurement (Y axis), which is usually the sample size, but which can also be the inverse of the variance or the standard error. We represent each primary study with a point and observe the point cloud.

In the most usual way, with the size of the sample on the Y axis, the precision of the results will be higher in the larger sample studies, so that the points will be closer together in the upper part of the axis and will be dispersed when approaching the origin of the axis Y. In this way, we observe a cloud of points in the form of a funnel, with the wide part down. This graphic should be symmetrical and, if that is not the case, we should always suspect a publication bias. In the second example attached you can see how there are “missing” studies on the side of lack of effect: this may mean that only studies with positive results are published.

This method is very simple to use but, sometimes, we can have doubts about the asymmetry of our funnel, especially if the number of studies is small. In addition, the funnel can be asymmetrical due to quality defects in the studies or because we are dealing with interventions whose effect varies according to the sample size of each study. For these cases, other more objective methods have been devised, such as the Begg’s rank correlation test and the Egger’s linear regression test.

The Begg’s test studies the presence of association between the estimates of the effects and their variances. If there is a correlation between them, bad going. The problem with this test is that it has little statistical power, so it is not reliable when the number of primary studies is small.

Egger’s test, more specific than Begg’s, consists of plotting the regression line between the precision of the studies (independent variable) and the standardized effect (dependent variable). This regression must be weighted by the inverse of the variance, so I do not recommend that you do it on your own, unless you are consummate statisticians. When there is no publication bias, the regression line originates at the zero of the Y axis. The further away from zero, the more evidence of publication bias.

As always, there are computer programs that do these tests quickly without having to burn your brain with the calculations.

What if after doing the work we see that there is publication bias? Can we do something to adjust it? As always, we can.

The simplest way is to use a graphic method called trim and fill. It consists of the following: a) we draw the funnel plot; b) we remove the small studies so that the funnel is symmetrical; c) the new center of the graph is determined; d) we recover the previously removed studies and we add their reflection to the other side of the center line; e) we estimate again the effect.

Other methods of studying publication bias

Another very conservative attitude that we can adopt is to assume that there is a publication bias and to ask how much it affects our results, assuming that we have left studies not included in the analysis.

The only way to know if the publication bias affects our estimates would be to compare the effect in the retrieved and unrecovered studies but, of course, then we would not have to worry about the publication bias.

To know if the observed result is robust or, on the contrary, it is susceptible to be biased by a publication bias, two methods of the fail-safe N have been devised.

The first is the Rosenthal’s fail-safe N method. Suppose we have a meta-analysis with an effect that is statistically significant, for example, a risk ratio greater than one with a p <0.05 (or a 95% confidence interval that does not include the null value, one). Then we ask ourselves a question: how many studies with RR = 1 (null value) will we have to include until p is not significant? If we need few studies (less than 10) to make the value of the effect null, we can worry because the effect may in fact be null and our significance is the product of a publication bias. On the contrary, if many studies are needed, the effect is likely to be truly significant. This number of studies is what the letter N of the name of the method means.

The problem with this method is that it focuses on the statistical significance and not on the relevance of the results. The correct thing would be to look for how many studies are needed so that the result loses clinical relevance, not statistical significance. In addition, it assumes that the effects of the missing studies is null (one in case of risk ratios and odds ratios, zero in cases of differences in means), when the effect of the missing studies can go in the opposite direction than the effect that we detect or in the same sense but of smaller magnitude.

To avoid these disadvantages there is a variation of the previous formula that assesses the statistical significance and clinical relevance. With this method, which is called the Orwin’s fail-safe N, it is calculated how many studies are needed to bring the value of the effect to a specific value, which will generally be the least effect that is clinically relevant. This method also allows to specify the average effect of the missing studies.

The PRISMA statement

To end the meta-analysis explanation, let’s see what is the right way to express the results of data analysis. To do it well, we can follow the recommendations of the statement, which devotes seven of its 27 items to give us advice on how to present the results of a meta-analysis.

First, we must inform about the selection process of studies: how many we have found and evaluated, how many we have selected and how many rejected, explaining in addition the reasons for doing so. For this, the flowchart that should include the systematic review from which the meta-analysis proceeds if it complies with the PRISMA statement is very useful.

Secondly, the characteristics of the primary studies must be specified, detailing what data we get from each one of them and their corresponding bibliographic citations to facilitate that any reader of the review can verify the data if he does not trust us. In this sense, there is also the third section, which refers to the evaluation of the risk of study biases and their internal validity.

Fourth, we must present the results of each individual study with a summary data of each intervention group analyzed together with the calculated estimators and their confidence intervals. These data will help us to compile the information that PRISMA asks us in its fifth point referring to the presentation of results and it is none other than the synthesis of all the meta-analysis studies, their confidence intervals, homogeneity study results, etc.

This is usually done graphically by means of an effects diagram, a graphical tool popularly known as forest plot, where the trees would be the primary studies of the meta-analysis and where all the relevant results of the quantitative synthesis are summarized.

The Cochrane’s Collaboration recommends structuring the forest plot in five well differentiated columns. Column 1 lists the primary studies or the groups or subgroups of patients included in the meta-analysis. They are usually represented by an identifier composed of the name of the first author and the date of publication.Column 2 shows the results of the measures of effect of each study as reported by their respective authors.

Column 3 is the actual forest plot, the graphic part of the subject. It shows the measures of effect of each study on both sides of the zero effect line, which we already know is zero for mean differences and one for odds ratios, risk ratios, hazard ratios, etc. Each study is represented by a square whose area is usually proportional to the contribution of each one to the overall result. In addition, the square is within a segment that represents the extremes of its confidence interval.

These confidence intervals inform us about the accuracy of the studies and tell us which are statistically significant: those whose interval does not cross the zero effect line. Anyway, do not forget that, although crossing the line of no effect and being not statistically significant, the interval boundaries can give us much information about the clinical significance of the results of each study. Finally, at the bottom of the chart we will find a diamond that represents the global result of the meta-analysis. Its position with respect to the null effect line will inform us about the statistical significance of the overall result, while its width will give us an idea of ​​its accuracy (its confidence interval). Furthermore, on top of this column will find the type of effect measurement, the analysis model data is used (fixed or random) and the significance value of the confidence intervals (typically 95%).

This chart is usually completed by a fourth column with the estimated weight of each study in per cent format and a fifth column with the estimates of the weighted effect of each. And in some corner of this forest will be the measure of heterogeneity that has been used, along with its statistical significance in cases where relevant.

To conclude the presentation of the results, PRISMA recommends a sixth section with the evaluation that has been made of the risks of bias in the study and a seventh with all the additional analyzes that have been necessary: stratification, sensitivity analysis, metaregression, etc.

What the Cochrane says

As you can see, nothing is easy about meta-analysis. Therefore, the Cochrane’s recommends following a series of steps to correctly interpret the results. Namely:

1. Verify which variable is compared and how. It is usually seen at the top of the forest plot.
2. Locate the measure of effect used. This is logical and necessary to know how to interpret the results. A hazard ratio is not the same as a difference in means or whatever it was used.
3. Locate the diamond, its position and its amplitude. It is also convenient to look at the numerical value of the global estimator and its confidence interval.
4. Check that heterogeneity has been studied. This can be seen by looking at whether the segments that represent the primary studies are or are not very dispersed and whether they overlap or not. In any case, there will always be a statistic that assesses the degree of heterogeneity. If we see that there is heterogeneity, the next thing will be to find out what explanation the authors give about its existence.
5. Draw our conclusions. We will look at which side of the null effect line are the overall effect and its confidence interval. You already know that, although it is significant, the lower limit of the interval should be as far as possible from the line, because of the clinical relevance, which does not always coincide with statistical significance. Finally, look again at the study of homogeneity. If there is a lot of heterogeneity, the results will not be as reliable.

We’re leaving…

And with this we end the topic of meta-analysis. In fact, the forest plot is not exclusive to meta-analyzes and can be used whenever we want to compare studies to elucidate their statistical or clinical significance, or in cases such as equivalence studies, in which the null effect line is joined of the equivalence thresholds. But it still has one more utility. A variant of the forest plot also serves to assess if there is a publication bias in the systematic review, although, as we already know, in these cases we change its name to funnel graph. But that is another story…

Study of heterogeneity in meta-analysis

You all sure know the Chinese tale of the poor solitary rice grain that falls to the ground and nobody can hear it. Of course, if instead of falling a grain it falls a sack full of rice that will be something else. There are many examples of union making strength. A red ant is harmless, unless it bites you in some soft and noble area, which are usually the most sensitive. But what about a marabout of millions of red ants? That is what scares you up, because if they all come together and come for you, you could do little to stop their push. Yes, the union is strength.

And this also happens with statistics. With a relatively small sample of well-chosen voters we can estimate who will win an election in which millions vote. So, what could we not do with a lot of those samples? Surely the estimate would be more reliable and more generalizable.

Turning to substance

Well, this is precisely one of the purposes of meta-analysis, which uses various statistical techniques to make a quantitative synthesis of the results of a set of studies that, although try to answer the same question, do not reach exactly to the same result. But beware; we cannot combine studies to draw conclusions about the sum of them without first taking a series of precautions. This would be like mixing apples and pears which, I’m not sure why, should be something terribly dangerous because everyone knows it’s something to avoid.

Think that we have a set of clinical trials on the same topic and we want to do a meta-analysis to obtain a global result. It is more than convenient that there is as little variability as possible among the studies if we want to combine them. Because, ladies and gentlemen, here also rules the saying: alongside but separate.

Before thinking about combining the results of the studies of a systematic review to perform a meta-analysis, we must always make a previous study of the heterogeneity of the primary studies, which is nothing more than the variability that exists among the estimators that have been obtained in each of those studies.

Study of heterogeneity in meta-analysis

First, we will investigate possible causes of heterogeneity, such as differences in treatments, variability of the populations of the different studies and differences in the designs of the trials. If there is a great deal of heterogeneity from the clinical point of view, perhaps the best thing to do is not to do meta-analysis and limit the analysis to a qualitative synthesis of the results of the review.

Once we come to the conclusion that the studies are similar enough to try to combine them we should try to measure this heterogeneity to have an objective data. For this, several privileged brains have created a series of statistics that contribute to our daily jungle of acronyms and letters.

Until recently, the most famous of those initials was the Cochran’s Q, which has nothing to do either with James Bond or our friend Archie Cochrane. Its calculation takes into account the sum of the deviations between each of the results of primary studies and the global outcome (squared differences to avoid positives cancelling negatives), weighing each study according to their contribution to overall result. It looks awesome but in reality, it is no big deal. Ultimately, it’s no more than an aristocratic relative of ji-square test. Indeed, Q follows a ji-square distribution with k-1 degrees of freedom (being k the number of primary studies). We calculate its value, look at the frequency distribution and estimate the probability that differences are not due to chance, in order to reject our null hypothesis (which assumes that observed differences among studies are due to chance). But, despite the appearances, Q has a number of weaknesses.

First, it’s a very conservative parameter and we must always keep in mind that no statistical significance is not always synonymous of absence of heterogeneity: as a matter of fact, we cannot reject the null hypothesis, so we have to know that when we approved it we are running the risk of committing a type II error and blunder. For this reason, some people propose to use a significance level of p < 0.1 instead of the standard p < 0.5. Another Q’s pitfall is that it doesn’t quantify the degree of heterogeneity and, of course, doesn’t explain the reasons that produce it. And, to top it off, Q loses power when the number of studies is small and doesn’t allow comparisons among different meta-analysis if they have different number of studies.

This is why another statistic has been devised that is much more celebrated today: I2. This parameter provides an estimate of total variation among studies with respect to total variability or, put it another way, the proportion of variability actually due to heterogeneity for actual differences among the estimates compared with variability due to chance. It also looks impressive, but it’s actually an advantageous relative of the intraclass correlation coefficient.

Its value ranges from 0 to 100%, and we usually consider the limits of 25%, 50% and 75% as signs of low, moderate and high heterogeneity, respectively. I2 is not affected either by the effects units of measurement or the number of studies, so it allows comparisons between meta-analysis with different units of effect measurement or different number of studies.

If you read a study that provides Q and you want to calculate I2, or vice versa, you can use the following formula, being k the number of primary studies:

$I^{2}=&space;\frac{Q-k+1}{Q}$

There’s a third parameter that is less known, but not less worthy of mention: H2. It measures the excess of Q value in respect of the value that we would expect to obtain if there were no heterogeneity. Thus, a value of 1 means no heterogeneity and its value increases as heterogeneity among studies does. But its real interest is that it allows calculating I2 confidence intervals.

Other times, the authors perform a hypothesis contrast with a null hypothesis of non-heterogeneity and use a ji-square or some similar statistic. In these cases, what they provide is a value of statistical significance. If the p is <0.05 the null hypothesis can be rejected and say that there is heterogeneity. Otherwise we will say that we cannot reject the null hypothesis of non-heterogeneity.

In summary, whenever we see an indicator of homogeneity that represents a percentage, it will indicate the proportion of variability that is not due to chance. For their part, when they give us a “p” there will be significant heterogeneity when the “p” is less than 0.05.

Do not worry about the calculations of Q, I2 and H2. For that there are specific programs as RevMan or modules within the usual statistical programs that do the same function.

Graphical methods for studying heterogeneity in meta-analysis

A point of attention: always remember that not being able to demonstrate heterogeneity does not always mean that the studies are homogeneous. The problem is that the null hypothesis assumes that they are homogeneous and the differences are due to chance. If we can reject it we can assure that there is heterogeneity (always with a small degree of uncertainty). But this does not work the other way around: if we cannot reject it, it simply means that we cannot reject that there is no heterogeneity, but there will always be a probability of committing a type II error if we directly assume that the studies are homogeneous.

For this reason, a series of graphical methods have been devised to inspect the studies and verify that there is no data of heterogeneity even if the numerical parameters say otherwise.

The most employed of them is, perhaps, the , with can be used for both meta-analysis from trials or observational studies. This graph represents the accuracy of each study versus the standardize effects. It also shows the adjusted regression line and sets two confidence bands. The position of each study regarding the accuracy axis indicates its weighted contribution to overall results, while its location outside the confidence bands indicates its contribution to heterogeneity.

Galbraith’s graph can also be useful for detecting sources of heterogeneity, since studies can be labeled according to different variables and see how they contribute to the overall heterogeneity.

Another available tool you can use for meta-analysis of clinical trials is L’Abbé’s plot. It represents response rates to treatment versus response rates in control group, plotting the studies to both sides of the diagonal. Above that line are studies with positive treatment outcome, while below are studies with an outcome favorable to control intervention. The studies usually are plotted with an area proportional to its accuracy, and its dispersion indicates heterogeneity. Sometimes, L’Abbé’s graph provides additional information. For example, in the accompanying graph you can see that studies in low-risk areas are located mainly below the diagonal. On the other hand, high-risk studies are mainly located in areas of positive treatment outcome. This distribution, as well as being suggestive of heterogeneity, may suggest that efficacy of treatments depends on the level of risk or, put another way, we have an effect modifying variable in our study. A small drawback of this tool is that it is only applicable to meta-analysis of clinical trials and when the dependent variable is dichotomous.

We must weight each study

Well, suppose we study heterogeneity and we decide that we are going to combine the studies to do a meta-analysis. The next step is to analyze the estimators of the effect size of the studies, weighing them according to the contribution that each study will have on the overall result. This is logical; it cannot contribute the same to the final result a trial with few participants and an imprecise result than another with thousands of participants and a more precise result measure.

The most usual way to take these differences into account is to weight the estimate of the size of the effect by the inverse of the variance of the results, subsequently performing the analysis to obtain the average effect. For these there are several possibilities, some of them very complex from the statistical point of view, although the two most commonly used methods are the fixed effect model and the random effects model. Both models differ in their conception of the starting population from which the primary studies of meta-analysis come.

Two models

The fixed effect model considers that there is no heterogeneity and that all studies estimate the same effect size of the population (they all measure the same effect, that is why it is called a fixed effect), so it is assumed that the variability observed among the individual studies is due only to the error that occurs when performing the random sampling in each study. This error is quantified by estimating intra-study variance, assuming that the differences in the estimated effect sizes are due only to the use of samples from different subjects.

On the other hand, the random effects model assumes that the effect size varies in each study and follows a normal frequency distribution within the population, so each study estimates a different effect size. Therefore, in addition to the intra-study variance due to the error of random sampling, the model also includes the variability among studies, which would represent the deviation of each study from the mean effect size. These two error terms are independent of each other, both contributing to the variance of the study estimator.

In summary, the fixed effect model incorporates only one error term for the variability of each study, while the random effects model adds, in addition, another error term due to the variability among the studies.

You see that I have not written a single formula. We do not actually need to know them and they are quite unfriendly, full of Greek letters that no one understands. But do not worry. As always, statistical programs like RevMan from the Cochrane Collaboration allow you to do the calculations in a simple way, including and removing studies from the analysis and changing the model as you wish.

The type of model to choose has its importance. If in the previous homogeneity analysis we see that the studies are homogeneous we can use the fixed effect model. But if we detect that heterogeneity exists, within the limits that allow us to combine the studies, it will be preferable to use the random effects model.

Another consideration is the applicability or external validity of the results of the meta-analysis. If we have used the fixed effect model, we will be committed to generalize the results out of populations with characteristics similar to those of the included studies. This does not occur with the results obtained using the random effects model, whose external validity is greater because it comes from studies of different populations.

In any case, we will obtain a summary effect measure along with its confidence interval. This confidence interval will be statistically significant when it does not cross the zero effect line, which we already know is zero for mean differences and one for odds ratios and risk ratios. In addition, the amplitude of the interval will inform us about the precision of the estimation of the average effect in the population: how much wider, less precise, and vice versa.

If you think a bit, you will immediately understand why the random effects model is more conservative than the fixed effect model in the sense that the confidence intervals obtained are less precise, since it incorporates more variability in its analysis. In some cases it may happen that the estimator is significant if we use the fixed effect model and it is not significant if we use the random effect model, but this should not condition us when choosing the model to use. We must always rely on the previous measure of heterogeneity, although if we have doubts, we can also use the two models and compare the different results.

What if there is heterogeneity?

Having examined the homogeneity of primary studies we can come to the grim conclusion that heterogeneity dominates the situation. Can we do something to manage it? Sure, we can. We can always not to combine the studies, or combine them despite heterogeneity and obtain a summary result but, in that case, we should also calculate any measure of variability among studies and yet we could not be sure of our results.

Another possibility is to do a stratified analysis according to the variable that causes heterogeneity, provided that we are able to identify it. For this we can do a sensitivity analysis, repeating calculations once removing one by one each of the subgroups and checking how it influences the overall result. The problem is that this approach ignores the final purpose of any meta-analysis, which is none than obtaining an overall value of homogeneous studies.

Finally, the brainiest on these issues can use meta-regression. This technique is similar to multivariate regression models in which the characteristics of the studies are used as explanatory variables, and effect’s variable or some measure of deviation of each study with respect to global result are used as dependent variable. Also, it should be done a weighting according to the contribution of each study to the overall result and try not to score too much coefficients to the regression model if the number of primary studies is not large. I wouldn’t advise you to do a meta-regression at home if it is not accompanied by seniors.

We´re leaving…

And we only need to check that we have not omitted studies and that we have presented the results correctly. The meta-analysis data are usually represented in a specific graph that is known as forest plot. But that is another story…

Systematic review and meta-analysis

This is another of those famous quotes that are all over the place. Apparently, the first person to have this clever idea was Aristotle, who used it to summarize his holism general principle in his briefs on metaphysics. Who would have said that this tinny phrase contains so much wisdom?. Holism theory insists that everything must be considered in a comprehensive manner, because its components may act in a synergistic way, allowing the meaning of the whole to be greater than the meaning that each individual part contribute with.

Don’t be afraid, you are still on the blog about the brains and not on a blog about philosophy. Neither have I changed the topic of the blog, but this saying is just what I needed to introduce you to the wildest beast of scientific method, which is called meta-analysis.

Introducing the topic

We live in the information age. Since the end of the 20th century, we have witnessed a true explosion of the available sources of information, accessible from multiple platforms. The end result is that we are overwhelmed every time we need information about a specific point, so we do not know where to look or how we can find what we want. For this reason, systems began to be developed to synthesize the information available to make it more accessible when needed.

Narrative review

So, the first reviews come of the arid, the so-called narrative or author reviews. To write them, one or more authors, usually experts in a specific subject, made a general review on this topic, although without any strict criteria on the search strategy or selection of information. Following with total freedom, the authors analyzed the results as instructed by their will and ended up drawing their conclusions from a qualitative synthesis of the obtained results.

These narrative reviews are very useful for acquiring an overview of the topic, especially when one knows little about the subject, but they are not very useful for those who already know the topic and need answers to a more specific question. In addition, as the whole procedure is done according to authors´ wishes, the conclusions are not reproducible.

Systematic review

For these reasons, a series of privileged minds invented the other type of review in which we will focus on this post: the systematic review. Instead of reviewing a general topic, systematic reviews do focus on a specific topic in order to solve specific doubts of clinical practice. In addition, they use a clearly specified search strategy and inclusion criteria for an explicit and rigorous work, which makes them highly reproducible if another group of authors comes up with a repeat review of the same topic. And, if that were not enough, whenever possible, they go beyond the analysis of qualitative synthesis, completing it with a quantitative synthesis that receives the funny name of meta-analysis.

The realization of a systematic review consists of six steps: formulation of the problem or question to be answered, search and selection of existing studies, evaluation of the quality of these studies, extraction of the data, analysis of the results and, finally, interpretation and conclusion. We are going to detail this whole process a little.

Any systematic review worth its salt should try to answer a specific question that must be relevant from the clinical point of view. The question will usually be asked in a structured way with the usual components of population, intervention, comparison and outcome (PICO), so that the analysis of these components will allow us to know if the review is of our interest.

In addition, the components of the structured clinical question will help us to search for the relevant studies that exist on the subject. This search must be global and not biased, so we avoid possible biases of source excluding sources by language, journal, etc. The usual is to use a minimum of two important electronic databases of general use, such as Pubmed, Embase or the Cochrane’s, together with the specific ones of the subject that is being treated. It is important that this search is complemented by a manual search in non-electronic registers and by consulting the bibliographic references of the papers found, in addition to other sources of the so-called gray literature, such as doctoral theses, and documents of congresses, as well as documents from funding agencies, registers and, even, establishing contact with other researchers to know if there are studies not yet published.

It is very important that this strategy is clearly specified in the methods section of the review, so that anyone can reproduce it later, if desired. In addition, it will be necessary to clearly specify the inclusion and exclusion criteria of the primary studies of the review, the type of design sought and its main components (again in reference to the PICO, the components of the structured clinical question).

It is important to assess the quality of the studies included in the review

The third step is the evaluation of the quality of the studies found, which must be done by a minimum of two people independently, with the help of a third party (who will surely be the boss) to break the tie in cases where there is no consensus among the extractors. For this task, tools or checklists designed for this purpose are usually used; one of the most frequently used tool for bias control is the Cochrane Collaboration Tool. This tool assesses five criteria of the primary studies to determine their risk of bias: adequate randomization sequence (prevents selection bias), adequate masking (prevents biases of realization and detection, both information biases), concealment of allocation (prevents selection bias), losses to follow-up (prevents attrition bias) and selective data information (prevents information bias). The studies are classified as high, low or indeterminate risk of bias. It is common to use the colors of the traffic light, marking in green the studies with low risk of bias, in red those with high risk of bias and in yellow those who remain in no man’s land. The more green we see, the better the quality of the primary studies of the review will be.

Ad-hoc forms are usually designed for extraction of data, which usually collect data such as date, scope of the study, type of design, etc., as well as the components of the structured clinical question. As in the case of the previous step, it is convenient that this be done by more than one person, establishing the method to reach an agreement in cases where there is no consensus among the reviewers.

And we come to meta-analysis

And here we enter the most interesting part of the review, the analysis of the results. The fundamental role of the authors will be to explain the differences that exist between the primary studies that are not due to chance, paying special attention to the variations in the design, study population, exposure or intervention and measured results. You can always make a qualitative synthesis analysis, although the real magic of the systematic review is that, when the characteristics of primary studies allow it, a quantitative synthesis, called meta-analysis, can also be performed.

A meta-analysis is a statistical analysis that combines the results of several independent studies that try to answer the same question. Although meta-analysis can be considered as a research project in its own right, it is usually part of a systematic review.

Primary studies can be combined using a statistical methodology developed for this purpose, which has a number of advantages. First, by combining all the results of the primary studies we can obtain a more complete global vision (you know, the whole is greater …). The second one, when studies are combined we increase the sample size, which increases the power of the study in comparison with that of the individual studies, improving the estimation of the effect we want to measure. Thirdly, when extracting the conclusions of a greater number of studies, its external validity increases, since having involved different populations it is easier to generalize the results. Finally, it can allow us to resolve controversies between the conclusions of the different primary studies of the review and, even, to answer questions that had not been raised in those studies.

Once the meta-analysis is done, a final synthesis must be made that integrates the results of the qualitative and quantitative synthesis in order to answer the question that motivated the systematic review or, when this is not possible, to propose the additional studies that must be carried out to be able to answer it.

But a meta-analysis will only deserve all our respect if it fulfills a series of requirements. As the systematic review to witch the meta-analysis belongs, it should aim to answer one specific question and it must be based on all relevant available information, avoiding publication bias and recovery bias. Also, primary studies must have been assessed to ensure its quality and its homogeneity before combining them. Of course, data must be analyzed and presented in an appropriate way. And, finally, it must make sense to combine the results in order to do it. The fact that we can combine results doesn’t always mean that we have to do it if it is not needed in our clinical setting.

Methods for combining studies

And how do you combine the studies?, you could ask yourselves. Well, that’s the meta-analysis’ crux of the matter (crossings, really, there’re many), because there are several possible ways to do it.

Anyone could think that the easiest way would be a sort of Eurovision Contest. We account for the primary studies with a statistically significant positive effect and, if they are majority, we conclude that there’s consensus for positive result. This approach is quite simple but, you will not deny it, also quite sloppy. Also I can think about a number of disadvantages about its use. On one hand, it implies that lack of significance and lack of effect is synonymous, which does not always have to be true. On the other hand, it doesn’t take into account the direction and strength of effect in each study, nor the accuracy of estimators, neither the quality nor the characteristics of primary studies’ design. So, this type of approach is not very recommended, although nobody is going to fine us if we use it as an informal first approach before deciding which if the best way to combine the results.

Another possibility is to use a sort of sign test, similar to other non-parametric statistical techniques. We count the number of positive effects, we subtract the negatives and we have our conclusion. The truth is that this method also seems too simple. It ignores studies that don’t have statistical significance and also ignores the accuracy of studies’ estimators. So, this approach is not of much use, unless you only know the directions of the effects measured in the studies. We could also use it when primary studies are very heterogeneous to get an approximation of the global result, although I would not trust very much results obtained in this way.

The third method is to combine the different Ps of the studies (our beloved and sacrosanct Ps). This could come to our minds if we had a systematic review whose primary studies use different outcome measures, although all of them tried to answer the same question. For example, think about a study on osteoporosis where some studies use ultrasonic densitometry, others spine or femur DEXA, etc. The problem with this method is that it doesn’t take into account the intensities of effects, but only its directions and statistical significances, and we all know the deficiencies of our holy Ps. To be able to make this approach we’d need software that combines data that follow a Chi-square or Gaussian distribution, giving us an estimate and its confidence interval.

The fourth and final method that I know is also the most stylish: to make a weighted combination of the estimated effect in all the primary studies. To calculate the mean would be the easiest way, but we have not come this far to make fudge again. Arithmetic mean gives same emphasis to all studies, so if you have an outlier or imprecise study, results will be greatly distorted. Don’t forget that average always follow the tails of distributions and are heavily influenced by extreme values (which does not happen to her relative, the median).

This is why we have to weigh the different estimates. This can be done in two ways, taking into account the number of subjects in each study, or performing a weighting based on the inverses of the variances of each (you know, the squares of standard errors). The latter way is the more complex, so it is the one people preferred to do more often. Of course, as the maths needed are very hard, people usually use special software that can be external modules working in usual statistical programs such as Stata, SPSS, SAS or R, or specific software such as the famous Cochrane Collaboration’s RevMan.

In summary

As you can see, I have not been short of calling the systematic review with meta-analysis as the wildest beast of epidemiological designs. However, it has its detractors. We all know someone who claims not to like systematic reviews because almost all of them end up in the same way: “more quality studies are needed to be able to make recommendations with a reasonable degree of evidence”. Of course, in these cases we cannot put the blame on the review, because we do not take enough care to perform our studies so the vast majority deserves to end up in the paper shredder.

Another controversy is that of those who debate about what is better, a good systematic review or a good clinical trial (reviews can be made on other types of designs, including observational studies). This debate reminds me of the controversy over whether one should do a calimocho mixing a good wine or if it is a sin to mix a good wine with Coca-Cola. Controversies aside, if you have to take a calimocho, I assure you that you will enjoy it more if you use a good wine, and something similar happens to reviews with the quality of their primary studies.

The problem of systematic reviews is that, to be really useful, you have to be very rigorous in its realization. So that we do not forget anything, there are lists of recommendations and verification that allow us to order the entire procedure of creation and dissemination of scientific works without making methodological errors or omissions in the procedure.

It all started with a program of the Health Service of the United Kingdom that ended with the founding of an international initiative to promote the transparency and precision of biomedical research works: the EQUATOR network (Enhancing the QUAlity and Transparency of health Research). This network consists of experts in methodology, communication and publication, so it includes professionals involved in the quality of the entire process of production and dissemination of research results. Among many other objectives, which you can consult on its website, one is to design a set of recommendations for the realization and publication of the different types of studies, which gives rise to different checklists or statements.

The checklist designed to apply to systematic reviews is the PRISMA statement (Preferred Reporting Items for Systematic reviews and Meta-Analyses), which comes to replace the QUOROM statement (QUality Of Reporting Of Meta-analyses). Based on the definition of systematic review of the Cochrane Collaboration, PRISMA helps us to select, identify and assess the studies included in a review. It also consists of a checklist and a flowchart that describes the passage of all the studies considered during the realization of the review. There is also a lesser-known statement for the assessment of meta-analyses of observational studies, the MOOSE statement (Meta-analyses of Observational Studies in Epidemiology).

The Cochrane Collaboration also has a very well structured and defined methodology, which you can consult on its website. This is the reason why they have so much prestige within the world of systematic reviews, because they are made by professionals who are dedicated to the task following a rigorous and contrasted methodology. Anyway, even Cochrane’s reviews should be critically read and not giving them anything for insured.

We’re leaving…

And with this we have reached the end for today. I want to insist that meta-analysis should be done whenever possible and interesting, but making sure beforehand that it is correct to combine the results. If the studies are very heterogeneous we should not combine anything, since the results that we could obtain would have a much compromised validity. There is a whole series of methods and statistics to measure the homogeneity or heterogeneity of the primary studies, which also influence the way in which we analyze the combined data. But that is another story…

Publication bias in meta-analysis

We can find strength through unity. It is a fact. Great goals are achieved more easily with the joining of the effort of many. And this is also true in statistics.
In fact, there are times when clinical trials do not have the power to demonstrate what they are pursuing, either because of lack of sample due to time, money or difficulty recruiting participants, or because of other methodological limitations. In these cases, it is possible to resort to a technique that allows us sometimes to combine the effort of multiple trials in order to reach the conclusion that we would not reach with any of the trials separately. This technique is meta-analysis.

The origin of the problem

Meta-analysis gives us an exact quantitative mathematical synthesis of the studies included in the analysis, generally the studies retrieved during a systematic review. Logically, if we include all the studies that have been done on a topic (or, at least, all that are relevant to our research), that synthesis will reflect the current knowledge on the subject. However, if the collection is biased and we lack studies, the result will reflect only the articles collected, not the total available knowledge.
When planning the review we must establish a global search structure to try to find all the articles. If we do not do this we can make a recovery bias, which will have the same effect on the quantitative analysis as the publication bias has. But even with modern electronic searches, it is very difficult to find all the relevant information on a particular topic.
In cases of missing studies, the importance of the effect will depend on how the studies are lost. If they are lost at random, everything will be in a problem of less information, so the accuracy of our results will be less and the confidence intervals will be broader, but our conclusions may be correct. However, if the articles that we do not find are systematically different from those we find, the result of our analysis may be biased, since our conclusions can only be applied to that sample of papers, which will be a biased sample.

The why os publication bias in meta-analysis

There are a number of factors that may contribute to the publication bias. First, the studies with meaningful results are more likely to be published and, within these, they are more likely to be published when the effect is greater. This means that studies with negative results or with effects of small magnitude may not be published, so we will draw a biased conclusion from the analysis of only large studies with a positive result.
Secondly, of course, published studies are more likely to come into our hands than those that are not published in scientific journals. This is the case of doctoral theses, communications to congresses, reports from government agencies or even studies pending to be published by researchers of the subject we are dealing with. For this reason it is so important to do a search that includes this type of work, which fall within the term of gray literature.
Finally, a number of biases can be listed that influence the likelihood that a paper will be published or retrieved by the investigator performing the systematic review such as language bias (we limit the search by language), availability bias (to include only those studies that are easy to retrieve by the researcher), cost bias (to include studies that are free or cheap), familiarity bias (only those of the discipline of the investigator), duplication bias (those who have significant outcomes are more likely to be published more than once) and citation bias (studies with significant outcome are more likely to be cited by other authors).
One may think that losing studies during the review cannot be so serious, since it could be argued that unpublished studies in peer-reviewed journals are often of poorer quality, so they do not deserve to be included in the meta-analysis. However, it is not clear that the scientific journals ensure the methodological quality of the study or that this is the only method to do so. There are researchers, such as government agencies, who are not interested in publishing in scientific journals, but in producing reports for those who commission them. In addition, peer review is not a quality assurance because, too often, neither the researcher who performs the study nor those in charge of reviewing it have a methodology training that ensures the quality of the final product.

Assessment of risk of publication bias

There are tools to assess the risk of publication bias. Perhaps the simplest may be to represent a forest plot ordered with the most accurate studies at the top and the less at the bottom. As we move down the precision of the results decreases, so that the effect must oscillate to both sides of the summary measure result. If it only oscillates towards one of the sides, we can indirectly assume that we have not detected the works that must exist that oscillate towards the opposite side, reason why surely we will have a bias of publication.
Another similar procedure is the use of the funnel plot, as seen in the attached image. In this graph the effect size is plotted on the X axis and on the Y axis a measure of the variance or the sample size, inverted. Thus, at the top will be the largest and most accurate studies. Once again, as we go down the graph, the accuracy of the studies is smaller and they are shifted sideways by random error. When there is publication bias this displacement is asymmetrical. The problem of the funnel plot is that its interpretation can be subjective, so there are numerical methods to try to detect the existence of publication bias.
And, at this point, what should we do in the face of a publication bias? Perhaps the most appropriate thing is not to ask if there is bias, but how much it affects my results (and assume that we have left studies without being included in the analysis).
The only way to know if publication bias affects our estimates would be to compare the effect on recovered and unrecovered studies, but of course, then we would not have to worry about publication bias.

Fail-safe N methods

In order to know if the observed result is robust or, conversely, it is susceptible to be biased by a publication bias, two methods have been devised called as the fail-safe N methods.
The first method is the Rosenthal’s fail-safe N method. Suppose we have a meta-analysis with an effect that is statistically significant, for instance, a relative risk greater than one with a p <0.05 (or a 95% confidence interval that does not include the null value, one). Then we ask ourselves a question: how many studies with RR = 1 (null value) will have to be included until p is not significant? If we need few studies (less than 10) to invalidate the value of the effect, we may be concerned that the effect may actually be null and our significance is the result of a publication bias. Conversely, if many studies are needed, the effect is likely to be truly significant. This number of studies is what the letter N of the method name means.
The problem with this method is that it focuses on statistical significance rather than on the relevance of results. The correct thing would be to look for how many studies are necessary so that the result loses clinical relevance, not statistical significance. In addition, it assumes that the effects of missing studies are zero (one in the case of relative risks and odds ratios, zero in cases of mean differences), when the effect of missing studies may go the other way than the effect we detected or In the same direction but of smaller magnitude.
To avoid these drawbacks there is a variation of the previous formula which values statistical significance and clinical significance. With this method, which is called the Orwin´s fail-safe N, we calculate how many studies are needed to bring the value of the effect to a specific value, which will generally be the smallest effect that is clinically important. This method also allows specifying the average effect of missing studies.

We’re leaving…

And here we leave the meta-analysis and publication bias for today. We have not talked about any other mathematical methods to detect publication bias like Begg’s and Egger’s. There is even some graphic method apart from the ones we have mentioned, such as the trim and fill method. But that is another story…

Vote counting method in reviews

No need for anyone to worry. Today we’re not going to talk about politics. Instead, today will talk about something far more interesting. Today we will discuss voting trials in narrative reviews. What am I talking about? Keep reading and you will understand.

Let’s illustrate it with a totally fictitious, besides absurd, example. Suppose we want to know if those who watch more than two hours of TV per day have more risk of suffering acute attacks of dandruff. We go to our favorite database, which can be Tripdatabase or Pubmed and do a search. We get a narrative review with six papers, four of which don’t obtain a higher relative risk of dandruff attacks among couch potatoes and two in which significant differences were found between those who see much or little television.

What do we make of it? Is there a risk in watching too much TV? The first thing that crosses our mind is to apply the democratic norm. We can count how many studies get a risk with a significant p-value and in how many the value of p is non-significant (taking the arbitrary value of p=0.05).

Vote counting method

Good work, it seems a reasonable solution. We have two in favor and four against, so it seems clear that those “against” win, so we can quietly conclude that watching TV is not a risk factor for presenting bouts of dandruff. The problem is that we can be blundering, also quietly.

This is so because we are making a common mistake. When we do a hypothesis test we assume the null hypothesis that there is no effect. We always do the experiment and obtain a difference between the two groups, even by chance. So we calculate the probability of, by chance, finding a difference as we have obtained or greater. This is the value of p. If it is less than 0.05 (according to the usual convention) we say it is very unlikely to be due to chance, so the difference must be real.

In short, a statistically significant p indicates that the effect exists. The problem, and therein lies our mistake in the example we have set, is that otherwise is not met. If p is greater than 0.05 (not statistically significant) it could mean that the effect does not exist, but also that the effect does exist but the study does not have sufficient statistical power to detect it.

As we know, the power depends on the size of the effect and the size of the sample. Although the effect is large, it may not be statistically significant if the sample size is not large enough. So, faced with a p> 0.05 we cannot safely conclude that the effect is not real (we simply cannot reject the null hypothesis of no effect).

Given this, how are we going to make a vote counting how many studies are there for and how many against? Some of the cases of studies without significance could be due to lack of enough power and not because the effect doesn’t exist. In our example, we have four non-significant studies and two significant but, how can we be sure that the four non-significant mean absence of effect?. We have seen that we can’t.

We need a weighted result

The right thing to do in these cases is applying techniques of meta-analysis and get a summary weighted value of all the studies in the review. Let’s see another example with the five studies depicted in the attached figure. Although the relative risks of the five studies show a protective effect (are less than 1, the null value) none reached statistical significance because their confidence intervals cross the zero value, which is the one for relative risks.

However, if we get a weighted sum, it has greater precision than individual studies, so that while the relative risk value is the same, the confidence interval is narrower and not cross the zero value: it is statistically significant.

Applying the method of the votes we could had concluded that there is no protective effect, while it seems likely that it exists when we apply the right method. In short, the voting method is unreliable and should not be used.

We’re leaving…

And that’s all for today. You see that democracy, although good in politics, is not so much when talking about statistics. We have not discussed anything about how we get the weighted sum of all the studies of the review. There are several methods applied in meta-analysis, including the fixed effect and the random effects model. But that’s another story…

Funnel asymmetry

Achilles. What a man!. Definitely, one of the main characters among those who were in that mess that was ensued in Troy because of Helena, a.k.a. the beauty. You know his story. In order to make him invulnerable his mother, who was none other than the nymph Tetis, bathed him in ambrosia and submerged him in the River Stix. But she made a mistake that should not have been allowed to any nymph: she took him by his right heel, which did not get wet with the river’s water. And so, his heel became his only vulnerable part. Hector didn’t realize it in time but Paris, far savvier, put an arrow in Achilles’ heel and sent him back to the Stix, but not into the water, but rather to the other side. Without Charon the Boatman.

This is the source of the expression Achilles’s heel, a metaphor for a vulnerable spot of someone or something that is otherwise usually known for their strength.

As an example, something as robust and formidable as a meta-analysis has its Achilles’s heel: publication bias. And that’s because in the world of science there is no social justice.

All scientific works should have the same opportunities to be published and become famous, but that is far from reality and those works can be discriminated against for four reasons: statistical significance, popularity of its topic, having someone who sponsors them and the language they are written.

The truth is that papers with statistically significant results are more likely to be published than those with non-significant ones. Moreover, even if accepted, the former are likely to be published before and, more often, in English written journals, which are more prestigious and have more diffusion. As a result, those papers are cited more frequently. And the same goes for papers with “positive” results versus those with “negative” results.

Similarly, papers about issues of public interest are more likely to be published regardless of the significance of their results. In addition, the sponsor also has its influence: a company that finances a study about one of its products is not going to be prone to publish the results if they are against the utility of the product concerned. And finally, English written papers have more diffusion than those written in other languages.

All of these can be worsened by the fact that these same factors may influence the choice of inclusion and exclusion criteria for primary studies in the meta-analysis, so we may get a sample of papers that may not be representative of global knowledge about the topic addressed in the systematic review and meta-analysis.

If there’s publication bias the applicability of results will be seriously compromised. This is why we say that publication bias is meta-analysis Achilles’s heel.

If we choose inclusion and exclusion criteria correctly and we do a global literature search without restrictions, we will have done our best to minimize the risk of bias, but we can never be sure of having prevented it. Therefore, there’re some techniques and tools that have been developed for publication bias detection.

The most widely used is a tool known by its friendly name: funnel plot. It represents the magnitude of the effect measured (X axis) versus a precision measure (Y axis), which is usually the sample size, but may also be the inverse of the variance or the standard error. Each primary study is represented with a dot and we only have to observe the dot cloud shape.

In its most common form, with the sample size represented on Y axis, the precision of results is higher for larger sample studies, so dots are closer to each other at the top of the plot and are increasingly scattered toward the bottom of the plot, near the origin of Y axis. Thus, the cloud has a funnel shape, with the wide part downwards. The shape should be symmetrical and, if not, we must always suspect the existence of publication bias. In the second example that I show, you can see how there’re “missing” studies in the side of lack of effect: this may mean that published studies have been only those with positive results.

This approach is very simple to use but, sometimes, we may have doubts about the funnel asymmetry, especially if the number of studies is small. In addition, the funnel may be asymmetric due to a deficient quality of studies or because we are dealing with interventions whose effect varies with the sample of each study. For these situations, other methods have been devised that are more objective, such as Begg’s rank correlation test and Egger’s linear regression.

Begg’s test examines the presence of association among the effect estimates and their variances. If there’s correlation among them, it is a bad thing. The problem with this test is that it is underpowered, so it’s unreliable when the number of primary studies is small.

Egger’s test is more specific than Begg’s. This tool plots the regression line between precision of the studies (independent variable) and the standardized effect (dependent variable). This regression line must be weighed by the inverse of variance, so I do not recommend you to do it on your own unless you are a consummated statistic. When there isn’t publication bias the regression line originates in the Y-axis zero. So much further away from zero, further evidence of publication bias.

As always, there are computer programs available to make these tests quickly without us having to get our brains fried with their calculations.

And what if after doing all the work we find that there is publication bias? Can we do anything to adjust it?. As always, we can.

The easiest way is to use a graphical technique that is called trim and fill adjustment. It works as follows: a) draw the funnel plot, b) remove small studies that make the funnel asymmetric, c) recalculate the new center of the graph, d) put back removed studies and add their reflections at the other side of the middle line of the cloud, e) re-estimate the effect.

And finally, only say that there’s a second method that is much more accurate but also much more complex, which consists of a regression model based on the Egger’s test. But that’s another story…

Don’t let you can’t see the wood for the trees

A long, long time ago a squirrel could cross the Iberian Peninsula without getting off the trees. Such was the lushness of our land. But don’t you be so sure that this is true, because some people think that it’s nothing more than a myth. Anyway, I wonder if the squirrel in question would realize it was in a great forest. I guess yes, but you never know: sometimes you can’t see the forest for the trees or, rather, you can’t see the entire forest.

In any case, a modern squirrel would not have this problem. There’s no doubt that today it could not cross the Peninsula without getting out of the trees but, instead, it could cross the entire country without getting off the head of a fool. As I read one day on a blog, there’re more stupid people than bottles of beer, and they are also strategically placed so you run into, at least, a couple of then each day.

Meta-analysis is also sort of a forest where its primary studies would be the trees. How poetic it is!. But, in this case, trees not only don’t prevent you of seeing anything, they help you not only to see the forest, but to see the entire forest as a whole. Of course, for that, the meta-analysis results must be presented in a proper way.

Until recently we could follow the advices of the QUOROM statement, but this statement was updated to become PRISMA, which devotes seven of its 27 items to give us advice on how to present the results of a meta-analysis.

First, we have to detail the study selection process: how many of them we have found and evaluated, how many selected and how many rejected and why. For this purpose it can come very handy the flow chart that the systematic review should include if the authors have followed the PRISMA statement scheme.

Second, you must specify the characteristics of the primary studies, detailing what data we have extracted from each one of them. Also, we must provide their corresponding citations in order to facilitate the work to any reader who want to check the data if he or she doesn’t trust us. On the same direction is the third recommendation, which refers to the assessment of studies for its risk of bias and internal validity.

Fourth, we have to present the results of each individual study given a summary measure of each intervention group analyzed as well as their calculated estimates with their confidence intervals. These data will provide the information we’ll need in the next step, the fifth PRISMA’s point recommendation concerning the presentation of results, which is non-other that the global synthesis of meta-analysis studies, confidence intervals, homogeneity study result and so on.

This is usually done graphically using a popular tool known by the name of forest plot. This is a kind of forest where trees are the meta-analysis’ primary studies and which summarizes all relevant results of quantitative synthesis.

The Cochrane Collaboration recommends structuring the forest plot in five different columns. Column 1 shows the primary studies or the groups or subgroups of patients included in the meta-analysis. They are usually represented by an ID composed by the author’s name and the date of publication.

Column 2 gives you the effect measure as recorded or calculated by the authors of each included study.

Column 3 is the actual forest plot. It shows the effect measures represented on both sides of the vertical line of no effect. We already know the null value is equal to zero for mean differences and to one for odds ratios, relative risks, hazard ratios, etc. Each study is represented by a blob whose area is proportional to the weigh with which each study contributes to the pooled effect. Also, each blob is represented in a horizontal line which represents its confidence interval.

These confidence intervals inform about the precision of studies and tell us which ones are statistically significant: those whose interval doesn’t cross the line of no effect. Anyway, you should not forget that, even though they cross the line of no effect and are not statistically significant, the position of the confidence intervals limits give us a lot of information about clinical relevance of results obtained from each study. Finally, at the bottom there’s a diamond that represents the pooled effect of the meta-analysis. Its position with respect to the line of no effect will tell us about the statistical significance of the effect, while its width will give us information about its precision (its confidence interval). Also, at the top you can find the type of measure that is represented in the forest plot, the statistical method used to pool it (fixed effect model or the random effects model) and the measure of the confidence interval given (conventionally taken as 95%).

The graph is usually completed with a fourth column with the percent weight of the estimates of each study respect the pooled result and a fifth column with the effect estimates in numbers. And in a little corner of that forest you will find the heterogeneity measure that the authors have calculated, along with its statistical significance when appropriate.

At the end of results presentation, PRISMA recommend a sixth section dedicated to detail any risk of bias that have been assessed and a seventh with all additional analysis that have been performed, as needed: stratification, sensitivity analysis, meta-regression, etc.

As you can see, nothing is simple about meta-analysis. Therefore, Cochrane’s recommend us to follow a series of steps to interpret results properly. These are the following:

1. Check the variable that is compared and how. You can usually find it out at the top of the forest plot.

2. Find out the effect measure used. You can understand this is needed to interpret results accurately. A hazard ratio is different from a mean difference or whatever it has been used.

3. Look at the diamond, its location and width. It’s also very convenient to exam the numerical value of the pooled estimate and its confidence interval.

4. Check that heterogeneity has been determined. You can find it out at a glance looking whether segments representing primary studies are widely scattered or not, and checking if they overlap with each other. In any case, there must always be a statistical parameter to quantify heterogeneity. If we find that there’s heterogeneity, next thing we have to do is to look for the explanation provided by the authors about its existence.

5. Draw your conclusions. We will look at which side of the line of no effect the pooled estimate is and its confidence interval. You know that, even though statistically significant, its lower limit should be as far away from that line as possible: clinical relevance and statistically significant is not always synonym. Finally, go back again to homogeneity study. If heterogeneity exists, reliability of results could be compromised.

And here we end up with results presentation and forest plot. As a matter of fact, the forest plot is not exclusive to meta-analysis as it can be used whenever we want to compare studies to assess their clinical or statistical significance, or in others cases such us in equivalence studies, in which the null line is flanked by the equivalence thresholds. But it still has further utility. A variation of the forest plot is also used to assess the publication bias in systematics reviews, although in those cases it is often called funnel plot instead of forest plot. But that’s another story…

Take care of the pennies, and the pounds will take care of themselves

All of you will know the Chinese Tale about the poor lone grain of rice that falls to the ground and no one hears. Of course, if instead of a grain it’s a sack of rice that fall that will be another thing. There’re many examples that show how unity creates strength. A lone red ant is harmless, unless it bites you in any soft and noble zone, which usually are the most sensitive parts. But what will you tell me about a scrum of millions of red ants?. That scare the crap out of you, because if they go against you all together there’s little you can do to stop them. Yes, the sum of many “few” makes a “lot”.

And that is true about statistics too. With the aid of a relatively small sample of well-chosen voters we can estimate who will win an election in which millions of people vote. So imagine what we could do with a lot of those samples. I’m sure that the estimate would be more reliable and generalizable.

Well, this is precisely one of the purposes of meta-analysis, which uses statistical techniques to come up with a quantitative synthesis from results of a series of studies that aim to answer the same question but don’t get exactly the same result.

We know we must check for heterogeneity among studies before combining them because, otherwise, it would make little sense to do it and the results we would get wouldn’t be valid or generalizable. Available for this purpose there’re a number of methods, both numerical and graphical ones, to ensure there’s the homogeneity we need.

The next step is to analyze the effect size estimates of the studies, weighing them according to the contribution of each of them to the pooled result. The most common way is to weigh the effect size estimates by the inverse of their variance and then doing the analysis to obtain an average effect. In order to this, there’re various possibilities, but the most commonly used methods are the fixed effects model and the random effects model. Both models differ in their assumptions about the original population from that primary studies come.

The fixed effects model considers that there’s heterogeneity and that all studies estimate the same effect size in the same population. So, it’s assumed that variability observed among individual studies is due solely to the error that occurs when performing random sampling in each study. This error is measured estimating intra-study variance, assuming that differences in effect size estimates are only due to the use of different samples of subjects.

On the other hand, in the random effects model it’s assumed that effect size follows a normal frequency distribution in the population, so each study estimates a different effect size. Therefore, in addition to intra-study variance due to random sampling, this model also includes the variability among studies that represents the deviation of each study with respect to the average effect size. These two errors are mutually independent and both of them contribute to the estimates variance.

In summary, the fixed effects model incorporates only one error term for the variability in each study, while the random effects model further adds another error term due to the variability among studies.

You can see I have not written a single formula. Actually, we don’t need them and they’re quite unfriendly, filled with Greek letters that no one can understand. But don’t worry. As always, statistical software such as the let you easily calculate the results, removing and drawing studies from the model, as well as change between models as we want.

It’s important what model we choose. If there’s not heterogeneity we can use the fixed effects model. But if we find out that our studies are heterogeneous, but not enough to advise against combining them, it is preferable to use the random effects model.

Another aspect to keep in mind is the applicability or external validity of the meta-analysis result. If we use the fixed effects models it will not be safe to generalize results to populations which are different of those of the included studies. This does not happen with the random effects model, whose external validity is higher because it takes into account different populations from different studies.

In any case, we’ll come up with an average effect measure along with its confidence interval. This confidence interval won’t be statistically significant if it crosses the line of no effect, we already know that it’s zero for mean differences and one for odds ratios and relative risks. In addition, the width of the interval will inform us about the accuracy of the estimated effect in the population: as much wider, less precise, and vice versa.

If you think about it you will understand why the random effects model is more conservative than the fixed effects models being that the confidence intervals obtained are less accurate because the former model incorporates more variability in its analysis. In some cases the estimate could be significant using the fixed effects model and not significant using the random effects model.  But that shouldn’t be a reason when choosing the model to use. We must always decide taking into account our previous heterogeneity study and, in case we have doubts, we can use both methods and compare the different results.

And now, it only remains to present the results in a proper way. Meta-analysis results are usually represented using a specific chart that is call the forest plot. But that’s another story…