Science without sense…double nonsense

Píldoras sobre medicina basada en pruebas

Achilles and Effects Forest

image_pdf

Achilles. What a man! Definitely, one of the main characters among those who were in that mess that was ensued in Troy because of Helena, a.k.a. the beauty. You know his story. In order to make him invulnerable his mother, who was none other than Tetis, the nymph, bathed him in ambrosia and submerged him in the River Stix. But she made a mistake that should not have been allowed to any nymph: she took him by his right heel, which did not get wet with the river’s water. And so, his heel became his only vulnerable part. Hector didn’t realize it in time but Paris, totally on the ball, put an arrow in Achilles’ heel and sent him back to the Stix, but not into the water, but rather to the other side. And without Charon the Ferryman.

This story is the origin of the expression “Achilles’ heel”, which usually refers to the weakest or most vulnerable point of someone or something that, otherwise, is usually known for its strength.

For example, something as robust and formidable as meta-analysis has its Achilles heel: the publication bias. And that’s because in the world of science there is no social justice.

All scientific works should have the same opportunities to be published and achieve fame, but the reality is not at all like that and they can be discriminated against for four reasons: statistical significance, popularity of the topic they are dealing with, having someone to sponsor them and the language in which they are written.

These are the main factors that can contribute to publication bias. First, studies with more significant results are more likely to be published and, within these, they are more likely to be published when the effect is greater. This means that studies with negative results or effects of small magnitude may not be published, which will draw a biased conclusion from the analysis only of large studies with positive results. In the same way, papers on topics of public interest are more likely to be published regardless of the importance of their results. In addition, the sponsor also influences: a company that finances a study with a product of theirs that has gone wrong, probably is not going to publish it so that we all know that their product is not useful.

Secondly, as is logical, published studies are more likely to reach our hands than those that are not published in scientific journals. This is the case of doctoral theses, communications to congresses, reports from government agencies or, even, pending studies to be published by researchers of the subject that we are dealing with. For this reason it is so important to do a search that includes this type of work, which is included within the grey literature term.

Finally, a series of biases can be listed that influence the likelihood that a work will be published or retrieved by the researcher performing the systematic review such as language bias (the search is limited by language), availability bias ( include only those studies that are easy for the researcher to recover), the cost bias (studies that are free or cheap), the familiarity bias (only those from the researcher’s discipline are included), the duplication bias (those that have significant results are more likely to be published more than once) and citation bias (studies with significant results are more likely to be cited by other authors).

One may think that this loss of studies during the review cannot be so serious, since it could be argued, for example, that studies not published in peer-reviewed journals are usually of poorer quality, so they do not deserve to be included in the meta-analysis However, it is not clear either that scientific journals ensure the methodological quality of the study or that this is the only method to do so. There are researchers, like those of government agencies, who are not interested in publishing in scientific journals, but in preparing reports for those who commission them. In addition, peer review is not a guarantee of quality since, too often, neither the researcher who carries out the study nor those in charge of reviewing it have a training in methodology that ensures the quality of the final product.

All this can be worsened by the fact that these same factors can influence the inclusion and exclusion criteria of the meta-analysis primary studies, in such a way that we obtain a sample of articles that may not be representative of the global knowledge on the subject of the systematic review and meta-analysis.

If we have a publication bias, the applicability of the results will be seriously compromised. That is why we say that the publication bias is the true Achilles’ heel of meta-analysis.

If we correctly delimit the inclusion and exclusion criteria of the studies and do a global and unrestricted search of the literature we will have done everything possible to minimize the risk of bias, but we can never be sure of having avoided it. That is why techniques and tools have been devised for its detection.

The most used has the sympathetic name of funnel plot. It shows the magnitude of the measured effect (X axis) versus a precision measurement (Y axis), which is usually the sample size, but which can also be the inverse of the variance or the standard error. We represent each primary study with a point and observe the point cloud.

In the most usual way, with the size of the sample on the Y axis, the precision of the results will be higher in the larger sample studies, so that the points will be closer together in the upper part of the axis and will be dispersed when approaching the origin of the axis Y. In this way, we observe a cloud of points in the form of a funnel, with the wide part down. This graphic should be symmetrical and, if that is not the case, we should always suspect a publication bias. In the second example attached you can see how there are “missing” studies on the side of lack of effect: this may mean that only studies with positive results are published.

This method is very simple to use but, sometimes, we can have doubts about the asymmetry of our funnel, especially if the number of studies is small. In addition, the funnel can be asymmetrical due to quality defects in the studies or because we are dealing with interventions whose effect varies according to the sample size of each study. For these cases, other more objective methods have been devised, such as the Begg’s rank correlation test and the Egger’s linear regression test.

The Begg’s test studies the presence of association between the estimates of the effects and their variances. If there is a correlation between them, bad going. The problem with this test is that it has little statistical power, so it is not reliable when the number of primary studies is small.

Egger’s test, more specific than Begg’s, consists of plotting the regression line between the precision of the studies (independent variable) and the standardized effect (dependent variable). This regression must be weighted by the inverse of the variance, so I do not recommend that you do it on your own, unless you are consummate statisticians. When there is no publication bias, the regression line originates at the zero of the Y axis. The further away from zero, the more evidence of publication bias.

As always, there are computer programs that do these tests quickly without having to burn your brain with the calculations.

What if after doing the work we see that there is publication bias? Can we do something to adjust it? As always, we can.

The simplest way is to use a graphic method called trim and fill. It consists of the following: a) we draw the funnel plot; b) we remove the small studies so that the funnel is symmetrical; c) the new center of the graph is determined; d) we recover the previously removed studies and we add their reflection to the other side of the center line; e) we estimate again the effect.Another very conservative attitude that we can adopt is to assume that there is a publication bias and to ask how much it affects our results, assuming that we have left studies not included in the analysis.

The only way to know if the publication bias affects our estimates would be to compare the effect in the retrieved and unrecovered studies but, of course, then we would not have to worry about the publication bias.

To know if the observed result is robust or, on the contrary, it is susceptible to be biased by a publication bias, two methods of the fail-safe N have been devised.

The first is the Rosenthal’s fail-safe N method. Suppose we have a meta-analysis with an effect that is statistically significant, for example, a risk ratio greater than one with a p <0.05 (or a 95% confidence interval that does not include the null value, one). Then we ask ourselves a question: how many studies with RR = 1 (null value) will we have to include until p is not significant? If we need few studies (less than 10) to make the value of the effect null, we can worry because the effect may in fact be null and our significance is the product of a publication bias. On the contrary, if many studies are needed, the effect is likely to be truly significant. This number of studies is what the letter N of the name of the method means.

The problem with this method is that it focuses on the statistical significance and not on the relevance of the results. The correct thing would be to look for how many studies are needed so that the result loses clinical relevance, not statistical significance. In addition, it assumes that the effects of the missing studies is null (one in case of risk ratios and odds ratios, zero in cases of differences in means), when the effect of the missing studies can go in the opposite direction than the effect that we detect or in the same sense but of smaller magnitude.

To avoid these disadvantages there is a variation of the previous formula that assesses the statistical significance and clinical relevance. With this method, which is called the Orwin’s fail-safe N, it is calculated how many studies are needed to bring the value of the effect to a specific value, which will generally be the least effect that is clinically relevant. This method also allows to specify the average effect of the missing studies.

To end the meta-analysis explanation, let’s see what is the right way to express the results of data analysis. To do it well, we can follow the recommendations of the PRISMA statement, which devotes seven of its 27 items to give us advice on how to present the results of a meta-analysis.

First, we must inform about the selection process of studies: how many we have found and evaluated, how many we have selected and how many rejected, explaining in addition the reasons for doing so. For this, the flowchart that should include the systematic review from which the meta-analysis proceeds if it complies with the PRISMA statement is very useful.

Secondly, the characteristics of the primary studies must be specified, detailing what data we get from each one of them and their corresponding bibliographic citations to facilitate that any reader of the review can verify the data if he does not trust us. In this sense, there is also the third section, which refers to the evaluation of the risk of study biases and their internal validity.

Fourth, we must present the results of each individual study with a summary data of each intervention group analyzed together with the calculated estimators and their confidence intervals. These data will help us to compile the information that PRISMA asks us in its fifth point referring to the presentation of results and it is none other than the synthesis of all the meta-analysis studies, their confidence intervals, homogeneity study results, etc.

This is usually done graphically by means of an effects diagram, a graphical tool popularly known as forest plot, where the trees would be the primary studies of the meta-analysis and where all the relevant results of the quantitative synthesis are summarized.

The Cochrane’s Collaboration recommends structuring the forest plot in five well differentiated columns. Column 1 lists the primary studies or the groups or subgroups of patients included in the meta-analysis. They are usually represented by an identifier composed of the name of the first author and the date of publication.Column 2 shows the results of the measures of effect of each study as reported by their respective authors.

Column 3 is the actual forest plot, the graphic part of the subject. It shows the measures of effect of each study on both sides of the zero effect line, which we already know is zero for mean differences and one for odds ratios, risk ratios, hazard ratios, etc. Each study is represented by a square whose area is usually proportional to the contribution of each one to the overall result. In addition, the square is within a segment that represents the extremes of its confidence interval.

These confidence intervals inform us about the accuracy of the studies and tell us which are statistically significant: those whose interval does not cross the zero effect line. Anyway, do not forget that, although crossing the line of no effect and being not statistically significant, the interval boundaries can give us much information about the clinical significance of the results of each study. Finally, at the bottom of the chart we will find a diamond that represents the global result of the meta-analysis. Its position with respect to the null effect line will inform us about the statistical significance of the overall result, while its width will give us an idea of ​​its accuracy (its confidence interval). Furthermore, on top of this column will find the type of effect measurement, the analysis model data is used (fixed or random) and the significance value of the confidence intervals (typically 95%).

This chart is usually completed by a fourth column with the estimated weight of each study in per cent format and a fifth column with the estimates of the weighted effect of each. And in some corner of this forest will be the measure of heterogeneity that has been used, along with its statistical significance in cases where relevant.

To conclude the presentation of the results, PRISMA recommends a sixth section with the evaluation that has been made of the risks of bias in the study and a seventh with all the additional analyzes that have been necessary: stratification, sensitivity analysis, metaregression, etc.

As you can see, nothing is easy about meta-analysis. Therefore, the Cochrane’s recommends following a series of steps to correctly interpret the results. Namely:

  1. Verify which variable is compared and how. It is usually seen at the top of the forest plot.
  2. Locate the measure of effect used. This is logical and necessary to know how to interpret the results. A hazard ratio is not the same as a difference in means or whatever it was used.
  3. Locate the diamond, its position and its amplitude. It is also convenient to look at the numerical value of the global estimator and its confidence interval.
  4. Check that heterogeneity has been studied. This can be seen by looking at whether the segments that represent the primary studies are or are not very dispersed and whether they overlap or not. In any case, there will always be a statistic that assesses the degree of heterogeneity. If we see that there is heterogeneity, the next thing will be to find out what explanation the authors give about its existence.
  5. Draw our conclusions. We will look at which side of the null effect line are the overall effect and its confidence interval. You already know that, although it is significant, the lower limit of the interval should be as far as possible from the line, because of the clinical relevance, which does not always coincide with statistical significance. Finally, look again at the study of homogeneity. If there is a lot of heterogeneity, the results will not be as reliable.

And with this we end the topic of meta-analysis. In fact, the forest plot is not exclusive to meta-analyzes and can be used whenever we want to compare studies to elucidate their statistical or clinical significance, or in cases such as equivalence studies, in which the null effect line is joined of the equivalence thresholds. But it still has one more utility. A variant of the forest plot also serves to assess if there is a publication bias in the systematic review, although, as we already know, in these cases we change its name to funnel graph. But that is another story…

Apples and pears

image_pdf

You all sure know the Chinese tale of the poor solitary rice grain that falls to the ground and nobody can hear it. Of course, if instead of falling a grain it falls a sack full of rice that will be something else. There are many examples of union making strength. A red ant is harmless, unless it bites you in some soft and noble area, which are usually the most sensitive. But what about a marabout of millions of red ants? That is what scares you up, because if they all come together and come for you, you could do little to stop their push. Yes, the union is strength.

And this also happens with statistics. With a relatively small sample of well-chosen voters we can estimate who will win an election in which millions vote. So, what could we not do with a lot of those samples? Surely the estimate would be more reliable and more generalizable.

Well, this is precisely one of the purposes of meta-analysis, which uses various statistical techniques to make a quantitative synthesis of the results of a set of studies that, although try to answer the same question, do not reach exactly to the same result. But beware; we cannot combine studies to draw conclusions about the sum of them without first taking a series of precautions. This would be like mixing apples and pears which, I’m not sure why, should be something terribly dangerous because everyone knows it’s something to avoid.

Think that we have a set of clinical trials on the same topic and we want to do a meta-analysis to obtain a global result. It is more than convenient that there is as little variability as possible among the studies if we want to combine them. Because, ladies and gentlemen, here also rules the saying: alongside but separate.

Before thinking about combining the results of the studies of a systematic review to perform a meta-analysis, we must always make a previous study of the heterogeneity of the primary studies, which is nothing more than the variability that exists among the estimators that have been obtained in each of those studies.

First, we will investigate possible causes of heterogeneity, such as differences in treatments, variability of the populations of the different studies and differences in the designs of the trials. If there is a great deal of heterogeneity from the clinical point of view, perhaps the best thing to do is not to do meta-analysis and limit the analysis to a qualitative synthesis of the results of the review.

Once we come to the conclusion that the studies are similar enough to try to combine them we should try to measure this heterogeneity to have an objective data. For this, several privileged brains have created a series of statistics that contribute to our daily jungle of acronyms and letters.

Until recently, the most famous of those initials was the Cochran’s Q, which has nothing to do either with James Bond or our friend Archie Cochrane. Its calculation takes into account the sum of the deviations between each of the results of primary studies and the global outcome (squared differences to avoid positives cancelling negatives), weighing each study according to their contribution to overall result. It looks awesome but in reality, it is no big deal. Ultimately, it’s no more than an aristocratic relative of ji-square test. Indeed, Q follows a ji-square distribution with k-1 degrees of freedom (being k the number of primary studies). We calculate its value, look at the frequency distribution and estimate the probability that differences are not due to chance, in order to reject our null hypothesis (which assumes that observed differences among studies are due to chance). But, despite the appearances, Q has a number of weaknesses.

First, it’s a very conservative parameter and we must always keep in mind that no statistical significance is not always synonymous of absence of heterogeneity: as a matter of fact, we cannot reject the null hypothesis, so we have to know that when we approved it we are running the risk of committing a type II error and blunder. For this reason, some people propose to use a significance level of p < 0.1 instead of the standard p < 0.5. Another Q’s pitfall is that it doesn’t quantify the degree of heterogeneity and, of course, doesn’t explain the reasons that produce it. And, to top it off, Q loses power when the number of studies is small and doesn’t allow comparisons among different meta-analysis if they have different number of studies.

This is why another statistic has been devised that is much more celebrated today: I2. This parameter provides an estimate of total variation among studies with respect to total variability or, put it another way, the proportion of variability actually due to heterogeneity for actual differences among the estimates compared with variability due to chance. It also looks impressive, but it’s actually an advantageous relative of the intraclass correlation coefficient.

Its value ranges from 0 to 100%, and we usually consider the limits of 25%, 50% and 75% as signs of low, moderate and high heterogeneity, respectively. I2 is not affected either by the effects units of measurement or the number of studies, so it allows comparisons between meta-analysis with different units of effect measurement or different number of studies.

If you read a study that provides Q and you want to calculate I2, or vice versa, you can use the following formula, being k the number of primary studies:

I^{2}= \frac{Q-k+1}{Q}

There’s a third parameter that is less known, but not less worthy of mention: H2. It measures the excess of Q value in respect of the value that we would expect to obtain if there were no heterogeneity. Thus, a value of 1 means no heterogeneity and its value increases as heterogeneity among studies does. But its real interest is that it allows calculating I2 confidence intervals.

Other times, the authors perform a hypothesis contrast with a null hypothesis of non-heterogeneity and use a ji-square or some similar statistic. In these cases, what they provide is a value of statistical significance. If the p is <0.05 the null hypothesis can be rejected and say that there is heterogeneity. Otherwise we will say that we cannot reject the null hypothesis of non-heterogeneity.

In summary, whenever we see an indicator of homogeneity that represents a percentage, it will indicate the proportion of variability that is not due to chance. For their part, when they give us a “p” there will be significant heterogeneity when the “p” is less than 0.05.

Do not worry about the calculations of Q, I2 and H2. For that there are specific programs as RevMan or modules within the usual statistical programs that do the same function.

A point of attention: always remember that not being able to demonstrate heterogeneity does not always mean that the studies are homogeneous. The problem is that the null hypothesis assumes that they are homogeneous and the differences are due to chance. If we can reject it we can assure that there is heterogeneity (always with a small degree of uncertainty). But this does not work the other way around: if we cannot reject it, it simply means that we cannot reject that there is no heterogeneity, but there will always be a probability of committing a type II error if we directly assume that the studies are homogeneous.

For this reason, a series of graphical methods have been devised to inspect the studies and verify that there is no data of heterogeneity even if the numerical parameters say otherwise.

The most employed of them is, perhaps, the , with can be used for both meta-analysis from trials or observational studies. This graph represents the accuracy of each study versus the standardize effects. It also shows the adjusted regression line and sets two confidence bands. The position of each study regarding the accuracy axis indicates its weighted contribution to overall results, while its location outside the confidence bands indicates its contribution to heterogeneity.

Galbraith’s graph can also be useful for detecting sources of heterogeneity, since studies can be labeled according to different variables and see how they contribute to the overall heterogeneity.

Another available tool you can use for meta-analysis of clinical trials is L’Abbé’s plot. It represents response rates to treatment versus response rates in control group, plotting the studies to both sides of the diagonal. Above that line are studies with positive treatment outcome, while below are studies with an outcome favorable to control intervention. The studies usually are plotted with an area proportional to its accuracy, and its dispersion indicates heterogeneity. Sometimes, L’Abbé’s graph provides additional information. For example, in the accompanying graph you can see that studies in low-risk areas are located mainly below the diagonal. On the other hand, high-risk studies are mainly located in areas of positive treatment outcome. This distribution, as well as being suggestive of heterogeneity, may suggest that efficacy of treatments depends on the level of risk or, put another way, we have an effect modifying variable in our study. A small drawback of this tool is that it is only applicable to meta-analysis of clinical trials and when the dependent variable is dichotomous.

Well, suppose we study heterogeneity and we decide that we are going to combine the studies to do a meta-analysis. The next step is to analyze the estimators of the effect size of the studies, weighing them according to the contribution that each study will have on the overall result. This is logical; it cannot contribute the same to the final result a trial with few participants and an imprecise result than another with thousands of participants and a more precise result measure.

The most usual way to take these differences into account is to weight the estimate of the size of the effect by the inverse of the variance of the results, subsequently performing the analysis to obtain the average effect. For these there are several possibilities, some of them very complex from the statistical point of view, although the two most commonly used methods are the fixed effect model and the random effects model. Both models differ in their conception of the starting population from which the primary studies of meta-analysis come.

The fixed effect model considers that there is no heterogeneity and that all studies estimate the same effect size of the population (they all measure the same effect, that is why it is called a fixed effect), so it is assumed that the variability observed among the individual studies is due only to the error that occurs when performing the random sampling in each study. This error is quantified by estimating intra-study variance, assuming that the differences in the estimated effect sizes are due only to the use of samples from different subjects.

On the other hand, the random effects model assumes that the effect size varies in each study and follows a normal frequency distribution within the population, so each study estimates a different effect size. Therefore, in addition to the intra-study variance due to the error of random sampling, the model also includes the variability among studies, which would represent the deviation of each study from the mean effect size. These two error terms are independent of each other, both contributing to the variance of the study estimator.

In summary, the fixed effect model incorporates only one error term for the variability of each study, while the random effects model adds, in addition, another error term due to the variability among the studies.

You see that I have not written a single formula. We do not actually need to know them and they are quite unfriendly, full of Greek letters that no one understands. But do not worry. As always, statistical programs like RevMan from the Cochrane Collaboration allow you to do the calculations in a simple way, including and removing studies from the analysis and changing the model as you wish.

The type of model to choose has its importance. If in the previous homogeneity analysis we see that the studies are homogeneous we can use the fixed effect model. But if we detect that heterogeneity exists, within the limits that allow us to combine the studies, it will be preferable to use the random effects model.

Another consideration is the applicability or external validity of the results of the meta-analysis. If we have used the fixed effect model, we will be committed to generalize the results out of populations with characteristics similar to those of the included studies. This does not occur with the results obtained using the random effects model, whose external validity is greater because it comes from studies of different populations.

In any case, we will obtain a summary effect measure along with its confidence interval. This confidence interval will be statistically significant when it does not cross the zero effect line, which we already know is zero for mean differences and one for odds ratios and risk ratios. In addition, the amplitude of the interval will inform us about the precision of the estimation of the average effect in the population: how much wider, less precise, and vice versa.

If you think a bit, you will immediately understand why the random effects model is more conservative than the fixed effect model in the sense that the confidence intervals obtained are less precise, since it incorporates more variability in its analysis. In some cases it may happen that the estimator is significant if we use the fixed effect model and it is not significant if we use the random effect model, but this should not condition us when choosing the model to use. We must always rely on the previous measure of heterogeneity, although if we have doubts, we can also use the two models and compare the different results.

Having examined the homogeneity of primary studies we can come to the grim conclusion that heterogeneity dominates the situation. Can we do something to manage it? Sure, we can. We can always not to combine the studies, or combine them despite heterogeneity and obtain a summary result but, in that case, we should also calculate any measure of variability among studies and yet we could not be sure of our results.

Another possibility is to do a stratified analysis according to the variable that causes heterogeneity, provided that we are able to identify it. For this we can do a sensitivity analysis, repeating calculations once removing one by one each of the subgroups and checking how it influences the overall result. The problem is that this approach ignores the final purpose of any meta-analysis, which is none than obtaining an overall value of homogeneous studies.

Finally, the brainiest on these issues can use meta-regression. This technique is similar to multivariate regression models in which the characteristics of the studies are used as explanatory variables, and effect’s variable or some measure of deviation of each study with respect to global result are used as dependent variable. Also, it should be done a weighting according to the contribution of each study to the overall result and try not to score too much coefficients to the regression model if the number of primary studies is not large. I wouldn’t advise you to do a meta-regression at home if it is not accompanied by seniors.

And we only need to check that we have not omitted studies and that we have presented the results correctly. The meta-analysis data are usually represented in a specific graph that is known as forest plot. But that is another story…

The whole is greater than the sum of its parts

image_pdf

This is another of those famous quotes that are all over the place. Apparently, the first person to have this clever idea was Aristotle, who used it to summarize his holism general principle in his briefs on metaphysics. Who would have said that this tinny phrase contains so much wisdom?. Holism theory insists that everything must be considered in a comprehensive manner, because its components may act in a synergistic way, allowing the meaning of the whole to be greater than the meaning that each individual part contribute with.

Don’t be afraid, you are still on the blog about the brains and not on a blog about philosophy. Neither have I changed the topic of the blog, but this saying is just what I needed to introduce you to the wildest beast of scientific method, which is called meta-analysis.

We live in the information age. Since the end of the 20th century, we have witnessed a true explosion of the available sources of information, accessible from multiple platforms. The end result is that we are overwhelmed every time we need information about a specific point, so we do not know where to look or how we can find what we want. For this reason, systems began to be developed to synthesize the information available to make it more accessible when needed.

So, the first reviews come of the arid, the so-called narrative or author reviews. To write them, one or more authors, usually experts in a specific subject, made a general review on this topic, although without any strict criteria on the search strategy or selection of information. Following with total freedom, the authors analyzed the results as instructed by their will and ended up drawing their conclusions from a qualitative synthesis of the obtained results.

These narrative reviews are very useful for acquiring an overview of the topic, especially when one knows little about the subject, but they are not very useful for those who already know the topic and need answers to a more specific question. In addition, as the whole procedure is done according to authors´ wishes, the conclusions are not reproducible.

For these reasons, a series of privileged minds invented the other type of review in which we will focus on this post: the systematic review. Instead of reviewing a general topic, systematic reviews do focus on a specific topic in order to solve specific doubts of clinical practice. In addition, they use a clearly specified search strategy and inclusion criteria for an explicit and rigorous work, which makes them highly reproducible if another group of authors comes up with a repeat review of the same topic. And, if that were not enough, whenever possible, they go beyond the analysis of qualitative synthesis, completing it with a quantitative synthesis that receives the funny name of meta-analysis.

The realization of a systematic review consists of six steps: formulation of the problem or question to be answered, search and selection of existing studies, evaluation of the quality of these studies, extraction of the data, analysis of the results and, finally, interpretation and conclusion. We are going to detail this whole process a little.

Any systematic review worth its salt should try to answer a specific question that must be relevant from the clinical point of view. The question will usually be asked in a structured way with the usual components of population, intervention, comparison and outcome (PICO), so that the analysis of these components will allow us to know if the review is of our interest.

In addition, the components of the structured clinical question will help us to search for the relevant studies that exist on the subject. This search must be global and not biased, so we avoid possible biases of source excluding sources by language, journal, etc. The usual is to use a minimum of two important electronic databases of general use, such as Pubmed, Embase or the Cochrane’s, together with the specific ones of the subject that is being treated. It is important that this search is complemented by a manual search in non-electronic registers and by consulting the bibliographic references of the papers found, in addition to other sources of the so-called gray literature, such as doctoral theses, and documents of congresses, as well as documents from funding agencies, registers and, even, establishing contact with other researchers to know if there are studies not yet published.

It is very important that this strategy is clearly specified in the methods section of the review, so that anyone can reproduce it later, if desired. In addition, it will be necessary to clearly specify the inclusion and exclusion criteria of the primary studies of the review, the type of design sought and its main components (again in reference to the PICO, the components of the structured clinical question).

The third step is the evaluation of the quality of the studies found, which must be done by a minimum of two people independently, with the help of a third party (who will surely be the boss) to break the tie in cases where there is no consensus among the extractors. For this task, tools or checklists designed for this purpose are usually used; one of the most frequently used tool for bias control is the Cochrane Collaboration Tool. This tool assesses five criteria of the primary studies to determine their risk of bias: adequate randomization sequence (prevents selection bias), adequate masking (prevents biases of realization and detection, both information biases), concealment of allocation (prevents selection bias), losses to follow-up (prevents attrition bias) and selective data information (prevents information bias). The studies are classified as high, low or indeterminate risk of bias. It is common to use the colors of the traffic light, marking in green the studies with low risk of bias, in red those with high risk of bias and in yellow those who remain in no man’s land. The more green we see, the better the quality of the primary studies of the review will be.

Ad-hoc forms are usually designed for extraction of data, which usually collect data such as date, scope of the study, type of design, etc., as well as the components of the structured clinical question. As in the case of the previous step, it is convenient that this be done by more than one person, establishing the method to reach an agreement in cases where there is no consensus among the reviewers.

And here we enter the most interesting part of the review, the analysis of the results. The fundamental role of the authors will be to explain the differences that exist between the primary studies that are not due to chance, paying special attention to the variations in the design, study population, exposure or intervention and measured results. You can always make a qualitative synthesis analysis, although the real magic of the systematic review is that, when the characteristics of primary studies allow it, a quantitative synthesis, called meta-analysis, can also be performed.

A meta-analysis is a statistical analysis that combines the results of several independent studies that try to answer the same question. Although meta-analysis can be considered as a research project in its own right, it is usually part of a systematic review.

Primary studies can be combined using a statistical methodology developed for this purpose, which has a number of advantages. First, by combining all the results of the primary studies we can obtain a more complete global vision (you know, the whole is greater …). The second one, when studies are combined we increase the sample size, which increases the power of the study in comparison with that of the individual studies, improving the estimation of the effect we want to measure. Thirdly, when extracting the conclusions of a greater number of studies, its external validity increases, since having involved different populations it is easier to generalize the results. Finally, it can allow us to resolve controversies between the conclusions of the different primary studies of the review and, even, to answer questions that had not been raised in those studies.

Once the meta-analysis is done, a final synthesis must be made that integrates the results of the qualitative and quantitative synthesis in order to answer the question that motivated the systematic review or, when this is not possible, to propose the additional studies that must be carried out to be able to answer it.

But a meta-analysis will only deserve all our respect if it fulfills a series of requirements. As the systematic review to witch the meta-analysis belongs, it should aim to answer one specific question and it must be based on all relevant available information, avoiding publication bias and recovery bias. Also, primary studies must have been assessed to ensure its quality and its homogeneity before combining them. Of course, data must be analyzed and presented in an appropriate way. And, finally, it must make sense to combine the results in order to do it. The fact that we can combine results doesn’t always mean that we have to do it if it is not needed in our clinical setting.

And how do you combine the studies?, you could ask yourselves. Well, that’s the meta-analysis’ crux of the matter (crossings, really, there’re many), because there are several possible ways to do it.

Anyone could think that the easiest way would be a sort of Eurovision Contest. We account for the primary studies with a statistically significant positive effect and, if they are majority, we conclude that there’s consensus for positive result. This approach is quite simple but, you will not deny it, also quite sloppy. Also I can think about a number of disadvantages about its use. On one hand, it implies that lack of significance and lack of effect is synonymous, which does not always have to be true. On the other hand, it doesn’t take into account the direction and strength of effect in each study, nor the accuracy of estimators, neither the quality nor the characteristics of primary studies’ design. So, this type of approach is not very recommended, although nobody is going to fine us if we use it as an informal first approach before deciding which if the best way to combine the results.

Another possibility is to use a sort of sign test, similar to other non-parametric statistical techniques. We count the number of positive effects, we subtract the negatives and we have our conclusion. The truth is that this method also seems too simple. It ignores studies that don’t have statistical significance and also ignores the accuracy of studies’ estimators. So, this approach is not of much use, unless you only know the directions of the effects measured in the studies. We could also use it when primary studies are very heterogeneous to get an approximation of the global result, although I would not trust very much results obtained in this way.

The third method is to combine the different Ps of the studies (our beloved and sacrosanct Ps). This could come to our minds if we had a systematic review whose primary studies use different outcome measures, although all of them tried to answer the same question. For example, think about a study on osteoporosis where some studies use ultrasonic densitometry, others spine or femur DEXA, etc. The problem with this method is that it doesn’t take into account the intensities of effects, but only its directions and statistical significances, and we all know the deficiencies of our holy Ps. To be able to make this approach we’d need software that combines data that follow a Chi-square or Gaussian distribution, giving us an estimate and its confidence interval.

The fourth and final method that I know is also the most stylish: to make a weighted combination of the estimated effect in all the primary studies. To calculate the mean would be the easiest way, but we have not come this far to make fudge again. Arithmetic mean gives same emphasis to all studies, so if you have an outlier or imprecise study, results will be greatly distorted. Don’t forget that average always follow the tails of distributions and are heavily influenced by extreme values (which does not happen to her relative, the median).

This is why we have to weigh the different estimates. This can be done in two ways, taking into account the number of subjects in each study, or performing a weighting based on the inverses of the variances of each (you know, the squares of standard errors). The latter way is the more complex, so it is the one people preferred to do more often. Of course, as the maths needed are very hard, people usually use special software that can be external modules working in usual statistical programs such as Stata, SPSS, SAS or R, or specific software such as the famous Cochrane Collaboration’s RevMan.

As you can see, I have not been short of calling the systematic review with meta-analysis as the wildest beast of epidemiological designs. However, it has its detractors. We all know someone who claims not to like systematic reviews because almost all of them end up in the same way: “more quality studies are needed to be able to make recommendations with a reasonable degree of evidence”. Of course, in these cases we cannot put the blame on the review, because we do not take enough care to perform our studies so the vast majority deserves to end up in the paper shredder.

Another controversy is that of those who debate about what is better, a good systematic review or a good clinical trial (reviews can be made on other types of designs, including observational studies). This debate reminds me of the controversy over whether one should do a calimocho mixing a good wine or if it is a sin to mix a good wine with Coca-Cola. Controversies aside, if you have to take a calimocho, I assure you that you will enjoy it more if you use a good wine, and something similar happens to reviews with the quality of their primary studies.

The problem of systematic reviews is that, to be really useful, you have to be very rigorous in its realization. So that we do not forget anything, there are lists of recommendations and verification that allow us to order the entire procedure of creation and dissemination of scientific works without making methodological errors or omissions in the procedure.

It all started with a program of the Health Service of the United Kingdom that ended with the founding of an international initiative to promote the transparency and precision of biomedical research works: the EQUATOR network (Enhancing the QUAlity and Transparency of health Research). This network consists of experts in methodology, communication and publication, so it includes professionals involved in the quality of the entire process of production and dissemination of research results. Among many other objectives, which you can consult on its website, one is to design a set of recommendations for the realization and publication of the different types of studies, which gives rise to different checklists or statements.

The checklist designed to apply to systematic reviews is the PRISMA statement (Preferred Reporting Items for Systematic reviews and Meta-Analyses), which comes to replace the QUOROM statement (QUality Of Reporting Of Meta-analyses). Based on the definition of systematic review of the Cochrane Collaboration, PRISMA helps us to select, identify and assess the studies included in a review. It also consists of a checklist and a flowchart that describes the passage of all the studies considered during the realization of the review. There is also a lesser-known statement for the assessment of meta-analyses of observational studies, the MOOSE statement (Meta-analyses of Observational Studies in Epidemiology).

The Cochrane Collaboration also has a very well structured and defined methodology, which you can consult on its website. This is the reason why they have so much prestige within the world of systematic reviews, because they are made by professionals who are dedicated to the task following a rigorous and contrasted methodology. Anyway, even Cochrane’s reviews should be critically read and not giving them anything for insured.

And with this we have reached the end for today. I want to insist that meta-analysis should be done whenever possible and interesting, but making sure beforehand that it is correct to combine the results. If the studies are very heterogeneous we should not combine anything, since the results that we could obtain would have a much compromised validity. There is a whole series of methods and statistics to measure the homogeneity or heterogeneity of the primary studies, which also influence the way in which we analyze the combined data. But that is another story…

The guard’s dilemma

image_pdf

The world of medicine is a world of uncertainty. We can never be sure of anything at 100%, however obvious it may seem a diagnosis, but we cannot beat right and left with ultramodern diagnostics techniques or treatments (that are never safe) when making the decisions that continually haunt us in our daily practice.

That’s why we are always immersed in a world of probabilities, where the certainties are almost as rare as the so-called common sense which, as almost everyone knows, is the least common of the senses.

Imagine you are in the clinic and a patient comes because he has been kicked in the ass, pretty strong, though. As good doctor as we are, we ask that of what’s wrong?, since when?, and what do you attribute it to? And we proceed to a complete physical examination, discovering with horror that he has a hematoma on the right buttock.

Here, my friends, the diagnostic possibilities are numerous, so the first thing we do is a comprehensive differential diagnosis. To do this, we can take four different approaches. The first is the possibilistic approach, listing all possible diagnoses and try to rule them all simultaneously applying the relevant diagnostic tests. The second is the probabilistic approach, sorting diagnostics by relative chance and then acting accordingly. It seems a posttraumatic hematoma (known as the kick in the ass syndrome), but someone might think that the kick has not been so strong, so maybe the poor patient has a bleeding disorder or a blood dyscrasia with secondary thrombocytopenia or even an atypical inflammatory bowel disease with extraintestinal manifestations and gluteal vascular fragility. We could also use a prognostic approach and try to show or rule out possible diagnostics with worst prognosis, so the diagnosis of the kicked in the ass syndrome lose interest and we were going to rule out chronic leukemia. Finally, a pragmatic approach could be used, with particular interest in first finding diagnostics that have a more effective treatment (the kick will be, one more time, the number one).

It seems that the right thing is to use a judicious combination of pragmatic, probabilistic and prognostic approaches. In our case we will investigate if the intensity of injury justifies the magnitude of bruising and, in that case, we would indicate some hot towels and we would refrain from further diagnostic tests. And this example may seems to be bullshit, but I can assure you I know people who make the complete list and order the diagnostic tests when there are any symptoms, regardless of expenses or risks. And, besides, one that I could think of, could assess the possibility of performing a more exotic diagnostic test that I cannot imagine, so the patient would be grateful if the diagnosis doesn’t require to make a forced anal sphincterotomy. And that is so because, as we have already said, the waiting list to get some common sense exceeds in many times the surgical waiting list.

Now imagine another patient with a symptom complex less stupid and absurd than the previous example. For instance, let’s think about a child with symptoms of celiac disease. Before we make any diagnostic test, our patient already has a probability of suffering the disease. This probability will be conditioned by the prevalence of the disease in the population from which she proceeds and is called the pretest probability. This probability will stand somewhere between two thresholds: the diagnostic threshold and the therapeutic threshold.

The usual thing is that the pre-test probability of our patient does not allow us to rule out the disease with reasonable certainty (it would have to be very low, below the diagnostic threshold) or to confirm it with sufficient security to start the treatment (it would have to be above the therapeutic threshold).

We’ll then make the indicated diagnostic test, getting a new probability of disease depending on the result of the test, the so-called post-test probability. If this probability is high enough to make a diagnosis and initiate treatment, we’ll have crossed our first threshold, the therapeutic one. There will be no need for additional tests, as we will have enough certainty to confirm the diagnosis and treat the patient, always within a range of uncertainty.

And what determines our treatment threshold? Well, there are several factors involved. The greater the risk, cost or adverse effects of the treatment in question, the higher the threshold that we will demand to be treated. In the other hand, as much more serious is the possibility of omitting the diagnosis, the lower the therapeutic threshold that we’ll accept.

But it may be that the post-test probability is so low that allows us to rule out the disease with reasonable assurance. We shall then have crossed our second threshold, the diagnostic one, also called the no-test threshold. Clearly, in this situation, it is not indicated further diagnostic tests and, of course, starting treatment.

However, very often changing pretest to post-test probability still leaves us in no man’s land, without achieving any of the two thresholds, so we will have to perform additional tests until we reach one of the two limits.

And this is our everyday need: to know the post-test probability of our patients to know if we discard or confirm the diagnosis, if we leave the patient alone or we lash her out with our treatments. And this is so because the simplistic approach that a patient is sick if the diagnostic test is positive and healthy if it is negative is totally wrong, even if it is the general belief among those who indicate the tests. We will have to look, then, for some parameter that tells us how useful a specific diagnostic test can be to serve the purpose we need: to know the probability that the patient suffers the disease.

And this reminds me of the enormous problem that a brother-in-law asked me about the other day. The poor man is very concerned with a dilemma that has arisen. The thing is that he’s going to start a small business and he wants to hire a security guard to stay at the entrance door and watch for those who take something without paying for it. And the problem is that there’re two candidates and he doesn’t know who of the two to choose. One of them stops nearly everyone, so no burglar escapes. Of course, many honest people are offended when they are asked to open their bags before leaving and so next time they will buy elsewhere. The other guard is the opposite: he stops almost anyone but the one he spots certainly brings something stolen. He offends few honest people, but too many grabbers escape. A difficult decision…

Why my brother-in-law comes to me with this story? Because he knows that I daily face with similar dilemmas every time I have to choose a diagnostic test to know if a patient is sick and I have to treat her. We have already said that the positivity of a test does not assure us the diagnosis, just as the bad looking of a client does not ensure that the poor man has robbed us.

Let’s see it with an example. When we want to know the utility of a diagnostic test, we usually compare its results with those of a reference or gold standard, which is a test that, ideally, is always positive in sick patients and negative in healthy people. Now let’s suppose that I perform a study in my hospital office with a new diagnostic test to detect a certain disease and I get the results from the attached table (the patients are those who have the positive reference test and the healthy ones, the negative).

Let’s start with the easy part. We have 1598 subjects, 520 out of them sick and 1078 healthy. The test gives us 446 positive results, 428 true (TP) and 18 false (FP). It also gives us 1152 negatives, 1060 true (TN) and 92 false (FN). The first we can determine is the ability of the test to distinguish between healthy and sick, which leads me to introduce the first two concepts: sensitivity (Se) and specificity (Sp). Se is the likelihood that the test correctly classifies a patient or, in other words, the probability that a patient gets a positive result. It’s calculated dividing TP by the number of sick. In our case it equals 0.82 (if you prefer to use percentages you have to multiply by 100). Moreover, Sp is the likelihood that the test correctly classifies a healthy or, put another way, the probability that a healthy gets a negative result. It’s calculated dividing TN by the number of healthy. In our example, it equals 0.98.

Someone may think that we have assessed the value of the new test, but we have just begun to do it. And this is because with Se and Sp we somehow measure the ability of the test to discriminate between healthy and sick, but what we really need to know is the probability that an individual with a positive results being sick and, although it may seem to be similar concepts, they are actually quite different.

The probability of a positive of being sick is known as the positive predictive value (PPV) and is calculated dividing the number of patients with a positive test by the total number of positives. In our case it is 0.96. This means that a positive has a 96% chance of being sick. Moreover, the probability of a negative of being healthy is expressed by the negative predictive value (NPV), with is the quotient of healthy with a negative test by the total number of negatives. In our example it equals 0.92 (an individual with a negative result has 92% chance of being healthy). This is already looking more like what we said at the beginning that we needed: the post-test probability that the patient is really sick.

And from now on is when neurons begin to be overheated. It turns out that Se and Sp are two intrinsic characteristics of the diagnostic test. Their results will be the same whenever we use the test in similar conditions, regardless of the subjects of the test. But this is not so with the predictive values, which vary depending on the prevalence of the disease in the population in which we test. This means that the probability of a positive of being sick depends on how common or rare the disease in the population is. Yes, you read this right: the same positive test expresses different risk of being sick, and for unbelievers, I’ll put another example.

Suppose that this same study is repeated by one of my colleagues who works at a community health center, where population is proportionally healthier than at my hospital (logical, they have not suffered the hospital yet). If you check the results in the table and bring you the trouble to calculate it, you may come up with a Se of 0.82 and a Sp of 0.98, the same that I came up with in my practice. However, if you calculate the predictive values, you will see that the PPV equals 0.9 and the NPV 0.95. And this is so because the prevalence of the disease (sick divided by total) is different in the two populations: 0.32 at my practice vs 0.19 at the health center. That is, in cases of highest prevalence a positive value is more valuable to confirm the diagnosis of disease, but a negative is less reliable to rule it out. And conversely, if the disease is very rare a negative result will reasonably rule out disease but a positive will be less reliable at the time to confirm it.

We see that, as almost always happen in medicine, we are moving on the shaking ground of probability, since all (absolutely all) diagnostic tests are imperfect and make mistakes when classifying healthy and sick. So when is a diagnostic test worth of using it? If you think about it, any particular subject has a probability of being sick even before performing the test (the prevalence of disease in her population) and we’re only interested in using diagnostic tests if that increase this likelihood enough to justify the initiation of the appropriate treatment (otherwise we would have to do another test to reach the threshold level of probability to justify treatment).

And here is when this issue begins to be a little unfriendly. The positive likelihood ratio (PLR), indicates how much more probable is to get a positive with a sick than with a healthy subject. The proportion of positive in sick patients is represented by Se. The proportion of positives in healthy are the FP, which would be those healthy without a negative result or, what is the same, 1-Sp. Thus, PLR = Se / (1 – Sp). In our case (hospital) it equals 41 (the same value no matter we use percentages for Se and Sp). This can be interpreted as it is 41 times more likely to get a positive with a sick than with a healthy.

It’s also possible to calculate NLR (negative), which expresses how much likely is to find a negative in a sick than in a healthy. Negative patients are those who don’t test positive (1-Se) and negative healthy are the same as the TN (the test’s Sp). So, NLR = (1 – Se) / Sp. In our example, 0.18.

A ratio of 1 indicates that the result of the test doesn’t change the likelihood of being sick. If it’s greater than 1 the probability is increased and, if less than 1, decreased. This is the parameter used to determine the diagnostic power of the test. Values > 10 (or > 0.01) indicates that it’s a very powerful test that supports (or contradict) the diagnosis; values from 5-10 (or 0.1-0.2) indicates low power of the test to support (or disprove) the diagnosis; 2-5 (or 0.2-05) indicates that the contribution of the test is questionable; and, finally, 1-2 (0.5-1) indicates that the test has not diagnostic value.

The likelihood ratio does not express a direct probability, but it helps us to calculate the probabilities of being sick before and after testing positive by means of the Bayes’ rule, which says that the posttest odds is equal to the product of the pretest odds by the likelihood ratio. To transform the prevalence into pre-test odds we use the formula odds = p / (1-p). In our case, it would be 0.47. Now we can calculate the post-test odds (PosO) by multiplying the pretest odds by the likelihood ratio. In our case, the positive post-test odds value is 19.27. And finally, we transform the post-test odds into post-test probability using the formula p = odds / (odds + 1). In our example it values 0.95, which means that if our test is positive the probability of being sick goes from 0.32 (the pre-test probability) to 0.95 (post-test probability).

If there’s still anyone reading at this point, I’ll say that we don’t need all this gibberish to get post-test probability. There are multiple websites with online calculators for all these parameters from the initial 2 by 2 table with a minimum effort. I addition, the post-test probability can be easily calculated using a Fagan’s nomogram (see attached figure). This graph represents in three vertical lines from left to right the pre-test probability (it is represented inverted), the likelihood ratios and the resulting post-test probability.

To calculate the post-test probability after a positive result, we draw a line from the prevalence (pre-test probability) to the PLR and extend it to the post-test probability axis. Similarly, in order to calculate post-test probability after a negative result, we would extend the line between prevalence and the value of the NLR.

In this way, with this tool we can directly calculate the post-test probability by knowing the likelihood ratios and the prevalence. In addition, we can use it in populations with different prevalence, simply by modifying the origin of the line in the axis of pre-test probability.

So far we have defined the parameters that help us to quantify the power of a diagnostic test and we have seen the limitations of sensitivity, specificity and predictive values and how the most useful in a general way are the likelihood ratios. But, you will ask, what is a good test?, is it a sensitive one?, a specific?, both?

Here we are going to return to the guard’s dilemma that has arisen to my poor brother-in-law, because we have left him abandoned and we have not answered yet which of the two guards we recommend him to hire, the one who ask almost everyone to open their bags and so offending many honest people, or the one who almost never stops honest people but, stopping almost anyone, many thieves get away.

And what do you think is the better choice? The simple answer is: it depends. Those of you who are still awake by now will have noticed that the first guard (the one who checks many people) is the sensitive one while the second is the specific one. What is better for us, the sensitive or the specific guard? It depends, for example, on where our shop is located. If your shop is located in a heeled neighborhood the first guard won’t be the best choice because, in fact, few people will be stealers and we’ll prefer not to offend our customers so they don’t fly away. But if our shop is located in front of the Cave of Ali Baba we’ll be more interested in detecting the maximum number of clients carrying stolen stuff. Also, it can depend on what we sell in the store. If we have a flea market we can hire the specific guard although someone can escape (at the end of the day, we’ll lose a few amount of money). But if we sell diamonds we’ll want no thieve to escape and we’ll hire the sensitive guard (we’ll rather bother someone honest than allows anybody escaping with a diamond).

The same happens in medicine with the choice of diagnostic tests: we have to decide in each case whether we are more interested in being sensitive or specific, because diagnostic tests not always have a high sensitivity (Se) and specificity (Sp).

In general, a sensitive test is preferred when the inconveniences of a false positive (FP) are smaller than those of a false negative (FN). For example, suppose that we’re going to vaccinate a group of patients and we know that the vaccine is deadly in those with a particular metabolic error. It’s clear that our interest is that no patient be undiagnosed (to avoid FN), but nothing happens if we wrongly label a healthy as having a metabolic error (FP): it’s preferable not to vaccinate a healthy thinking that it has a metabolopathy (although it hasn’t) that to kill a patient with our vaccine supposing he was healthy. Another less dramatic example: in the midst of an epidemic our interest will be to be very sensitive and isolate the largest number of patients. The problem here if for the unfortunate healthy who test positive (FP) and get isolated with the rest of sick people. No doubt we’d do him a disservice with the maneuver. Of course, we could do to all the positives to the first test a second confirmatory one that is very specific in order to avoid bad consequences to FP people.

On the other hand, a specific test is preferred when it is better to have a FN than a FP, as when we want to be sure that someone is actually sick. Imagine that a test positive result implies a surgical treatment: we’ll have to be quite sure about the diagnostic so we don’t operate any healthy people.

Another example is a disease whose diagnosis can be very traumatic for the patient or that is almost incurable or that has no treatment. Here we´ll prefer specificity to not to give any unnecessary annoyance to a healthy. Conversely, if the disease is serious but treatable we´ll probably prefer a sensitive test.

So far we have talked about tests with a dichotomous result: positive or negative. But, what happens when the result is quantitative? Let’s imagine that we measure fasting blood glucose. We must decide to what level of glycemia we consider normal and above which one will seem pathological. And this is a crucial decision, because Se and Sp will depend on the cutoff point we choose.

To help us to choose we have the receiver operating characteristic, known worldwide as the ROC curve. We represent in coordinates (y axis) the Se and in abscissas the complementary Sp (1-Sp) and draw a curve in which each point represents the probability that the test correctly classifies a healthy-sick couple taken at random. The diagonal of the graph would represent the “curve” if the test had no ability to discriminate healthy from sick patients.

As you can see in the figure, the curve usually has a segment of steep slope where the Se increases rapidly without hardly changing the Sp: if we move up we can increase Se without practically increasing FP. But there comes a time when we get to the flat part. If we continue to move to the right, there will be a point from which the Se will no longer increase, but will begin to increase FP. If we are interested in a sensitive test, we will stay in the first part of the curve. If we want specificity we will have to go further to the right. And, finally, if we do not have a predilection for either of the two (we are equally concerned with obtaining FP than FN), the best cutoff point will be the one closest to the upper left corner. For this, some use the so-called Youden’s index, which optimizes the two parameters to the maximum and is calculated by adding Se and Sp and subtracting 1. The higher the index, the fewer patients misclassified by the diagnostic test.

A parameter of interest is the area under the curve (AUC), which represents the probability that the diagnostic test correctly classifies the patient who is being tested (see attached figure). An ideal test with Se and Sp of 100% has an area under the curve of 1: it always hits. In clinical practice, a test whose ROC curve has an AUC> 0.9 is considered very accurate, between 0.7-0.9 of moderate accuracy and between 0.5-0.7 of low accuracy. On the diagonal, the AUC is equal to 0.5 and it indicates that it does not matter if the test is done by throwing a coin in the air to decide if the patient is sick or not. Values below 0.5 indicate that the test is even worse than chance, since it will systematically classify patients as healthy and vice versa.Curious these ROC curves, aren`t they? Its usefulness is not limited to the assessment of the goodness of diagnostic tests with quantitative results. The ROC curves also serve to determine the goodness of fit of a logistic regression model to predict dichotomous outcomes, but that is another story…

King of Kings

image_pdf

There is no doubt that when doing a research in biomedicine we can choose from a large number of possible designs, all with their advantages and disadvantages. But in such a diverse and populous court, among jugglers, wise men, gardeners and purple flautists, it reigns over all of them the true Crimson King in epidemiology: the randomized clinical trial.

The clinical trial is an interventional analytical study, with antegrade direction and concurrent temporality, and with sampling of a closed cohort with control of exposure. In a trial, a sample of a population is selected and divided randomly into two groups. One of the groups (intervention group) undergoes the intervention that we want to study, while the other (control group) serves as a reference to compare the results. After a given follow-up period, the results are analyzed and the differences between the two groups are compared. We can thus evaluate the benefits of treatments or interventions while controlling the biases of other types of studies: randomization favors that possible confounding factors, known or not, are distributed evenly between the two groups, so that if in the end we detect any difference, this has to be due to the intervention under study. This is what allows us to establish a causal relationship between exposure and effect.

From what has been said up to now, it is easy to understand that the randomized clinical trial is the most appropriate design to assess the effectiveness of any intervention in medicine and is the one that provides, as we have already mentioned, a higher quality evidence to demonstrate the causal relationship between the intervention and the observed results.

But to enjoy all these benefits it is necessary to be scrupulous in the approach and methodology of the trials. There are checklists published by experts who understand a lot of these issues, as is the case of the CONSORT list, which can help us assess the quality of the trial’s design. But among all these aspects, let us give some thought to those that are crucial for the validity of the clinical trial.

Everything begins with a knowledge gap that leads us to formulate a structured clinical question. The only objective of the trial should be to answer this question and it is enough to respond appropriately to a single question. Beware of clinical trials that try to answer many questions, since, in many cases, in the end they do not respond well to any. In addition, the approach must be based on what the inventors of methodological jargon call the equipoise principle, which does not mean more than, deep in our hearts, we do not really know which of the two interventions is more beneficial for the patient (from the ethical point of view, it would be necessary to be anathema to make a comparison if we already know with certainty which of the two interventions is better). It is curious in this sense how the trials sponsored by the pharmaceutical industry are more likely to breach the equipoise principle, since they have a preference for comparing with placebo or with “non-intervention” in order to be able to demonstrate more easily the efficacy of their products.Then we must carefully choose the sample on which we will perform the trial. Ideally, all members of the population should have the same probability not only of being selected, but also of finishing in either of the two branches of the trial. Here we are faced with a small dilemma. If we are very strict with the inclusion and exclusion criteria, the sample will be very homogeneous and the internal validity of the study will be strengthened, but it will be more difficult to extend the results to the general population (this is the explanatory attitude of sample selection). On the other hand, if we are not so rigid, the results will be more similar to those of the general population, but the internal validity of the study may be compromised (this is the pragmatic attitude).

Randomization is one of the key points of the clinical trial. It is the one that assures us that we can compare the two groups, since it tends to distribute the known variables equally and, more importantly, also the unknown variables between the two groups. But do not relax too much: this distribution is not guaranteed at all, it is only more likely to happen if we randomize correctly, so we should always check the homogeneity of the two groups, especially with small samples.

In addition, randomization allows us to perform masking appropriately, with which we perform an unbiased measurement of the response variable, avoiding information biases. These results of the intervention group can be compared with those of the control group in three ways. One of them is to compare with a placebo. The placebo should be a preparation of physical characteristics indistinguishable from the intervention drug but without its pharmacological effects. This serves to control the placebo effect (which depends on the patient’s personality, their feelings towards the intervention, their love for the research team, etc.), but also the side effects that are due to the intervention and not to the pharmacological effect (think, for example, of the percentage of local infections in a trial with medication administered intramuscularly).

The other way is to compare with the accepted as the most effective treatment so far. If there is a treatment that works, the logical (and more ethical) is that we use it to investigate whether the new one brings benefits. It is also usually the usual comparison method in equivalence or non-inferiority studies. Finally, the third possibility is to compare with non-intervention, although in reality this is a far-fetched way of saying that only the usual care that any patient would receive in their clinical situation is applied.

It is essential that all participants in the trial are submitted to the same follow-up guideline, which must be long enough to allow the expected response to occur. All losses that occur during follow-up should be detailed and analyzed, since they can compromise the validity and power of the study to detect significant differences. And what do we do with those that get lost or end up in a different branch to the one assigned? If there are many, it may be more reasonable to reject the study. Another possibility is to exclude them and act as if they had never existed, but we can bias the results of the trial. A third possibility is to include them in the analysis in the branch of the trial in which they have participated (there is always one that gets confused and takes what he should not), which is known as analysis by treatment or analysis by protocol. And the fourth and last option we have is to analyze them in the branch that was initially assigned to them, regardless of what they did during the study. This is called the intention-to-treat analysis, and it is the only one of the four possibilities that allows us to retain all the benefits that randomization had previously provided.

As a final phase, we would have the analyze and compare the data to draw the conclusions of the trial, using for this the association and impact measures of effect that, in the case of the clinical trial, are usually the response rate, the risk ratio (RR), the relative risk reduction (RRR), the absolute risk reduction (ARR) and the number needed to treat (NNT). Let’s see them with an example.

Let’s imagine that we carried out a clinical trial in which we tried a new antibiotic (let’s call it A not to get warm from head to feet) for the treatment of a serious infection of the location that we are interested in studying. We randomize the selected patients and give them the new drug or the usual treatment (our control group), according to what corresponds to them by chance. In the end, we measure how many of our patients fail treatment (present the event we want to avoid).

Thirty six out of the 100 patients receiving drug A present the event to be avoided. Therefore, we can conclude that the risk or incidence of the event in those exposed (Ie) is 0.36. On the other hand, 60 of the 100 controls (we call them the group of not exposed) have presented the event, so we quickly calculate that the risk or incidence in those not exposed (Io) is 0.6.

At first glance we already see that the risk is different in each group, but as in science we have to measure everything, we can divide the risks between exposed and not exposed, thus obtaining the so-called risk ratio (RR = Ie / Io). An RR = 1 means that the risk is equal in the two groups. If the RR> 1 the event will be more likely in the group of exposed (the exposure we are studying will be a risk factor for the production of the event) and if RR is between 0 and 1, the risk will be lower in those exposed. In our case, RR = 0.36 / 0.6 = 0.6. It is easier to interpret RR> 1. For example, a RR of 2 means that the probability of the event is twice as high in the exposed group. Following the same reasoning, a RR of 0.3 would tell us that the event is a third less frequent in the exposed than in the controls. You can see in the attached table how these measures are calculated.

But what we are interested in is to know how much the risk of the event decreases with our intervention to estimate how much effort is needed to prevent each one. For this we can calculate the RRR and the ARR. The RRR is the risk difference between the two groups with respect to the control (RRR = [Ie-Io] / Io). In our case it is 0.4, which means that the intervention tested reduces the risk by 60% compared to the usual treatment.

The ARR is simpler: it is the difference between the risks of exposed and controls (ARR = Ie – Io). In our case it is 0.24 (we ignore the negative sign), which means that out of every 100 patients treated with the new drug there will be 24 fewer events than if we had used the control treatment. But there is still more: we can know how many we have to treat with the new drug to avoid an event by just doing the rule of three (24 is to 100 as 1 is to x) or, easier to remember, calculating the inverse of the ARR. Thus, the NNT = 1 / ARR = 4.1. In our case we would have to treat four patients to avoid an adverse event. The context will always tell us the clinical importance of this figure.

As you can see, the RRR, although it is technically correct, tends to magnify the effect and does not clearly quantify the effort required to obtain the results. In addition, it may be similar in different situations with totally different clinical implications. Let’s see it with another example that I also show you in the table. Suppose another trial with a drug B in which we obtain three events in the 100 treated and five in the 100 controls. If you do the calculations, the RR is 0.6 and the RRR is 0.4, as in the previous example, but if you calculate the ARR you will see that it is very different (ARR = 0.02), with an NNT of 50 It is clear that the effort to avoid an event is much greater (4 versus 50) despite the same RR and RRR.

So, at this point, let me advice you. As the data needed to calculate RRR are the same than to calculate the easier ARR (and NNT), if a scientific paper offers you only the RRR and hide the ARR, distrust it and do as with the brother-in-law who offers you wine and cured cheese, asking him why he does not better put a skewer of Iberian ham. Well, I really wanted to say that you’d better ask yourselves why they don’t give you the ARR and compute it using the information from the article.

So far all that we have said refers to the classical design of parallel clinical trials, but the king of designs has many faces and, very often, we can find papers in which it is shown a little differently, which may imply that the analysis of the results has special peculiarities.

Let’s start with one of the most frequent variations. If we think about it for a moment, the ideal design would be that which would allow us to experience in the same individual the effect of the study intervention and the control intervention (the placebo or the standard treatment), since the parallel trial is an approximation that it assumes that the two groups respond equally to the two interventions, which always implies a risk of bias that we try to minimize with randomization. If we had a time machine we could try the intervention in all of them, write down what happens, turn back the clock and repeat the experiment with the control intervention so we could compare the two effects. The problem, the more alert of you have already imagined, is that the time machine has not been invented yet.

But what has been invented is the cross-over clinical trial, in which each subject is their own control. As you can see in the attached figure, in this type of test each subject is randomized to a group, subjected to the intervention, allowed to undergo a wash-out period and, finally, subjected to the other intervention. Although this solution is not as elegant as that of the time machine, the defenders of cross-trials argue the fact that variability within each individual is less than the interindividual one, with which the estimate can be more accurate than that of the parallel trial and, in general, smaller sample sizes are needed. Of course, before using this design you have to make a series of considerations. Logically, the effect of the first intervention should not produce irreversible changes or be very prolonged, because it would affect the effect of the second. In addition, the washing period must be long enough to avoid any residual effects of the first intervention.

It is also necessary to consider whether the order of the interventions can affect the final result (sequence effect), with which only the results of the first intervention would be valid. Another problem is that, having a longer duration, the characteristics of the patient can change throughout the study and be different in the two periods (period effect). And finally, beware of the losses during the study, which are more frequent in longer studies and have a greater impact on the final results than in parallel trials.

Imagine now that we want to test two interventions (A and B) in the same population. Can we do it with the same trial and save costs of all kinds? Yes, we can, we just have to design a factorial clinical trial. In this type of trial, each participant undergoes two consecutive randomizations: first it is assigned to intervention A or to placebo (P) and, second, to intervention B or placebo, with which we will have four study groups: AB, AP, BP and PP. As is logical, the two interventions must act by independent mechanisms to be able to assess the results of the two effects independently.

Usually, an intervention related to a more plausible and mature hypothesis and another one with a less contrasted hypothesis are studied, assuring that the evaluation of the second does not influence the inclusion and exclusion criteria of the first one. In addition, it is not convenient that neither of the two options has many annoying effects or is badly tolerated, because the lack of compliance with one treatment usually determines the poor compliance of the other. In cases where the two interventions are not independent, the effects could be studied separately (AP versus PP and BP versus PP), but the design advantages are lost and the necessary sample size increases.

At other times it may happen that we are in a hurry to finish the study as soon as possible. Imagine a very bad disease that kills lots of people and we are trying a new treatment. We want to have it available as soon as possible (if it works, of course), so after every certain number of participants we will stop and analyze the results and, in the case that we can already demonstrate the usefulness of the treatment, we will consider the study finished. This is the design that characterizes the sequential clinical trial. Remember that in the parallel trial the correct thing is to calculate previously the sample size. In this design, with a more Bayesian mentality, a statistic is established whose value determines an explicit termination rule, so that the size of the sample depends on the previous observations. When the statistic reaches the predetermined value we see ourselves with enough confidence to reject the null hypothesis and we finish the study. The problem is that each stop and analysis increases the error of rejecting it being true (type 1 error), so it is not recommended to do many intermediate analysis. In addition, the final analysis of the results is complex because the usual methods do not work, but there are others that take into account the intermediate analysis. This type of trial is very useful with very fast-acting interventions, so it is common to see them in titration studies of opioid doses, hypnotics and similar poisons.

There are other occasions when individual randomization does not make sense. Imagine we have taught the doctors of a center a new technique to better inform their patients and we want to compare it with the old one. We cannot tell the same doctor to inform some patients in one way and others in another, since there would be many possibilities for the two interventions to contaminate each other. It would be more logical to teach the doctors in a group of centers and not to teach those in another group and compare the results. Here what we would randomize is the centers to train their doctors or not. This is the trial with group assignment design. The problem with this design is that we do not have many guarantees that the participants of the different groups behave independently, so the size of the sample needed can increase a lot if there is great variability between the groups and little within each group. In addition, an aggregate analysis of the results has to be done, because if it is done individually, the confidence intervals are falsely narrowed and we can find false statistical meanings. The usual thing is to calculate a weighted synthetic statistic for each group and make the final comparisons with it.

The last of the series that we are going to discuss is the community essay, in which the intervention is applied to population groups. When carried out in real conditions on populations, they have great external validity and often allow for cost-efficient measures based on their results. The problem is that it is often difficult to establish control groups, it can be more difficult to determine the necessary sample size and it is more complex to make causal inference from their results. It is the typical design for evaluating public health measures such as water fluoridation, vaccinations, etc.

I’m done now. The truth is that this post has been a bit long (and I hope not too hard), but the King deserves it. In any case, if you think that everything is said about clinical trials, you have no idea of all that remains to be said about types of sampling, randomization, etc., etc., etc. But that is another story…

From the hen to the egg

image_pdf

Surely someone overflowing genius has asked you on any occasion, with a smug look, what came first, the hen or the egg? Well, the next time you meet with someone like this, you can answer with another question: what is it that that the hen and the egg have something to do which each other? Because we must first now not only whether if to have hens we have to have eggs before, but also how likely is to end having hens, with or without eggs (some twisted mind will say that the question could be raised backwards, but I am among those to think that the first thing we have to have, no offense, are eggs).

This approach would lead us to the design of a case-control study, which is an observational and analytical study in which sampling is done on the basis of presenting a certain disease or effect (the cases) and that group is compared with another group that it does not present it (the controls), in order to determine if there is a difference in the frequency of exposure to a certain risk factor between the two groups. These studies are of retrograde directionality and of mixed temporality, so most of them are retrospective, although, as was the case with cohort studies, they can also be prospective (perhaps the most useful key to distinguish between the two is the sampling of each one, based on the exposure in the cohort studies and based on the effect in the cases and controls).

In the attached figure you can see the typical design of a case-control study. These studies are based on a specific population from which a sample of cases that usually includes all diagnosed and available cases, are compared with a control group consisting of a balanced sample of healthy subjects from the same population. However, it is increasingly common to find variations in the basic design that combine characteristics of the cohort and case-control studies, comparing the cases that appear in a stable cohort over time with controls of a partial sample extracted from that same cohort.

The best known of this type of mixed designs is that of nested in a cohort cases and controls. In these cases, we start with a well-known cohort in which we identify the cases that are occurring. Each time a case appears, it is paired with one or more controls also taken from the initial cohort. If we think about it briefly, it is possible that a subject that is initially selected as a control becomes a case over time (develop the disease under study). Although it may seem that this may bias the results, this should not be the case, since it is about measuring the effect of the exposure at the time of the analysis. This design can be done with smaller cohorts, so it can be simpler and cheaper. In addition, it is especially useful in very dynamic cohorts with many inputs and outputs over time, especially if the incidence of the disease under study is low.

Another variant of the basic design are the cohort and cases studies. In this type, we initially have a very large cohort from which we will select a smaller sub-cohort. The cases will be the patients that are produced in either of the two cohorts, while the controls will be the subjects of the smallest (and most manageable) sub-cohort. These studies have a method of analysis a little more complicated than the basic designs, since they have to compensate the fact that the cases are overrepresented because they come from the two cohorts. The great advantage of this design is that it allows studying several diseases at the same time, comparing the different cohorts of patients with the sub-cohort chosen as control.

Finally, one last variation that we are going to discuss is that of the polysemic case-cohort studies, also known as crossed cases and controls, also known as self-controlled cases. In this paired design, each individual serves as their own control, comparing the exposure during the period of time closest to the onset of the disease (case period) with the exposure during the previous period of time (control period). This study approach is useful when the exposure is short, with a foreseeable time of action and produces a disease of short duration in time. They are widely used, for example, to study the adverse effects of vaccines.

As in cohort studies, case-control studies allow the calculation of a whole series of association and impact measures. Of course, here we have a fundamental difference with cohort studies. In cohort studies we started from a cohort without patients in which the patients appeared during the follow-up, which allowed us to calculate the risk of becoming ill over time (incidence). Thus, the quotient between incidents of exposed and not exposed gave us the risk ratio, the main measure of association.

However, as can be deduced from the design of case-control studies, in these cases we cannot make a direct estimate of the incidence or prevalence of the disease, since the proportion of patients is determined by the selection criteria of the researcher and not by the incidence in the population (a fixed number of cases and controls are selected at the beginning, but we cannot calculate the risk of being a case in the population). Thus, before the impossibility of calculating the risk ratio, we will resort to the calculation of the odds ratio (OR), as you can see in the second figure.

The OR has a similar interpretation that the risk ratio, being able to value from zero to infinity. An OR = 1 means that there is no association between exposure and effect. An OR <1 means that exposure is a factor of protection against the effect. Finally, an OR> 1 indicates that the exposure is a risk factor, the higher the value of the OR.

Anyway and only for those who like getting into trouble, I will tell you that it is possible to calculate the incidence rates from the results of a case-control study. If the incidence of the disease under study is low (below 10%), OR and risk ratio can be comparable, so we can estimate the incidence in an approximate way. If the incidence of the disease is greater, the OR tends to overestimate the risk ratio, so we cannot consider them to be equivalent. In any case, in these cases, if we previously know the incidence of the disease in the population (obtained from other studies), we can calculate the incidence using the following formulas:

I0 = It / (OR x Pe) + P0

Ie = I0 x OR,

where It is the total incidence, Ie the incidence in exposed, I0 the incidence in not exposed, Pe the proportion of exposed, and P0 the proportion of not exposed.

Although the OR allows estimating the strength of the association between the exposure and the effect, it does not report on the potential effect that eliminating the exposure on the health of the population would have. For this, we will have to resort to the measures of attributable risk (as we did with cohort studies), which can be absolute or relative.

There are two absolute measures of attributable risk. The first is the attributable risk in exposed (ARE), which is the difference between the incidence in exposed and not exposed and represents the amount of incidence that can be attributed to the risk factor in the exposed. The second is the population attributable risk (PAR), which represents the amount of incidence that can be attributed to the risk factor in the general population.

On the other hand, there are also two relative measures of attributable risk (also known as proportions or attributable or etiological fractions). First, the attributable fraction in exposed (AFE), which represents the difference of risk relative to the incidence in the group of exposed to the factor. Second, the population attributable fraction (PAF), which represents the difference in risk relative to the incidence in the general population.

In the attached table I show you the formulas for the calculation of these parameters, which is somewhat more complex than in the case of cohort studies.

The problem with these impact measures is that they can sometimes be difficult for the clinician to interpret. For this reason, and inspired by the calculation of the number needed to treat (NNT) of clinical trials, a series of measures called impact numbers have been devised, which give us a more direct idea of the effect of the exposure factor on the disease. in study. These impact numbers are the number of impact in exposed (NIE), the number of impact in cases (NIC) and the number of impact in exposed cases (NIEC).

Let’s start with the simplest one. The NIE would be the equivalent of the NNT and would be calculated as the inverse of the absolute risk reduction or of the risk difference between exposed and not exposed. The NNT is the number of people who should be treated to prevent a case compared to the control group. The NIE represents the average number of people who have to be exposed to the risk factor so that a new case of illness occurs compared to the people who are not exposed. For example, a NIE of 10 means that out of every 10 exposed there will be a case of disease attributable to the risk factor studied.

The NIC is the inverse of the PAF, so it defines the average number of sick people among which a case is due to the risk factor. An NIC of 10 means that for every 10 patients in the population, one is attributable to the risk factor under study.

Finally, the NIEC is the inverse of the AFE. It is the average number of patients among which a case is attributable to the risk factor.

In summary, these three parameters measure the impact of exposure among all exposed (NIE), among all patients (NIC) and among all patients who have been exposed (NIEC). It will be useful for us to try to calculate them if the authors of the study do not do so, since they will give us an idea of the real impact of the exposure on the effect. In the second table I show you the formulas that you can use to obtain them.

As a culmination to the previous three, we could estimate the effect of the exposure on the entire population by calculating the number of impact on the population (NIP), for which we have only to do the inverse of the ARP. Thus, a NIP of 3000 means that for every 3,000 subjects of the population there will be a case of illness due to exposure.

In addition to assessing the measures of association and impact, when appraising a case-control study we will have to pay special attention to the presence of biases, since they are the observational studies that have the greatest risk of presenting them.

Case-control studies are relatively simple to make, have in general lower cost than other observational studies (including cohort studies), allow us to study various exposure factors at the same time and to know how they interact, and they are ideal for diseases of exposure factors with very low frequency. The problem with this type of design is that you have to be extremely careful selecting cases and controls, as it is very easy to commit a list of biases that, to this day, does not have a known end.

In general, the selection criteria should be the same for cases and controls, but as to be a case one has to be diagnosed and be available for the study, it’s very likely that cases are not fully representative of the population. For example, if the diagnostic criteria are not sensitive and specific enough we’ll get many false positives and negatives, with the consequent dilution of the effect of the exposure to the factor.

Other possible problem depends on the selection of incident (newly diagnosed) or prevalent cases. Prevalence based studies favor the selection of the survivors (as far as it’s known, no dead has agreed to participate in any study) and if survival is related to the exposure, the risk identified will be less than with incident cases. This effect is even more evident when the exposure factor is of good prognosis, a situation in which prevalence studies produces a greater overestimation of the association. As an example to better understand these issues, let’s suppose that the risk of suffer a stroke is higher the more one smokes. If we include only prevalent cases we’ll exclude the people dead of more severe heart attacks, which probably would be the one who smoke most, with which the effect of smoking could be underestimated.

But if selecting cases seems complicated, it’s nothing compared to a good selection of controls. Ideally, controls have had the same likely of exposure than cases or, put it another way, should be representative of the population from which the cases were extracted. In addition, this must be combined with the exclusion of those who have any illness related positively or negatively to the exposure factor. For example, If we want to waste our time and study the association between air passengers who have thrombophlebitis and prior aspirin ingestion, we must exclude from the study the controls that have any other disease being treated with aspirin, even if they had not taken it before the journey.

We have also to be careful with some habits of control selection. For instance, patients who go to the hospital for reasons different to that of study are at hand and tend to be very cooperative and, being sick, they surely better recall past exposure to risk factors. But the problem is that they are ill, so the pattern of exposure to risk factors can be different to the general population.

Another resource is to include neighbors, friend, relatives, etc. These usually are very comparable and cooperative, but we have the risk that there’re paired exposure habit that can alter study results. These entire problems are avoided taking controls from general population, but it is more costly in effort and money, they usually are less cooperative and, above all, much more forgetful (healthy people recall less about past exposures to risk factors), with so the quality of the information we obtain from cases and controls can be very different.

Just one more comment to end this theme so enjoyable. Case-control studies share a characteristic with the rest of the observational studies: they detect the association between exposure and effect, but they do not allow us to establish causality relations with certainty, for which we need other types of studies such as randomized clinical trials. But that is another story…

One about Romans

image_pdf

What a fellows, those Romans!. They came, they saw and they conquered. With those legions, each one with ten cohorts, each cohort with almost five hundred Romans with their skirts and strappy sandals. The cohorts were groups of soldier that were in reach of the speech of the same boss. They always went forward, never retreating. This is how you can conquer Gaul (though not entirely, as is well known).

In epidemiology, a cohort is also a group of people who share something, but instead of being the boss’s harangue it is the exposure to a factor that is studied over time (neither the skirt nor the sandals are essential) . Thus, a cohort study is a type of observational, analytical design, of anterograde directionality and of concurrent or mixed temporality that compares the frequency with which a certain effect occurs (usually a disease) in two different groups (cohorts), one of them exposed to one factor and another not exposed to the same factor (see attached figure).

Therefore, sampling is related to exposure to the factor. Both cohorts are studied over time, which is why most of the cohort studies are prospective or of concurrent temporality (they go forward, like the Roman cohorts). However, it is possible to do retrospective cohort studies once both the exposure and the effect have occurred. In these cases, the researcher identifies the exposure in the past, reconstructs the experience of the cohort over time and attends in the present to the appearance of the effect, which is why they are studies of mixed temporality.

We can also classify the cohort studies according to whether they use an internal or external comparison group. Sometimes we can use two internal cohorts belonging to the same general population, classifying the subjects in one or another cohort according to the level of exposure to the factor. However, other times the exposed cohort will interest us because of its high level of exposure, so we will prefer to select an external cohort of subjects not exposed to make the comparison between both.

Another important aspect when classifying the cohort studies is the time of inclusion of the subjects in the study. When we only select the subjects that meet the inclusion criteria at the beginning of the study, we speak of a fixed cohort, whereas we will speak of an open or dynamic cohort when subjects continue to enter the study throughout the follow-up. This aspect will be important, as we will see later, when calculating the association measures between exposure and effect.

Finally, and as a curiosity, we can also do a study with a single cohort if we want to study the incidence or evolution of a certain disease. Although we can always compare the results with other known data of the general population, this type of designs lacks a comparison group in the strict sense, so it is included within the longitudinal descriptive studies.

When followed up over time, the cohort studies allow us to calculate the incidence of the effect between exposed and not exposed, calculating from them a series of association measures and specific impact measures.

In studies with closed cohorts in which the number of participants does not change, the measure of association is the relative risk (RR), which is the ratio between the incidence of exposed (Ie) and unexposed (I0): RR = Ie / I0.

As we know, the RR can value from 0 to infinity. A RR = 1 means that there is no association between exposure and effect. A RR <1 means that exposure is a factor of protection against the effect. Finally, a RR> 1 indicates that exposure is a risk factor, the greater the value of the RR.

The case of studies with open cohorts in which participants can enter and leave during the follow-up is a bit more complex, since instead of incidences we will calculate incidence densities, a term that refers to the number of cases of the effect or disease that they occur referring to the number of people followed by each follow-up time (for example, number of cases per 100 person-years). In these cases, instead of the RR we will calculate the incidence density ratio, which is the quotient of the incidence density in exposed divided by the density in not exposed.

These measures allow us to estimate the strength of the association between the exposure to the factor and the effect, but they do not inform us about the potential impact that exposure has on the health of the population (the effect that eliminating this factor would have on the health of the population). For this, we will have to resort to the measures of attributable risk, which can be absolute or relative.

There are two absolute measures of attributable risk. The first is the attributable risk in exposed (ARE), which is the difference between the incidence in exposed and not exposed and represents the amount of incidence that can be attributed to the risk factor in the exposed. The second is the population attributable risk (PAR), which represents the amount of incidence that can be attributed to the risk factor in the general population.

On the other hand, there are also two relative measures of attributable risk (also known as proportions or attributable or etiological fractions). First, the attributable fraction in exposed (AFE), which represents the difference of risk relative to the incidence in the group of exposed to the factor. Second, the population attributable fraction (PAF), which represents the difference in risk relative to the incidence in the general population.

In the attached table you can see the formulas that are used for the calculation of these impact measures.

The problem with these impact measures is that they can sometimes be difficult for the clinician to interpret. For this reason, and inspired by the calculation of the number needed to treat (NNT) of clinical trials, a series of measures called impact numbers have been devised, which give us a more direct idea of the effect of the exposure factor on the disease. in study. These impact numbers are the number of impact in exposed (NIE), the number of impact in cases (NIC) and the number of impact in exposed cases (NIEC).

Let’s start with the simplest one. The NIE would be the equivalent of the NNT and would be calculated as the inverse of the absolute risk reduction or of the risk difference between exposed and not exposed. The NNT is the number of people who should be treated to prevent a case compared to the control group. The NIE represents the average number of people who have to be exposed to the risk factor so that a new case of illness occurs compared to the people who are not exposed. For example, a NIE of 10 means that out of every 10 exposed there will be a case of disease attributable to the risk factor studied.

The NIC is the inverse of the PAF, so it defines the average number of sick people among which a case is due to the risk factor. An NIC of 10 means that for every 10 patients in the population, one is attributable to the risk factor under study.

Finally, the NIEC is the inverse of the AFE. It is the average number of patients among which a case is attributable to the risk factor.

In summary, these three parameters measure the impact of exposure among all exposed (NIE), among all patients (NIC) and among all patients who have been exposed (NIEC). It will be useful for us to try to calculate them if the authors of the study do not do so, since they will give us an idea of the real impact of the exposure on the effect. In the second table I show you the formulas that you can use to obtain them.

As a culmination to the previous three, we could estimate the effect of the exposure on the entire population by calculating the number of impact on the population (NIP), for which we have only to do the inverse of the ARP. Thus, a NIP of 3000 means that for every 3,000 subjects of the population there will be a case of illness due to exposure.

Another aspect that we must take into account when dealing with cohort studies is their risk of bias. In general, observational studies have a higher risk of bias than experimental studies, as well as being susceptible to the influence of confounding factors and effect modifying variables.

The selection bias must always be considered, since it can compromise the internal and external validity of the study results. The two cohorts should be comparable in all aspects, in addition to being representative of the population from which they come.

Another very typical bias of cohort studies is the classification bias, which occurs when an erroneous classification of the participants is made in terms of their exposure or the detection of the effect (basically, it is just another information bias). . The classification bias can be non-differential when the error occurs randomly independently of the study variables. This type of classification bias is in favor of the null hypothesis, that is, it makes it difficult for us to detect the association between exposure and effect, if it exists. If, despite the bias, we detect the association, then nothing bad will happen, but if we do not detect it, we will not know if it does not exist or if we do not see it because of the bad classification of the participants. On the other hand, the classification bias is differential when performed differently between the two cohorts and has to do with some of the study variables. In this case there is no forgiveness or possibility of amendment: the direction of this bias is unpredictable and mortally compromises the validity of the results.

Finally, we should always be alert to the possibility of confusion bias (due to confounding variables) or interaction bias (due to effect modifying variables). The ideal is to prevent them in the design phase, but it is not superfluous to control confusion in the analysis phase, mainly through stratified analyzes and multivariate studies.

And with this we come to the end of this post. We see, then, that cohort studies are very useful to calculate the association and the impact between effect and exposure but, careful, they do not serve to establish causal relationships. For that, other types of studies are necessary.

The problem with cohort studies is that they are difficult (and expensive) to perform adequately, often require large samples and sometimes long follow-up periods (with the consequent risk of losses). In addition, they are not very useful for rare diseases. And we must not forget that they do not allow us to establish causal relationships with sufficient security, although for this reason, case-control studies are better than their cousins, but that is another story…

Which family you belong?

image_pdf

As we already know from previous posts, the evidence-based medicine systematics begins with a knowledge gap that moves us to ask a structured clinical question. Once we have elaborated the question, we will use its components to make a bibliographic search and obtain the best available evidence to solve our doubt.

And here comes, perhaps, the most feared task of evidence-based medicine: the critical appraisal of the evidence found. Actually, the thing is not so much since, with a little practice, the critical reading consists only of systematically applying a series of questions about the article that we are analyzing. The problem sometimes comes in knowing what questions we have to ask, since this system has differences according to the design of the study that we are evaluating.

Here, by design we understand the set of procedures, methods and techniques used with the study participants, during the data collection and during the analysis and interpretation of the results to obtain the conclusions of the study. And there are a myriad of possible study designs, especially in recent times when epidemiologists have been led to design mixed observational studies. In addition, the terminology can sometimes be confusing and use terms that do not clarify well what is the design we have in front of us. It’s like when we get to a wedding of someone from a large family and we meet a cousin we do not know where it comes from. Even if we look for physical similarities, we will most likely end up asking him: and you, which family you belong? Only then will we know if he belongs to the groom or to the bride.

What we are going to do in this post is something similar. We will try to establish a series of criteria for classifying studies to finally establish a series of questions whose answers allow us to identify which family they belong to.

To begin with, the type of clinical question to which the work tries to answer can give us some guidance. If the question is of diagnostic nature, it is most likely that we will be faced with what is called a diagnostic test study, which is usually a design in which a series of participants are subjected, in a systematic and independent way, to the test in study and to the reference pattern (the gold standard). It is a type of design especially made for this type of questions but do not just take it from me: sometimes we can see diagnostic questions that can be tried to be solved with other types of studies.

If the question is about treatment, it is most likely that we are facing a clinical trial or, sometimes, a systematic review of clinical trials. However, there are not always trials on everything we look for and we may have to settle for an observational study, such as a case-control or a cohort study.

In case of questions of prognosis and etiology/harm we may find ourselves reading a clinical trial, but the most usual thing is that it is not possible to carry out trials and we only have observational studies.

Once analyzed this aspect, it is possible that we have doubts about the type of design we are facing. It will then be time to turn to our questions about six criteria related to the methodological design: general objective of the clinical question, direction of the study, type of sampling of the participants, temporality of the events, assignment of the study factors and units of study used. Let’s see in detail what each one of these six criteria means, which you see summarized in the table that I attach.

According to the objective, the studies can be descriptive or analytical. A descriptive study is one that, as the name suggests, only has the descriptive purpose of telling how things are, but without intending to establish causal relationships between the risk factor or exposure and the effect studied (a certain disease or health event, in most cases). These studies answer not very complex questions like how many? where? or to whom ?, so they are usually simple and they serve to elaborate hypotheses that later will need more complex studies for their demonstration.

By contrast, other analytical studies do try to establish such relationships, answering questions like why? how to deal with? or how to prevent? Logically, to establish such relationships it will need to have a group with which to compare (the control group). This will be a useful clue to distinguish between analytical and descriptive studies if we have any doubt: the presence of a comparison group will be typical of analytical studies.

The directionality of the study refers to the order in which the exposure and the effect of such exposure are investigated. The study will have an antegrade directionality when the exposure is studied before the effect and a retrograde directionality when the opposite is done. For example, if we want to investigate the effect of smoking on coronary mortality, we can take a set of smokers and see how many die of coronary diseases (antegrade) or, conversely, take a set of deaths from coronary heart disease and look to see how many smoked (retrograde). Logically, only studies with anterograde directionality can ensure that the exposure precedes the effect in time (I’m not saying that one is the cause of the other). Finally, to say that sometimes we can find studies in which exposure and effect are studied at the same time, talking then of simultaneous directionality.

The type of sampling has to do with how to select the study participants. These can be chosen because they are subject to the exposure factor that interests us, to having presented the effect or to a combination of the two or even other criteria other than exposure and effect.

Our fourth criterion is temporality, which refers to the relationship in time between the researcher and the exposure factor or the effect studied. A study will have a historical temporality when effect and exposure have already occurred when the study begins. On the other hand, when these events take place during the study, it will have a concurrent temporality. Sometimes the exposure can be historical and the effect concurrent, speaking then of mixed temporality.

Here I would like to make a point about two terms used by many authors and that will be more familiar to you: prospective and retrospective. Prospective studies would be those in which exposure and effect did not occur at the beginning of the study, while those in which the events have already occurred at the time of the study would be retrospective. To curl the curl, when both situations are combined we would talk about ambispective studies. The problem with these terms is that sometimes they are used indistinctly to express directionality or temporality, which are different terms. In addition, they are usually associated with specific designs: prospective with cohort studies and retrospective with case and control studies. It may be better to use the specific criteria of directionality and temporality, which express the aspects of the design more precisely.

Two other terms related to temporality are those of transversal and longitudinal studies. Transversals are those that provide us with a snapshot of how things are at a given moment, so they do not allow us to establish temporal or causal relationships. They tend to be prevalence studies and always of a descriptive nature.

On the other hand, in longitudinal studies variables are measured over a period of time, so they do allow establishing temporary relationships, but the researcher dos not control how the exposure is assigned to participants. These may have an antegrade (as in cohort studies) or retrograde (as in case and control studies) directionality.

The penultimate of the six criteria that we are going to take into account is the assignment of the study factors. In this sense, a study will be observational when the researchers are mere observers who do not act on the assignment of the exposure factors. In these cases, the relationship between exposure and effect may be affected by other factors, known as confusion, so they do not allow drawing conclusions about causality. On the other hand, when the researcher assigns the effect in a controlled manner according to a previous established protocol, we will talk about experimental or intervention studies. These experimental studies with randomization are the only ones that allow establishing cause-effect relationships and are, by definition, analytical studies.

The last of the criteria refers to the study units. The studies can be carried out on individual participants or on population groups. The latter are ecological studies and community trials, which have specific design characteristics.In the attached figure you can see a scheme of how to classify the different epidemiological designs according to these criteria. When you have doubts about which design corresponds to the work you are evaluating, follow this scheme. The first will be to decide if the study is observational or experimental. This is usually simple, so we move on to the next point. A descriptive observational (without a comparison group) will correspond to a series of cases or a cross-sectional study.

If the observational study is analytical, we will look at the type of sampling, which may be due by disease or study effect (case-control study) or by exposure to the risk or protection factor (cohort study).

Finally, if the study is experimental, we will look for if the exposure or intervention has been assigned randomly and with a comparison group. In the affirmative case, we will find ourselves in front of a randomized controlled clinical trial. If not, it is probably an uncontrolled trial or another type of quasi-experimental design.

And here we will stop for today. We have seen how to identify the most common types of methodological designs. But there are many more. Some with a very specific purpose and their own design, such as economic studies. And others that combine characteristics of basic designs, such as case-cohort studies or nested studies. But that is another story…

The hereafter

image_pdf

We have already seen in previous posts how to search for information in Pubmed in different ways, from the simplest, which is the simple search, to the advanced search methods and filtering  of results. Pubmed is, in my modest opinion, a very useful tool for professionals who have to look for biomedical information among the maelstrom of papers that are published daily.

However, Pubmed should not be our only search engine. Yes, ladies and gentlemen, not only does it turn out that there is life beyond Pubmed, but there is a lot of it and interesting.

The first engine I can think of because of the similarity to Pubmed is Embase. This is an Elsevier’s search engine that has about 32 million records of about 8500 journals from 95 countries. As with Pubmed, there are several search options that make it a versatile tool, something more specific for European studies and about drugs than Pubmed (or so they say). The usual when you want to do a thorough search is to use two databases, with the combination of Pubmed and Embase being frequent, since both search engines will provide us with records that the other search engine will not have indexed. The big drawback of Embase, especially when compared to Pubmed, is that its access is not free. Anyway, those who work in large health centers can have the luck to have a subscription paid through the library of the center.

Another useful tool is provided by the Cochrane Library, which includes multiple resources including the Cochrane Database of Systematic Reviews (CDSR), the Cochrane Central Register of Controlled Trials (CENTRAL), the Cochrane Methodology Register (CMR), the Database of Abstracts of Reviews (DARE), the Health Technology Assessment Database (HTA) and the NHS Economic Evaluation Database (EED). In addition, the Spanish-speakers can resort to the Cochrane Library Plus, which translates into Spanish the works of the Cochrane Library. Cochrane Plus is not free, but in Spain we enjoy a subscription that kindly pays us the Ministry of Health, Equality and Social Services.

And since we speak of resources in Spanish, let me bring the ember to my sardine and tell you two search engines that are very dear to me. The first is Epistemonikos, which is a source of systematic reviews and other types of scientific evidence. The second is Pediaclic, a search tool for child health information resources, which classifies the results into a series of categories such as systematic reviews, clinical practice guidelines, evidence-based summaries, and so on.

In fact, Epistemonikos and Pediaclic are meta-searchers. A meta-searcher is a tool that searches in a series of databases and not in a single indexed database like Pubmed or Embase.

There are many meta-search engines but, without a doubt, the king of all and one not to be missed is TRIP Database.

TRIP (Turning Research Into Practice) is a free-access meta-search engine that was created in 1997 to facilitate the search for information from evidence-based medicine databases, although it has evolved and nowadays also retrieves information from image banks , documents for patients, electronic textbooks and even Medline (Pubmed’s database). Let’s take a look at how it works.

In the first figure you can see the top of the TRIP home page. In the simplest form, we will select the link “Search” (it is the one that works by default when we open the page), we will write in the search window the English terms we want to search for and click on the magnifying glass on the right, with what the search engine will show us the list of results.

Although the latest version of TRIP includes a language selector, it is probably best to enter the terms in English in the search window, trying not to put more than two or three words to get the best results. Here we can use the same logical operators we saw in Pubmed (AND, OR and NOT), as well as the truncation operator “*”. In fact, if you type several words in a row, TRIP automatically includes the AND operator between them.

Next to “Search” you can see a link that says “PICO”. This opens a search menu in which we can select the four components of the structured clinical question separately: patients (P), intervention (I), comparison (C) and outcomes (O).

To the right there are two more links. “Advanced” allows advanced searches by fields of the record as the name of the journal, title, year, etc. “Recent” allows us to access the search history. The problem is that these two links are reserved in the latest versions for licensed users. In previous version of TRIP they were free, so I hope that this little flaw will not spread to the whole search engine and that, soon, TRIP will end up being a payment resource.

There are video tutorials available on the web of the search engine about the operation of the diverse modalities of TRIP; but the most attractive thing about TRIP is its way of ordering the results of the search, since it does so according to the source and the quality and the frequency of appearance of the search terms in the articles found. To the right of the screen you can see the list of results organized into a series of categories, such as systematic reviews, evidence-based medicine synopsis, clinical practice guidelines, clinical questions, Medline articles filtered through Clinical Queries, etc.

We can click on one of the categories and restrict the list of results. Once this is done, we can still restrict more the list based on subcategories. For example, if we select systematic reviews we can later restrict to only those of the Cochrane. The possibilities are many, so I invite you to try them.Let’s look at an example. If I write “asthma obesity children” in the search string, I get 1117 results and the list of resources sorted to the right, as you see in the second figure. If I now click on the index “sistematic review” and later on “Cochrane”, I’ll have a single result, although I’ll recover the rest just clicking any of the other categories. Have you ever seen such a combination of simplicity and power? In my humble opinion, with a decent management of Pubmed and the help of TRIP you can find everything you need, no matter how hidden.

And to finish today’s post, you’re going to allow me to ask you a favor: do not use Google to do medical searches or, at least, do not depend exclusively on Google, not even Google Scholar. This search engine is good for finding a restaurant or a hotel for holidays, but not for a controlled search for reliable and relevant medical information as we can do with other tools we have discussed. Of course, with the changes and evolutions that Google has accustomed us to, this may change over time and, maybe, in the future I will have to rewrite this post to recommend it (God forbid).

And here we will leave the topic of bibliographic searches. Needless to say, there are countless more search engines, which you can use the one you like the most or the one you have accessible on your computer or workplace. In some cases, as already mentioned, it is almost mandatory to use more than one, as in the case of systematic reviews, in which the two large ones (Pubmed and Embase) are often used and combined with Cochrane’s and some other that are specific for the subject matter. Because all the search engines we have seen are general, but there are specifics of nursing, psychology, physiotherapy, etc., as well as specific disease. For example, if you do a systematic review on a tropical disease it is advisable to use a specific subject database, such as LILACS, as well as local magazine searchers, if any. But that is another story…

Gathering the gold nuggets

image_pdf

I was thinking about today’s post and I cannot help remembering the gold-seekers of the Alaskan gold rush of the late nineteenth century. They went traveling to Yukon, looking for a good creek like the Bonanza and collecting tons of mud. But that mud was not the last step of the quest. Among the sediments they had to extract the longed gold nuggets, for which they carefully filtered the sediments to keep only the gold, when there was any.

When we look for the best scientific evidence to solve our clinical questions we do something similar. Normally we chose one of the Internet search engines (like Pubmed, our Bonanza Creek) and we usually get a long list of results (our great deal of mud) that, finally, we will have to filter to extract the gold nuggets, if there are any among the search results.

We have already seen in previous posts how to do a simple search (the least specific and which will provide us with more mud) and how to refine the searches by using the MeSH terms or the advanced search form, with which we try to get less mud and more nuggets.

However, the usual situation is that, once we have the list of results, we have to filter it to keep only what interests us most. Well, for that there is a very popular tool within Pubmed that is, oh surprise, the use of filters.

Let’s see an example. Suppose we want to seek information about the relationship between asthma and obesity in childhood. The ideal would be to build a structured clinical question to perform a specific search, but to show more clearly how filters work we will do a simple “bad designed” search with natural language, to obtain a greater number of results.

I open Pubmed’s home page, type asthma and obesity in children in the search box and press the “Search” button. I get 1169 results, although the number may vary if you do the search at another time.

You can see the result in the first figure. If you look closer, in the left margin of the screen there is a list of text with headings such as “Article types”, “text availability”, etc. Each section is one of the filters that I have selected to be shown in my results screen. You see that there are two links below. The first one says “Clear all” and serves to unmark all the filters that we have selected (in this case, still none). The second one says “Show additional filters” and, if we click on it, a screen with all the available filters appears so that we choose which we want them to be displayed on the screen. Take a look at all the possibilities.

When we want to apply a filter, we just have to click on the text under each filter header. In our case we will filter only the clinical trials published in the last five years and of which the full free text is available (without having to pay a subscription). To do this, click on “Clinical Trial”, “Free full text” and “5 years”, as you can see in the second figure. You can see that the list of results has been reduced to 11, a much more manageable figure than the original 1169.

Now we can remove filters one by one (by clicking on the word “clear” next to each filter), remove them all (by clicking “Clear all”) or add new ones (clicking on the filter we want).

Two precautions to take into account with the use of filters. First, filters will remain active until we deactivate them. If we do not realize it and deactivate them, we can apply them to searches that we do later and get fewer results than expected. Second, filters are built using the MeSH terms that have been assigned to each article at the time of indexing, so very recent articles, which has not been indexed yet and, therefore, have not get their MeSH terms allocated, will be lost when applying the filters. That is why it is advisable to apply the filters at the end of the search process, which is better to make more specific using other techniques such as the use of MeSH or advanced search.

Another option we have with indexes is to automate them for all the searches but without reducing the number of results. To do this we have to open an account in Pubmed by clicking on “Sign in to NCBI” in the upper right corner of the screen. Once we use the search engine as a registered user, we can click on a link above to the right that says “Manage filters” and select the filters we want. In the future, the searches that we do will be without filters, but above to the right you will see links to the filters that we have selected with the number of results in parentheses (you can see it in the first two figures that I have shown). By clicking, we will filter the list of results in a similar way as we did with the other filters, which are accessible without registering.

I would not like to leave the topic of Pubmed and its filters without talking about another search resource: Clinical Queries. You can access them by clicking on the “Pubmed Tools” on the home page of the search engine. Clinical Queries are a kind of filter built by Pubmed developers who filter the search so that only articles related to clinical research are shown.

We type the search string in the search box and we obtain the results distributed in three columns, as you see in the third figure attached. In the first column they are sorted according to the type of study (etiology, diagnosis, treatment, prognosis and clinical prediction guidelines) and the scope of the search that may be more specific (“Narrow”) or less (“Broad”). If we select “treatment” and narrow range (“Narrow”), we see that the search is limited to 25 articles.

The second column lists systematic reviews, meta-analyzes, reviews of evidence-based medicine, etc. Finally, the third focuses on papers on genetics.

If we want to see the complete list we can click on “See all” at the bottom of the list. We will then see a screen similar to the results of a simple or advanced search, as you see in the fourth attached figure. If you look at the search box, the search string has been slightly modified. Once we have this list we can modify the search string and press “Search” again, reapply the filters that suit us, etc. As you can see, the possibilities are endless.

And with this I think we’re going to say goodbye to Pubmed. I encourage you to investigate many other options and tools that are explained in the tutorials of the website, some of which will require you to have an account at NCBI (remember it’s free). You can, for example, set alarms so that the searcher warns you when something new related to certain search is published, among many other possibilities. But that’s another story…