Science without sense…double nonsense

Píldoras sobre medicina basada en pruebas

The crystal ball

Print Friendly, PDF & Email

How I wish I could predict the future! And not only to win millions in the lottery, which is the first thing you can think of. There are more important things in life than money (or so that’s what some say), decisions that we make based on assumptions that end up not being fulfilled and that complicate our lives to unsuspected limits. We all have ever thought about “if you lived twice …” I have no doubt, if I met the genie of the lamp one of the three wishes I would ask would be a crystal ball to see the future.

And we could also do well in our work as doctors. In our day to day we are forced to make decisions about the diagnosis or prognosis of our patients and we always do it on the swampy terrain of uncertainty, always assuming the risk of making some mistake. We, especially when we are more experienced, estimate consciously or unconsciously the likelihood of our assumptions, which helps us in making diagnostic or therapeutic decisions. However, it would be good to also have a crystal ball to know more accurately the evolution of the patient’s course.

The problem, as with other inventions that would be very useful in medicine (like the time machine), is that nobody has yet managed to manufacture a crystal ball that really works. But do not let us down. We cannot know for sure what will happen, but we can estimate the probability that a certain result will occur.

For this, we can use all those variables related to the patient that have a known diagnostic or prognostic value and integrate them to perform the calculation of probabilities. Well, doing such a thing would be the same as designing and applying what is known as a clinical prediction rule (CPR).

Thus, if we get a little formal, we can define a CPR as a tool composed of a set of variables of clinical history, physical examination and basic complementary tests, which provides us with an estimate of the probability of an event, suggesting a diagnosis or predicting a concrete response to a treatment.

The critical appraisal of an article about a CPR shares similar aspects with those of the ones about diagnostic tests and also has specific aspects related to the methodology of its design and application. For this reason, we will briefly look at the methodological aspects of CPRs before entering into their critical assessment.

In the process of developing a CPR, the first thing to do is to define it. The four key elements are the study population, the variables that we will consider as potentially predictive, the gold or reference standard that classifies whether the event we want to predict occurs or not and the criterion of assessment of the result.

It must be borne in mind that the variables we choose must be clinically relevant, they must be collected accurately and, of course, they must be available at the time we want to apply the CPR for decision making. It is advisable not to fall into the temptation of putting variables everywhere and endlessly since, apart from complicating the application of the CPR, it can decrease its validity. In general, it is recommended that for every variable that is introduced in the model there should have been at least 10 events that we want to predict (the design is made in a certain sample whose components have the variables but only a certain number have ended up presenting the event to predict).

I would also like to highlight the importance of the gold standard. There must be a diagnostic test or a set of well-defined criteria that allow us to clearly define the event we want to predict with the CPR.

Finally, it is convenient that those who collect the variables during this definition phase are unaware of the results of the gold standard, and vice versa. The absence of blinding decreases the validity of the CPR.

The next step is the derivation or design phase itself. This is where the statistical methods that allow to include predictive variables and exclude those that are not going to contribute anything are applied. We will not go into statistics, just say that the most commonly used methods are those based on logistic regression, although discriminant, survival and even more exotic analysis based on discriminant risks or neural networks can be used, only afforded by a few virtuous ones.

In the logistic regression models, the event will be the dichotomous dependent variable (it happens or it does not happen) and the other variables will be the predictive or independent variables. Thus, each coefficient that multiplies each predictive variable will be the natural antilogarithm of the adjusted odds ratio. In case anyone has not understood, the adjusted odds ratio for each predictive variable will be calculated raising the number “e” to the value of the coefficient of that variable in the regression model.

The usual thing is that a certain score is assigned on a scale according to the weight of each variable, so that the total sum of points of all the predictive variables will allow to classify the patient in a specific range of prediction of event production. There are also other more complex methods using regression equations, but after all you always get the same thing: an individualized estimate of the probability of the event in a particular patient.

With this process we perform the categorization of patients in homogenous groups of probability, but we still need to know if this categorization is adjusted to reality or, what is the same, what is the capacity of discrimination of the CPR.

The overall validity or discrimination capacity of the PRC will be assess by contrasting its results with those of the gold standard, using similar techniques to those used to assess the power of diagnostic tests: sensitivity, specificity, predictive values and likelihood ratios. In addition, in cases where the CPR provides a quantitative estimate, we can resort to the use of the ROC curves, since the area under the curve will represent the global validity of the CPR.

The last step of the design phase will be the calibration of the CPR, which is nothing more than checking its good behavior throughout the range of possible results.

Some CPR’s authors end this here, but they forget two fundamental steps of the elaboration: the validation and the calculation of the clinical impact of the rule.

The validation consists in testing the CPR in samples different to the one used for its design. We can take a surprise and verify that a rule that works well in a certain sample does not work in another. Therefore, it must be tested, not only in similar patients (limited validation), but also in different clinical settings (broad validation), which will increase the external validity of the CPR.

The last phase is to check its clinical performance. This is where many CPRs crash down after having gone through all the previous steps (maybe that’s why this last check is often avoided). To assess the clinical impact, we will have to apply CPR in our patients and see how clinical outcome measures change such as survival, complications, costs, etc. The ideal way to analyze the clinical impact of a CPR is to conduct a clinical trial with two groups of patients managed with and without the rule.

For those self-sacrificing people who are still reading, now that we know what a CPR is and how it is designed, we will see how the critical appraisal of these works is done. And for this, as usual, we will use our three pillars: validity, relevance and applicability. To not forget anything, we will follow the questions that are listed on the grid for CRP studies of the CASP tool.

Regarding VALIDITY, we will start first with some elimination questions. If the answer is negative, it may be time to wait until someone finally makes up a crystal ball that works.

Does the rule answer a well-defined question? The population, the event to be predicted, the predictive variables and the outcome evaluation criteria must be clearly defined. If this is not done or these components do not fit our clinical scenario, the rule will not help us. The predictive variables must be clinically relevant, reliable and well defined in advance.

Did the study population from which the rule was derived include an adequate spectrum of patients? It must be verified that the method of patient selection is adequate and that the sample is representative. In addition, it must include patients from the entire spectrum of the disease. As with diagnostic tests, events may be easier to predict in certain groups, so there must be representatives of all of them. Finally, we must see if the sample was validated in a different group of patients. As we have already said, it is not enough that the rule works in the group of patients in which it has been derived, but that it must be tested in other groups that are similar or different from those with which it was generated.

If the answer to these three questions has been affirmative, we can move on to the three next questions. Was there a blind evaluation of the outcome and of the predictor variables? We have already commented, it is important that the person who collects the predictive variables does not know the result of the reference pattern, and vice versa. The collection of information must be prospective and independent. The next thing to ask is whether the predictor variables and the outcome in all the patients were measured. If the outcome or the variables are not measured in all patients, the validity of the CPR can be compromised. In any case, the authors should explain the exclusions, if there are any. Finally, are the methods of derivation and validation of the rule described? We already know that it is essential that the results of the rule be validated in a population different from the one used for the design.

If the answers to the previous questions indicate that the study is valid, we will answer the questions about the RELEVANCE of the results. The first is if you can calculate the performance of the CRP. The results should be presented with their sensitivity, specificity, odds ratios, ROC curves, etc., depending on the result provided by the rule (scoring scales, regression formulas, etc.). All these indicators will help us to calculate the probabilities of occurrence of the event in environments with different prevalence. This is similar to what we did with the studies of diagnostic tests, so I invite you to review the post on the subject to not repeat too much. The second question is: what is the precision of the results? Here we will not extend either: remember our revered confidence intervals, which will inform us of the accuracy of the results of the rule.

To finish, we will consider the APPLICABILITY of the results to our environment, for which we will try to answer three questions. Will the reproducibility of the PRC and its interpretation be satisfactory within the scope of the scenario? We will have to think about the similarities and differences between the field in which the CPR develops and our clinical environment. In this sense, it will be helpful if the rule has been validated in several samples of patients from different environments, which will increase its external validity. Is the test acceptable in this case? We will think wether the rule is easy to apply in our environment and wether it makes sense to do it from the clinical point of view in our environment. Finally, will the results modify clinical behavior, health outcomes or costs? If, from our point of view, the results of the CPR are not going to change anything, the rule will be useless and a waste of time. Here our opinion will be important, but we must also look for studies that assess the impact of the rule on costs or on health outcomes.

And up to here everything I wanted to tell you about critical appraising of studies on CPRs. Anyway, before finishing I would like to tell you a little about a checklist that, of course, also exists for the valuation of this type of studies: the checklist CHARMS (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modeling Studies). You will not tell me that the name, although a bit fancy, is not lovely.

This list is designed to assess the primary studies of a systematic review on CPRs. It try to answer some general design questions and assess 11 domains to extract enough information to perform the critical appraisal. The two great parts that are valued are the risk of bias in the studies and its applicability. The risk of bias refers to the design or validation flaws that may result in the model being less discriminative, excessively optimistic, etc. The applicability, on the other hand, refers to the degree to which the primary studies are in agreement with the question that motivates the systematic review, for which it informs us of whether the rule can be applied to the target population. This list is good and helps to assess and understand the methodological aspects of this type of studies but, in my humble opinion, it is easier to make a systematic critical appraisal by using the CASP’s tool.

And here, finally, we leave it for today. We have not spoken anything, so as not to stretch ourselves too long, of what to do with the result of the rule. The fundamental thing, we already know, is that we can calculate the probability of occurrence of the event in individual patients from environments with different prevalence. But that is another story…

Little ado about too much

Print Friendly, PDF & Email

Yes, I know that the saying goes just the opposite. But that is precisely the problem we have with so much new information technology. Today anyone can write and make public what goes through his head, reaching a lot of people, although what he says is bullshit (and no, I do not take this personally, not even my brother-in-law reads what I post!). The trouble is that much of what is written is not worth a bit, not to refer to any type of excreta. There is a lot of smoke and little fire, when we all would like the opposite to happen.

The same happens in medicine when we need information to make some of our clinical decisions. Anywhere the source we go, the volume of information will not only overwhelm us, but above all the majority of it will not serve us at all. Also, even if we find a well-done article it may not be enough to answer our question completely. That’s why we love so much the revisions of literature that some generous souls publish in medical journals. They save us the task of reviewing a lot of articles and summarizing the conclusions. Great, isn’t it? Well, sometimes it is, sometimes it is not. As when we read any type of medical literature’s study, we should always make a critical appraisal and not rely solely on the good know-how of its authors.

Revisions, of which we already know there are two types, also have their limitations, which we must know how to value. The simplest form of revision, our favorite when we are younger and ignorant, is what is known as a narrative review or author’s review. This type of review is usually done by an expert in the topic, who reviews the literature and analyzes what she finds as she believes that it is worth (for that she is an expert) and summarizes the qualitative synthesis with her expert’s conclusions. These types of reviews are good for getting a general idea about a topic, but they do not usually serve to answer specific questions. In addition, since it is not specified how the information search is done, we cannot reproduce it or verify that it includes everything important that has been written on the subject. With these revisions we can do little critical appraising, since there is no precise systematization of how these summaries have to be prepared, so we will have to trust unreliable aspects such as the prestige of the author or the impact of the journal where it is published.

As our knowledge of the general aspects of science increases, our interest is shifting towards other types of revisions that provide us with more specific information about aspects that escape our increasingly wide knowledge. This other type of review is the so-called systematic review (SR), which focuses on a specific question, follows a clearly specified methodology of searching and selection of information and performs a rigorous and critical analysis of the results found. Moreover, when the primary studies are sufficiently homogeneous, the SR goes beyond the qualitative synthesis, also performing a quantitative synthesis analysis, which has the nice name of meta-analysis. With these reviews we can do a critical appraising following an ordered and pre-established methodology, in a similar way as we do with other types of studies.

The prototype of SR is the one made by the Cochrane’s Collaboration, which has developed a specific methodology that you can consult in the manuals available on its website. But, if you want my advice, do not trust even the Cochrane’s and make a careful critical appraising even if the review has been done by them, not taking it for granted simply because of its origin. As one of my teachers in these disciplines says (I’m sure he’s smiling if he’s reading these lines), there is life after Cochrane’s. And, besides, there is lot of it, and good, I would add.

Although SRs and meta-analyzes impose a bit of respect at the beginning, do not worry, they can be critically evaluated in a simple way considering the main aspects of their methodology. And to do it, nothing better than to systematically review our three pillars: validity, relevance and applicability.

Regarding VALIDITY, we will try to determine whether or not the revision gives us some unbiased results and respond correctly to the question posed. As always, we will look for some primary validity criteria. If these are not fulfilled we will think if it is already time to walk the dog: we probably make better use of the time.

Has the aim of the review been clearly stated? All SRs should try to answer a specific question that is relevant from the clinical point of view, and that usually arises following the PICO scheme of a structured clinical question. It is preferable that the review try to answer only one question, since if it tries to respond to several ones there is a risk of not responding adequately to any of them. This question will also determine the type of studies that the review should include, so we must assess whether the appropriate type has been included. Although the most common is to find SRs of clinical trials, they can include other types of observational studies, diagnostic tests, etc. The authors of the review must specify the criteria for inclusion and exclusion of the studies, in addition to considering their aspects regarding the scope of realization, study groups, results, etc. Differences among the studies included in terms of (P) patients, (I) intervention or (O) outcomes make two SRs that ask the same question to reach to different conclusions.

If the answer to the two previous questions is affirmative, we will consider the secondary criteria and leave the dog’s walk for later. Have important studies that have to do with the subject been included? We must verify that a global and unbiased search of the literature has been carried out. It is frequent to do the electronic search including the most important databases (generally PubMed, Embase and the Cochrane’s Library), but this must be completed with a search strategy in other media to look for other works (references of the articles found, contact with well-known researchers, pharmaceutical industry, national and international registries, etc.), including the so-called gray literature (thesis, reports, etc.), since there may be important unpublished works. And that no one be surprised about the latter: it has been proven that the studies that obtain negative conclusions have more risk of not being published, so they do not appear in the SR. We must verify that the authors have ruled out the possibility of this publication bias. In general, this entire selection process is usually captured in a flow diagram that shows the evolution of all the studies assessed in the SR.

It is very important that enough has been done to assess the quality of the studies, looking for the existence of possible biases. For this, the authors can use an ad hoc designed tool or, more usually, resort to one that is already recognized and validated, such as the bias detection tool of the Cochrane’s Collaboration, in the case of reviews of clinical trials. This tool assesses five criteria of the primary studies to determine their risk of bias: adequate randomization sequence (prevents selection bias), adequate masking (prevents biases of realization and detection, both information biases), concealment of allocation (prevents selection bias), losses to follow-up (prevents attrition bias) and selective data information (prevents information bias). The studies are classified as high, low or indeterminate risk of bias according to the most important aspects of the design’s methodology (clinical trials in this case).

In addition, this must be done independently by two authors and, ideally, without knowing the authors of the study or the journals where the primary studies of the review were published. Finally, it should be recorded the degree of agreement between the two reviewers and what they did if they did not agree (the most common is to resort to a third party, which will probably be the boss of both).

To conclude with the internal or methodological validity, in case the results of the studies have been combined to draw common conclusions with a meta-analysis, we must ask ourselves if it was reasonable to combine the results of the primary studies. It is fundamental, in order to draw conclusions from combined data, that the studies are homogeneous and that the differences among them are due solely to chance. Although some variability of the studies increases the external validity of the conclusions, we cannot unify the data for the analysis if there are a lot of variability. There are numerous methods to assess the homogeneity about which we are not going to refer now, but we are going to insist on the need for the authors of the review to have studied it adequately.

In summary, the fundamental aspects that we will have to analyze to assess the validity of a SR will be: 1) that the aims of the review are well defined in terms of population, intervention and measurement of the result; 2) that the bibliographic search has been exhaustive; 3) that the criteria for inclusion and exclusion of primary studies in the review have been adequate; and 4) that the internal or methodological validity of the included studies has also been verified. In addition, if the SR includes a meta-analysis, we will review the methodological aspects that we saw in a previous post: the suitability of combining the studies to make a quantitative synthesis, the adequate evaluation of the heterogeneity of the primary studies and the use of a suitable mathematical model to combine the results of the primary studies (you know, that of the fixed effect and random effects models).

Regarding the RELEVANCE of the results we must consider what is the overall result of the review and if the interpretation has been made in a judicious manner. The SR should provide a global estimate of the effect of the intervention based on a weighted average of the included quality items. Most often, relative measures such as risk ratio or odds ratio are expressed, although ideally, they should be complemented with absolute measures such as absolute risk reduction or the number needed to treat (NNT). In addition, we must assess the accuracy of the results, for which we will use our beloved confidence intervals, which will give us an idea of ​​the accuracy of the estimation of the true magnitude of the effect in the population. As you can see, the way of assessing the importance of the results is practically the same as assessing the importance of the results of the primary studies. In this case we give examples of clinical trials, which is the type of study that we will see more frequently, but remember that there may be other types of studies that can better express the relevance of their results with other parameters. Of course, confidence intervals will always help us to assess the accuracy of the results.

The results of the meta-analyzes are usually represented in a standardized way, usually using the so-called forest plot. A graph is drawn with a vertical line of zero effect (in the one for relative risk and odds ratio and zero for means differences) and each study is represented as a mark (its result) in the middle of a segment (its confidence interval). Studies with results with statistical significance are those that do not cross the vertical line. Generally, the most powerful studies have narrower intervals and contribute more to the overall result, which is expressed as a diamond whose lateral ends represent its confidence interval. Only diamonds that do not cross the vertical line will have statistical significance. Also, the narrower the interval, the more accurate result. And, finally, the further away from the zero-effect line, the clearer the difference between the treatments or the comparative exposures will be.

If you want a more detailed explanation about the elements that make up a forest plot, you can go to the previous post where we explained it or to the online manuals of the Cochrane’s Collaboration.

We will conclude the critical appraising of the SR assessing the APPLICABILITY of the results to our environment. We will have to ask ourselves if we can apply the results to our patients and how they will influence the care we give them. We will have to see if the primary studies of the review describe the participants and if they resemble our patients. In addition, although we have already said that it is preferable that the SR is oriented to a specific question, it will be necessary to see if all the relevant results have been considered for the decision making in the problem under study, since sometimes it will be convenient to consider some other additional secondary variable. And, as always, we must assess the benefit-cost-risk ratio. The fact that the conclusion of the SR seems valid does not mean that we have to apply it in a compulsory way.

If you want to correctly evaluate a SR without forgetting any important aspect, I recommend you to use a checklist such as PRISMA’s or some of the tools available on the Internet, such as the grills that can be downloaded from the CASP page, which are the ones we have used for everything we have said so far.

The PRISMA statement (Preferred Reporting Items for Systematic reviews and Meta-Analyzes) consists of 27 items, classified in 7 sections that refer to the sections of title, summary, introduction, methods, results, discussion and financing:

  1. Title: it must be identified as SR, meta-analysis or both. If it is specified, in addition, that it deals with clinical trials, priority will be given to other types of reviews.
  2. Summary: it should be a structured summary that should include background, objectives, data sources, inclusion criteria, limitations, conclusions and implications. The registration number of the revision must also be included.
  3. Introduction: includes two items, the justification of the study (what is known, controversies, etc) and the objectives (what question tries to answer in PICO terms of the structured clinical question).
  4. Methods. It is the section with the largest number of items (12):

– Protocol and registration: indicate the registration number and its availability.

– Eligibility criteria: justification of the characteristics of the studies and the search criteria used.

– Sources of information: describe the sources used and the last search date.

– Search: complete electronic search strategy, so that it can be reproduced.

– Selection of studies: specify the selection process and inclusion’s and exclusion’s criteria.

– Data extraction process: describe the methods used to extract the data from the primary studies.

– Data list: define the variables used.

– Risk of bias in primary studies: describe the method used and how it has been used in the synthesis of results.

– Summary measures: specify the main summary measures used.

– Results synthesis: describe the methods used to combine the results.

– Risk of bias between studies: describe biases that may affect cumulative evidence, such as publication bias.

– Additional analyzes: if additional methods are made (sensitivity, metaregression, etc) specify which were pre-specified.

  1. Results. Includes 7 items:

– Selection of studies: it is expressed through a flow chart that assesses the number of records in each stage (identification, screening, eligibility and inclusion).

– Characteristics of the studies: present the characteristics of the studies from which data were extracted and their bibliographic references.

– Risk of bias in the studies: communicate the risks in each study and any evaluation that is made about the bias in the results.

– Results of the individual studies: study data for each study or intervention group and estimation of the effect with their confidence interval. The ideal is to accompany it with a forest plot.

– Synthesis of the results: present the results of all the meta-analysis performed with the confidence intervals and the consistency measures.

– Risk of bias between the subjects: present any evaluation that is made of the risk of bias between the studies.

– Additional analyzes: if they have been carried out, provide the results of the same.

  1. Discussion. Includes 3 items:

– Summary of the evidence: summarize the main findings with the strength of the evidence of each main result and the relevance from the clinical point of view or of the main interest groups (care providers, users, health decision-makers, etc.).

– Limitations: discuss the limitations of the results, the studies and the review.

– Conclusions: general interpretation of the results in context with other evidences and their implications for future research.

  1. Financing: describe the sources of funding and the role they played in the realization of the SR.

As a third option to these two tools, you can also use the aforementioned Cochrane’s Handbook for Systematic Reviews of Interventions, available on its website and whose purpose is to help authors of Cochrane’s reviews to work explicitly and systematically.

As you can see, we have not talked practically anything about meta-analysis, with all its statistical techniques to assess homogeneity and its fixed and random effects models. And is that the meta-analysis is a beast that must be eaten separately, so we have already devoted two post only about it that you can check when you want. But that is another story…

Doc, is this serious?

Print Friendly, PDF & Email

I wonder how many times I have heard this question or one of its many variants. Because it turns out that we are always thinking about clinical trials and clinical questions about diagnosis and treatment, but think about whether a patient ever asked you if the treatment you were proposing was endorsed by a randomized controlled trial that meets the criteria of the CONSORT statement and has a good score on the Jadad scale. I can say, at least, that it has never happened to me. But they do ask me daily what will happen to them in the future.

And here lies the relevance of prognostic studies. Note that you cannot always heal and that, unfortunately, many times all we can do is assist and relieve, if it is possible, the announcement of serious sequelae or death. But it is essential to have good quality information about the future of our patient’s disease. This information will also serve to calibrate therapeutic efforts in each situation depending on the risks and benefits. And besides, prognostic’s studies are used to compare results between different departments or hospitals. Nobody comes up saying that a hospital is worse than another because their mortality is higher without first checking that the prognosis of patients is similar.

Before getting into the critical appraisal of prognostic studies, let’s clarify the difference between risk factor and prognostic factor. The risk factor is a characteristic of the environment or the subject that favors the development of the disease, while the prognostic factor is that which, once the disease occurs, influences its evolution. Risk factor and prognostic factor are different things, although sometimes they can coincide. What the two do share is the same type of study design. The ideal would be to use clinical trials, but most of the time we cannot or are not ethical to randomize the prognostic or risk factors. Let’s think we want to demonstrate the deleterious effect of booze on the liver. The way with the highest degree of evidence to prove it would be to make two random groups of participants and give 10 whiskeys a day to the participants of one arm and some water to the participants of the other, to see the differences in liver damage after a year, for example. However, it is evident to anyone that we cannot do a clinical trial like this. Not because we cannot find subjects for the intervention arm, but because ethics and common sense prevent us from doing it.

For this reason, it is usual to use cohort studies: we would study what differences at the level of the liver there may be between individuals who drink and who do not drink alcohol by their own choice. In cases that require very long follow-ups or in which the effect we want to measure is very rare, case-control studies can be used, but they will always be less powerful because they have a higher risk of bias. Following our ethyl example, we would study people with and without liver damage and we would see if one of the two groups was exposed to alcohol.

A prognostic study should inform us of three aspects: what result we evaluate, how likely they are to happen, and in what time frame we expect it to happen. And to appraise it, as always, we will base on our three pillars: validity, relevance and applicability.

To assess the VALIDITY, we´ll first consider if the article meets a set of primary or elimination criteria. If the answer is not, we better throw the paper and go to read the last bullshit our Facebook’s friends have written on our wall.

Is the study sample well defined and is it representative of patients at a similar stage of disease? The sample, which is usually called initial or incipient cohort, should be formed by a group of patients at the same stage of disease, ideally at the beginning, at it should be followed-up prospectively. It should be well specified the type of patients included, the criteria for diagnosing them and the method of selection. We must also verify that the follow-up has been long enough and complete enough to observe the event we study. Each participant has to be followed-up from the start to the end of the study, either because he’s healed, because he presents the event or because the study ends. It is very important to take into account losses during the study, very common in designs with long follow-up. The study should provide the characteristics of patients lost and the reasons for the loss. If they are similar to those who are not lost during follow-up, we can get valid results. If the number of patients lost to follow-up is greater than 20% it’s usually done a sensitivity analysis using the worst possible scenario, which considers that all losses have had a poor prognosis and then recalculate the results to check if they are modified, in which case the study results could be invalidated.

Once these two aspect being assessed, we turn to the secondary criteria about internal validity or scientific rigor.

Were outcomes measured objectively and unbiased? It must be clearly specified what is being measured and how before starting the study. In addition, in order to avoid the information bias, the ideal is that the measure of results is done blinded to the researcher, who must not know whether the subject in question is subjected to any of the prognostic factors.

Were the results adjusted by all relevant prognostic values? We must take into account all the confounding variables and prognostic factors that may influence the results. In case they are known from previous studies, known factors may be considered. Otherwise, the authors will determine these effects using stratified data analysis (the easiest method) or multivariate analysis (the more powerful and complex), usually by a proportional hazards model or Cox regression analysis. Although we’re not going to talk about regression models now, there are two simple aspects that we can take into account. First, these models need a certain number of events per variable included in the model, so distrust those where many variables are analyzed, especially with small samples. Second, the variables included are decided by the author and are different from one work to another, so we will have to assess if they have not included any that may be relevant to the final result.

Were the results validated in other groups of patients? When we set groups of variables and we make multiple comparisons we risk the chance plays a trick on us and shows us associations that don’t exists. This is why when a risk factor is described in a group (training or derivation group), the results should be replicated in an independent group (validation group) to be really sure about the effect.

Now we must consider what the results are to determine their RELEVANCE. For this, we’ll check if the probability of the outcome of the study is estimated and provided by the authors, as well as the accuracy of this estimate and the risk associated with the factors influencing the prognosis.

Is the probability of the event specified in a given period of time? There are several ways to present the number of events occurring during the follow-up period. The simplest would be to provide an incidence rate (events / person / unit time) or the cumulative frequency at any given time. Another indicator is the median survival, which is just the moment at follow-up in which the event has happened in half of the cohort participants (remember that although we speak about survival, the event not need tro be necessarily death).

We can use survival curves of various kinds to determine the probability of the occurrence of the event in each period and the rate at which it is presenting. Actuarial or life tables are used for larger samples when we don’t know the exact time of the event and we use fixed time periods. However, the more often used are the Kaplan-Meier curves, which better measure the probability of the event for each particular time with smaller samples. This method can provide hazard ratios and median survival, as well as other parameter accor4ding to the regression model used.

To assess the accuracy of the results will look, as always, for the confidence intervals. The larger the interval, the less accurate the estimate of the probability of occurrence in the general population, which is what we really want to know. Keep in mind that the number of patients is generally lower as time passes, so it is usual that the survival curves are more accurate at the beginning than at the end of follow up. Finally, we’ll assess the factors that modify the prognosis. The right thing is to represent all the variables that may influence the prognosis with its corresponding relative risks, which will allow us to evaluate the clinical significance of the association.

Finally, we must consider the APPLICABILITY of the results. Do they apply to my patients? We will look for similarities between the study patients and ours and assess whether the differences we find allow us to extrapolate the results to our practice. But besides, are the results useful? The fact that they’re applicable doesn’t necessarily mean that we have to implement them. We have to assess carefully if they’re going to help us to decide what treatment to apply and how to inform our patients and their families.

As always, I recommend you to use a template, such as those provided by CASP, for systematically critical appraisal without leaving any important matter without assessing.

You can see that articles about prognosis have a lot of to say. And we haven’t almost talked about regression models and survival curves, which are often the statistical core of this type of articles. But that’s another story…

You have to know what you are looking for

Print Friendly, PDF & Email

Every day we find articles that show new diagnostic tests that appear to have been designed to solve all our problems. But we should not be tempted to pay credit to everything we read before reconsidering what we have, in fact, read. At the end of the day, if we paid attention to everything we read we would be swollen from drinking Coca-Cola.

We know that a diagnostic test is not going to say whether or not a person is sick. Its result will only allow us to increase or decrease the probability that the individual is sick or not so we can confirm or rule out the diagnosis, but always with some degree of uncertainty.

Anyone has a certain risk of suffering from any disease, which is nothing more than the prevalence of the disease in the general population. Below a certain level of probability, it seems so unlikely that the patient is sick that we leave him alone and do not do any diagnostic tests (although some find it hard to restrain the urge to always ask for something). This is the diagnostic or test threshold.

But if, in addition to belonging to the population, one has the misfortune of having symptoms, that probability will increase until this threshold is exceeded, in which the probability of presenting the disease justifies performing diagnostic tests. Once we have the result of the test that we have chosen, the probability (post-test probability) will have changed. It may have changed to less and it has been placed below the test threshold, so we discard the diagnosis and leave the patient alone again. It may also exceed another threshold, the therapeutic, from which the probability of the disease reaches the sufficient level so as not to need further tests and to be able to initiate the treatment.

The usefulness of the diagnostic test will be in its ability to reduce the probability below the threshold of testing (and discard the diagnosis) or, on the contrary, to increase it to the threshold at which it is justified to start treatment. Of course, sometimes the test leaves us halfway and we have to do additional tests before confirming the diagnosis with enough security to start the treatment.

Diagnostic tests studies should provide information about the ability of a test to produce the same results when performed under similar conditions (reliability) and about the accuracy with which the measurements reflect that measure (validity). But they also give us data about their discriminatory power (sensitivity and specificity), their clinical performance (positive predictive value and negative predictive value), its ability to modify the probability of illness and change our position between the two thresholds (likelihood ratios), and about other aspects that allow us to assess whether it’s worth to test our patients with the diagnostic test. And to check if a study gives us the right information we need to make a critical appraisal and read the paper based on our three pillars: validity, relevance and applicability.

Let’s start with VALIDITY. First, we’ll make ourselves some basic eliminating questions about primary criteria about the study. If the answer to these questions is no, the best you can do probably is to use the article to wrap your mid-morning snack.

Was the diagnostic test blindly and independently compared with an appropriate gold standard or reference test?. We must review that results of reference test were not interpreted differently depending on the results of the study test, thus committing an incorporation bias, which could invalidate the results. Another problem that can arise is that the reference test results are frequently inconclusive. If we made the mistake of excluding that doubtful cases we’d commit and indeterminate exclusion bias that, in addition to overestimate the sensitivity and specificity of the test, will compromise the external validity of the study, whose conclusions would only be applicable to patients with indeterminate result.

Do patients encompass a similar spectrum to which we will find in our practice?. The inclusion criteria of the study should be clear, and the study must include healthy and diseased with varying severity or progression stages of disease. As we know, the prevalence influences the clinical performance of the test so if it’s validated, for example, in a tertiary center (the probability of being sick is statistically greater) its diagnostic capabilities will be overestimated when we use the test at a Primary Care center or with the general population (where the proportion of diseased will be lower).

At this point, if we think it’s worth reading further, we’ll focus on secondary criteria, which are those that add value to the study design. Another question to ask is: had the study test’s results any influence in the decision to do the reference test?. We have to check that there hasn’t been a sequence bias or a diagnostic verification bias, whereby excluding those with negative test. Although this is common in current practice (we start with simple tests and perform the more invasive ones only in positive patients), doing so in a diagnostic test study affect the validity of the results. Both tests should be done independently and blindly, so that the subjectivity of the observer does not influence the results (review bias). Finally, is the method described with enough detail to allow its reproduction?. It should be clear what is considered normal and abnormal and what criteria we have used to define normal and how we have interpreted the results of the test.

Having analyzed the internal validity of the study we’ll appraise the RELEVANCE of the presented data. The purpose of a diagnostic study is to determine the ability of a test to correctly classify individuals according to the presence or absence of disease. Actually, and to be more precise, we want to know how the likelihood of being ill increases after the test’s result (post-test probability). It’s therefore essential that the study gives information about the direction and magnitude of this change (pretest / posttest), that we know depends on the characteristics of the test and, to a large extent, on the prevalence or pretest probability.

Do the work present likelihood ratios or is it possible to calculate them from the data?. This information is critical because if not, we couldn’t estimate the clinical impact of the study test. We have to be especially careful with tests with quantitative results in which the researcher has established a cutoff of normality. When using ROC curves, it is usual to move the cutoff to favor sensitivity or specificity of the test, but we must always appraise how this measure affects the external validity of the study, since it may limit its applicability to a particular group of patients.

How reliable are the results?. We will have to determine whether the results are reproducible and how they can be affected by variations among different observers or when retested in succession. But we have not only to assess the reliability, but also how accurate the results are. The study was done on a sample of patients, but it should provide an estimate of their values in the population, so results should be expressed with their corresponding confident intervals.

The third pillar in critical appraising is that of APLICABILITY or external validity, which will help us to determine whether the results are useful to our patients. In this regard, we ask three questions. Is the test available and is it possible to perform it in our patients?. If the test is not available all we’ll have achieved with the study is to increase our vast knowledge. But if we can apply the test we must ask whether our patients fulfill the inclusion and exclusion criteria of the study and, if not, consider how these differences may affect the applicability of the test.

The second question is if we know the pretest probability of our patients. If our prevalence is very different from that of the study the actual usefulness of the test can be modified. One solution may be to do a sensitivity analysis evaluating how the study results would be modified after changing values of pre and posttest probability to a different ones that are clinically reasonable.

Finally, we should ask ourselves the most important question: can posttest probability change our therapeutic attitude, so being helpful to the patient?. For example, if the pretest probability is very low, probably the posttest probability will be also very low and won’t reach the therapeutic threshold, so it would be not worth spending money and effort with the test. Conversely, is pretest probability is very high it may be worth starting treatment without any more evidence, unless the treatment is very expensive or dangerous. As always, the virtue will be in the middle and it will be in these intermediate areas where more benefits can be obtained from the studied diagnostic test. In any case, we must never forget who our boss is (I mean the patient, not our boss at the office): you must not to be content only with studying the effectiveness or cost-effectiveness, but also consider the risks, discomfort, and patients preferences and the consequences that can lead to the performing of the diagnostic test.

If you allow me an advice, when critically appraising an article about diagnostic tests I recommend you to use the CASP’s templates, which can be downloaded from the website. They will help you make the critical appraising in a systematic and easy way.

A clarification to go running out: we must not confuse the studies of diagnostic tests with diagnostic prediction rules. Although the assessment is similar, the prediction rules have specific characteristics and methodological requirements that must be assessed in an appropriate way and that we will see in another post.

Finally, just say that everything we have said so far applies to the specific papers about diagnostic tests. However, the assessment of diagnostic tests may be part of observational studies such as cohort or case-control studies, which can have some peculiarity in the sequence of implementation and validation criteria of the study and reference test. But that’s another story…

The King under review

Print Friendly, PDF & Email

We all know that the randomized clinical trial is the king of interventional methodological designs. It is the type of epidemiological study that allows a better control of systematic errors or biases, since the researcher controls the variables of the study and the participants are randomly assigned among the interventions that are compared.

In this way, if two homogeneous groups that differ only in the intervention present some difference of interest during the follow-up, we can affirm with some confidence that this difference is due to the intervention, the only thing that the two groups do not have in common. For this reason, the clinical trial is the preferred design to answer clinical questions about intervention or treatment, although we will always have to be prudent with the evidence generated by a single clinical trial, no matter how well performed. When we perform a systematic review of randomized clinical trials on the same intervention and combine them in a meta-analysis, the answers we get will be more reliable than those obtained from a single study. That’s why some people say that the ideal design for answering treatment questions is not the clinical trial, but the meta-analysis of clinical trials.

In any case, as systematic reviews assess their primary studies individually and as it is more usual to find individual trials and not systematic reviews, it is advisable to know how to make a good critical appraisal in order to draw conclusions. In effect, we cannot relax when we see that an article corresponds to a clinical trial and take its content for granted. A clinical trial can also contain its traps and tricks, so, as with any other type of design, it will be a good practice to make a critical reading of it, based on our usual three pillars: validity, importance and applicability.

As always, when studying scientific rigor or VALIDITY (internal validity), we will first look at a series of essential primary criteria. If these are not met, it is better not to waste time with the trial and try to find another more profitable one.

Is there a clearly defined clinical question? In its origin, the trial must be designed to answer a structured clinical question about treatment, motivated by one of our multiple knowledge gaps. A working hypothesis should be proposed with its corresponding null and alternative hypothesis, if possible on a topic that is relevant from the clinical point of view. It is preferable that the study try to answer only one question. When you have several questions, the trial may get complicated in excess and end up not answering any of them completely and properly.

Was the assignment done randomly? As we have already said, to be able to affirm that the differences between the groups are due to the intervention, they must be homogeneous. This is achieved by assigning patients randomly, the only way to control the known confounding variables and, more importantly, also those that we do not know. If the groups were different and we attributed the difference only to the intervention, we could incur in a confusion bias. The trial should contain the usual and essential table 1 with the frequency of appearance of the demographic and confusion variables of both samples to be sure that the groups are homogeneous. A frequent error is to look for the differences between the two groups and evaluate them according to their p, when we know that p does not measure homogeneity. If we have distributed them at random, any difference we observe will necessarily be random (we will not need a p to know that). The sample size is not designed to discriminate between demographic variables, so a non-significant p may simply indicate that the sample is small to reach statistical significance. On the other hand, any minimal difference can reach statistical significance if the sample is large enough. So forget about the p: if there is any difference, what you have to do is assess whether it has sufficient clinical relevance to have influenced the results or, more elegantly, we will have to control the unbalanced covariates during the randomization. Fortunately, it is increasingly rare to find the tables of the study groups with the comparison of p between the intervention and control groups.

But it is not enough for the study to be randomized, we must also consider whether the randomization sequence was done correctly. The method used must ensure that all components of the selected population have the same probability of being chosen, so random number tables or computer generated sequences are preferred. The randomization must be hidden, so that it is not possible to know which group the next participant will belong to. That is why people like centralized systems by telephone or through the Internet. And here is something very curious: it turns out that it is well known that randomization produces samples of different sizes, especially if the samples are small, which is why samples randomized by blocks balanced in size are sometimes used. And I ask you, how many studies have you read with the same number of participants in the two branches and who claimed to be randomized? Do not trust if you see equal groups, especially if they are small, and do not be fooled: you can always use one of the multiple binomial probability calculators available on the Internet to know what is the probability that chance generates the groups that the authors present (we always speak of simple randomization, not by blocks, conglomerates, minimization or other techniques). You will be surprised with what you will find.

It is also important that the follow-up has been long and complete enough, so that the study lasts long enough to be able to observe the outcome variable and that every participant who enters the study is taken into account at the end. As a general rule, if the losses exceed 20%, it is admitted that the internal validity of the study may be compromised.

We will always have to analyze the nature of losses during follow-up, especially if they are high. We must try to determine if the losses are random or if they are related to any specific variable (which would be a bad matter) and estimate what effect they may have on the results of the trial. The most usual is usually to adopt the so-called worst-case scenarios: it is assumed that all the losses of the control group have gone well and all those in the intervention group have gone badly and the analysis is repeated to check if the conclusions are modified, in which case the validity of the study would be seriously compromised. The last important aspect is to consider whether patients who have not received the previously assigned treatment (there is always someone who does not know and mess up) have been analyzed according to the intention of treatment, since it is the only way to preserve all the benefits that are obtained with randomization. Everything that happens after the randomization (as a change of the assignment group) can influence the probability that the subject experiences the effect we are studying, so it is important to respect this analysis by intention to treat and analyze each one in the group in which it was initially assigned.

Once these primary criteria have been verified, we will look at three secondary criteria that influence internal validity. It will be necessary to verify that the groups were similar at the beginning of the study (we have already talked about the table with the data of the two groups), that the masking was carried out in an appropriate way as a form of control of biases and that the two groups were managed and controlled in a similar way except, of course, the intervention under study. We know that masking or blinding allows us to minimize the risk of information bias, which is why the researchers and participants are usually unaware of which group is assigned to each, which is known as double blind. Sometimes, given the nature of the intervention (think about a group that is operated on and another one that does not) it will be impossible to mask researchers and participants, but we can always give the masked data to the person who performs the analysis of the results (the so-called blind evaluator), which ameliorate this incovenient.

To summarize this section of validity of the trial, we can say that we will have to check that there is a clear definition of the study population, the intervention and the result of interest, that the randomization has been done properly, that they have been treated to control the information biases through masking, that there has been an adequate follow-up with control of the losses and that the analysis has been correct (analysis by intention of treat and control of covariates not balanced by randomization).

A very simple tool that can also help us assess the internal validity of a clinical trial is the Jadad’s scale, also called the Oxford’s quality scoring system. Jadad, a Colombian doctor, devised a scoring system with 7 questions. First, 5 questions whose affirmative answer adds 1 point:

  1. Is the study described as randomized?
  2. Is the method used to generate the randomization sequence described and is it adequate?
  3. Is the study described as double blind?
  4. Is the masking method described and is it adequate?
  5. Is there a description of the losses during follow up?

Finally, two questions whose negative answer subtracts 1 point:

  1. Is the method used to generate the randomization sequence adequate?
  2. Is the masking method appropriate?

As you can see, the Jadad’s scale assesses the key points that we have already mentioned: randomization, masking and monitoring. A trial is considered a rigorous study from the methodological point of view if it has a score of 5 points. If the study has 3 points or less, we better use it to wrap the sandwich.

We will now proceed to consider the results of the study to gauge its clinical RELEVANCE. It will be necessary to determine the variables measured to see if the trial adequately expresses the magnitude and precision of the results. It is important, once again, not to settle for being inundated with multiple p full of zeros. Remember that the p only indicates the probability that we are giving as good differences that only exist by chance (or, to put it simply, to make a type 1 error), but that statistical significance does not have to be synonymous with clinical relevance.

In the case of continuous variables such as survival time, weight, blood pressure, etc., it is usual to express the magnitude of the results as a difference in means or medians, depending on which measure of centralization is most appropriate. However, in cases of dichotomous variables (live or dead, healthy or sick, etc.) the relative risk, its relative and absolute reduction and the number needed to treat (NNT) will be used. Of all of them, the one that best expresses the clinical efficiency is always the NNT. Any trial worthy of our attention must provide this information or, failing that, the necessary information so that we can calculate it.

But to allow us to know a more realistic estimate of the results in the population, we need to know the precision of the study, and nothing is easier than resorting to confidence intervals. These intervals, in addition to precision, also inform us of statistical significance. It will be statistically significant if the risk ratio interval does not include the value one and that of the mean difference the value zero. In the case that the authors do not provide them, we can use a calculator to obtain them, such as those available on the CASP website.

A good way to sort the study of the clinical importance of a trial is to structure it in these four aspects: Quantitative assessment (measures of effect and its precision), Qualitative assessment (relevance from the clinical point of view), Comparative assessment (see if the results are consistent with those of other previous studies) and Cost-benefit assessment (this point would link to the next section of the critical appraisal that has to do with the applicability of the results of the trial).

To finish the critical reading of a treatment article we will value its APPLICABILITY (also called external validity), for which we will have to ask ourselves if the results can be generalized to our patients or, in other words, if there is any difference between our patients and those of the study that prevents the generalization of the results. It must be taken into account in this regard that the stricter the inclusion criteria of a study, the more difficult it will be to generalize its results, thereby compromising its external validity.

But, in addition, we must consider whether all clinically important outcomes have been taken into account, including side effects and undesirable effects. The measured result variable must be important for the investigator and for the patient. Do not forget that the fact that demonstrating that the intervention is effective does not necessarily mean that it is beneficial for our patients. We must also assess the harmful or annoying effects and study the benefits-costs-risks balance, as well as the difficulties that may exist to apply the treatment in our environment, the patient’s preferences, etc.

As it is easy to understand, a study can have a great methodological validity and its results have great importance from the clinical point of view and not be applicable to our patients, either because our patients are different from those of the study, because it does not adapt to your preferences or because it is unrealizable in our environment. However, the opposite usually does not happen: if the validity is poor or the results are unimportant, we will hardly consider applying the conclusions of the study to our patients.

To finish, recommend that you use some of the tools available for critical appraisal, such as the CASP templates, or a checklist, such as CONSORT, so as not to leave any of these points without consideration. Yes, all we have talked about is randomized and controlled clinical trials, and what happens if it is nonrandomized trials or other kinds of quasi-experimental studies? Well for that we follow another set of rules, such as those of the TREND statement. But that is another story…

Achilles and Effects Forest

Print Friendly, PDF & Email

Achilles. What a man! Definitely, one of the main characters among those who were in that mess that was ensued in Troy because of Helena, a.k.a. the beauty. You know his story. In order to make him invulnerable his mother, who was none other than Tetis, the nymph, bathed him in ambrosia and submerged him in the River Stix. But she made a mistake that should not have been allowed to any nymph: she took him by his right heel, which did not get wet with the river’s water. And so, his heel became his only vulnerable part. Hector didn’t realize it in time but Paris, totally on the ball, put an arrow in Achilles’ heel and sent him back to the Stix, but not into the water, but rather to the other side. And without Charon the Ferryman.

This story is the origin of the expression “Achilles’ heel”, which usually refers to the weakest or most vulnerable point of someone or something that, otherwise, is usually known for its strength.

For example, something as robust and formidable as meta-analysis has its Achilles heel: the publication bias. And that’s because in the world of science there is no social justice.

All scientific works should have the same opportunities to be published and achieve fame, but the reality is not at all like that and they can be discriminated against for four reasons: statistical significance, popularity of the topic they are dealing with, having someone to sponsor them and the language in which they are written.

These are the main factors that can contribute to publication bias. First, studies with more significant results are more likely to be published and, within these, they are more likely to be published when the effect is greater. This means that studies with negative results or effects of small magnitude may not be published, which will draw a biased conclusion from the analysis only of large studies with positive results. In the same way, papers on topics of public interest are more likely to be published regardless of the importance of their results. In addition, the sponsor also influences: a company that finances a study with a product of theirs that has gone wrong, probably is not going to publish it so that we all know that their product is not useful.

Secondly, as is logical, published studies are more likely to reach our hands than those that are not published in scientific journals. This is the case of doctoral theses, communications to congresses, reports from government agencies or, even, pending studies to be published by researchers of the subject that we are dealing with. For this reason it is so important to do a search that includes this type of work, which is included within the grey literature term.

Finally, a series of biases can be listed that influence the likelihood that a work will be published or retrieved by the researcher performing the systematic review such as language bias (the search is limited by language), availability bias ( include only those studies that are easy for the researcher to recover), the cost bias (studies that are free or cheap), the familiarity bias (only those from the researcher’s discipline are included), the duplication bias (those that have significant results are more likely to be published more than once) and citation bias (studies with significant results are more likely to be cited by other authors).

One may think that this loss of studies during the review cannot be so serious, since it could be argued, for example, that studies not published in peer-reviewed journals are usually of poorer quality, so they do not deserve to be included in the meta-analysis However, it is not clear either that scientific journals ensure the methodological quality of the study or that this is the only method to do so. There are researchers, like those of government agencies, who are not interested in publishing in scientific journals, but in preparing reports for those who commission them. In addition, peer review is not a guarantee of quality since, too often, neither the researcher who carries out the study nor those in charge of reviewing it have a training in methodology that ensures the quality of the final product.

All this can be worsened by the fact that these same factors can influence the inclusion and exclusion criteria of the meta-analysis primary studies, in such a way that we obtain a sample of articles that may not be representative of the global knowledge on the subject of the systematic review and meta-analysis.

If we have a publication bias, the applicability of the results will be seriously compromised. That is why we say that the publication bias is the true Achilles’ heel of meta-analysis.

If we correctly delimit the inclusion and exclusion criteria of the studies and do a global and unrestricted search of the literature we will have done everything possible to minimize the risk of bias, but we can never be sure of having avoided it. That is why techniques and tools have been devised for its detection.

The most used has the sympathetic name of funnel plot. It shows the magnitude of the measured effect (X axis) versus a precision measurement (Y axis), which is usually the sample size, but which can also be the inverse of the variance or the standard error. We represent each primary study with a point and observe the point cloud.

In the most usual way, with the size of the sample on the Y axis, the precision of the results will be higher in the larger sample studies, so that the points will be closer together in the upper part of the axis and will be dispersed when approaching the origin of the axis Y. In this way, we observe a cloud of points in the form of a funnel, with the wide part down. This graphic should be symmetrical and, if that is not the case, we should always suspect a publication bias. In the second example attached you can see how there are “missing” studies on the side of lack of effect: this may mean that only studies with positive results are published.

This method is very simple to use but, sometimes, we can have doubts about the asymmetry of our funnel, especially if the number of studies is small. In addition, the funnel can be asymmetrical due to quality defects in the studies or because we are dealing with interventions whose effect varies according to the sample size of each study. For these cases, other more objective methods have been devised, such as the Begg’s rank correlation test and the Egger’s linear regression test.

The Begg’s test studies the presence of association between the estimates of the effects and their variances. If there is a correlation between them, bad going. The problem with this test is that it has little statistical power, so it is not reliable when the number of primary studies is small.

Egger’s test, more specific than Begg’s, consists of plotting the regression line between the precision of the studies (independent variable) and the standardized effect (dependent variable). This regression must be weighted by the inverse of the variance, so I do not recommend that you do it on your own, unless you are consummate statisticians. When there is no publication bias, the regression line originates at the zero of the Y axis. The further away from zero, the more evidence of publication bias.

As always, there are computer programs that do these tests quickly without having to burn your brain with the calculations.

What if after doing the work we see that there is publication bias? Can we do something to adjust it? As always, we can.

The simplest way is to use a graphic method called trim and fill. It consists of the following: a) we draw the funnel plot; b) we remove the small studies so that the funnel is symmetrical; c) the new center of the graph is determined; d) we recover the previously removed studies and we add their reflection to the other side of the center line; e) we estimate again the effect.Another very conservative attitude that we can adopt is to assume that there is a publication bias and to ask how much it affects our results, assuming that we have left studies not included in the analysis.

The only way to know if the publication bias affects our estimates would be to compare the effect in the retrieved and unrecovered studies but, of course, then we would not have to worry about the publication bias.

To know if the observed result is robust or, on the contrary, it is susceptible to be biased by a publication bias, two methods of the fail-safe N have been devised.

The first is the Rosenthal’s fail-safe N method. Suppose we have a meta-analysis with an effect that is statistically significant, for example, a risk ratio greater than one with a p <0.05 (or a 95% confidence interval that does not include the null value, one). Then we ask ourselves a question: how many studies with RR = 1 (null value) will we have to include until p is not significant? If we need few studies (less than 10) to make the value of the effect null, we can worry because the effect may in fact be null and our significance is the product of a publication bias. On the contrary, if many studies are needed, the effect is likely to be truly significant. This number of studies is what the letter N of the name of the method means.

The problem with this method is that it focuses on the statistical significance and not on the relevance of the results. The correct thing would be to look for how many studies are needed so that the result loses clinical relevance, not statistical significance. In addition, it assumes that the effects of the missing studies is null (one in case of risk ratios and odds ratios, zero in cases of differences in means), when the effect of the missing studies can go in the opposite direction than the effect that we detect or in the same sense but of smaller magnitude.

To avoid these disadvantages there is a variation of the previous formula that assesses the statistical significance and clinical relevance. With this method, which is called the Orwin’s fail-safe N, it is calculated how many studies are needed to bring the value of the effect to a specific value, which will generally be the least effect that is clinically relevant. This method also allows to specify the average effect of the missing studies.

To end the meta-analysis explanation, let’s see what is the right way to express the results of data analysis. To do it well, we can follow the recommendations of the PRISMA statement, which devotes seven of its 27 items to give us advice on how to present the results of a meta-analysis.

First, we must inform about the selection process of studies: how many we have found and evaluated, how many we have selected and how many rejected, explaining in addition the reasons for doing so. For this, the flowchart that should include the systematic review from which the meta-analysis proceeds if it complies with the PRISMA statement is very useful.

Secondly, the characteristics of the primary studies must be specified, detailing what data we get from each one of them and their corresponding bibliographic citations to facilitate that any reader of the review can verify the data if he does not trust us. In this sense, there is also the third section, which refers to the evaluation of the risk of study biases and their internal validity.

Fourth, we must present the results of each individual study with a summary data of each intervention group analyzed together with the calculated estimators and their confidence intervals. These data will help us to compile the information that PRISMA asks us in its fifth point referring to the presentation of results and it is none other than the synthesis of all the meta-analysis studies, their confidence intervals, homogeneity study results, etc.

This is usually done graphically by means of an effects diagram, a graphical tool popularly known as forest plot, where the trees would be the primary studies of the meta-analysis and where all the relevant results of the quantitative synthesis are summarized.

The Cochrane’s Collaboration recommends structuring the forest plot in five well differentiated columns. Column 1 lists the primary studies or the groups or subgroups of patients included in the meta-analysis. They are usually represented by an identifier composed of the name of the first author and the date of publication.Column 2 shows the results of the measures of effect of each study as reported by their respective authors.

Column 3 is the actual forest plot, the graphic part of the subject. It shows the measures of effect of each study on both sides of the zero effect line, which we already know is zero for mean differences and one for odds ratios, risk ratios, hazard ratios, etc. Each study is represented by a square whose area is usually proportional to the contribution of each one to the overall result. In addition, the square is within a segment that represents the extremes of its confidence interval.

These confidence intervals inform us about the accuracy of the studies and tell us which are statistically significant: those whose interval does not cross the zero effect line. Anyway, do not forget that, although crossing the line of no effect and being not statistically significant, the interval boundaries can give us much information about the clinical significance of the results of each study. Finally, at the bottom of the chart we will find a diamond that represents the global result of the meta-analysis. Its position with respect to the null effect line will inform us about the statistical significance of the overall result, while its width will give us an idea of ​​its accuracy (its confidence interval). Furthermore, on top of this column will find the type of effect measurement, the analysis model data is used (fixed or random) and the significance value of the confidence intervals (typically 95%).

This chart is usually completed by a fourth column with the estimated weight of each study in per cent format and a fifth column with the estimates of the weighted effect of each. And in some corner of this forest will be the measure of heterogeneity that has been used, along with its statistical significance in cases where relevant.

To conclude the presentation of the results, PRISMA recommends a sixth section with the evaluation that has been made of the risks of bias in the study and a seventh with all the additional analyzes that have been necessary: stratification, sensitivity analysis, metaregression, etc.

As you can see, nothing is easy about meta-analysis. Therefore, the Cochrane’s recommends following a series of steps to correctly interpret the results. Namely:

  1. Verify which variable is compared and how. It is usually seen at the top of the forest plot.
  2. Locate the measure of effect used. This is logical and necessary to know how to interpret the results. A hazard ratio is not the same as a difference in means or whatever it was used.
  3. Locate the diamond, its position and its amplitude. It is also convenient to look at the numerical value of the global estimator and its confidence interval.
  4. Check that heterogeneity has been studied. This can be seen by looking at whether the segments that represent the primary studies are or are not very dispersed and whether they overlap or not. In any case, there will always be a statistic that assesses the degree of heterogeneity. If we see that there is heterogeneity, the next thing will be to find out what explanation the authors give about its existence.
  5. Draw our conclusions. We will look at which side of the null effect line are the overall effect and its confidence interval. You already know that, although it is significant, the lower limit of the interval should be as far as possible from the line, because of the clinical relevance, which does not always coincide with statistical significance. Finally, look again at the study of homogeneity. If there is a lot of heterogeneity, the results will not be as reliable.

And with this we end the topic of meta-analysis. In fact, the forest plot is not exclusive to meta-analyzes and can be used whenever we want to compare studies to elucidate their statistical or clinical significance, or in cases such as equivalence studies, in which the null effect line is joined of the equivalence thresholds. But it still has one more utility. A variant of the forest plot also serves to assess if there is a publication bias in the systematic review, although, as we already know, in these cases we change its name to funnel graph. But that is another story…

Apples and pears

Print Friendly, PDF & Email

You all sure know the Chinese tale of the poor solitary rice grain that falls to the ground and nobody can hear it. Of course, if instead of falling a grain it falls a sack full of rice that will be something else. There are many examples of union making strength. A red ant is harmless, unless it bites you in some soft and noble area, which are usually the most sensitive. But what about a marabout of millions of red ants? That is what scares you up, because if they all come together and come for you, you could do little to stop their push. Yes, the union is strength.

And this also happens with statistics. With a relatively small sample of well-chosen voters we can estimate who will win an election in which millions vote. So, what could we not do with a lot of those samples? Surely the estimate would be more reliable and more generalizable.

Well, this is precisely one of the purposes of meta-analysis, which uses various statistical techniques to make a quantitative synthesis of the results of a set of studies that, although try to answer the same question, do not reach exactly to the same result. But beware; we cannot combine studies to draw conclusions about the sum of them without first taking a series of precautions. This would be like mixing apples and pears which, I’m not sure why, should be something terribly dangerous because everyone knows it’s something to avoid.

Think that we have a set of clinical trials on the same topic and we want to do a meta-analysis to obtain a global result. It is more than convenient that there is as little variability as possible among the studies if we want to combine them. Because, ladies and gentlemen, here also rules the saying: alongside but separate.

Before thinking about combining the results of the studies of a systematic review to perform a meta-analysis, we must always make a previous study of the heterogeneity of the primary studies, which is nothing more than the variability that exists among the estimators that have been obtained in each of those studies.

First, we will investigate possible causes of heterogeneity, such as differences in treatments, variability of the populations of the different studies and differences in the designs of the trials. If there is a great deal of heterogeneity from the clinical point of view, perhaps the best thing to do is not to do meta-analysis and limit the analysis to a qualitative synthesis of the results of the review.

Once we come to the conclusion that the studies are similar enough to try to combine them we should try to measure this heterogeneity to have an objective data. For this, several privileged brains have created a series of statistics that contribute to our daily jungle of acronyms and letters.

Until recently, the most famous of those initials was the Cochran’s Q, which has nothing to do either with James Bond or our friend Archie Cochrane. Its calculation takes into account the sum of the deviations between each of the results of primary studies and the global outcome (squared differences to avoid positives cancelling negatives), weighing each study according to their contribution to overall result. It looks awesome but in reality, it is no big deal. Ultimately, it’s no more than an aristocratic relative of ji-square test. Indeed, Q follows a ji-square distribution with k-1 degrees of freedom (being k the number of primary studies). We calculate its value, look at the frequency distribution and estimate the probability that differences are not due to chance, in order to reject our null hypothesis (which assumes that observed differences among studies are due to chance). But, despite the appearances, Q has a number of weaknesses.

First, it’s a very conservative parameter and we must always keep in mind that no statistical significance is not always synonymous of absence of heterogeneity: as a matter of fact, we cannot reject the null hypothesis, so we have to know that when we approved it we are running the risk of committing a type II error and blunder. For this reason, some people propose to use a significance level of p < 0.1 instead of the standard p < 0.5. Another Q’s pitfall is that it doesn’t quantify the degree of heterogeneity and, of course, doesn’t explain the reasons that produce it. And, to top it off, Q loses power when the number of studies is small and doesn’t allow comparisons among different meta-analysis if they have different number of studies.

This is why another statistic has been devised that is much more celebrated today: I2. This parameter provides an estimate of total variation among studies with respect to total variability or, put it another way, the proportion of variability actually due to heterogeneity for actual differences among the estimates compared with variability due to chance. It also looks impressive, but it’s actually an advantageous relative of the intraclass correlation coefficient.

Its value ranges from 0 to 100%, and we usually consider the limits of 25%, 50% and 75% as signs of low, moderate and high heterogeneity, respectively. I2 is not affected either by the effects units of measurement or the number of studies, so it allows comparisons between meta-analysis with different units of effect measurement or different number of studies.

If you read a study that provides Q and you want to calculate I2, or vice versa, you can use the following formula, being k the number of primary studies:

I^{2}= \frac{Q-k+1}{Q}

There’s a third parameter that is less known, but not less worthy of mention: H2. It measures the excess of Q value in respect of the value that we would expect to obtain if there were no heterogeneity. Thus, a value of 1 means no heterogeneity and its value increases as heterogeneity among studies does. But its real interest is that it allows calculating I2 confidence intervals.

Other times, the authors perform a hypothesis contrast with a null hypothesis of non-heterogeneity and use a ji-square or some similar statistic. In these cases, what they provide is a value of statistical significance. If the p is <0.05 the null hypothesis can be rejected and say that there is heterogeneity. Otherwise we will say that we cannot reject the null hypothesis of non-heterogeneity.

In summary, whenever we see an indicator of homogeneity that represents a percentage, it will indicate the proportion of variability that is not due to chance. For their part, when they give us a “p” there will be significant heterogeneity when the “p” is less than 0.05.

Do not worry about the calculations of Q, I2 and H2. For that there are specific programs as RevMan or modules within the usual statistical programs that do the same function.

A point of attention: always remember that not being able to demonstrate heterogeneity does not always mean that the studies are homogeneous. The problem is that the null hypothesis assumes that they are homogeneous and the differences are due to chance. If we can reject it we can assure that there is heterogeneity (always with a small degree of uncertainty). But this does not work the other way around: if we cannot reject it, it simply means that we cannot reject that there is no heterogeneity, but there will always be a probability of committing a type II error if we directly assume that the studies are homogeneous.

For this reason, a series of graphical methods have been devised to inspect the studies and verify that there is no data of heterogeneity even if the numerical parameters say otherwise.

The most employed of them is, perhaps, the , with can be used for both meta-analysis from trials or observational studies. This graph represents the accuracy of each study versus the standardize effects. It also shows the adjusted regression line and sets two confidence bands. The position of each study regarding the accuracy axis indicates its weighted contribution to overall results, while its location outside the confidence bands indicates its contribution to heterogeneity.

Galbraith’s graph can also be useful for detecting sources of heterogeneity, since studies can be labeled according to different variables and see how they contribute to the overall heterogeneity.

Another available tool you can use for meta-analysis of clinical trials is L’Abbé’s plot. It represents response rates to treatment versus response rates in control group, plotting the studies to both sides of the diagonal. Above that line are studies with positive treatment outcome, while below are studies with an outcome favorable to control intervention. The studies usually are plotted with an area proportional to its accuracy, and its dispersion indicates heterogeneity. Sometimes, L’Abbé’s graph provides additional information. For example, in the accompanying graph you can see that studies in low-risk areas are located mainly below the diagonal. On the other hand, high-risk studies are mainly located in areas of positive treatment outcome. This distribution, as well as being suggestive of heterogeneity, may suggest that efficacy of treatments depends on the level of risk or, put another way, we have an effect modifying variable in our study. A small drawback of this tool is that it is only applicable to meta-analysis of clinical trials and when the dependent variable is dichotomous.

Well, suppose we study heterogeneity and we decide that we are going to combine the studies to do a meta-analysis. The next step is to analyze the estimators of the effect size of the studies, weighing them according to the contribution that each study will have on the overall result. This is logical; it cannot contribute the same to the final result a trial with few participants and an imprecise result than another with thousands of participants and a more precise result measure.

The most usual way to take these differences into account is to weight the estimate of the size of the effect by the inverse of the variance of the results, subsequently performing the analysis to obtain the average effect. For these there are several possibilities, some of them very complex from the statistical point of view, although the two most commonly used methods are the fixed effect model and the random effects model. Both models differ in their conception of the starting population from which the primary studies of meta-analysis come.

The fixed effect model considers that there is no heterogeneity and that all studies estimate the same effect size of the population (they all measure the same effect, that is why it is called a fixed effect), so it is assumed that the variability observed among the individual studies is due only to the error that occurs when performing the random sampling in each study. This error is quantified by estimating intra-study variance, assuming that the differences in the estimated effect sizes are due only to the use of samples from different subjects.

On the other hand, the random effects model assumes that the effect size varies in each study and follows a normal frequency distribution within the population, so each study estimates a different effect size. Therefore, in addition to the intra-study variance due to the error of random sampling, the model also includes the variability among studies, which would represent the deviation of each study from the mean effect size. These two error terms are independent of each other, both contributing to the variance of the study estimator.

In summary, the fixed effect model incorporates only one error term for the variability of each study, while the random effects model adds, in addition, another error term due to the variability among the studies.

You see that I have not written a single formula. We do not actually need to know them and they are quite unfriendly, full of Greek letters that no one understands. But do not worry. As always, statistical programs like RevMan from the Cochrane Collaboration allow you to do the calculations in a simple way, including and removing studies from the analysis and changing the model as you wish.

The type of model to choose has its importance. If in the previous homogeneity analysis we see that the studies are homogeneous we can use the fixed effect model. But if we detect that heterogeneity exists, within the limits that allow us to combine the studies, it will be preferable to use the random effects model.

Another consideration is the applicability or external validity of the results of the meta-analysis. If we have used the fixed effect model, we will be committed to generalize the results out of populations with characteristics similar to those of the included studies. This does not occur with the results obtained using the random effects model, whose external validity is greater because it comes from studies of different populations.

In any case, we will obtain a summary effect measure along with its confidence interval. This confidence interval will be statistically significant when it does not cross the zero effect line, which we already know is zero for mean differences and one for odds ratios and risk ratios. In addition, the amplitude of the interval will inform us about the precision of the estimation of the average effect in the population: how much wider, less precise, and vice versa.

If you think a bit, you will immediately understand why the random effects model is more conservative than the fixed effect model in the sense that the confidence intervals obtained are less precise, since it incorporates more variability in its analysis. In some cases it may happen that the estimator is significant if we use the fixed effect model and it is not significant if we use the random effect model, but this should not condition us when choosing the model to use. We must always rely on the previous measure of heterogeneity, although if we have doubts, we can also use the two models and compare the different results.

Having examined the homogeneity of primary studies we can come to the grim conclusion that heterogeneity dominates the situation. Can we do something to manage it? Sure, we can. We can always not to combine the studies, or combine them despite heterogeneity and obtain a summary result but, in that case, we should also calculate any measure of variability among studies and yet we could not be sure of our results.

Another possibility is to do a stratified analysis according to the variable that causes heterogeneity, provided that we are able to identify it. For this we can do a sensitivity analysis, repeating calculations once removing one by one each of the subgroups and checking how it influences the overall result. The problem is that this approach ignores the final purpose of any meta-analysis, which is none than obtaining an overall value of homogeneous studies.

Finally, the brainiest on these issues can use meta-regression. This technique is similar to multivariate regression models in which the characteristics of the studies are used as explanatory variables, and effect’s variable or some measure of deviation of each study with respect to global result are used as dependent variable. Also, it should be done a weighting according to the contribution of each study to the overall result and try not to score too much coefficients to the regression model if the number of primary studies is not large. I wouldn’t advise you to do a meta-regression at home if it is not accompanied by seniors.

And we only need to check that we have not omitted studies and that we have presented the results correctly. The meta-analysis data are usually represented in a specific graph that is known as forest plot. But that is another story…

The whole is greater than the sum of its parts

Print Friendly, PDF & Email

This is another of those famous quotes that are all over the place. Apparently, the first person to have this clever idea was Aristotle, who used it to summarize his holism general principle in his briefs on metaphysics. Who would have said that this tinny phrase contains so much wisdom?. Holism theory insists that everything must be considered in a comprehensive manner, because its components may act in a synergistic way, allowing the meaning of the whole to be greater than the meaning that each individual part contribute with.

Don’t be afraid, you are still on the blog about the brains and not on a blog about philosophy. Neither have I changed the topic of the blog, but this saying is just what I needed to introduce you to the wildest beast of scientific method, which is called meta-analysis.

We live in the information age. Since the end of the 20th century, we have witnessed a true explosion of the available sources of information, accessible from multiple platforms. The end result is that we are overwhelmed every time we need information about a specific point, so we do not know where to look or how we can find what we want. For this reason, systems began to be developed to synthesize the information available to make it more accessible when needed.

So, the first reviews come of the arid, the so-called narrative or author reviews. To write them, one or more authors, usually experts in a specific subject, made a general review on this topic, although without any strict criteria on the search strategy or selection of information. Following with total freedom, the authors analyzed the results as instructed by their will and ended up drawing their conclusions from a qualitative synthesis of the obtained results.

These narrative reviews are very useful for acquiring an overview of the topic, especially when one knows little about the subject, but they are not very useful for those who already know the topic and need answers to a more specific question. In addition, as the whole procedure is done according to authors´ wishes, the conclusions are not reproducible.

For these reasons, a series of privileged minds invented the other type of review in which we will focus on this post: the systematic review. Instead of reviewing a general topic, systematic reviews do focus on a specific topic in order to solve specific doubts of clinical practice. In addition, they use a clearly specified search strategy and inclusion criteria for an explicit and rigorous work, which makes them highly reproducible if another group of authors comes up with a repeat review of the same topic. And, if that were not enough, whenever possible, they go beyond the analysis of qualitative synthesis, completing it with a quantitative synthesis that receives the funny name of meta-analysis.

The realization of a systematic review consists of six steps: formulation of the problem or question to be answered, search and selection of existing studies, evaluation of the quality of these studies, extraction of the data, analysis of the results and, finally, interpretation and conclusion. We are going to detail this whole process a little.

Any systematic review worth its salt should try to answer a specific question that must be relevant from the clinical point of view. The question will usually be asked in a structured way with the usual components of population, intervention, comparison and outcome (PICO), so that the analysis of these components will allow us to know if the review is of our interest.

In addition, the components of the structured clinical question will help us to search for the relevant studies that exist on the subject. This search must be global and not biased, so we avoid possible biases of source excluding sources by language, journal, etc. The usual is to use a minimum of two important electronic databases of general use, such as Pubmed, Embase or the Cochrane’s, together with the specific ones of the subject that is being treated. It is important that this search is complemented by a manual search in non-electronic registers and by consulting the bibliographic references of the papers found, in addition to other sources of the so-called gray literature, such as doctoral theses, and documents of congresses, as well as documents from funding agencies, registers and, even, establishing contact with other researchers to know if there are studies not yet published.

It is very important that this strategy is clearly specified in the methods section of the review, so that anyone can reproduce it later, if desired. In addition, it will be necessary to clearly specify the inclusion and exclusion criteria of the primary studies of the review, the type of design sought and its main components (again in reference to the PICO, the components of the structured clinical question).

The third step is the evaluation of the quality of the studies found, which must be done by a minimum of two people independently, with the help of a third party (who will surely be the boss) to break the tie in cases where there is no consensus among the extractors. For this task, tools or checklists designed for this purpose are usually used; one of the most frequently used tool for bias control is the Cochrane Collaboration Tool. This tool assesses five criteria of the primary studies to determine their risk of bias: adequate randomization sequence (prevents selection bias), adequate masking (prevents biases of realization and detection, both information biases), concealment of allocation (prevents selection bias), losses to follow-up (prevents attrition bias) and selective data information (prevents information bias). The studies are classified as high, low or indeterminate risk of bias. It is common to use the colors of the traffic light, marking in green the studies with low risk of bias, in red those with high risk of bias and in yellow those who remain in no man’s land. The more green we see, the better the quality of the primary studies of the review will be.

Ad-hoc forms are usually designed for extraction of data, which usually collect data such as date, scope of the study, type of design, etc., as well as the components of the structured clinical question. As in the case of the previous step, it is convenient that this be done by more than one person, establishing the method to reach an agreement in cases where there is no consensus among the reviewers.

And here we enter the most interesting part of the review, the analysis of the results. The fundamental role of the authors will be to explain the differences that exist between the primary studies that are not due to chance, paying special attention to the variations in the design, study population, exposure or intervention and measured results. You can always make a qualitative synthesis analysis, although the real magic of the systematic review is that, when the characteristics of primary studies allow it, a quantitative synthesis, called meta-analysis, can also be performed.

A meta-analysis is a statistical analysis that combines the results of several independent studies that try to answer the same question. Although meta-analysis can be considered as a research project in its own right, it is usually part of a systematic review.

Primary studies can be combined using a statistical methodology developed for this purpose, which has a number of advantages. First, by combining all the results of the primary studies we can obtain a more complete global vision (you know, the whole is greater …). The second one, when studies are combined we increase the sample size, which increases the power of the study in comparison with that of the individual studies, improving the estimation of the effect we want to measure. Thirdly, when extracting the conclusions of a greater number of studies, its external validity increases, since having involved different populations it is easier to generalize the results. Finally, it can allow us to resolve controversies between the conclusions of the different primary studies of the review and, even, to answer questions that had not been raised in those studies.

Once the meta-analysis is done, a final synthesis must be made that integrates the results of the qualitative and quantitative synthesis in order to answer the question that motivated the systematic review or, when this is not possible, to propose the additional studies that must be carried out to be able to answer it.

But a meta-analysis will only deserve all our respect if it fulfills a series of requirements. As the systematic review to witch the meta-analysis belongs, it should aim to answer one specific question and it must be based on all relevant available information, avoiding publication bias and recovery bias. Also, primary studies must have been assessed to ensure its quality and its homogeneity before combining them. Of course, data must be analyzed and presented in an appropriate way. And, finally, it must make sense to combine the results in order to do it. The fact that we can combine results doesn’t always mean that we have to do it if it is not needed in our clinical setting.

And how do you combine the studies?, you could ask yourselves. Well, that’s the meta-analysis’ crux of the matter (crossings, really, there’re many), because there are several possible ways to do it.

Anyone could think that the easiest way would be a sort of Eurovision Contest. We account for the primary studies with a statistically significant positive effect and, if they are majority, we conclude that there’s consensus for positive result. This approach is quite simple but, you will not deny it, also quite sloppy. Also I can think about a number of disadvantages about its use. On one hand, it implies that lack of significance and lack of effect is synonymous, which does not always have to be true. On the other hand, it doesn’t take into account the direction and strength of effect in each study, nor the accuracy of estimators, neither the quality nor the characteristics of primary studies’ design. So, this type of approach is not very recommended, although nobody is going to fine us if we use it as an informal first approach before deciding which if the best way to combine the results.

Another possibility is to use a sort of sign test, similar to other non-parametric statistical techniques. We count the number of positive effects, we subtract the negatives and we have our conclusion. The truth is that this method also seems too simple. It ignores studies that don’t have statistical significance and also ignores the accuracy of studies’ estimators. So, this approach is not of much use, unless you only know the directions of the effects measured in the studies. We could also use it when primary studies are very heterogeneous to get an approximation of the global result, although I would not trust very much results obtained in this way.

The third method is to combine the different Ps of the studies (our beloved and sacrosanct Ps). This could come to our minds if we had a systematic review whose primary studies use different outcome measures, although all of them tried to answer the same question. For example, think about a study on osteoporosis where some studies use ultrasonic densitometry, others spine or femur DEXA, etc. The problem with this method is that it doesn’t take into account the intensities of effects, but only its directions and statistical significances, and we all know the deficiencies of our holy Ps. To be able to make this approach we’d need software that combines data that follow a Chi-square or Gaussian distribution, giving us an estimate and its confidence interval.

The fourth and final method that I know is also the most stylish: to make a weighted combination of the estimated effect in all the primary studies. To calculate the mean would be the easiest way, but we have not come this far to make fudge again. Arithmetic mean gives same emphasis to all studies, so if you have an outlier or imprecise study, results will be greatly distorted. Don’t forget that average always follow the tails of distributions and are heavily influenced by extreme values (which does not happen to her relative, the median).

This is why we have to weigh the different estimates. This can be done in two ways, taking into account the number of subjects in each study, or performing a weighting based on the inverses of the variances of each (you know, the squares of standard errors). The latter way is the more complex, so it is the one people preferred to do more often. Of course, as the maths needed are very hard, people usually use special software that can be external modules working in usual statistical programs such as Stata, SPSS, SAS or R, or specific software such as the famous Cochrane Collaboration’s RevMan.

As you can see, I have not been short of calling the systematic review with meta-analysis as the wildest beast of epidemiological designs. However, it has its detractors. We all know someone who claims not to like systematic reviews because almost all of them end up in the same way: “more quality studies are needed to be able to make recommendations with a reasonable degree of evidence”. Of course, in these cases we cannot put the blame on the review, because we do not take enough care to perform our studies so the vast majority deserves to end up in the paper shredder.

Another controversy is that of those who debate about what is better, a good systematic review or a good clinical trial (reviews can be made on other types of designs, including observational studies). This debate reminds me of the controversy over whether one should do a calimocho mixing a good wine or if it is a sin to mix a good wine with Coca-Cola. Controversies aside, if you have to take a calimocho, I assure you that you will enjoy it more if you use a good wine, and something similar happens to reviews with the quality of their primary studies.

The problem of systematic reviews is that, to be really useful, you have to be very rigorous in its realization. So that we do not forget anything, there are lists of recommendations and verification that allow us to order the entire procedure of creation and dissemination of scientific works without making methodological errors or omissions in the procedure.

It all started with a program of the Health Service of the United Kingdom that ended with the founding of an international initiative to promote the transparency and precision of biomedical research works: the EQUATOR network (Enhancing the QUAlity and Transparency of health Research). This network consists of experts in methodology, communication and publication, so it includes professionals involved in the quality of the entire process of production and dissemination of research results. Among many other objectives, which you can consult on its website, one is to design a set of recommendations for the realization and publication of the different types of studies, which gives rise to different checklists or statements.

The checklist designed to apply to systematic reviews is the PRISMA statement (Preferred Reporting Items for Systematic reviews and Meta-Analyses), which comes to replace the QUOROM statement (QUality Of Reporting Of Meta-analyses). Based on the definition of systematic review of the Cochrane Collaboration, PRISMA helps us to select, identify and assess the studies included in a review. It also consists of a checklist and a flowchart that describes the passage of all the studies considered during the realization of the review. There is also a lesser-known statement for the assessment of meta-analyses of observational studies, the MOOSE statement (Meta-analyses of Observational Studies in Epidemiology).

The Cochrane Collaboration also has a very well structured and defined methodology, which you can consult on its website. This is the reason why they have so much prestige within the world of systematic reviews, because they are made by professionals who are dedicated to the task following a rigorous and contrasted methodology. Anyway, even Cochrane’s reviews should be critically read and not giving them anything for insured.

And with this we have reached the end for today. I want to insist that meta-analysis should be done whenever possible and interesting, but making sure beforehand that it is correct to combine the results. If the studies are very heterogeneous we should not combine anything, since the results that we could obtain would have a much compromised validity. There is a whole series of methods and statistics to measure the homogeneity or heterogeneity of the primary studies, which also influence the way in which we analyze the combined data. But that is another story…

The guard’s dilemma

Print Friendly, PDF & Email

The world of medicine is a world of uncertainty. We can never be sure of anything at 100%, however obvious it may seem a diagnosis, but we cannot beat right and left with ultramodern diagnostics techniques or treatments (that are never safe) when making the decisions that continually haunt us in our daily practice.

That’s why we are always immersed in a world of probabilities, where the certainties are almost as rare as the so-called common sense which, as almost everyone knows, is the least common of the senses.

Imagine you are in the clinic and a patient comes because he has been kicked in the ass, pretty strong, though. As good doctor as we are, we ask that of what’s wrong?, since when?, and what do you attribute it to? And we proceed to a complete physical examination, discovering with horror that he has a hematoma on the right buttock.

Here, my friends, the diagnostic possibilities are numerous, so the first thing we do is a comprehensive differential diagnosis. To do this, we can take four different approaches. The first is the possibilistic approach, listing all possible diagnoses and try to rule them all simultaneously applying the relevant diagnostic tests. The second is the probabilistic approach, sorting diagnostics by relative chance and then acting accordingly. It seems a posttraumatic hematoma (known as the kick in the ass syndrome), but someone might think that the kick has not been so strong, so maybe the poor patient has a bleeding disorder or a blood dyscrasia with secondary thrombocytopenia or even an atypical inflammatory bowel disease with extraintestinal manifestations and gluteal vascular fragility. We could also use a prognostic approach and try to show or rule out possible diagnostics with worst prognosis, so the diagnosis of the kicked in the ass syndrome lose interest and we were going to rule out chronic leukemia. Finally, a pragmatic approach could be used, with particular interest in first finding diagnostics that have a more effective treatment (the kick will be, one more time, the number one).

It seems that the right thing is to use a judicious combination of pragmatic, probabilistic and prognostic approaches. In our case we will investigate if the intensity of injury justifies the magnitude of bruising and, in that case, we would indicate some hot towels and we would refrain from further diagnostic tests. And this example may seems to be bullshit, but I can assure you I know people who make the complete list and order the diagnostic tests when there are any symptoms, regardless of expenses or risks. And, besides, one that I could think of, could assess the possibility of performing a more exotic diagnostic test that I cannot imagine, so the patient would be grateful if the diagnosis doesn’t require to make a forced anal sphincterotomy. And that is so because, as we have already said, the waiting list to get some common sense exceeds in many times the surgical waiting list.

Now imagine another patient with a symptom complex less stupid and absurd than the previous example. For instance, let’s think about a child with symptoms of celiac disease. Before we make any diagnostic test, our patient already has a probability of suffering the disease. This probability will be conditioned by the prevalence of the disease in the population from which she proceeds and is called the pretest probability. This probability will stand somewhere between two thresholds: the diagnostic threshold and the therapeutic threshold.

The usual thing is that the pre-test probability of our patient does not allow us to rule out the disease with reasonable certainty (it would have to be very low, below the diagnostic threshold) or to confirm it with sufficient security to start the treatment (it would have to be above the therapeutic threshold).

We’ll then make the indicated diagnostic test, getting a new probability of disease depending on the result of the test, the so-called post-test probability. If this probability is high enough to make a diagnosis and initiate treatment, we’ll have crossed our first threshold, the therapeutic one. There will be no need for additional tests, as we will have enough certainty to confirm the diagnosis and treat the patient, always within a range of uncertainty.

And what determines our treatment threshold? Well, there are several factors involved. The greater the risk, cost or adverse effects of the treatment in question, the higher the threshold that we will demand to be treated. In the other hand, as much more serious is the possibility of omitting the diagnosis, the lower the therapeutic threshold that we’ll accept.

But it may be that the post-test probability is so low that allows us to rule out the disease with reasonable assurance. We shall then have crossed our second threshold, the diagnostic one, also called the no-test threshold. Clearly, in this situation, it is not indicated further diagnostic tests and, of course, starting treatment.

However, very often changing pretest to post-test probability still leaves us in no man’s land, without achieving any of the two thresholds, so we will have to perform additional tests until we reach one of the two limits.

And this is our everyday need: to know the post-test probability of our patients to know if we discard or confirm the diagnosis, if we leave the patient alone or we lash her out with our treatments. And this is so because the simplistic approach that a patient is sick if the diagnostic test is positive and healthy if it is negative is totally wrong, even if it is the general belief among those who indicate the tests. We will have to look, then, for some parameter that tells us how useful a specific diagnostic test can be to serve the purpose we need: to know the probability that the patient suffers the disease.

And this reminds me of the enormous problem that a brother-in-law asked me about the other day. The poor man is very concerned with a dilemma that has arisen. The thing is that he’s going to start a small business and he wants to hire a security guard to stay at the entrance door and watch for those who take something without paying for it. And the problem is that there’re two candidates and he doesn’t know who of the two to choose. One of them stops nearly everyone, so no burglar escapes. Of course, many honest people are offended when they are asked to open their bags before leaving and so next time they will buy elsewhere. The other guard is the opposite: he stops almost anyone but the one he spots certainly brings something stolen. He offends few honest people, but too many grabbers escape. A difficult decision…

Why my brother-in-law comes to me with this story? Because he knows that I daily face with similar dilemmas every time I have to choose a diagnostic test to know if a patient is sick and I have to treat her. We have already said that the positivity of a test does not assure us the diagnosis, just as the bad looking of a client does not ensure that the poor man has robbed us.

Let’s see it with an example. When we want to know the utility of a diagnostic test, we usually compare its results with those of a reference or gold standard, which is a test that, ideally, is always positive in sick patients and negative in healthy people. Now let’s suppose that I perform a study in my hospital office with a new diagnostic test to detect a certain disease and I get the results from the attached table (the patients are those who have the positive reference test and the healthy ones, the negative).

Let’s start with the easy part. We have 1598 subjects, 520 out of them sick and 1078 healthy. The test gives us 446 positive results, 428 true (TP) and 18 false (FP). It also gives us 1152 negatives, 1060 true (TN) and 92 false (FN). The first we can determine is the ability of the test to distinguish between healthy and sick, which leads me to introduce the first two concepts: sensitivity (Se) and specificity (Sp). Se is the likelihood that the test correctly classifies a patient or, in other words, the probability that a patient gets a positive result. It’s calculated dividing TP by the number of sick. In our case it equals 0.82 (if you prefer to use percentages you have to multiply by 100). Moreover, Sp is the likelihood that the test correctly classifies a healthy or, put another way, the probability that a healthy gets a negative result. It’s calculated dividing TN by the number of healthy. In our example, it equals 0.98.

Someone may think that we have assessed the value of the new test, but we have just begun to do it. And this is because with Se and Sp we somehow measure the ability of the test to discriminate between healthy and sick, but what we really need to know is the probability that an individual with a positive results being sick and, although it may seem to be similar concepts, they are actually quite different.

The probability of a positive of being sick is known as the positive predictive value (PPV) and is calculated dividing the number of patients with a positive test by the total number of positives. In our case it is 0.96. This means that a positive has a 96% chance of being sick. Moreover, the probability of a negative of being healthy is expressed by the negative predictive value (NPV), with is the quotient of healthy with a negative test by the total number of negatives. In our example it equals 0.92 (an individual with a negative result has 92% chance of being healthy). This is already looking more like what we said at the beginning that we needed: the post-test probability that the patient is really sick.

And from now on is when neurons begin to be overheated. It turns out that Se and Sp are two intrinsic characteristics of the diagnostic test. Their results will be the same whenever we use the test in similar conditions, regardless of the subjects of the test. But this is not so with the predictive values, which vary depending on the prevalence of the disease in the population in which we test. This means that the probability of a positive of being sick depends on how common or rare the disease in the population is. Yes, you read this right: the same positive test expresses different risk of being sick, and for unbelievers, I’ll put another example.

Suppose that this same study is repeated by one of my colleagues who works at a community health center, where population is proportionally healthier than at my hospital (logical, they have not suffered the hospital yet). If you check the results in the table and bring you the trouble to calculate it, you may come up with a Se of 0.82 and a Sp of 0.98, the same that I came up with in my practice. However, if you calculate the predictive values, you will see that the PPV equals 0.9 and the NPV 0.95. And this is so because the prevalence of the disease (sick divided by total) is different in the two populations: 0.32 at my practice vs 0.19 at the health center. That is, in cases of highest prevalence a positive value is more valuable to confirm the diagnosis of disease, but a negative is less reliable to rule it out. And conversely, if the disease is very rare a negative result will reasonably rule out disease but a positive will be less reliable at the time to confirm it.

We see that, as almost always happen in medicine, we are moving on the shaking ground of probability, since all (absolutely all) diagnostic tests are imperfect and make mistakes when classifying healthy and sick. So when is a diagnostic test worth of using it? If you think about it, any particular subject has a probability of being sick even before performing the test (the prevalence of disease in her population) and we’re only interested in using diagnostic tests if that increase this likelihood enough to justify the initiation of the appropriate treatment (otherwise we would have to do another test to reach the threshold level of probability to justify treatment).

And here is when this issue begins to be a little unfriendly. The positive likelihood ratio (PLR), indicates how much more probable is to get a positive with a sick than with a healthy subject. The proportion of positive in sick patients is represented by Se. The proportion of positives in healthy are the FP, which would be those healthy without a negative result or, what is the same, 1-Sp. Thus, PLR = Se / (1 – Sp). In our case (hospital) it equals 41 (the same value no matter we use percentages for Se and Sp). This can be interpreted as it is 41 times more likely to get a positive with a sick than with a healthy.

It’s also possible to calculate NLR (negative), which expresses how much likely is to find a negative in a sick than in a healthy. Negative patients are those who don’t test positive (1-Se) and negative healthy are the same as the TN (the test’s Sp). So, NLR = (1 – Se) / Sp. In our example, 0.18.

A ratio of 1 indicates that the result of the test doesn’t change the likelihood of being sick. If it’s greater than 1 the probability is increased and, if less than 1, decreased. This is the parameter used to determine the diagnostic power of the test. Values > 10 (or > 0.01) indicates that it’s a very powerful test that supports (or contradict) the diagnosis; values from 5-10 (or 0.1-0.2) indicates low power of the test to support (or disprove) the diagnosis; 2-5 (or 0.2-05) indicates that the contribution of the test is questionable; and, finally, 1-2 (0.5-1) indicates that the test has not diagnostic value.

The likelihood ratio does not express a direct probability, but it helps us to calculate the probabilities of being sick before and after testing positive by means of the Bayes’ rule, which says that the posttest odds is equal to the product of the pretest odds by the likelihood ratio. To transform the prevalence into pre-test odds we use the formula odds = p / (1-p). In our case, it would be 0.47. Now we can calculate the post-test odds (PosO) by multiplying the pretest odds by the likelihood ratio. In our case, the positive post-test odds value is 19.27. And finally, we transform the post-test odds into post-test probability using the formula p = odds / (odds + 1). In our example it values 0.95, which means that if our test is positive the probability of being sick goes from 0.32 (the pre-test probability) to 0.95 (post-test probability).

If there’s still anyone reading at this point, I’ll say that we don’t need all this gibberish to get post-test probability. There are multiple websites with online calculators for all these parameters from the initial 2 by 2 table with a minimum effort. I addition, the post-test probability can be easily calculated using a Fagan’s nomogram (see attached figure). This graph represents in three vertical lines from left to right the pre-test probability (it is represented inverted), the likelihood ratios and the resulting post-test probability.

To calculate the post-test probability after a positive result, we draw a line from the prevalence (pre-test probability) to the PLR and extend it to the post-test probability axis. Similarly, in order to calculate post-test probability after a negative result, we would extend the line between prevalence and the value of the NLR.

In this way, with this tool we can directly calculate the post-test probability by knowing the likelihood ratios and the prevalence. In addition, we can use it in populations with different prevalence, simply by modifying the origin of the line in the axis of pre-test probability.

So far we have defined the parameters that help us to quantify the power of a diagnostic test and we have seen the limitations of sensitivity, specificity and predictive values and how the most useful in a general way are the likelihood ratios. But, you will ask, what is a good test?, is it a sensitive one?, a specific?, both?

Here we are going to return to the guard’s dilemma that has arisen to my poor brother-in-law, because we have left him abandoned and we have not answered yet which of the two guards we recommend him to hire, the one who ask almost everyone to open their bags and so offending many honest people, or the one who almost never stops honest people but, stopping almost anyone, many thieves get away.

And what do you think is the better choice? The simple answer is: it depends. Those of you who are still awake by now will have noticed that the first guard (the one who checks many people) is the sensitive one while the second is the specific one. What is better for us, the sensitive or the specific guard? It depends, for example, on where our shop is located. If your shop is located in a heeled neighborhood the first guard won’t be the best choice because, in fact, few people will be stealers and we’ll prefer not to offend our customers so they don’t fly away. But if our shop is located in front of the Cave of Ali Baba we’ll be more interested in detecting the maximum number of clients carrying stolen stuff. Also, it can depend on what we sell in the store. If we have a flea market we can hire the specific guard although someone can escape (at the end of the day, we’ll lose a few amount of money). But if we sell diamonds we’ll want no thieve to escape and we’ll hire the sensitive guard (we’ll rather bother someone honest than allows anybody escaping with a diamond).

The same happens in medicine with the choice of diagnostic tests: we have to decide in each case whether we are more interested in being sensitive or specific, because diagnostic tests not always have a high sensitivity (Se) and specificity (Sp).

In general, a sensitive test is preferred when the inconveniences of a false positive (FP) are smaller than those of a false negative (FN). For example, suppose that we’re going to vaccinate a group of patients and we know that the vaccine is deadly in those with a particular metabolic error. It’s clear that our interest is that no patient be undiagnosed (to avoid FN), but nothing happens if we wrongly label a healthy as having a metabolic error (FP): it’s preferable not to vaccinate a healthy thinking that it has a metabolopathy (although it hasn’t) that to kill a patient with our vaccine supposing he was healthy. Another less dramatic example: in the midst of an epidemic our interest will be to be very sensitive and isolate the largest number of patients. The problem here if for the unfortunate healthy who test positive (FP) and get isolated with the rest of sick people. No doubt we’d do him a disservice with the maneuver. Of course, we could do to all the positives to the first test a second confirmatory one that is very specific in order to avoid bad consequences to FP people.

On the other hand, a specific test is preferred when it is better to have a FN than a FP, as when we want to be sure that someone is actually sick. Imagine that a test positive result implies a surgical treatment: we’ll have to be quite sure about the diagnostic so we don’t operate any healthy people.

Another example is a disease whose diagnosis can be very traumatic for the patient or that is almost incurable or that has no treatment. Here we´ll prefer specificity to not to give any unnecessary annoyance to a healthy. Conversely, if the disease is serious but treatable we´ll probably prefer a sensitive test.

So far we have talked about tests with a dichotomous result: positive or negative. But, what happens when the result is quantitative? Let’s imagine that we measure fasting blood glucose. We must decide to what level of glycemia we consider normal and above which one will seem pathological. And this is a crucial decision, because Se and Sp will depend on the cutoff point we choose.

To help us to choose we have the receiver operating characteristic, known worldwide as the ROC curve. We represent in coordinates (y axis) the Se and in abscissas the complementary Sp (1-Sp) and draw a curve in which each point represents the probability that the test correctly classifies a healthy-sick couple taken at random. The diagonal of the graph would represent the “curve” if the test had no ability to discriminate healthy from sick patients.

As you can see in the figure, the curve usually has a segment of steep slope where the Se increases rapidly without hardly changing the Sp: if we move up we can increase Se without practically increasing FP. But there comes a time when we get to the flat part. If we continue to move to the right, there will be a point from which the Se will no longer increase, but will begin to increase FP. If we are interested in a sensitive test, we will stay in the first part of the curve. If we want specificity we will have to go further to the right. And, finally, if we do not have a predilection for either of the two (we are equally concerned with obtaining FP than FN), the best cutoff point will be the one closest to the upper left corner. For this, some use the so-called Youden’s index, which optimizes the two parameters to the maximum and is calculated by adding Se and Sp and subtracting 1. The higher the index, the fewer patients misclassified by the diagnostic test.

A parameter of interest is the area under the curve (AUC), which represents the probability that the diagnostic test correctly classifies the patient who is being tested (see attached figure). An ideal test with Se and Sp of 100% has an area under the curve of 1: it always hits. In clinical practice, a test whose ROC curve has an AUC> 0.9 is considered very accurate, between 0.7-0.9 of moderate accuracy and between 0.5-0.7 of low accuracy. On the diagonal, the AUC is equal to 0.5 and it indicates that it does not matter if the test is done by throwing a coin in the air to decide if the patient is sick or not. Values below 0.5 indicate that the test is even worse than chance, since it will systematically classify patients as healthy and vice versa.Curious these ROC curves, aren`t they? Its usefulness is not limited to the assessment of the goodness of diagnostic tests with quantitative results. The ROC curves also serve to determine the goodness of fit of a logistic regression model to predict dichotomous outcomes, but that is another story…

King of Kings

Print Friendly, PDF & Email

There is no doubt that when doing a research in biomedicine we can choose from a large number of possible designs, all with their advantages and disadvantages. But in such a diverse and populous court, among jugglers, wise men, gardeners and purple flautists, it reigns over all of them the true Crimson King in epidemiology: the randomized clinical trial.

The clinical trial is an interventional analytical study, with antegrade direction and concurrent temporality, and with sampling of a closed cohort with control of exposure. In a trial, a sample of a population is selected and divided randomly into two groups. One of the groups (intervention group) undergoes the intervention that we want to study, while the other (control group) serves as a reference to compare the results. After a given follow-up period, the results are analyzed and the differences between the two groups are compared. We can thus evaluate the benefits of treatments or interventions while controlling the biases of other types of studies: randomization favors that possible confounding factors, known or not, are distributed evenly between the two groups, so that if in the end we detect any difference, this has to be due to the intervention under study. This is what allows us to establish a causal relationship between exposure and effect.

From what has been said up to now, it is easy to understand that the randomized clinical trial is the most appropriate design to assess the effectiveness of any intervention in medicine and is the one that provides, as we have already mentioned, a higher quality evidence to demonstrate the causal relationship between the intervention and the observed results.

But to enjoy all these benefits it is necessary to be scrupulous in the approach and methodology of the trials. There are checklists published by experts who understand a lot of these issues, as is the case of the CONSORT list, which can help us assess the quality of the trial’s design. But among all these aspects, let us give some thought to those that are crucial for the validity of the clinical trial.

Everything begins with a knowledge gap that leads us to formulate a structured clinical question. The only objective of the trial should be to answer this question and it is enough to respond appropriately to a single question. Beware of clinical trials that try to answer many questions, since, in many cases, in the end they do not respond well to any. In addition, the approach must be based on what the inventors of methodological jargon call the equipoise principle, which does not mean more than, deep in our hearts, we do not really know which of the two interventions is more beneficial for the patient (from the ethical point of view, it would be necessary to be anathema to make a comparison if we already know with certainty which of the two interventions is better). It is curious in this sense how the trials sponsored by the pharmaceutical industry are more likely to breach the equipoise principle, since they have a preference for comparing with placebo or with “non-intervention” in order to be able to demonstrate more easily the efficacy of their products.Then we must carefully choose the sample on which we will perform the trial. Ideally, all members of the population should have the same probability not only of being selected, but also of finishing in either of the two branches of the trial. Here we are faced with a small dilemma. If we are very strict with the inclusion and exclusion criteria, the sample will be very homogeneous and the internal validity of the study will be strengthened, but it will be more difficult to extend the results to the general population (this is the explanatory attitude of sample selection). On the other hand, if we are not so rigid, the results will be more similar to those of the general population, but the internal validity of the study may be compromised (this is the pragmatic attitude).

Randomization is one of the key points of the clinical trial. It is the one that assures us that we can compare the two groups, since it tends to distribute the known variables equally and, more importantly, also the unknown variables between the two groups. But do not relax too much: this distribution is not guaranteed at all, it is only more likely to happen if we randomize correctly, so we should always check the homogeneity of the two groups, especially with small samples.

In addition, randomization allows us to perform masking appropriately, with which we perform an unbiased measurement of the response variable, avoiding information biases. These results of the intervention group can be compared with those of the control group in three ways. One of them is to compare with a placebo. The placebo should be a preparation of physical characteristics indistinguishable from the intervention drug but without its pharmacological effects. This serves to control the placebo effect (which depends on the patient’s personality, their feelings towards the intervention, their love for the research team, etc.), but also the side effects that are due to the intervention and not to the pharmacological effect (think, for example, of the percentage of local infections in a trial with medication administered intramuscularly).

The other way is to compare with the accepted as the most effective treatment so far. If there is a treatment that works, the logical (and more ethical) is that we use it to investigate whether the new one brings benefits. It is also usually the usual comparison method in equivalence or non-inferiority studies. Finally, the third possibility is to compare with non-intervention, although in reality this is a far-fetched way of saying that only the usual care that any patient would receive in their clinical situation is applied.

It is essential that all participants in the trial are submitted to the same follow-up guideline, which must be long enough to allow the expected response to occur. All losses that occur during follow-up should be detailed and analyzed, since they can compromise the validity and power of the study to detect significant differences. And what do we do with those that get lost or end up in a different branch to the one assigned? If there are many, it may be more reasonable to reject the study. Another possibility is to exclude them and act as if they had never existed, but we can bias the results of the trial. A third possibility is to include them in the analysis in the branch of the trial in which they have participated (there is always one that gets confused and takes what he should not), which is known as analysis by treatment or analysis by protocol. And the fourth and last option we have is to analyze them in the branch that was initially assigned to them, regardless of what they did during the study. This is called the intention-to-treat analysis, and it is the only one of the four possibilities that allows us to retain all the benefits that randomization had previously provided.

As a final phase, we would have the analyze and compare the data to draw the conclusions of the trial, using for this the association and impact measures of effect that, in the case of the clinical trial, are usually the response rate, the risk ratio (RR), the relative risk reduction (RRR), the absolute risk reduction (ARR) and the number needed to treat (NNT). Let’s see them with an example.

Let’s imagine that we carried out a clinical trial in which we tried a new antibiotic (let’s call it A not to get warm from head to feet) for the treatment of a serious infection of the location that we are interested in studying. We randomize the selected patients and give them the new drug or the usual treatment (our control group), according to what corresponds to them by chance. In the end, we measure how many of our patients fail treatment (present the event we want to avoid).

Thirty six out of the 100 patients receiving drug A present the event to be avoided. Therefore, we can conclude that the risk or incidence of the event in those exposed (Ie) is 0.36. On the other hand, 60 of the 100 controls (we call them the group of not exposed) have presented the event, so we quickly calculate that the risk or incidence in those not exposed (Io) is 0.6.

At first glance we already see that the risk is different in each group, but as in science we have to measure everything, we can divide the risks between exposed and not exposed, thus obtaining the so-called risk ratio (RR = Ie / Io). An RR = 1 means that the risk is equal in the two groups. If the RR> 1 the event will be more likely in the group of exposed (the exposure we are studying will be a risk factor for the production of the event) and if RR is between 0 and 1, the risk will be lower in those exposed. In our case, RR = 0.36 / 0.6 = 0.6. It is easier to interpret RR> 1. For example, a RR of 2 means that the probability of the event is twice as high in the exposed group. Following the same reasoning, a RR of 0.3 would tell us that the event is a third less frequent in the exposed than in the controls. You can see in the attached table how these measures are calculated.

But what we are interested in is to know how much the risk of the event decreases with our intervention to estimate how much effort is needed to prevent each one. For this we can calculate the RRR and the ARR. The RRR is the risk difference between the two groups with respect to the control (RRR = [Ie-Io] / Io). In our case it is 0.4, which means that the intervention tested reduces the risk by 60% compared to the usual treatment.

The ARR is simpler: it is the difference between the risks of exposed and controls (ARR = Ie – Io). In our case it is 0.24 (we ignore the negative sign), which means that out of every 100 patients treated with the new drug there will be 24 fewer events than if we had used the control treatment. But there is still more: we can know how many we have to treat with the new drug to avoid an event by just doing the rule of three (24 is to 100 as 1 is to x) or, easier to remember, calculating the inverse of the ARR. Thus, the NNT = 1 / ARR = 4.1. In our case we would have to treat four patients to avoid an adverse event. The context will always tell us the clinical importance of this figure.

As you can see, the RRR, although it is technically correct, tends to magnify the effect and does not clearly quantify the effort required to obtain the results. In addition, it may be similar in different situations with totally different clinical implications. Let’s see it with another example that I also show you in the table. Suppose another trial with a drug B in which we obtain three events in the 100 treated and five in the 100 controls. If you do the calculations, the RR is 0.6 and the RRR is 0.4, as in the previous example, but if you calculate the ARR you will see that it is very different (ARR = 0.02), with an NNT of 50 It is clear that the effort to avoid an event is much greater (4 versus 50) despite the same RR and RRR.

So, at this point, let me advice you. As the data needed to calculate RRR are the same than to calculate the easier ARR (and NNT), if a scientific paper offers you only the RRR and hide the ARR, distrust it and do as with the brother-in-law who offers you wine and cured cheese, asking him why he does not better put a skewer of Iberian ham. Well, I really wanted to say that you’d better ask yourselves why they don’t give you the ARR and compute it using the information from the article.

So far all that we have said refers to the classical design of parallel clinical trials, but the king of designs has many faces and, very often, we can find papers in which it is shown a little differently, which may imply that the analysis of the results has special peculiarities.

Let’s start with one of the most frequent variations. If we think about it for a moment, the ideal design would be that which would allow us to experience in the same individual the effect of the study intervention and the control intervention (the placebo or the standard treatment), since the parallel trial is an approximation that it assumes that the two groups respond equally to the two interventions, which always implies a risk of bias that we try to minimize with randomization. If we had a time machine we could try the intervention in all of them, write down what happens, turn back the clock and repeat the experiment with the control intervention so we could compare the two effects. The problem, the more alert of you have already imagined, is that the time machine has not been invented yet.

But what has been invented is the cross-over clinical trial, in which each subject is their own control. As you can see in the attached figure, in this type of test each subject is randomized to a group, subjected to the intervention, allowed to undergo a wash-out period and, finally, subjected to the other intervention. Although this solution is not as elegant as that of the time machine, the defenders of cross-trials argue the fact that variability within each individual is less than the interindividual one, with which the estimate can be more accurate than that of the parallel trial and, in general, smaller sample sizes are needed. Of course, before using this design you have to make a series of considerations. Logically, the effect of the first intervention should not produce irreversible changes or be very prolonged, because it would affect the effect of the second. In addition, the washing period must be long enough to avoid any residual effects of the first intervention.

It is also necessary to consider whether the order of the interventions can affect the final result (sequence effect), with which only the results of the first intervention would be valid. Another problem is that, having a longer duration, the characteristics of the patient can change throughout the study and be different in the two periods (period effect). And finally, beware of the losses during the study, which are more frequent in longer studies and have a greater impact on the final results than in parallel trials.

Imagine now that we want to test two interventions (A and B) in the same population. Can we do it with the same trial and save costs of all kinds? Yes, we can, we just have to design a factorial clinical trial. In this type of trial, each participant undergoes two consecutive randomizations: first it is assigned to intervention A or to placebo (P) and, second, to intervention B or placebo, with which we will have four study groups: AB, AP, BP and PP. As is logical, the two interventions must act by independent mechanisms to be able to assess the results of the two effects independently.

Usually, an intervention related to a more plausible and mature hypothesis and another one with a less contrasted hypothesis are studied, assuring that the evaluation of the second does not influence the inclusion and exclusion criteria of the first one. In addition, it is not convenient that neither of the two options has many annoying effects or is badly tolerated, because the lack of compliance with one treatment usually determines the poor compliance of the other. In cases where the two interventions are not independent, the effects could be studied separately (AP versus PP and BP versus PP), but the design advantages are lost and the necessary sample size increases.

At other times it may happen that we are in a hurry to finish the study as soon as possible. Imagine a very bad disease that kills lots of people and we are trying a new treatment. We want to have it available as soon as possible (if it works, of course), so after every certain number of participants we will stop and analyze the results and, in the case that we can already demonstrate the usefulness of the treatment, we will consider the study finished. This is the design that characterizes the sequential clinical trial. Remember that in the parallel trial the correct thing is to calculate previously the sample size. In this design, with a more Bayesian mentality, a statistic is established whose value determines an explicit termination rule, so that the size of the sample depends on the previous observations. When the statistic reaches the predetermined value we see ourselves with enough confidence to reject the null hypothesis and we finish the study. The problem is that each stop and analysis increases the error of rejecting it being true (type 1 error), so it is not recommended to do many intermediate analysis. In addition, the final analysis of the results is complex because the usual methods do not work, but there are others that take into account the intermediate analysis. This type of trial is very useful with very fast-acting interventions, so it is common to see them in titration studies of opioid doses, hypnotics and similar poisons.

There are other occasions when individual randomization does not make sense. Imagine we have taught the doctors of a center a new technique to better inform their patients and we want to compare it with the old one. We cannot tell the same doctor to inform some patients in one way and others in another, since there would be many possibilities for the two interventions to contaminate each other. It would be more logical to teach the doctors in a group of centers and not to teach those in another group and compare the results. Here what we would randomize is the centers to train their doctors or not. This is the trial with group assignment design. The problem with this design is that we do not have many guarantees that the participants of the different groups behave independently, so the size of the sample needed can increase a lot if there is great variability between the groups and little within each group. In addition, an aggregate analysis of the results has to be done, because if it is done individually, the confidence intervals are falsely narrowed and we can find false statistical meanings. The usual thing is to calculate a weighted synthetic statistic for each group and make the final comparisons with it.

The last of the series that we are going to discuss is the community essay, in which the intervention is applied to population groups. When carried out in real conditions on populations, they have great external validity and often allow for cost-efficient measures based on their results. The problem is that it is often difficult to establish control groups, it can be more difficult to determine the necessary sample size and it is more complex to make causal inference from their results. It is the typical design for evaluating public health measures such as water fluoridation, vaccinations, etc.

I’m done now. The truth is that this post has been a bit long (and I hope not too hard), but the King deserves it. In any case, if you think that everything is said about clinical trials, you have no idea of all that remains to be said about types of sampling, randomization, etc., etc., etc. But that is another story…