A weakness

Print Friendly, PDF & Email

NNT calculation in meta-analysis

Even the greatest have weaknesses. It is a reality that affects even the great NNT, the number needed to treat, without a doubt the king of the measures of absolute impact of the research methodology in clinical trials.

Of course, that is not an irreparable disgrace. We only have to be well aware of its strengths and weaknesses in order to take advantage of the former and try to mitigate and control the latter. And it is that the NNT depends on the baseline risks of the intervention and control groups, which can be inconsistent fellow travelers and be subjected to variation due to several factors.

As we all know, NNT is an absolute measure of effect that is used to estimate the efficacy or safety of an intervention. This parameter, just like a good marriage, is useful in good times and in bad, in sickness and in health.

Thus, on the good side we talk about NNT, which is the number of patients that have to be treated for one to present a result that we consider as good. By the way, on the dark side we have the number needed to harm (NNH), which indicates how many we have to treat in order for one to present an adverse event.

NNT was originally designed to describe the effect of the intervention relative to the control group in clinical trials, but its use was later extended to interpret the results of systematic reviews and meta-analyzes. And this is where the problem may arise since, sometimes, the way to calculate it in trials is generalized for meta-analyzes, which can lead to error.

NNT calculation in meta-analyis

The simplest way to obtain the NNT is to calculate the inverse of the absolute risk reduction between the intervention and the control group. The problem is that this form is the one that is most likely to be biased by the presence of factors that can influence the value of the NNT. Although it is the king of absolute measures of impact, it also has its limitations, with various factors influencing its magnitude, not to mention its clinical significance.

One of these factors is the duration of the study follow-up period. This duration can influence the number of events, good or bad ones, that the study participants can present, which makes it incorrect to compare the NNTs of studies with follow-ups of different duration.

Another may be the baseline risk of presenting the event. Let’s think that the term “risk”, from a statistical point of view, does not always imply something bad. We can speak, for example, of risk of cure. If the baseline risk is higher, more events will likely occur and the NNT may be lower. The outcome variable used and the treatment alternative with which we compared the intervention should also be taken into account.

And third, to name a few more of these factors, the direction and size of the effect, the scale of measurement, and the precision of the NNT estimates, their confidence intervals, may influence its value.

Controls event rate

And here the problem arises with systematic reviews and meta-analyzes. Even though we might want to, there will always be some heterogeneity among the primary studies in the review, so these factors we have discussed may differ among studies. At this point, it is easy to understand that the estimation of the global NNT based on the summary measures of risks between the two groups may not be the most suitable, since it is highly influenced by the variations in the baseline control event rate (CER).

NNT calculation from RR and OR

For these situations, it is much more advisable to make other more robust estimates of the NNT, the most widely used being those that use other association measures such as the risk ratio (RR) or the odds ratio (OR), which are more robust in the face of variations in CER. In the attached figure I show you the formulas for the calculation of the NNT using the different measures of association and effect.

In any case, we must not lose sight of the recommendation of not to carry out a quantitative synthesis or calculation of summary measures if there is significant heterogeneity among primary studies, since then the global estimates will be unreliable, whatever we do.

But do not think that we have solved the problem. We cannot finish this post without mentioning that these alternative methods for calculating NNT also have their weaknesses. Those have to do with obtaining an overall CER summary value, which also varies among primary studies.

The simplest way would be to divide the sum of events in the control groups of the primary studies by the total number of participants in that group. This is usually possible simply by taking the data from the meta-analysis’ forest plot. However, this method is not recommended, as it completely ignores the variability among studies and possible differences in randomization.

Another more correct way would be to calculate the mean or median of the CER of all the primary studies and, even better, to calculate some weighted measure based on the variability of each study.

And even, if baseline risk variations among studies are very important, an estimate based on the investigator’s knowledge or other studies could be used, as well as using a range of possible CER values and comparing the differences among the different NNTs that could be obtained.

You have to be very careful with the variance weighting methods of the studies, since the CER has the bad habit of not following a normal distribution, but a binomial one. The problem with the binomial distribution is that its variance depends greatly on the mean of the distribution, being maximum in mean values around 0.5.

On the contrary, the variance decreases if the mean is close to 0 or 1, so all the variance-based weighting methods will assign a greater weight to the studies the more their mean separates from 0.5 (remember that CER can range from 0 to 1, like any other probability value). For this reason, it is necessary to carry out a transformation so that the values approach a normal instead of a binomial distribution and thus be able to carry out the weighting.

We’re leaving…

And I think we will leave it here for today. We are not going to go into the methods to transform the CER, such as the double arcsine or the application of mixed generalized linear models, since that is for the most exclusive minds, among which my own’s is not included. Anyway, don’t get stuck with this. I advise you to calculate the NNT using statistical packages or calculators, such as Calcupedev. There are other uses of NNT that we could also comment on and that can be obtained with these tools, as is the case with NNT in survival studies. But that is another story…

When nothing bad happens, is everything okay?

Print Friendly, PDF & Email

Probability calculation when denominator equals zero

I have a brother-in-law who is increasingly afraid of getting on a plane. He is able to make road trips for several days in a row so as not to take off the ground. But it turns out that the poor guy has no choice but to make a transcontinental trip and he has no choice but to take a plane to travel.

But at the same time, my brother-in-law, in addition to being fearful, is an occurrence person. He has been counting the number of flights of the different airlines and the number of accidents that each one has had in order to calculate the probability of having a mishap with each of them and fly with the safest. The matter is very simple if we remember that of probability equals to favorable cases divided by possible cases.

And it turns out that he is happy because there is a company that has made 1500 flights and has never had any accidents, then the probability of having an accident flying on their planes will be, according to my brother-in-law, 0/1500 = 0. He is now so calm that he almost has lost his fear to fly. Mathematically, it is almost certain that nothing will happen to him. What do you think about my brother-in-law?

Many of you will already be thinking that using brothers-in-law for these examples has these problems. We all know how brothers-in-law are… But don’t be unfair to them. As the famous humorist Joaquín Reyes says, “we all of us are brothers-in-law”, so just remember it. Of which there is no doubt, is that we will all agree with the statement that my brother-in-law is wrong: the fact that there has not been any mishap in the 1500 flights does not guarantee that the next one will not fall. In other words, even if the numerator of the proportion is zero, if we estimate the real risk it would be incorrect to keep zero as a result.

This situation occurs with some frequency in Biomedicine research studies. To leave airlines and aerophobics alone, think that we have a new drug with which we want to prevent this terrible disease that is fildulastrosis. We take 150 healthy people and give them the antifildulin for 1 year and, after this follow-up period, we do not detect any new cases of disease. Can we conclude then that the treatment prevents the development of the disease with absolute certainty? Obviously not. Let’s think about it a little.

Probability calculation when denominator equals zero

Making inferences about probabilities when the numerator of the proportion is zero can be somewhat tricky, since we tend to think that the non-occurrence of events is something qualitatively different from the occurrence of one, few or many events, and this is not really so. A numerator equal to zero does not mean that the risk is zero, nor does it prevent us from making inferences about the size of the risk, since we can apply the same statistical principles as to non-zero numerators.

Returning to our example, suppose that the incidence of fildulastrosis in the general population is 3 cases per 2000 people per year (1.5 per thousand, 0.15% or 0.0015). Can we infer with our experiment if taking antifildulin increases, decreases or does not modify the risk of suffering fildulastrosis? Following the familiar adage, yes, we can.

We will continue our habit of considering the null hypothesis as of equal effect, so that the risk of disease is not modified by the new treatment. Thus, the risk of each of the 150 participants becoming ill throughout the study will be 0.0015. In other words, the risk of not getting sick will be 1-0.0015 = 0.9985. What will be the probability that none will get sick during the year of the study? Since there are 150 independent events, the probability that 150 subjects do not get sick will be 0.98985150 = 0.8. We see, therefore, that although the risk is the same as that of the general population, with this number of patients we have an 80% chance of not detecting any event (fildulastrosis) during the study, so it would be more surprising to find a patient who the fact of not having any. But the most surprising thing is that we are, thus, getting the probability that we do not have any sick in our sample: the probability that there is no sick is not 0 (0/150), as my brother-in-law thinks, but 80 %!

And the worst part is that, given this result, pessimism invades us: it is even possible that the risk of disease with the new drug is greater and we are not detecting it. Let’s assume that the risk with medication is 1% (compared to 0.15% of the general population). The risk of none being sick would be (1-0.01)150 = 0.22. Even with a 2% risk, the risk of not getting any disease is (1-0.02)150 = 0.048. Remember that 5% is the value that we usually adopt as a “safe” limit to reject the null hypothesis without making a type 1 error.

At this point, we can ask ourselves if we are very unfortunate and have not been lucky enough to detect cases of illness when the risk is high or, on the contrary, that we are not so unfortunate and, in reality, the risk must be low. To clarify ourselves, we can return to our usual 5% confidence limit and see with what risk of getting sick with the treatment we have at least a 5% chance of detecting a patient:

– Risk of 1.5/1000: (1-0.0015)150 = 0.8.

– Risk of 1/1000: (1-0.001)150 = 0.86.

– Risk of 1/200: (1-0.005)150 = 0.47.

– Risk of 1/100: (1-0.01)150 = 0.22.

– Risk of 1/50: (1-0.02)150 = 0.048.

– Risk of 1/25: (1-0.04)150 = 0.002.

As we see in the previous series, our “security” range of 5% is reached when the risk is below 1/50 (2% or 0.02). This means that, with a 5% probability of being wrong, the risk of fildulastrosis taking antifuldulin is equal to or less than 2%. In other words, the 95% confidence interval of our estimate would range from 0 to 0.02 (and not 0, if we calculate the probability in a simplistic way).

To prevent our reheated neurons from eventually melting, let’s see a simpler way to automate this process. For this we use what is known as the rule of 3. If we do the study with n patients and none present the event, we can affirm that the probability of the event is not zero, but less than or equal to 3/n. In our example, 3/150 = 0.02, the probability we calculate with the laborious method above. We will arrive at this rule after solving the equation we use with the previous method:

(1 – maximum risk) n = 0.05

First, we rewrite it:

1 – maximum risk = 0.051/n

If n is greater than 30, 0.051/n approximates (n-3)/n, which is the same as 1-(3/n). In this way, we can rewrite the equation as:

1- maximum risk = 1 – (3/n)

With which we can solve the equation and get the final rule:

Maximum risk = 3/n.

You have seen that we have considered that n is greater than 30. This is because, below 30, the rule tends to overestimate the risk slightly, which we will have to take into account if we use it with reduced samples.

We’re leaving…

And with this we will end this post with some considerations. First, and as is easy to imagine, statistical programs calculate risk’s confidence intervals without much effort even if the numerator is zero. Similarly, it can also be done manually and much more elegantly by resorting to the Poisson probability distribution, although the result is similar to that obtained with the rule of 3.

Second, what happens if the numerator is not 0 but a small number? Can a similar rule be applied? The answer, again, is yes. Although there is no general rule, extensions of the rule have been developed for a number of events up to 4. But that’s another story…

Powerful gentleman

Print Friendly, PDF & Email

Critical appraisal of economic valuations

Yes, as the illustrious Francisco de Quevedo y Villegas once said, powerful gentleman is Don Dinero (Mr. Money). A great truth because, who, purely in love, does not humble himself before the golden yellow? And even more in a mercantilist and materialist society like ours.

But the problem is not that we are materialistic and just think about money. The problem is that nobody believes they have all the money they need. Even the wealthiest would like to have much more money. And many times, it is true, we do not have enough money to cover all our needs as we would like.

And that does not only happen at the individual’s level, but also at social groups level. Any country has a limited amount of money, which is why you cannot spend everything you want and you have to choose where you spend your money. Let’s think, for example, of our healthcare system, in which new health technologies (new treatments, new diagnostic techniques, etc. ) are getting better … and more expensive (sometimes, even bordering on obscenity). If we are spending at the limit of our possibilities and want to apply a new treatment, we only have two choices: either we increase our wealth (where do we get the money from?) or we stop spending it on something else. There would be a third one that is used frequently, even if it is not the right thing to do: spend what we do not have and pass on the debt to whoever comes next.

Yes, my friends, the saying that Health is priceless does not hold up economically. Resources are always limited and we must all be aware of the so-called opportunity cost of a product: the price it costs, the money will have to stop spending on something else.

Therefore, it is very important to properly evaluate any new health technology before deciding its implementation in the health system, and this is why the so-called economic evaluation studies have been developed, aimed at identifying what actions should be prioritized to maximize the benefits produced in an environment with limited resources. These studies are a tool to assist in decision-making, but are not aim to replace it, so other elements have to be taken into account, such as justice, equity and free access to the election.

Valuation economic studies

The economic evaluation (EV) studies encompass a whole series of methodology and specific terminology that is usually little known by those who are not dedicated to the evaluation of health technologies. Let’s briefly review its characteristics to finally give some recommendations on how to make a critical appraisal of these studies.

The first thing would be to explain what are the two characteristics that define an EV. These are the measure of the costs and benefits of the interventions (the first one) and the choice or comparison between two or more alternatives (the second one). These two features are essential to say that we are facing an EV, which can be defined as the comparative analysis of different health interventions in terms of costs and benefits. The methodology of development of an EV will have to take into account a number of aspects that we list below and that you can see summarized in the attached table.

– Objective of the study. It will be determined if the use of a new technology is justified in terms of the benefits it produces. For this, a research question will be formulated with a structure similar to that of other types of epidemiological studies.

– Perspectives of the analysis. It is the point of view of the person or institution to whom the analysis is targeted, which will include the costs and benefits that must be taken into account from the positioning chosen. The most global perspective is that of the Society, although the one of the funders, that of specific organizations (for example, hospitals) or that of patients and families can also be adopted. The most usual is to adopt the perspective of the funders, sometimes accompanied by the social one. If so, both must be well differentiated.

– Time horizon of the analysis. It is the period of time during which the main economic and health effects of the intervention are evaluated.

– Choice of the comparator. It is a crucial point to be able to determine the incremental effectiveness of the new technology and on which the importance of the study for the decision makers will largely depend. In practice, the most commonly comparator is the alternative that is commonly used (the gold standard), although it can sometimes be compared with the non-treatment option, which must be justified.

– Identification of costs. Costs are usually considered taking into account the total amount of the resource consumed and the monetary value of the resource unit (you know, as the friendly hostesses of an old TV contest said: 25 responses, at 5 pesetas each, 125 pesetas). The costs are classified as direct and indirect and as sanitary and non-sanitary. The direct ones are those clearly related to the illness (hospitalization, laboratory tests, laundry and kitchen, etc.), while the indirect refer to productivity or its loss (work functionality, mortality). On the other hand, health costs are those related to the intervention (medicines, diagnostic tests, etc.), while non-health costs are those that the patient or other entities have to pay or those related to productivity.

What costs will be included in an EV? It will depend on the intervention being analyzed and, especially, on the perspective and time horizon of the analysis.

 Quantification of costs. It will be necessary to determine the amount of resources used, either individually or in aggregate, depending on the information available.

– Cost assessment. They will be assigned a unit price, specifying the source and the method used to assign this price. When the study covers long periods of time, it must be borne in mind that things do not cost the same over the years. If I tell you that I knew a time when you went out at night with a thousand pesetas (the equivalent of about 6 euros now) and came back home with money in your pocket, you will think it is another of my frequent ravings, but I swear it is true.

To take this into account, a weighting factor or discount rate is used, which is usually between 3% and 6%. For who is curious, the general formula is CV = FV / (1 + d) n, where CV is the current value, FV future value, n is the number of years and d the discount rate.

 Identification, measurement and evaluation of results. The benefits obtained can be classified into health and non-health ones. Health benefits are clinical consequences of the intervention, generally measured from a point of view of interest to the patient (improvement of blood pressure figures, deaths avoided, etc.). On the other hand, the non-health ones are divided as they cause improvements in productivity or in the quality of life.

The first ones are easy to understand: productivity can improve because people go to work earlier (shorter hospitalization, shorter convalescence) or because they work better to improve the health conditions of the worker. The second ones are related to the concept of quality of life related to health, which reflects the impact of the disease and its treatment on the patient.

The quality of life related to health can be estimated using a series of questionnaires on the preferences of patients, summarized in a single score value that, together with the amount of life, will provide us with the quality-adjusted life year (QALY).

To assess the quality of life we ​​refer to the utilities of the health states, which are expressed with a numerical value between 0 and 1, in which 0 represents the utility of the state of death and 1 that of perfect health. In this sense, a year of life lived in perfect health is equivalent to 1 QALY (1 year of life x 1 utility = 1 QALY). Thus, to determine the value in QALYs we will multiply the value associated with a state of health by the years lived in that state. For example, half a year in perfect health (0.5 years x 1 utility) would be equivalent to one year with some ailments (1 year x 0.5 utility).

 Type of economic analysis. We can choose between four types of economic analysis.

The first, the cost minimization analysis. This is used when there is no difference in effect between the two options compared, situation in which will be enough to compare the costs to choose the cheapest. The second, the cost-effectiveness analysis. This is used when the interventions are similar and determines the relationship between costs and consequences of interventions in units usually used in clinical practice (decrease in days of admission, for example). The third, the cost-utility analysis. It is similar to cost-effectiveness, but the effectiveness is adjusted for quality of life, so the outcome is the QALY. Finally, the fourth method is the cost-benefit analysis. In this type everything is measured in monetary units, which we usually understand quite well, although it can be a little complicated to explain with them the gains in health.

 Analysis of results. The analysis will depend on the type of economic analysis used. In the case of cost-effectiveness studies, it is typical to calculate two measures, the average cost-effectiveness (dividing the cost between the benefit) and the incremental cost-effectiveness (the extra cost per unit of additional benefit obtained with an option with respect to the other). This last parameter is important, since it constitutes a limit of efficiency of the intervention, which we will be chosen or not depending on how much we are willing to pay for an additional unit of effectiveness.

– Sensitivity analysis. As with other types of designs, EVs do not get rid off uncertainty, generally due to lack of reliability of the available data. Therefore, it is convenient to evaluate the degree of uncertainty through a sensitivity analysis to check the degree of stability of the results and how they can be modified if the main variables vary. An example may be the variation of the discount rate chosen.

There are five types of sensitivity analysis: univariate (the study variables are modified one by one), multivariate (two or more are modified), extremes (we put ourselves in the most optimistic and most pessimistic scenarios for the intervention), threshold (identifies if there is a critical value above or below which the choice is reversed towards one or the other the interventions compared) and probabilistic (assuming a certain probability distribution for the uncertainty of the parameters used).

 Conclusion. This is the last section of the development of an EV. The conclusions should take into account two aspects: internal validity (correct analysis for patients included in the study) and external validity (possibility of extrapolating the conclusions to other groups of similar patients).

As we said at the beginning of this post, EVs have a lot of jargon and its own methodological aspects, which makes it difficult for us to make a critical appraising and a correct understanding of its content. But let no one get discouraged, we can do it by relying on our three basic pillars: validity, relevance and applicability.

There are multiple guides that systematically explain how to assess an EV. Perhaps the first to appear was that of the British NICE (National Institute for Clinical Excellence), but subsequently others have arisen such as that of the Australian PBAC (Pharmaceutical Benefits Advisory Committee) and that of the Canadian CADTH (Canadian Agency for Drugs and Technologies in Health). In Spain we could not be less and the Laín Entralgo’s Health Technology Assessment Unit also developed an instrument to determine the quality of an EV. This guide establishes recommendations for 17 domains that closely resemble what we have said so far, completing with a checklist to facilitate the assessment of the quality of the EV.

Critical appraisal of economic valuations

Anyway, as my usual sufferers know, I prefer to use a simpler checklist that is available on the Internet for free, which is none other than the tool provided by the CASPe group and that you can download from their website. We are going to follow these 11 CASPe’s questions, although without losing sight of the recommendations of the Spanish guide that we have mentioned.

As always, we will start with the VALIDITY, trying to answer first two elimination questions. If the answer is negative, we can leave the study aside and dedicate ourselves to another more productive task.

Is the question or objective of the evaluation well defined? The research question should be clear and define the target population of the study. There will also be three fundamental aspects that should be clear in the objective: the options compared, the perspective of the analysis and the time horizon. Is there a sufficient description of all possible alternatives and their consequences? The actions to follow must be perfectly defined in all the compared options, including who, where and to whom each action is applied. The usual will be to compare the new technology, at least, with the one of habitual use, always justifying the choice of the comparison technology, especially if this is the non-treatment one (in the case of pharmacological interventions).

If we have been able to answer these two questions affirmatively, we will move on to the four questions of detail. Are there evidence of the effectiveness, of the intervention or of the evaluated program? We will see if there are trials, reviews or other previous studies that prove the effectiveness of the interventions. Think of a cost minimization study, in which we want to know which of the two options, both effective, is cheaper. Logically, we will have to have prior evidence of this effectiveness. Are the effects of the intervention (or interventions) identified, measured and appropriately valued or considered? These effects can be measured with simple units, often derived from clinical practice, with monetary units and more elaborate calculation units, such as the QALYs mentioned above. Are the costs incurred by the intervention (interventions) identified, measured and appropriately valued? The resources used must be well identified and measured in the appropriate units. The method and source used to assign the value to the resources used must be specified, as we have already mentioned. Finally, were discount rates applied to the costs of the intervention/s? And to the effects? As we already know, this is fundamental when the time horizon of the study is prolonged. In Spain, it is recommended to use a discount rate of 3% for basic resources. When doing sensitivity analysis this rate will be tested between 0% and 5%, which will allow comparison with other studies.

Once assessed the internal validity of our EV, we will answer the questions regarding the RELEVANCE of the results. Firstly, what are the evaluation results? We will review the units that have been used (QALYs, monetary costs, etc.) and if the incremental benefits analysis have been carried out, in appropriate cases. The second question in this section refers to whether an adequate sensitivity analysis has been carried out to know how the results would vary with changes in costs or effectiveness. In addition, it is recommended that the authors justify the modifications made with respect to the base case, the choice of the variables that are modified and the method used in the sensitivity analysis. Our Spanish guide recommends carrying out, whenever possible, a probabilistic sensitivity analysis, detailing all the statistical tests performed and the confidence intervals of the results.

Finally, we will assess the Cost-efeor external validity of our study by answering the last three questions. Would the program be equally effective in your environment? It will be necessary to consider if the target population, the perspective, the availability of technologies, etc., are applicable to our clinical context. Finally, we must reflect on whether the costs would be transferable to our environment and if it would be worth applying them to our environment. This may depend on social, political, economic, population, etc. differences, between our environment and that in which the study has been carried out.

We’re leaving…

And with this we are going to finish this post for today. Even if I blow your mind after all we have said, you can believe me if I tell you that we have done nothing but scratch the surface of this stormy world of economic valuation studies. We have not discussed anything, for example, about the statistical methods that can be used in studies of sensitivity, which can become complicated, nor about the studies using modeling, employing techniques only available to privileged minds, like Markov chains, stochastic models or discrete event simulation models, to name a few. Neither have we talked about the type of studies on which economic evaluations are based.  These can be experimental or observational studies, but they have a series of peculiarities that differentiate them from other studies of similar design, but with different functions. This is the case of clinical trials that incorporate an economic evaluation (also known as piggy -back clinical trials , which tend to have a more pragmatic design than conventional trials. But that is another story…

Little ado about too much

Print Friendly, PDF & Email

Critical appraisal of meta-analysis

Yes, I know that the saying goes just the opposite. But that is precisely the problem we have with so much new information technology. Today anyone can write and make public what goes through his head, reaching a lot of people, although what he says is bullshit (and no, I do not take this personally, not even my brother-in-law reads what I post!). The trouble is that much of what is written is not worth a bit, not to refer to any type of excreta. There is a lot of smoke and little fire, when we all would like the opposite to happen.

The same happens in medicine when we need information to make some of our clinical decisions. Anywhere the source we go, the volume of information will not only overwhelm us, but above all the majority of it will not serve us at all. Also, even if we find a well-done article it may not be enough to answer our question completely. That’s why we love so much the revisions of literature that some generous souls publish in medical journals. They save us the task of reviewing a lot of articles and summarizing the conclusions. Great, isn’t it? Well, sometimes it is, sometimes it is not. As when we read any type of medical literature’s study, we should always make a critical appraisal and not rely solely on the good know-how of its authors.

Revisions, of which we already know there are two types, also have their limitations, which we must know how to value. The simplest form of revision, our favorite when we are younger and ignorant, is what is known as a narrative review or author’s review. This type of review is usually done by an expert in the topic, who reviews the literature and analyzes what she finds as she believes that it is worth (for that she is an expert) and summarizes the qualitative synthesis with her expert’s conclusions. These types of reviews are good for getting a general idea about a topic, but they do not usually serve to answer specific questions. In addition, since it is not specified how the information search is done, we cannot reproduce it or verify that it includes everything important that has been written on the subject. With these revisions we can do little critical appraising, since there is no precise systematization of how these summaries have to be prepared, so we will have to trust unreliable aspects such as the prestige of the author or the impact of the journal where it is published.

As our knowledge of the general aspects of science increases, our interest is shifting towards other types of revisions that provide us with more specific information about aspects that escape our increasingly wide knowledge. This other type of review is the so-called systematic review (SR), which focuses on a specific question, follows a clearly specified methodology of searching and selection of information and performs a rigorous and critical analysis of the results found. Moreover, when the primary studies are sufficiently homogeneous, the SR goes beyond the qualitative synthesis, also performing a quantitative synthesis analysis, which has the nice name of meta-analysis. With these reviews we can do a critical appraising following an ordered and pre-established methodology, in a similar way as we do with other types of studies.

The prototype of SR is the one made by the Cochrane’s Collaboration, which has developed a specific methodology that you can consult in the manuals available on its website. But, if you want my advice, do not trust even the Cochrane’s and make a careful critical appraising even if the review has been done by them, not taking it for granted simply because of its origin. As one of my teachers in these disciplines says (I’m sure he’s smiling if he’s reading these lines), there is life after Cochrane’s. And, besides, there is lot of it, and good, I would add.

Critical appraisal of meta-analyes

Although SRs and meta-analyzes impose a bit of respect at the beginning, do not worry, they can be critically evaluated in a simple way considering the main aspects of their methodology. And to do it, nothing better than to systematically review our three pillars: validity, relevance and applicability.

Regarding VALIDITY, we will try to determine whether or not the revision gives us some unbiased results and respond correctly to the question posed. As always, we will look for some primary validity criteria. If these are not fulfilled we will think if it is already time to walk the dog: we probably make better use of the time.

Has the aim of the review been clearly stated? All SRs should try to answer a specific question that is relevant from the clinical point of view, and that usually arises following the PICO scheme of a structured clinical question. It is preferable that the review try to answer only one question, since if it tries to respond to several ones there is a risk of not responding adequately to any of them. This question will also determine the type of studies that the review should include, so we must assess whether the appropriate type has been included. Although the most common is to find SRs of clinical trials, they can include other types of observational studies, diagnostic tests, etc. The authors of the review must specify the criteria for inclusion and exclusion of the studies, in addition to considering their aspects regarding the scope of realization, study groups, results, etc. Differences among the studies included in terms of (P) patients, (I) intervention or (O) outcomes make two SRs that ask the same question to reach to different conclusions.

If the answer to the two previous questions is affirmative, we will consider the secondary criteria and leave the dog’s walk for later. Have important studies that have to do with the subject been included? We must verify that a global and unbiased search of the literature has been carried out. It is frequent to do the electronic search including the most important databases (generally PubMed, Embase and the Cochrane’s Library), but this must be completed with a search strategy in other media to look for other works (references of the articles found, contact with well-known researchers, pharmaceutical industry, national and international registries, etc.), including the so-called gray literature (thesis, reports, etc.), since there may be important unpublished works. And that no one be surprised about the latter: it has been proven that the studies that obtain negative conclusions have more risk of not being published, so they do not appear in the SR. We must verify that the authors have ruled out the possibility of this publication bias. In general, this entire selection process is usually captured in a flow diagram that shows the evolution of all the studies assessed in the SR.

It is very important that enough has been done to assess the quality of the studies, looking for the existence of possible biases. For this, the authors can use an ad hoc designed tool or, more usually, resort to one that is already recognized and validated, such as the bias detection tool of the Cochrane’s Collaboration, in the case of reviews of clinical trials. This tool assesses five criteria of the primary studies to determine their risk of bias: adequate randomization sequence (prevents selection bias), adequate masking (prevents biases of realization and detection, both information biases), concealment of allocation (prevents selection bias), losses to follow-up (prevents attrition bias) and selective data information (prevents information bias). The studies are classified as high, low or indeterminate risk of bias according to the most important aspects of the design’s methodology (clinical trials in this case).

In addition, this must be done independently by two authors and, ideally, without knowing the authors of the study or the journals where the primary studies of the review were published. Finally, it should be recorded the degree of agreement between the two reviewers and what they did if they did not agree (the most common is to resort to a third party, which will probably be the boss of both).

To conclude with the internal or methodological validity, in case the results of the studies have been combined to draw common conclusions with a meta-analysis, we must ask ourselves if it was reasonable to combine the results of the primary studies. It is fundamental, in order to draw conclusions from combined data, that the studies are homogeneous and that the differences among them are due solely to chance. Although some variability of the studies increases the external validity of the conclusions, we cannot unify the data for the analysis if there are a lot of variability. There are numerous methods to assess the homogeneity about which we are not going to refer now, but we are going to insist on the need for the authors of the review to have studied it adequately.

In summary, the fundamental aspects that we will have to analyze to assess the validity of a SR will be: 1) that the aims of the review are well defined in terms of population, intervention and measurement of the result; 2) that the bibliographic search has been exhaustive; 3) that the criteria for inclusion and exclusion of primary studies in the review have been adequate; and 4) that the internal or methodological validity of the included studies has also been verified. In addition, if the SR includes a meta-analysis, we will review the methodological aspects that we saw in a previous post: the suitability of combining the studies to make a quantitative synthesis, the adequate evaluation of the heterogeneity of the primary studies and the use of a suitable mathematical model to combine the results of the primary studies (you know, that of the fixed effect and random effects models).

Regarding the RELEVANCE of the results we must consider what is the overall result of the review and if the interpretation has been made in a judicious manner. The SR should provide a global estimate of the effect of the intervention based on a weighted average of the included quality items. Most often, relative measures such as risk ratio or odds ratio are expressed, although ideally, they should be complemented with absolute measures such as absolute risk reduction or the number needed to treat (NNT). In addition, we must assess the accuracy of the results, for which we will use our beloved confidence intervals, which will give us an idea of ​​the accuracy of the estimation of the true magnitude of the effect in the population. As you can see, the way of assessing the importance of the results is practically the same as assessing the importance of the results of the primary studies. In this case we give examples of clinical trials, which is the type of study that we will see more frequently, but remember that there may be other types of studies that can better express the relevance of their results with other parameters. Of course, confidence intervals will always help us to assess the accuracy of the results.

The results of the meta-analyzes are usually represented in a standardized way, usually using the so-called forest plot. A graph is drawn with a vertical line of zero effect (in the one for relative risk and odds ratio and zero for means differences) and each study is represented as a mark (its result) in the middle of a segment (its confidence interval). Studies with results with statistical significance are those that do not cross the vertical line. Generally, the most powerful studies have narrower intervals and contribute more to the overall result, which is expressed as a diamond whose lateral ends represent its confidence interval. Only diamonds that do not cross the vertical line will have statistical significance. Also, the narrower the interval, the more accurate result. And, finally, the further away from the zero-effect line, the clearer the difference between the treatments or the comparative exposures will be.

If you want a more detailed explanation about the elements that make up a forest plot, you can go to the previous post where we explained it or to the online manuals of the Cochrane’s Collaboration.

We will conclude the critical appraising of the SR assessing the APPLICABILITY of the results to our environment. We will have to ask ourselves if we can apply the results to our patients and how they will influence the care we give them. We will have to see if the primary studies of the review describe the participants and if they resemble our patients. In addition, although we have already said that it is preferable that the SR is oriented to a specific question, it will be necessary to see if all the relevant results have been considered for the decision making in the problem under study, since sometimes it will be convenient to consider some other additional secondary variable. And, as always, we must assess the benefit-cost-risk ratio. The fact that the conclusion of the SR seems valid does not mean that we have to apply it in a compulsory way.

If you want to correctly evaluate a SR without forgetting any important aspect, I recommend you to use a checklist such as PRISMA’s or some of the tools available on the Internet, such as the grills that can be downloaded from the CASP page, which are the ones we have used for everything we have said so far.

PRISMA statement

The PRISMA statement (Preferred Reporting Items for Systematic reviews and Meta-Analyzes) consists of 27 items, classified in 7 sections that refer to the sections of title, summary, introduction, methods, results, discussion and financing:

  1. Title: it must be identified as SR, meta-analysis or both. If it is specified, in addition, that it deals with clinical trials, priority will be given to other types of reviews.
  2. Summary: it should be a structured summary that should include background, objectives, data sources, inclusion criteria, limitations, conclusions and implications. The registration number of the revision must also be included.
  3. Introduction: includes two items, the justification of the study (what is known, controversies, etc) and the objectives (what question tries to answer in PICO terms of the structured clinical question).
  4. Methods. It is the section with the largest number of items (12):

– Protocol and registration: indicate the registration number and its availability.

– Eligibility criteria: justification of the characteristics of the studies and the search criteria used.

– Sources of information: describe the sources used and the last search date.

– Search: complete electronic search strategy, so that it can be reproduced.

– Selection of studies: specify the selection process and inclusion’s and exclusion’s criteria.

– Data extraction process: describe the methods used to extract the data from the primary studies.

– Data list: define the variables used.

– Risk of bias in primary studies: describe the method used and how it has been used in the synthesis of results.

– Summary measures: specify the main summary measures used.

– Results synthesis: describe the methods used to combine the results.

– Risk of bias between studies: describe biases that may affect cumulative evidence, such as publication bias.

– Additional analyzes: if additional methods are made (sensitivity, metaregression, etc) specify which were pre-specified.

  1. Results. Includes 7 items:

– Selection of studies: it is expressed through a flow chart that assesses the number of records in each stage (identification, screening, eligibility and inclusion).

– Characteristics of the studies: present the characteristics of the studies from which data were extracted and their bibliographic references.

– Risk of bias in the studies: communicate the risks in each study and any evaluation that is made about the bias in the results.

– Results of the individual studies: study data for each study or intervention group and estimation of the effect with their confidence interval. The ideal is to accompany it with a forest plot.

– Synthesis of the results: present the results of all the meta-analysis performed with the confidence intervals and the consistency measures.

– Risk of bias between the subjects: present any evaluation that is made of the risk of bias between the studies.

– Additional analyzes: if they have been carried out, provide the results of the same.

  1. Discussion. Includes 3 items:

– Summary of the evidence: summarize the main findings with the strength of the evidence of each main result and the relevance from the clinical point of view or of the main interest groups (care providers, users, health decision-makers, etc.).

– Limitations: discuss the limitations of the results, the studies and the review.

– Conclusions: general interpretation of the results in context with other evidences and their implications for future research.

  1. Financing: describe the sources of funding and the role they played in the realization of the SR.

As a third option to these two tools, you can also use the aforementioned Cochrane’s Handbook for Systematic Reviews of Interventions, available on its website and whose purpose is to help authors of Cochrane’s reviews to work explicitly and systematically.

We’re leaving…

As you can see, we have not talked practically anything about meta-analysis, with all its statistical techniques to assess homogeneity and its fixed and random effects models. And is that the meta-analysis is a beast that must be eaten separately, so we have already devoted two post only about it that you can check when you want. But that is another story…

The guard’s dilemma

Print Friendly, PDF & Email

Sensitivity and specificity

The world of medicine is a world of uncertainty. We can never be sure of anything at 100%, however obvious it may seem a diagnosis, but we cannot beat right and left with ultramodern diagnostics techniques or treatments (that are never safe) when making the decisions that continually haunt us in our daily practice.

That’s why we are always immersed in a world of probabilities, where the certainties are almost as rare as the so-called common sense which, as almost everyone knows, is the least common of the senses.

Imagine you are in the clinic and a patient comes because he has been kicked in the ass, pretty strong, though. As good doctor as we are, we ask that of what’s wrong?, since when?, and what do you attribute it to? And we proceed to a complete physical examination, discovering with horror that he has a hematoma on the right buttock.

Different approaches for a differential diagnosis

Here, my friends, the diagnostic possibilities are numerous, so the first thing we do is a comprehensive differential diagnosis. To do this, we can take four different approaches. The first is the possibilistic approach, listing all possible diagnoses and try to rule them all simultaneously applying the relevant diagnostic tests. The second is the probabilistic approach, sorting diagnostics by relative chance and then acting accordingly. It seems a posttraumatic hematoma (known as the kick in the ass syndrome), but someone might think that the kick has not been so strong, so maybe the poor patient has a bleeding disorder or a blood dyscrasia with secondary thrombocytopenia or even an atypical inflammatory bowel disease with extraintestinal manifestations and gluteal vascular fragility. We could also use a prognostic approach and try to show or rule out possible diagnostics with worst prognosis, so the diagnosis of the kicked in the ass syndrome lose interest and we were going to rule out chronic leukemia. Finally, a pragmatic approach could be used, with particular interest in first finding diagnostics that have a more effective treatment (the kick will be, one more time, the number one).

It seems that the right thing is to use a judicious combination of pragmatic, probabilistic and prognostic approaches. In our case we will investigate if the intensity of injury justifies the magnitude of bruising and, in that case, we would indicate some hot towels and we would refrain from further diagnostic tests. And this example may seems to be bullshit, but I can assure you I know people who make the complete list and order the diagnostic tests when there are any symptoms, regardless of expenses or risks. And, besides, one that I could think of, could assess the possibility of performing a more exotic diagnostic test that I cannot imagine, so the patient would be grateful if the diagnosis doesn’t require to make a forced anal sphincterotomy. And that is so because, as we have already said, the waiting list to get some common sense exceeds in many times the surgical waiting list.

The two thresholds

Now imagine another patient with a symptom complex less stupid and absurd than the previous example. For instance, let’s think about a child with symptoms of celiac disease. Before we make any diagnostic test, our patient already has a probability of suffering the disease. This probability will be conditioned by the prevalence of the disease in the population from which she proceeds and is called the pretest probability. This probability will stand somewhere between two thresholds: the diagnostic threshold and the therapeutic threshold.

The usual thing is that the pre-test probability of our patient does not allow us to rule out the disease with reasonable certainty (it would have to be very low, below the diagnostic threshold) or to confirm it with sufficient security to start the treatment (it would have to be above the therapeutic threshold).

We’ll then make the indicated diagnostic test, getting a new probability of disease depending on the result of the test, the so-called post-test probability. If this probability is high enough to make a diagnosis and initiate treatment, we’ll have crossed our first threshold, the therapeutic one. There will be no need for additional tests, as we will have enough certainty to confirm the diagnosis and treat the patient, always within a range of uncertainty.

And what determines our treatment threshold? Well, there are several factors involved. The greater the risk, cost or adverse effects of the treatment in question, the higher the threshold that we will demand to be treated. In the other hand, as much more serious is the possibility of omitting the diagnosis, the lower the therapeutic threshold that we’ll accept.

But it may be that the post-test probability is so low that allows us to rule out the disease with reasonable assurance. We shall then have crossed our second threshold, the diagnostic one, also called the no-test threshold. Clearly, in this situation, it is not indicated further diagnostic tests and, of course, starting treatment.

However, very often changing pretest to post-test probability still leaves us in no man’s land, without achieving any of the two thresholds, so we will have to perform additional tests until we reach one of the two limits.

And this is our everyday need: to know the post-test probability of our patients to know if we discard or confirm the diagnosis, if we leave the patient alone or we lash her out with our treatments. And this is so because the simplistic approach that a patient is sick if the diagnostic test is positive and healthy if it is negative is totally wrong, even if it is the general belief among those who indicate the tests. We will have to look, then, for some parameter that tells us how useful a specific diagnostic test can be to serve the purpose we need: to know the probability that the patient suffers the disease.

The guard’s dilemma

And this reminds me of the enormous problem that a brother-in-law asked me about the other day. The poor man is very concerned with a dilemma that has arisen. The thing is that he’s going to start a small business and he wants to hire a security guard to stay at the entrance door and watch for those who take something without paying for it. And the problem is that there’re two candidates and he doesn’t know who of the two to choose. One of them stops nearly everyone, so no burglar escapes. Of course, many honest people are offended when they are asked to open their bags before leaving and so next time they will buy elsewhere. The other guard is the opposite: he stops almost anyone but the one he spots certainly brings something stolen. He offends few honest people, but too many grabbers escape. A difficult decision…

Why my brother-in-law comes to me with this story? Because he knows that I daily face with similar dilemmas every time I have to choose a diagnostic test to know if a patient is sick and I have to treat her. We have already said that the positivity of a test does not assure us the diagnosis, just as the bad looking of a client does not ensure that the poor man has robbed us.

Let’s see it with an example. When we want to know the utility of a diagnostic test, we usually compare its results with those of a reference or gold standard, which is a test that, ideally, is always positive in sick patients and negative in healthy people. Now let’s suppose that I perform a study in my hospital office with a new diagnostic test to detect a certain disease and I get the results from the attached table (the patients are those who have the positive reference test and the healthy ones, the negative).

Sensitivity and specificity

Let’s start with the easy part. We have 1598 subjects, 520 out of them sick and 1078 healthy. The test gives us 446 positive results, 428 true (TP) and 18 false (FP). It also gives us 1152 negatives, 1060 true (TN) and 92 false (FN). The first we can determine is the ability of the test to distinguish between healthy and sick, which leads me to introduce the first two concepts: sensitivity (Se) and specificity (Sp). Se is the likelihood that the test correctly classifies a patient or, in other words, the probability that a patient gets a positive result. It’s calculated dividing TP by the number of sick. In our case it equals 0.82 (if you prefer to use percentages you have to multiply by 100). Moreover, Sp is the likelihood that the test correctly classifies a healthy or, put another way, the probability that a healthy gets a negative result. It’s calculated dividing TN by the number of healthy. In our example, it equals 0.98.

Someone may think that we have assessed the value of the new test, but we have just begun to do it. And this is because with Se and Sp we somehow measure the ability of the test to discriminate between healthy and sick, but what we really need to know is the probability that an individual with a positive results being sick and, although it may seem to be similar concepts, they are actually quite different.

The probability of a positive of being sick is known as the positive predictive value (PPV) and is calculated dividing the number of patients with a positive test by the total number of positives. In our case it is 0.96. This means that a positive has a 96% chance of being sick. Moreover, the probability of a negative of being healthy is expressed by the negative predictive value (NPV), with is the quotient of healthy with a negative test by the total number of negatives. In our example it equals 0.92 (an individual with a negative result has 92% chance of being healthy). This is already looking more like what we said at the beginning that we needed: the post-test probability that the patient is really sick.

Predictive values

And from now on is when neurons begin to be overheated. It turns out that Se and Sp are two intrinsic characteristics of the diagnostic test. Their results will be the same whenever we use the test in similar conditions, regardless of the subjects of the test. But this is not so with the predictive values, which vary depending on the prevalence of the disease in the population in which we test. This means that the probability of a positive of being sick depends on how common or rare the disease in the population is. Yes, you read this right: the same positive test expresses different risk of being sick, and for unbelievers, I’ll put another example.

Suppose that this same study is repeated by one of my colleagues who works at a community health center, where population is proportionally healthier than at my hospital (logical, they have not suffered the hospital yet). If you check the results in the table and bring you the trouble to calculate it, you may come up with a Se of 0.82 and a Sp of 0.98, the same that I came up with in my practice. However, if you calculate the predictive values, you will see that the PPV equals 0.9 and the NPV 0.95. And this is so because the prevalence of the disease (sick divided by total) is different in the two populations: 0.32 at my practice vs 0.19 at the health center. That is, in cases of highest prevalence a positive value is more valuable to confirm the diagnosis of disease, but a negative is less reliable to rule it out. And conversely, if the disease is very rare a negative result will reasonably rule out disease but a positive will be less reliable at the time to confirm it.

We see that, as almost always happen in medicine, we are moving on the shaking ground of probability, since all (absolutely all) diagnostic tests are imperfect and make mistakes when classifying healthy and sick. So when is a diagnostic test worth of using it? If you think about it, any particular subject has a probability of being sick even before performing the test (the prevalence of disease in her population) and we’re only interested in using diagnostic tests if that increase this likelihood enough to justify the initiation of the appropriate treatment (otherwise we would have to do another test to reach the threshold level of probability to justify treatment).

Likelihood ratios

And here is when this issue begins to be a little unfriendly. The positive likelihood ratio (PLR), indicates how much more probable is to get a positive with a sick than with a healthy subject. The proportion of positive in sick patients is represented by Se. The proportion of positives in healthy are the FP, which would be those healthy without a negative result or, what is the same, 1-Sp. Thus, PLR = Se / (1 – Sp). In our case (hospital) it equals 41 (the same value no matter we use percentages for Se and Sp). This can be interpreted as it is 41 times more likely to get a positive with a sick than with a healthy.

It’s also possible to calculate NLR (negative), which expresses how much likely is to find a negative in a sick than in a healthy. Negative patients are those who don’t test positive (1-Se) and negative healthy are the same as the TN (the test’s Sp). So, NLR = (1 – Se) / Sp. In our example, 0.18.

A ratio of 1 indicates that the result of the test doesn’t change the likelihood of being sick. If it’s greater than 1 the probability is increased and, if less than 1, decreased. This is the parameter used to determine the diagnostic power of the test. Values > 10 (or > 0.01) indicates that it’s a very powerful test that supports (or contradict) the diagnosis; values from 5-10 (or 0.1-0.2) indicates low power of the test to support (or disprove) the diagnosis; 2-5 (or 0.2-05) indicates that the contribution of the test is questionable; and, finally, 1-2 (0.5-1) indicates that the test has not diagnostic value.

Postest probbility and Fagan’s nomogram

The likelihood ratio does not express a direct probability, but it helps us to calculate the probabilities of being sick before and after testing positive by means of the Bayes’ rule, which says that the posttest odds is equal to the product of the pretest odds by the likelihood ratio. To transform the prevalence into pre-test odds we use the formula odds = p / (1-p). In our case, it would be 0.47. Now we can calculate the post-test odds (PosO) by multiplying the pretest odds by the likelihood ratio. In our case, the positive post-test odds value is 19.27. And finally, we transform the post-test odds into post-test probability using the formula p = odds / (odds + 1). In our example it values 0.95, which means that if our test is positive the probability of being sick goes from 0.32 (the pre-test probability) to 0.95 (post-test probability).

If there’s still anyone reading at this point, I’ll say that we don’t need all this gibberish to get post-test probability. There are multiple websites with online calculators for all these parameters from the initial 2 by 2 table with a minimum effort. I addition, the post-test probability can be easily calculated using a Fagan’s nomogram (see attached figure). This graph represents in three vertical lines from left to right the pre-test probability (it is represented inverted), the likelihood ratios and the resulting post-test probability.

To calculate the post-test probability after a positive result, we draw a line from the prevalence (pre-test probability) to the PLR and extend it to the post-test probability axis. Similarly, in order to calculate post-test probability after a negative result, we would extend the line between prevalence and the value of the NLR.

In this way, with this tool we can directly calculate the post-test probability by knowing the likelihood ratios and the prevalence. In addition, we can use it in populations with different prevalence, simply by modifying the origin of the line in the axis of pre-test probability.

So far we have defined the parameters that help us to quantify the power of a diagnostic test and we have seen the limitations of sensitivity, specificity and predictive values and how the most useful in a general way are the likelihood ratios. But, you will ask, what is a good test?, is it a sensitive one?, a specific?, both?

Here we are going to return to the guard’s dilemma that has arisen to my poor brother-in-law, because we have left him abandoned and we have not answered yet which of the two guards we recommend him to hire, the one who ask almost everyone to open their bags and so offending many honest people, or the one who almost never stops honest people but, stopping almost anyone, many thieves get away.

The resolution of the dilemma

And what do you think is the better choice? The simple answer is: it depends. Those of you who are still awake by now will have noticed that the first guard (the one who checks many people) is the sensitive one while the second is the specific one. What is better for us, the sensitive or the specific guard? It depends, for example, on where our shop is located. If your shop is located in a heeled neighborhood the first guard won’t be the best choice because, in fact, few people will be stealers and we’ll prefer not to offend our customers so they don’t fly away. But if our shop is located in front of the Cave of Ali Baba we’ll be more interested in detecting the maximum number of clients carrying stolen stuff. Also, it can depend on what we sell in the store. If we have a flea market we can hire the specific guard although someone can escape (at the end of the day, we’ll lose a few amount of money). But if we sell diamonds we’ll want no thieve to escape and we’ll hire the sensitive guard (we’ll rather bother someone honest than allows anybody escaping with a diamond).

The same happens in medicine with the choice of diagnostic tests: we have to decide in each case whether we are more interested in being sensitive or specific, because diagnostic tests not always have a high sensitivity (Se) and specificity (Sp).

In general, a sensitive test is preferred when the inconveniences of a false positive (FP) are smaller than those of a false negative (FN). For example, suppose that we’re going to vaccinate a group of patients and we know that the vaccine is deadly in those with a particular metabolic error. It’s clear that our interest is that no patient be undiagnosed (to avoid FN), but nothing happens if we wrongly label a healthy as having a metabolic error (FP): it’s preferable not to vaccinate a healthy thinking that it has a metabolopathy (although it hasn’t) that to kill a patient with our vaccine supposing he was healthy. Another less dramatic example: in the midst of an epidemic our interest will be to be very sensitive and isolate the largest number of patients. The problem here if for the unfortunate healthy who test positive (FP) and get isolated with the rest of sick people. No doubt we’d do him a disservice with the maneuver. Of course, we could do to all the positives to the first test a second confirmatory one that is very specific in order to avoid bad consequences to FP people.

On the other hand, a specific test is preferred when it is better to have a FN than a FP, as when we want to be sure that someone is actually sick. Imagine that a test positive result implies a surgical treatment: we’ll have to be quite sure about the diagnostic so we don’t operate any healthy people.

Another example is a disease whose diagnosis can be very traumatic for the patient or that is almost incurable or that has no treatment. Here we´ll prefer specificity to not to give any unnecessary annoyance to a healthy. Conversely, if the disease is serious but treatable we´ll probably prefer a sensitive test.

ROC curves

So far we have talked about tests with a dichotomous result: positive or negative. But, what happens when the result is quantitative? Let’s imagine that we measure fasting blood glucose. We must decide to what level of glycemia we consider normal and above which one will seem pathological. And this is a crucial decision, because Se and Sp will depend on the cutoff point we choose.

To help us to choose we have the receiver operating characteristic, known worldwide as the ROC curve. We represent in coordinates (y axis) the Se and in abscissas the complementary Sp (1-Sp) and draw a curve in which each point represents the probability that the test correctly classifies a healthy-sick couple taken at random. The diagonal of the graph would represent the “curve” if the test had no ability to discriminate healthy from sick patients.

As you can see in the figure, the curve usually has a segment of steep slope where the Se increases rapidly without hardly changing the Sp: if we move up we can increase Se without practically increasing FP. But there comes a time when we get to the flat part. If we continue to move to the right, there will be a point from which the Se will no longer increase, but will begin to increase FP. If we are interested in a sensitive test, we will stay in the first part of the curve. If we want specificity we will have to go further to the right. And, finally, if we do not have a predilection for either of the two (we are equally concerned with obtaining FP than FN), the best cutoff point will be the one closest to the upper left corner. For this, some use the so-called Youden’s index, which optimizes the two parameters to the maximum and is calculated by adding Se and Sp and subtracting 1. The higher the index, the fewer patients misclassified by the diagnostic test.

A parameter of interest is the area under the curve (AUC), which represents the probability that the diagnostic test correctly classifies the patient who is being tested (see attached figure). An ideal test with Se and Sp of 100% has an area under the curve of 1: it always hits. In clinical practice, a test whose ROC curve has an AUC> 0.9 is considered very accurate, between 0.7-0.9 of moderate accuracy and between 0.5-0.7 of low accuracy. On the diagonal, the AUC is equal to 0.5 and it indicates that it does not matter if the test is done by throwing a coin in the air to decide if the patient is sick or not. Values below 0.5 indicate that the test is even worse than chance, since it will systematically classify patients as healthy and vice versa.

We’re leaving…

Curious these ROC curves, aren`t they? Its usefulness is not limited to the assessment of the goodness of diagnostic tests with quantitative results. The ROC curves also serve to determine the goodness of fit of a logistic regression model to predict dichotomous outcomes, but that is another story…


Print Friendly, PDF & Email

Advanced search in Pubmed

We already know what Pubmed MeSH terms are and how an advanced search can be done with them. We saw that the search method by selecting the descriptors can be a bit laborious, but allowed us to select very well, not only the descriptor, but also some of its subheadings, including or not the terms that depended on it in the hierarchy, etc.

Today we are going to see another method of advanced search a little faster when it comes to building the search string, and that allows us to combine several different searches. We will use the Pubmed advanced search form.

Advanced search in Pubmed

To get started, click on the “Advanced” link under the search box on the Pubmed home page. This brings us to the advanced search page, which you can see in the first figure. Let’s take a look.

First there is a box with the text “Use the builder below to create your search” and on which, initially, we cannot write. Here is going to be created the search string that Pubmed will use when we press the “Search” button. This string can be edited by clicking on the link below to the left of the box, “Edit”, which will allow us to remove or put text to the search string that has been elaborated until then, with natural or controlled text, so we can click the “Search” button and repeat the search with the new string. There is also a link below and to the right of the box that says “Clear”, with which we can erase its contents.

Below this text box we have the search string constructor (“Builder”), with several rows of fields. In each row we will introduce a different descriptor, so we can add or remove the rows we need with the “+” and “-” buttons to the right of each row.

Within each row there are several boxes. The first, which is not shown in the first row, is a dropdown with the boolean search operator. By default it marks the AND operator, but we can change it if we want. The following is a drop-down where we can select where we want the descriptor to be searched. By default it marks “All Fields”, all the fields, but we can select only the title, only the author, only last author and many other possibilities. In the center is the text box where we will enter the descriptor. On its right, the “+” and “-” buttons of which we have already spoken. And finally, in the far right there is a link that says “Show index list”. This is a help from Pubmed, because if we click on it, it will give us a list of possible descriptors that fit with what we have written in the text box.

As we are entering terms in the boxes, creating the rows we need and selecting the boolean operators of each row, the search string will be formed, When we are finished we have to options we can take.

The most common will be to press the “Search” button and do the search. But there is another possibility, which is to click on the link “Add to history”, whereupon the search is stored at the bottom of the screen, where it says “History”. This will be very useful since the saved searches can be entered in block in the field of the descriptors when making a new search and combined with other searches or with series of descriptors. Do you think this is a little messy? Let’s be clear with an example.

Suppose I treat my infants with otitis media with amoxicillin, but I want to know if other drugs, specifically cefaclor and cefuroxime, could improve the prognosis. Here are two structured clinical questions. The first one would say “Does cefaclor treatment improve the prognosis of otitis media in infants?” The second one would say the same but changing cefaclor to cefuroxime. So there would be two different searches, one with the terms infants, otitis media, amoxicillin, cefaclor and prognosis, and another with the terms infants, otitis media, amoxicillin, cefuroxime and prognosis.

What we are going to do is to plan three searches. A first one about article about the prognosis of otitis media in infants; a second one about cefaclor; and a third one about cefuroxime. Finally, we will combine the first with the second and the first with the third in two different searches, using the boolean AND.

Selecting search terms

Let us begin. We write otitis in the text box of the first search row and click on the link “Show index”. A huge drop-down appears with the list of related descriptors (when we see a word followed by the slash and another word it will mean that it is a subheader of the descriptor). If we look down in the list, there is a possibility that says “otitis / media infants” that fits well to what we are interested in, so we select it. We can now close the list of descriptors by clicking the “Hide index list” link. Now in the second box we write prognosis (we must follow the same method: write part in the box and select the term from the index list). We have a third row of boxes (if not, press the “+” button). In this third row we write amoxicillin. Finally, we will exclude from the search those articles dealing with the combination of amoxicillin and clavulanic acid. We write clavulanic and click on “Show index list”, which shows us the descriptor “clavulanic acid”, which we select. Since we want to exclude these articles from the search, we change the boolean operator of that row to NOT.

In the second screen capture you can see what we have done so far. You see that the terms are in quotes. That’s because we’ve chosen the MeSHs from the index list. If we write the text directly in the box it will appear without quotes, which will mean that the search has been done with natural language (so the accuracy of the controlled language of MeSH terms will have been lost). Note also that in the first text box of the form the search string that we have built so far has been written, which says (((“otitis/media infants”) AND prognosis) AND amoxicillin) NOT “clavulanic acid”. If we wanted, we have already said that we could modify it, but we will leave it as it is.

Now we could click “Search” and make the search or directly click on the “Add to history” link. To see how the number of articles found can be reduced, click on “Search”. I get a list with 98 results (the number may depend on when you do the search). Very well, click on the link “Advanced” (at the top of the screen) to return to the advanced search form.

At the bottom of the screen we can see the first search saved, numbered as # 1 (you can see it in the third figure).

What remains to be done is simpler. We write cefaclor in the text box and give the link “Add to history”. We repeat the process with the term cefuroxime. You can see the result of these actions in the fourth screen capture. You see how Pubmed has saved all the three searches in the search history. If we now want to combine them, we just have to click on the number of each one (a window will open for us to choose the boolean we want, in this case all will be AND).

First we click on # 1 and # 2, selecting AND. You see the product in the fith capture. Notice that the search string has been somewhat complicated: (((((otitis/media infants) AND prognosis) AND amoxicillin) NOT clavulanic acid)) AND cefaclor. As a curiosity I will tell you that, if we write this string directly in the simple search box, the result would be the same. It is the method used by those who totally dominate the jargon of this search engine. But we have to do it with the help of the advanced search form. We click on “Search” and we obtain seven results that will (or so we expect and hope) compare amoxicillin with cefaclor for the treatment of otitis media in infants.

We click again on the link “Advanced” and in the form we see that there is a further search, the # 4, which is the combination of # 1 and # 2. You can already have an idea of how complicated the searching could become combining searches with each other, adding or subtracting according to the boolean operator that we choose. Well, we click on # 1 and # 3 and press “Search”, finding five articles that should deal with the problem we are looking for.

In summary

We are coming to the end of my comments for today. I think that the fact that the use of MeSH terms and advanced search yields more specific results than simple search has been fully demonstrated. The usual thing with the simple search with natural language is to obtain endless lists of articles, most of them without interest for our clinical question. But we have to keep one thing in mind. We have already mentioned that a number of people are dedicated to assigning the MeSH descriptors to articles that enter the Medline database. Of course, since the article enters the database until it is indexed (the MeSH is assigned), some time passes and during that time we cannot find them using MeSH terms. For this reason, it could not be a bad idea to do a natural language search after the advanced one and see if there are any articles in the top of the list that might interest us and are not indexed yet.

We’re leaving…

Finally, commenting that searches can be stored by downloading them to your disk (by clicking the link “download history”) or, much better, creating an account in PubMed by clicking on the link on the top right of the screen that says “Sign in to NCBI. ” This is free and allows us to save the search from one time to another, which can be very useful to use other tools such as Clinical Queries or search filters. But that is another story…

An open relationship

Print Friendly, PDF & Email

We already know about the relationship between variables. Who can doubt that smoking kills or that TV dries our brain?. The issue is that we have to try to quantify these relationships in an objective way because, otherwise, here will always be someone who can doubt them. To do this, we’ll have to use some parameter which studies if our variables change in a related way.

When the two are dichotomous variables the solution is simple: we can use the odds ratio. Regarding TV and brain damage, we could use it to calculate whether it is really more likely to have dry brains watching too much TV (although I’d not waste time). But what happens if the two variables are continuous?. We cannot use then the odds ratio and we have to use other tools. Let’s see an example.

R_generalSuppose I take blood pressure to a sample of 300 people and represent the values of systolic and diastolic pressure, as I show in the first scatterplot. At a glance, you realize that it smells a rat. If you look carefully, high systolic pressure is usually associated with high diastolic values and, conversely, low values of systolic are associated with low diastolic values. I would say that they vary similarly: higher values of one with higher of the other, and vice versa. For a better view, look at the following two graphs.

R_estandar_simpleStandardized pressure values (each value minus the mean) are shown in the first graph. We see that most of the points are in the lower left and upper right quadrants. You can still see it better in the second chart, in which I’ve omitted values between systolic ± 10 mmHg and diastolic ± 5 mmHg around zero, which would be the standardized means. Let’s see if we can quantify this somehow.

Remember that the variance measured how varied the values of a distribution respect to the mean. We subtract the mean to each value, the result is squared so it’s always positive (to avoid that positive and negative differences cancelled each other), all these differences are added and the sum is divided by the sample size (in reality, the sample size minus one, and do not ask why, only mathematicians know why). You know that the square root of the variance is the standard deviation, the queen of the measures of dispersion.

Well, with a couple of variables we can do a similar thing. We calculate, for every couple, the differences of every variable with their means and multiply these differences (the equivalent of squaring the difference we did with the variance). Finally, we add all these products and divided the result by the sample size minus one, thus obtaining this version of the couples’s variance which is called, how could it be otherwise, covariance.variance = \frac{1}{n-1}\sum_{i=1}^{n}{(x_{i}-\overline{x})}^{2}

covariance = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\overline{\mu }_{x})(y_{i}-\overline{\mu }_{y})

And what tells us the value of covariance?. Well, not much, as it will depend on the magnitudes of the variables, which can be different depending on what we are talking about. To circumvent this little problem we use a very handy solution in such situations: standardize.

Thus, we divide the differences from the mean by their standard deviations, obtaining the world famous linear correlation coefficient of Pearson.Pearson\ correlation\ coefficient = \frac{1}{n-1}\sum_{i=1}^{n}(\frac{{}x_{i}-\overline{\mu }_{x}}{\sigma _{x}})(\frac{{}y_{i}-\overline{\mu }_{y}}{\sigma _{y}})

It’s good to know that, actually, Pearson only made the initial development of the above mentioned coefficient and that the real father of the creature was Francis Galton. The poor man spent all his life trying to do something important because he was jealous of his cousin, much more famous, one Charles Darwin, which I think he wrote about a species that eat each other and saying that the secret is to procreate as much as possible to survive.

R_ejemplos_independPearson’s correlation coefficient, r for friends, can have any value from -1 to 1. When set to zero means that the variables are uncorrelated, but do not confuse this with the fact that whether they are or not independent; as the tittle of this post says, the Pearson’s coefficient relationship does not compromise variables to anything serious. Correlation and independence have nothing to do with each other, they are different concepts. If you look at the two graphs of the example you’ll see that r equals zero in both. However, although the variables in the first one are independent, this is not true for the second, which represents the function y = |x|.

If r is greater than zero it means that the correlation is positive, so that the two variables vary in the same sense: when one increases, so does the other and, conversely, when the second decreases so decreases the other one. It is said that this positive correlation is perfect when r is 1. On the other hand, when r is negative, it means that the variables vary in the opposite way: when one increases the other decreases, and vice versa. Again, the negative correlation is perfect when r is -1.

It is crucial to understand that correlation doesn’t always mean causality. As Stephen J. Gould said in his book “The false measure of man” to take this fact is one of the two or three most serious and frequent errors of human reasoning. And it must be true because, even though I searched, I have not found any cousin to which he wanted to shadow, which makes me think he said it because he was convinced. So now you know, even though when there’s causality there’s correlation, the opposite is not always true.

R_histohramasAnother mistake we can make is to use this coefficient without making a series of preflight checks. The first is that the correlation between the two variables must be linear. This is easy to check by plotting the points and seeing that does not look like a parabola, hyperbole or any other curved shape. The second is that at least one of the variables should follow a normal distribution of frequencies. For this we can use statistical tests such as Kolmogorov-Smirnov’s or Shapiro-Wilks’, but often it is enough with representing histograms with frequency curves and see if they fit the normal. In our case, diastolic may fit a normal curve, but I’d not hold my breath for the systolic. The form of the cloud of point in the scatter plot gives us another clue: elliptical or rugby ball shape indicates that the variables probably follow a normal distribution. Finally, the third test is to ensure that samples are random. In addition, we can only use r within the range of data. If we extrapolated outside this range, we’d make an error.

A final warning: do not confuse correlation with regression. Correlation investigates the strength of the lineal relationship between two continuous variables and is not useful for estimating the value of a variable based on the value of the other. Moreover, the (linear in this case) regression investigates the nature of the linear relationship between two continuous variables. Regression itself serves to predict the value of a variable (dependent) based on the other (the independent variable). This technique gives us the equation of the line that best fits the point’s cloud, with two coefficients that indicate the point of intersection with the vertical axis and the slope of the line.

And what if the variables were not normally distributed?. Well, then we cannot use the Pearson’s coefficient. But do not despair, we have the Spearman’s coefficient and a battery of tests based on ranges of data. But that is another story…

The cooker and his cake

Print Friendly, PDF & Email

Knowing how to cook is a plus. What in good terms you stay when you have guests and you know how to cook properly!. It takes you two or three hours to buy the ingredients, you spend a fortune in it, and it takes another two or three hours working in the kitchen… and, in the end, it turns out that your great dish you were preparing ends up as a wreck.

And this may happen even to best cookers. We can never be sure that our dish will turn out good, although we have prepared it many times before. So you will understand the problem with my cousin.

As it happens, he’s going to give a party and the dessert has been his lot. He knows how to do a pretty and tasty cake, but it only turns out really good half of the times he tries. So he’s very concerned about making fool of himself at the party, as it’s easy to understand. Of course, my cousin is very clever and has thought that, if he makes more than one cake, at least one of them will turn out good. But how many cakes does he have to do to get at least one good?.

The problem with this question is that it doesn’t have an exact answer. The more cakes we make, the more likely one of them turns out good. But, of course, you can make two hundreds cakes and have the bad luck that all of them turn out bad. But do not despair: although we cannot give a number with absolute certainty, we can measure the probability of getting along with a certain number of cakes. Let’s see it.

We are going to imagine the probability distribution, which is just the set of situations that include all the possible situations that may occur. For example, if my cousin makes one cake, it can turns out good (G) or bad (B), both of them with a probability of 0.5. You can see it represented in Figure A. He’ll have a 50% chance of success.

If he makes two cakes it may be that he gets one cake good, two or none. The possible combinations are: GG, GB, BG, and BB. The chance of coming up with a good one is 0.5, and 0.25 the chance of getting two good ones, so the probability of getting at least one cake good is 0.75 or 75% (3/4). It’s represented in Figure B. We see that options have improved, but it’s still much room for failure.binomial

If he makes three cakes, the options will be: GGG, GGB, GBG, GBB, BGB, BGG, BBG, and BBB. The situation is improving, we have an 87.5% (1/8) of probabilities to get at least one cake. We represent it in Figure C.

And what if he makes four cakes, or five, or…?. The issue becomes a pain in the ass. It’s increasingly difficult to imagine all the possible combinations. What can we do?. Well, we can think a little.

If we look at the graphs, the bars represent the discrete elements of the probability of each of the possible events. As the number of possibilities and the number of vertical bars increase, the bars distribution begins to take a bell shape, conforming to a known probability distribution, the binomial distribution.

People who know about this stuff called Bernouilli experiments to those who have only two possible solutions (are dichotomous), like flipping a coin (heads or tails) or making our cakes (good or bad). However, the binomial distribution measures the number of successes (k) of a series of Bernouilli experiments (n) with a certain probability of success of each event (p).

In our case the probability is p = 0.5 and we can calculate the probability of success by repeating the experiment (cooking cakes) using the following formula:

\fn_jvn \fn_jvn P(k\ successes\ with\ n\ experiments)= \binom{n}{k} p^{k} (1-p)^{n-k}

If we replace p by 0.5 (the probability of the cake comes out good), we can play with different values of n to obtain the probability of getting at least one good cake (k ≥ 1).

If we make four cakes, the probability of having at least one good is of 93.75% and if we make five the probability increases to 96.87%, a reasonable probability for what we are dealing with. I believe that if my cousin makes fives cakes it will be very difficult for him to ruin his party.

We could also clear up the value of the probability and calculate the reverse: given a value of P(k,n), get the number of attempts  needed. Another thing you can also do is to calculate all these things without using the formula, but using any probability calculator available online.

And this is the end of this tasty post. There are, as you can imagine, more types of probability distributions, both discrete as the binomial and continuous as the normal distribution, the most famous of all of them. But that’s another story…