Science without sense…double nonsense

Píldoras sobre medicina basada en pruebas

Posts tagged Diagnostic tests

The crystal ball

How I wish I could predict the future! And not only to win millions in the lottery, which is the first thing you can think of. There are more important things in life than money (or so that’s what some say), decisions that we make based on assumptions that end up not being fulfilled and that complicate our lives to unsuspected limits. We all have ever thought about “if you lived twice …” I have no doubt, if I met the genie of the lamp one of the three wishes I would ask would be a crystal ball to see the future.

And we could also do well in our work as doctors. In our day to day we are forced to make decisions about the diagnosis or prognosis of our patients and we always do it on the swampy terrain of uncertainty, always assuming the risk of making some mistake. We, especially when we are more experienced, estimate consciously or unconsciously the likelihood of our assumptions, which helps us in making diagnostic or therapeutic decisions. However, it would be good to also have a crystal ball to know more accurately the evolution of the patient’s course.

The problem, as with other inventions that would be very useful in medicine (like the time machine), is that nobody has yet managed to manufacture a crystal ball that really works. But do not let us down. We cannot know for sure what will happen, but we can estimate the probability that a certain result will occur.

For this, we can use all those variables related to the patient that have a known diagnostic or prognostic value and integrate them to perform the calculation of probabilities. Well, doing such a thing would be the same as designing and applying what is known as a clinical prediction rule (CPR).

Thus, if we get a little formal, we can define a CPR as a tool composed of a set of variables of clinical history, physical examination and basic complementary tests, which provides us with an estimate of the probability of an event, suggesting a diagnosis or predicting a concrete response to a treatment.

The critical appraisal of an article about a CPR shares similar aspects with those of the ones about diagnostic tests and also has specific aspects related to the methodology of its design and application. For this reason, we will briefly look at the methodological aspects of CPRs before entering into their critical assessment.

In the process of developing a CPR, the first thing to do is to define it. The four key elements are the study population, the variables that we will consider as potentially predictive, the gold or reference standard that classifies whether the event we want to predict occurs or not and the criterion of assessment of the result.

It must be borne in mind that the variables we choose must be clinically relevant, they must be collected accurately and, of course, they must be available at the time we want to apply the CPR for decision making. It is advisable not to fall into the temptation of putting variables everywhere and endlessly since, apart from complicating the application of the CPR, it can decrease its validity. In general, it is recommended that for every variable that is introduced in the model there should have been at least 10 events that we want to predict (the design is made in a certain sample whose components have the variables but only a certain number have ended up presenting the event to predict).

I would also like to highlight the importance of the gold standard. There must be a diagnostic test or a set of well-defined criteria that allow us to clearly define the event we want to predict with the CPR.

Finally, it is convenient that those who collect the variables during this definition phase are unaware of the results of the gold standard, and vice versa. The absence of blinding decreases the validity of the CPR.

The next step is the derivation or design phase itself. This is where the statistical methods that allow to include predictive variables and exclude those that are not going to contribute anything are applied. We will not go into statistics, just say that the most commonly used methods are those based on logistic regression, although discriminant, survival and even more exotic analysis based on discriminant risks or neural networks can be used, only afforded by a few virtuous ones.

In the logistic regression models, the event will be the dichotomous dependent variable (it happens or it does not happen) and the other variables will be the predictive or independent variables. Thus, each coefficient that multiplies each predictive variable will be the natural antilogarithm of the adjusted odds ratio. In case anyone has not understood, the adjusted odds ratio for each predictive variable will be calculated raising the number “e” to the value of the coefficient of that variable in the regression model.

The usual thing is that a certain score is assigned on a scale according to the weight of each variable, so that the total sum of points of all the predictive variables will allow to classify the patient in a specific range of prediction of event production. There are also other more complex methods using regression equations, but after all you always get the same thing: an individualized estimate of the probability of the event in a particular patient.

With this process we perform the categorization of patients in homogenous groups of probability, but we still need to know if this categorization is adjusted to reality or, what is the same, what is the capacity of discrimination of the CPR.

The overall validity or discrimination capacity of the PRC will be assess by contrasting its results with those of the gold standard, using similar techniques to those used to assess the power of diagnostic tests: sensitivity, specificity, predictive values and likelihood ratios. In addition, in cases where the CPR provides a quantitative estimate, we can resort to the use of the ROC curves, since the area under the curve will represent the global validity of the CPR.

The last step of the design phase will be the calibration of the CPR, which is nothing more than checking its good behavior throughout the range of possible results.

Some CPR’s authors end this here, but they forget two fundamental steps of the elaboration: the validation and the calculation of the clinical impact of the rule.

The validation consists in testing the CPR in samples different to the one used for its design. We can take a surprise and verify that a rule that works well in a certain sample does not work in another. Therefore, it must be tested, not only in similar patients (limited validation), but also in different clinical settings (broad validation), which will increase the external validity of the CPR.

The last phase is to check its clinical performance. This is where many CPRs crash down after having gone through all the previous steps (maybe that’s why this last check is often avoided). To assess the clinical impact, we will have to apply CPR in our patients and see how clinical outcome measures change such as survival, complications, costs, etc. The ideal way to analyze the clinical impact of a CPR is to conduct a clinical trial with two groups of patients managed with and without the rule.

For those self-sacrificing people who are still reading, now that we know what a CPR is and how it is designed, we will see how the critical appraisal of these works is done. And for this, as usual, we will use our three pillars: validity, relevance and applicability. To not forget anything, we will follow the questions that are listed on the grid for CRP studies of the CASP tool.

Regarding VALIDITY, we will start first with some elimination questions. If the answer is negative, it may be time to wait until someone finally makes up a crystal ball that works.

Does the rule answer a well-defined question? The population, the event to be predicted, the predictive variables and the outcome evaluation criteria must be clearly defined. If this is not done or these components do not fit our clinical scenario, the rule will not help us. The predictive variables must be clinically relevant, reliable and well defined in advance.

Did the study population from which the rule was derived include an adequate spectrum of patients? It must be verified that the method of patient selection is adequate and that the sample is representative. In addition, it must include patients from the entire spectrum of the disease. As with diagnostic tests, events may be easier to predict in certain groups, so there must be representatives of all of them. Finally, we must see if the sample was validated in a different group of patients. As we have already said, it is not enough that the rule works in the group of patients in which it has been derived, but that it must be tested in other groups that are similar or different from those with which it was generated.

If the answer to these three questions has been affirmative, we can move on to the three next questions. Was there a blind evaluation of the outcome and of the predictor variables? We have already commented, it is important that the person who collects the predictive variables does not know the result of the reference pattern, and vice versa. The collection of information must be prospective and independent. The next thing to ask is whether the predictor variables and the outcome in all the patients were measured. If the outcome or the variables are not measured in all patients, the validity of the CPR can be compromised. In any case, the authors should explain the exclusions, if there are any. Finally, are the methods of derivation and validation of the rule described? We already know that it is essential that the results of the rule be validated in a population different from the one used for the design.

If the answers to the previous questions indicate that the study is valid, we will answer the questions about the RELEVANCE of the results. The first is if you can calculate the performance of the CRP. The results should be presented with their sensitivity, specificity, odds ratios, ROC curves, etc., depending on the result provided by the rule (scoring scales, regression formulas, etc.). All these indicators will help us to calculate the probabilities of occurrence of the event in environments with different prevalence. This is similar to what we did with the studies of diagnostic tests, so I invite you to review the post on the subject to not repeat too much. The second question is: what is the precision of the results? Here we will not extend either: remember our revered confidence intervals, which will inform us of the accuracy of the results of the rule.

To finish, we will consider the APPLICABILITY of the results to our environment, for which we will try to answer three questions. Will the reproducibility of the PRC and its interpretation be satisfactory within the scope of the scenario? We will have to think about the similarities and differences between the field in which the CPR develops and our clinical environment. In this sense, it will be helpful if the rule has been validated in several samples of patients from different environments, which will increase its external validity. Is the test acceptable in this case? We will think wether the rule is easy to apply in our environment and wether it makes sense to do it from the clinical point of view in our environment. Finally, will the results modify clinical behavior, health outcomes or costs? If, from our point of view, the results of the CPR are not going to change anything, the rule will be useless and a waste of time. Here our opinion will be important, but we must also look for studies that assess the impact of the rule on costs or on health outcomes.

And up to here everything I wanted to tell you about critical appraising of studies on CPRs. Anyway, before finishing I would like to tell you a little about a checklist that, of course, also exists for the valuation of this type of studies: the checklist CHARMS (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modeling Studies). You will not tell me that the name, although a bit fancy, is not lovely.

This list is designed to assess the primary studies of a systematic review on CPRs. It try to answer some general design questions and assess 11 domains to extract enough information to perform the critical appraisal. The two great parts that are valued are the risk of bias in the studies and its applicability. The risk of bias refers to the design or validation flaws that may result in the model being less discriminative, excessively optimistic, etc. The applicability, on the other hand, refers to the degree to which the primary studies are in agreement with the question that motivates the systematic review, for which it informs us of whether the rule can be applied to the target population. This list is good and helps to assess and understand the methodological aspects of this type of studies but, in my humble opinion, it is easier to make a systematic critical appraisal by using the CASP’s tool.

And here, finally, we leave it for today. We have not spoken anything, so as not to stretch ourselves too long, of what to do with the result of the rule. The fundamental thing, we already know, is that we can calculate the probability of occurrence of the event in individual patients from environments with different prevalence. But that is another story…

Crossing the threshold

The world of medicine is a world of uncertainty. We can never be sure of anything at 100%, however obvious it may seem a diagnosis, but we cannot beat right and left with ultramodern diagnostics techniques or treatments (that are never safe) when making the decisions that continually haunt us in our daily practice.

That’s why we are always immersed in a world of probabilities, where the certainties are almost as rare as the so-called common sense which, as almost everyone knows, is the least common of the senses.

Imagine you are in the clinic and a patient comes because he has been kicked in the ass, pretty strong, though. As good doctor we are, we ask that of what’s wrong?, since when?, and what do you attribute it to? And we proceed to a complete physical examination, discovering with horror that he has a hematoma on the right buttock.

Here, my friends, the diagnostic possibilities are numerous, so the first thing we do is a comprehensive differential diagnosis. To do this, we can take four different approaches. The first is the possibilistic approach, listing all possible diagnoses and try to rule them all simultaneously applying the relevant diagnostic tests. The second is the probabilistic approach, sorting diagnostics by relative chance and then acting accordingly. It seems a posttraumatic hematoma (known as the kick in the ass syndrome), but someone might think that the kick has not been so strong, so maybe the poor patient has a bleeding disorder or blood dyscrasias with secondary thrombocytopenia or even an atypical inflammatory bowel disease with extraintestinal manifestations and gluteal vascular fragility. We could also use a prognostic approach and try to show or rule out possible diagnostics with worst prognosis, so the diagnosis of the syndrome kicked in the ass lose interest and we were going to rule out chronic leukemia. Finally, a pragmatic approach could be used, with particular interest in first finding diagnostics that have a more effective treatment (the kick will be, one more time, the number one).

It seems that the right thing is to use a judicious combination of pragmatic, probabilistic and prognostic approaches. In our case we will investigate if the intensity of injury justifies the magnitude of bruising and, in that case, we would indicate some hot towels and we would refrain from further diagnostic tests. And this example may seems to be bullshit, but I can assure you I know people who make the complete list and order the diagnostic tests when there are any symptoms, regardless of expenses or risks. And, besides, one that I could think of, could assess the possibility of performing a more exotic diagnostic test that I cannot imagine, so the patient would be grateful if the diagnosis doesn’t require to make a forced anal sphincterotomy. And that is so because, as we have already said, the waiting list to get some common sense many times exceeds the surgical waiting list.

Now imagine another patient with a symptom complex less stupid and absurd than the previous example. For instance, a child with symptoms of celiac disease. Before we make any diagnostic test, our patient already has a probability of having the disease. This probability will be conditioned by the prevalence of the disease in the population from which he proceeds and is called the pretest probability. This probability will be somewhere concerning two thresholds: the diagnostic threshold and the therapeutic threshold.

theshold_Dx_TTIf we consider that pretest probability justifies the treatment of disease, there is no need for diagnostic tests and we’ll proceed to remove gluten from the diet. But it is usual that the pretest probability does not allow us to rule out the diagnostic or to confirm the disease with sufficient certainty to start treatment.

We’ll then make the indicated diagnostic test, getting a new probability of disease depending on the result of the test, the so-called post-test probability. If this probability is high enough to make a diagnosis and initiate treatment, we’ll have crossed our first threshold, the therapeutic one. There will be no need for additional tests, as we will have enough certainty to confirm the diagnosis and treat the patient, always within a range of uncertainty.

And what determines our treatment threshold? Well there are several factors involved. The greater the risk, cost or adverse effects have the treatment in question, the higher the threshold that we will demand to be treated. In the other hand, as much more serious is the possibility of omitting the diagnosis, the lower the therapeutic threshold that we’ll accept.

But it may be that the post-test probability is so low that allows us to rule out the disease with reasonable assurance. We shall then have crossed our second threshold, the diagnostic one, also called the no-test threshold. Clearly, in this situation, it is not indicated further diagnostic tests and, of course, starting treatment.

However, very often changing pretest to post-test probability still leaves us in no man’s land, without achieving any of the two thresholds, so we will have to perform additional tests until we reach one of the two limits.

Finally, I just want to stress the importance of diagnostic tests’ properties to move from one probability to the other and reach one of the two thresholds: sensitivity, specificity, predictive values and likelihood ratios. Familiarity with these properties is essential to determine the performance of the test, especially when it is costly and carries a risk or discomfort to the patient. But that is another story…

The imperfect screening

Nobody is perfect. It is a fact. And a relief too. Because the problem is not to be imperfect, it is inevitable. The real problem is to believe one being perfect, to be ignorant of one’s limitations. And the same goes for many other things, such as diagnostic tests used in medicine.

But this is a real crime with diagnostic tools because, beyond its imperfection, it is possible to misclassify healthy and sick people. Don’t you believe me?. Let’s make some reflections.

Venn_Dco_enTo begin with, take a look at the Venn’s diagram I have drawn. What childhood memories these diagrams bring to me!. The filled square symbolizes our population in question. Up the diagonal are the sick (SCK) and down it the healthy (HLT), so that each area represents the probability of being SCK or HLT. The area of the square, obviously, equals 1: we can be certain that anybody will be healthy or sick, two mutually excluding situations. The ellipse encompasses the subjects undergoing the diagnostic test and getting a positive result (POS). In a perfect world, the entire ellipse would be above the diagonal, but in the real imperfect world the ellipse is crossed by the diagonal, so the results can be true POS (TP) or false (FP), the latter when are obtained in healthy. The area outside the ellipse would be the negatives (NEG), which, as you can see, are also divided into true and false (TN, FN).

Now let’s transfer this to the typical contingency table to define the probabilities of different options and think about a situation where we still have not carried out the test. In this case, the columns condition the probabilities of the events of the rows. For example, the upper left box represents the probability of POS in the SCK (once you are sick, how likely you are to get a positive result?), which we call the sensitivity (SEN). For its part, the lower right represents the probability of a NEG in a HLT, which we call specificity (SPE). The total of the first column represents the probability of being sick, which is nothing more than the prevalence (PRV), and so we can discern what the significance of the probability of each cell is. This table provides two features of the test, SEN and SPE, which, as we know, are intrinsic characteristics of the test whenever it is performed under similar conditions, even though if the populations are different.

PROB_PRE_POS_ENAnd what about the contingency table once you have carried out the test?. A subtle, but very important, change has taken place: now the rows condition the probabilities of the events of the columns. The total of the table do not change but do look now at the first cell, that represents the probability of being SCK given that the result has been POS (when positive, what is the probability of being sick?). And this is no longer the SEN, but the positive predictive value (PPV). The same applies to the lower right cell, which now represents the probability of being HLT given that the result has been NEG: the negative predictive value (NPV).

So we see that before performing the test we can usually will know its SEN and SPE, while once perform the test we can calculate its positive and negative predictive values, remaining these four test’s characteristics linked through the magic of Bayes’ theorem. Of course, regarding PPV and NPV there’s a fifth element to take into account: the prevalence. We know that predictive values vary depending on the PRV of the disease in the population, while SEN and SPE remain unchanged.

cribado_imperfecto_enAnd all this has its practical expression. Let’s invent an example to messing around a bit more. Suppose we have a population of one million inhabitants in which we conduct a screening for fildulastrosis. We know from previous studies that the test SEN is 0.66 and SPE is 0.96, and the prevalence of fildulastrosis is 0.0001 (1 in 10,000); a rare disease that I would advise you not to bother to look for it, if anyone has thought about it.

Knowing the PRV is easy to calculate that in our country there are 100 SCK. Of these, 66 will be POS (SEN = 0.66) and 34 will be NEG. Moreover, there will be 990,900 healthy, of which 96% (959904) will be NEG (SPE = 0.96) and the rest (39,996) will be POS. In short, we’ll get 40,062 POS, of which 39,996 will be FP. No one feel scared about the high number of false positives. This is because we have chosen a very rare disease, so there are many FP even though the SPE is quite high. Consider that in real life, we’d need to do the confirmatory test to all these subjects to finish confirming the diagnosis only in 66 people. Therefore, it’s very important to think well if the screening is worth doing before starting to look for the disease in the population. For this and many other reasons.

We can now calculate the predictive values. PPV is the ratio between true and the total of POS: 66/40062 = 0.0016. So, there will be one sick in 1,500 positive, more or less. Similarly, the NPV is the ratio between true and the total of NEG: 959904/959938 = 0.99. As expected, given the high SPE of the test, to get a negative result makes it highly improbable to be sick.

What do you think? Is it a useful test for mass screening with such a number of false positives and a PPV of 0.0016?. Well, while it may seem counterintuitive, if we think about it for a moment, it’s not so bad. The pretest probability of being SCK is 0.0001 (PRV). The posttest probability is 0.0016 (PPV). So, their ratio has a value of 0.0016/0.0001 = 16, which means we have multiplied by 16 our ability to detect the sick. Therefore, the test doesn’t seem so bad, but we must take into account many other factors before starting to screen.

All this we have seen so far has an additional practical application. Suppose you only know SEN and SPE, but we don’t know the PRV of the disease in the population that we have screened. Can we be estimated it from the results of the screening?. The answer is, of course, yes.

Imagine again our population of one million subjects. We do the test and get 40,062 positive. The problem here is that some of these (the most) are FP. Also, we don’t know how many patients have tested negative (FN). How can we get then the number of sick people?. Let’s think about it for a while.

We have said that the number of patients will be equal to the number of POS minus the number of FP and plus the number of FN:

Nº sick = Total POS – Nº FP + Nº FN

We have the number of POS: 40,062. The FP will be those healthy (1-PRV) who get positive being healthy (or the healthy that doesn’t get NEG: 1-SPE). Then, the total number of FP will be:

FP = (1-PRV)(1-SPE) x n (1 million, the population’s size)

Finally, FN will be sick people (PRV) which don’t get a positive (SEN-1). Then, the total number of FN is:

FN = PRV x (1-SEN) x n (1 million, the population’s size)

If we substitute the total of FP and FN in the first equation with the values we’ve just derived, we can get the PRV, obtaining the following formula:

PRV= \frac{\frac{POS}{n}-(1-SPE)}{SEN - (1-SPE)}

We can now calculate the prevalence in our population:

PRV= \frac{\frac{POS}{n}-(1-SPE)}{SEN - (1-SPE)}PRV= \frac{\frac{40.062}{1.000.000}-(1-0,96)}{0,66 - (1-0,96)}= \frac{0,040062 - 0,04}{0,66 -0,04}= \frac{0,000062}{0,062}= 0,0001 (1 \ per\ 10.000)

Well, I think one of my lobes has just melted down, so we’ll have to leave it there. Once again, we’ve seen the magic and power of number and how to make that the imperfections of our tools work in our favor. We could even go a step further and calculate the accuracy of the estimate we’ve done. But that’s another story…

All that glitters is not gold

A brother-in-law of mine is very concerned with a dilemma he’s gotten into. The thing is that he’s going to start a small business and he wants to hire a security guard to stay at the entrance door and watch for those who take something without paying for it. And the problem is that there’re two candidates and he doesn’t know what of both to choose. One of them stops nearly everyone, so no burglar escapes. Of course, many honest people are offended when they are asked to open their bags before leaving and so next time they will buy elsewhere. The other guard is the opposite: he stops almost anyone but the one he spots certainly brings something stolen. He offends few honest people, but too many grabbers escape. Difficult decision…

Why my brother-in-law comes to me with this story?. Because he knows that I daily face with similar dilemmas every time I have to choose a diagnostic test. And the thing is that there’re still people who think that if you get a positive result with a diagnostic tool you have a certain diagnostic of illness and, conversely, that if you are sick to know the diagnostic you only have to do a test. And things are not, nor much less, so simple. Nor is gold all that glitters neither all gold have the same quality.

Let’s see it with an example. dco_hosp_enWhen we want to know the utility of a diagnostic test we usually compare its results with those of a reference or gold standard, which is a test that, ideally, is always positive in sick people and negative in healthy.

Now suppose I perform a study with my hospital patients with a new diagnostic test for a particular disease and I get the results showed in the table below (the sick are those with a positive reference test and the healthy those with a negative one).

Let’s start with the easy part. We have 1598 subjects, 520 out of them sick and 1078 healthy. The test gives us 446 positive results, 428 true (TP) and 18 false (FP). It also gives us 1152 negatives, 1060 true (TN) and 92 false (FN). The first we can determine is the ability of the test to distinguish between healthy and sick, which leads me to introduce the first two concepts: sensitivity (Se) and specificity (Sp). Se is the likelihood that the test correctly classifies a patient or, in other words, the probability that a patient gets a positive result. It’s calculated dividing TP by the number of sick. In our case it equals 0.82 (if you prefer to use percentages you have to multiply by 100). Moreover, Sp is the likelihood that the test correctly classifies a healthy or, put another way, the probability that a healthy gets a negative result. It’s calculated dividing TN by the number of healthy. In our example, it equals 0.98.

Someone may think that we have assessed the value of the new test, but we have just begun to do it. And this is because with Se and Sp we somehow measure the ability of the test to discriminate between healthy and sick, but what we really need to know is the probability that an individual with a positive results being sick and, although it may seem to be similar concepts, they are actually quite different.

The probability of a positive of being sick is known as the positive predictive value (PPV) and is calculated dividing the number of patients with a positive test by the total number of positives. In our case it is 0.96. This means that a positive has a 96% chance of being sick. Moreover, the probability of a negative of being healthy is expressed by the negative predictive value (NPV), with is the quotient of healthy with a negative test by the total number of negatives. In our example it equals 0.92 (an individual with a negative result has 92% chance of being healthy).

And from now on is when neurons begin to be overheated. It turns out that Se and Sp are two intrinsic characteristics of the diagnostic test. Their results will be the same whenever we use the test in similar conditions, regardless of the subjects of the test. But this is not so with the predictive values, which vary depending on the prevalence of the disease in the population in which we test. This means that the probability of a positive of being sick depends on how common or rare the disease in the population is. Yes, you read this right: the same positive test expresses different risk of being sick, and for unbelievers, I’ll put another example.dco_hc_en

Suppose that this same study is repeated by one of my colleagues who works at a community health center, where population is proportionally healthier than at my hospital (logical, they have not suffered the hospital yet). If you check the results in the table and bring you the trouble to calculate it, you may come up with a Se of 0.82 and a Sp of 0.98, the same that I came up with in my practice. However, if you calculate the predictive values, you will see that the PPV equals 0.9 and the NPV 0.95. And this is so because the prevalence of the disease (sick divided by total) is different in the two populations: 0.32 at my practice vs 0.19 at the health center. That is, in cases of highest prevalence a positive value is more valuable to confirm the diagnosis of disease, but a negative is less reliable to rule it out. And conversely, if the disease is very rare a negative result will reasonably rule out disease but a positive will be less reliable at the time to confirm it.

We see that, as almost always happen in medicine, we are moving on the shaking ground of probability, since all (absolutely all) diagnostic tests are imperfect and make mistakes when classifying healthy and sick. So when is a diagnostic test worth of using it?. If you think about it, any particular subject has a probability of being sick even before performing the test (the prevalence of disease in his population) and we’re only interested in using diagnostic tests that increase this likelihood enough to justify the initiation of the appropriate treatment (otherwise we would have to do another test to reach the threshold level of probability to justify treatment).

And here is when this issue begins to be a little unfriendly. The positive likelihood ratio (PLR), also known as positive probability ratio, indicates how much more probable is to get a positive with a sick than with a healthy subject. The proportion of positive in sick patients is represented by Se. The proportion of positives in healthy are the FP, which would be those healthy without a negative result or, what is the same, 1-Sp. Thus, PLR = Se / (1 – Sp). In our case (hospital) it equals 41 (the same value no matter we use percentages for Se and Sp). This can be interpreted as it is 41 times more likely to get a positive with a sick than with a healthy.

It’s also possible to calculate NLR (negative), which expresses how much likely is to find a negative in a sick than in a healthy. Negative patients are those who don’t test positive (1-Se) and negative healthy are the same as the TN (the test’s Sp). So, NLR = (1 – Se) / Sp. In our example 0.18.

A ratio of 1 indicates that the result of the test doesn’t change the probability of being sick. If it’s greater than 1 the probability is increased and, if less than 1, decreased. This is the parameter used to determine the diagnostic power of the test. Values > 10 (or < 0.01) indicates that it’s a very powerful test that supports (or contradict) the diagnosis; values from 5-10 (or 0.1-0.2) indicates low power of the test to support (or disprove) the diagnosis; 2-5 (or 0.2-05) indicates that the contribution of the test is questionable; and, finally, 1-2 (0.5-1) indicates that the test has not diagnostic value.

The likelihood ratio doesn’t express a direct chance, but it allows us to calculate the odds of being sick before and after testing positive for the diagnostic test. We can calculate the pre-test odds (PreO) as the prevalence divided by its complementary (how much probably is to be sick than not to be). In our case it equals 0.47. Moreover, the post-test odd (PosO) is calculated as the product of the prevalence by the PreO. In our case, it is 19.27. And finally, following the reverse mechanism that we use to get the PreO from the prevalence, post-test probability (PosP) would be equal to PosO / (PosO +1). In our example it equals 0.95, which means that if our test is positive the probability of being sick changes from 0.32 (the prevalence) to 0.95 (post-test probability).

If there’s still anyone reading at this point, I’ll say that we don’t need all this gibberish to get post-test probability. There are multiple websites with online calculators for all these parameters from the initial 2 by 2 table with a minimum effort. I addition, the post-test probability can be easily calculated using a Fagan’s nomogram. What we need to know is how to properly assess the information provided by a diagnostic tool to see if it’s useful because of its power, costs, patient discomfort, etc.

Just one last question. We’ve been talking all the time about positive and negative diagnostic tests, but when the result of the test is quantitative, we must set what value we consider positive and what negative, with which all the parameters we’ve seen will vary depending on these values, especially Se and Sp. And to which of the parameters of the diagnostic test must we give priority?. Well, that depends on the characteristics of the test and on the use that we pretend to give to it, but that’s another story…

A never-ending story

Today we won’t talk about dragons that take you for a walk if you get on its hump. Nor we’ll talk about men with feet on their heads or any other creature from the delusional mind of Michael Ende. Today we’re going to talk about another never-ending story: that of diagnostic tests indicators.
When you think you know them all, you can raise a stone to find another beneath it. And why are there so many?, you may ask. Well, the answer is simple. Although there’re indicators that know very well how to interpret how a diagnostic test manages healthy and sick people, investigators are still looking for a good indicator, unique, that give us an idea about the diagnostic capability of a test.

DORThere are many diagnostic tests indicators that assess the ability of the diagnostic test to discriminate among sick and healthy comparing the results with those of a gold standard. They are computed from the comparison among positives and negatives in a contingency table, with which you can build the usual indicators you see in the table above: sensitivity, specificity, predictive values, likelihood ratios, accuracy index and Youden’s index.

The problem is that most of them partially assess the ability of the test, so we need to use them in pairs: sensitivity and specificity, for example. Only the last two of the mentioned indicators can function as single ones. The accuracy index measures the percentage of correctly diagnosed patients, but it treats equally positives and negatives, true or false. Meanwhile, Youden’s index adds the patients misclassified by the diagnostic test.

In any case, it’s not recommended to use either the accuracy or Youden’s index in an isolated way when evaluating diagnostic tests. Moreover, the latter is a term difficult to translate to a tangible clinical concept as it’s a linear transformation of sensitivity and specificity.

At this point it’s easy to understand how we’d like to have a single indicator, simple, easy to interpret and not dependent on the prevalence of the disease. It would certainly be a good indicator of the ability of the diagnostic test that would avoid us of having to resort to a pair of indicators.

And at this point is when some brilliant mind has thought about using a well-known and familiar indicator such as the odds ratio to measure the capabilities of a diagnostic test. Thus, we can define de diagnostic odds ratio (DOR) as the ratio of the odds that the patients tests positive with respect to the odds of testing positive being healthy. As this is quite a tongue-twister, we’ll discuss the two components of the ratio.

The odds that the patient tests positive versus negative is simply the quotient among true positives (TP) and false negatives (FN): TP / FN. Moreover, the odds that a healthy tests positive versus negative is the quotient among false positives (FP) and true negatives (TN): FP / TN. And seeing this, we just have to define the ratio of the two odds:

DOR = \frac{TP}{FN} / \frac{FP}{TN} = \frac{Se}{1 - Se} / \frac{1 - Sp}{Sp}

DOR can also be expressed in terms of predictive values and likelihood ratios, according to the following expressions:

DOR= \frac{PPV}{1 - PPV} / \frac{1 - NPV}{NPV}

DOR= \frac{PLR}{NLR}

As any odds ratio, the possible values of DOR range from zero to infinity. The null value is one, which means that the test has no discriminatory capacity among healthy and sick. A value greater than one indicates discriminatory ability, which will be greater the higher the value is. Finally, values between zero and one will indicate that the test not only not discriminate well among healthy and sick, but that it incorrectly classifies them and yield more negative values among sick than among healthy people.

DOR is a global measure that is easy to interpret and does not depend on the prevalence of the disease, although it must be said that it can vary among groups of patients with different severity of their disease.

Finally, add to its advantages the possibility of constructing its confidence interval from the contingency table using this little formula that I show you:

Standard\ error (ln DOR)= \sqrt{\frac{1}{TP} + \frac{1}{TN} + \frac{1}{FP} + \frac{1}{FN}}

Yes, I’ve seen the log, but this is the way with odds ratios: as odds are asymmetrical around the null value, these calculations must be done with logarithms. So, once we have the standard error, we can calculate the interval as follows:

IC\ 95\%= ln DOR \pm 1,96 SE(lnDOR)

We just have to apply the antilogarithm to the limits of the interval we got with the formula (the antilog is to raise the number e to the limits obtained).

And I think that this is enough for today. We could go more. DOR has many more virtues. For example, it can be used with test with continuous results (not just positive or negative), since there’s a correlation between DOR and the area under the ROC curve of the test. Furthermore, it can be used in meta-analysis and in logistic regression models, allowing the inclusion of variables to control the heterogeneity of the primary studies. But that’s another story…