## An unfairly treated genius

The genius that I am talking about in the title of this post is none other than Alan Mathison Turing, considered one of the fathers of computer science and a forerunner of modern computing.

For mathematicians, Turing is best known for his involvement in the solution of the decision problem previously proposed by Gottfried Wilhelm Leibniz and David Hilbert, who were seeking to define a method that could be applied to any mathematical sentence to prove whether that sentence were or not true (to those interested in the matter, it could be demonstrated that such a method does not exist).

But what it is Turing is famous for among the general public comes thanks to the cinema and to his work in statistics during World War II. And it is that Turing was taken to exploiting Bayesian magic to deepen the concept of how the evidence we are collecting during an investigation can support the initial hypothesis or not, thus favoring the development of a new alternative hypothesis. This allowed him to decipher the code of the Enigma machine, which was the one used by the German navy’s sailors to encrypt their messages, and that is the story that has been taken to the screen. This line of work led to the development of concepts such as the weight of evidence and concepts of probability, with which confront null and alternative hypotheses, which were applied in biomedicine and enabled the development of new ways to evaluate new diagnostic tests capabilities, such as the ones we are going to deal with today.

But all this story about Alan Turing turn out to be just a recognition of one of the people whose contribution made it possible to develop the methodological design that we are going to talk about today, which is none other than the meta-analysis of diagnostic accuracy.

We already know that a meta-analysis is a quantitative synthesis method that is used in systematic reviews to integrate the results of primary studies into a summary result measure. The most common is to find systematic reviews on treatment, for which the implementation methodology and the choice of summary result measure are quite well defined. Reviews on diagnostic tests, which have been possible after the development and characterization of the parameters that measure the diagnostic performance of a test, are less common.

The process of conducting a diagnostic systematic review essentially follows the same guidelines as a treatment review, although there are some specific differences that we will try to clarify. We will focus first on the choice of the outcome summary measure and try to take into account the rest of the peculiarities when we give some recommendations for a critical appraisal of these studies.

When choosing the outcome measure, we will find the first big difference with the meta-analyzes of treatment. In the meta-analysis of diagnostic accuracy (MDA) the most frequent way to assess the test is to combine sensitivity and specificity as summary values. However, these indicators present the problem that the cut-off points to consider the results of the test as positive or negative usually vary among the different primary studies of the review. Moreover, in some cases positivity may depend on the objectivity of the evaluator (think of results of imaging tests). All this, besides being a source of heterogeneity among the primary studies, constitutes the origin of a typical MDA bias called the threshold effect, in which we will stop a little later.

For this reason, many authors do not like to use sensitivity and specificity as summary measures and resort to positive and negative likelihood ratios. These ratios have two advantages. First, they are more robust against the presence of threshold effect. Second, as we know, they allow calculating the post-test probability either using Bayes’ rule (pre-test odds  x likelihood ratio = posttest odds) or a Fagan’s nomogram (you can review these concepts in the corresponding post).

Finally, a third possibility is to resort to another of the inventions that derive from Turing’s work: the diagnostic odds ratio (DOR).

The DOR is defined as the ratio of the odds of the patient being positive with a test with respect to the odds of being positive while being healthy. This phrase may seem a bit cryptic, but it is not so. The odds of the patient being positive versus being negative is only the ratio between true positives (TP) and false negatives (FN): TP / FN. On the other hand, the odds of the healthy being positive versus negative is the quotient between false positives (FP) and true negatives (TN): FP / TN. And seeing this, we can only define the ratio between the two odds, as you can see in the attached figure. The DOR can also be expressed in terms of the predictive values ​​and the likelihood ratios, according to the expressions that you can see in the same figure. Finally, it is also possible to calculate their confidence interval, according to the formula that ends the figure.

Like all odds ratios, the possible values ​​of the DOR go from zero to infinity. The null value is 1, which means that the test has no discriminatory capacity between the healthy and the sick. A value greater than one indicates discriminatory capacity, which will be greater the greater the value. Finally, values ​​between zero and 1 will indicate that the test not only does not discriminate well between the sick and healthy, but classifies them in a wrong way and gives us more negative values ​​among the sick than among the healthy.

The DOR is a global parameter easy to interpret and does not depend on the prevalence of the disease, although it must be said that it can vary between groups of patients with different severity of disease. In addition, it is also a very robust measure against the threshold effect and is very useful for calculating the summary ROC curves that we will comment on below.

The second peculiar aspect of MDA that we are going to deal with is the threshold effect. We must always assess their presence when we find ourselves before a MDA. The first thing will be to observe the clinical heterogeneity among the primary studies, which could be evident without needing to make many considerations. There is also a simple mathematical form, which is to calculate the Spearman’s correlation coefficient between sensitivity and specificity . If there is a threshold effect, there will be an inverse correlation between the two, the stronger the higher the threshold effect.

Finally, a graphical method is to assess the dispersion of the sensitivity and specificity representation of the primary studies on the summary ROC curve of the meta-analysis. A dispersion allows us to suspect the threshold effect, but it can also occur due to the heterogeneity of the studies and other biases such as selection’s or verification’s.

The third specific element of MDA that we are going to comment on is that of the summary ROC curve (sROC), which is an estimate of the common ROC curve adjusted according to the results of the primary studies of the review. There are several ways to calculate it, some quite complicated from the mathematical point of view, but the most used are the regression models that use the DOR as an estimator, since, as we have said, it is very robust against heterogeneity and the threshold effect. But do not be alarmed, most of the statistical packages calculate and represent the sROC with little effort.

The reading of sROC is similar to that of any ROC curve. The two more used parameters are area under the ROC curve (AUC) and Q index. The AUC of a perfect curve is equal to 1. Values above 0.5 indicate its discriminatory diagnostic capacity, which will be higher the closer it gets to 1. A value of 0.5 tells us that the usefulness of the test is the same that flipping a coin. Finally, values ​​below 0.5 indicate that the test does not contribute at all to the diagnosis it intends to perform.

On the other hand, the Q index corresponds to the point at which sensitivity and specificity are equal. Similar to AUC manner, a value greater than 0.5 indicate the overall effectiveness of the diagnostic test, which will be higher the closer the index value is to 1. In addition, confidence intervals can also be calculated both for AUC as Q index, with which it will be possible to assess the precision of the estimation of the summary measure of the MDA.

Once seen (at a glance) the specific aspects of MDA, we will give some recommendations to perform the critical appraising of this type of study. CASP network does not provide a specific tool for MDA, but we can follow the lines of the systematic review of treatment studies taking into account the differential aspects of MDA. As always, we will follow our three basic pillars: validity, relevance and applicability.

Let’s start with the questions that value the VALIDITY of the study.

The first question asks if it has been clearly specified the issue of the review. As with any systematic review, diagnostic tests’ should try to answer a specific question that is clinically relevant, and which is usually proposed following the PICO scheme of a structured clinical question. The second question makes us reflect if the type of studies that have been included in the review are adequate. The ideal design is that of a cohort to which the diagnostic test that we want to assess and the gold standard are blindly and independently applied. Other studies based on case-control designs are less valid for the evaluation of diagnostic tests, and will reduce the validity of the results.

If the answer to both questions is yes, we turn to the secondary criteria. Have important studies that have to do with the subject been included? We must verify that a global and unbiased search of the literature has been carried out. The methodology of the search is similar to that of systematic reviews on treatment, although we should take some precautions. For example, diagnostic studies are usually indexed differently in databases, so the use of the usual filters of other types of revisions can cause us to lose relevant studies. We will have to carefully check the search strategy, which must be provided by the authors of the review.

In addition, we must verify that the authors have ruled out the possibility of a publication bias. This poses a special problem in MDA, since the study of the publication bias in these studies is not well developed and the usual methods such as the funnel plot or the Egger’s test are not very reliable. The most conservative thing to do is always assume that there may be a publication bias.

It is very important that enough has been done to assess the quality of the studies, looking for the existence of possible biases. For this the authors can use specific tools, such as the one provided by the QUADAS-2 declaration.

To finish the section of internal or methodological validity, we must ask ourselves if it was reasonable to combine the results of the primary studies. It is fundamental, in order to draw conclusions from combined data, that studies are homogeneous and that the differences among them are due solely to chance. We will have to assess the possible sources of heterogeneity and if there may be a threshold effect, which the authors have had to take into account.

In summary, the fundamental aspects that we will have to analyze to assess the validity of a MDA will be: 1) that the objectives are well defined; 2) that the bibliographic search has been exhaustive; and 3) that the internal or methodological validity of the included studies has also been verified. In addition, we will review the methodological aspects of the meta-analysis technique: the convenience of combining the studies to perform a quantitative synthesis, an adequate evaluation of the heterogeneity of the primary studies and the possible threshold effect and use of an adequate mathematical model to combine the results of the primary studies (sROC, DOR, etc.).

Regarding the RELEVANCE of the results we must consider what is the overall result of the review and if the interpretation has been made in a judicious manner. We will value more those MDA that provide more robust measures against possible biases, such as likelihood ratios and DOR. In addition, we must assess the accuracy of the results, for which we will use our beloved confidence intervals, which will give us an idea of ​​the precision of the estimation of the true magnitude of the effect in the population.

We will conclude the critical appraisal of MDA assessing the APPLICABILITY of the results to our environment. We will have to ask whether we can apply the results to our patients and how they will influence the attention to them. We will have to see if the primary studies of the review describe the participants and if they resemble our patients. In addition, it will be necessary to see if all the relevant results have been considered for decision making in the problem under study and, as always, the benefit-cost-risk ratio must be assessed. The fact that the conclusion of the review seems valid does not mean that we have to apply it in a compulsory way.

Well, with all that said, we are going to finish today. The title of this post refers to the mistreatment suffered by a genius. We already know what genius we were referring to: Alan Turing. Now, we will clarify the abuse. Despite being one of the most brilliant minds of the 20th century, as witnessed by his work on statistics, computing, cryptography, cybernetics, etc., and having saved his country from the blockade of the German Navy during the war, in 1952 he was tried for his homosexuality and convicted of serious indecency and sexual perversion. As it is easy to understand, his career ended after the trial and Alan Turing died in 1954, apparently after eating a piece of an apple poisoned with cyanide, which was labeled as suicide, although there are theories that speak rather of murder. They say that from here comes the bitten apple of a well-known brand of computers, although there are others who say that the apple just represents a play on words between bite and byte.

I do not know which of the two theories is true, but I prefer to recall Turing every time I see the little-apple. My humble tribute to a great man.

And now we finish. We have seen the peculiarities of the meta-analyzes of diagnostic accuracy and how to assess them. Much more could be said of all the mathematics associated with its specific aspects such as the presentation of variables, the study of publication bias, the threshold effect, etc. But that’s another story…

## You have to know what you are looking for

### Critical appraisal of diagnostic studies

Every day we find articles that show new diagnostic tests that appear to have been designed to solve all our problems. But we should not be tempted to pay credit to everything we read before reconsidering what we have, in fact, read. At the end of the day, if we paid attention to everything we read we would be swollen from drinking Coca-Cola.

We know that a diagnostic test is not going to say whether or not a person is sick. Its result will only allow us to increase or decrease the probability that the individual is sick or not so we can confirm or rule out the diagnosis, but always with some degree of uncertainty.

Anyone has a certain risk of suffering from any disease, which is nothing more than the prevalence of the disease in the general population. Below a certain level of probability, it seems so unlikely that the patient is sick that we leave him alone and do not do any diagnostic tests (although some find it hard to restrain the urge to always ask for something). This is the diagnostic or test threshold.

But if, in addition to belonging to the population, one has the misfortune of having symptoms, that probability will increase until this threshold is exceeded, in which the probability of presenting the disease justifies performing diagnostic tests. Once we have the result of the test that we have chosen, the probability (post-test probability) will have changed. It may have changed to less and it has been placed below the test threshold, so we discard the diagnosis and leave the patient alone again. It may also exceed another threshold, the therapeutic, from which the probability of the disease reaches the sufficient level so as not to need further tests and to be able to initiate the treatment.

#### Usefulness of diagnostic test

The usefulness of the diagnostic test will be in its ability to reduce the probability below the threshold of testing (and discard the diagnosis) or, on the contrary, to increase it to the threshold at which it is justified to start treatment. Of course, sometimes the test leaves us halfway and we have to do additional tests before confirming the diagnosis with enough security to start the treatment.

Diagnostic tests studies should provide information about the ability of a test to produce the same results when performed under similar conditions (reliability) and about the accuracy with which the measurements reflect that measure (validity). But they also give us data about their discriminatory power (sensitivity and specificity), their clinical performance (positive predictive value and negative predictive value), its ability to modify the probability of illness and change our position between the two thresholds (likelihood ratios), and about other aspects that allow us to assess whether it’s worth to test our patients with the diagnostic test. And to check if a study gives us the right information we need to make a critical appraisal and read the paper based on our three pillars: validity, relevance and applicability.

#### Let’s go with the critical appraisal of diagnostic studies

Let’s start with VALIDITY. First, we’ll make ourselves some basic eliminating questions about primary criteria about the study. If the answer to these questions is no, the best you can do probably is to use the article to wrap your mid-morning snack.

Was the diagnostic test blindly and independently compared with an appropriate gold standard or reference test?. We must review that results of reference test were not interpreted differently depending on the results of the study test, thus committing an incorporation bias, which could invalidate the results. Another problem that can arise is that the reference test results are frequently inconclusive. If we made the mistake of excluding that doubtful cases we’d commit and indeterminate exclusion bias that, in addition to overestimate the sensitivity and specificity of the test, will compromise the external validity of the study, whose conclusions would only be applicable to patients with indeterminate result.

Do patients encompass a similar spectrum to which we will find in our practice?. The inclusion criteria of the study should be clear, and the study must include healthy and diseased with varying severity or progression stages of disease. As we know, the prevalence influences the clinical performance of the test so if it’s validated, for example, in a tertiary center (the probability of being sick is statistically greater) its diagnostic capabilities will be overestimated when we use the test at a Primary Care center or with the general population (where the proportion of diseased will be lower).

At this point, if we think it’s worth reading further, we’ll focus on secondary criteria, which are those that add value to the study design. Another question to ask is: had the study test’s results any influence in the decision to do the reference test?. We have to check that there hasn’t been a sequence bias or a diagnostic verification bias, whereby excluding those with negative test. Although this is common in current practice (we start with simple tests and perform the more invasive ones only in positive patients), doing so in a diagnostic test study affect the validity of the results. Both tests should be done independently and blindly, so that the subjectivity of the observer does not influence the results (review bias). Finally, is the method described with enough detail to allow its reproduction?. It should be clear what is considered normal and abnormal and what criteria we have used to define normal and how we have interpreted the results of the test.

Having analyzed the internal validity of the study we’ll appraise the RELEVANCE of the presented data. The purpose of a diagnostic study is to determine the ability of a test to correctly classify individuals according to the presence or absence of disease. Actually, and to be more precise, we want to know how the likelihood of being ill increases after the test’s result (post-test probability). It’s therefore essential that the study gives information about the direction and magnitude of this change (pretest / posttest), that we know depends on the characteristics of the test and, to a large extent, on the prevalence or pretest probability.

Do the work present likelihood ratios or is it possible to calculate them from the data?. This information is critical because if not, we couldn’t estimate the clinical impact of the study test. We have to be especially careful with tests with quantitative results in which the researcher has established a cutoff of normality. When using ROC curves, it is usual to move the cutoff to favor sensitivity or specificity of the test, but we must always appraise how this measure affects the external validity of the study, since it may limit its applicability to a particular group of patients.

How reliable are the results?. We will have to determine whether the results are reproducible and how they can be affected by variations among different observers or when retested in succession. But we have not only to assess the reliability, but also how accurate the results are. The study was done on a sample of patients, but it should provide an estimate of their values in the population, so results should be expressed with their corresponding confident intervals.

The third pillar in critical appraising is that of APLICABILITY or external validity, which will help us to determine whether the results are useful to our patients. In this regard, we ask three questions. Is the test available and is it possible to perform it in our patients?. If the test is not available all we’ll have achieved with the study is to increase our vast knowledge. But if we can apply the test we must ask whether our patients fulfill the inclusion and exclusion criteria of the study and, if not, consider how these differences may affect the applicability of the test.

The second question is if we know the pretest probability of our patients. If our prevalence is very different from that of the study the actual usefulness of the test can be modified. One solution may be to do a sensitivity analysis evaluating how the study results would be modified after changing values of pre and posttest probability to a different ones that are clinically reasonable.

Finally, we should ask ourselves the most important question: can posttest probability change our therapeutic attitude, so being helpful to the patient?. For example, if the pretest probability is very low, probably the posttest probability will be also very low and won’t reach the therapeutic threshold, so it would be not worth spending money and effort with the test. Conversely, is pretest probability is very high it may be worth starting treatment without any more evidence, unless the treatment is very expensive or dangerous. As always, the virtue will be in the middle and it will be in these intermediate areas where more benefits can be obtained from the studied diagnostic test. In any case, we must never forget who our boss is (I mean the patient, not our boss at the office): you must not to be content only with studying the effectiveness or cost-effectiveness, but also consider the risks, discomfort, and patients preferences and the consequences that can lead to the performing of the diagnostic test.

If you allow me an advice, when critically appraising an article about diagnostic tests I recommend you to use the CASP’s templates, which can be downloaded from the website. They will help you make the critical appraising in a systematic and easy way.

A clarification to go running out: we must not confuse the studies of diagnostic tests with diagnostic prediction rules. Although the assessment is similar, the prediction rules have specific characteristics and methodological requirements that must be assessed in an appropriate way and that we will see in another post.

#### We’re leaving…

Finally, just say that everything we have said so far applies to the specific papers about diagnostic tests. However, the assessment of diagnostic tests may be part of observational studies such as cohort or case-control studies, which can have some peculiarity in the sequence of implementation and validation criteria of the study and reference test. But that’s another story…

## The guard’s dilemma

### Sensitivity and specificity

The world of medicine is a world of uncertainty. We can never be sure of anything at 100%, however obvious it may seem a diagnosis, but we cannot beat right and left with ultramodern diagnostics techniques or treatments (that are never safe) when making the decisions that continually haunt us in our daily practice.

That’s why we are always immersed in a world of probabilities, where the certainties are almost as rare as the so-called common sense which, as almost everyone knows, is the least common of the senses.

Imagine you are in the clinic and a patient comes because he has been kicked in the ass, pretty strong, though. As good doctor as we are, we ask that of what’s wrong?, since when?, and what do you attribute it to? And we proceed to a complete physical examination, discovering with horror that he has a hematoma on the right buttock.

#### Different approaches for a differential diagnosis

Here, my friends, the diagnostic possibilities are numerous, so the first thing we do is a comprehensive differential diagnosis. To do this, we can take four different approaches. The first is the possibilistic approach, listing all possible diagnoses and try to rule them all simultaneously applying the relevant diagnostic tests. The second is the probabilistic approach, sorting diagnostics by relative chance and then acting accordingly. It seems a posttraumatic hematoma (known as the kick in the ass syndrome), but someone might think that the kick has not been so strong, so maybe the poor patient has a bleeding disorder or a blood dyscrasia with secondary thrombocytopenia or even an atypical inflammatory bowel disease with extraintestinal manifestations and gluteal vascular fragility. We could also use a prognostic approach and try to show or rule out possible diagnostics with worst prognosis, so the diagnosis of the kicked in the ass syndrome lose interest and we were going to rule out chronic leukemia. Finally, a pragmatic approach could be used, with particular interest in first finding diagnostics that have a more effective treatment (the kick will be, one more time, the number one).

It seems that the right thing is to use a judicious combination of pragmatic, probabilistic and prognostic approaches. In our case we will investigate if the intensity of injury justifies the magnitude of bruising and, in that case, we would indicate some hot towels and we would refrain from further diagnostic tests. And this example may seems to be bullshit, but I can assure you I know people who make the complete list and order the diagnostic tests when there are any symptoms, regardless of expenses or risks. And, besides, one that I could think of, could assess the possibility of performing a more exotic diagnostic test that I cannot imagine, so the patient would be grateful if the diagnosis doesn’t require to make a forced anal sphincterotomy. And that is so because, as we have already said, the waiting list to get some common sense exceeds in many times the surgical waiting list.

#### The two thresholds

Now imagine another patient with a symptom complex less stupid and absurd than the previous example. For instance, let’s think about a child with symptoms of celiac disease. Before we make any diagnostic test, our patient already has a probability of suffering the disease. This probability will be conditioned by the prevalence of the disease in the population from which she proceeds and is called the pretest probability. This probability will stand somewhere between two thresholds: the diagnostic threshold and the therapeutic threshold.

The usual thing is that the pre-test probability of our patient does not allow us to rule out the disease with reasonable certainty (it would have to be very low, below the diagnostic threshold) or to confirm it with sufficient security to start the treatment (it would have to be above the therapeutic threshold).

We’ll then make the indicated diagnostic test, getting a new probability of disease depending on the result of the test, the so-called post-test probability. If this probability is high enough to make a diagnosis and initiate treatment, we’ll have crossed our first threshold, the therapeutic one. There will be no need for additional tests, as we will have enough certainty to confirm the diagnosis and treat the patient, always within a range of uncertainty.

And what determines our treatment threshold? Well, there are several factors involved. The greater the risk, cost or adverse effects of the treatment in question, the higher the threshold that we will demand to be treated. In the other hand, as much more serious is the possibility of omitting the diagnosis, the lower the therapeutic threshold that we’ll accept.

But it may be that the post-test probability is so low that allows us to rule out the disease with reasonable assurance. We shall then have crossed our second threshold, the diagnostic one, also called the no-test threshold. Clearly, in this situation, it is not indicated further diagnostic tests and, of course, starting treatment.

However, very often changing pretest to post-test probability still leaves us in no man’s land, without achieving any of the two thresholds, so we will have to perform additional tests until we reach one of the two limits.

And this is our everyday need: to know the post-test probability of our patients to know if we discard or confirm the diagnosis, if we leave the patient alone or we lash her out with our treatments. And this is so because the simplistic approach that a patient is sick if the diagnostic test is positive and healthy if it is negative is totally wrong, even if it is the general belief among those who indicate the tests. We will have to look, then, for some parameter that tells us how useful a specific diagnostic test can be to serve the purpose we need: to know the probability that the patient suffers the disease.

#### The guard’s dilemma

And this reminds me of the enormous problem that a brother-in-law asked me about the other day. The poor man is very concerned with a dilemma that has arisen. The thing is that he’s going to start a small business and he wants to hire a security guard to stay at the entrance door and watch for those who take something without paying for it. And the problem is that there’re two candidates and he doesn’t know who of the two to choose. One of them stops nearly everyone, so no burglar escapes. Of course, many honest people are offended when they are asked to open their bags before leaving and so next time they will buy elsewhere. The other guard is the opposite: he stops almost anyone but the one he spots certainly brings something stolen. He offends few honest people, but too many grabbers escape. A difficult decision…

Why my brother-in-law comes to me with this story? Because he knows that I daily face with similar dilemmas every time I have to choose a diagnostic test to know if a patient is sick and I have to treat her. We have already said that the positivity of a test does not assure us the diagnosis, just as the bad looking of a client does not ensure that the poor man has robbed us.

Let’s see it with an example. When we want to know the utility of a diagnostic test, we usually compare its results with those of a reference or gold standard, which is a test that, ideally, is always positive in sick patients and negative in healthy people. Now let’s suppose that I perform a study in my hospital office with a new diagnostic test to detect a certain disease and I get the results from the attached table (the patients are those who have the positive reference test and the healthy ones, the negative).

#### Sensitivity and specificity

Let’s start with the easy part. We have 1598 subjects, 520 out of them sick and 1078 healthy. The test gives us 446 positive results, 428 true (TP) and 18 false (FP). It also gives us 1152 negatives, 1060 true (TN) and 92 false (FN). The first we can determine is the ability of the test to distinguish between healthy and sick, which leads me to introduce the first two concepts: sensitivity (Se) and specificity (Sp). Se is the likelihood that the test correctly classifies a patient or, in other words, the probability that a patient gets a positive result. It’s calculated dividing TP by the number of sick. In our case it equals 0.82 (if you prefer to use percentages you have to multiply by 100). Moreover, Sp is the likelihood that the test correctly classifies a healthy or, put another way, the probability that a healthy gets a negative result. It’s calculated dividing TN by the number of healthy. In our example, it equals 0.98.

Someone may think that we have assessed the value of the new test, but we have just begun to do it. And this is because with Se and Sp we somehow measure the ability of the test to discriminate between healthy and sick, but what we really need to know is the probability that an individual with a positive results being sick and, although it may seem to be similar concepts, they are actually quite different.

The probability of a positive of being sick is known as the positive predictive value (PPV) and is calculated dividing the number of patients with a positive test by the total number of positives. In our case it is 0.96. This means that a positive has a 96% chance of being sick. Moreover, the probability of a negative of being healthy is expressed by the negative predictive value (NPV), with is the quotient of healthy with a negative test by the total number of negatives. In our example it equals 0.92 (an individual with a negative result has 92% chance of being healthy). This is already looking more like what we said at the beginning that we needed: the post-test probability that the patient is really sick.

#### Predictive values

And from now on is when neurons begin to be overheated. It turns out that Se and Sp are two intrinsic characteristics of the diagnostic test. Their results will be the same whenever we use the test in similar conditions, regardless of the subjects of the test. But this is not so with the predictive values, which vary depending on the prevalence of the disease in the population in which we test. This means that the probability of a positive of being sick depends on how common or rare the disease in the population is. Yes, you read this right: the same positive test expresses different risk of being sick, and for unbelievers, I’ll put another example.

Suppose that this same study is repeated by one of my colleagues who works at a community health center, where population is proportionally healthier than at my hospital (logical, they have not suffered the hospital yet). If you check the results in the table and bring you the trouble to calculate it, you may come up with a Se of 0.82 and a Sp of 0.98, the same that I came up with in my practice. However, if you calculate the predictive values, you will see that the PPV equals 0.9 and the NPV 0.95. And this is so because the prevalence of the disease (sick divided by total) is different in the two populations: 0.32 at my practice vs 0.19 at the health center. That is, in cases of highest prevalence a positive value is more valuable to confirm the diagnosis of disease, but a negative is less reliable to rule it out. And conversely, if the disease is very rare a negative result will reasonably rule out disease but a positive will be less reliable at the time to confirm it.

We see that, as almost always happen in medicine, we are moving on the shaking ground of probability, since all (absolutely all) diagnostic tests are imperfect and make mistakes when classifying healthy and sick. So when is a diagnostic test worth of using it? If you think about it, any particular subject has a probability of being sick even before performing the test (the prevalence of disease in her population) and we’re only interested in using diagnostic tests if that increase this likelihood enough to justify the initiation of the appropriate treatment (otherwise we would have to do another test to reach the threshold level of probability to justify treatment).

#### Likelihood ratios

And here is when this issue begins to be a little unfriendly. The positive likelihood ratio (PLR), indicates how much more probable is to get a positive with a sick than with a healthy subject. The proportion of positive in sick patients is represented by Se. The proportion of positives in healthy are the FP, which would be those healthy without a negative result or, what is the same, 1-Sp. Thus, PLR = Se / (1 – Sp). In our case (hospital) it equals 41 (the same value no matter we use percentages for Se and Sp). This can be interpreted as it is 41 times more likely to get a positive with a sick than with a healthy.

It’s also possible to calculate NLR (negative), which expresses how much likely is to find a negative in a sick than in a healthy. Negative patients are those who don’t test positive (1-Se) and negative healthy are the same as the TN (the test’s Sp). So, NLR = (1 – Se) / Sp. In our example, 0.18.

A ratio of 1 indicates that the result of the test doesn’t change the likelihood of being sick. If it’s greater than 1 the probability is increased and, if less than 1, decreased. This is the parameter used to determine the diagnostic power of the test. Values > 10 (or > 0.01) indicates that it’s a very powerful test that supports (or contradict) the diagnosis; values from 5-10 (or 0.1-0.2) indicates low power of the test to support (or disprove) the diagnosis; 2-5 (or 0.2-05) indicates that the contribution of the test is questionable; and, finally, 1-2 (0.5-1) indicates that the test has not diagnostic value.

#### Postest probbility and Fagan’s nomogram

The likelihood ratio does not express a direct probability, but it helps us to calculate the probabilities of being sick before and after testing positive by means of the Bayes’ rule, which says that the posttest odds is equal to the product of the pretest odds by the likelihood ratio. To transform the prevalence into pre-test odds we use the formula odds = p / (1-p). In our case, it would be 0.47. Now we can calculate the post-test odds (PosO) by multiplying the pretest odds by the likelihood ratio. In our case, the positive post-test odds value is 19.27. And finally, we transform the post-test odds into post-test probability using the formula p = odds / (odds + 1). In our example it values 0.95, which means that if our test is positive the probability of being sick goes from 0.32 (the pre-test probability) to 0.95 (post-test probability).

If there’s still anyone reading at this point, I’ll say that we don’t need all this gibberish to get post-test probability. There are multiple websites with online calculators for all these parameters from the initial 2 by 2 table with a minimum effort. I addition, the post-test probability can be easily calculated using a Fagan’s nomogram (see attached figure). This graph represents in three vertical lines from left to right the pre-test probability (it is represented inverted), the likelihood ratios and the resulting post-test probability.

To calculate the post-test probability after a positive result, we draw a line from the prevalence (pre-test probability) to the PLR and extend it to the post-test probability axis. Similarly, in order to calculate post-test probability after a negative result, we would extend the line between prevalence and the value of the NLR.

In this way, with this tool we can directly calculate the post-test probability by knowing the likelihood ratios and the prevalence. In addition, we can use it in populations with different prevalence, simply by modifying the origin of the line in the axis of pre-test probability.

So far we have defined the parameters that help us to quantify the power of a diagnostic test and we have seen the limitations of sensitivity, specificity and predictive values and how the most useful in a general way are the likelihood ratios. But, you will ask, what is a good test?, is it a sensitive one?, a specific?, both?

Here we are going to return to the guard’s dilemma that has arisen to my poor brother-in-law, because we have left him abandoned and we have not answered yet which of the two guards we recommend him to hire, the one who ask almost everyone to open their bags and so offending many honest people, or the one who almost never stops honest people but, stopping almost anyone, many thieves get away.

#### The resolution of the dilemma

And what do you think is the better choice? The simple answer is: it depends. Those of you who are still awake by now will have noticed that the first guard (the one who checks many people) is the sensitive one while the second is the specific one. What is better for us, the sensitive or the specific guard? It depends, for example, on where our shop is located. If your shop is located in a heeled neighborhood the first guard won’t be the best choice because, in fact, few people will be stealers and we’ll prefer not to offend our customers so they don’t fly away. But if our shop is located in front of the Cave of Ali Baba we’ll be more interested in detecting the maximum number of clients carrying stolen stuff. Also, it can depend on what we sell in the store. If we have a flea market we can hire the specific guard although someone can escape (at the end of the day, we’ll lose a few amount of money). But if we sell diamonds we’ll want no thieve to escape and we’ll hire the sensitive guard (we’ll rather bother someone honest than allows anybody escaping with a diamond).

The same happens in medicine with the choice of diagnostic tests: we have to decide in each case whether we are more interested in being sensitive or specific, because diagnostic tests not always have a high sensitivity (Se) and specificity (Sp).

In general, a sensitive test is preferred when the inconveniences of a false positive (FP) are smaller than those of a false negative (FN). For example, suppose that we’re going to vaccinate a group of patients and we know that the vaccine is deadly in those with a particular metabolic error. It’s clear that our interest is that no patient be undiagnosed (to avoid FN), but nothing happens if we wrongly label a healthy as having a metabolic error (FP): it’s preferable not to vaccinate a healthy thinking that it has a metabolopathy (although it hasn’t) that to kill a patient with our vaccine supposing he was healthy. Another less dramatic example: in the midst of an epidemic our interest will be to be very sensitive and isolate the largest number of patients. The problem here if for the unfortunate healthy who test positive (FP) and get isolated with the rest of sick people. No doubt we’d do him a disservice with the maneuver. Of course, we could do to all the positives to the first test a second confirmatory one that is very specific in order to avoid bad consequences to FP people.

On the other hand, a specific test is preferred when it is better to have a FN than a FP, as when we want to be sure that someone is actually sick. Imagine that a test positive result implies a surgical treatment: we’ll have to be quite sure about the diagnostic so we don’t operate any healthy people.

Another example is a disease whose diagnosis can be very traumatic for the patient or that is almost incurable or that has no treatment. Here we´ll prefer specificity to not to give any unnecessary annoyance to a healthy. Conversely, if the disease is serious but treatable we´ll probably prefer a sensitive test.

#### ROC curves

So far we have talked about tests with a dichotomous result: positive or negative. But, what happens when the result is quantitative? Let’s imagine that we measure fasting blood glucose. We must decide to what level of glycemia we consider normal and above which one will seem pathological. And this is a crucial decision, because Se and Sp will depend on the cutoff point we choose.

To help us to choose we have the receiver operating characteristic, known worldwide as the ROC curve. We represent in coordinates (y axis) the Se and in abscissas the complementary Sp (1-Sp) and draw a curve in which each point represents the probability that the test correctly classifies a healthy-sick couple taken at random. The diagonal of the graph would represent the “curve” if the test had no ability to discriminate healthy from sick patients.

As you can see in the figure, the curve usually has a segment of steep slope where the Se increases rapidly without hardly changing the Sp: if we move up we can increase Se without practically increasing FP. But there comes a time when we get to the flat part. If we continue to move to the right, there will be a point from which the Se will no longer increase, but will begin to increase FP. If we are interested in a sensitive test, we will stay in the first part of the curve. If we want specificity we will have to go further to the right. And, finally, if we do not have a predilection for either of the two (we are equally concerned with obtaining FP than FN), the best cutoff point will be the one closest to the upper left corner. For this, some use the so-called Youden’s index, which optimizes the two parameters to the maximum and is calculated by adding Se and Sp and subtracting 1. The higher the index, the fewer patients misclassified by the diagnostic test.

#### A parameter of interest is the area under the curve (AUC), which represents the probability that the diagnostic test correctly classifies the patient who is being tested (see attached figure). An ideal test with Se and Sp of 100% has an area under the curve of 1: it always hits. In clinical practice, a test whose ROC curve has an AUC> 0.9 is considered very accurate, between 0.7-0.9 of moderate accuracy and between 0.5-0.7 of low accuracy. On the diagonal, the AUC is equal to 0.5 and it indicates that it does not matter if the test is done by throwing a coin in the air to decide if the patient is sick or not. Values below 0.5 indicate that the test is even worse than chance, since it will systematically classify patients as healthy and vice versa.We’re leaving…

Curious these ROC curves, aren`t they? Its usefulness is not limited to the assessment of the goodness of diagnostic tests with quantitative results. The ROC curves also serve to determine the goodness of fit of a logistic regression model to predict dichotomous outcomes, but that is another story…

## The imperfect screening

Nobody is perfect. It is a fact. And a relief too. Because the problem is not to be imperfect, it is inevitable. The real problem is to believe one being perfect, to be ignorant of one’s limitations. And the same goes for many other things, such as diagnostic tests used in medicine.

But this is a real crime with diagnostic tools because, beyond its imperfection, it is possible to misclassify healthy and sick people. Don’t you believe me?. Let’s make some reflections.

To begin with, take a look at the Venn’s diagram I have drawn. What childhood memories these diagrams bring to me!. The filled square symbolizes our population in question. Up the diagonal are the sick (SCK) and down it the healthy (HLT), so that each area represents the probability of being SCK or HLT. The area of the square, obviously, equals 1: we can be certain that anybody will be healthy or sick, two mutually excluding situations. The ellipse encompasses the subjects undergoing the diagnostic test and getting a positive result (POS). In a perfect world, the entire ellipse would be above the diagonal, but in the real imperfect world the ellipse is crossed by the diagonal, so the results can be true POS (TP) or false (FP), the latter when are obtained in healthy. The area outside the ellipse would be the negatives (NEG), which, as you can see, are also divided into true and false (TN, FN).

Now let’s transfer this to the typical contingency table to define the probabilities of different options and think about a situation where we still have not carried out the test. In this case, the columns condition the probabilities of the events of the rows. For example, the upper left box represents the probability of POS in the SCK (once you are sick, how likely you are to get a positive result?), which we call the sensitivity (SEN). For its part, the lower right represents the probability of a NEG in a HLT, which we call specificity (SPE). The total of the first column represents the probability of being sick, which is nothing more than the prevalence (PRV), and so we can discern what the significance of the probability of each cell is. This table provides two features of the test, SEN and SPE, which, as we know, are intrinsic whenever it is performed under similar conditions, even though if the populations are different.

And what about the contingency table once you have carried out the test?. A subtle, but very important, change has taken place: now the rows condition the probabilities of the events of the columns. The total of the table do not change but do look now at the first cell, that represents the probability of being SCK given that the result has been POS (when positive, what is the probability of being sick?). And this is no longer the SEN, but the positive predictive value (PPV). The same applies to the lower right cell, which now represents the probability of being HLT given that the result has been NEG: the negative predictive value (NPV).

So we see that before performing the test we can usually will know its SEN and SPE, while once perform the test we can calculate its positive and negative predictive values, remaining these four test’s characteristics linked through the magic of Bayes’ theorem. Of course, regarding PPV and NPV there’s a fifth element to take into account: the prevalence. We know that predictive values vary depending on the PRV of the disease in the population, while SEN and SPE remain unchanged.

And all this has its practical expression. Let’s invent an example to messing around a bit more. Suppose we have a population of one million inhabitants in which we conduct a screening for fildulastrosis. We know from previous studies that the test SEN is 0.66 and SPE is 0.96, and the prevalence of fildulastrosis is 0.0001 (1 in 10,000); a rare disease that I would advise you not to bother to look for it, if anyone has thought about it.

Knowing the PRV is easy to calculate that in our country there are 100 SCK. Of these, 66 will be POS (SEN = 0.66) and 34 will be NEG. Moreover, there will be 990,900 healthy, of which 96% (959904) will be NEG (SPE = 0.96) and the rest (39,996) will be POS. In short, we’ll get 40,062 POS, of which 39,996 will be FP. No one feel scared about the high number of false positives. This is because we have chosen a very rare disease, so there are many FP even though the SPE is quite high. Consider that in real life, we’d need to do the confirmatory test to all these subjects to finish confirming the diagnosis only in 66 people. Therefore, it’s very important to think well if the screening is worth doing before starting to look for the disease in the population. For this and many other reasons.

We can now calculate the predictive values. PPV is the ratio between true and the total of POS: 66/40062 = 0.0016. So, there will be one sick in 1,500 positive, more or less. Similarly, the NPV is the ratio between true and the total of NEG: 959904/959938 = 0.99. As expected, given the high SPE of the test, to get a negative result makes it highly improbable to be sick.

What do you think? Is it a useful test for mass screening with such a number of false positives and a PPV of 0.0016?. Well, while it may seem counterintuitive, if we think about it for a moment, it’s not so bad. The pretest probability of being SCK is 0.0001 (PRV). The posttest probability is 0.0016 (PPV). So, their ratio has a value of 0.0016/0.0001 = 16, which means we have multiplied by 16 our ability to detect the sick. Therefore, the test doesn’t seem so bad, but we must take into account many other factors before starting to screen.

All this we have seen so far has an additional practical application. Suppose you only know SEN and SPE, but we don’t know the PRV of the disease in the population that we have screened. Can we be estimated it from the results of the screening?. The answer is, of course, yes.

Imagine again our population of one million subjects. We do the test and get 40,062 positive. The problem here is that some of these (the most) are FP. Also, we don’t know how many patients have tested negative (FN). How can we get then the number of sick people?. Let’s think about it for a while.

We have said that the number of patients will be equal to the number of POS minus the number of FP and plus the number of FN:

Nº sick = Total POS – Nº FP + Nº FN

We have the number of POS: 40,062. The FP will be those healthy (1-PRV) who get positive being healthy (or the healthy that doesn’t get NEG: 1-SPE). Then, the total number of FP will be:

FP = (1-PRV)(1-SPE) x n (1 million, the population’s size)

Finally, FN will be sick people (PRV) which don’t get a positive (SEN-1). Then, the total number of FN is:

FN = PRV x (1-SEN) x n (1 million, the population’s size)

If we substitute the total of FP and FN in the first equation with the values we’ve just derived, we can get the PRV, obtaining the following formula:

We can now calculate the prevalence in our population:

Well, I think one of my lobes has just melted down, so we’ll have to leave it there. Once again, we’ve seen the magic and power of number and how to make that the imperfections of our tools work in our favor. We could even go a step further and calculate the accuracy of the estimate we’ve done. But that’s another story…

## All that glitters is not gold

A brother-in-law of mine is very concerned with a dilemma he’s gotten into. The thing is that he’s going to start a small business and he wants to hire a security guard to stay at the entrance door and watch for those who take something without paying for it. And the problem is that there’re two candidates and he doesn’t know what of both to choose. One of them stops nearly everyone, so no burglar escapes. Of course, many honest people are offended when they are asked to open their bags before leaving and so next time they will buy elsewhere. The other guard is the opposite: he stops almost anyone but the one he spots certainly brings something stolen. He offends few honest people, but too many grabbers escape. Difficult decision…

Why my brother-in-law comes to me with this story?. Because he knows that I daily face with similar dilemmas every time I have to choose a diagnostic test. And the thing is that there’re still people who think that if you get a positive result with a diagnostic tool you have a certain diagnostic of illness and, conversely, that if you are sick to know the diagnostic you only have to do a test. And things are not, nor much less, so simple. Nor is gold all that glitters neither all gold have the same quality.

Let’s see it with an example. When we want to know the utility of a diagnostic test we usually compare its results with those of a reference or gold standard, which is a test that, ideally, is always positive in sick people and negative in healthy.

Now suppose I perform a study with my hospital patients with a new diagnostic test for a particular disease and I get the results showed in the table below (the sick are those with a positive reference test and the healthy those with a negative one).

Let’s start with the easy part. We have 1598 subjects, 520 out of them sick and 1078 healthy. The test gives us 446 positive results, 428 true (TP) and 18 false (FP). It also gives us 1152 negatives, 1060 true (TN) and 92 false (FN). The first we can determine is the ability of the test to distinguish between healthy and sick, which leads me to introduce the first two concepts: sensitivity (Se) and specificity (Sp). Se is the likelihood that the test correctly classifies a patient or, in other words, the probability that a patient gets a positive result. It’s calculated dividing TP by the number of sick. In our case it equals 0.82 (if you prefer to use percentages you have to multiply by 100). Moreover, Sp is the likelihood that the test correctly classifies a healthy or, put another way, the probability that a healthy gets a negative result. It’s calculated dividing TN by the number of healthy. In our example, it equals 0.98.

Someone may think that we have assessed the value of the new test, but we have just begun to do it. And this is because with Se and Sp we somehow measure the ability of the test to discriminate between healthy and sick, but what we really need to know is the probability that an individual with a positive results being sick and, although it may seem to be similar concepts, they are actually quite different.

The probability of a positive of being sick is known as the positive predictive value (PPV) and is calculated dividing the number of patients with a positive test by the total number of positives. In our case it is 0.96. This means that a positive has a 96% chance of being sick. Moreover, the probability of a negative of being healthy is expressed by the negative predictive value (NPV), with is the quotient of healthy with a negative test by the total number of negatives. In our example it equals 0.92 (an individual with a negative result has 92% chance of being healthy).

And from now on is when neurons begin to be overheated. It turns out that Se and Sp are two intrinsic characteristics of the diagnostic test. Their results will be the same whenever we use the test in similar conditions, regardless of the subjects of the test. But this is not so with the predictive values, which vary depending on the prevalence of the disease in the population in which we test. This means that the probability of a positive of being sick depends on how common or rare the disease in the population is. Yes, you read this right: the same positive test expresses different risk of being sick, and for unbelievers, I’ll put another example.

Suppose that this same study is repeated by one of my colleagues who works at a community health center, where population is proportionally healthier than at my hospital (logical, they have not suffered the hospital yet). If you check the results in the table and bring you the trouble to calculate it, you may come up with a Se of 0.82 and a Sp of 0.98, the same that I came up with in my practice. However, if you calculate the predictive values, you will see that the PPV equals 0.9 and the NPV 0.95. And this is so because the prevalence of the disease (sick divided by total) is different in the two populations: 0.32 at my practice vs 0.19 at the health center. That is, in cases of highest prevalence a positive value is more valuable to confirm the diagnosis of disease, but a negative is less reliable to rule it out. And conversely, if the disease is very rare a negative result will reasonably rule out disease but a positive will be less reliable at the time to confirm it.

We see that, as almost always happen in medicine, we are moving on the shaking ground of probability, since all (absolutely all) diagnostic tests are imperfect and make mistakes when classifying healthy and sick. So when is a diagnostic test worth of using it?. If you think about it, any particular subject has a probability of being sick even before performing the test (the prevalence of disease in his population) and we’re only interested in using diagnostic tests that increase this likelihood enough to justify the initiation of the appropriate treatment (otherwise we would have to do another test to reach the threshold level of probability to justify treatment).

And here is when this issue begins to be a little unfriendly. The positive likelihood ratio (PLR), also known as positive probability ratio, indicates how much more probable is to get a positive with a sick than with a healthy subject. The proportion of positive in sick patients is represented by Se. The proportion of positives in healthy are the FP, which would be those healthy without a negative result or, what is the same, 1-Sp. Thus, PLR = Se / (1 – Sp). In our case (hospital) it equals 41 (the same value no matter we use percentages for Se and Sp). This can be interpreted as it is 41 times more likely to get a positive with a sick than with a healthy.

It’s also possible to calculate NLR (negative), which expresses how much likely is to find a negative in a sick than in a healthy. Negative patients are those who don’t test positive (1-Se) and negative healthy are the same as the TN (the test’s Sp). So, NLR = (1 – Se) / Sp. In our example 0.18.

A ratio of 1 indicates that the result of the test doesn’t change the probability of being sick. If it’s greater than 1 the probability is increased and, if less than 1, decreased. This is the parameter used to determine the diagnostic power of the test. Values > 10 (or < 0.01) indicates that it’s a very powerful test that supports (or contradict) the diagnosis; values from 5-10 (or 0.1-0.2) indicates low power of the test to support (or disprove) the diagnosis; 2-5 (or 0.2-05) indicates that the contribution of the test is questionable; and, finally, 1-2 (0.5-1) indicates that the test has not diagnostic value.

The likelihood ratio doesn’t express a direct chance, but it allows us to calculate the odds of being sick before and after testing positive for the diagnostic test. We can calculate the pre-test odds (PreO) as the prevalence divided by its complementary (how much probably is to be sick than not to be). In our case it equals 0.47. Moreover, the post-test odd (PosO) is calculated as the product of the prevalence by the PreO. In our case, it is 19.27. And finally, following the reverse mechanism that we use to get the PreO from the prevalence, post-test probability (PosP) would be equal to PosO / (PosO +1). In our example it equals 0.95, which means that if our test is positive the probability of being sick changes from 0.32 (the prevalence) to 0.95 (post-test probability).

If there’s still anyone reading at this point, I’ll say that we don’t need all this gibberish to get post-test probability. There are multiple websites with online calculators for all these parameters from the initial 2 by 2 table with a minimum effort. I addition, the post-test probability can be easily calculated using a Fagan’s nomogram. What we need to know is how to properly assess the information provided by a diagnostic tool to see if it’s useful because of its power, costs, patient discomfort, etc.

Just one last question. We’ve been talking all the time about positive and negative diagnostic tests, but when the result of the test is quantitative, we must set what value we consider positive and what negative, with which all the parameters we’ve seen will vary depending on these values, especially Se and Sp. And to which of the parameters of the diagnostic test must we give priority?. Well, that depends on the characteristics of the test and on the use that we pretend to give to it, but that’s another story…

## A never-ending story

Today we won’t talk about dragons that take you for a walk if you get on its hump. Nor we’ll talk about men with feet on their heads or any other creature from the delusional mind of Michael Ende. Today we’re going to talk about another never-ending story: that of diagnostic tests indicators.
When you think you know them all, you can raise a stone to find another beneath it. And why are there so many?, you may ask. Well, the answer is simple. Although there’re indicators that know very well how to interpret how a diagnostic test manages healthy and sick people, investigators are still looking for a good indicator, unique, that give us an idea about the diagnostic capability of a test.

There are many diagnostic tests indicators that assess the ability of the diagnostic test to discriminate among sick and healthy comparing the results with those of a gold standard. They are computed from the comparison among positives and negatives in a contingency table, with which you can build the usual indicators you see in the table above: sensitivity, specificity, predictive values, likelihood ratios, accuracy index and Youden’s index.

The problem is that most of them partially assess the ability of the test, so we need to use them in pairs: sensitivity and specificity, for example. Only the last two of the mentioned indicators can function as single ones. The accuracy index measures the percentage of correctly diagnosed patients, but it treats equally positives and negatives, true or false. Meanwhile, Youden’s index adds the patients misclassified by the diagnostic test.

In any case, it’s not recommended to use either the accuracy or Youden’s index in an isolated way when evaluating diagnostic tests. Moreover, the latter is a term difficult to translate to a tangible clinical concept as it’s a linear transformation of sensitivity and specificity.

At this point it’s easy to understand how we’d like to have a single indicator, simple, easy to interpret and not dependent on the prevalence of the disease. It would certainly be a good indicator of the ability of the diagnostic test that would avoid us of having to resort to a pair of indicators.

And at this point is when some brilliant mind has thought about using a well-known and familiar indicator such as the odds ratio to measure the capabilities of a diagnostic test. Thus, we can define de diagnostic odds ratio (DOR) as the ratio of the odds that the patients tests positive with respect to the odds of testing positive being healthy. As this is quite a tongue-twister, we’ll discuss the two components of the ratio.

The odds that the patient tests positive versus negative is simply the quotient among true positives (TP) and false negatives (FN): TP / FN. Moreover, the odds that a healthy tests positive versus negative is the quotient among false positives (FP) and true negatives (TN): FP / TN. And seeing this, we just have to define the ratio of the two odds:

$DOR&space;=&space;\frac{TP}{FN}&space;/&space;\frac{FP}{TN}&space;=&space;\frac{Se}{1&space;-&space;Se}&space;/&space;\frac{1&space;-&space;Sp}{Sp}$

DOR can also be expressed in terms of predictive values and likelihood ratios, according to the following expressions:

$DOR=&space;\frac{PPV}{1&space;-&space;PPV}&space;/&space;\frac{1&space;-&space;NPV}{NPV}$

$DOR=&space;\frac{PLR}{NLR}$

As any odds ratio, the possible values of DOR range from zero to infinity. The null value is one, which means that the test has no discriminatory capacity among healthy and sick. A value greater than one indicates discriminatory ability, which will be greater the higher the value is. Finally, values between zero and one will indicate that the test not only not discriminate well among healthy and sick, but that it incorrectly classifies them and yield more negative values among sick than among healthy people.

DOR is a global measure that is easy to interpret and does not depend on the prevalence of the disease, although it must be said that it can vary among groups of patients with different severity of their disease.

Finally, add to its advantages the possibility of constructing its confidence interval from the contingency table using this little formula that I show you:

$Standard\&space;error&space;(ln&space;DOR)=&space;\sqrt{\frac{1}{TP}&space;+&space;\frac{1}{TN}&space;+&space;\frac{1}{FP}&space;+&space;\frac{1}{FN}}$

Yes, I’ve seen the log, but this is the way with odds ratios: as odds are asymmetrical around the null value, these calculations must be done with logarithms. So, once we have the standard error, we can calculate the interval as follows:

$IC\&space;95\%=&space;ln&space;DOR&space;\pm&space;1,96&space;SE(lnDOR)$

We just have to apply the antilogarithm to the limits of the interval we got with the formula (the antilog is to raise the number e to the limits obtained).

And I think that this is enough for today. We could go more. DOR has many more virtues. For example, it can be used with test with continuous results (not just positive or negative), since there’s a correlation between DOR and the area under the ROC curve of the test. Furthermore, it can be used in meta-analysis and in logistic regression models, allowing the inclusion of variables to control the heterogeneity of the primary studies. But that’s another story…

## The turn of the screw

Have you read the novel by Henry James?. I recommend it. A classical in the horror genre, with a dead and evil governess who appears as a ghost and with turbid relationships in the background. But today I’m not going to tell you about any horror story, but I’ll give another turn to the screw of diagnostics tests, although some can get even scarier than with a John Carpenter’s film.

We know there’s not a perfect diagnostic test. All are wrong at some time, either diagnosing someone health as sick (false positive, FP) or yielding a negative result in someone who has the disease (false negative, FN). This is why it have been developed some parameters to characterize diagnostic tests and to give us an idea of their performance in the daily clinical practice.

The best well-known are sensitivity (S) and specificity (Sp). We know they are intrinsic properties of the test and that they inform us about the ability of the diagnostic test to correctly classify sick patients (S) and healthy people (Sp). The problem is that we need to know the likelihood of being or not sick in condition of having obtained a positive or negative result. These probabilities conditioned by the result of the test are provided by the positive and negative predictive values.

These pair of values can characterize the worth of the test, but we’d prefer to be able to define the value of the test with a single number. We may use likelihood ratios, both positive and negatives ones, that gives us an idea about how much likely is to have the disease or not to have it, but these ratios weighs an ancient curse: they are of little knowing and even poor understanding by clinicians.

For these reasons, some people have tried to develop other indicators to characterize the validity of diagnostic tests. One of them is the so called accuracy or precision of the test, which reflects the probability that the test has made a correct diagnosis.

To calculate it, we construct a quotient placing in the numerator all possible true values (positives and negatives) and in the denominator all possible outcomes, according to the following formula:

$Accuracy\&space;index=&space;\frac{TP&space;+&space;TN}{TP&space;+&space;TN&space;+&space;FP&space;+&space;FN}$

This indicator informs us about in what percentage of cases the diagnostic test is not wrong, but it can be difficult to convert its value to a tangible clinical concept.

Another parameter to measure the overall effectiveness of the test is the Youden’s index, which sums the cases that are wrongly classified by the diagnostic test, according to the following formula:

Youden’s index = S + Sp -1

It’s not a bad index as an approximation to the overall performance of the test, but it is not recommended to use it as a single parameter to evaluate any diagnostic test.

Some authors go one step further and try to develop parameters that function in an analogous way to the number needed to treat (NNT) of treatment studies. Thus, two parameters have been developed.

The first one is the number needed to diagnose (NND). If NNT is the inverse of those which improve with treatment minus those which improve with control intervention, let’s make a NND placing in the denominator the difference between sick patients with positive result and healthy people with positive result.

S gives us the proportion of positive patients, and the complementary of Sp gives us the proportion of healthy people being positive with the test. So:

NND = 1 / S – (1 – Sp)

If we simplify the denominator by removing the parentheses, we’ll have:

NND = 1 / S + E – 1

That is, indeed, the inverse of Youden’s index we saw before:

NND = 1 / Youden’s I.

The second parameter is the number of patients to test to misdiagnose one (NNMD). To calculate it, we place in the denominator the complementary of the accuracy index that we talk about earlier:

NNMD = 1 / Accuracy I.

If we substitute the index for its actual value and simplify the equation, we’ll get:

$NNMD=&space;\frac{1}{1-\frac{TP&space;+&space;TN}{TP&space;+&space;TN&space;+&space;FP&space;+&space;FN}}=&space;\frac{1}{1-Sp-[Pr(S-Sp)]}$

where Pr is the prevalence of disease (pretest probability). This parameter provides the number of diagnostic tests that we have to do to be wrong once, so the higher the index, the better the test. This index and the previous one are much more graspable for clinicians, although both of them have the same drawback: FP and FN are given the same level of importance, which does not always fit with the clinical context in which we apply the diagnostic test.

And these are all the parameters I know, but surely there are more and, if not, someone will invent some soon. I would not end without a clarification on the Youden’s index, in which we have barely spent time. This index is not only important to assess the overall performance of a diagnostic test. It’s also a useful tool to decide what the best cut on a ROC curve is, since its maximum value indicates the point of the curve that is further away from the diagonal. But that’s another story…