Science without sense…double nonsense

Píldoras sobre medicina basada en pruebas

Posts tagged Likelihood ratio

You have to know what you are looking for

Every day we find articles that show new diagnostic tests that appear to have been designed to solve all our problems. But we should not be tempted to pay credit to everything we read before reconsidering what we have, in fact, read. At the end of the day, if we paid attention to everything we read we would be swollen from drinking Coca-Cola.

We know that a diagnostic test is not going to say whether or not a person is sick. Its result will only allow us to increase or decrease the probability that the individual is sick or not so we can confirm or rule out the diagnosis, but always with some degree of uncertainty.

Anyone has a certain risk of suffering from any disease, which is nothing more than the prevalence of the disease in the general population. Below a certain level of probability, it seems so unlikely that the patient is sick that we leave him alone and do not do any diagnostic tests (although some find it hard to restrain the urge to always ask for something). This is the diagnostic or test threshold.

But if, in addition to belonging to the population, one has the misfortune of having symptoms, that probability will increase until this threshold is exceeded, in which the probability of presenting the disease justifies performing diagnostic tests. Once we have the result of the test that we have chosen, the probability (post-test probability) will have changed. It may have changed to less and it has been placed below the test threshold, so we discard the diagnosis and leave the patient alone again. It may also exceed another threshold, the therapeutic, from which the probability of the disease reaches the sufficient level so as not to need further tests and to be able to initiate the treatment.

The usefulness of the diagnostic test will be in its ability to reduce the probability below the threshold of testing (and discard the diagnosis) or, on the contrary, to increase it to the threshold at which it is justified to start treatment. Of course, sometimes the test leaves us halfway and we have to do additional tests before confirming the diagnosis with enough security to start the treatment.

Diagnostic tests studies should provide information about the ability of a test to produce the same results when performed under similar conditions (reliability) and about the accuracy with which the measurements reflect that measure (validity). But they also give us data about their discriminatory power (sensitivity and specificity), their clinical performance (positive predictive value and negative predictive value), its ability to modify the probability of illness and change our position between the two thresholds (likelihood ratios), and about other aspects that allow us to assess whether it’s worth to test our patients with the diagnostic test. And to check if a study gives us the right information we need to make a critical appraisal and read the paper based on our three pillars: validity, relevance and applicability.

Let’s start with VALIDITY. First, we’ll make ourselves some basic eliminating questions about primary criteria about the study. If the answer to these questions is no, the best you can do probably is to use the article to wrap your mid-morning snack.

Was the diagnostic test blindly and independently compared with an appropriate gold standard or reference test?. We must review that results of reference test were not interpreted differently depending on the results of the study test, thus committing an incorporation bias, which could invalidate the results. Another problem that can arise is that the reference test results are frequently inconclusive. If we made the mistake of excluding that doubtful cases we’d commit and indeterminate exclusion bias that, in addition to overestimate the sensitivity and specificity of the test, will compromise the external validity of the study, whose conclusions would only be applicable to patients with indeterminate result.

Do patients encompass a similar spectrum to which we will find in our practice?. The inclusion criteria of the study should be clear, and the study must include healthy and diseased with varying severity or progression stages of disease. As we know, the prevalence influences the clinical performance of the test so if it’s validated, for example, in a tertiary center (the probability of being sick is statistically greater) its diagnostic capabilities will be overestimated when we use the test at a Primary Care center or with the general population (where the proportion of diseased will be lower).

At this point, if we think it’s worth reading further, we’ll focus on secondary criteria, which are those that add value to the study design. Another question to ask is: had the study test’s results any influence in the decision to do the reference test?. We have to check that there hasn’t been a sequence bias or a diagnostic verification bias, whereby excluding those with negative test. Although this is common in current practice (we start with simple tests and perform the more invasive ones only in positive patients), doing so in a diagnostic test study affect the validity of the results. Both tests should be done independently and blindly, so that the subjectivity of the observer does not influence the results (review bias). Finally, is the method described with enough detail to allow its reproduction?. It should be clear what is considered normal and abnormal and what criteria we have used to define normal and how we have interpreted the results of the test.

Having analyzed the internal validity of the study we’ll appraise the RELEVANCE of the presented data. The purpose of a diagnostic study is to determine the ability of a test to correctly classify individuals according to the presence or absence of disease. Actually, and to be more precise, we want to know how the likelihood of being ill increases after the test’s result (post-test probability). It’s therefore essential that the study gives information about the direction and magnitude of this change (pretest / posttest), that we know depends on the characteristics of the test and, to a large extent, on the prevalence or pretest probability.

Do the work present likelihood ratios or is it possible to calculate them from the data?. This information is critical because if not, we couldn’t estimate the clinical impact of the study test. We have to be especially careful with tests with quantitative results in which the researcher has established a cutoff of normality. When using ROC curves, it is usual to move the cutoff to favor sensitivity or specificity of the test, but we must always appraise how this measure affects the external validity of the study, since it may limit its applicability to a particular group of patients.

How reliable are the results?. We will have to determine whether the results are reproducible and how they can be affected by variations among different observers or when retested in succession. But we have not only to assess the reliability, but also how accurate the results are. The study was done on a sample of patients, but it should provide an estimate of their values in the population, so results should be expressed with their corresponding confident intervals.

The third pillar in critical appraising is that of APLICABILITY or external validity, which will help us to determine whether the results are useful to our patients. In this regard, we ask three questions. Is the test available and is it possible to perform it in our patients?. If the test is not available all we’ll have achieved with the study is to increase our vast knowledge. But if we can apply the test we must ask whether our patients fulfill the inclusion and exclusion criteria of the study and, if not, consider how these differences may affect the applicability of the test.

The second question is if we know the pretest probability of our patients. If our prevalence is very different from that of the study the actual usefulness of the test can be modified. One solution may be to do a sensitivity analysis evaluating how the study results would be modified after changing values of pre and posttest probability to a different ones that are clinically reasonable.

Finally, we should ask ourselves the most important question: can posttest probability change our therapeutic attitude, so being helpful to the patient?. For example, if the pretest probability is very low, probably the posttest probability will be also very low and won’t reach the therapeutic threshold, so it would be not worth spending money and effort with the test. Conversely, is pretest probability is very high it may be worth starting treatment without any more evidence, unless the treatment is very expensive or dangerous. As always, the virtue will be in the middle and it will be in these intermediate areas where more benefits can be obtained from the studied diagnostic test. In any case, we must never forget who our boss is (I mean the patient, not our boss at the office): you must not to be content only with studying the effectiveness or cost-effectiveness, but also consider the risks, discomfort, and patients preferences and the consequences that can lead to the performing of the diagnostic test.

If you allow me an advice, when critically appraising an article about diagnostic tests I recommend you to use the CASP’s templates, which can be downloaded from the website. They will help you make the critical appraising in a systematic and easy way.

A clarification to go running out: we must not confuse the studies of diagnostic tests with diagnostic prediction rules. Although the assessment is similar, the prediction rules have specific characteristics and methodological requirements that must be assessed in an appropriate way and that we will see in another post.

Finally, just say that everything we have said so far applies to the specific papers about diagnostic tests. However, the assessment of diagnostic tests may be part of observational studies such as cohort or case-control studies, which can have some peculiarity in the sequence of implementation and validation criteria of the study and reference test. But that’s another story…

All that glitters is not gold

A brother-in-law of mine is very concerned with a dilemma he’s gotten into. The thing is that he’s going to start a small business and he wants to hire a security guard to stay at the entrance door and watch for those who take something without paying for it. And the problem is that there’re two candidates and he doesn’t know what of both to choose. One of them stops nearly everyone, so no burglar escapes. Of course, many honest people are offended when they are asked to open their bags before leaving and so next time they will buy elsewhere. The other guard is the opposite: he stops almost anyone but the one he spots certainly brings something stolen. He offends few honest people, but too many grabbers escape. Difficult decision…

Why my brother-in-law comes to me with this story?. Because he knows that I daily face with similar dilemmas every time I have to choose a diagnostic test. And the thing is that there’re still people who think that if you get a positive result with a diagnostic tool you have a certain diagnostic of illness and, conversely, that if you are sick to know the diagnostic you only have to do a test. And things are not, nor much less, so simple. Nor is gold all that glitters neither all gold have the same quality.

Let’s see it with an example. dco_hosp_enWhen we want to know the utility of a diagnostic test we usually compare its results with those of a reference or gold standard, which is a test that, ideally, is always positive in sick people and negative in healthy.

Now suppose I perform a study with my hospital patients with a new diagnostic test for a particular disease and I get the results showed in the table below (the sick are those with a positive reference test and the healthy those with a negative one).

Let’s start with the easy part. We have 1598 subjects, 520 out of them sick and 1078 healthy. The test gives us 446 positive results, 428 true (TP) and 18 false (FP). It also gives us 1152 negatives, 1060 true (TN) and 92 false (FN). The first we can determine is the ability of the test to distinguish between healthy and sick, which leads me to introduce the first two concepts: sensitivity (Se) and specificity (Sp). Se is the likelihood that the test correctly classifies a patient or, in other words, the probability that a patient gets a positive result. It’s calculated dividing TP by the number of sick. In our case it equals 0.82 (if you prefer to use percentages you have to multiply by 100). Moreover, Sp is the likelihood that the test correctly classifies a healthy or, put another way, the probability that a healthy gets a negative result. It’s calculated dividing TN by the number of healthy. In our example, it equals 0.98.

Someone may think that we have assessed the value of the new test, but we have just begun to do it. And this is because with Se and Sp we somehow measure the ability of the test to discriminate between healthy and sick, but what we really need to know is the probability that an individual with a positive results being sick and, although it may seem to be similar concepts, they are actually quite different.

The probability of a positive of being sick is known as the positive predictive value (PPV) and is calculated dividing the number of patients with a positive test by the total number of positives. In our case it is 0.96. This means that a positive has a 96% chance of being sick. Moreover, the probability of a negative of being healthy is expressed by the negative predictive value (NPV), with is the quotient of healthy with a negative test by the total number of negatives. In our example it equals 0.92 (an individual with a negative result has 92% chance of being healthy).

And from now on is when neurons begin to be overheated. It turns out that Se and Sp are two intrinsic characteristics of the diagnostic test. Their results will be the same whenever we use the test in similar conditions, regardless of the subjects of the test. But this is not so with the predictive values, which vary depending on the prevalence of the disease in the population in which we test. This means that the probability of a positive of being sick depends on how common or rare the disease in the population is. Yes, you read this right: the same positive test expresses different risk of being sick, and for unbelievers, I’ll put another example.dco_hc_en

Suppose that this same study is repeated by one of my colleagues who works at a community health center, where population is proportionally healthier than at my hospital (logical, they have not suffered the hospital yet). If you check the results in the table and bring you the trouble to calculate it, you may come up with a Se of 0.82 and a Sp of 0.98, the same that I came up with in my practice. However, if you calculate the predictive values, you will see that the PPV equals 0.9 and the NPV 0.95. And this is so because the prevalence of the disease (sick divided by total) is different in the two populations: 0.32 at my practice vs 0.19 at the health center. That is, in cases of highest prevalence a positive value is more valuable to confirm the diagnosis of disease, but a negative is less reliable to rule it out. And conversely, if the disease is very rare a negative result will reasonably rule out disease but a positive will be less reliable at the time to confirm it.

We see that, as almost always happen in medicine, we are moving on the shaking ground of probability, since all (absolutely all) diagnostic tests are imperfect and make mistakes when classifying healthy and sick. So when is a diagnostic test worth of using it?. If you think about it, any particular subject has a probability of being sick even before performing the test (the prevalence of disease in his population) and we’re only interested in using diagnostic tests that increase this likelihood enough to justify the initiation of the appropriate treatment (otherwise we would have to do another test to reach the threshold level of probability to justify treatment).

And here is when this issue begins to be a little unfriendly. The positive likelihood ratio (PLR), also known as positive probability ratio, indicates how much more probable is to get a positive with a sick than with a healthy subject. The proportion of positive in sick patients is represented by Se. The proportion of positives in healthy are the FP, which would be those healthy without a negative result or, what is the same, 1-Sp. Thus, PLR = Se / (1 – Sp). In our case (hospital) it equals 41 (the same value no matter we use percentages for Se and Sp). This can be interpreted as it is 41 times more likely to get a positive with a sick than with a healthy.

It’s also possible to calculate NLR (negative), which expresses how much likely is to find a negative in a sick than in a healthy. Negative patients are those who don’t test positive (1-Se) and negative healthy are the same as the TN (the test’s Sp). So, NLR = (1 – Se) / Sp. In our example 0.18.

A ratio of 1 indicates that the result of the test doesn’t change the probability of being sick. If it’s greater than 1 the probability is increased and, if less than 1, decreased. This is the parameter used to determine the diagnostic power of the test. Values > 10 (or < 0.01) indicates that it’s a very powerful test that supports (or contradict) the diagnosis; values from 5-10 (or 0.1-0.2) indicates low power of the test to support (or disprove) the diagnosis; 2-5 (or 0.2-05) indicates that the contribution of the test is questionable; and, finally, 1-2 (0.5-1) indicates that the test has not diagnostic value.

The likelihood ratio doesn’t express a direct chance, but it allows us to calculate the odds of being sick before and after testing positive for the diagnostic test. We can calculate the pre-test odds (PreO) as the prevalence divided by its complementary (how much probably is to be sick than not to be). In our case it equals 0.47. Moreover, the post-test odd (PosO) is calculated as the product of the prevalence by the PreO. In our case, it is 19.27. And finally, following the reverse mechanism that we use to get the PreO from the prevalence, post-test probability (PosP) would be equal to PosO / (PosO +1). In our example it equals 0.95, which means that if our test is positive the probability of being sick changes from 0.32 (the prevalence) to 0.95 (post-test probability).

If there’s still anyone reading at this point, I’ll say that we don’t need all this gibberish to get post-test probability. There are multiple websites with online calculators for all these parameters from the initial 2 by 2 table with a minimum effort. I addition, the post-test probability can be easily calculated using a Fagan’s nomogram. What we need to know is how to properly assess the information provided by a diagnostic tool to see if it’s useful because of its power, costs, patient discomfort, etc.

Just one last question. We’ve been talking all the time about positive and negative diagnostic tests, but when the result of the test is quantitative, we must set what value we consider positive and what negative, with which all the parameters we’ve seen will vary depending on these values, especially Se and Sp. And to which of the parameters of the diagnostic test must we give priority?. Well, that depends on the characteristics of the test and on the use that we pretend to give to it, but that’s another story…