## The imperfect screening

Nobody is perfect. It is a fact. And a relief too. Because the problem is not to be imperfect, it is inevitable. The real problem is to believe one being perfect, to be ignorant of one’s limitations. And the same goes for many other things, such as diagnostic tests used in medicine.

But this is a real crime with diagnostic tools because, beyond its imperfection, it is possible to misclassify healthy and sick people. Don’t you believe me?. Let’s make some reflections.

To begin with, take a look at the Venn’s diagram I have drawn. What childhood memories these diagrams bring to me!. The filled square symbolizes our population in question. Up the diagonal are the sick (SCK) and down it the healthy (HLT), so that each area represents the probability of being SCK or HLT. The area of the square, obviously, equals 1: we can be certain that anybody will be healthy or sick, two mutually excluding situations. The ellipse encompasses the subjects undergoing the diagnostic test and getting a positive result (POS). In a perfect world, the entire ellipse would be above the diagonal, but in the real imperfect world the ellipse is crossed by the diagonal, so the results can be true POS (TP) or false (FP), the latter when are obtained in healthy. The area outside the ellipse would be the negatives (NEG), which, as you can see, are also divided into true and false (TN, FN).

Now let’s transfer this to the typical contingency table to define the probabilities of different options and think about a situation where we still have not carried out the test. In this case, the columns condition the probabilities of the events of the rows. For example, the upper left box represents the probability of POS in the SCK (once you are sick, how likely you are to get a positive result?), which we call the sensitivity (SEN). For its part, the lower right represents the probability of a NEG in a HLT, which we call specificity (SPE). The total of the first column represents the probability of being sick, which is nothing more than the prevalence (PRV), and so we can discern what the significance of the probability of each cell is. This table provides two features of the test, SEN and SPE, which, as we know, are intrinsic whenever it is performed under similar conditions, even though if the populations are different.

And what about the contingency table once you have carried out the test?. A subtle, but very important, change has taken place: now the rows condition the probabilities of the events of the columns. The total of the table do not change but do look now at the first cell, that represents the probability of being SCK given that the result has been POS (when positive, what is the probability of being sick?). And this is no longer the SEN, but the positive predictive value (PPV). The same applies to the lower right cell, which now represents the probability of being HLT given that the result has been NEG: the negative predictive value (NPV).

So we see that before performing the test we can usually will know its SEN and SPE, while once perform the test we can calculate its positive and negative predictive values, remaining these four test’s characteristics linked through the magic of Bayes’ theorem. Of course, regarding PPV and NPV there’s a fifth element to take into account: the prevalence. We know that predictive values vary depending on the PRV of the disease in the population, while SEN and SPE remain unchanged.

And all this has its practical expression. Let’s invent an example to messing around a bit more. Suppose we have a population of one million inhabitants in which we conduct a screening for fildulastrosis. We know from previous studies that the test SEN is 0.66 and SPE is 0.96, and the prevalence of fildulastrosis is 0.0001 (1 in 10,000); a rare disease that I would advise you not to bother to look for it, if anyone has thought about it.

Knowing the PRV is easy to calculate that in our country there are 100 SCK. Of these, 66 will be POS (SEN = 0.66) and 34 will be NEG. Moreover, there will be 990,900 healthy, of which 96% (959904) will be NEG (SPE = 0.96) and the rest (39,996) will be POS. In short, we’ll get 40,062 POS, of which 39,996 will be FP. No one feel scared about the high number of false positives. This is because we have chosen a very rare disease, so there are many FP even though the SPE is quite high. Consider that in real life, we’d need to do the confirmatory test to all these subjects to finish confirming the diagnosis only in 66 people. Therefore, it’s very important to think well if the screening is worth doing before starting to look for the disease in the population. For this and many other reasons.

We can now calculate the predictive values. PPV is the ratio between true and the total of POS: 66/40062 = 0.0016. So, there will be one sick in 1,500 positive, more or less. Similarly, the NPV is the ratio between true and the total of NEG: 959904/959938 = 0.99. As expected, given the high SPE of the test, to get a negative result makes it highly improbable to be sick.

What do you think? Is it a useful test for mass screening with such a number of false positives and a PPV of 0.0016?. Well, while it may seem counterintuitive, if we think about it for a moment, it’s not so bad. The pretest probability of being SCK is 0.0001 (PRV). The posttest probability is 0.0016 (PPV). So, their ratio has a value of 0.0016/0.0001 = 16, which means we have multiplied by 16 our ability to detect the sick. Therefore, the test doesn’t seem so bad, but we must take into account many other factors before starting to screen.

All this we have seen so far has an additional practical application. Suppose you only know SEN and SPE, but we don’t know the PRV of the disease in the population that we have screened. Can we be estimated it from the results of the screening?. The answer is, of course, yes.

Imagine again our population of one million subjects. We do the test and get 40,062 positive. The problem here is that some of these (the most) are FP. Also, we don’t know how many patients have tested negative (FN). How can we get then the number of sick people?. Let’s think about it for a while.

We have said that the number of patients will be equal to the number of POS minus the number of FP and plus the number of FN:

Nº sick = Total POS – Nº FP + Nº FN

We have the number of POS: 40,062. The FP will be those healthy (1-PRV) who get positive being healthy (or the healthy that doesn’t get NEG: 1-SPE). Then, the total number of FP will be:

FP = (1-PRV)(1-SPE) x n (1 million, the population’s size)

Finally, FN will be sick people (PRV) which don’t get a positive (SEN-1). Then, the total number of FN is:

FN = PRV x (1-SEN) x n (1 million, the population’s size)

If we substitute the total of FP and FN in the first equation with the values we’ve just derived, we can get the PRV, obtaining the following formula:

We can now calculate the prevalence in our population:

Well, I think one of my lobes has just melted down, so we’ll have to leave it there. Once again, we’ve seen the magic and power of number and how to make that the imperfections of our tools work in our favor. We could even go a step further and calculate the accuracy of the estimate we’ve done. But that’s another story…

## Prevention not always is better

Any sensible people will tell you that prevention is better than cure. I’ve heard it a million times. There was even a television show named “It‘s better to prevent”. Besides, nobody in their right mind doubt about the health benefits that preventive medicine has achieved promoting the improving of lifestyles, controlling environmental conditions or with vaccination programs. But, however, when it comes to screening programs, I’ll say that it’s not always so clear that it’s better to prevent and, at times, it’s better to do nothing for two reasons. First, because our resources are limited and all that we spend on screening will come from other needs that will have fewer resources. Second, because even if we do it out of the best of intentions, if we try to prevent indiscriminately we can cause more harm than good.

So, we’ll have to think if there’s justification for any screening strategy before implementing it. The diagnostic test with which we plan to screen must be simple, inexpensive, reliable and with good acceptability by the population. It is important not to forget that we are going to do the test to healthy individuals that may not want to be bothered. Furthermore, it’s rare that we can confirm the diagnosis with a single positive result, and test to confirm it surely will be more expensive and cumbersome, if not clearly invasive (imagine that the screening must be confirmed by a biopsy). We will have to consider the sensitivity and specificity of the test because, although we tolerate some number of false positive when screening, if the confirmation test is expensive or very cumbersome, it will be better that false positive are few, or screening won’t be cost-effective.

Moreover, for the screening being worth of doing, the preventable disease has to have a long preclinical phase. If it is not so, we’ll have little opportunity to detected it. The problem is, of course, that we are more interested in detecting the more severe diseases, and those often have shorter preclinical stages without symptoms.

Besides, who is going to be screened?. Everyone, you will tell me. The problem is that this is the most expensive option, especially considering that healthy people do not usually go to the doctor and you’ll have to actively recruit them if you want to do the screening (for their sake, of course). To those who are sick, but not so much, you´ll tell me then. Well, not a great deal because when they go to the doctor they are already out of the reach of prevention (they’re already sick). But we can take advantage of those who go to the doctor for other reasons, some of you could think. This is called opportunity screening, and is what sometimes is done for practical reasons. It’s cheaper, but the theoretical benefits of universal screening are lost. Screening a number as large as possible is of particular interest when we’re trying to detect risk factors (such as hypertension) since, in addition of the advantages of early treatment, we have the opportunity to do primary prevention, much cheaper and with better health results.

So, as we see, doing a screening can have many advantages, what is evident to everyone. The problem is that we rarely think about the damage we can cause with that way of prevention. How is it possible that early disease detection or the possibility of doing an early treatment could harm someone?. Let’s make some considerations.

The test may be painful (a shot) or be bothering (to gather up stools in a container for three days). But if you think this is bullshit, think about people who have a heart attack while doing a stress test, or which have an anaphylactic shock, not to speak about the Japanese who suffer a perforation during a colonoscopy. That’s a horse of a different color. Moreover, the mere prospect of screening can cause anxiety or stress in a healthy person who should not be worried about it.

And think about what will happen if the test is positive. Imagine that, to confirm the diagnosis, we have to do a colonoscopy or a chorionic biopsy, not to mention the patient anxiety until diagnosis is confirmed or ruled out. And although we confirm the diagnosis, the benefit may be limited: what is the benefit for an asymptomatic person to know that he has a disease, especially when there is no treatment or it’s not already the time to start it?. But the fact is that, although there’s a treatment, it may be also injurious. One very up-to-date example of that is the effect of a prophylactic prostatectomy for a low-grade carcinoma detected with PSA screening: the patient can suffer incontinency or impotency (or both of them) for being operating on a surgery that could be delayed for years.

Think always that the potential benefits of screening in general healthy population may be limited because of this reason: people are healthy. If there is the slighted damage that may arise from the strategy of early screening and treatment we should seriously consider whether it is worth performing the screening program.

So, when do we have to do the screening for a given disease?. First, when disease burden is worth of doing screening. Disease burden depends on the severity and prevalence. If a disease is very common but very benign, disease burden will be low and probably we’ll not be interested in screening. In the event that it is very rare it should be neither worth of doing screening, unless the disease is severe and has a very effective treatment to prevent its complications. An example could be the screening of hyperphenylalaninemia in newborns.

Second, we need to have a proper test with the mentioned characteristics, especially the fact that number of false positives is not too high to avoid to have to confirm the diagnostic in too many healthy people, making a ruinous business.

Third, there has to be an early treatment that, also, has to be more effective than usual treatment at the onset of symptomatology. Of course, we must also have the resources to perform this treatment.

Fourth, both the screening test and the treatment arising from the positive result must be safe. Otherwise, we could do more damage than that we want to avoid.

And fifth, we must balance costs and potential benefits of screening. Don’t forget that, although the test is not very expensive, we are going to do it in a lot of people, so we’ll have to spend a huge amount of money, which is rather scarce at this moment.

Finally, just say that any screening program must be supplemented with studies proving its effectiveness. This can be done by direct or indirect methods depending on if we are comparing the possibilities of to do or not to do the screening, or if we study and compare the different screening strategies. But that’s another story…

## Even non-significant Ps have a little soul

In any epidemiological study, results and validity are always at risk of two fearsome dangers: random bias and systematic bias.

Systematic bias (or systematics errors) are related to study design defects in any of its phases, so we must be careful to avoid them in order to not to compromise the validity of the results.

Random bias is quite different kettle of fish. It’s inevitable and is due to changes beyond our control which occur during the process of measurement and data collection, so altering the accuracy of our results. But do not despair: we can’t avoid randomness, but we can control (within some limits) and quantify it.

Let’s suppose we have measured differences in oxygen saturation between lower and upper extremities in twenty healthy newborns and we’ve came up with an average result of 2.2%. If we repeat the experiment, even in the same infants, what value will we come up with?. In all probability, any value but 2.2% (although it will seem quite similar if we make the two rounds in the same conditions). That’s the effect of randomness: repetition tends to produce different results, although always close to the true value we want to measure.

Random bias can be reduced by increasing the sample size (with one hundred instead of twenty children the averages will be more the same if we repeat the experiment), but we’ll never get rid of it completely. To make things worse, we don’t even want to know the mean saturation’s differences in these twenty, but in the overall population from which they are extracted. How can we get out of this maze?. You’ve got it, using confidence intervals.

When we establish the null hypothesis of no difference between measuring saturation on the leg or on the arm and we compare means with the appropriate statistical test, p-values will tell us the probability that the difference found is due to chance. If p<0.05 we’ll assume that the probability it is due to chance is small enough to calmly reject the null hypothesis and embrace the alternative hypothesis: it is not the same to measure oxygen saturation on the leg or on the arm. On the other hand, if p is not significant we won’t able to reject the null hypothesis, insomuch us we’ll always think about what if we would have obtained the p-value with 100 children, or even with 1000. p might have reach statistical significance and we might have rejected H0.

If we calculate the confidence interval of our variable we’ll get the range in which the real value is with a certain probability (typically 95%). The interval will inform us about the accuracy of the study. It will not be the same to come up with oxygen saturation’s difference from to 2 to 2.5% than from 2 to 25% (in this case, we should distrust study results no matter it had a five-zero p value).

And what if p is non-significant?. Can we draw any conclusions from the study?. Well, that depends largely on the importance of what we are measuring, on its clinical relevance. If we consider as clinically significant a saturation difference of 10% and the interval is below this value, clinical importance will be low no matter the significance of p. But the good news is that this reasoning can also be state in the reverse way: non-statistically significant intervals can have a great impact if any of its limits intersect with the area of clinical importance.

Let’s see some examples in the figure above, in which a difference of 5% oxygen saturation has been considered as clinically significant (I apologize to the neonatologists, but the only thing I know about saturation is that it’s measured by a device that now and then is not capable of doing its task and beeps).

Study A is not statistically significant (its confidence interval intersects with the null effect, which is zero in our example) and, also, it doesn’t seem to be clinically important.

Study B is not statistically significant but it may be clinically important, since its upper limits falls into the clinical relevance’s area. If you’d increase the accuracy of the study (increasing sample size), who assures us that the interval could not be narrower and above the null effect line, reaching statistical significance?. In this case the question is not very important because we are measuring a bit nonsense variable, but think about how the situation would change if we were considering a harder variable, as mortality.

Studies C and D reach statistical significance, but only study D’s results are clinically relevant. Study C shows a statistically significant difference, but its clinical relevance and therefore its interest are minimal.

So, you see, there are times that a non-statistically significant p-value can provide information of interest from a clinical point of view, and vice versa. Furthermore, all that we have discussed is important to understand the designs of superiority, equivalence and non-inferiority trials. But that’s another story…