Science without sense…double nonsense

The polysemy of Q

Manuel Molina — Tue, 16 Apr 2024 06:42:36 +0000

Science without sense...double nonsense

The polysemy of Q

Cochran’s Q.

Cochran’s Q is a widely used measure to detect heterogeneity between primary studies in a meta-analysis. Its statistical properties and hypothesis testing are reviewed. Finally, other measures calculated from this value are described, such as the I² statistic and the H² statistic, frequently used to quantify the intensity of heterogeneity between studies.

Polysemy is one of the characteristics of language, you know, that party of meanings in which a word decides to wear multiple disguises. Take, for example, the letter Q.

Q is the seventeenth letter of the Latin alphabet, commonly used in languages such as Spanish, English, French and many others. You all know it, it is similar to the letter O, but with an added tail that gives it a distinctive touch.

Anyway, when I think of Q, my mind quickly drifts to the world of movies and spy novels and I remember Q, the technological genius behind the extravagant adventures of James Bond. Q is the mastermind behind MI6’s most ingenious gadgets. But let’s be fair, Q is not only the toy supplier for Agent 007, but also the character who adds a touch of wit and cunning to Bond’s elegant and dangerous life.

You see that Q is much more than a simple letter of the alphabet. But today we are not going to talk about semantics, much less about films. Because Q is not just a letter or a surname, it is also the key to unravelling the statistical mysteries that haunt our meta-analyses. Like a quantum detective, another Q, Cochran’s Q, helps us reveal the truth hidden in the data, or at least, it tries to do so amidst so much uncertainty and statistical heterogeneity.

But don’t worry, we won’t need a magnifying glass or a raincoat to follow this exciting path. Make yourself comfortable and prepare to walk through a world where numbers make their movements and hypotheses are unravelled. Let’s go there.

A prior clarification

Before starting to talk about how to measure heterogeneity in a meta-analysis, I think it is worth making a prior clarification, because here we once again encounter the polysemy of language, and the term “heterogeneity” can have two meanings.

First, there may be differences between the primary studies of a meta-analysis in terms of population, intervention, comparison and outcome (the classic components of the structured clinical question, PICO). This heterogeneity, which we can call clinical, can be reduced when we design our study by asking an appropriate research question, but there is no point in trying to correct it in the results analysis phase.

Apples should not be mixed with pears. If we find ourselves in this situation, the correct thing to do is to limit ourselves to doing a qualitative synthesis of our systematic review and refrain from doing a quantitative synthesis or meta-analysis.

Secondly, we may find ourselves faced with what we call statistical heterogeneity, due to the precision of the estimates made in the meta-analysis. This heterogeneity, which can be high even if the studies are very homogeneous from a clinical point of view, must always be considered when analysing the results.

Statistical heterogeneity is responsible for the variability between studies in the meta-analysis. Because, once again, we can also find more than one source of variability.

Sources of variability in meta-analysis

When considering the sources of variation between the effects observed in the primary studies of a meta-analysis, we can use two different models: the fixed effect model and the random effects model.

The fixed effect model assumes that the effect we want to estimate is the same in the populations from which the samples with which the primary studies are carried out come. In this way, the differences observed between the effects of the studies are due solely to chance.

For its part, the random effects model assumes that each population has its specific effect, so that these random effects are considered random variables that follow a certain distribution, and their inclusion in the model helps to improve the precision of the estimates and to consider variability between studies.

Thus, there are two sources of variability under the random effects model. On the one hand, our inseparable companion, chance. On the other hand, the variability between studies. And here Cochran’s Q comes into play, helping us to differentiate between random or sampling error from that due to real differences between the different meta-analysis studies.

Cochran’s Q

Back in the 1950s, Cochran defined his famous Q by resorting to a very beloved tool in the world of statistics, which is none other than the sum of the squares of the differences between the observed value and the calculated one, the residuals. In this case, the sum of the squares of the differences between the effect of each study and the summary effect measure calculated according to the fixed effect model is calculated, weighting each difference by the contribution of each study to the summary effect result.

Since a picture is worth a thousand words (or so they say), I show you the formula below.

Let’s look a little more at the formula, although many would like to forget it as soon as possible. The letter θ represents the summary effect measure calculated by applying, as we have already said, a fixed effect model, while θ_k represents the effect of each primary study. K represents the number of studies, with k being each individual study. Finally, w_k is the weight of each study, calculated as the inverse of its variance (what is usual in the fixed effect model).

We can easily understand that the value of Q will increase as the number of studies increases. Furthermore, when weighted by the inverse of the variance (the standard error of the effect), studies with a very small error will greatly influence Q value, even if the effect value differs little from the summary measure.

The obtained Q value can be used to measure the excess variation that we can attribute exclusively to chance or, in other words, the heterogeneity between studies. But what value of Q will indicate that there is heterogeneity beyond that explained by sampling error?

The answer is that there is no value that we can establish in a general way, but it will depend on each case, so, to better understand how to calculate it, we are going to do a small simulation with completely invented data.

Experimenting with Cochran’s Q

Let’s see how the value of Q behaves under the assumptions of absence and presence of heterogeneity. To do this, we are going to use the R program to calculate the distribution that the Q values follow in these two situations. I will be writing the R commands, in case someone wants to replicate the experiment at the same time we are doing it.

Absence of heterogeneity

Let’s start by assuming that there is no heterogeneity. This implies that the value of the residuals (the difference between the effect of each study and the summary measure with the fixed effect model) is distributed normally around the value of the summary measure, with a mean of 0 and a certain variance, which, in this case, we are going to assume that it is 1. Thus, we can say that the residuals are distributed according to a standard normal, N(0,1).

Assuming our meta-analysis has 35 studies, we could calculate the residual values with R’s rnorm() function:

residuals <- rnorm(n = 35, mean = 0, sd = 1)

This would give us the values of the residuals for our meta-analysis, but what we are interested in is knowing how these residuals behave if we repeat the study many times, in order to calculate the sampling distribution of the residuals and, from it, that of Q’s sampling distribution. We can simulate repeating the meta-analysis 10,000 times with the following command:

fixed_err <- replicate(10000, rnorm(n = 35, mean = 0, sd = 1))

Finally, remembering the formula for Q, we can calculate its value in each of these 10,000 meta-analyses. For simplicity, let’s assume that the weight of all studies is equal to 1:

Q_fixed <- replicate(10000, sum(rnorm(n = 35, mean = 0, sd = 1) ^ 2))

Let no one worry too much if they do not fully understand how we do the simulation. The interesting thing comes now.

Since it is a weighted sum of squares, Q can only acquire positive values, and we know that it follows, approximately, a chi-square distribution with a number of degrees of freedom equal to the number of studies minus one (K – 1). I take this opportunity to remember that the chi-square distribution is characterized by having a mean equal to the number of degrees of freedom and a variance equal to twice the degrees of freedom.

To check that our simulation works, we plot the histogram of the sampling distribution of Q, superimposing the curve of the chi-square distribution with 35-1 degrees of freedom:

hist(Q_fixed, xlab="Q", prob = TRUE, breaks = 100, ylim = c(0, 0.06), xlim = c(0, 80), ylab = "", main = "Without heterogeneity", border="lightblue")

lines(seq(0, 80, 0.01), dchisq(seq(0, 80, 0.01), df = 35-1), col = "blue", lwd = 2)

As you can see in attached figure, the Q values of the 10,000 simulated studies reasonably follows the distribution, so there are no surprises so far.

Presence of heterogeneity

In this case, we must assume that the residuals have two components of variability: chance and variability between studies. We can calculate it in a similar way as we did before, but adding a second component:

residuals <- rnorm(n = 35, mean = 0, sd = 1) + rnorm(n = 35, mean = 0, sd = 1)

We now calculate the Q values for the 10,000 simulated meta-analyses under the assumption that heterogeneity exists:

Q_random <- replicate(10000, sum((rnorm(n = 35, mean = 0, sd = 1) +

rnorm(n = 35, mean = 0, sd = 1))^2))

Now we only have to do the graphical representation, in a similar way to what we did before:

hist(Q_random, xlab="Q", prob = TRUE, breaks = 100, ylim = c(0, 0.06), xlim = c(0, 160), ylab = "", main = "With heterogeneity", border = "pink")

lines(seq(0, 100, 0.01), dchisq(seq(0, 100, 0.01), df = 35-1), col = "red", lwd = 2)

The result is shown in attached figure. As you can see, this time the sampling distribution of the Q values does not fit the theoretical chi-square distribution. To do this, the assumption of non-heterogeneity should be met, which, as we know, is not met on this occasion.

Well, it is this deviation from the theoretical distribution when there is variability between studies what we can leverage to determine if there is heterogeneity between primary studies.

The hypothesis test for Cochran’s Q

We are going to carry out a hypothesis test under the assumption that the null hypothesis of absence of heterogeneity is met.

What we do is calculate the value of Cochran’s Q with the effects of the studies in our meta-analysis. Ideally, under the null hypothesis assumption, the expected value is K-1 (the mean of the distribution), although we already know that the value will almost always be different, if only due to random error.

We will only have to calculate the probability of finding a value of Q as extreme or more extreme than the one we have found, just by chance. If this probability (the p-value) is greater than 0.05, we will not be able to reject the null hypothesis and we will assume that there is no heterogeneity.

On the contrary, if p < 0.05, we will reject the hypothesis and conclude that there is statistical heterogeneity between the primary studies in the meta-analysis.

Let’s look at an example. Suppose that in our meta-analysis of 35 studies we have calculated the value of Cochran’s Q, which is equal to 52.5. To obtain the value of p, we can execute the following command in R:

pchisq(q = 52.5, df = 34, lower.tail = FALSE)

This command gives us the probability of obtaining, just by chance, a value of Q like the one we have obtained or greater (the area under the curve of the right tail). The program gave us a p-value = 0.02. We can reject the null hypothesis and conclude that there is heterogeneity between the primary studies.

As always, the important thing is to understand the concept. The statistics program with which we carry out the meta-analysis will be in charge of all these calculations.

Variations on a theme by Cochran

Cochran’s Q is a widely used measure to detect heterogeneity and, roughly, we can say that the greater the variability between studies, it will be greater.

The problem is that its value is not very intuitive to assess the intensity of this variability, which is why two more estimators have been developed, which aim to be easier to interpret. These are I² and H², which can be calculated from the value of Q.

The I² statistic

I² defines the percentage of variability in the size of the effects that is not caused by random error or, in other words, that is due to heterogeneity. It is calculated according to the following formula.

We already know that, if there is no heterogeneity, the expected value of Q is K-1 (the mean of the chi-square distribution). If we subtract K-1 (the expected value due to random error) from the value of Q (observed) and divide it by the total Q, it will give us what proportion of the value of Q that is not explained by chance.

In the event that Q is less than the expected mean (K-1), we assume that I², which is usually multiplied by 100 and expressed as a percentage, is equal to 0 (it is logical, the proportion of variability due to heterogeneity may be more or less high, but never less than 0%).

The value of I² can range from 0 to 100%, with the limits of 25%, 50% and 75% usually being considered to delimit when there is low, moderate and high heterogeneity, respectively.

An advantage of the I² is that it does not depend on the units of measurement of the effects or the number of studies, so, unlike what happens with Cochran’s Q, it does allow comparisons with different effect measures and between different meta-analyses with different number of studies.

If we calculate the value of I² in our example with 35 studies and a Q value of 52.5, we see that it is 35.2%, indicating moderate heterogeneity.

The H² statistic

This statistic, much less famous than the previous one, is also calculated from the value of Q and represents the ratio between the observed value (Q) and the one expected under the assumption of the null hypothesis (K-1, the mean of the distribution):

In this case it is not necessary to make any correction when the value of Q is less than K-1. When there is no heterogeneity, H² ≤ 1. Values greater than 1 indicate variability between studies.

In our example, with a Q value of 52.5 and 35 studies, H² is equal to 1.54, which indicates the existence of heterogeneity.

We’re leaving…

And this is where we have come today.

We have seen how the Cochran’s Q value is usually used to detect heterogeneity between studies in a meta-analysis, although it has some defects, such as depending on the number and precision of the studies.

I², for its part, is somewhat less sensitive to these effects and easier to interpret, but still depends on the precision of the studies included in the meta-analysis. If the studies have very large samples, the random error will tend to 0, but Q will increase when weighted by the inverse of the variance and the value of I² will tend towards 100%.

That is why it is not a good idea to limit ourselves to calculating the values of Q and I² (or H², which behaves similarly to I²), so many authors advise completing the assessment of heterogeneity with the calculation of τ² (which is not , strictly speaking, a measure of heterogeneity) and prediction intervals. But that is another story…

Esta entrada The polysemy of Q ha sido publicada en Science without sense...double nonsense por Manuel Molina.

The Palace of Probabilities

Manuel Molina — Tue, 27 Feb 2024 07:48:39 +0000

Science without sense...double nonsense

The Palace of Probabilities

Fisher’s z.

With small sample size, the variance of the Pearson’s coefficient increases, decreasing the precision of its estimates. The use of Fisher’s z helps to stabilize the variance and obtain more precise estimates in meta-analyses whose outcome measure is a correlation, and when the sample of primary studies is small.

In the world of numbers, there is a magical kingdom called Probabiland, where every corner is impregnated with the excitement of statistics and the magic of probabilities. Rivers flow with data, forests are filled with regression trees, and in the mountains, peaks reach heights proportional to their statistical significance.

One of the most popular places in the kingdom is the majestic Palace of Probabilities, a casino where each table is a statistical adventure, and each card reveals a secret hidden in the data. Players can explore the gaming tables, from regression roulette to hypothesis testing poker, to non-linear correlation blackjack or discrete distributions, for the more daring.

We are going to stop at the “Reduced Samples” table, where the cards are piled up in small piles of data and uncertainty reaches its highest levels. At this table we will see two common players, Pearson’s R and Fisher’s z.

Pearson’s R, known for its mathematical elegance, is like a reliable card that always adds up to 1. But, although it is ideal for secure, linear connections, it faces challenges in tiny samples, where imprecision can significantly impact its estimation ability, demanding extra caution in interpreting the results.

In contrast, Fisher’s z, with its touch of statistical intrigue, acts as the wild card that can change the game in an instant. It is capable, even though the sample is small, of maintaining its precision with a certain elegance.

Today we are going to stay in this corner of the statistical casino, where R and z come together to face the challenge of small samples, collaborating in meta-analyses in which the effect measure is a correlation, and the sample sizes of the primary studies are limited.

Pearson’s correlation coefficient

We already know that one of the objectives of a meta-analysis is to obtain a global outcome measure that summarizes the results of each of the studies included in the systematic review. This measure provides a general estimate of the effect of interest, combining information from all studies considered.

When the effect studied is a correlation, the parameter most frequently used is the product-moment correlation coefficient, better known as the Pearson’s correlation coefficient (R).

We can calculate R between two quantitative variables by dividing the covariance of the two variables by the product of their standard deviations, according to the following formula:

If we look a little at the formula, R is nothing more than the covariance standardized by the standard deviations, which makes it possible to compare different correlation coefficients without their values being affected by the measurement scales of the variables.

This is a point in favor of Pearson’s R, but we will see that it has other properties that are not so favorable, especially when we handle small sample sizes.

We know that the global measure that we calculate in a meta-analysis tries to be an estimate of that measure of correlation existing in the general population, which is unreachable. To make this estimate, we calculate a confidence interval, usually at 95%, which gives us an idea of the uncertainty and the margins between which the population measurement may be.

Well, the problem with Pearson’s R is that its variance depends greatly on the sample size of the data with which it was calculated. We can see its formula in the following formula:

The first thing we see is that R is in the numerator subtracting from 1, so the numerator will be smaller as the value of R increases (so the variance will decrease). We see it in scatter diagrams if we represent two quantitative variables. When the correlation is greater, the variability of the data is lower, and the point cloud has a more elongated shape. When the correlation is lower, the cloud will be more circular, there will be more variability in the data.

The value of R does not depend on us but is a characteristic of the relationship between the two quantitative variables. But there is another component in the formula on which we can act: the sample size (n).

Since the sample size is in the denominator, the variance will increase as the sample size decreases. Look at the first figure, which represents the sample size on the abscissa and the variance on the ordinate. Let’s focus, for now, on the blue line, which represents the variance of Pearson’s R (later we will see what the red line means).

We can see how the variance begins to rise as n decreases, taking a very high slope with sizes less than 20 participants. In other words, the variance becomes almost exponentially larger (more unstable) with smaller sample sizes.

This has an unintended consequence. R estimates are based on the point estimate (the R value of each study) and the standard error (which is the square root of the variance), so they will be more imprecise when the sample size is small, a situation that is occurs relatively frequently when we deal with the primary studies of a meta-analysis.

As the sample size increases, the correlation coefficient estimates become more stable and closer to the true population correlation. However, in small samples, values may fluctuate more due to increased variability of estimates.

More simply put, and to emphasize its importance, sample size can have an impact on the interpretation and reliability of Pearson’s correlation coefficient estimates.

This is the reason why, in these situations, it is advisable to perform a transformation to obtain Fisher’s z (which is not equivalent to the z of the normal distribution and the z score), which performs better with smaller samples than Pearson’s R.

Let’s see what it consists of.

Fisher’s z

As we have already said, to calculate the global measure when the sample sizes are small, instead of directly using the correlation as a measure of effect, it is transformed using the inverse hyperbolic arctangent function, also known as the Fisher’s transformation.

Don’t be alarmed, behind these offensive words lies a fairly simple formula:

We will understand the usefulness of Fisher’s z if we look at the formula to calculate its variance:

You will say that it still has the value of n in the denominator, like the variance of R, and that the variance is greater when the sample size decreases. This is true, but it is also logical that it happens. We already know that, in general, and for the same level of confidence, the precision of an estimate increases (its confidence interval will be narrower) when the sample size increases and/or variability decreases, and vice versa.

But look now at the red line in the previous graph that we previously decided to ignore. Represents the variance of Fisher’s z as a function of sample size. Although we see that the effect is like that suffered by Pearson’s R, its magnitude is much smaller, so z will be much more precise than R when the sample is small.

The reason behind this transformation is related to the statistical properties of probability distributions. It is like when we apply a logarithmic or inverse transformation such to a series of non-normal data to force the transformed data to be distributed normally and to be able to apply a test that assumes the normality of the data.

As transformed correlations behave closer to a normal distribution, statistical methods based on normality will work more appropriately. In particular, when performing a meta-analysis, the Fisher’s transformation allows the effect sizes of individual studies to be combined more accurately and appropriately, since many meta-analysis methods assume a normal distribution of effect sizes.

In any case, when we see a publication on one of these meta-analyses, we will probably not see the z anywhere and the authors will only show the Pearson’s correlation coefficient.

This has its reason. It is usual to use a statistical program to perform the meta-analysis. One of the options that we can tell the program is to perform this transformation. In that case, the Pearson’s coefficients of the primary studies will be transformed into their Fisher’s z equivalents, the program will perform all the calculations with Fisher’s z and, upon completion, it will perform the inverse transformation to present the reader with the Pearson’s coefficient. which is usually easier to interpret for most readers.

We can transform Fisher’s z into its R value by applying the following formula:

A practical example

The truth is that, in practice, we do not need to pay much attention to everything we have explained, since statistical programs take care of these details. In any case, it doesn’t hurt to know how and why things are done.

Let’s perform a simple example of meta-analysis with primary studies whose outcome variable is a correlation. To do this, we are going to use the R program (not to be confused with the R correlation coefficient) and a fictitious meta-analysis that we are going to invent on the fly to try to estimate the correlation that exists between the levels of magnetic fildulastrin and those of foolsterol in that terrible disease that is fildulastrosis.

If you feel like it, we can continue together step by step.

1. We load the necessary libraries in R:

library(tidyverse)

library(meta)

If you do not have these packages installed, it is necessary to install them before using them the first time using the install.packages() command.

2. Create our data set: We are going to create a data set with 15 completely made-up studies. To do a correlation meta-analysis with R we only need the value of Pearson’s R coefficient (not transformed) and the sample size of each study. We added it to the data set, along with a third column with the name of the study (capital letters A to O, to keep things simple). We create the three variables (vectors in R) and assemble them into a dataframe:

R <- c(0.58, 0.62, 0.54, 0.52, 0.76, 0.69, 0.66, 0.58, 0.62, 0.81, 0.49, 0.60, 0.79, 0.43, 0.69) N <- c(18, 22, 45, 15, 20, 22, 24, 30, 50, 35, 14, 20, 16, 38, 40) S <- LETTERS[1:15] data <- tibble(S, N, R)

3. We do the meta-analysis: we are going to use the metacor() function from the R meta package. Although this function admits many parameters, we are going to make the model as simple as possible:

meta <- metacor(cor = R, n = N, sm = "zcor", studlab = S, data = data, comb.fixed = FALSE, comb.random = TRUE)

We have indicated to the program where the values of the untransformed Pearson correlation coefficients (cor), the sample sizes (n), and the names of the primary studies (studlab) are located, all within the indicated data set (data). Finally, we ask it to apply a random effects model, do the Fisher’s transformation (sm=”zcor”), and store the result in an object called meta.

4. We obtain the results: to do this, we simply execute the summary(meta) command.

In the second figure you can see the results output that the program offers us.

Firstly, it shows us a table with three columns with the list of studies with their Pearson’s coefficients (the ones we introduced), the 95% confidence intervals that it estimates for each of them and the weighting that it establishes for each study. for the calculation of the global summary measure.

Remember that the R coefficient of each study is considered a point estimate of the coefficient of the population that carried out that specific study. The first step is to calculate the confidence interval, which is the population estimate (from which the study sample comes). This is where the stability of the variance dependent on the sample size influences, but if you look for Fisher’s z you will not see them anywhere.

We have already mentioned it, the program converts R into z, does all the calculations and performs the inverse transformation to show us only R values, with which we are usually more familiar.

But Fisher’s z values are in the meta object. If you want to see the Fisher’s z values of the individual studies and their standard errors, you can write the commands print(meta$TE) and print(meta$seTE).

Continuing with the results, after this table the number of studies (k = 15) and observations (o = 409) with which the meta-analysis was carried out are shown.

The program then informs us that it has used a random effects model and has calculated an overall measure of the Pearson’s correlation coefficient of 0.63. It specifies its confidence interval (remember that it is a population estimate) and its statistical significance.

The rest is the heterogeneity study carried out and some final notes on the methodology of the meta-analysis. In the last line it reminds us that the Fisher’s transformation has been used to estimate the global correlation measure.

We are leaving…

And here we are going to leave this exhausting sport of meta-analysis for today.

We have seen how, in general, with large samples the Pearson’s correlation coefficient tends to be more precise, while with small samples it can be subject to greater variability and, therefore, be less reliable as an estimator of the true relationship between variables.

In these cases, performing the Fisher’s transformation helps stabilize the variance of the correlation estimates, which is important when making statistical inferences.

We can encounter a similar problem when the global measure of the meta-analysis is a difference in means of a quantitative variable. In general, it is advisable to use a standardized measure of effect, such as Cohen’s d, but, in cases of small samples, it is also advisable to apply a correction and use another parameter, Hedges’ g. But that is another story…

Esta entrada The Palace of Probabilities ha sido publicada en Science without sense...double nonsense por Manuel Molina.

The art of resignation

Manuel Molina — Tue, 30 Jan 2024 07:53:12 +0000

Science without sense...double nonsense

The art of resignation

Precision enrichment ratio.

The procedure of choosing the cut-off point for a diagnostic test is reviewed. To decide this threshold, which is influenced by the characteristics of the model and the clinical scenario in which it will be applied, we will take into account the sensitivity and precision of the test for each possible cut-off point. The precision enrichment ratio will be useful in cases with a large imbalance between the two diagnostic categories.

An error that we clinicians make more frequently than is desirable is to interpret the result of a diagnostic test as a definitive verdict: the patient is sick or healthy, right?

But today I’m going to tell you about a little secret from the world of medicine: real life is rarely so simple. When we deal with diagnostic tests, we are immersed in an intricate game of probabilities in which there is no clear dividing line. This is especially so when we have to choose the cut-off point for a test with a continuous result or, in the case of a test with a dichotomous result (positive or negative), when we have to choose the cut-off point in the model that provides us the probability that the patient is sick or not.

The problem in many cases is that, whatever point is chosen, we will not be able to take full advantage of all the qualities of the test. As in many other aspects of life, where we must make choices and sacrifices, in the world of diagnostic tests, choosing the perfect cut-off point becomes an art in itself in which, instead of seeking the definitive answer, we will find ourselves seeking the delicate balance between the precision and sensitivity of the test.

But let no one despair because of this very pessimistic introduction. There are tools that help us educate the art of resignation and allow us to choose, if not an ideal point, then the most appropriate for each situation. We are going to talk about one of these tools today. It is another of those resources that come from the world of data science and that has the dazzling name of precision to prevalence enrichment ratio.

Let’s see what this is all about.

Statement of the problem

To more easily show the problem of choosing the cut-off point, we are going to see a couple of examples in which we will develop a multiple logistic regression model to estimate the probability of disease based on the values of several independent variables that we can obtain from the patient.

Once the model has been developed, applied to each patient, it will provide us with the logarithm of the odds ratio that the patient presents the result defined as 1 in the binary dependent variable (0 = healthy, 1 = sick, the most common). From this result we can calculate the probability that she is sick or, in other words, that the value of the dependent variable is 1.

In practice, an isolated probability value is of little use, so we will have to choose a cut-off point above which we will consider the test result to be positive (sick) and below which we will consider it to be negative (healthy). Finally, this result, positive or negative, will be compared with that obtained with that of the reference or gold standard, which we assume tells us with certainty whether the patient is healthy or sick.

This comparison is made with the usual contingency table, which allows us to calculate parameters such as sensitivity (S), specificity (Se), predictive values and likelihood ratios. In general, we usually base ourselves primarily on two parameters to assess which is the appropriate cut-off point for our clinical scenario.

The first is the S, the proportion of patients in whom a positive result will be obtained with the test. The second is the precision of the test, which tells us what proportion of the positives are actually sick. This is what clinicians know better as positive predictive value (PPV).

The million-dollar question is, in each case, what probability cut-off point we choose. A quick answer, especially if we use a logistic regression model, would be to choose the one with the probability greater than or equal to 0.5 for the positive result and less than 0.5 for the negative result, which would be the “natural” cut-off point for the logistic function of the model. As is easy to imagine, this will work very rarely, generally when the disease prevalence is close to 0.5.

Another thing we can do is graphically represent the probabilities obtained in the two groups (healthy and sick according to the gold standard). In an ideal world, the two probability density curves would be more or less well differentiated, so we would go to the central valley where the values of the two curves are minimum. On the right would be the positive cases and, on the left, the negative ones.

But in our daily life, things are not usually so simple. Let’s look at a couple of examples to understand it better.

Practical example (relatively simple)

To illustrate the examples, I am going to use the freely available R program. If you want to reproduce the experiment as described in this post, you can download the script from this link.

For this first assumption, I am going to use the Pima.te data set, from the MASS package in R. It contains a record of 332 pregnant women over 21 years of age who are evaluated for a diagnosis of gestational diabetes according to the WHO criteria. In addition to the diagnosis of diabetes (yes/no), it contains information on the number of pregnancies, plasma glucose, systolic blood pressure, triceps fold, body mass index, history of diabetes and age.

In this data set there are 109 diabetics and 223 non-diabetics women, which represents a prevalence of gestational diabetes of 0.33.

In this case, we are going to develop a multiple logistic regression model with the diagnosis of diabetes (1 = yes, 0 = no) as the dependent variable and blood glucose and body mass index as independent variables. We are not going to ask ourselves if this is the best possible model, since it is not the topic of this post and, as we have described it, it serves us perfectly for what we want to show.

Before choosing a cut-off point to consider the test result (the model) as positive or negative, we can estimate its overall performance by calculating the area under the ROC curve (AUC), which you can see in the attached figure. Our calculation tells us that the test has an AUC = 0.82, which suggests that it performs well in discriminating between positive and negative results.

Now we have to choose the cut-off point for the test. To do this, we begin by graphically representing the density functions of the probabilities that the model gives us for the two groups, as you can see in the next figure.

As often happens in real life, there is quite a bit of overlap between the two curves, so no cut-off point will perfectly separate the sick from the healthy.

We are going to begin to educate our art of resignstion by assessing, as a cut-off point, the one corresponding to the probability of 0.5, marked with the dashed red line.

We can calculate that, for this cut-off point, S = 0.53 and Se = 0.91. The performance for non-diabetic pregnant women seems adequate, since only 20 of the 223 would be diagnosed as diabetic by mistake (false positives, FP). However, we would leave 73 diabetics undiagnosed, all those whose probability is to the left of the cut-off point.

How can we optimize this choice? If we go to the right, the Se and the PPV of the test will increase, but the S will decrease even more. At the cut-off point p = 0.75, we will have an S = 0.33, an Se = 0.98 and a PPV = 0.9.

On the contrary, if we go to the left, the S will increase, but so will the FP as the Se and the PPV decrease. For p = 0.25, S = 0.77, Se = 0.66 and PPV = 0.53. You can see how these parameters change for different cut-off points in the first table.

We see, then, that it is not possible to have very good S and Se at the same time, we must prioritize one of the two. We will have to decide if we want to favour S (moving to the left of the curve) or the precision of the test, reflected by its Se and its PPV (moving to the right). In any case, in this example we would not have to make an excessive renunciation, since there is not great variation between the indicators, unless we go to the extremes of the probability values generated by the diagnostic model.

Considering the clinical context, we will probably choose to prioritize sensitivity more than precision. Surely, we would prefer that the smallest possible number of diabetic pregnant women remain undiagnosed. The price that will have to be paid will be a greater number of false positives, but which we will be able to correctly diagnose later with another relatively simple test, such as a glucose overload test. In my opinion, a good cut-off point for this model would be between 0.2 and 0.3.

Another practical example (somewhat more complex)

We are now going to look at another clinical scenario that is somewhat more complex to resolve. To do this, we resort to a data set that I have just invented and that includes the results of a fictitious study for the diagnosis of that terrible disease that is fildulastrosis. You can download the data at this link.

This is a registry of 10,000 patients who attend an emergency department and in which data is collected on the determination of some molecules that can help in the diagnosis of this disease, such as foolsterol, vitaminite, endorphinol, idiotin, stupidine and lipidosin. The record is completed by the diagnosis of fildulastrosis (0 = no, 1 = yes) according to the result of magnetic fildulastrine, the gold standard for this disease.

In this data set there are 473 patients with fildulastrosis, which represents a prevalence for the disease of 0.047, rounding up, 5%.

To begin to solve the problem, we developed a multiple logistic regression model with the diagnosis of fildulastrosis (1 = yes, 0 = no) as the dependent variable and the rest of the analytical determinations as independent variables.

As in the previous example, we are not going to ask ourselves if this is the best possible model, it is not the topic that concerns us today.

Also as in the previous example, we will first look at the overall performance of the model (see figure). We see that it has an AUC = 0.87, which suggests that the test (the model) has a good ability to discriminate between healthy and sick people.

This is a good start, so we are quite encouraged. It seems that, once we have learned the procedure with the example of diabetic pregnant women, solving this scenario is going to be a piece of cake.

However, the euphoria quickly fades when we look at the probability density curves that the model provides for the two groups (next figure). What we see now is no longer simply an overlap of the two curves but is more like a superposition. How are we going to differentiate between healthy and sick?

In this scenario it makes little sense to choose the cut-off point p = 0.5. We would have a very high Se, with very few false positives, but the S would be too low, 0.09, which would leave 400 of the 473 patients undiagnosed. In a word: disastrous.

We have no choice but to move, and quite a bit, towards the left side of the curve. You can see the performance of the test for different cut-off points in the attached table.

For p = 0.28, we obtain S = 0.25 and PPV = 0.35. We still left 352 patients undiagnosed, despite having 220 PF. In another word: another disaster (well, in two words).

Overcome by discouragement, we go as far as p = 0.05. Logically, the sensitivity has improved, 0.77, but the precision is still very low, with a PPV = 0.15. This implies about 2000 FP, in which we would have to rule out the disease by doing a magnetic fildulastrine, a tremendously expensive and annoying test for the patient.

What can we do? Is there any solution to this problem?

Well yes, there is. But this solution requires us to use two resources. The first, that we put into practice our art of resignation. The second, the use of another tool: the enrichment ratio.

Precision enrichment ratio

When we deal with diseases with low prevalence, we can often find ourselves in a situation similar to the present one. There is almost overlap of the density curves, which makes choosing a cut-off point extremely difficult.

In these cases, we can take advantage of the fact that the probability curve dies earlier in negatives than in positives. This means that, as we move to the right from the lowest extreme of probability, the model will identify subpopulations in which the risk of being sick is greater than in the global set of patients (it is logical, we see it by the increase in PPV).

We will not be able to use the test to classify healthy and sick with good sensitivity and precision, but we can identify subjects whose risk of being sick is higher than average. In this way, we will move to the right carefully observing how much the S decreases and how much the precision (the PPV) increases with respect to the prevalence of the disease. This ratio between PPV and prevalence is what is known as the enrichment ratio.

We can represent it graphically, as it appears in the next figure. In the bottom graph we observe the evolution of the values of S when increasing the probability threshold that we consider as the cut-off point. As we already know, the higher this probability threshold, the lower S the test will have.

In the top graph we see how the enrichment ratio increases as we move to the right. This is logical, since this ratio is directly proportional to the precision of the test and, therefore, its PPV.

Finally, we can draw a line at the point at which we consider that we cannot lose any more sensitivity and that, at the same time, allows us to detect individuals with a risk a number of times greater than the average that seems appropriate to our scenario.

In this fictional example, it seems correct to me to choose a threshold p = 0.12. I have already used the new tool. Now it’s time to exercise the art of resignation.

At this cutoff point I have a S = 0.5, which means I only diagnose half of the patients. The rest will have to wait until they have more symptoms of illness so that we can try to rule it out by other means, which will surely be more expensive and annoying. In return, I detect a population with a risk of disease 5 times higher than the average, that is, with a probability of approximately 0.25 of being sick. In these we will have to determine magnetic fildulastrine, which will only be positive in one in four of them.

A final analysis

I suppose that, at this point, many of you are thinking that the test is worthless. Furthermore, some of you may wonder how such a useless test can have such good performance indicators. If you remember, the AUC = 0.87.

The problem is that the power of the test to diagnose or rule out the disease will depend on the cut-off point chosen. If you calculate the likelihood ratios for the cut-off point p = 0.12, the positive (PLR) is 6.25 and the negative (NLR) is 0.55, suggesting a modest power for the positive diagnosis and a zero contribution for the negative.

To achieve a PLR > 10 we would have to choose the cut-off at p = 0.28, but the S would decrease to 0.25. We couldn’t afford so many false negatives.

Regarding NLR, we cannot achieve a low value with any cut-off point. This test has a difficult time ruling out the disease. The cause is the imbalance of the two categories to be classified because the low prevalence of the disease. Keep in mind that if we say that all patients are healthy without doing any tests, we will be right by chance 95% of the time.

So, is it useful or not? Yes, I think so. Consider that you are on duty in an emergency department, and you want to know if the patients who come with a certain symptom suffer from this serious illness. You cannot measure fildulastrine in everyone, since it is very expensive and annoying and, at the end of the day, the vast majority (95%) will not suffer from the disease.

But it is not reasonable to do anything either, since the disease is very serious and we are interested in diagnosing it in its initial stages, if possible. Well, this test could help us to identify the group with a higher risk of disease, which we could follow in consultation, repeat the test after a while or perform the gold standard, as seems most appropriate.

We are leaving…

And here we have come for today.

I think it has become clear that the choice of the cut-off point for the positivity of a diagnostic test depends not only on the characteristics of the test, but also on the clinical scenario in which you want to apply it.

Furthermore, we have verified how this choice can become even more difficult when we deal with very low prevalences, situations in which statistical models have a harder time classifying healthy and sick people.

Finally, we have seen how the enrichment ratio, one of the tools that come from the field of data science, can help us in choosing the cut-off point in these more complex situations.

This is not the only tool we can use to resolve the delicate balance between a test’s sensitivity and its precision. There are others, such as the F-score, also originating in data science. But that’s another story…

Esta entrada The art of resignation ha sido publicada en Science without sense...double nonsense por Manuel Molina.

An epic dance

Manuel Molina — Tue, 05 Dec 2023 07:18:03 +0000

Science without sense...double nonsense

An epic dance

Regression regularization techniques.

Multiple regression regularization (shrinkage) techniques can be very useful to address collinearity or overfitting problems. In addition, they can be used to select the independent variables and reduce multidimensionality, achieving more robust and easy-to-interpret models. Ridge, lasso and elastic network regression techniques are described.

It seems to me that lately I’ve been spending more time than I should have wandering into a faraway and mysterious corner of the world of statistics, where an epic dance is taking place that has left many connoisseurs stumped and, at times, reeling. That corner I am referring to is that of multiple regression, where the numbers dance to the rhythm of the data and the equations get entangled in a mathemagical whirlwind.

No one will be surprised, then, that when I fall asleep, I am assaulted by unreal and distressing nightmares, worthy of a dark Lovecraft’s story. Without going any further, the other night I dreamed that I was attending a fancy data party. The numbers were dressed in their best Gaussian garb, the equations chatted in groups, and the independent variables tried to impress the dependent ones.

In the center of the dance floor, the DJ was spinning probability distribution records while regression models made their grand entrance. It is at that moment when my gaze was fixed in one of the corners of the room, attracted by a dazzling couple. She was the star of the night, the diva of gradient descent, the queen of regularization. With her penalty outfit and her restraint demeanor, the LASSO Regularization was joined by Ridge, the intense-eyed hunk who slid into less overfitted models. Both were preparing to enter the dance floor.

I woke up with a start and drenched in sweat, gasping for air. I found it impossible to get back to sleep that night, so I began to investigate how my models could find the balance between the extravagance of overfitting and the rigidity of undercutting. If any of you are interested in knowing the results of this delusion, I invite you to continue reading this post.

The tribulations of multiple regression

Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables. The goal is to find the equation that best fits the data to predict the values of the dependent variable based on the values of the independent variables.

Usually, this is achieved by the method of least squares (also called ordinary least squares), which tries to minimize the differences between the observed values of the dependent variable and the values predicted by the model, differences that are known by the name of residuals. The residuals are squared (so that the positives do not cancel out with the negatives) and added. The best regression equation will be the one in which this sum of the squares of the residuals has the lowest value.

Although this method produces a model that has a good fit to the data with which it has been built, it may not be as good when trying to predict the values of the independent variable with new data, different from those used during its construction. This can occur when a series of circumstances occur.

The first is collinearity, which occurs when there is a high correlation between the predictor or independent variables. This can cause some of the regression coefficients to take excessively high or low values, with the opposite sign to what we might expect based on our knowledge of the model, or with remarkably high standard errors. In this situation, even if we are lucky that the model can make roughly correct predictions, it will be difficult to interpret it and the importance of each variable in explaining global variability.

Another problem that can arise is that of high dimensionality or, put more simply, the existence of a large number of independent variables in the model. In general, more complex models have a tendency to overfit the data: they fit well with known data but fail to generalize to new data.

This reaches the limit if the number of independent variables is close to the number of participants (sample size), in which the method of least squares may fail to obtain the regression coefficients.

Well, to alleviate or solve these problems, we have a series of regularization or shrinkage techniques. Do you remember my dream and the couple that made me wake up? Well, yes, these two techniques are ridge regularization and lasso regularization.

Well, there is also a third one that shares the inheritance of these two: elastic net regularization. Let’s see what they consist of.

Regularization techniques

Regularization techniques help us to minimize the problems that we have described by placing restrictions on the regression coefficients of the model, which helps to control its complexity and prevent the coefficients from taking extreme values.

As we have already said, the two most common regularization techniques are ridge regression, also called L2 regularization, and lasso regression (Least Absolute Shrinkage and Selection Operator), or L1 regularization.

Both techniques are based on performing a modification of the ordinary least squares method by adding a penalty to the sum of the squares of the residuals. This results in a constraint or shrinkage on the model’s coefficients, which helps control its complexity, increases the stability of the coefficients, and avoids their extreme values.

Ridge regression

Ridge regression tries to minimize the prediction error by adding, to the original cost function (sum of the squares of the residuals), a penalty term proportional to the square of the coefficients. The formula would be the following:

Modified cost function = original cost + λ x Σ(coefficients²)

The value of the parameter λ is determined by the researcher and must be greater than zero, since if λ = 0 there is no difference with the multiple linear regression without regularization. The larger the value of λ, the greater the constraint that is placed on the model.

If we think about it a bit, squaring the coefficients penalizes the coefficients with a higher value more. The result when, for example, there is collinearity, is to approximate the extreme values to intermediate values. The coefficients are close to zero, but without reaching this value (as we will see what happens in the lasso regression), so they do not disappear from the equation, although their impact on the global model does decrease.

Lasso regression

Lasso regression also adds a penalty term to the original cost function, but in this case, it is proportional to the absolute value of the coefficients and not to their squared value:

Modified cost function = original cost + λ x Σ|coefficients|

Lasso regression restricts the magnitude of the regression coefficients, but, unlike ridge regression, they can reach zero, which means that they disappear from the model. This is very useful to reduce the complexity of the model and when it is necessary to reduce the dimensionality, which is nothing more than to reduce the number of independent variables.

Again, when λ = 0, the result is equivalent to that of a linear ordinary least squares model. As the value of λ increases, the greater the penalty and more predictor variables can be excluded from the model.

Differences between ridge regression and lasso regression

Although both techniques decrease the magnitude of the regression coefficients, only lasso regression achieves that some are exactly zero, which makes it possible to select predictor variables. This is the greatest advantage of lasso regression when we work with scenarios in which not all the predictor variables are important to the model, and we want the less influential ones to be excluded.

For its part, ridge regression is more useful when there is collinearity between independent variables, since it reduces the influence of all of them at the same time and proportionally. We can also use it if we are faced with a situation in which losing independent variables is a luxury that we cannot afford.

However, we can find situations in which we are not clear about which of the two techniques to use or in which we want to take advantage of both. To achieve a balance between the properties of the two, we can resort to what is known as elastic net regression.

Elastic net regression: the midpoint

This technique combines the penalty of the L1 and L2 regularization techniques, which tries to take advantage of both and avoid some of their drawbacks. The formula would be the following:

New cost function = original cost + [(1 – α) x λ x Σ(coefficients²)] + (α x λ x Σ|coefficients|)

To understand it a little better, we can write it more simplified:

New cost function = original cost + (1 – α) L1 penalty + α x L2 penalty

As you can see, we now have two coefficients, α and λ. The values of α oscillate between 0 and 1. When α = 0, the technique would be equivalent to doing a ridge regression, while, when α = 1, it would work like a lasso regression. Intermediate values would mark an equilibrium position between the two regularization techniques.

So, it is easy to understand that we have to decide the values of α and λ to know what degree of penalty to apply and what predominance of the two techniques to use. This is usually done by testing many values and seeing which one gives us the best results, a process that is usually taken care of by the statistical programs that we use for these techniques.

A practical example

I think it’s time to give a practical example of everything we’ve talked about so far. It will allow us to understand it better and understand how it is done in practice. To do this, we are going to use a specific statistical software, the R program, and one of its most used data sets for teaching purposes, mtcars.

This data set contains information about different car models and their characteristics. It includes 32 rows (one for each car model) and 11 columns representing various car characteristics such as fuel efficiency (mpg), number of cylinders (cyl), displacement (disp), engine horsepower (hp), rear axle ratio (drat), weight (wt), quarter mile time (qsec), etc.

I’m not going to detail all the commands that need to be executed for this example, since this dance could go on until the wee hours of the morning, but if anyone is interested in reproducing the exercise, they can download the complete script at this link.

Let’s see the process step by step.

1. Loading function libraries and data preparation.

In R, the first thing we do is load the libraries that we are going to need for data processing, its graphic representation and the development of multiple linear regression models and regularization techniques.

Next, we load the data set. We prepare a vector with the dependent variable and a matrix with the independent ones, since we are going to need them to apply the functions that perform the regularization.

2. Linear regression model.

We begin by developing the linear regression model taking engine horsepower (hp) as the dependent variable and the rest of the variables as independent or predictors.

Focusing on the results that interest us for our example, the model is statistically significant (F = 19.5 with 10 and 21 df, p < 0.05) and explains 85% of the variance of the dependent variable (adjusted R² = 0, 85). But the most interesting thing is to look at the attached figure with the regression coefficients.

You can see that there are some that draw attention due to their magnitude compared to the others, such as those of the wt and ws variables.

This may be a sign that collinearity exists. In addition, the model has a large number of predictor variables for the sample size (only 32 records), so the risk of overfitting is high. We decided to apply regularization techniques.

3. Ridge regression.

If we use R, we can perform regularization techniques with the glmnet() function, adjusting the value of its alpha parameter. To make it perform a ridge regression, we set the value of alpha = 0.

We have already said that we have to choose the value of λ that we are interested in applying. And how do we know? R helps us in this task. The glmnet() function does not calculate a single model, but many (100, if we don’t tell it otherwise) with different values of λ. Each of these models will have different regression coefficients, as you can see in the next figure.

Each line of the graph shows how the regression coefficient of each variable varies as a function of the value of λ. For example, the blue line shows the values for λ = 80 and the red line for λ = 40.

The function also accounts for the cost function of each of the models and gives us two values of interest. One, the so-called minimum λ (λmin), which corresponds to the minimum value of the prediction error of the model. Two, the λ corresponding to a prediction error of one standard deviation from the mean of all the models (λ1se).

We are going to take the value of λmin (some people think that with λ1se there is less risk of overfitting). In the third figure you can see the distribution of the error of the model as a function of the value of the natural logarithm of λ. It marks the optimal limits between the two vertical lines. We choose our value of λmin = 17.

Look at the fourth figure in the distribution of the coefficients of the model corresponding to λ = 17 (which appear next to those of the lasso regression model that we will develop later). We can verify how the dispersion is less and how the magnitude of the coefficients has also decreased (if you compare it with those of the linear regression model, keep in mind that the scales are different, since visually they may appear very similar).

Now we only have to extract the values of the regression coefficients (it is of no interest for what we are dealing with). The model explains 85% of the variance of the dependent variable (R² = 0.87). If we calculate its error by the method of least squares, it is 24.5.

4. Lasso regression.

We would repeat the entire process of the previous step, but setting alpha = 1 in the glmnet() function.

We obtain a value of λmin = 17. The distribution of the regression coefficients is shown in the figure above. It is striking how 6 of them have become 0, with which they would disappear from the model. This is, as we already know, one of the characteristics of this technique.

If we calculate the performance of the model, we see that it is like that of the ridge regression, with a value of R² = 0.87 and a least-squares error of 24.02. We will have to decide which of the two interests us more, considering that, in this case, the point that we could take advantage of is the reduction of dimensionality provided by lasso regression.

5. Elastic net regression.

In this case we have to test models with multiple values of α and λ. This is done in R using the cva.glmnet() function, which employs cross-validation techniques.

This function tests several values of α (11, if we don’t tell it otherwise) and, for each of them, multiples of λ. Thus, in a similar way to the previous steps, we can choose the best value of α and, for this, the value of λ (in this case λ1se) that minimizes the model error, as you can see in the last figure.

We see that the optimal α value is 0.73. This marks the point between the two regularization techniques, L1 and L2, in which we will perform the best fit.

We are leaving…

And with this we are going to finish for today.

We have seen how multiple regression regularization techniques can be very useful when we have collinearity or overfitting problems. In addition, they can be used to select the independent variables and reduce multidimensionality, achieving more robust and easy-to-interpret models.

Before we say goodbye, I want to clarify that everything we have said is also valid for multiple logistic regression. The way to do it is similar, although there may be some small difference in the use of the statistical program that we use.

In the case of logistic regression, regularization is also useful when what is called separation or quasi-separation occurs, which takes place when the variables overfit a subset of the data available to develop the model. But that is another story…

Esta entrada An epic dance ha sido publicada en Science without sense...double nonsense por Manuel Molina.

An intruder from another world

Manuel Molina — Mon, 06 Nov 2023 09:22:21 +0000

Science without sense...double nonsense

An intruder from another world

F1-score.

The F1-score, also called F-score or F-measure, is an estimator of the classification capacity of a test that is frequently used in data science and artificial intelligence algorithms and that can be useful for evaluation of diagnostic tests. It is the harmonic mean of sensitivity and positive predictive value, so it weights the value of both in a single estimator.

Have you ever felt that a stranger has burst into your world, an intruder who, although seemingly alien, seems destined to be seen with some frequency? Data science, a kingdom in constant expansion, has brought with it a new tool, the F1-score (or F1-measure), unknown to many, at least among those addicted to the world of medicine. Although it may sound like something out of a science fiction movie and seem as mysterious as a UFO, the F1-score has decided to land in the field of medical publications, where it seems to have found an unexpected home.

So, if you’re ready to discover how this interloper from another world can improve the way we evaluate our diagnostic tests, join me on this journey through the surprising intersection between data science and medicine.

Our usual metrics

Although there are multiple tools for evaluating diagnostic tests, the four most loved by the public are the pair formed by sensitivity and specificity, and the two predictive values, positive and negative. Let’s remember what they consist of.

When we want to evaluate a diagnostic test, the usual thing is to compare it with another test that we consider the reference standard, the gold standard. The positives and negatives of the two tests are represented in a contingency table and, using the values in the cells, we make our calculations.

You can see, in the next table, a fictitious example that compares two tests to diagnose that terrible disease that is fildulastrosis. On the one hand, magnetic fildulastrin (MF), our reference standard. On the other hand, a new but very promising test, the green corpuscle cell (GC).

We can see that our sample is made up of 2000 subjects, 100 of whom suffer from fildulastrosis and 1900 healthy.

Of the 100 patients, 92 have a positive GC test. They are the true positives (TP). Furthermore, of the 1900 healthy people, 1815 have negative GC. They are the true negatives (TN).

But we see that the GC test misclassifies some people. Eight of the sick patients test negative (false negatives, FN) and 85 of the healthy patients test positive (false positives, FP). With these four cells we build our indicators to evaluate the test under study (the GC, in our example).

Sensitivity (Se) is the ability to correctly classify sick people. It is the quotient between the TP and the total number of patients (TP + FN). In our example, 0.92.

Specificity (Sp) is the ability to correctly classify healthy people. It is the quotient between the TN and the total number of healthy people (FP + TN). In our example, 0.95.

Let’s go with the predictive values. The positive predictive value (PPV) is the proportion of positives who are sick. It is the quotient between the TP and all the positives (TP + FP). In our example, 0.52. The negative predictive value (NPV) is the proportion of negatives who are healthy. It is the quotient between TN and all negatives (TN + FN). In our example, 0.99.

To complete our analysis of the table, let’s calculate the prevalence of disease. We know that it is the total number of sick patients divided by the total number of participants. In our example, 0.05 (or 5%, whichever we like best).

The problem of lack of balance

Given the example table and the calculations we have done, do you think that the GC test is a powerful diagnostic test?

Those of you who are more diligent will tell me that this is difficult to answer without knowing the likelihood ratios of the test and, without a doubt, you will be right. Let’s calculate them.

The positive likelihood ratio (PLR) tells us how much more likely it is to have a positive result in a sick person than in a healthy person. We know that the probability of testing positive in a sick person is Se and that the probability of testing positive in a healthy person will be that of misclassifying that healthy person, that is, the complementary of Sp. If we calculate Se / (1 – Sp) = 18.4.

The negative likelihood ratio (NLR) tells us how much more likely it is to find a negative result in a sick person than in a healthy person. The probability of finding a negative in a sick person will be that of not classifying him correctly, that is, the complementary of Se. The probability that a healthy person will test negative is Sp. If we calculate (1 – Se) / Sp = 0.08.

A PLR > 10 indicates that the test is very powerful for diagnosis when it gives a positive result. Similarly, a NLR < 0.1 also tells us that the test is very powerful in ruling out the disease when it is negative. However, we see that, although Se and Sp have very good values, the PPV is quite poor (0.52). What is this about?

Indeed, the culprit for this low predictive value is the prevalence of the disease, which is low. For the same value of Se and Sp (or likelihood ratios), the PPV decreases as the prevalence of the disease is lower. In our example, furthermore, this difficulty in correctly classifying patients is aggravated by the great difference between the proportions of sick and healthy people. Think that if, instead of doing the diagnostic test, we always say that the person is healthy, we will be right 95% of the time. The test does not have a simple task, but it is a matter of chance and Bayes’ theorem.

Diagnostic haggling

At this point, if we want to assess the usefulness of the test to help us determine whether or not a specific patient has the disease, we will have to fundamentally calibrate its Se and PPV.

The Se tells us the probability that a patient will have a positive result, but once we already know that he is sick. Seen another way, it gives us an idea of the patients that we will be able to diagnose with the test. If the Se is very high, there will be few patients on whom we test who will remain without a positive diagnosis.

The PPV does not say a very different thing: the probability that a positive person is sick. If the PPV is low, there will be healthy people who will be diagnosed as sick (FP), more so the lower the PPV.

The problem is that, in many situations, especially when the distributions of test result values between healthy and sick are not well separated, when one of the two improves, the other will worsen, and vice versa. Which one interests us more?

If the disease is very serious, we will be interested in a high Se so that no patient goes undiagnosed. The price that will have to be paid will be a more or less high number of FP.

On the contrary, imagine a disease that is not so serious and whose treatment is expensive or annoying. We will prefer not to have FP, even if we miss some undiagnosed patients. In this case, we will be interested in having a better PPV, even if the Se is lower.

In any case, it would be good for us to have a single parameter that summarizes the overall behavior of the test in terms of Se and PPV, especially if we are trying to choose which one may be most useful among different options.

This is where our otherworldly intruder comes to our aid.

An intruder comes to our aid: F1-score

The F1-score is the harmonic mean of the Se and the VPP, so we can define it according to the following formula:

F1 = 2 / (Se ^-1+ PPV ^-1)

This formula is usually transformed into its friendlier version, which is the following:

F1 = 2 x Se x PPV / (Se + PPV)

The possible values of the F1-score range between 0 and 1. A perfect test (a perfect classifier, as we would say in data science) has an F1-score = 1 (both its Se and its PPV will be worth 1). At the other extreme, the minimum possible value is 0, which will occur when Se and/or VPP are 0.

In this way, the F1-score gives a global idea of the performance of the test based on its Se and its PPV. In our example, the F1-score = 0.65, which would indicate that the test has a moderate capacity to discriminate healthy and sick (what its likelihood ratios already have announced yet).

Let’s imagine that the result of our green corpuscle test is a continuous value and that we have to define the cut-off point to distinguish between positives and negatives. In this case, we can increase the Se by lowering the cut-off point, but we will have many false positives (the PPV will be lower). On the contrary, if we increase the cut-off point we will improve the PPV of the test, but we will probably begin to miss undiagnosed patients (the Se will drop).

We can use the value of the F1-score as we evaluate the different cut-off points. For example, if we are interested in increasing the PPV, we can increase the cut-off point until the moment when the value of the F1-score begins to decrease noticeably. This will mean that, probably, we will have sacrificed the Se of the test excessively in our efforts to improve its PPV , so the number of patients left undiagnosed may be higher than what is convenient for us (yes, but the number of false positives will be lower).

One thing we must keep in mind is that the F1-score, since it depends directly on the PPV, shares with it the defect of depending on the prevalence of the disease. Logically, the same diagnostic test performed in two different populations will show a higher F1-score value in the population in which the prevalence of the disease is higher.

For this reason, if we want to compare tests between different populations, it may be better to use other estimators that do not affect prevalence (as much), such as the likelihood ratios or the area under the ROC curves.

The intruder has a family

Until now we have talked about F1- score but it would be more correct to talk about the F-score (without the 1) when we refer to the estimator in a general way.

We have already seen that F-score represents a balance between Se and PPV. The most common situation is a balanced balance between the two parameters, but there may be times when we are interested in giving priority to one over the other.

Thus, we find a whole family of Fβ measurements , β being the parameter that allows us to choose the balance between Se and PPV that interests us most.

Thus, we can understand the Fβ-score as an abstraction of the F-score in which the calculation of the harmonic mean of Se and VPP is controlled by this parameter β. We can see how the equation for the calculation would look:

Fβ = ((1 + β²) x Se x PPV) / ( β²x (Se + PPV))

The neutral value in this balance is the one that corresponds to β = 1. In that case, the previous equation remains as the unmodified harmonic mean and we obtain the F measure that balances Se and VPP.

Although, in theory, we could choose the value of β that we wanted, in practice only three of them are usually used:

– β = 0.5. It gives more importance to the PPV than to the Se, so it will help us to minimize the number of false positives. We will use it to establish the cut-off point when it harms more to have false positives than false negatives (that we miss undiagnosed patients).

– β = 1. It is the F1-score that we have talked about before. It balances Se and PPV (or false positives and negatives) in a similar way.

– β = 2. This value decreases the weight of the PPV and increases that of the Se. That is, it is preferred to minimize false negatives (undiagnosed patients) even if false positives increase.

As you can see, it is a continuous haggling. When it comes to diagnostic tests, as in life, you can’t always have everything and you have to choose what you prefer to prioritize.

We’re leaving…

And with this we are going to finish this post.

You see that the realm of diagnostic tests is wide and that there is room for many different estimators to assess the performance capacity of the tests.

This is because no test is perfect in everything, so most of the time we will have to choose whether to favor false positives, false negatives or whatever interests us most.

We have mentioned, although only in passing, that this problem can increase when there is a great imbalance between the proportion of sick and healthy people (very low disease prevalence). In these cases, the task of the diagnostic test becomes complicated and it may be difficult to choose the most appropriate cut-off point for our needs. In addition to the F-score, we have some other measure, also coming from the world of data science, that helps us with the haggling between Se and VPP.

I am referring specifically to the so-called enrichment of precision with respect to recovery (or PPV with respect to Se, in our usual language), closely related to the concepts of pre-test and post-test probability . But that is another story…

Esta entrada An intruder from another world ha sido publicada en Science without sense...double nonsense por Manuel Molina.

Science without sense…double nonsense

The polysemy of Q

Cochran’s Q.

A prior clarification

Sources of variability in meta-analysis

Cochran’s Q

Experimenting with Cochran’s Q

Absence of heterogeneity

Presence of heterogeneity

The hypothesis test for Cochran’s Q

Variations on a theme by Cochran

The I2 statistic

The H2 statistic

We’re leaving…

The Palace of Probabilities

Fisher’s z.

Pearson’s correlation coefficient

Fisher’s z

A practical example

We are leaving…

The art of resignation

Precision enrichment ratio.

Statement of the problem

Practical example (relatively simple)

Another practical example (somewhat more complex)

Precision enrichment ratio

A final analysis

We are leaving…

An epic dance

Regression regularization techniques.

The tribulations of multiple regression

Regularization techniques

Ridge regression

Lasso regression

Differences between ridge regression and lasso regression

Elastic net regression: the midpoint

A practical example

1. Loading function libraries and data preparation.

2. Linear regression model.

3. Ridge regression.

4. Lasso regression.

5. Elastic net regression.

We are leaving…

An intruder from another world

F1-score.

Our usual metrics

The problem of lack of balance

Diagnostic haggling

An intruder comes to our aid: F1-score

The intruder has a family

We’re leaving…

The I² statistic

The H² statistic