La entrada Its bark is worse than its bite se publicó primero en Science without sense...double nonsense.

]]>One of my neighbors has a dog that is barking the whole damn day. It is the typical dog so tiny it is barely a palm from the ground, which does not prevent it from barking at an incredible wild volume, not to mention the unpleasantness of its “voice” pitch.

With these dwarf dogs it is what usually happens, they bark at you like demon-possessed as soon as they see you, but, according to popular wisdom, you can rest easy because the more they bark at you, the less likely they are to bite you. Come to think of it, I would almost say that there is an inverse correlation between the amount of howling and the probability of being bitten by one of these little animals.

And since we have mentioned the term correlation, we are going to talk about this concept and how to measure it.

**What does correlation mean?**

The Cambridge’s English Dictionary says that correlation is a connection or relationship between two or more facts, numbers, etc. Another source of wisdom, Wikipedia, says that when it comes to probability and statistics, correlation is any statistical relationship between two random variables or bivariate data.

What does it mean, then, that two variables are correlated? Well, a much simpler thing than it may seem: that the values of one of the variables change in a certain sense in a systematic way when changes occur in the other. Said more simply, given two variables A and B, whenever the value of A changes in a certain direction, those of B will also change in a certain direction, which may be the same or the opposite one.

And that is what correlation means. Only that, how one variable changes with the changes of the other. This does not mean at all that there is a causal relationship between the two variables, which is a generally erroneous assumption that is made with some frequency. So common is this fallacy that it even has a nice Latin name, *cum hoc ergo propter hoc*, that can be summed up for less educated minds as “correlation does not imply causation.” Because two things vary together does not mean that one is the cause of the other.

Another common mistake is to confuse correlation with regression. Actually, they are two terms that are closely related. While the first, correlation, only tells us if there is a relationship between the two variables, the regression analysis goes one step further and aims to find a model that allows us to predict the value of one of the variables (the dependent variable) based on of the value that the other variable takes (which we will call independent or explanatory variable). In many cases, studying if there is correlation is the previous step before generating the regression model.

**Correlation coefficients**

Well, everyone knows the human being’s hobby of measuring and quantifying everything, so it cannot surprise to anyone that the so-called correlation coefficients were invented, of which there is a more or less numerous family.

To calculate the correlation coefficient, we therefore need a parameter that allows us to quantify this relationship. For this, we have the covariance, which indicates the degree of common variation of two random variables.

The problem is that covariance’s value depends on the measurement scales of the variables, which prevents us from making direct comparisons between different pairs of variables. To avoid this problem, we resort to a solution that is already known to us and which is none other than standardization. The product of the standardization of the covariance will be the correlation coefficients.

All these coefficients have something in common: their value ranges from -1 to 1. The farther the value is from 0, the greater the strength of the relationship, which will be practically perfect when it reaches -1 or 1. At 0, which is the null value, in theory there will be no correlation between the two variables.

The sign of the value of the correlation coefficient will indicate the other quality of the relationship between the two variables: the direction. When the sign is positive it will mean that the correlation is direct: when one increases or decreases, the other does so in the same way. If the sign is negative, the correlation will be inverse: when changing one variable, the other will do it in the opposite direction (if one increases, the other decreases, and vice versa).

So far we have seen two of the characteristics of the correlation between two variables: strength and direction. There is a third characteristic which depends on the type of line that defines the best fit model. In this post we are going to talk only about the simplest form, which is none other than linear correlation, in which the fit line is a straight line, but you should know that there are other non-linear fits.

**Pearson’s correlation coefficient**

We have already said that there is a whole series of correlation coefficients that we can calculate based on the type of variables that we want to study and the probability function they are distributed in the population from which the sample comes.

Pearson’s correlation coefficient, also called the linear product-moment correlation coefficient, is undoubtedly the most famous of this entire family.

As we have already said, it is nothing more than the standardized covariance. There are several ways to calculate it, but all roads lead to Rome, so I will not resist putting the formula:

As we can see, covariance (in the numerator) is standardized by dividing it by the product of the variances of the two variables (in the denominator).

In order to use Pearson’s correlation coefficient, both variables must be quantitative, be linearly correlated, be normally distributed in the population, and the assumption of homoskedasticity must be met, which means that the variance of the Y variable must be constant along the values of the X variable. An easy way to check this last assumption is to draw the scatterplot and see if the cloud is scattered similarly along the values of the X variable.

One factor to keep in mind that the value of this coefficient can be biased with the existence of extreme values (outliers).

**Spearman’s correlation coefficient**

The non-parametric equivalent of Pearson’s coefficient is Spearman’s correlation coefficient. This, as occurs with non-parametric techniques, does not use direct data for its calculation, but uses its transformation in ranks.

Thus, it is used when the variables are ordinal or when they are quantitative, but they do not meet the normality assumption and can be transformed into ranks.

Otherwise, its interpretation is similar to that of the rest of the coefficients. Furthermore, because of being calculated with ranks, it is less sensitive to outliers than the Pearson’s coefficient.

Another advantage compared to Spearman’s is that it only requires that the correlation between the two variables be monotonous, which means that when one variable increases the other also does so (or decrease) with a constant trend. This allows it to be used not only when the relationship is linear, but also in cases of logistic and exponential relationships.

**Kendall’s tau coefficient**

Another coefficient that also uses the ranks of the variable is the Kendall’s τ coefficient. Being a non-parametric coefficient, it is also an alternative to Pearson’s coefficient when the assumption of normality is not fulfilled, being more advisable than Spearman’s when the sample is small and when there is a lot of rank ligation, which means that many data occupy the same position in the ranks.

**But there is still more…**

Although there are some more, I am only going to refer specifically to three of them, useful for studying quantitative variables:

**Partial correlation coefficient.**This coefficient studies the relationship between two variables, but take into account and eliminating the influence of other variables.

The simplest case is to study two variables X1 and X2, eliminating the effect of a third variable X3. In this case, if the correlations between X1 and X3 and between X2 and X3 are equal to zero, the same value is obtained for the partial correlation coefficient as if we calculate the Pearson’s coefficient between X1 and X2.

In the event that we want to control more variables, the formula, which I do not intend to write, becomes more complex, but it is best to let a statistical program to calculate it.

If the value of the partial coefficient is less than that of the Pearson’s coefficient, it means that the correlation between both variables is partially due to the other variables that we are controlling. When the partial coefficient is greater than Pearson’s, the variables that are controlled mask the relationship between the two variables of interest.

**Semi-partial correlation coefficient.**It is similar to the previous one, but this semi-partial allows evaluating the association between two variables by controlling the effect of a third on one of the two variables of interest (not on the two, as the partial coefficient).**Multiple correlation coefficient.**This allows to know the correlation among a variable and a set of two or more variables, all of them quantitative.

And I think we have enough with this, for now. There are a few more coefficients that are useful for special situations. Whoever is curious can look for it in a thick statistics book and be confident to succeed in finding it.

**Significance and interpretation**

We already said at the beginning that the value of these coefficients could range from -1 to 1, with -1 being the perfect negative correlation and 1 being the perfect positive correlation.

We can make a parallel between the value of the coefficient and the strength of the association, which is nothing but the effect size. Thus, values of 0 indicate null association, 0.1 small association, 0.3 medium, 0.5 moderate, 0.7 high and 0.9 very high association.

To finish, it must be said that, in order to give value to the coefficient, it must be statistically significant. You know that we always work with samples, but what we are interested in is inferring the value in the population, so we have to calculate the confidence interval of the coefficient that we have used. If this interval includes the null value (zero) or if the program calculates the value of p and it is greater than 0.05, it will not make sense to value the coefficient, even if it is close to -1 or 1.

**We are leaving…**

And here we leave it for today. We have not discussed anything about using the Pearson’s correlation coefficient to compare the precision of a diagnostic test. And we have not said anything because we should not use this coefficient for this purpose. Pearson’s coefficient is highly dependent on intra-subject variability and can give a very high value when one of the measurements is systematically greater than the other, even though there is not a good concordance between the two. For this, it is much more appropriate to use the intraclass correlation coefficient, a best estimator of the concordance among repeated measures. But that is another story…

La entrada Its bark is worse than its bite se publicó primero en Science without sense...double nonsense.

]]>La entrada Exoteric or esoteric? se publicó primero en Science without sense...double nonsense.

]]>There are days when I feel biblical. Other days I feel mythological. Today I feel philosophical and even a little Masonic.

And the reason is that the other day it gave me to wonder what the difference between exoteric and esoteric were, so I consulted with that friend of us all who knows so much about everything, our friend Google. It kindly explained to me that both terms are similar and usually explain two aspects of the same doctrine. Exoterism refers to that knowledge that is not limited to a certain group in the community that deals with that doctrine, and that can be disclosed and made available to anyone. On the other hand, esotericism refers to knowledge that belongs to a deeper and higher order, only available to a privileged few specially educated to understand it.

And now, once the difference is understood, I ask you a slightly tricky question: is multivariate statistics exoteric or esoteric? The answer, of course, will depend on each one, but we are going to see if it is true that both concepts are not contradictory, but complementary, and we can strike a balance between both of them, at least in understanding the usefulness of multivariate techniques.

We are more used to work with univariate or bivariate statistical techniques, which allow us to study together a maximum of two characteristics of the individuals in a population to detect relationships between them.

However, with the mathematical development and, above all, the calculating power of modern computers, multivariate statistical techniques are becoming increasingly important.

We can define multivariate analysis as the set of statistical procedures that simultaneously study various characteristics of the same subject or entity, in order to analyze the interrelation that may exist among all the random variables that these characteristics represent. Let me insist on the two aspects of these techniques: the multiplicity of variables and the study of their possible interrelations.

There are many multivariate analysis techniques, ranging from purely descriptive methods to those that use statistical inference techniques to draw conclusions from the data and that are able to develop models that are not obvious to the naked eye by observing the data obtained. They will also allow us to develop prediction models of various variables and establish relationships among them.

Some of these techniques are the extension of their equivalents with two variables, one dependent and the other independent or explanatory. Others have nothing similar in two-dimensional statistics.

Some authors classify these techniques into three broad groups: full-range and non-full-range models, techniques to reduce dimensionality, and classification and discrimination methods. Do not worry if this seems gibberish, we will try to simplify it a bit.

To be able to talk about the **FULL AND NON-FULL RANGE TECHNIQUES**, I think it will be necessary to explain first what range we are talking about.

Although we are not going to go into the subject in depth, all these methods involve matrix calculation techniques within them. You know, matrices or arrays, a set of two-dimensional numbers (the ones we are going to discuss here) that form rows and columns and that can be added and multiplied together, in addition to other calculations.

The range of an array is defined as the number of rows or columns that are linearly independent (no matter rows or columns, the number is the same). The range can value from 0 to the minimum number of rows or columns. For example, a 2 row by 3 column array may have a range from 0 to 2. A 5 row by 3 column array may have a range from 0 to 3. Now imagine an array with two rows, the first 1 2 3 and the second 3 6 9 (it has 3 columns). Its maximum range would be 2 (the smallest number of rows and columns) but, if you look closely, the second row is the first one multiplied by 3, so there is only one linearly independent row, so its range is equal to 1.

Well, an array is said to be a full-range one when its range is equal to the largest possible for an array of the same dimensions. The third example that I have given you would be a non-full range array, since a 2×3 matrix would have a maximum range of 2 and that of our array is 1.

Once this is understood, we go with the full and non-full range methods.

The first one we will look at is the multiple linear regression model. This model, an extension of the simple linear regression model, is used when we have a dependent variable and a series of explanatory variables, all of them quantitative variables, and they can be linearly related and the explanatory variables form a full-range array.

Like simple regression, this technique allows us to predict changes in the dependent variable based on the values of explanatory variables. The formula is like that of the simple regression, but including all the explanatory independent variables, so I’m not going to bore you with it. However, since I have punished you with ranges and matrices, let me tell you that, in arrays terms, it can be expressed as follows:

Y = Xβ + e_{i}

where X is the full range matrix of the explanatory variables. The equation includes an error term that is justified by the possible omission in the model of relevant explanatory variables or measurement errors.

To complicate matters, imagine that we were to simultaneously correlate several independent variables with several dependent ones. In this case, multiple regression does not help us, and we would have to resort to the canonical correlation technique, which allows us to make predictions of various dependent variables based on the values of several explanatory variables.

If you remember bivariate statistics, analysis of variance (ANOVA) is the technique that allows us to study the effect on a quantitative dependent variable of explanatory variables when these are categories of a qualitative variable (we call these categories as factors). In this case, since each observation can belong to one and only one of the factors of the explanatory variable, matrix X will be of a non-full range one.

A slightly more complicated situation occurs when the explanatory variables are a quantitative variable and one or more factors of a qualitative variable. On these occasions we resorted to a generalized linear model called the analysis of covariance (ANCOVA).

Transferring what we have just said to the realm of multivariate statistics, we would have to use the extension of these techniques. The extension of ANOVA when there is more than one dependent variable that cannot be combined into one is the multivariate analysis of variance (MANOVA). If factors of qualitative variables coexist with quantitative variables, we will resort to the multivariate analysis of covariance (MANCOVA).

The second group of multivariate techniques are those that try the **REDUCTION OF DIMENSIONALITY**.

Sometimes we must handle such a high number of variables that it is complex to organize them and reach some useful conclusion. Now, if we are lucky that the variables are correlated with each other, the information provided by the set will be redundant, since the information given by some variables will include that already provided by other variables in the set.

In these cases, it is useful to reduce the dimension of the problem by decreasing the number of variables to a smaller set of variables that are not correlated with each other and that collect most of the information included in the original set. And we say most of the information because, obviously, the more we reduce the number, the more information we will lose.

The two fundamental techniques that we will use in these cases are principal component analysis and factor analysis.

Principal component analysis takes a set of p correlated variables and transforms them into a new set of uncorrelated variables, which we call principal components. These main components allow us to explain the variables in terms of their common dimensions.

Without going into detail, a correlation matrix and a series of vectors are calculated that will provide us with the new main components, ordered from highest to lowest according to the variance of the original data that each component encompass. Each component will be a linear combination of the original variables, somewhat like a regression line.

Let’s imagine a very simple case with six explanatory variables (X1 to X6). Principal component 1 (PC1) can be, let’s say, 0.15X1 + 0.5X2 – 0.6X3 + 0.25X4 – 0.1X5 – 0.2X6 and, in addition, encompass 47% of the total variance. If PC2 turns out to encompass 30% of the variance, with PC1 and PC2 we will have 77% of the total variance controlled with a data set that is easier to handle (let’s think if instead of 6 variables we have 50). And not only that, if we represent graphically PC1 versus PC2, we can see if some type of grouping of the variable under study occurs according to the values of the principal components.

In this way, if we are lucky and a few components collect most of the variance of the original variables, we will have reduced the dimension of the problem. And although, sometimes, this is not possible, it can always help us to find groupings in the data defined by a large number of variables, which links us to the following technique, factor analysis.

We know that the total variance of our data (the one studied by principal component analysis) is the sum of three components: the common or shared variance, the specific variance of each variable, and the variance due to chance and measurement errors. Again, without going into detail, the factor analysis method starts from the correlation matrix to isolate only the common variance and try to find a series of common underlying dimensions, called factors, that are not observable by looking at the original set of variables.

As we can see, these two methods are very similar, so there is a lot of confusion about when to use one and when another, especially considering that principal component analysis may be the first step in the factor analysis methodology.

As we have already said, principal component analysis tries to explain the maximum possible proportion of the total variance of the original data, while the objective of the factor analysis study is to explain the covariance or correlation that exists among its variables. Therefore, principal component analysis will usually be used to search for linear combinations of the original variables and reduce one large data set to a smaller and more manageable one, while we will resort to factor analysis when looking for a new set of variables, generally smaller than the original, and to represent what the original variables have in common.

Moving forward on our arduous today’s path, for those hard-working who are still reading, we are going to discuss **CLASSIFICATION AND DISCRIMINATION METHODS**, which are two: cluster analysis and discriminant analysis.

Cluster analysis tries to recognize patterns to summarize the information contained in the initial set of variables, which are grouped according to their greater or less homogeneity. In summary, we look for groups that are mutually exclusive, so that the elements are as similar as possible to those of their group and as different as possible to those of the other groups.

The most famous part of the cluster analysis is, without a doubt, its graphic representation, with decision trees and dendrograms, in which homogeneous groups increasingly different from those farthest between the branches of the tree are represented.

But, instead of wanting to segment the population, let’s assume that we already have a population segmented into a number of classes, k. Suppose we have a group of individuals defined by a number p of random variables. If we want to know to what class of the population a certain individual may belong, we will resort to the technique of discriminant analysis.

Suppose that we have a new treatment that is awfully expensive, so we only want to give it to patients who we are sure that they will comply with the treatment. Thus, our population is segmented into compliant and non-compliant classes. It would be very useful for us to select a set of variables that would allow us to discriminate which class a specific person can belong to, and even which of these variables are the ones that best discriminate between the two groups. Thus, we will measure the variables in the candidate for treatment and, using what is known as a discrimination criterion or rule, we will assign it to one or the other group and proceed accordingly. Of course, do not forget, there will always be a probability of being wrong, so we will be interested in finding the discriminant rule that minimizes the probability of discrimination error.

Discriminant analysis may seem similar to cluster analysis, but if we think about it, the difference is clear. In the discriminant analysis the groups are previously defined (compliant or non-compliant, in our example), while in the cluster analysis we look for groups that are not evident: we would analyze the data and discover that there are patients who do not take the pill that we give them, something that had not even crossed our minds (in addition to our ignorance, we would demonstrate our innocence).

And here we are going to leave it for today. We have flown over the steep landscape of multivariate statistics from a great height and I hope it has served to transfer it from the field of the esoteric to that of the exoteric (or was it the other way around?). We have not entered the specific methodology of each technique, since we could have written an entire book. By roughly understanding what each method is and what it is for, I think we have quite a lot. In addition, statistical packages carried them out, as always, effortlessly.

Also don’t you think that we have talked about all the methods that have been developed for multivariate analysis. There are many others, such as conjoint analysis and multidimensional scaling, widely used in advertising to determine the attributes of an object that are preferred by the population and how they influence their perception of it. We could also get lost among other newer techniques, such as correspondence analysis, or linear probability models, such as logit and probit analysis, which are combinations of multiple regression and discriminant analysis, not to mention simultaneous or structural equation models. But that is another story…

La entrada Exoteric or esoteric? se publicó primero en Science without sense...double nonsense.

]]>La entrada By your actions they will judge you se publicó primero en Science without sense...double nonsense.

]]>Today you are going to forgive me, but I am in a mood a little biblical. And I was thinking about the sample size calculation for survival studies and it reminded me of the message that Ezekiel transmits to us: according to your ways and your works they will judge you.

Once again, you will think that from all the buzzing of evidence-based medicine in my head I have gone a little nuts, but if you hold on a bit and continue reading, you will see that the analogy can be explained.

One of the most valued methodological quality indicators of a study is the previous calculation of the sample size necessary to demonstrate (or reject) the working hypothesis. When we want to study the effect of an intervention, we must, a priori, define what effect size we want to detect and calculate the sample size necessary to be able to do it, as long as the effect exists (something we want when we plan the experiment, but which we do not know a priori) , taking into account the level of significance and the power that we want the study to have .

In summary, if we detect the effect size that we previously established, the difference between the two groups will be statistically significant (our desired p <0.05). On the contrary, if there is no significant difference, there is probably no real difference, although always with the risk of making a type 2 error that is equal to 1 minus the power of the study.

So far it seems clear, we must calculate the number of participants we need. But this is not so simple for survival studies.

Survival studies grouped a series of statistical techniques to deal with situations in which it is not enough to observe an event, it is critical the time that elapses until the event occurs. In these cases, the outcome variable will be neither quantitative nor qualitative, but from time to event. It is a mixed variable type that would have a dichotomous part (the event occurs or does not) and a quantitative part (how long it takes to occur).

The name of survival studies is a bit misleading and one can think that the event under study will be the death of the participants, but nothing is further from reality. The event can be any type of incident or occurrence, good or bad for the participant. What happens is that the first studies were applied to situations in which the event of interest was death and the name has prevailed.

In these studies, the participants’ follow-up period is often uneven, and some may even end the study without reporting the event of interest or missing out of the study before it ends.

For these reasons, if we want to know if there are differences between the presentation of the event of interest in the two branches of the study, the number of subjects participating will not be so important to calculate the sample, but rather the number of events that we need for the difference to be significant if the clinically important difference is reached, which we must establish a priori.

Let’s see how it is done, depending on the type of contrast we plan to use.

If we only want to determine the number of necessary events that we have to observe to detect a difference between a certain group and the population from which it is sourced, the formula to do so is as follows:

Where E is the number of events we need to observe, K is the value determined by the confidence level and the power of the study and lnRR is the natural logarithm of the risk rate.

The value of K is calculated as (Z_{α} + Z_{β})^{2}, with z being the standardized value for the chosen confidence and power level. The most common is to perform a bilateral contrast (with two tails) with a confidence level of 0.05 and a power of 80%. In this case, the values are Z_{α}_{ }= 1.96, Z_{β}_{ }= 0.84 and K = 7.9. In the attached table I leave you the most frequent values of K, so you do not have to calculate them.

The risk rate is the ratio between the risk of the study group and the risk in the population, which we are supposed to know. It is defined as Sm_{1}_{ }/Sm_{2}, where Sm_{1}_{ }is the mean time of appearance of the event in the population and Sm_{2 }is_{ }the expected in the study group.

Let’s give an example to better understand what has been said so far.

Suppose that patients or treatment with a certain drug (which we will call A to not work ridiculously hard) are at risk of developing a stomach ulcer during the first year of treatment. Now we select a group and give them a treatment (B, this time) that acts as prophylaxis, in such a way that we hope that the event will take another year to occur. How many ulcers do we have to observe in a study with a confidence level of 0.05 and a power of 0.8 (80%)?

We know that K is worth 7.9. Sm_{1}_{ }= 1 and Sm_{2}_{ }= 2. We substitute their values in the formula that we already know:

We will need to see 33 ulcers during follow-up. Now we can calculate how many patients we must include in the study (I find it difficult to enroll just ulcers).

Let’s assume that we can enroll 12 patients a year. If we want to observe 33 ulcers, the follow-up should last for 33/12 = 2.75, that is, 3 years. For more security, we would plan a slightly higher follow-up.

This was the simplest problem. When we want to compare the two survival curves (we plan to do a log-rank test), the calculation of the sample size is a bit more complex, but not much. After all, we will already be comparing the survival probability curves of the two groups.

In these cases, the formula for calculating the number of necessary events is as follows:

We find a new parameter, C, which is the ratio of participants between one group and the other (1: 1, 1: 2, etc.).

But there is another difference with the previous assumption. In these cases, the RR is calculated as the quotient of the natural logarithms of π_{1} and π_{2}, which are the proportions of participants from each group that present the event in a given period of time.

Following the previous example, suppose we know that the ulcer risk in those who are on A is 50% in the first 6 months and that of those who are on B, 20%. How many ulcers do we need to observe with the same level of confidence and the same power of the study?

Let’s substitute the values in the previous formula:

We will need to observe 50 ulcers during the study. Now we need to know how many participants (not events) we need in each branch of the study. We can obtain it with the following formula:

If we substitute our values in the equation, we obtain a value of 29.4, so we will need 30 patients in each branch of the study, 60 in all.

Coming to the end, let’s see what would happen if we want a different ratio of participants instead of the easiest, 1: 1. In that case, the calculation of n with the last formula must be adjusted taking into account this proportion, which is our known C:

Imagine we want a 2:1 ratio. We substitute the values in the equation:

We would need 23 participants in one branch and 46, double, in the other, 69 in all.

And here we leave it for today. As always, everything we have said in this post is so that we can understand the fundamentals of calculating the necessary sample size in this type of study, but I advise you to use a statistical program or a calculator if you ever have to do it. There are many available and some are even totally free.

I hope that you understand now about Ezekiel’s message and that, in these studies, the things we do (or suffer) are more important than how many we do (or suffer). We have seen the simplest way to calculate the sample size of a survival study , although we could have bring unnecessary troubles into our lives and have calculated the sample size based on estimates of risks ratios or hazard ratios. But that is another story…

La entrada By your actions they will judge you se publicó primero en Science without sense...double nonsense.

]]>La entrada I have come here to talk about my book se publicó primero en Science without sense...double nonsense.

]]>The problem is that poor Umbral found himself at a table with two other people who, following the thread of the program, were talking about everything except his book, with the apparent complicity of the presenter and the enthusiastic cooperation of the audience on the set.

And what had to happen happened. Time was running, the program was going to end and there was no talk about the book, so Umbral, demonstrating other lesser-known qualities than his genius as a novelist and journalist, exploded demanding that his book be talked about, that it was the reason why he had come to TV, repeatedly saying the phrase that has remained forever in the Spanish cultural heritage: “I have come here to talk about my book.”

The regular readers of the blog are accustomed to seeing that the posts usually begin with some delusion of my imagination that ends up giving way to the real topic of the day, which has nothing to do with what was spoken at the beginning of the post, so you will already be wondering what today’s post will be about.

But today you are going to get a surprise. There is no topic on evidence-based medicine. Because today, I have come here to talk about my book.

The blog “Ciencia sin seso… locura doble” was born on July 26, 2012, with the ambitious purpose of teaching topics of research methodology and evidence-based medicine in ways that seem easy and even fun. Since then, about 160 posts have been published in two languages (in Spanish and in something that wants to resemble the language of the Bard from Avon) and it has grown in audience and diversity of topics, although the most important milestone from the point of view of its dissemination and prestige was its inclusion in AnestesiaR web in May 2014.

It was time, then, to bring into being to at least part of the contents so that they formed a coherent and homogeneous set. And this is how “El ovillo y la espada” (“The ball and the sword”) is born, the book I have come to talk about here today.

You can see that I continue with my hobby of giving it a title that has nothing to do with the content of the work. In reality, “The ball and the sword” is a “Manual for critical appraisal of scientific documents”, made up of a selection of blog’s posts that, grouped together, are intended to provide the reader with the necessary knowledge to face the critical appraisal of the articles that we have to resort to daily in our professional practice.

The manual is made up of a series of blocks that deal with the usual steps that make up evidence-based medicine systematics: the generation of the structured clinical question in the face of a knowledge gap, the bibliographic search, the characteristics of the most common epidemiological designs and the guidelines for critical appraising the papers based on these designs.

To finish, I wish only thank my colleagues and friends from the Evidence-Based Pediatrics Committee of the AEP-AEPap and from AnestesiaR. With the first ones I have learned everything I know about these topics (do not think it is much just because I write a book) and thanks to the second ones the blog has reached a diffusion that was beyond my possibilities, in addition to making the project that I am presenting you today. My book, in case someone hasn’t found out yet.

And with this we are leaving. I hope you are encouraged to read my creature and that it be useful to you. We get to the end of this post without explaining what the balls and swords are in the title of the manual. I will tell you that it has something to do with a certain Theseus. But that’s another story…

La entrada I have come here to talk about my book se publicó primero en Science without sense...double nonsense.

]]>La entrada The shortest distance se publicó primero en Science without sense...double nonsense.

]]>The other day I was trying to measure the distance between Madrid and New York in Google Earth and I found something unexpected: when I tried to draw a straight line between the two cities, it twisted and formed an arc, and there was no way to avoid it.

I wondered if what Euclid said about the straight line being the shortest path between two points would not be true. Of course, right away, I realized where the error was: Euclid was thinking about the distance between two points located in a plane and I was drawing the minimum distance between two points located in a sphere. Obviously, in this case the shortest distance is not marked by a straight line, but an arc, as Google showed me.

And since one thing leads to another, this led me to think about what would happen if instead of two points there were many more. This has to do, as some of you already imagine, with the regression line that is calculated to fit a point cloud. Here, as it is easy to understand, the line cannot pass through all the points without losing its straightness, so the statisticians devised a way to calculate the line that is closest to all points on average. The method they use the most is the one they call the least squares method, whose name suggests something strange and esoteric. However, the reasoning for calculating it is much simpler and, therefore, no less ingenious. Let’s see it.

The linear regression model makes it possible, once a linear relation has been established, to make predictions about the value of a variable Y knowing the values of a set of variables X_{1}, X_{2}, … X_{n}. We call the variable Y as dependent, although it is also known as objective, endogenous, criterion or explained variable. For their part, the X variables are the independent variables, also known as predictors, explanatory, exogenous or regressors.

When there are several independent variables, we are faced with a multiple linear regression model, while when there is only one, we will talk about simple linear regression. To make it easier, we will focus, of course, on the simple regression, although the reasoning also applies to multiple one.

As we have already said, linear regression requires that the relationship between the two variables is linear, so it can be represented by the following equation of a straight line:

Here we find two new friends accompanying our dependent and independent variables: they are the coefficients of the regression model. β_{0 }represents the model constant (also called the intercept) and is the point where the line intersects the ordinate axis (the Y axis, to understand each other better). It would represent the theoretical value of variable Y when variable X values zero.

For its part, β_{1} represents the slope (inclination) of the regression line. This coefficient tells us the increment of units of variable Y that occurs for each increment of one unit of variable X.

This would be the general theoretical line of the model. The problem is that the distribution of values is never going to fit perfectly to any line, so when we are going to calculate a determined value of Y (y_{i}) from a value of X (x_{i}) there will be a difference between the real value of y_{i} and the one that we obtain with the formula of the line. We have already met with random, our inseparable companion, so we will have no choice but to include it in the equation:

Although it seems a similar formula to the previous one, it has undergone a profound transformation. We now have two well-differentiated components, a deterministic and a stochastic (error) component. The deterministic component is marked by the first two elements of the equation, while the stochastic is marked by the error in the estimation. The two components are characterized by their random variable, x_{i} and ε_{i}, respectively, while x_{i} would be a specific and known value of the variable X.

Let’s focus a little on the value of ε_{i}. We have already said that it represents the difference between the real value of y_{i} in our point cloud and that which would be provided by the equation of the line (the estimated value, represented as ŷ_{i}). We can represent it mathematically in the following way:

This value is known as the residual and its value depends on chance, although if the model is not well specified, other factors may also systematically influence it, but that does not change what we are dealing with.

Let’s recap what we have so far:

- A point cloud on which we want to draw the line that best fits the cloud.
- An infinite number of possible lines, from which we want to select a specific one.
- A general model with two components: one deterministic and the other stochastic. This second will depend, if the model is correct, on chance.

We already have the values of the variables X and Y in our point cloud for which we want to calculate the line. What will vary in the equation of the line that we select will be the coefficients of the model, β_{0} and β_{1}. And what coefficients interest us? Logically, those with which the random component of the equation (the error) is as small as possible. In other words, we want the equation with a value of the sum of residuals as low as possible.

Starting from the previous equation of each residual, we can represent the sum of residuals as follows, where n is the number of pairs of values of X and Y that we have:

But this formula does not work for us. If the difference between the estimated value and the real value is random, sometimes it will be positive and sometimes negative. Furthermore, its average will be zero or close to zero. For this reason, as on other occasions in which it is interesting to measure the magnitude of the deviation, we have to resort to a method that prevents from negatives differences canceling out with the positives ones, so we calculate these squared differences, according to the following formula:

At last! We already know where the least squares method comes from: we look for the regression line that gives us the smallest possible value of the sum of the squares of the residuals. To calculate the coefficients of the regression line we will only have to expand the previous equation a little, substituting the estimated value of Y for the terms of the regression line equation:

and find the values of b_{0} and b_{1} that minimize the function. From here the task is a piece of cake, we just have to set the partial derivatives of the previous equation to zero (take it easy, let’s save the hard-mathematical jargon) to get the value of b_{1}:

Where we have in the numerator the covariance of the two variables and, in the denominator, the variance of the independent variable. From here, the calculation of b_{0} is a breeze:

We can now build our line that, if you look closely, goes through the mean values of X and Y.

And with this we end the arduous part of this post. Everything we have said is to understand what the least squares mean and where the matter comes from, but it is not necessary to do all this to calculate the linear regression line. Statistical packages do it in the blink of an eye.

For example, in R it is calculated using the function lm(), which stands for linear model. Let’s see an example using the “trees” database (girth, volume and height of 31 observations on trees), calculating the regression line to estimate the volume of the trees knowing their height:

modelo_reg <- lm(Height~Volume, data = trees)

summary(modelo_reg)

The lm() function returns the model to the variable that we have indicated (reg_model, in this case), which we can exploit later, for example, with the summary() function. This will provide us with a series of data, as you can see in the attached figure.

First, the quartiles and the median of the residuals. For the model to be correct, it is important that the median is close to zero and that the absolute values of the residuals are distributed evenly among the quartiles (similar between maximum and minimum and between first and third quartiles).

Next, the point estimate of the coefficients is shown below along with their standard error, which will allow us to calculate their confidence intervals. This is accompanied by the values of the t statistic with its statistical significance. We have not said it, but the coefficients follow a Student’s t distribution with n-2 degrees of freedom, which allows us to know if they are statistically significant.

Finally, the standard deviation of the residuals is provided, the square of the multiple correlation coefficient or determination coefficient (the precision with which the line represents the functional relationship between the two variables; its square root in simple regression is the Pearson’s correlation coefficient), its adjusted value (which will be more reliable when we calculate regression models with small samples) and the F contrast to validate the model (the variance ratios follow a Snedecor’s F distribution).

Thus, our regression equation would be as follows:

Height = 69 + 0.23xVolume

We could already calculate how tall a tree could be given a specific volume that was not in our sample (although it should be within the range of data used to calculate the regression line, since it is risky to make predictions outside this range).

Also, with the scatterplot(Volume ~ Height, regLine = TRUE, smooth = FALSE, boxplots = FALSE, data = trees) command, we could draw the point cloud and the regression line, as you can see in the second figure.

And we could calculate many more parameters related to the regression model calculated by R, but we will leave it here for today.

Before finishing, just to tell you that the least squares method is not the only one that allows us to calculate the regression line that best fits our point cloud. There is also another method that is that of the maximum likelihood, which gives more importance to the choice of the coefficients that with more compatibility with the observed values. But that is another story…

La entrada The shortest distance se publicó primero en Science without sense...double nonsense.

]]>La entrada Rioja vs Ribera se publicó primero en Science without sense...double nonsense.

]]>This is one of the typical debates that one can have with a brother-in-law during a family dinner: whether the wine from Ribera is better than that from Rioja, or vice versa. In the end, as always, the brother-in-law will be (or will want to be) right, which will not prevent us from trying to contradict him. Of course, we must make good arguments to avoid falling into the same error, in my humble opinion, in which some fall when participating in another classic debate, this one from the less playful field of epidemiology: Frequentist vs. Bayesian statistics?

And these are the two approaches that we can use when dealing with a research problem.

Frequentist statistics, the best known and to which we are most accustomed, is the one that is developed according to the classic concepts of probability and hypothesis testing. Thus, it is about reaching a conclusion based on the level of statistical significance and the acceptance or rejection of a working hypothesis, always within the framework of the study being carried out. This methodology forces to stabilize the decision parameters *a priori*, which avoids subjectivities regarding them.

The other approach to solving problems is that of Bayesian statistics, which is increasingly fashionable and, as its name suggests, is based on the probabilistic concept of Bayes’ theorem. Its differentiating feature is that it incorporates external information into the study that is being carried out, so that the probability of a certain event can be modified by the previous information that we have on the event in question. Thus, the information obtained *a priori* is used to establish an *a posteriori* probability that allows us to make the inference and reach a conclusion about the problem we are studying.

This is another difference between the two approaches: while Frequentist statistics avoids subjectivity, Bayesian’s one introduces a subjective (but not capricious) definition of probability, based on the researcher’s conviction, to make judgments about a hypothesis.

Bayesian statistics is not really new. Thomas Bayes’ theory of probability was published in 1763, but experiences a resurgence from the last third of the last century. And as usually happens in these cases where there are two alternatives, supporters and detractors of both methods appear, which are deeply involved in the fight to demonstrate the benefits of their preferred method, sometimes looking more for the weaknesses of the opposite than for their own strengths.

And this is what we are going to talk about in this post, about some arguments that Bayesians use on some occasion that, one more time in my humble opinion, take more advantage misuses of Frequentist statistics by many authors, than of intrinsic defects of this methodology.

We will start with a bit of history.

The history of hypothesis testing begins back in the 20s of the last century, when the great Ronald Fisher proposed to value the working hypothesis (of absence of effect) through a specific observation and the probability of observing a value equal or greater than the observed result. This probability is the p-value, so sacred and so misinterpreted, that it does not mean more than that: the probability of finding a value equal to or more extreme than that found if the working hypothesis were true.

In summary, the p that Fisher proposed is nothing short of a measure of the discrepancy that could exist between the data found and the hypothesis of work proposed, the null hypothesis (H0).

Almost a decade later, the concept of alternative hypothesis (H1) was introduced, which did not exist in Fisher’s original approach, and the reasoning is modified based on two error rates of false positive and negative:

- Alpha error (type 1 error): probability of rejecting the null hypothesis when, in fact, it is true. It would be the false positive: we believe we detect an effect that, in reality, does not exist.
- Beta error (type 2 error): it is the probability of accepting the null hypothesis when, in fact, it is false. It is the false negative: we fail to detect an effect that actually exists.

Thus, we set a maximum value for what seems to us the worst case scenario, which is detecting a false effect, and we choose a “small” value. How small is it? Well, by convention, 0.05 (sometimes 0.01). But, I repeat, it is a value chosen by agreement (and there are those who say that it is capricious, because 5% reminds them the fingers of the hand, which are usually 5).

Thus, if p <0.05, we reject H0 in favor of H1. Otherwise, we accept H0, the hypothesis of no effect. It is important to note that we can only reject H0, never demonstrate it in a positive way. We can demonstrate the effect, but not its absence.

Everything said so far seems easy to understand: the frequentist method tries to quantify the level of uncertainty of our estimate to try to draw a conclusion from the results. The problem is that p, which is nothing more than a way to quantify this uncertainty, is sacralized and misinterpreted too often, which is used to their advantage (if I may say so) by opponents of the method to try to expose its weaknesses.

One of the major flaws attributed to the frequentist method is the dependence of the p-value on the sample size. Indeed, the value of p can be the same with a small effect size in a large sample as with a large effect size in a small sample. And this is more important than it may seem at first, since the value that will allow us to reach a conclusion will depend on a decision exogenous to the problem we are examining: the chosen sample size.

Here would be the benefit of the Bayesian method, in which larger samples would serve to provide more and more information about the study phenomenon. But I think this argument is based on a misunderstanding of what an adequate sample is. I am convinced, the more is not always the better.

Another great man, David Sackett, said that “too small samples can be used to prove nothing; samples that are too large can be used to prove nothing ”. The problem is that, in my opinion, a sample is neither large nor small, but sufficient or insufficient to demonstrate the existence (or not) of an effect size that is considered clinically important.

And this is the heart of the matter. When we want to study the effect of an intervention we must, a priori, define what effect size we want to detect and calculate the necessary sample size to be able to do it, as long as the effect exists (something that we desire when we plan the experiment, but that we don’t know a priori) . When we do a clinical trial we are spending time and money, in addition to subjecting participants to potential risk, so it is important to include only those necessary to try to prove the clinically important effect. Including the necessary participants to reach the desired p <0.05, in addition to being uneconomic and unethical, demonstrates a lack of knowledge about the true meaning of p-value and sample size.

This misinterpretation of the p-value is also the reason that many authors who do not reach the desired statistical significance allow themselves to affirm that with a larger sample size they would have achieved it. And they are right, they would have reached the desired p <0.05, but they again ignore the importance of clinical significance versus statistical significance.

When the sample size to detect the clinically important effect is calculated a priori, the power of the study is also calculated, which is the probability of detecting the effect if it actually exists. If the power is greater than 80-90%, the values admitted by convention, it does not seem correct to say that you do not have enough sample. And, of course, if you have not calculated the power of the study before, you should do it before affirming that you have no results due to shortness of sample.

Another argument against the frequentist method and in favor of the Bayesian’s says that hypothesis testing is a dichotomous decision process, in which a hypothesis is rejected or accepted such as you rejects or accepts an invitation to the wedding of a distant cousin you haven’t seen for years.

Well, if they previously forgot about clinical significance, those who affirm this fact forget about our beloved confidence intervals. The results of a study should not be interpreted solely on the basis of the p-value. We must look at the confidence intervals, which inform us of the precision of the result and of the possible values that the observed effect may have and that we cannot further specify due to the effect of chance. As we saw in a previous post, the analysis of the confidence intervals can give us clinically important information, sometimes, although the p is not statistically significant.

Finally, some detractors of the frequentist method say that the hypothesis test makes decisions without considering information external to the experiment. Again, a misinterpretation of the value of p.

As we already said in a previous post, a value of p <0.05 does not mean that H0 is false, nor that the study is more reliable, or that the result is important (even though the p has six zeros). But, most importantly for what we are discussing now, it is false that the value of p represents the probability that H0 is false (the probability that the effect is real).

Once our results allow us to affirm, with a small margin of error, that the detected effect is real and not random (in other words, when the p is statistically significant), we can calculate the probability that the effect is “real”. And for this, Oh, surprise! we will have to calibrate the value of p with the value of the basal probability of H0, which will be assigned by the researcher based on her knowledge or previous available data (which is still a Bayesian approach).

As you can see, the assessment of the credibility or likelihood of the hypothesis, one of the differentiating characteristics of the Bayesian’s approach, can also be used if we use frequentist methods.

And here we are going to leave it for today. But before finishing I would like to make a couple of considerations.

First, in Spain we have many great wines throughout our geography, not just Ribera or Rioja. For no one to get offended, I have chosen these two because they are usually the ones asked by the brothers-in-law when they come to have dinner at home.

Second, do not misunderstand me if it may have seemed to you that I am an advocate of frequentist statistics against Bayesian’s. Just as when I go to the supermarket I feel happy to be able to buy wine from various designations of origin, in research methodology I find it very good to have different ways of approaching a problem. If I want to know if my team is going to win a match, it doesn’t seem very practical to repeat the match 200 times to see what average results come out. It would be better to try to make an inference taking into account the previous results.

And that’s all. We have not gone into depth in what we have commented at the end on the real probability of the effect, somehow mixing both approaches, frequentist’s and Bayesian’s. The easiest way, as we saw in a previous post, is to use a Held’s nomogram. But that is another story…

La entrada Rioja vs Ribera se publicó primero en Science without sense...double nonsense.

]]>La entrada A weakness se publicó primero en Science without sense...double nonsense.

]]>Even the greatest have weaknesses. It is a reality that affects even the great NNT, the number needed to treat, without a doubt the king of the measures of absolute impact of the research methodology in clinical trials.

Of course, that is not an irreparable disgrace. We only have to be well aware of its strengths and weaknesses in order to take advantage of the former and try to mitigate and control the latter. And it is that the NNT depends on the baseline risks of the intervention and control groups, which can be inconsistent fellow travelers and be subjected to variation due to several factors.

As we all know, NNT is an absolute measure of effect that is used to estimate the efficacy or safety of an intervention. This parameter, just like a good marriage, is useful in good times and in bad, in sickness and in health.

Thus, on the good side we talk about NNT, which is the number of patients that have to be treated for one to present a result that we consider as good. By the way, on the dark side we have the number needed to harm (NNH), which indicates how many we have to treat in order for one to present an adverse event.

NNT was originally designed to describe the effect of the intervention relative to the control group in clinical trials, but its use was later extended to interpret the results of systematic reviews and meta-analyzes. And this is where the problem may arise since, sometimes, the way to calculate it in trials is generalized for meta-analyzes, which can lead to error.

The simplest way to obtain the NNT is to calculate the inverse of the absolute risk reduction between the intervention and the control group. The problem is that this form is the one that is most likely to be biased by the presence of factors that can influence the value of the NNT. Although it is the king of absolute measures of impact, it also has its limitations, with various factors influencing its magnitude, not to mention its clinical significance.

One of these factors is the duration of the study follow-up period. This duration can influence the number of events, good or bad ones, that the study participants can present, which makes it incorrect to compare the NNTs of studies with follow-ups of different duration.

Another may be the baseline risk of presenting the event. Let’s think that the term “risk”, from a statistical point of view, does not always imply something bad. We can speak, for example, of risk of cure. If the baseline risk is higher, more events will likely occur and the NNT may be lower. The outcome variable used and the treatment alternative with which we compared the intervention should also be taken into account.

And third, to name a few more of these factors, the direction and size of the effect, the scale of measurement, and the precision of the NNT estimates, their confidence intervals, may influence its value.

And here the problem arises with systematic reviews and meta-analyzes. Even though we might want to, there will always be some heterogeneity among the primary studies in the review, so these factors we have discussed may differ among studies. At this point, it is easy to understand that the estimation of the global NNT based on the summary measures of risks between the two groups may not be the most suitable, since it is highly influenced by the variations in the baseline control event rate (CER).

For these situations, it is much more advisable to make other more robust estimates of the NNT, the most widely used being those that use other association measures such as the risk ratio (RR) or the odds ratio (OR), which are more robust in the face of variations in CER. In the attached figure I show you the formulas for the calculation of the NNT using the different measures of association and effect.

In any case, we must not lose sight of the recommendation of not to carry out a quantitative synthesis or calculation of summary measures if there is significant heterogeneity among primary studies, since then the global estimates will be unreliable, whatever we do.

But do not think that we have solved the problem. We cannot finish this post without mentioning that these alternative methods for calculating NNT also have their weaknesses. Those have to do with obtaining an overall CER summary value, which also varies among primary studies.

The simplest way would be to divide the sum of events in the control groups of the primary studies by the total number of participants in that group. This is usually possible simply by taking the data from the meta-analysis’ forest plot. However, this method is not recommended, as it completely ignores the variability among studies and possible differences in randomization.

Another more correct way would be to calculate the mean or median of the CER of all the primary studies and, even better, to calculate some weighted measure based on the variability of each study.

And even, if baseline risk variations among studies are very important, an estimate based on the investigator’s knowledge or other studies could be used, as well as using a range of possible CER values and comparing the differences among the different NNTs that could be obtained.

You have to be very careful with the variance weighting methods of the studies, since the CER has the bad habit of not following a normal distribution, but a binomial one. The problem with the binomial distribution is that its variance depends greatly on the mean of the distribution, being maximum in mean values around 0.5.

On the contrary, the variance decreases if the mean is close to 0 or 1, so all the variance-based weighting methods will assign a greater weight to the studies the more their mean separates from 0.5 (remember that CER can range from 0 to 1, like any other probability value). For this reason, it is necessary to carry out a transformation so that the values approach a normal instead of a binomial distribution and thus be able to carry out the weighting.

And I think we will leave it here for today. We are not going to go into the methods to transform the CER, such as the double arcsine or the application of mixed generalized linear models, since that is for the most exclusive minds, among which my own’s is not included. Anyway, don’t get stuck with this. I advise you to calculate the NNT using statistical packages or calculators, such as Calcupedev. There are other uses of NNT that we could also comment on and that can be obtained with these tools, as is the case with NNT in survival studies. But that is another story…

La entrada A weakness se publicó primero en Science without sense...double nonsense.

]]>La entrada A good agreement? se publicó primero en Science without sense...double nonsense.

]]>We all know that the less we go to the doctor, the best. And this is so for two reasons. First, because if we go to many doctors we are either physically ill or very mentally sick (some unfortunates are both of them). And second, which is the fact I am always struck by, because every doctor tells you something different. And it’s not that doctors don’t know their job, it’s that getting an agreement is not as simple as it seems.

To give you an idea, the problem starts when wanting to know if two doctor who assess the same diagnostic test have a good degree of agreement. Let’s see an example.

Imagine for a moment than I am the manager of a hospital and I want to hire a pathologist because the only one that works at the hospital is overworked.

I meet with my pathologist and the applicant and give them 795 biopsies to tell me if there’re malignant cells in them. As you can see in the first table, my pathologist finds malignant cells in 99 biopsies, while the applicant sees them in 135 (do not panic, in real life difference couldn’t be so wide, could be?). We wonder what degree of agreement or, rather, concordance exists between the two. The first think that comes to our mind is to calculate the number of biopsies in which they agree: they both agree with 637 normal biopsies and 76 with malignant cells, so the percentages of cases of agreement can be calculated as (637+76)/795=0.896. Hurray!, we think, the two agree almost 90% of the time. The result is not as bad as it seemed to be looking at the table.

But it turns out that when I’m about to hire the new pathologist I wonder if they could have agreed just by chance.

So, a stupid experiment springs to my mind: I take the 795 biopsies and throw a coin, labeling each biopsy as normal if I get heads, or pathological, if tails.

The coin says I have 400 normal biopsies and 395 with malignant cells. If I calculate the concordance between the coin and the pathologist, I see that it values (365+55)/795=0.516, 52%!. This is really amazing, just by chance there’s agreement in half of the cases (yes, yes, I know that those know-it-all of you will be thinking that it’s not a surprise, since 50% is the probability of each possible outcome when tossing a coin). So I start thinking how to save money for my hospital and I come out with another experiment that this time is not only stupid, but totally ridiculous: I offer my cousin to do the test instead of throwing a coin (by this time I’m going to left my brother-in-law alone).

The problem, of course, is that my cousin is not a doctor and, although a nice guy, pathology is not his main topic. So, when he starts to see the colorful cells he thinks it’s impossible that such beauty is produced by malignant cells and gives all the biopsies as normal. When we look at the table with the results the first think that we think if to burn it but, for the sake of curiosity, we calculate the concordance between my cousin and my pathologist and see that it’s 696/795=0.875, 87!. Conclusion: it could be more convenient to me to hire my cousin instead of a new pathologist.

At this stage many of you will think that I forgot to take my medication this morning, but the truth is that all these examples serve to show you that, if we want to know what the agreement between two observers is, we must first get rid of the cumbersome and everlasting effect of chance. And for that, mathematicians have invented a statistic called kappa, the interobserver agreement coefficient.

The function of kappa is to exclude from the observed agreement that part that is due to chance, obtaining a more representative measure of the strength of agreement between observers. Its formula is a ratio in which the numerator is the difference between observed and random difference and which denominator represents the complementary of the random agreement: (Po-Pr) / (1-Pr).

We already know the value of Po with two pathologists: 0.89. To get Pr we have to calculate the theoretical expected values for each cell of the table, in a similar way that we remember we did with chi squared test: the expected value of each cell is the product of the total of its row and column divided by the total of the table. As an example, the expected value of the first cell of our table is (696×660)/795=578. With the expected values we can calculate the probability of agreement due to chance using the same method we used earlier with observed values: (578+17)/795=0.74.

And now we can calculate kappa = (0.89-0.74)/(1-0.74) = 0.57. And what can we conclude of a value of 0.57?. We can do with it whatever we want except multiply it by a hundred, because this values doesn’t represent a true percentage. The value of kappa can range between -1 and 1. Negative values indicate that concordance is worse than that expected by chance. A value of 0 indicates that the agreement is similar than that we could get flipping a coin. Values greater than 0 indicate that concordance is slight (0.01-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80) or almost perfect (0.81-1.00). In our case, there’s a fairly good agreement between the two pathologists. If you are curious, you can calculate the kappa for my cousin and you’ll see that it’s no better than flipping a coin.

Kappa can also be calculated if we have measurements of several observers and more than one result for each observation, but tables get so unfriendly that it is better to use a statistical program to calculate it, and by the way, come up with confidence intervals.

Anyway, do not put much trust in kappa, because it needs not to be greater difference among table’s cells. If a cell has few cases the coefficient will tend to underestimate the actual concordance even if it’s very good.

Finally, say that, although all our examples showed tests with dichotomous result, it’s also possible to calculate interobserver agreement with quantitative results (a rating scale, for instance). Of course, for that we have to use another statistical technique as Bland-Altman’s test, but that’s another story…

La entrada A good agreement? se publicó primero en Science without sense...double nonsense.

]]>La entrada I am Spartacus se publicó primero en Science without sense...double nonsense.

]]>I was thinking about the effect size based on mean differences and how to know when that effect is really large and, because of the association of ideas, someone great has come to mind who, sadly, has left us recently. I am referring to Kirk Douglas, that hell of an actor that I will always remember for his roles as a Viking, as Van Gogh or as Spartacus, in the famous scene of the film in which all slaves, in the style of our Spanish’s Fuenteovejuna, stand up and proclaim together to be Spartacus so that Romans cannot do anything to the true one (or to get all equally whacked, much more typical of the *modus operandi* of the Romans of that time).

You won’t tell me the man wasn’t great. But how great if we compare it with others? How can we measure it? It is clear that not because of the number of Oscars, since that would only serve to measure the prolonged shortsightedness of the so-called academics of the cinema, which took a long time until they awarded him the honorary prize for his entire career. It is not easy to find a parameter that defines the greatness of a character like Issur Danielovitch Demsky, which was the ragman’s son’s name before becoming a legend.

We have it easier to quantify the effect size in our studies, although the truth is that researchers are usually more interested in telling us the statistical significance than in the size of the effect. It is so unusual to calculate it that even many statistical packages forget to have routines to obtain it. In this post, we are going to focus on how to measure the effect size based on differences between means.

Imagine that we want to conduct a trial to compare the effect of a new treatment against placebo and that we are going to measure the result with a quantitative variable X. What we will do is calculate the mean effect between participants in the experimental or intervention group and compare it with the mean of the participants in the control group. Thus, the effect size of the intervention with respect to the placebo will be represented by the magnitude of the difference between the mean in the experimental group and that of the control group:However, although it is the easiest to calculate, this value does not help us to get an idea of the effect size, since its magnitude will depend on several factors, such as the unit of measure of the variable. Let us think about how the differences change if one mean is twice the other as their values are 1 and 2 or 0.001 and 0.002. In order for this difference to be useful, it is necessary to standardize it, so a man named Gene Glass thought he could do it by dividing it by the standard deviation of the control group. He obtained the well-known Glass’ delta, which is calculated according to the following formula:Now, since what we want is to estimate the value of delta in the population, we will have to calculate the standard deviation using n-1 in the denominator instead of n, since we know that this quasi-variance is a better estimator of the population value of the deviation:But do not let yourselves be impressed by delta, it is not more than a Z score (those obtained by subtracting to the value its mean and dividing it by the standard deviation): each unit of the delta value is equivalent to one standard deviation, so it represents the standardized difference in the effect that occurs between the two groups due to the effect of the intervention. This value allows us to estimate the percentage of superiority of the effect by calculating the area under the curve of the standard normal distribution N(0,1) for a specific delta value (equivalent to the standard deviation). For example, we can calculate the area that corresponds to a delta value = 1.3. Nothing is simpler than using a table of values of the standard normal distribution or, even better, the pnorm() function of R, which returns the value 0.90. This means that the effect in the intervention group exceeds the effect in the control group by 90%.

The problem with Glass’ delta is that the difference in means depends on the variability between the two groups, which makes it sensitive to these variance differences. If the variances of the two groups are very different, the delta value may be biased. That is why one Larry Vernon Hedges wanted to contribute with his own letter to this particular alphabet and decided to do the calculation of Glass in a similar way, but using a unified variance that does not assume their equality, according to the following formula:If we substitute the variance of the control group of the Glass’ delta formula with this unified variance we will obtain the so-called Hedges’ g. The advantage of using this unified standard deviation is that it takes into account the variances and sizes of the two groups, so g has less risk of bias than delta when we cannot assume equal variances between the two groups.

However, both delta and g have a positive bias, which means that they tend to overestimate the effect size. To avoid this, Hedges modified the calculation of his parameter in order to obtain an adjusted g, according to the following formula:where df are the degrees of freedom, which are calculated as n_{e} + n_{c}.

This correction is more needed with small samples (few degrees of freedom). It is logical, if we look at the formula, the more degrees of freedom, the less necessary it will be to correct the bias.

So far, we have tried to solve the problem of calculating an estimator of the effect size that is not biased by the lack of equal variances. The point is that, in the rigid and controlled world of clinical trials, it is usual that we can assume the equality of variances between the groups of the two branches of the study. We might think, then, that if this is true, it would not be necessary to resort to the trick of n-1.

Well, Jacob Cohen thought the same, so he devised his own parameter, Cohen’s d. This Cohen’s d is similar to Hedges’ g, but still more sensitive to inequality of variances, so we will only use it when we can assume the equality of variances between the two groups. Its calculation is identical to that of the Hedges’ g, but using n instead of n-1 to obtain the unified variance.

As a rough-and-ready rule, we can say that the effect size is small for d = 0.2, medium for d = 0.5, large for d = 0.8 and very large for d = 1.20. In addition, we can establish a relationship between d and the Pearson’s correlation coefficient (r), which is also a widely used measure to estimate the effect size.

The correlation coefficient measures the relationship between an independent binary variable (intervention or control) and a numerical dependent variable (our X). The great advantage of this measure is that it is easier to interpret than the parameters we have seen so far, which all function as standardized Z scores. We already know that r can range from -1 to 1 and the meaning of these values.

Thus, if you want to calculate r given d, you only have to apply the following formula:where p and q are the proportions of subjects in the experimental and control groups (p = n_{e} / n and q = n_{c} / n). In general, the larger the effect size, the greater r and vice versa (although it must be taken into account that r is also smaller as the difference between p and q increases). However, the factor that most determines the value of r is the value of d.

And with this we will end for today. Do not believe that we have discussed all the measures of this family. There are about a hundred parameters to estimate the effect size, such as the determination coefficient, eta-square, chi-square, etc., even others that Cohen himself invented (not very happy with only d), such as f-square or Cohen’s q. But that is another story…

La entrada I am Spartacus se publicó primero en Science without sense...double nonsense.

]]>La entrada When nothing bad happens, is everything okay? se publicó primero en Science without sense...double nonsense.

]]>I have a brother-in-law who is increasingly afraid of getting on a plane. He is able to make road trips for several days in a row so as not to take off the ground. But it turns out that the poor guy has no choice but to make a transcontinental trip and he has no choice but to take a plane to travel.

But at the same time, my brother-in-law, in addition to being fearful, is an occurrence person. He has been counting the number of flights of the different airlines and the number of accidents that each one has had in order to calculate the probability of having a mishap with each of them and fly with the safest. The matter is very simple if we remember that of probability equals to favorable cases divided by possible cases.

And it turns out that he is happy because there is a company that has made 1500 flights and has never had any accidents, then the probability of having an accident flying on their planes will be, according to my brother-in-law, 0/1500 = 0. He is now so calm that he almost has lost his fear to fly. Mathematically, it is almost certain that nothing will happen to him. What do you think about my brother-in-law?

Many of you will already be thinking that using brothers-in-law for these examples has these problems. We all know how brothers-in-law are… But don’t be unfair to them. As the famous humorist Joaquín Reyes says, “we all of us are brothers-in-law”, so just remember it. Of which there is no doubt, is that we will all agree with the statement that my brother-in-law is wrong: the fact that there has not been any mishap in the 1500 flights does not guarantee that the next one will not fall. In other words, even if the numerator of the proportion is zero, if we estimate the real risk it would be incorrect to keep zero as a result.

This situation occurs with some frequency in Biomedicine research studies. To leave airlines and aerophobics alone, think that we have a new drug with which we want to prevent this terrible disease that is fildulastrosis. We take 150 healthy people and give them the antifildulin for 1 year and, after this follow-up period, we do not detect any new cases of disease. Can we conclude then that the treatment prevents the development of the disease with absolute certainty? Obviously not. Let’s think about it a little.

Making inferences about probabilities when the numerator of the proportion is zero can be somewhat tricky, since we tend to think that the non-occurrence of events is something qualitatively different from the occurrence of one, few or many events, and this is not really so. A numerator equal to zero does not mean that the risk is zero, nor does it prevent us from making inferences about the size of the risk, since we can apply the same statistical principles as to non-zero numerators.

Returning to our example, suppose that the incidence of fildulastrosis in the general population is 3 cases per 2000 people per year (1.5 per thousand, 0.15% or 0.0015). Can we infer with our experiment if taking antifildulin increases, decreases or does not modify the risk of suffering fildulastrosis? Following the familiar adage, yes, we can.

We will continue our habit of considering the null hypothesis as of equal effect, so that the risk of disease is not modified by the new treatment. Thus, the risk of each of the 150 participants becoming ill throughout the study will be 0.0015. In other words, the risk of not getting sick will be 1-0.0015 = 0.9985. What will be the probability that none will get sick during the year of the study? Since there are 150 independent events, the probability that 150 subjects do not get sick will be 0.98985^{150} = 0.8. We see, therefore, that although the risk is the same as that of the general population, with this number of patients we have an 80% chance of not detecting any event (fildulastrosis) during the study, so it would be more surprising to find a patient who the fact of not having any. But the most surprising thing is that we are, thus, getting the probability that we do not have any sick in our sample: the probability that there is no sick is not 0 (0/150), as my brother-in-law thinks, but 80 %!

And the worst part is that, given this result, pessimism invades us: it is even possible that the risk of disease with the new drug is greater and we are not detecting it. Let’s assume that the risk with medication is 1% (compared to 0.15% of the general population). The risk of none being sick would be (1-0.01)^{150} = 0.22. Even with a 2% risk, the risk of not getting any disease is (1-0.02)^{150} = 0.048. Remember that 5% is the value that we usually adopt as a “safe” limit to reject the null hypothesis without making a type 1 error.

At this point, we can ask ourselves if we are very unfortunate and have not been lucky enough to detect cases of illness when the risk is high or, on the contrary, that we are not so unfortunate and, in reality, the risk must be low. To clarify ourselves, we can return to our usual 5% confidence limit and see with what risk of getting sick with the treatment we have at least a 5% chance of detecting a patient:

– Risk of 1.5/1000: (1-0.0015)^{150} = 0.8.

– Risk of 1/1000: (1-0.001)^{150} = 0.86.

– Risk of 1/200: (1-0.005)^{150} = 0.47.

– Risk of 1/100: (1-0.01)^{150} = 0.22.

– Risk of 1/50: (1-0.02)^{150} = 0.048.

– Risk of 1/25: (1-0.04)^{150} = 0.002.

As we see in the previous series, our “security” range of 5% is reached when the risk is below 1/50 (2% or 0.02). This means that, with a 5% probability of being wrong, the risk of fildulastrosis taking antifuldulin is equal to or less than 2%. In other words, the 95% confidence interval of our estimate would range from 0 to 0.02 (and not 0, if we calculate the probability in a simplistic way).

To prevent our reheated neurons from eventually melting, let’s see a simpler way to automate this process. For this we use what is known as the rule of 3. If we do the study with n patients and none present the event, we can affirm that the probability of the event is not zero, but less than or equal to 3/n. In our example, 3/150 = 0.02, the probability we calculate with the laborious method above. We will arrive at this rule after solving the equation we use with the previous method:

(1 – maximum risk) n = 0.05

First, we rewrite it:

1 – maximum risk = 0.05^{1/n}

If n is greater than 30, 0.05^{1/n} approximates (n-3)/n, which is the same as 1-(3/n). In this way, we can rewrite the equation as:

1- maximum risk = 1 – (3/n)

With which we can solve the equation and get the final rule:

Maximum risk = 3/n.

You have seen that we have considered that n is greater than 30. This is because, below 30, the rule tends to overestimate the risk slightly, which we will have to take into account if we use it with reduced samples.

And with this we will end this post with some considerations. First, and as is easy to imagine, statistical programs calculate risk’s confidence intervals without much effort even if the numerator is zero. Similarly, it can also be done manually and much more elegantly by resorting to the Poisson probability distribution, although the result is similar to that obtained with the rule of 3.

Second, what happens if the numerator is not 0 but a small number? Can a similar rule be applied? The answer, again, is yes. Although there is no general rule, extensions of the rule have been developed for a number of events up to 4. But that’s another story…

La entrada When nothing bad happens, is everything okay? se publicó primero en Science without sense...double nonsense.

]]>