That’s not what it seems to be

I hope, for your own good, that you have never had to do with a situation in what you had to pronounce this sentence. And I hope, also for your good, that if you have had to pronounce it, the sentence wouldn’t have begun with the word “darling”. Would it?. Let’s leave it to the conscience of everyone.

What is true is that we have to ask ourselves this question in a much less scabrous situation: when assessing the results of a cross-sectional study. It goes without saying, of course, that in these cases there’s no use for the word “darling”.

Cross-sectional descriptive studies are a type of observational study in what we extract a sample from the population we want to study and then we measure the frequency of the disease or effect that we are interested in in the individuals of that sample. When we measure more than one variable this studies are called association cross-sectional studies and allow us to determine if there’s any kind of association among the variables.

But these studies have two characteristics that we must always keep in mind. First, they are prevalence studies that measure the frequency at a given time, so the result may vary depending on the timing of measuring the variable. Second, since the measurement is performed simultaneously, it is difficult to establish a cause-effect relationship, something that we all love to do. But it is something we should avoid doing because with this type of study, things are not always what they seem to be. Or rather, things can be a lot more things than what they seem.

What are we talking about?. Let’s consider an example. transversal_enI’m a little bored of going to the gym because I’m becoming more and more tired and my physical condition… well, just leave it that I get tired, so I want to study whether or not the effort can reward me with a better control of my body weight. Thus, I make a survey and get data from 1477 individuals approximately my age and ask them if they go to the gym (yes or no) and if they have a body mass index greater than 25 (yes or no). If you look closely at the results depicted in the table you’ll notice that the prevalence of overweight-obesity among those who go to the gym (50/751, about 7%) is higher than among those not going (21/726, about 3%).Oh my goodness!, I think, I not only get tired, but going to the gym I have twice the chance of being fat. Conclusion: I’ll leave the gym tomorrow.

Do you see how easy it is to reach an absurd (rather stupid, in this case) conclusion?. But the data are there, so we have to find an explanation to understand why they suggest something that goes against our common sense. And there are several possible explanations for these results.

The first is that going to the gym actually favors one fattening. It seems unlikely, but you never know … Imagine that working out motivates athletes to eat like wild beasts during the next six hours after a sports session.

The second is that obese going to the gym live longer than those who don’t go. Let’s think that exercise prevents death from cardiovascular disease in obese patients. It would explain why there are more obese (in proportion) in the gym than outside it: obese going to the gym die less that those not going. At the end of the day we are dealing with a prevalence study, so we see the final result at the time of measurement.

The third possibility is that the disease can influence the frequency of exposure, which is known as reverse causality. In our example, there could be more obese in the gym because the treatment recommendations they receive is doing it: to join a gym. This does not sound as ridiculous as the first one.

But we still have more possible explanations. So far we have tried to explain an association between the two variables that we have assumed as real. But what if the association is not real?. How can we get a false association between the two variables?. Again, we have three possible explanations.

First, our old friend: random. Some of you will tell me that we can calculate statistical significance or confidence intervals, but so what?. Even in the case of statistical significance, it only means that we can rule out the effect of random, but with some degree of uncertainty. Even with p < 0.05, there’s always a chance of committing a type I error, and erroneously reject the effect of chance. We can measure random, but never get rid of it.

The second is that we have committed some kind of bias that invalidates our results. Sometimes the disease’s characteristics can result in a different probability of choosing exposed and unexposed subjects, resulting in a selection bias. Imagine that instead of a survey (by telephone, for example) we have used a medical record. It may happen that obese going to the gym are more responsible with their health care and go to the doctor more than those that don’t go to the gym. In this situation, it will be more likely that we include obese athletes in the study, making a higher estimate of the true proportion. Sometimes the study factor may be somewhat stigmatizing from the social point of view, so diseased people will have less desire to participate in the study (and recognize their disease) that those who are healthy. In this case, we’ll underestimate the frequency of disease.

In our example, it may be that obese people who do not go to the gym answer to the survey lying about their true weight, which will be wrongly classified. This classification bias can occur randomly in the two groups of exposed and unexposed, thereby favoring the lack of association (the null hypothesis), and so the association will be underestimated, if it exists. The problem is when this error is systematic in one of the two groups, as this can both underestimate and overestimate the association between exposure and disease.

And finally, the third possibility is that there is a confounding variable that is distributed differently between exposed and unexposed. I can think that those who go to the gym are younger than those who don’t. I t is possible that younger obese are more likely to go to the gym. If we stratified the results by the confounding variable, age, we can determine its influence on the association.

To finish, I only want to apologize to all obese in the world for using them in the example but, for once, I wanted to let the smokers alone.

As you can see, things are not always what they seem at first glance, so the results should be interpreted with common sense and in the light of existing knowledge, without falling into the trap of establishing causal relationships from associations detected in observational studies. To stablish cause and effect we always need experimental studies, the paradigm of which is the clinical trial. But that’s another story…

Playing with powers

Numbers are a very peculiar creatures. It seems incredible sometimes what can be achieved by operating with some of them. You can even get other different numbers expressing different things. This is the case of the process by which we can take the values of a distribution and, from their arithmetic mean (a measure of centralization) calculate how apart from it the rest of the variables are and to raise the differences to successive powers to get measures of dispersion and even of symmetry. I know it seems impossible, but I swear it’s true. I’ve just read it in a pretty big book. I’ll tell you how…

Once we know what the arithmetic mean is, we can calculate the average separation of each value from it. We subtract the mean from each value and divide it by the total number of values (like calculating the arithmetic mean of the deviations of each value from the mean of the distribution). But there is one problem: as the mean is always in the middle (hence its name), the differences with the highest values (to be positive) will cancel out with that of the lowest values (which will be negative) and the result will always be zero. It is logical, and it is an intrinsic property of the mean, which is far from all the values the same average quantity. Since we cannot change the nature of the mean, what we can do is to calculate the absolute value of each difference before adding them. And so we calculate the mean deviation, which is the average of the deviations absolute values with respect to the arithmetic mean.power formulas

And here begins the game of powers. If we add the square differences instead of adding its absolute values we’ll come up with the variance, which is the average of the square deviations from the mean. We know that if we square-root the variance (recovering the original units of the variable) we get the standard deviation, which is the queen of the measures of dispersion.

And what if we raised the differences to the third power instead of square them?. Then we’ll get the average of the cube of the deviations of the values from the mean. If you think about it, you’ll realize that raising them to the cube we will not get rid of the negative signs. Thus, if there’s a predominance of lower values (the distribution is skewed to the left) the result will be negative and, if there is of higher values, will be positive (the distribution is skewed to the right). One last detail: to compare the symmetry index with other distributions we can standardize it dividing it by the cube of the standard deviation, according to the formula I write in the accompanying box. The truth is that to see it scares a little, but do not worry, any statistical software can do this and even worse things.

And as an example of anything worse, what if we raised the differences to the fourth power instead of to the third?. Then we’ll calculate the average of the fourth power of the deviations of the values from the mean. If you think about it for a second, you’ll quickly understand its usefulness. If all the values are very close to the mean, when multiplying by itself four times (raise to the fourth power) the result will be smaller than if the values are far from the mean. So, if there are many values near the mean (the distribution curve will be more pointed) the value will be lower than if the values are more dispersed. This parameter can be standardize dividing it by the fourth power of the standard deviation to get the kurtosis, which leads me to introduce three strange words more: a very sharp distribution is called leptokurtic, if it has extreme values scattered it’s called platykurtic and if it’s neither one thing nor the other, mesokurtic.

And what if we raise the differences to the fifth power?. Well, I don’t know what would happen. Fortunately, as far as I know, no one has jet thought about such a rudeness.

All these calculations of measures of central tendency, dispersion and symmetry may seem the delirium of someone with little work to do, but do not be deceived: they are very important, not only to properly summarize a distribution, but to determine the type of statistical test we use when we want to do a hypothesis contrast. But that’s another story…

Not all deviations are perverse

I dare even say that some deviations are much needed. But nobody get enthusiastic before time. Although what I said has been able to mean anything, we’re going to talk about how the values vary in a quantitative distribution.

When we get the values of a given parameter in a sample and we want to get a summarize idea about how it behaves in the sample, the first thing that comes to our mind is to calculate a measure that represents it, so we draw upon the mean, the median or any other centralization measure.

However, the calculation of the central value gives little information if it’s not complemented by other measure that informs us about the results heterogeneity in the distribution. To quantify the degree of variation, mathematicians, with very little imagination, have invented a thing called the variance.

To calculate it, we have to subtract the mean to each individual value with the aim of adding all these subtractions and divide the result by the number of measurements. It’s like calculating the mean of the differences of each value with respect to the central value of the distribution. But there’s a slight problem: as values are situated both above and below the mean (it must be so, it’s the average), positive and negative differences will cancel each other out and we will get a value close to zero if the distribution is symmetric, even though the degree of variation was large. To avoid this, what we do is to square the differences before adding them, thereby disappearing any negative sign. In this way, we always come up with a positive value related to the amplitude of the differences. This value is what is known as variance.

For example, let’s suppose we measure the systolic blood pressure to 200 schoolchildren randomly selected and we get an average of 100 mmHg. We begin to subtract the mean from each value, square the differences, and add up all the squares dividing the result by 200 (the number of determinations). We come up with a variance of, for instance, 100 mmHg2. And I wonder, what the heck is a square millimeter of mercury?. Variance might describe well variation, but I do not deny it’s a bit difficult to interpret. Again, some mathematical genius runs to our rescue with the solution: to do the square root of variance and thus recovering the original units of the variable. Doing so, we get to the most famous of deviations: the standard deviation. In our case it could be, let’s say, 10 mmHg. If we consider the two parameters, we can get the idea that most of the pupils will have a blood pressure close to the mean. If we had obtained a standard deviation of 50 mmHg we would think that there was much individual variation in blood pressure determinations, although the mean of the sample was the same.

A clarification for the sake of purist. Usually, the sum of square differences is divided by the number of cases minus one (n-1) instead of by the number of cases (n), which would seem more logical. Why?. A whim of mathematicians. For some arcane reason, doing so the value we get is closer to the population’s value from which the sample comes from.

We have, therefore, the two values that define our distribution. And the good thing is that they not only give us an idea of the central value and dispersion of data, but also indicate the probability of finding an individual with a certain value in the sample. We know that 95% of them will have a value between the mean plus minus two times the standard deviation (1.96 times, to be exact) and 99% between the mean plus minus 2.5 times the standard deviation (2.58 times, actually).

This sounds dangerously like the 95% and 99% confidence intervals, but we should not confuse terms. If we repeat the blood pressure experiment a very large number of times, we’ll get a slightly different mean each time. We could calculate the mean of the result of those experiments and the standard deviation of that group of means. This standard deviation is what is known as the standard error, and this is the value we use to calculate confidence intervals within which it is the actual value in the population from which the sample comes, a value that neither we can measure in a direct way, nor can we know with exactitude. Therefore, standard deviation tell us about data dispersion in the sample, while standard error gives us an idea of the accuracy with which we can estimate the true value of the variable in the population from that we extracted our sample.

One last thought about the standard deviation. Although the value of 95% of the population is in the interval formed by the mean plus minus two times the standard deviation, this assumption makes sense only if the distribution is reasonably symmetrical. In cases of much skewed distributions, the standard deviation loses much of its meaning and we use other measures of dispersion. But that’s another story…