Chocolate’s fallacy

White, black, filling, in ounces, to the cup, powdered, in ice cream, with hazelnuts, with almonds, with fruits, milky, pure, fondant, bitter, in pies, in candy, in hot or cold drinks, etc., etc., etc. I like them all.

chocolate_nobelSo you can easily imagine my joy when my RSS reader showed me the title of the article in the New England saying that there was a relationship between chocolate consumption and Nobel prizes. I could see myself eating chocolate galore with my copy of the paper in my pocket to shut the mouths of all who would come to spoil me the party saying that I was going over the top with calories, fat, sugar, or whatever. At the end of the day, what could be more important than working to get a Nobel Prize?

It’s at this point that you can also easily imagine my frustration when reading the work and seeing that the title was fishy. It turns out that it was an ecological study.

In the epidemiological studies that we’re most used to read, the units of analysis are often isolated elements. However, in ecological studies these units are formed with aggregates of individuals.

A synthetic measure of the frequency of association and the effect on individuals in each aggregate is calculated, showing at the end if there’s an association between exposure and effect among the different units.

There’re two types of ecological studies. At one end are those which study frequency measures, such us incidence, mortality, etc., looking for different geographical patterns that may be related to social, economic, genetic factors or whatever. On the other, we have those who study the variations in frequency over time in order to look for temporal trends and detect them, trying to explain their cause.

These studies are usually simple and quick to perform, and often are made from data which are previously available in records or yearbooks, so they are also usually not too expensive. The problem with ecological studies is that the fact that there is an association among the units of analysis does not necessarily mean that it also exists at the level of individuals. If we take this association for granted at individual level, we’ll run the risk of committing a sin that is known by the beautiful name of ecological fallacy.  You can get stuck comparing every variable you can think of to the frequency of a particular disease to find a significant association, but then it could be impossible to find a plausible mechanism to explain it. In our example, it could even be the case that, at the individual level, the more chocolate you eat the more brutalized your senses are, putting you away of the desired Nobel Prize.

And for those who do not believe me, we will see a totally absurd and invented example. Suppose we want to know if there is a relationship between watching television for more than four hours a day and to be a strict vegetarian. It turns out that we have data from three surveys in three cities, we will call A, B and C to not get us any more trouble.

falacyIf we calculate the prevalence of vegetarianism and tele addiction we’ll see that it’s 0.4 in A, 0.5 in B and 0.6 in C. It’s pretty clear, in cities where there are more addicted to the boob tube there are more strict vegetarians, which may indicate that the use of television is even more dangerous than previously thought.

But these are aggregate results. What happen at the individual level? We see that the odds ratios are 0.33 in A and C and 0.44 in B. So, surprisingly, even though in cities with more coach potatoes there are more vegetarians, people with coach potatoes stigma have a 33-44% less chance of being a strict vegetarian. So we see how important it is that the results of an ecological study are subsequently investigated with other designs of analytical studies to explain them properly.

Only two issues more before ending this post. First, let’s vegetarians forgive me, even if they are strict, and, why not?, also forgive me those who watch TV for too long. Second, we have seen the fallacy of chocolate is actually an ecological fallacy. But, even in the cases that data were extracted from individual units, we must always remember that neither correlation nor association is synonymous with causality. But that’s another story…

The backdoor

I wish I had a time machine!. Think about it for a moment. We should not have to work (we would have won the lottery several times), we could anticipate all our misfortunes, always making the best decision … It would be like in the movie “Groundhog Day”, but without acting the fool.

Of course if we had a time machine that worked, there would be occupations that could disappear. For example, epidemiologists would have a hard time. If we wanted to know, imagine, if the snuff is a risk factor for coronary heart disease we only would’ve to take a group of people, tell them not to smoke and see what happened twenty years later. Then we would go back in time, require them to smoke and see what happen twenty years later and compare the results of the two tests. How easy, isn’t it?. Who would need an epidemiologist and all his complex science about associations and study designs?. We could study the influence of exposure (the snuff) on the effect (coronary heart disease) comparing these two potential outcomes, also called counterfactual outcomes (pardon the barbarism).

However, not having a time machine, the reality is that we cannot measure the two results in one person, and although it seems obvious, what it actually means is that we cannot directly measure the effect of exposure to a particular person.

So epidemiologists resort to study populations. Normally in a population will be exposed and unexposed subjects, so we can try to estimate the counterfactual effect of each group to calculate what would be the average effect of exposure on the population as a whole. For example, the incidence of coronary heart disease in nonsmokers may serve to estimate what would have been the incidence of disease in smokers if they had not smoked. This enables that the difference in disease between the two groups (the difference between its factual outcomes), expressed as the applicable measure of association, is an estimate of the average effect of smoking on the incidence of coronary heart disease in the population.

All that we have said requires a prerequisite: counterfactual outcomes have to be interchangeable. In our case, this means that the incidence of disease in smokers if they had not smoked would have been the same as that of nonsmokers, who have never smoked. And vice versa: if the group of non-smokers had smoked they would have the same incidence than that observed in those who are actually smokers. This seems like another truism, but it’s not always the case, since in the relationship which exists between effect and exposure frequently exist backdoors that make counterfactual outcomes of the two groups not interchangeable, so the estimation of measures of association cannot be done properly. This backdoor is what we call a confounding factor o confounding variable.

backdoor_globalLet’s clarify a bit with a fictional example. In the first table I present the results of a cohort study (that I have just invented myself) that evaluates the effects of smoking on the incidence of coronary heart disease. The risk of disease is 0.36 (394/1090) among smokers and 0.34 (381/1127) among nonsmokers, so the relative risk (RR, the relevant measure of association in this case) is 0.36 / 0.34 = 1.05. I knew it!. As Woody Allen said in “Sleeper”!. The snuff is not as bad as previously thought. Tomorrow I go back to smoking.

Sure?. It turns out that mulling over the matter, it just occurs to me that something may be wrong. The sample is large, so it is unlikely that chance has played me a bad move. The study does not apparently have a substantial risk of bias, although you can never completely trust. So, assuming that Woody Allen wasn’t right in his film, there is only the possibility that there’s a confounding variable implicated altering our results.

The confounding variable must meet three requirements. First, it must be associated with exposure. Second, it must be associated with the effect of exposure independently of the exposure we are studying. Third, it should not be part of the chain of cause-effect relationship between exposure and effect.

This is where the imagination of researcher comes into play, which has to think what may act as a confounder. To me, in this case, the first that comes to mind is age. It fulfills the second point (the oldest are at increased risk of coronary heart disease) and third (no matter how the snuff is, it doesn’t increase your risk of getting sick because it makes you older). But, does it fulfill the first condition?. Is there an association between age and the fact of be a smoker?. It turns out that we had not thought about it before, but if this were so, it could explain everything. For example, if smokers were younger, the injurious effect of snuff could be offset by the “benefit” of younger age. Conversely, the benefit of the elderly for not smoking would vanish because of the increased risk of older age.

How can we prove this point?. Let’s separate the data of younger and older than 50 years and let’s recalculate the risk. If the relative risks are different, you will probably want to say that age is acting as a confounding variable. Conversely, if they are equal there will be no choice but to agree with Woody Allen.backdoor_edadesLet’s look at the table of the youngest. The risk of disease is 0.28 (166/591) in smokers and 0.11 (68/605) in non-smokers, then the RR is 2.5. Meanwhile the risk of disease, in patients older than 50 years, is 0.58 (227/387) in smokers and 0.49 (314/634) in nonsmokers, so the RR equals 1.18. Sorry for those of you who are smokers, but The Sleeper was wrong: the snuff is bad.

With this example we realize how important it is what we said before about counterfactual outcomes being interchangeable. If the age distribution is different between exposed and unexposed and we have the misfortune of that age is a confounding variable, the result observed in smokers will no longer be interchangeable with the counterfactual outcome of nonsmokers, and vice versa.

Can we avoid this effect?. We cannot avoid the effect of a confounding variable, and this is even a bigger problem when we don’t know that it can play its trick. Therefore it’s essential to take a number of precautions when designing the study to minimize the risk of its occurrence and having backdoors which data squeeze through.

One of these is randomization, with which we will try that both groups are similar in terms of the distribution of confounding variables, both those known and unknown. Another would be to restrict the inclusion in the study of a particular group as, for instance, those less than 50 years in our example. The problem is that we cannot do so for unknown confounders. Another third possibility is to use paired data, so that for every young smoker we include, we select another young non-smoker, and the same for the elderly. To apply this paired selection we also need to know beforehand the role of confounding variables.

And what do we do once we have finished the study and found to our horror that there is a backdoor?. First, do not despair. We can always use the multiple resources of epidemiology to calculate an adjusted measure of association which estimate the relationship between exposure and effect regardless of the confounding effect. In addition, there are several methods for doing the analysis, some simpler and some more complex, but all very stylish. But that’s another story…

Residuals management

We live in a nearly subsistence economy. We do not throw anything away. Even if there’s no choice but to waste something, it is rather recycled. Yes, recycling is a good practice, with its ecological and economic advantages. And the thing is that residues are always usable.

But when it comes to statistics and epidemiology, not only residues are not thrown, they are important for interpreting the data from which they come. Does anyone not believe it?. Let’s imagine and absurd but very illustrative example.

Suppose we want to know what kind of fish is the most preferred in Mediterranean Europe. The reason for wanting to know this must be so stupid that it has not yet occurred to me but, anyway, I do a survey among 5,281 people from four countries in Southern Europe.

The simpler and most useful thing to do in the first place is the one that is often done always: to build a contingency table with the frequencies of the results, such as I show you below.residuos_frec

Contingency tables are often used to study the association or relationship between two qualitative variables. In our example, both variables are the preferred fish and the place of residence. Normally, you try to explain a variable (dependent) as a function of the other one (independent). In our example we want to know if the respondent’s nationality influences his or her food tastes.

Total values table is informative in itself. For instance, we see that grouper fish and swordfish are preferred over hake, that Italians like tuna less than Spanish, etc. However, managing large tables like ours can be laborious and difficult to draw many conclusions from raw data. Therefore, a useful alternative is to build a table with percentages of rows, columns or both, as you can see below.residuos_porc

It comes in handy to compare columns’ percentages to check the effect of the independent variable (nationality, in our example) over the dependent variable (preferred fish). Moreover, row’s percentages show the frequency distribution of the dependent variables for each of the categories of the independent one (the country, in our set). But, of the two percentages, the most interesting  ones are column’s percentages: if we see clear differences among categories of the independent variable (countries) we´ll suspect that there may be a statistical association between variables.

In our survey, the percentages within each column are very different, so we suspect that not all fishes are preferred in all countries equally. Of course, this must be quantified in an objective way in order to be sure that these results are no due to chance. How?. Using residuals (the way residues are called in statistics). We’re going to see what they are and how to get them in a while.

The first thing to do is to build a table with the expected values if all people like all fishes equally, no matter their country of origin. We need to do that because many statistical association and significance tests are based on the comparison between observed and expected frequencies. To calculate the expected value of each cell as if there were not relationship between variables, you have to multiply the row marginal (the row total) by the column marginal and divide them by the total of the table. So, we obtain the table with expected and observed values that I show you below.residuos_esperados

If variables are unrelated, observed and expected values are virtually the same, with small differences due to sampling error. If there’re large differences, there will be a likely relationship between the two variables that explains them. And when it comes time to assess these differences is when our residuals come into play.

A residual is just the difference between expected and observed values. We already said that when residuals move away from zero they may be significant but, how much do they have to move away?.

We can transform a residual dividing it by the square root of the expected value. So, we come up with the standardized residual, also called Pearson residual. In turn, a Pearson residual can be divided by the standard deviation of all residuals (square root[(Expected*(1-RowProportion)*(1-ColumnProportion))]), thus obtaining the adjusted residual. We can now build the residuals table shown below.

The great usefulness of adjusted residuals is that they are standardized values, so we are allowed to compare residuals from different cells. Furthermore, adjusted residuals follow a standard normal frequency distribution (with mean zero and standard deviation one), so we can use a computer program or a probabilities table to come up with the probability that a certain residual’s value is not due to chance. In a normal distribution, 95% of the values are roughly within the mean plus or minus two standard deviations. So, if the adjusted residual’s value is greater than two o lesser than minus two, the probability that this value is due to chance will be less than 5% and we’ll be able to say that the residual is significant. For example, in our table we see that French people like sea bream more than what would be expected if the country did not influence food taste, while, at the same time, they abhor tuna.

Adjusted residuals allow us to assess the significance in each cell but, if we want to know if there’s a global association between variables we have to sum up all adjusted residuals. This is because the sum of adjusted residuals also follow a frequency distribution, but this time it’s a chi-square frequency distribution with (rows-1) x (columns-1) degrees of freedom. If we calculate our value we’ll come up with a chi-square = 368.3921 with a p value <0.001, so we’ll conclude that there’s a statistically significant relationship between the two variables.

As you see, residuals are very useful, not only to calculate chi-square, but also to calculate many other statistics. However, epidemiologists prefer to use other measures of association with contingency tables. And this is because chi-square doesn’t vary from zero to one and, although it informs us if there’s statistical significance, gives no information about the strength of the association. For that we need other parameters that do vary from zero to one, such us the relative risk and the odds ratio. But that’s another story…