The why and wherefores

Do you remember the last post about girls with their level of education and their unhealthy habits?. I´ll do a brief reminder for the sake of those with a loose memory.

As it happened, we interviewed 585 girls to find out their educational level and if they smoked, drank alcohol, both of them or none. With the results obtained we built the contingency table that I attach again to show you here.study_level

We wanted to know if there was any relationship between the educational level and having bad habits, for which we began to set our null hypothesis that both qualitative variables were independent.

So we proceeded to perform a chi-square test to reach a conclusion. The first thing we have to do was to calculate the expected values of each cell, which is very ease because you just have to multiply the total of the row by the total of the column and divide the result by the total of the table… Stop right there! Why is that so?. Where does this rule come from?. Do you know why that product divided by the total equals the expected number of the cell?. It’s a good thing to have rules to facilitate our work, but I like to know where things come from and I’m sure that few of you have ever thought about it. Let’s see why.

First, let’s keep in mind that we are going to reason under the assumption of the null hypothesis that the variables harmful habits and educational level are independent. We are going to calculate the expected value of the cell corresponding to high school students with two harmful habits.

Being both situations (having studied up to secondary and both smoking and drinking) independent, the probability that both of them happen equals the product of their probabilities:

P(high school and both habits) = P(high school) x P (both habits)

We know that P(high school) will be equal to the total number of girls who studied up to high school divided by the total number of respondents to the interview. Similarly, P(both habits) will be equal to all who drink and smoke divided by the total number of respondents (total of the table). If we substitute the above expression by their values, we get:

P(high school and both habits) = (223/585) x (303/585)

We know now what is the probability of each one to belong to that cell. What is the expected number?. The answer is very simple, the probability of each one multiplied by the total number of girls interviewed:

P(high school and both habits) = 585 x (223/585) x (303/585)

If we cancel out the value 585 in numerator and denominator and simplify the expression, we’ll get:

P(high school and both habits) = (223 x 303) / 585

That is nothing more than the row’s marginal by the column’s marginal divided by the total of the table, being the result, in our example, 115.5.

So, you can see where the rule to calculate the expected number of occurrences in a contingency table comes from. Of course, you know that to find out if the variables are or not independent you still have to standardize the squared differences, calculate their sum and get the probability using a chi-square frequency distribution. But that’s another story…

The table

There’re plenty of tables. And they play a great role throughout our lives. Perhaps the first one that strikes us during our early childhood is the multiplication table. Who doesn’t long, at least the older of us, how we used to repeat like parrots that of two times one equals two, two times… until we learned it by heart?. But, as soon as we achieved mastering multiplication tables we bumped into the periodic table of the elements.  Again to memorize, this time aided by idiotic and impossible mnemonics about some Indians who Gained Bore I-don’t-know-what.

But it was through the years that we found the worst table of all: the foods composition table, with its cells full of calories. This table pursues us even in our dreams. And it’s because eating a lot have many drawbacks, most of which are found out with the aid of other table: the contingency table.

Contingency tables are used very frequently in Epidemiology to analyze the relationship among two or more variables. They consist of rows and columns. Groups by level of exposure to the study factor are usually represented in the rows, while categories that have to do with the health problem that we are investigating are usually placed in the columns. Rows and columns intersect to form cells in which the frequency of its particular combination of variables is represented.

The most common table represents two variables (our beloved 2×2 table), one dependent and one independent, but this is not always true. There may be more than two variables and, sometimes, there may be no direction of dependence between variables before doing the analysis.

Simpler 2×2 tables allow analyzing the relationship between two dichotomous variables. According to the content and the design of the study to which they belong, their cells may have slightly different meanings, just as there will be different parameters that can be calculated from the data of the table.

contingencia_transversal_enThe first we’re going to talk about are cross-sectional studies’ tables. This type of study represents a sort of snapshot of our sample that allows us to study the relationship between the variables. They’re, therefore, prevalence studies and, although data can be collected over a period of time, the result only represents the snapshot we have already mentioned. Dependent variable is placed in columns (disease status) and independent variable in rows (exposure status), so we can calculate a series of frequency, association and statistical significance measures.

The frequency measures are the prevalence of disease among exposed (EXP) and unexposed (NEXP) and the prevalence of exposure among diseased (DIS) and non-diseased (NDIS). These prevalences represent the number of sick, healthy, exposed and unexposed in relation to each group total, so they are rates estimated in a precise moment.

The measures of association are the rates between prevalences just aforementioned according to exposure and disease status, and the odds ratio, which tells us how much more likely the disease will occur in exposed (EXP) versus non-exposed (NEXP) people. If these parameters have a value greater than one it will indicate that the exposure factor is a risk factor for disease. On the contrary, a value equal or greater than zero and less than one will mean a protective factor. And if the value equals one, it will be neither fish nor fowl.

Finally, as in all types of tables that we’ll mention, you can calculate statistical significance measures, mainly chi-square with or without correction, Fisher’s exact test and p value, unilateral or bilateral.

contingencia_casos_controles_enVery much like those table we’ve just seen are case-control studies’ tables. This study design tries to find out if different levels of exposure can explain different levels of disease. Cases and controls are placed in columns and exposure status (EXP and NEXP) in rows.

The measures of frequency that we can calculate are the proportion of exposed cases (based on the total number of cases) and the proportion of exposed controls (based on the total number of controls). Obviously, we can also come up with the proportions of non-exposed calculating the complementary values of the aforementioned ones.

The key measure of association is the odds ratio that we already know and in which we are not going to spend much time. All of us know that, in the simplest way, we can calculate its value as the ratio of the cross products of the table and that it informs us about how much more likely is the disease to occur in exposed than in non-exposed people. The other measure of association is the exposed attributable fraction (ExpAR), which indicates the number of patients who are sick due to direct effect of exposition.

Managing this type of tables, we can also calculate a measure of impact: the population attributable fraction (PopAR), which tells us what would happen on the population if we eliminated the exposure factor. If the exposure factor is a risk factor, the impact will be positive. Conversely, if we are dealing with a protective factor, its elimination impact will be negative.

With this type of study design, the statistical significance measures will be different if we are managing paired (McNemar test) or un-paired data (chi-square, Fisher’s exact test and p value).

contingencia_cohortes_acumulada_enThe third type of contingency tables is the corresponding to cohort studies, although their structure differ slightly if you count total cases along the entire period of the study (cumulative incidence) or if you consider the time period of the study, the time of onset of disease in cases and the different time of follow-up among groups (incidence rate or incidence density).

Tables from cumulative incidence studies (CI) are similar to those we have seen so far. Disease status is represented in columns and exposure status in rows. Otherwise, incidence density (ID) tables represent in the first column the number of patients and, in the second column, the follow-up in patients-years format, so that those with longer follow-up have greater weight when calculating measures of frequency, association, etc.

contingencia_cohortes_densidad_enThe measures of frequency are the EXP risk (Re) and the NEXP risk (Ro) for CI studies and EXP and NEXP incidence rates in ID studies.

We can calculate the ratios of the above measures to come up with the association measures: relative risk (RR), absolute risk reduction (ARR) and relative risk reduction (RRR) for CI studies and incidence density reduction (IRD) for ID studies. In addition, we can also calculate ExpAR as we did in the cases-control study, as well as a measure of impact: PopAR.

We can also calculate the odds ratios if we want, but they are generally much less used in this type of study design. In any case, we know that RR and odds ratio are very similar when disease prevalence is low.

To end with this kind of table, we can calculate the statistical significance measures: chi-square, Fisher’s test and p value for CI studies and other association measures for ID studies.

As always, all these calculations can be done by hand, although I recommend you to use a calculator, such as the available one at the CASPe site. It’s easier and faster and further we will come up with all these parameters and their confidence intervals, so we can also estimate their precision.

And with this we come to the end. There’re more types of tables, with multiple levels for managing more than two variables, stratified according to different factors and so on. But that’s another story…

Residuals management

We live in a nearly subsistence economy. We do not throw anything away. Even if there’s no choice but to waste something, it is rather recycled. Yes, recycling is a good practice, with its ecological and economic advantages. And the thing is that residues are always usable.

But when it comes to statistics and epidemiology, not only residues are not thrown, they are important for interpreting the data from which they come. Does anyone not believe it?. Let’s imagine and absurd but very illustrative example.

Suppose we want to know what kind of fish is the most preferred in Mediterranean Europe. The reason for wanting to know this must be so stupid that it has not yet occurred to me but, anyway, I do a survey among 5,281 people from four countries in Southern Europe.

The simpler and most useful thing to do in the first place is the one that is often done always: to build a contingency table with the frequencies of the results, such as I show you below.residuos_frec

Contingency tables are often used to study the association or relationship between two qualitative variables. In our example, both variables are the preferred fish and the place of residence. Normally, you try to explain a variable (dependent) as a function of the other one (independent). In our example we want to know if the respondent’s nationality influences his or her food tastes.

Total values table is informative in itself. For instance, we see that grouper fish and swordfish are preferred over hake, that Italians like tuna less than Spanish, etc. However, managing large tables like ours can be laborious and difficult to draw many conclusions from raw data. Therefore, a useful alternative is to build a table with percentages of rows, columns or both, as you can see below.residuos_porc

It comes in handy to compare columns’ percentages to check the effect of the independent variable (nationality, in our example) over the dependent variable (preferred fish). Moreover, row’s percentages show the frequency distribution of the dependent variables for each of the categories of the independent one (the country, in our set). But, of the two percentages, the most interesting  ones are column’s percentages: if we see clear differences among categories of the independent variable (countries) we´ll suspect that there may be a statistical association between variables.

In our survey, the percentages within each column are very different, so we suspect that not all fishes are preferred in all countries equally. Of course, this must be quantified in an objective way in order to be sure that these results are no due to chance. How?. Using residuals (the way residues are called in statistics). We’re going to see what they are and how to get them in a while.

The first thing to do is to build a table with the expected values if all people like all fishes equally, no matter their country of origin. We need to do that because many statistical association and significance tests are based on the comparison between observed and expected frequencies. To calculate the expected value of each cell as if there were not relationship between variables, you have to multiply the row marginal (the row total) by the column marginal and divide them by the total of the table. So, we obtain the table with expected and observed values that I show you below.residuos_esperados

If variables are unrelated, observed and expected values are virtually the same, with small differences due to sampling error. If there’re large differences, there will be a likely relationship between the two variables that explains them. And when it comes time to assess these differences is when our residuals come into play.

A residual is just the difference between expected and observed values. We already said that when residuals move away from zero they may be significant but, how much do they have to move away?.

We can transform a residual dividing it by the square root of the expected value. So, we come up with the standardized residual, also called Pearson residual. In turn, a Pearson residual can be divided by the standard deviation of all residuals (square root[(Expected*(1-RowProportion)*(1-ColumnProportion))]), thus obtaining the adjusted residual. We can now build the residuals table shown below.

The great usefulness of adjusted residuals is that they are standardized values, so we are allowed to compare residuals from different cells. Furthermore, adjusted residuals follow a standard normal frequency distribution (with mean zero and standard deviation one), so we can use a computer program or a probabilities table to come up with the probability that a certain residual’s value is not due to chance. In a normal distribution, 95% of the values are roughly within the mean plus or minus two standard deviations. So, if the adjusted residual’s value is greater than two o lesser than minus two, the probability that this value is due to chance will be less than 5% and we’ll be able to say that the residual is significant. For example, in our table we see that French people like sea bream more than what would be expected if the country did not influence food taste, while, at the same time, they abhor tuna.

Adjusted residuals allow us to assess the significance in each cell but, if we want to know if there’s a global association between variables we have to sum up all adjusted residuals. This is because the sum of adjusted residuals also follow a frequency distribution, but this time it’s a chi-square frequency distribution with (rows-1) x (columns-1) degrees of freedom. If we calculate our value we’ll come up with a chi-square = 368.3921 with a p value <0.001, so we’ll conclude that there’s a statistically significant relationship between the two variables.

As you see, residuals are very useful, not only to calculate chi-square, but also to calculate many other statistics. However, epidemiologists prefer to use other measures of association with contingency tables. And this is because chi-square doesn’t vary from zero to one and, although it informs us if there’s statistical significance, gives no information about the strength of the association. For that we need other parameters that do vary from zero to one, such us the relative risk and the odds ratio. But that’s another story…