A matter of pairs

We saw in the previous post how observational studies, in particular cohort studies and case-control studies, are full of traps and loopholes. One of these traps is the backdoor through which data may be eluding us, causing that we get erroneous estimates of association measures. This backdoor is kept ajar by confounding factors.

We know that there’re several ways to control confounding. One of them, pairing, has it peculiarities in accordance whether we employ it in a cohort study or in a case-control study.

When it comes to cohort studies, matching by the confounding factor allows us to obtain an adjusted measure of association. This is because we control the influence of the confounding variable on exposure and on the effect. However, the above is not fulfilled when the matching technique is used in a case-control study. The design of this type of study imposes the obligation to make the pairing once the effect has been produced. Thus, patients that act as controls are a set of independent individuals chosen at random, since each control is selected because it fulfills a series of criteria established by the case with which it is going to be paired. This, of course, prevents us to select other individuals in the population who do not meet the specified criteria but would be potentially included in the study. If we forget this little detail and apply the same methodology of analysis that would use a cohort study we would incur in a selection bias that would invalidate our results. In addition, although we force a similar distribution of the confounder, we only fully control its influence on the effect, but no on the exposure.

So the mentality of the analysis varies slightly when assessing the results of a case-control study in which we used the matching technique to control for confounding factors. While with an unpaired study we analyze the association between exposure and effect on the overall group, when we paired we must study the effect on the case-control pairs.

cyc_pairingWe will see this continuing with the example of the effect of tobacco on the occurrence of laryngeal carcinoma from the previous post.

In the upper table we see the global data of the study. If we analyze the data without considering that we used the pairing to select the controls we obtain an odds ratio of 2.18, as we saw in the previous post. However, we know that this estimate is wrong. What do we do? Consider the effect of couples, but only that of those that don’t get along.

We see in the table below the distribution of the pairs according to their exposure to tobacco. We have 208 pairs in which both the case (person with laryngeal cancer) and the control are smokers. Being both subject to exposure they don’t serve to estimate their association with the effect. The same is true of the 46 pairs in which neither the case nor the control smoke. Pairs of interest are the 14 in which the control smoke but the case don’t, and the 62 pairs in which only the case smokes, but not the control.

These discordant pairs are the ones that give us information on the effect of tobacco on the occurrence of laryngeal cancer. If we calculated the odds ratio is of 62/14 = 4.4, a measure of association stronger than the previously obtained and certainly much closer to reality.

Finally, I want to do three considerations before finishing. First, although it goes without saying, to remind you that the data are a product of my imagination and that the example is completely fictitious although it does not seem as stupid as others I invented in other posts. Second, these calculations are usually made with software, using the Mantel-Haenszel´s or the McNemar´s test. The third is to comment that in all these examples we have used a pairing ratio of 1: 1 (one control per case), but this need not necessarily be so because, in some cases, we may be interested in using more than one control for each case. This entails its differences about the influence of the confounder on the estimated measure of association, and its considerations when performing the analysis. But that’s another story…

As an egg to a chestnut

Bar plot and histogram

What an egg and a chestnut look alike?. If we fired our imagination we can give some answers as absurd as stilted. Both are more or less rounded, the two can serve as food and both have a hard shell that encloses the part that is eaten. But in fact, and egg and a chestnut don’t resemble each other at all, even though we want to look for similarities.

The same thing happens to two graphic tools widely used in descriptive statistics: the bar chart and the histogram. At first glance they may look very similar, but if you look closely there are clear differences between the two types of graphs, which enclose totally different concepts.

Types of variables

We know that there are different types of variables. On the one hand there’re quantitative variables, which may be continuous or discrete. Continuous are those that can take any value within a range, as with the weight or blood pressure (in practice, possible values may be limited ​​due to the precision of the measuring devices, but in theory we can find any weight value between the minimum and maximum of the distribution). Discrete variables are those that can only take certain values ​​within a set, for example, the number of children or the number of episodes of myocardial ischemia.

Furthermore, there are qualitative variables that represent attributes or categories of the variable. When the variable does not include any sense of order, it is said to be a qualitative nominal variable, whereas if you can establish some order among the categories you will say that it is a qualitative ordinal variable. For example, smoking will be a qualitative nominal variable if it has two possibilities: yes or no. However, if we define the variable into categories like casual, slightly smoker, moderate or heavy smoker, there will be a hierarchy among the categories and it will be an ordinal qualitative variable.

Bar plot

Well, the bar graph is used to represent ordinal qualitative variables. The horizontal axis represents the different categories and over it are drawn a series of columns or bars whose heights are proportional to the frequency of each category. We could also use this type of graph to represent discrete quantitative variables, but what is not right to do is to use it to plot nominal qualitative variables.

bar graph


The great merit of the bar chart is expressing the magnitude of the differences between the categories of the variable. But that is precisely its weakness because they are easily manipulated by modifying its axes. As you can see in the first figure, the difference between short and occasional smokers seems much higher in the second graph, in which we have miss out part of the vertical axis. So be careful when analyzing this type of graph to avoid being deceived with the message that the author of the study may want to convey.



Moving on, the histogram is a graph with a much deeper meaning. A histogram is a frequency distribution that is used (or should) to represent the frequency of continuous quantitative variables. This is not the height, but the area of the bar which is proportional to the frequency of that interval, and is related to the probability with which each interval may occur. As you can see in the second figure, columns, unlike in the bar chart, are side-by-side and the midpoint gives the name to the interval. The intervals need not to be all of the same width (although it is the most common situation), but they will always have a larger area the more frequent those intervals are.

In addition, there’s another very important difference between the bar graph and the histogram. In the first graph there’re represented only those values of the variable than have been observed in the study. Meanwhile, the histogram goes much further, since its represents all the possible values that exist within the range, although we haven’t seen some of them in a direct way. So, it allows calculating the probability of any value of the represented distribution, which is very important if we want to make inference and to estimate population’s values from the result of our sample.

We’re leaving…

And here we leave these graphs that may look the same but, as we’ve shown, seem like an egg to a chestnut.

Just one last comment. We’ve said at the beginning that it was a mistake to use a bar chart (or, of course, histograms) to represent nominal qualitative variables. And what can we use for that?. Well, a sectors’ chart, the famous and ubiquitous pie that is used on more occasions than the proper and that has its own idiosyncrasies. But that’s another story…

Give me a bar and I’ll move the earth

But I don’t accept any bar. It must be a very special bar. Or rather, a series of bars. And I’m not thinking about a bars chart, those so well-know and used that PowerPoint makes them almost without you asking for it. No, these graphs are very dull; they just represent how many times it repeats each of the values of a qualitative variable, but tell us nothing more.

I’m thinking about a much meaningful plot. I’m thinking about a histogram. Wow, you’ll say, but isn’t it another kind of bar chart?. Yes, but it has a different kind of bars, much more informative. To begin with, the histogram is used (or it should be) to represent frequencies of continuous quantitative variables. The histogram is not just a bar graph, but a frequency distribution. What does that mean?. Well, at the bottom, the bars are somewhat artificial. Let’s suppose a continuous quantitative variable such as weight. Imagine that our distribution ranges from 38 to 118 kg of weight. In theory, we can have infinite weight values (as with any continuous variable), but to represent the distribution we divide the range into an arbitrary number of intervals and draw a bar for each interval so that the height of the bar (and therefore its surface) is proportional to the number of cases inside the interval. This is a histogram: a frequency distribution.histogramas

Now, suppose we make the intervals more and more narrow. The profile formed by the bars is increasingly looking like a curve as intervals narrow. In the end, what we’ll come up with is a curve, which will be called the probability density curve. The probability of a given value will be zero (one would think that it should be the height of the curve at that point, but it is not other than zero), but the probability of the values of a give interval is equivalent to the surface area under the curve in that interval. And what will be the area under the entire curve?. Very easy: the probability of finding any of the possible values, i.e., one (100% if you like percentages).

As you see, the histogram is much more than what it seems at first sight. It tells us that the probability of finding a value lower than the mean is 0.5, but not only that, because we can calculate the probability density of any value using a tiny formula that I prefer not to show you to avoid you closing your browsers and stopping reading this post. Moreover, there’s a simpler way to find it out.

With variables following a normal distribution (the famous bell) the solution is simple. We know that a normal distribution is perfectly characterized by its mean and standard deviation. The problem is that each normal curve has its own distribution, so the probability density curve is specific to each distribution. What can we do?. We can invent a standard normal distribution whose mean is zero and whose standard deviation is one and we can study its probability density so that we need neither formulas nor tables to know the probability of a given segment.

Once done, we take any value of our distribution and transform it into its soul mate in the standard distribution. This process is called standardization and is as simple as subtracting the mean from the value and dividing the result by the standard deviation. Thus we obtain another statistic that physicians in general, and particularly statisticians, venerate the most: the z score.

The probability density of the standard distribution is well known. A z-value of zero is in the mean. The range of z = 0 ± 1.64 comprises 90% of the distribution; the rage of z = 0 ± 1.96 includes 95%; and z = 0 ± 2.58, 99%. What we do in practice is to choose the desirable standardized z value for our variable. This value is typically set at ±1 or ±2, according to the variable measured. Moreover, we can compare how the z-score is modified in successive determinations.

The problem arises because in medicine there’re many variables whose distribution is skewed and does not fit a normal curve, such as the height, blood cholesterol, and many others. But do not despair, mathematicians have invented a stuff called the central limit theorem, which says that if the sample size is large enough we can standardize any distribution and work with it as it fit the standard normal distribution. This theorem is a great thing because it allows standardizing even non-continuous variables that fit other distributions like the binomial, Poisson, or other.

But all this does not ends here. Standardization is the basis for calculating other features of the distribution such as the asymmetry index and kurtosis, and it is also the basis for many hypothesis contrasts seeking a known distribution to calculate statistical significance. But that’s another story…