Science without sense…double nonsense

Píldoras sobre medicina basada en pruebas

Archive for January 2015

An important decision

Life is full of important decisions. We decided what to study. Sometimes, very few, we can decide where to work. We decided if we marry and to whom (or so we think). We decide where to live, what car to buy, etc. And we’re screwed more times that we’d wish. Don’t you think so?. Then explain to me the meaning of that lament that is heard so often and which states: “if you lived twice …”. That’s it.

So, before making a decision we have to carefully weigh the alternatives available to us. And this truth for most aspects of normal life, applies also to the scientific method, with the added advantage that it tends to be more clearly established what may be the right choice.

And speaking of choice, suppose we want to get an idea of what will be the result of the next election. The way to get the most nearly figures would be to ask all voters for their voting intentions, but anybody can see that this may be impossible from a practical point of view. Consider a large country with fifty million voters. In these cases what we do is to choose a subset of the population, which we call a sample, do the survey among its components and estimate the result in the general population.

But we can ask ourselves, is this estimate reliable?. And the answer is yes, provided that we use a valid sampling technique that allows us to obtain a representative sample of the population. Everything depends on two characteristics of the sampling: accuracy and precision.

The accuracy determines how close the result we obtain in the sample and the inaccessible real value in the population are to each other, and depends on the type of sample chosen. To be exact, the sample must be representative, which means that the distribution of the study variable (and the related variables) must be similar to that of the population. We usually begins by defining the sampling frame, which is the listing or way of identifying individuals in the population that can access, called sampling units, and on which the selection process will apply. Consider, for example, a population census, a list of medical records, etc. The choice of the frame should be done very carefully because it will determine the interpretation of results.

On the other hand, the accuracy depends on the sample size and the variability among participants, as you surely remember from the formula for calculating confidence intervals (95% CI = 1.96 x value ± standard error). As the standard error is the quotient of the standard deviation by the square root of the sample size (n), the greater the standard deviation or the smaller the sample size, the greater the confidence interval width and the lower the precision of the estimate. But this is only a half-true that only works if we assume that the population has an infinite size because, in reality, the standard error must be multiplied by a correction factor for finite sample taking into account the population size (N), so the true formula for the confidence interval would be:

95\%CI= value\pm\1,96\times standard\ error\times\sqrt{1-\frac{n}{N}}Stop!. Now don’t read faster because I’ve written a formula and turn back to look at it and contemplate, once again, the magic of the central limit theorem, the Sancta Santorum of statistical inference. If the population (N) is large, the ratio quickly becomes almost zero and the error is multiplied almost by one, which practically does not change the interval. And this is not trivial because it explains why a sample of 1200 voters can estimate the results of, with very little margin for error, both the elections for mayor of New York or those for the US president or the emperor of the entire world, if we had one (provided, of course, that the sample is representative of each electoral census). Moreover, if n is approaching to N, the correction factor approaches to zero and the interval gradually becomes narrower. This explains that if n = N, the value we obtain matches the actual value of the population.

Thus, being so the power of an insignificant sample, nobody can be surprised that there are various forms of sampling. First ones that we are going to consider are the probabilistic sampling techniques, in which all subjects have a known, non-zero probability of being selected, although it’s not required that all of them have the same probability. Typically, random selection methods are used to avoid the investigator’s subjectivity and to prevent the possibility that, by chance, the sample is not representative, which could be the origin of random or sampling bias. As always, we cannot get rid of random, by we can quantify it.

The best known is the simple random sampling, in which each unit in the sampling frame has an equal likelihood of being chosen. The sampling is most often performed without replacement, which means that, once selected, the participant is not remitted back to the population, and so that unit cannot be selected more than once. To make things right, the selection process from the frame is made with a random number table or a computer algorithm.

Sometimes, the variable is not evenly distributed in the population. In these cases, to get a representative sample, we can divide the population into strata and do a random sampling in each stratum. To perform this technique, called stratified random sampling, we need to know the distribution of the variable in the population. Furthermore, the strata should be mutually exclusive, so that the variability within each is minimal and variability between strata is maximized. If the strata are of similar size, sampling is done proportionally, but if anyone is smaller it can be over represented and include comparatively more sampling units than the rest. The problem is that the analysis is complicated, because you have to weigh the results of each stratum based on their contribution to the overall result, but any statistical program makes these corrections undeterred. The advantage of this technique is that, provided that data are analyzed correctly, estimates are more accurate, since the overall variance is calculated from the strata, which will always be less than that of the general population. This type of sampling is very useful when the study variable is influenced by other variables of the population. If we consider, for example, the prevalence of ischemic heart disease it may be useful to stratify by sex, weight, age, smoking history, or what we think that can influence the outcome.

A step beyond this approach is the multistage sampling or cluster sampling. In this case the population is divided into primary sampling units which, in turn, are divided into secundary units in which the selection process is conducted. This type, with all the stages we are interested in each case, is widely used in studies about schools, clustering by socioeconomic status, type of education, age, grade or whatever comes to mind. The problem with this design, apart from the complexity of implementation and analysis of results, is that they can be biased if the members of the units are very similar. Think, for example, we want to study the rate of vaccinations in a city: we divide the city into zones; we randomly select some families in each zone and see how many children are immunized. Logically, if a child is vaccinated surely his brothers will be too, which may overestimate the overall vaccination rate in town if there’re many of these families in the sample from areas with better health status.

Systematic sampling is often used in studies where the sampling frame does not exist or is incomplete. For example, if we want to try an antiinfluenzae drug, we don’t know who are going to get the flu. We choose a constant of randomization (k) and wait quietly for patients arriving to the medical practice. When they have come the first k, we choose one at random and, from there, we included one in k that come with influenza until getting the desired sample size.

In all the above techniques the probability of each members of the population to be selected was known. However, this probability is unknown in non-probabilistic models in which non-random methods are used, so you have to be careful with the representativeness of the sample and the presence of bias.

Consecutive sampling is often used in clinical trials. In the above example about flu, we can enroll the first n who come to the practice and who met the inclusion and exclusion trial’s criteria. Another possibility is the inclusion of volunteers. This is not recommended, since subjects who agree to participate in a study without anyone asking for it may have characteristics that affect the representativeness of the sample.

Marketing specialists very often uses the quota sampling technique, selecting subjects according to the distribution of the variables that interest them, but this type of design is little used in medicine. And finally, let us comment the use of adaptive techniques such as the so-called snowball sampling, random walk sampling or sampling network. For example, think we want to do a study with addicted to illegal substances. It will be difficult to find participants, but when we find the first we may ask if he or she knows of someone else who might participate. This technique, not just invented by me, comes in handy for very hard to reach populations.

And with that we ended with techniques that try to get the most suitable type of sample for our study. We would talk about the sample size and how it should be pre-computed at baseline so that it is neither too big nor too small. But that’s another story…

Everything is not normal

According to the dictionary, one thing is considered normal when it’s in its natural state or conforms to standards set in advance. And this is its normal meaning. But, like many other words, “normal” has many other meanings. In statistics, we talking about normal we refer to a given probability distribution that is called the normal distribution, the famous Gauss’ bell.

This distribution is characterized by its symmetry around its mean that coincides with its median, in addition to other features which we discussed in a previous post. The great advantage of the normal distribution is that it allows us to calculate probabilities of occurrence of data from this distribution, which results in the possibility of inferring population data using a sample obtained from it.

Thus, virtually all parametric techniques of hypothesis testing need that data follow a normal distribution. One might think that this is not a big problem. If it’s called normal it’ll be because biological data do usually follow, more or less, this distribution. Big mistake, many data follow a distribution that deviates from normal. Consider, for example, the consumption of alcohol. The data will not be grouped symmetrically around a mean. In contrast, the distribution will have a positive bias (to the right): there will be a large number around zero (abstainers or very occasional drinkers) and a long right tail formed by people with higher consumption. The tail is long extended to the right with the consumption values of those people who eat breakfast with bourbon.

And how affect our statistical calculations the fact that the variable doesn’t follow a normal?. What should we do if data are not normal?.

curva_normalThe first thing to do is realize that the variable is not normally distributed. We have already seen there are a number of graphical methods that allow us to visually approximate if the data follow the normal. Histograms and box-plots allow us to test whether the distribution is skewed, if too flat or peaked, or have extreme values. But the most specific graphic for this purpose is the normal probability plot (q-q plot), where the values are set to the diagonal line if they are normally distributed.

Another possibility is to use numeric contrast tests such as the Shapiro-Wilk’s or the Kolmogorov-Smirnov’s. The problem with these tests is that they are very sensitive to the effect of sample size. If the sample is large they can be affected by minor deviations from normality. Conversely, if the sample is small, they may fail to detect large deviations from normality. But these tests also have another drawback you will understand better after a small clarification.

We know that in any hypothesis testing we set a null hypothesis that usually says the opposite of what we want to show. Thus, if the value of statistical significance is lower than a set value (usually 0.05), we reject the null hypothesis and stayed with the alternative, that says what we want to prove. The problem is that the null hypothesis is only falsifiable, it can never be said to be true. Simply, if the statistical significance is high, we cannot say it’s untrue, but that does not mean it’s true. It may happen that the study did not have enough power to reject a null hypothesis that is, in fact, false.

Well, it happens that contrasts for normality are set with a null hypothesis that the data follow a normal. Therefore, if the significance is small, we can reject the null and say that the data are not normal. But if the significance is high, we simply cannot reject it and we will say that we have no ability to say that the data are not normal, which is not the same as to say that the fit a normal distribution. For these reasons, it is always advisable to complement numerical contrasts with some graphical method to test the normality of the variable.

Once we know that the data are not normal, we must take this into account when describing them. If the distribution is highly skewed we cannot use the mean as a measure of central tendency and we must resort to other robust estimators such as the median or other parameters available for these situations.

Furthermore, the absence of normality may discourage the use of parametric contrast tests. Student’s t test and the analysis of variance (ANOVA) require that the distribution is normal. Student’s t is quite robust in this regard, so that if the sample is large (n> 80) it can be used with some confidence. But if the sample is small or very different from the normal, we cannot use parametric contrast tests.

One of the possible solutions to this problem would be to attempt a data transformation. The most frequently used in biology is the logarithmic transformation, useful to approximate to a normal distribution when the distribution is right-skewed. We mustn’t forget to undo the transformation once the contrast in question has been made.

The other possibility is to use nonparametric tests, which require no assumption about the distribution of the variable. Thus, to compare two means of unpaired data we will use the Wilcoxon’s rank sum test (also called Mann-Whitney’s U test). If data are paired we will have to use the Wilcoxon’s sign rank test. If we compare more than two means, the Kruskal-Wallis’ test will be the nonparametric equivalent of the ANOVA. Finally, remember that the nonparametric equivalent of the Pearson’s correlation coefficient is the Spearman’s correlation coefficient.

The problem is that nonparametric tests are more demanding than their parametric equivalent to obtain statistical significance, but they must be used as soon as there’s any doubt about the normality of the variable we’re contrasting.

And here we will stop for today. We could have talked about a third possibility of facing a not-normal variable, much more exotic than those mentioned. It is the use of resampling techniques such as bootstrapping, which consists of building an empirical distribution of the means of many samples drawn from our data to make inferences with the results, thus preserving the original units of the variable and avoiding the swing of data transformation techniques. But that’s another story…