Rioja vs Ribera

Frequentist vs Bayesian statistics

This is one of the typical debates that one can have with a brother-in-law during a family dinner: whether the wine from Ribera is better than that from Rioja, or vice versa. In the end, as always, the brother-in-law will be (or will want to be) right, which will not prevent us from trying to contradict him. Of course, we must make good arguments to avoid falling into the same error, in my humble opinion, in which some fall when participating in another classic debate, this one from the less playful field of epidemiology: Frequentist vs. Bayesian statistics?

And these are the two approaches that we can use when dealing with a research problem.

Some previous definitions

Frequentist statistics, the best known and to which we are most accustomed, is the one that is developed according to the classic concepts of probability and hypothesis testing. Thus, it is about reaching a conclusion based on the level of statistical significance and the acceptance or rejection of a working hypothesis, always within the framework of the study being carried out. This methodology forces to stabilize the decision parameters a priori, which avoids subjectivities regarding them.

The other approach to solving problems is that of Bayesian statistics, which is increasingly fashionable and, as its name suggests, is based on the probabilistic concept of Bayes’ theorem. Its differentiating feature is that it incorporates external information into the study that is being carried out, so that the probability of a certain event can be modified by the previous information that we have on the event in question. Thus, the information obtained a priori is used to establish an a posteriori probability that allows us to make the inference and reach a conclusion about the problem we are studying.

This is another difference between the two approaches: while Frequentist statistics avoids subjectivity, Bayesian’s one introduces a subjective (but not capricious) definition of probability, based on the researcher’s conviction, to make judgments about a hypothesis.

Bayesian statistics is not really new. Thomas Bayes’ theory of probability was published in 1763, but experiences a resurgence from the last third of the last century. And as usually happens in these cases where there are two alternatives, supporters and detractors of both methods appear, which are deeply involved in the fight to demonstrate the benefits of their preferred method, sometimes looking more for the weaknesses of the opposite than for their own strengths.

And this is what we are going to talk about in this post, about some arguments that Bayesians use on some occasion that, one more time in my humble opinion, take more advantage misuses of Frequentist statistics by many authors, than of intrinsic defects of this methodology.

A bit of history

We will start with a bit of history.

The history of hypothesis testing begins back in the 20s of the last century, when the great Ronald Fisher proposed to value the working hypothesis (of absence of effect) through a specific observation and the probability of observing a value equal or greater than the observed result. This probability is the p-value, so sacred and so misinterpreted, that it does not mean more than that: the probability of finding a value equal to or more extreme than that found if the working hypothesis were true.

In summary, the p that Fisher proposed is nothing short of a measure of the discrepancy that could exist between the data found and the hypothesis of work proposed, the null hypothesis (H0).

Almost a decade later, the concept of alternative hypothesis (H1) was introduced, which did not exist in Fisher’s original approach, and the reasoning is modified based on two error rates of false positive and negative:

  1. Alpha error (type 1 error): probability of rejecting the null hypothesis when, in fact, it is true. It would be the false positive: we believe we detect an effect that, in reality, does not exist.
  2. Beta error (type 2 error): it is the probability of accepting the null hypothesis when, in fact, it is false. It is the false negative: we fail to detect an effect that actually exists.

Thus, we set a maximum value for what seems to us the worst case scenario, which is detecting a false effect, and we choose a “small” value. How small is it? Well, by convention, 0.05 (sometimes 0.01). But, I repeat, it is a value chosen by agreement (and there are those who say that it is capricious, because 5% reminds them the fingers of the hand, which are usually 5).

Thus, if p <0.05, we reject H0 in favor of H1. Otherwise, we accept H0, the hypothesis of no effect. It is important to note that we can only reject H0, never demonstrate it in a positive way. We can demonstrate the effect, but not its absence.

Everything said so far seems easy to understand: the frequentist method tries to quantify the level of uncertainty of our estimate to try to draw a conclusion from the results. The problem is that p, which is nothing more than a way to quantify this uncertainty, is sacralized and misinterpreted too often, which is used to their advantage (if I may say so) by opponents of the method to try to expose its weaknesses.

One of the major flaws attributed to the frequentist method is the dependence of the p-value on the sample size. Indeed, the value of p can be the same with a small effect size in a large sample as with a large effect size in a small sample. And this is more important than it may seem at first, since the value that will allow us to reach a conclusion will depend on a decision exogenous to the problem we are examining: the chosen sample size.

Here would be the benefit of the Bayesian method, in which larger samples would serve to provide more and more information about the study phenomenon. But I think this argument is based on a misunderstanding of what an adequate sample is. I am convinced, the more is not always the better.

We start with the debate

Another great man, David Sackett, said that “too small samples can be used to prove nothing; samples that are too large can be used to prove nothing ”. The problem is that, in my opinion, a sample is neither large nor small, but sufficient or insufficient to demonstrate the existence (or not) of an effect size that is considered clinically important.

And this is the heart of the matter. When we want to study the effect of an intervention we must, a priori, define what effect size we want to detect and calculate the necessary sample size to be able to do it, as long as the effect exists (something that we desire when we plan the experiment, but that we don’t know a priori) . When we do a clinical trial we are spending time and money, in addition to subjecting participants to potential risk, so it is important to include only those necessary to try to prove the clinically important effect. Including the necessary participants to reach the desired p <0.05, in addition to being uneconomic and unethical, demonstrates a lack of knowledge about the true meaning of p-value and sample size.

This misinterpretation of the p-value is also the reason that many authors who do not reach the desired statistical significance allow themselves to affirm that with a larger sample size they would have achieved it. And they are right, they would have reached the desired p <0.05, but they again ignore the importance of clinical significance versus statistical significance.

When the sample size to detect the clinically important effect is calculated a priori, the power of the study is also calculated, which is the probability of detecting the effect if it actually exists. If the power is greater than 80-90%, the values admitted by convention, it does not seem correct to say that you do not have enough sample. And, of course, if you have not calculated the power of the study before, you should do it before affirming that you have no results due to shortness of sample.

Another argument against the frequentist method and in favor of the Bayesian’s says that hypothesis testing is a dichotomous decision process, in which a hypothesis is rejected or accepted such as you rejects or accepts an invitation to the wedding of a distant cousin you haven’t seen for years.

Well, if they previously forgot about clinical significance, those who affirm this fact forget about our beloved confidence intervals. The results of a study should not be interpreted solely on the basis of the p-value. We must look at the confidence intervals, which inform us of the precision of the result and of the possible values that the observed effect may have and that we cannot further specify due to the effect of chance. As we saw in a previous post, the analysis of the confidence intervals can give us clinically important information, sometimes, although the p is not statistically significant.

More arguments

Finally, some detractors of the frequentist method say that the hypothesis test makes decisions without considering information external to the experiment. Again, a misinterpretation of the value of p.

As we already said in a previous post, a value of p <0.05 does not mean that H0 is false, nor that the study is more reliable, or that the result is important (even though the p has six zeros). But, most importantly for what we are discussing now, it is false that the value of p represents the probability that H0 is false (the probability that the effect is real).

Once our results allow us to affirm, with a small margin of error, that the detected effect is real and not random (in other words, when the p is statistically significant), we can calculate the probability that the effect is “real”. And for this, Oh, surprise! we will have to calibrate the value of p with the value of the basal probability of H0, which will be assigned by the researcher based on her knowledge or previous available data (which is still a Bayesian approach).

As you can see, the assessment of the credibility or likelihood of the hypothesis, one of the differentiating characteristics of the Bayesian’s approach, can also be used if we use frequentist methods.

We’re leaving…

And here we are going to leave it for today. But before finishing I would like to make a couple of considerations.

First, in Spain we have many great wines throughout our geography, not just Ribera or Rioja. For no one to get offended, I have chosen these two because they are usually the ones asked by the brothers-in-law when they come to have dinner at home.

Second, do not misunderstand me if it may have seemed to you that I am an advocate of frequentist statistics against Bayesian’s. Just as when I go to the supermarket I feel happy to be able to buy wine from various designations of origin, in research methodology I find it very good to have different ways of approaching a problem. If I want to know if my team is going to win a match, it doesn’t seem very practical to repeat the match 200 times to see what average results come out. It  would be better to try to make an inference taking into account the previous results.

And that’s all. We have not gone into depth in what we have commented at the end on the real probability of the effect, somehow mixing both approaches, frequentist’s and Bayesian’s. The easiest way, as we saw in a previous post, is to use a Held’s nomogram. But that is another story…

The false coin

Today we’re going to continue playing with coins. In fact, we’re going to play with two coins, one of them a fair coin and the other one faker than Judas Iscariot, loaded to give more heads than tails when flipped. I recommend you to sit back and relax before starting.

It turns out we have a loaded coin. By definition, the probability of getting heads when tossing a fair coin is 0.5 (50%). However, our fake coin lands on heads 70% of the time (probability 0.7), which comes in handy because we can use it whenever we want to negotiate any unpleasant task. We only have to offer our coin, choose tails and trust to be lucky enough to be benefited by our unfair coin.

Let’s suppose now we have been so careless as to put the fake coin with the others. How can we know what is the false one?. And this is when we think about our game. Let’s imagine what would happen if we flipped a coin 100 times in a row. If the coin is fair we expect to get heads 50 times, whereas if the coin was our false one, we’d expect 70 heads. So we can choose a coin at random, toss it 100 times and, counting the number of heads, decide if it’s fair or not. We can arbitrarily choose a value between 50 and 70, let’s say 65, and state: if we get 65 heads or more our coin will be the loaded one, but if we get less than 65, we’ll say it is a fair coin.

But anyone immediately realizes that this method is not foolproof. On the one hand, we can get 67 heads with a fair coin and conclude it’s not, when it is indeed fair. But it can also happen that, just by chance, we get 60 heads with the loaded coin and conclude it is fair. Can we solve this problem and avoid getting at the wrong conclusion?. Well, the truth is that we can’t, but what we can do is to measure the likelihood we have of making a mistake.

If we use a binomial probability calculator (the bravest of you can do the calculations by hand) we’ll come up with a probability of getting 65 heads or more with a fair coin of 0.17%, while the probability of getting them with the loaded coin is 88.4%. So we can find ourselves four possibilities that I represent in the accompanying table.

In this case, our null hypothesis says that the coin is fair, while the alternative hypothesis says that the coin is spoofed in favor of heads.

Let’s start with the case the test concludes that the coin is fair (we get less than 65 heads). The first possibility is that the coin is actually fair. Well, we’ll be right. We have no more to say about that situation.

The second possibility is that, despite the conclusion of our test, the coin is faker than the kiss of a mother-in-law. Well, this time we’ll have made a mistake that someone with little imagination named as type II error. We have accepted the null hypothesis that the coin is fair when it’s actually unfair.

We’re going to suppose now that our test concludes that the coin is loaded. If the coin is actually fair, we will err again, but this time we will have committed a type I error. In this case, we reject the null hypothesis that the coin is fair when it is actually fair.

Finally, if we conclude that it is not fair and it is actually loaded, we will be right again.

We can see in the table that the probability of making a type I error is, in this example, 0.17%. This is the statistical significance level of our test, which is just the probability of rejecting our null hypothesis that the coin is fair (concluding it is false) when it is in fact fair. On the other hand, the probability of being right when the coin is false is 88%. This probability is called the power of the test, and it is just the probability of being right when the test concludes the coin is loaded (put it in other words, reject the null hypothesis and be right).

If you think a little about it you will see that the type II error is the complementary of power. When the coin is not fair, the probability of accepting it is fair (type II error) plus the probability of being right and conclude it is false must add up to 100%. Thus, type II error equals 1 minus power.

This statistical significance we have seen is the same as the famous p value. Statistical significance is just the probability of committing a type I error. By convention, it’s generally accepted as tolerable when it is less than 0.05 (5%) since, in general, it is preferable not to accept a false hypothesis. This is why scientific studies look for low values of significance and high values for power, although both of them are related, so that increasing significance decreases power and vice versa.

And this is the end for now. Those of you that have got this far through this rigmarole without getting missing at all, my sincere congratulations, because the truth is that this post seems a play on words. And we could have said something about significance and the calculation of confidence intervals, samples sizes, etc. But that’s another story…

The tails of p

Forgive me my friends from the other side of the Atlantic, but I am not thinking about the kind of tails that many perverse minds are. Far from it, today we’re going to talk about a lot more boring tails but that are very important if we want to do a hypothesis testing. And, as usual, we will illustrate the point with an example to try to understand it much better.

Let’s suppose we take a coin and, armed with infinite patience, toss it 1000 times, getting heads 560 times. We all know that the probability of getting heads is 0.5, so if you throw the coin 1000 times we expected to get an average number of 500 heads. But we’ve got 560, so we can consider two possibilities that come to mind immediately.

First, the coin if fair and we’ve got 60 more heads just by chance. This will be our null hypothesis, which says that the probability of getting heads [P(heads)] is equal to 0.5. Second, our coin is not fair, but it is loaded to obtain more heads. This will be our alternative hypothesis (Ha), which states that P(heads) > 0.5.

Well, let’s make a hypothesis testing using one of the binomial probability calculators that are available on the Internet. Assuming the null hypothesis that the coin is fair, the probability to obtain 560 heads or more is 0.008%. Being it lower than 5%, we reject our null hypothesis: the coin is loaded.

Now, if you look well at it, the alternative hypothesis has a directionality towards P(heads) > 0.5, but we could have hypothesized that the coin were not fair without presupposing it was load in favor of heads or tails: P(heads) not equal to 0.5. In this case we would calculate the probability to get a number of heads that were 60 above or below 500, in both directions. This probability values 0.016%, so we’d reject our null hypothesis and would conclude that the coin is not fair. The problem is that the test doesn’t tell us in what direction it’s loaded but, in the face of the results, we assume it favors heads. In the first example we did a one-tailed test, while in the second we have made a two-tailed test.

WebIn the figure, you can see the probability areas in both tests. In the one-tailed test, the red small area on the right represents the probability that the difference from the expected value is due to chance. In the two-tailed test, this area is doubled and located on both sides of the probability distribution. Notice that two-tailed p’s value doubles the one-tailed value. In our example, both p values are so low that we can reject the null hypothesis in any case. But this is not always so, and there may be occasions when the researcher chooses to do a one-tailed test to get statistical significance that is not possible with the two-tailed test.

And I’m saying one of the two tails because we have calculated the right tail probability, but we could have calculated the probability of the left tail. Consider the unlikely event that even though the coin is loaded favoring tails, we have got more heads just by chance. Our Ha now says that P(heads) < 0.5. In this case we’d calculate the probability that, under the null hypothesis that the coin is fair, we can get 560 tails or less. This p-value is 99.9%, so we cannot reject our null hypothesis that the coin is fair.

But, what is going on here?, you’ll ask. The first hypothesis test we did allowed us to reject the null hypothesis and the last test says otherwise. Being the same coin and the same data, shouldn’t we have reached the same conclusion?. As it turns out, it seems not. Remember that the fact that we cannot reject the null hypothesis is not the same as to conclude that it is true, a fact we can never be sure of. In the last example, the null hypothesis that the coin is fair is a better option than the alternative that it is loaded favoring tails. However, that does not mean we can conclude that the coin is fair.

You see therefore how important it is to be clear about the meaning of the null and alternative hypothesis when doing a hypothesis testing. And always remember that, even though we cannot reject the null hypothesis it doesn’t mandatorily imply it is true. It could just happen that we haven’t enough power to reject it. This leads me to think about type I and type II errors and their relation with power and sample size. But that’s another story…

Freedom in degrees

Freedom is one of those concepts that everyone can understand easily, but it is extremely difficult to define. If you don’t believe me, try to state a definition of freedom and you will see that it is not so easy. Right away, you’ll be running into other people’s freedom when trying to define yours or you’ll be wondering what kind of freedom you’re trying to define.

However, with degrees of freedom it goes exactly the opposite. This term is far easy to define, but many have trouble understanding the exact meaning of this seemingly abstract concept.

The degrees of freedom are the number of observations in a sample which can take any possible value (which are “free” to take any value) given that it has been previously and independently calculated certain parameter estimate in the sample or its population of origin. Do you realize now why do I say it is easy to define but not so easy to understand?.  Let’s see an example just to be a little clear.

In a stroke of delusional imagination, let’s assume that we are school teachers. The school principal tells us there’s a competition among neighboring schools and we have to select five students to represent ours. The only condition we have to fulfill is that the final average rating of the five students must be seven points. Let’s also suppose that, as it happens, our eldest son, who records eight, is in the class. So, acting impartially, we pick him out to represent his peers. But we still need to pick four more so, why not be consistent with our sense of justice and choose his four friends. His friend Philip has 9, John 6, Louis 5 (he gets through by the narrowest of margins) and Evaristo records 10 (the very nerd). What’s the problem?. It’s that the five’s average record is 7.6 and it should be exactly 7. What can we do?.

Let’s try to remove Louis; after all he’s the one with lower grades. The problem is that we’ll have to choose a student with a score of 2 to come up with an average of 7. But we can’t select a student who has failed his tests. Then, let’s try to remove nerd-Evaristo and we’ll need to look for a student with a score of 7. If you think about it, we can make all possible combinations with the five friends, but always choosing only four, since the fifth would be bound by the average value we have set previously. So this means, nor more and no less, that we have four degrees of freedom.

When we make a statistical inference in a population, if we want the results to be reliable, we have to do each estimate independently. For instance, if we calculate the mean and the standard deviation we should do it independently, but this is not usually so, since we need and estimate of the mean to calculate the standard deviation. This is why not all the estimates can be considered free and independent of the mean. At least one of them will be conditioned by the value previously settled for the mean.

So you can see that the number of degrees of freedom indicates us the number of independent observations that are involved in the estimation of a population parameter.

This is important because estimators follow specific frequency distributions whose shape depends on the number of degrees of freedom associated with the estimate. The greater the number of degrees of freedom, the narrower the frequency distribution and the higher the power of the study to make the estimation. Thus, power and degrees of freedom are positively related with the sample size, so that the larger the sample size the greater the number of degrees of freedom and hence the greater the power.

To calculate the number of degrees of freedom of a test is usually straightforward, but it is different depending on the test in question. To calculate the mean of a sample is the simplest case. We have already seen that it equals n-1, being n the sample size. Similarly, when we are dealing with two samples and two means, the number of degrees of freedom equals n1+n2-2. In general, when calculating several parameters, the degrees of freedom are calculated as n-p-1, being p the number of parameters to be estimated. This is useful when we do an analysis of variance to compare two or more means.

And so we could give examples for the calculation of each particular statistical or test we’d want to accomplish. But that’s another story…

Size and power

Two associated qualities. And very enviable too. Especially when it comes to scientific studies (what were you thinking about?). Although there’re more factors involved, as we’ll see in a moment.

Let’s suppose we are measuring the mean of a variable in two samples to find out if there’re differences between them. We know that, just by random sampling, the results of the two samples will be different but we’ll want to know if that difference is wide enough to allow us to suppose they are actually different.

To find it out we make a hypothesis testing using the appropriate statistical. In our case, let’s suppose we do a Student t test. We calculate the value of our t and estimate its probability. Most of statistical, included t, follow a specific frequency or probability distribution. These distributions are generally bell-shape, more or less symmetrical and centered on certain value. Thus, values near the center are more likely to occur, while those in the extremes edges are less likely. By convention, when this probability is less than 5% we consider the occurrence of that value of the parameter measured unlikely to happen.

But of course, unlikely is not synonymous with impossible. It may be that, by chance, we have choose a sample that is not centered on the same value as the reference population, so the value happens in spite of its low probability of happening in this population.

And this is important because it can lead to errors in our conclusions. Remember that when we have two values to compare we establish the null hypothesis (H0) that the two values are equivalent, and that any difference is due to a random sampling error. Then, if we know its frequency distribution, we can calculate the probability of that difference occurring by chance. Finally, if it is less than 5% we’ll consider unlikely for it to be fortuitous and we’ll reject H0: the difference is not the result of chance and there’s a real effect or a real difference.

But again, unlikely is not impossible. If we have the misfortune of having chosen a biased sample to the population, we could reject the null hypothesis without having a real effect and commit a type 1 error.

Conversely, if the probability is greater than 5% we will not be able to reject H0 and we will say that the difference is due to chance. But here’s a little concept hue that is important to consider. The null hypothesis is only falsifiable. This means that we can reject it, but not affirm it. When we cannot reject it, if we assume it’s true we’ll run the risk of not detecting a trend that really exist. This is the type 2 error.

Usually we are more interested in accepting theories as safely as possible, so we look for low type 1 error probabilities, usually 5%. This is called the alpha value. But the two types of errors are interlinked, so a very low alpha compels us to accept a higher type 2 error (or beta) probability, generally 20%.

The reciprocal value of beta is what is called the power of the study (1-beta). This power is the probability of detecting an effect, given that it really exists, or put it in other words, the probability of not committing a type 2 error.

To understand the factors involved with the study power, will you let me pester you with a little equation:

1-\beta \propto \frac{SE\sqrt{n}\alpha }{\sigma }

SE represents the standard error. Being it in the numerator implies that the lower SE (the more subtle the difference) the lower the power of the study to detect the effect. The same applies to the sample size (n) and alpha: the larger the sample and the higher the significance that we tolerate (with increased risk of type 1 error), the greater the power of the study. Finally, s is the standard deviation: the more variability is in the population, the lower the power of the study.

The utility of the above equation is that we can solve is to obtain the sample size in the following way:

n∝((1-β)×σ^2)/(SE×α)

With this formula we can calculate the sample size we need to get the desired power we choose. Beta is usually set at 0.8 (80%). SE and s are obtained from pilot studies or previous data or regulations and, if they don’t exist, they are set by the researcher. Finally, as we have already mentioned, alpha is usually set at 0.05 (5%), although if we are very afraid of committing a type 1 error we can set it at 0.01.

Closing this post, I would like to draw your attention to the relationship between n and alpha in the first equation. Notice that the power doesn’t change if we increase sample size and concomitantly diminish the significance level. This leads to the situation that, sometimes, to obtain statistical significance is only a matter of increasing enough the sample size. It is therefore essential to assess the clinical relevance of the results and not just its p-values. But that’s another story…