Rioja vs Ribera

Frequentist vs Bayesian statistics

This is one of the typical debates that one can have with a brother-in-law during a family dinner: whether the wine from Ribera is better than that from Rioja, or vice versa. In the end, as always, the brother-in-law will be (or will want to be) right, which will not prevent us from trying to contradict him. Of course, we must make good arguments to avoid falling into the same error, in my humble opinion, in which some fall when participating in another classic debate, this one from the less playful field of epidemiology: Frequentist vs. Bayesian statistics?

And these are the two approaches that we can use when dealing with a research problem.

Some previous definitions

Frequentist statistics, the best known and to which we are most accustomed, is the one that is developed according to the classic concepts of probability and hypothesis testing. Thus, it is about reaching a conclusion based on the level of statistical significance and the acceptance or rejection of a working hypothesis, always within the framework of the study being carried out. This methodology forces to stabilize the decision parameters a priori, which avoids subjectivities regarding them.

The other approach to solving problems is that of Bayesian statistics, which is increasingly fashionable and, as its name suggests, is based on the probabilistic concept of Bayes’ theorem. Its differentiating feature is that it incorporates external information into the study that is being carried out, so that the probability of a certain event can be modified by the previous information that we have on the event in question. Thus, the information obtained a priori is used to establish an a posteriori probability that allows us to make the inference and reach a conclusion about the problem we are studying.

This is another difference between the two approaches: while Frequentist statistics avoids subjectivity, Bayesian’s one introduces a subjective (but not capricious) definition of probability, based on the researcher’s conviction, to make judgments about a hypothesis.

Bayesian statistics is not really new. Thomas Bayes’ theory of probability was published in 1763, but experiences a resurgence from the last third of the last century. And as usually happens in these cases where there are two alternatives, supporters and detractors of both methods appear, which are deeply involved in the fight to demonstrate the benefits of their preferred method, sometimes looking more for the weaknesses of the opposite than for their own strengths.

And this is what we are going to talk about in this post, about some arguments that Bayesians use on some occasion that, one more time in my humble opinion, take more advantage misuses of Frequentist statistics by many authors, than of intrinsic defects of this methodology.

A bit of history

We will start with a bit of history.

The history of hypothesis testing begins back in the 20s of the last century, when the great Ronald Fisher proposed to value the working hypothesis (of absence of effect) through a specific observation and the probability of observing a value equal or greater than the observed result. This probability is the p-value, so sacred and so misinterpreted, that it does not mean more than that: the probability of finding a value equal to or more extreme than that found if the working hypothesis were true.

In summary, the p that Fisher proposed is nothing short of a measure of the discrepancy that could exist between the data found and the hypothesis of work proposed, the null hypothesis (H0).

Almost a decade later, the concept of alternative hypothesis (H1) was introduced, which did not exist in Fisher’s original approach, and the reasoning is modified based on two error rates of false positive and negative:

  1. Alpha error (type 1 error): probability of rejecting the null hypothesis when, in fact, it is true. It would be the false positive: we believe we detect an effect that, in reality, does not exist.
  2. Beta error (type 2 error): it is the probability of accepting the null hypothesis when, in fact, it is false. It is the false negative: we fail to detect an effect that actually exists.

Thus, we set a maximum value for what seems to us the worst case scenario, which is detecting a false effect, and we choose a “small” value. How small is it? Well, by convention, 0.05 (sometimes 0.01). But, I repeat, it is a value chosen by agreement (and there are those who say that it is capricious, because 5% reminds them the fingers of the hand, which are usually 5).

Thus, if p <0.05, we reject H0 in favor of H1. Otherwise, we accept H0, the hypothesis of no effect. It is important to note that we can only reject H0, never demonstrate it in a positive way. We can demonstrate the effect, but not its absence.

Everything said so far seems easy to understand: the frequentist method tries to quantify the level of uncertainty of our estimate to try to draw a conclusion from the results. The problem is that p, which is nothing more than a way to quantify this uncertainty, is sacralized and misinterpreted too often, which is used to their advantage (if I may say so) by opponents of the method to try to expose its weaknesses.

One of the major flaws attributed to the frequentist method is the dependence of the p-value on the sample size. Indeed, the value of p can be the same with a small effect size in a large sample as with a large effect size in a small sample. And this is more important than it may seem at first, since the value that will allow us to reach a conclusion will depend on a decision exogenous to the problem we are examining: the chosen sample size.

Here would be the benefit of the Bayesian method, in which larger samples would serve to provide more and more information about the study phenomenon. But I think this argument is based on a misunderstanding of what an adequate sample is. I am convinced, the more is not always the better.

We start with the debate

Another great man, David Sackett, said that “too small samples can be used to prove nothing; samples that are too large can be used to prove nothing ”. The problem is that, in my opinion, a sample is neither large nor small, but sufficient or insufficient to demonstrate the existence (or not) of an effect size that is considered clinically important.

And this is the heart of the matter. When we want to study the effect of an intervention we must, a priori, define what effect size we want to detect and calculate the necessary sample size to be able to do it, as long as the effect exists (something that we desire when we plan the experiment, but that we don’t know a priori) . When we do a clinical trial we are spending time and money, in addition to subjecting participants to potential risk, so it is important to include only those necessary to try to prove the clinically important effect. Including the necessary participants to reach the desired p <0.05, in addition to being uneconomic and unethical, demonstrates a lack of knowledge about the true meaning of p-value and sample size.

This misinterpretation of the p-value is also the reason that many authors who do not reach the desired statistical significance allow themselves to affirm that with a larger sample size they would have achieved it. And they are right, they would have reached the desired p <0.05, but they again ignore the importance of clinical significance versus statistical significance.

When the sample size to detect the clinically important effect is calculated a priori, the power of the study is also calculated, which is the probability of detecting the effect if it actually exists. If the power is greater than 80-90%, the values admitted by convention, it does not seem correct to say that you do not have enough sample. And, of course, if you have not calculated the power of the study before, you should do it before affirming that you have no results due to shortness of sample.

Another argument against the frequentist method and in favor of the Bayesian’s says that hypothesis testing is a dichotomous decision process, in which a hypothesis is rejected or accepted such as you rejects or accepts an invitation to the wedding of a distant cousin you haven’t seen for years.

Well, if they previously forgot about clinical significance, those who affirm this fact forget about our beloved confidence intervals. The results of a study should not be interpreted solely on the basis of the p-value. We must look at the confidence intervals, which inform us of the precision of the result and of the possible values that the observed effect may have and that we cannot further specify due to the effect of chance. As we saw in a previous post, the analysis of the confidence intervals can give us clinically important information, sometimes, although the p is not statistically significant.

More arguments

Finally, some detractors of the frequentist method say that the hypothesis test makes decisions without considering information external to the experiment. Again, a misinterpretation of the value of p.

As we already said in a previous post, a value of p <0.05 does not mean that H0 is false, nor that the study is more reliable, or that the result is important (even though the p has six zeros). But, most importantly for what we are discussing now, it is false that the value of p represents the probability that H0 is false (the probability that the effect is real).

Once our results allow us to affirm, with a small margin of error, that the detected effect is real and not random (in other words, when the p is statistically significant), we can calculate the probability that the effect is “real”. And for this, Oh, surprise! we will have to calibrate the value of p with the value of the basal probability of H0, which will be assigned by the researcher based on her knowledge or previous available data (which is still a Bayesian approach).

As you can see, the assessment of the credibility or likelihood of the hypothesis, one of the differentiating characteristics of the Bayesian’s approach, can also be used if we use frequentist methods.

We’re leaving…

And here we are going to leave it for today. But before finishing I would like to make a couple of considerations.

First, in Spain we have many great wines throughout our geography, not just Ribera or Rioja. For no one to get offended, I have chosen these two because they are usually the ones asked by the brothers-in-law when they come to have dinner at home.

Second, do not misunderstand me if it may have seemed to you that I am an advocate of frequentist statistics against Bayesian’s. Just as when I go to the supermarket I feel happy to be able to buy wine from various designations of origin, in research methodology I find it very good to have different ways of approaching a problem. If I want to know if my team is going to win a match, it doesn’t seem very practical to repeat the match 200 times to see what average results come out. It  would be better to try to make an inference taking into account the previous results.

And that’s all. We have not gone into depth in what we have commented at the end on the real probability of the effect, somehow mixing both approaches, frequentist’s and Bayesian’s. The easiest way, as we saw in a previous post, is to use a Held’s nomogram. But that is another story…

The false coin

Today we’re going to continue playing with coins. In fact, we’re going to play with two coins, one of them a fair coin and the other one faker than Judas Iscariot, loaded to give more heads than tails when flipped. I recommend you to sit back and relax before starting.

It turns out we have a loaded coin. By definition, the probability of getting heads when tossing a fair coin is 0.5 (50%). However, our fake coin lands on heads 70% of the time (probability 0.7), which comes in handy because we can use it whenever we want to negotiate any unpleasant task. We only have to offer our coin, choose tails and trust to be lucky enough to be benefited by our unfair coin.

Let’s suppose now we have been so careless as to put the fake coin with the others. How can we know what is the false one?. And this is when we think about our game. Let’s imagine what would happen if we flipped a coin 100 times in a row. If the coin is fair we expect to get heads 50 times, whereas if the coin was our false one, we’d expect 70 heads. So we can choose a coin at random, toss it 100 times and, counting the number of heads, decide if it’s fair or not. We can arbitrarily choose a value between 50 and 70, let’s say 65, and state: if we get 65 heads or more our coin will be the loaded one, but if we get less than 65, we’ll say it is a fair coin.

But anyone immediately realizes that this method is not foolproof. On the one hand, we can get 67 heads with a fair coin and conclude it’s not, when it is indeed fair. But it can also happen that, just by chance, we get 60 heads with the loaded coin and conclude it is fair. Can we solve this problem and avoid getting at the wrong conclusion?. Well, the truth is that we can’t, but what we can do is to measure the likelihood we have of making a mistake.

If we use a binomial probability calculator (the bravest of you can do the calculations by hand) we’ll come up with a probability of getting 65 heads or more with a fair coin of 0.17%, while the probability of getting them with the loaded coin is 88.4%. So we can find ourselves four possibilities that I represent in the accompanying table.

In this case, our null hypothesis says that the coin is fair, while the alternative hypothesis says that the coin is spoofed in favor of heads.

Let’s start with the case the test concludes that the coin is fair (we get less than 65 heads). The first possibility is that the coin is actually fair. Well, we’ll be right. We have no more to say about that situation.

The second possibility is that, despite the conclusion of our test, the coin is faker than the kiss of a mother-in-law. Well, this time we’ll have made a mistake that someone with little imagination named as type II error. We have accepted the null hypothesis that the coin is fair when it’s actually unfair.

We’re going to suppose now that our test concludes that the coin is loaded. If the coin is actually fair, we will err again, but this time we will have committed a type I error. In this case, we reject the null hypothesis that the coin is fair when it is actually fair.

Finally, if we conclude that it is not fair and it is actually loaded, we will be right again.

We can see in the table that the probability of making a type I error is, in this example, 0.17%. This is the statistical significance level of our test, which is just the probability of rejecting our null hypothesis that the coin is fair (concluding it is false) when it is in fact fair. On the other hand, the probability of being right when the coin is false is 88%. This probability is called the power of the test, and it is just the probability of being right when the test concludes the coin is loaded (put it in other words, reject the null hypothesis and be right).

If you think a little about it you will see that the type II error is the complementary of power. When the coin is not fair, the probability of accepting it is fair (type II error) plus the probability of being right and conclude it is false must add up to 100%. Thus, type II error equals 1 minus power.

This statistical significance we have seen is the same as the famous p value. Statistical significance is just the probability of committing a type I error. By convention, it’s generally accepted as tolerable when it is less than 0.05 (5%) since, in general, it is preferable not to accept a false hypothesis. This is why scientific studies look for low values of significance and high values for power, although both of them are related, so that increasing significance decreases power and vice versa.

And this is the end for now. Those of you that have got this far through this rigmarole without getting missing at all, my sincere congratulations, because the truth is that this post seems a play on words. And we could have said something about significance and the calculation of confidence intervals, samples sizes, etc. But that’s another story…

Size and power

Two associated qualities. And very enviable too. Especially when it comes to scientific studies (what were you thinking about?). Although there’re more factors involved, as we’ll see in a moment.

Let’s suppose we are measuring the mean of a variable in two samples to find out if there’re differences between them. We know that, just by random sampling, the results of the two samples will be different but we’ll want to know if that difference is wide enough to allow us to suppose they are actually different.

To find it out we make a hypothesis testing using the appropriate statistical. In our case, let’s suppose we do a Student t test. We calculate the value of our t and estimate its probability. Most of statistical, included t, follow a specific frequency or probability distribution. These distributions are generally bell-shape, more or less symmetrical and centered on certain value. Thus, values near the center are more likely to occur, while those in the extremes edges are less likely. By convention, when this probability is less than 5% we consider the occurrence of that value of the parameter measured unlikely to happen.

But of course, unlikely is not synonymous with impossible. It may be that, by chance, we have choose a sample that is not centered on the same value as the reference population, so the value happens in spite of its low probability of happening in this population.

And this is important because it can lead to errors in our conclusions. Remember that when we have two values to compare we establish the null hypothesis (H0) that the two values are equivalent, and that any difference is due to a random sampling error. Then, if we know its frequency distribution, we can calculate the probability of that difference occurring by chance. Finally, if it is less than 5% we’ll consider unlikely for it to be fortuitous and we’ll reject H0: the difference is not the result of chance and there’s a real effect or a real difference.

But again, unlikely is not impossible. If we have the misfortune of having chosen a biased sample to the population, we could reject the null hypothesis without having a real effect and commit a type 1 error.

Conversely, if the probability is greater than 5% we will not be able to reject H0 and we will say that the difference is due to chance. But here’s a little concept hue that is important to consider. The null hypothesis is only falsifiable. This means that we can reject it, but not affirm it. When we cannot reject it, if we assume it’s true we’ll run the risk of not detecting a trend that really exist. This is the type 2 error.

Usually we are more interested in accepting theories as safely as possible, so we look for low type 1 error probabilities, usually 5%. This is called the alpha value. But the two types of errors are interlinked, so a very low alpha compels us to accept a higher type 2 error (or beta) probability, generally 20%.

The reciprocal value of beta is what is called the power of the study (1-beta). This power is the probability of detecting an effect, given that it really exists, or put it in other words, the probability of not committing a type 2 error.

To understand the factors involved with the study power, will you let me pester you with a little equation:

1-\beta \propto \frac{SE\sqrt{n}\alpha }{\sigma }

SE represents the standard error. Being it in the numerator implies that the lower SE (the more subtle the difference) the lower the power of the study to detect the effect. The same applies to the sample size (n) and alpha: the larger the sample and the higher the significance that we tolerate (with increased risk of type 1 error), the greater the power of the study. Finally, s is the standard deviation: the more variability is in the population, the lower the power of the study.

The utility of the above equation is that we can solve is to obtain the sample size in the following way:

n∝((1-β)×σ^2)/(SE×α)

With this formula we can calculate the sample size we need to get the desired power we choose. Beta is usually set at 0.8 (80%). SE and s are obtained from pilot studies or previous data or regulations and, if they don’t exist, they are set by the researcher. Finally, as we have already mentioned, alpha is usually set at 0.05 (5%), although if we are very afraid of committing a type 1 error we can set it at 0.01.

Closing this post, I would like to draw your attention to the relationship between n and alpha in the first equation. Notice that the power doesn’t change if we increase sample size and concomitantly diminish the significance level. This leads to the situation that, sometimes, to obtain statistical significance is only a matter of increasing enough the sample size. It is therefore essential to assess the clinical relevance of the results and not just its p-values. But that’s another story…