Rioja vs Ribera

Frequentist vs Bayesian statistics

This is one of the typical debates that one can have with a brother-in-law during a family dinner: whether the wine from Ribera is better than that from Rioja, or vice versa. In the end, as always, the brother-in-law will be (or will want to be) right, which will not prevent us from trying to contradict him. Of course, we must make good arguments to avoid falling into the same error, in my humble opinion, in which some fall when participating in another classic debate, this one from the less playful field of epidemiology: Frequentist vs. Bayesian statistics?

And these are the two approaches that we can use when dealing with a research problem.

Some previous definitions

Frequentist statistics, the best known and to which we are most accustomed, is the one that is developed according to the classic concepts of probability and hypothesis testing. Thus, it is about reaching a conclusion based on the level of statistical significance and the acceptance or rejection of a working hypothesis, always within the framework of the study being carried out. This methodology forces to stabilize the decision parameters a priori, which avoids subjectivities regarding them.

The other approach to solving problems is that of Bayesian statistics, which is increasingly fashionable and, as its name suggests, is based on the probabilistic concept of Bayes’ theorem. Its differentiating feature is that it incorporates external information into the study that is being carried out, so that the probability of a certain event can be modified by the previous information that we have on the event in question. Thus, the information obtained a priori is used to establish an a posteriori probability that allows us to make the inference and reach a conclusion about the problem we are studying.

This is another difference between the two approaches: while Frequentist statistics avoids subjectivity, Bayesian’s one introduces a subjective (but not capricious) definition of probability, based on the researcher’s conviction, to make judgments about a hypothesis.

Bayesian statistics is not really new. Thomas Bayes’ theory of probability was published in 1763, but experiences a resurgence from the last third of the last century. And as usually happens in these cases where there are two alternatives, supporters and detractors of both methods appear, which are deeply involved in the fight to demonstrate the benefits of their preferred method, sometimes looking more for the weaknesses of the opposite than for their own strengths.

And this is what we are going to talk about in this post, about some arguments that Bayesians use on some occasion that, one more time in my humble opinion, take more advantage misuses of Frequentist statistics by many authors, than of intrinsic defects of this methodology.

A bit of history

We will start with a bit of history.

The history of hypothesis testing begins back in the 20s of the last century, when the great Ronald Fisher proposed to value the working hypothesis (of absence of effect) through a specific observation and the probability of observing a value equal or greater than the observed result. This probability is the p-value, so sacred and so misinterpreted, that it does not mean more than that: the probability of finding a value equal to or more extreme than that found if the working hypothesis were true.

In summary, the p that Fisher proposed is nothing short of a measure of the discrepancy that could exist between the data found and the hypothesis of work proposed, the null hypothesis (H0).

Almost a decade later, the concept of alternative hypothesis (H1) was introduced, which did not exist in Fisher’s original approach, and the reasoning is modified based on two error rates of false positive and negative:

  1. Alpha error (type 1 error): probability of rejecting the null hypothesis when, in fact, it is true. It would be the false positive: we believe we detect an effect that, in reality, does not exist.
  2. Beta error (type 2 error): it is the probability of accepting the null hypothesis when, in fact, it is false. It is the false negative: we fail to detect an effect that actually exists.

Thus, we set a maximum value for what seems to us the worst case scenario, which is detecting a false effect, and we choose a “small” value. How small is it? Well, by convention, 0.05 (sometimes 0.01). But, I repeat, it is a value chosen by agreement (and there are those who say that it is capricious, because 5% reminds them the fingers of the hand, which are usually 5).

Thus, if p <0.05, we reject H0 in favor of H1. Otherwise, we accept H0, the hypothesis of no effect. It is important to note that we can only reject H0, never demonstrate it in a positive way. We can demonstrate the effect, but not its absence.

Everything said so far seems easy to understand: the frequentist method tries to quantify the level of uncertainty of our estimate to try to draw a conclusion from the results. The problem is that p, which is nothing more than a way to quantify this uncertainty, is sacralized and misinterpreted too often, which is used to their advantage (if I may say so) by opponents of the method to try to expose its weaknesses.

One of the major flaws attributed to the frequentist method is the dependence of the p-value on the sample size. Indeed, the value of p can be the same with a small effect size in a large sample as with a large effect size in a small sample. And this is more important than it may seem at first, since the value that will allow us to reach a conclusion will depend on a decision exogenous to the problem we are examining: the chosen sample size.

Here would be the benefit of the Bayesian method, in which larger samples would serve to provide more and more information about the study phenomenon. But I think this argument is based on a misunderstanding of what an adequate sample is. I am convinced, the more is not always the better.

We start with the debate

Another great man, David Sackett, said that “too small samples can be used to prove nothing; samples that are too large can be used to prove nothing ”. The problem is that, in my opinion, a sample is neither large nor small, but sufficient or insufficient to demonstrate the existence (or not) of an effect size that is considered clinically important.

And this is the heart of the matter. When we want to study the effect of an intervention we must, a priori, define what effect size we want to detect and calculate the necessary sample size to be able to do it, as long as the effect exists (something that we desire when we plan the experiment, but that we don’t know a priori) . When we do a clinical trial we are spending time and money, in addition to subjecting participants to potential risk, so it is important to include only those necessary to try to prove the clinically important effect. Including the necessary participants to reach the desired p <0.05, in addition to being uneconomic and unethical, demonstrates a lack of knowledge about the true meaning of p-value and sample size.

This misinterpretation of the p-value is also the reason that many authors who do not reach the desired statistical significance allow themselves to affirm that with a larger sample size they would have achieved it. And they are right, they would have reached the desired p <0.05, but they again ignore the importance of clinical significance versus statistical significance.

When the sample size to detect the clinically important effect is calculated a priori, the power of the study is also calculated, which is the probability of detecting the effect if it actually exists. If the power is greater than 80-90%, the values admitted by convention, it does not seem correct to say that you do not have enough sample. And, of course, if you have not calculated the power of the study before, you should do it before affirming that you have no results due to shortness of sample.

Another argument against the frequentist method and in favor of the Bayesian’s says that hypothesis testing is a dichotomous decision process, in which a hypothesis is rejected or accepted such as you rejects or accepts an invitation to the wedding of a distant cousin you haven’t seen for years.

Well, if they previously forgot about clinical significance, those who affirm this fact forget about our beloved confidence intervals. The results of a study should not be interpreted solely on the basis of the p-value. We must look at the confidence intervals, which inform us of the precision of the result and of the possible values that the observed effect may have and that we cannot further specify due to the effect of chance. As we saw in a previous post, the analysis of the confidence intervals can give us clinically important information, sometimes, although the p is not statistically significant.

More arguments

Finally, some detractors of the frequentist method say that the hypothesis test makes decisions without considering information external to the experiment. Again, a misinterpretation of the value of p.

As we already said in a previous post, a value of p <0.05 does not mean that H0 is false, nor that the study is more reliable, or that the result is important (even though the p has six zeros). But, most importantly for what we are discussing now, it is false that the value of p represents the probability that H0 is false (the probability that the effect is real).

Once our results allow us to affirm, with a small margin of error, that the detected effect is real and not random (in other words, when the p is statistically significant), we can calculate the probability that the effect is “real”. And for this, Oh, surprise! we will have to calibrate the value of p with the value of the basal probability of H0, which will be assigned by the researcher based on her knowledge or previous available data (which is still a Bayesian approach).

As you can see, the assessment of the credibility or likelihood of the hypothesis, one of the differentiating characteristics of the Bayesian’s approach, can also be used if we use frequentist methods.

We’re leaving…

And here we are going to leave it for today. But before finishing I would like to make a couple of considerations.

First, in Spain we have many great wines throughout our geography, not just Ribera or Rioja. For no one to get offended, I have chosen these two because they are usually the ones asked by the brothers-in-law when they come to have dinner at home.

Second, do not misunderstand me if it may have seemed to you that I am an advocate of frequentist statistics against Bayesian’s. Just as when I go to the supermarket I feel happy to be able to buy wine from various designations of origin, in research methodology I find it very good to have different ways of approaching a problem. If I want to know if my team is going to win a match, it doesn’t seem very practical to repeat the match 200 times to see what average results come out. It  would be better to try to make an inference taking into account the previous results.

And that’s all. We have not gone into depth in what we have commented at the end on the real probability of the effect, somehow mixing both approaches, frequentist’s and Bayesian’s. The easiest way, as we saw in a previous post, is to use a Held’s nomogram. But that is another story…

A case of misleading probability

Bayes’ theorem

Today we are going to see another of those examples where intuition about the value of certain probabilities plays tricks on us. And, for that, we will use nothing less than Bayes’ theorem, playing a little with conditioned probabilities. Let’s see step by step how it works.

What is the probability of two events occurring? The probability of an event A occurring is P(A) and that of B, P(B). Well, the probability of the two occurring is P(A∩B), which, if the two events are independent, is equal to P(A) x P(B).

Imagine that we have a die with six faces. If we throw it once, the probability of taking out, for example, a five is 1/6 (one result among the six possible). The probability to draw a four is also 1/6. What will be the probability of getting a four, once in the first roll we get a five? Since the two runs are independent, the probability of the combination five followed by four will be 1/6 x 1/6 = 1/36.

Conditional probability

Now let’s think of another example. Suppose that in a group of 10 people there are four doctors, two of whom are surgeons. If we take one at random, the probability of being a doctor is 4/10 = 0.4 and that of a surgeon is 2/10 = 0.2. But if we get one and know that he is a doctor, the probability that he is a surgeon will no longer be 0.2, because the two events, being a doctor and a surgeon, are not independent. If you are a doctor, the probability that you are a surgeon will be 0.5 (half the doctors in our group are surgeons).

When two events are dependent, the probability of occurrence of the two will be the probability of occurrence of the first, once the second occurs, by the probability of occurrence of the second. So the P(surgeon) = P(surgeon|doctor) x P(doctor). We can generalize the expression as follows:

P(A∩B) = P(A|B) x P(B), and changing the order of the components of the expression, we obtain the so-called Bayes rule, as follows:

P(A|B) = P(A∩B) / P(B).

The P(A∩B) will be the probability of B, once A is produced, by the probability of A = P(B|A) x P(A). On the other hand, the probability of B will be equal to the sum of the probability of occurrence B once A is produced plus the probability of occurring B without occurring A, which put in mathematical form is of the following form:

P(B|A) x P(A) + P(B|Ac) x P(Ac), being P(Ac) the probability of not occurring A.

If we substitute the initial rule for its developed values, we obtain the best known expression of the Bayes theorem:

P(A|B)=\frac{P(B|A) \times P(A)}{P(B|A) \times P(A)+P(B|A^{{c}}) \times P(A^{{c}})}Let’s see how the Bayes theorem is applied with a practical example. Consider the case of acute fildulastrosis, a serious disease whose prevalence in the population is, fortunately, quite low, one per 1000 inhabitants. Then, the P(F) = 0.001.

Let’s see an example

Luckily we have a good diagnostic test, with a sensitivity of 98% and a specificity of 95%. Suppose now that I take the test and it gives me a positive result. Do I have to scare myself a lot? What is the probability that I actually have the disease? Do you think it will be high or low? Let’s see.

A sensitivity of 98% means that the probability of giving positive when having the disease is 0.98. Mathematically, P(POS|F) = 0,98. On the other hand, a specificity of 95% means that the probability of a negative result being healthy is 0.95. That is, P(NEG|Fc) = 0.95. But what we want to know is neither of these two things, but we really look for the probability of being sick once we test positive, that is, P (F|POS).

To calculate it, we have only to apply the theorem of Bayes:

P(F|POS)=\frac{P(POS|F) \times P(F)}{P(POS|F) \times P(F)+P(POS|F^{{c}}) \times P(F^{{c}})}Then we replace the symbols with their values and solve the equation:

P(F|POS)=\frac{0,98 \times 0,001}{0,98 \times 0,001+[(1-0,95) \times (1-0,001)]}=0,02So we see that, in principle, I do not have to scare a lot when the test gives me a positive result, since the probability of being ill is only 2%. As you see, much lower than intuition would tell us with such a high sensitivity and specificity. Why is this happening? Very simple, because the prevalence of the disease is very low. We are going to repeat the experiment assuming now that the prevalence is 10% (0,1):

P(F|POS)=\frac{0,98 \times 0,1}{0,98 \times 0,1+[(1-0,95) \times (1-0,1)]}=0,68As you see, in this case the probability of being ill if I give positive rises to 68%. This probability is known as positive predictive value which, as we can see, can vary greatly depending on the frequency of the effect we are studying.

We’re leaving…

And here we leave it for today. Before closing, let me warn you not to seek what the fildulastrosis is. I would be very surprised if anyone found it in a medical book. Also, be careful not to confuse P (POS|F) with P (F|POS), since you would make a mistake called reverse fallacy or fallacy of transposition of conditionals, which is a serious error.

We have seen how the calculation of probabilities gets somewhat complicated when the events are not independent. We have also learned how unreliable predictive values are when the prevalence of the disease changes. That is why the likelihood ratios were invented, which do not depend so much on the prevalence of the disease that is diagnosed and allow a better overall assessment of the power of the diagnostic test. But that is another story…

The stigma of guilt

Bayes’ rule

Sometimes statistical concepts are useful to other facets of life. For example, imagine that a burglary occurs in a bank and the thief has entered through a small hole in the wall. Now keep on imagining that five years ago a tiny thief, who was release from prison two months ago, made a similar theft. Who do you think the police will interrogate first?

All of you will agree with me that the dwarf thief will be the main suspect, but you probably will be wondering what all this has to do with statistics. And I’ll tell you that the answer is very simple: police are using the concept of conditioned probability when thinking about its little suspect. Let’s see what conditional probability is and you’ll see I am right.

Conditioned probability

Two events may be dependent or independent. They are independent when the probability of one to occur has nothing to do with the probability of the other. For instance, if you throw a dice ten times, each of these runs will be independent of the preceding (and following). If we get a six in a throw, therefore the probability of getting another in the following throw won’t be lower, but it’ll still be a sixth. Applying the same reasoning, if we throw ten times and don’t get a six in any of them, the probability of getting it the next time we throw will still be a sixth, and not higher. The probability of getting six twice in a row would be the product of the probability of getting each: 1/6 x 1/6 = 1/36.

Mathematically expressed, the probability of two independent events to occur is:

P(A and B) = P(A) x P(B)

In other cases the events may be dependent, which means that the occurrence of one of them changes the probability of occurrence of the other. We speak then of conditional probability. Let’s see an example.

The first that comes to a physician`s mind may be that of the positive and negative predictive values of diagnostic tests. The probability that a patient has a positive test is not the same than the probability of being sick once he has tested positive. The last, in turn, will be greater than if he gets a negative result. As you can see, the result of the test determines the probability of disease.

In the other hand, think we’re studying a population of children to see how many of them have anemia and malnutrition. Logically, the likelihood of malnutrition will be greater in anemic children. Once we determine that the child is anemic, the probability that he’s malnourished will increase. The good thing about all these is that if we know the different probabilities, we can calculate the probability of having anemia once we have found that the child is malnourished. Let’s see it mathematically.

Bayes’ rule

The probability of two dependent events to occur can be expressed as follows:

P(A and B) = P(A) x P(B|A), where B|A is read as B given A.

We may also write the equation changing A by B, as follows:

P(A and B) = P(B) x P(A|B)

and as the left side of the two equations are the same, we can equalize them and get another expression:

P (A) x P (B | A) = P (B) x P (A | B)
P (B | A) = [P (B) x P (A | B)] / P (A)

what is known as Bayes’ rule. Bayes was a clergyman of the eighteenth century who was very fond of conditional events.

Let’s see an example

To understand its utility, we’ll apply it to the case of positive predictive value. Suppose a disease whose prevalence (probability of occurring in the population) is 0.2 and a test to diagnose it with a sensitivity of 0.8. If we take a population and get a 30% of positive results (probability 0.3), what is the probability that an individual being sick once he has obtained a positive test result?. Let’s solve the problem:

P (sick | positive) = [P (sick) x P (positive | diseased)] / P (positive)
P (sick | positive) = (prevalence x sensitivity) / P (positive test)
P (sick | positive) = (0.2 x 0.8) / 0.3 = 0.53

In summary, if an individual tests positive he has a 53% chance of being sick.

We’re leaving…

And here we end with Bayes for today. Note that Bayes’ contribution to the science of statistics was much broader. In fact, this type of reasoning leads to another way of seeing statistics depending on the events to be happening compared to classical frequentist statistical approach we use most of the time. But that’s another story…