By your actions they will judge you

Sample size calculation in survival studies

Today you are going to forgive me, but I am in a mood a little biblical. And I was thinking about the sample size calculation for survival studies and it reminded me of the message that Ezekiel transmits to us: according to your ways and your works they will judge you.

Once again, you will think that from all the buzzing of evidence-based medicine in my head I have gone a little nuts, but if you hold on a bit and continue reading, you will see that the analogy can be explained.

A little introduction

One of the most valued methodological quality indicators of a study is the previous calculation of the sample size necessary to demonstrate (or reject) the working hypothesis. When we want to study the effect of an intervention, we must, a priori, define what effect size we want to detect and calculate the sample size necessary to be able to do it, as long as the effect exists (something we want when we plan the experiment, but which we do not know a priori) , taking into account the level of significance and the power that we want the study to have .

In summary, if we detect the effect size that we previously established, the difference between the two groups will be statistically significant (our desired p <0.05). On the contrary, if there is no significant difference, there is probably no real difference, although always with the risk of making a type 2 error that is equal to 1 minus the power of the study.

So far it seems clear, we must calculate the number of participants we need. But this is not so simple for survival studies.

The approach to the problem

Survival studies grouped a series of statistical techniques to deal with situations in which it is not enough to observe an event, it is critical the time that elapses until the event occurs. In these cases, the outcome variable will be neither quantitative nor qualitative, but from time to event. It is a mixed variable type that would have a dichotomous part (the event occurs or does not) and a quantitative part (how long it takes to occur).

The name of survival studies is a bit misleading and one can think that the event under study will be the death of the participants, but nothing is further from reality. The event can be any type of incident or occurrence, good or bad for the participant. What happens is that the first studies were applied to situations in which the event of interest was death and the name has prevailed.

In these studies, the participants’ follow-up period is often uneven, and some may even end the study without reporting the event of interest or missing out of the study before it ends.

For these reasons, if we want to know if there are differences between the presentation of the event of interest in the two branches of the study, the number of subjects participating will not be so important to calculate the sample, but rather the number of events that we need for the difference to be significant if the clinically important difference is reached, which we must establish a priori.

Let’s see how it is done, depending on the type of contrast we plan to use.

Sample size calculation in survival studies

If we only want to determine the number of necessary events that we have to observe to detect a difference between a certain group and the population from which it is sourced, the formula to do so is as follows:


Where E is the number of events we need to observe, K is the value determined by the confidence level and the power of the study and lnRR is the natural logarithm of the risk rate.

The value of K is calculated as (Zα + Zβ)2, with z being the standardized value for the chosen confidence and power level. The most common is to perform a bilateral contrast (with two tails) with a confidence level of 0.05 and a power of 80%. In this case, the values ​​are Zα = 1.96, Zβ = 0.84 and K = 7.9. In the attached table I leave you the most frequent values ​​of K, so you do not have to calculate them.

The risk rate is the ratio between the risk of the study group and the risk in the population, which we are supposed to know. It is defined as Sm1 /Sm2, where Sm1 is the mean time of appearance of the event in the population and Sm2 is the expected in the study group.

Let’s give an example to better understand what has been said so far.

Suppose that patients or treatment with a certain drug (which we will call A to not work ridiculously hard) are at risk of developing a stomach ulcer during the first year of treatment. Now we select a group and give them a treatment (B, this time) that acts as prophylaxis, in such a way that we hope that the event will take another year to occur. How many ulcers do we have to observe in a study with a confidence level of 0.05 and a power of 0.8 (80%)?

We know that K is worth 7.9. Sm1 = 1 and Sm2 = 2. We substitute their values ​​in the formula that we already know:

We will need to see 33 ulcers during follow-up. Now we can calculate how many patients we must include in the study (I find it difficult to enroll just ulcers).

Let’s assume that we can enroll 12 patients a year. If we want to observe 33 ulcers, the follow-up should last for 33/12 = 2.75, that is, 3 years. For more security, we would plan a slightly higher follow-up.

Survival curve comparison

This was the simplest problem. When we want to compare the two survival curves (we plan to do a log-rank test), the calculation of the sample size is a bit more complex, but not much. After all, we will already be comparing the survival probability curves of the two groups.

In these cases, the formula for calculating the number of necessary events is as follows:

We find a new parameter, C, which is the ratio of participants between one group and the other (1: 1, 1: 2, etc.).

But there is another difference with the previous assumption. In these cases, the RR is calculated as the quotient of the natural logarithms of π1 and π2, which are the proportions of participants from each group that present the event in a given period of time.

Following the previous example, suppose we know that the ulcer risk in those who are on A is 50% in the first 6 months and that of those who are on B, 20%. How many ulcers do we need to observe with the same level of confidence and the same power of the study?

Let’s substitute the values ​​in the previous formula:

We will need to observe 50 ulcers during the study. Now we need to know how many participants (not events) we need in each branch of the study. We can obtain it with the following formula:

If we substitute our values ​​in the equation, we obtain a value of 29.4, so we will need 30 patients in each branch of the study, 60 in all.

Coming to the end, let’s see what would happen if we want a different ratio of participants instead of the easiest, 1: 1. In that case, the calculation of n with the last formula must be adjusted taking into account this proportion, which is our known C:

Imagine we want a 2:1 ratio. We substitute the values ​​in the equation:

We would need 23 participants in one branch and 46, double, in the other, 69 in all.

We’re leaving…

And here we leave it for today. As always, everything we have said in this post is so that we can understand the fundamentals of calculating the necessary sample size in this type of study, but I advise you to use a statistical program or a calculator if you ever have to do it. There are many available and some are even totally free.

I hope that you understand now about Ezekiel’s message and that, in these studies, the things we do (or suffer) are more important than how many we do (or suffer). We have seen the simplest way to calculate the sample size of a survival study , although we could have bring unnecessary troubles into our lives and have calculated the sample size based on estimates of risks ratios or hazard ratios. But that is another story…

Rioja vs Ribera

Frequentist vs Bayesian statistics

This is one of the typical debates that one can have with a brother-in-law during a family dinner: whether the wine from Ribera is better than that from Rioja, or vice versa. In the end, as always, the brother-in-law will be (or will want to be) right, which will not prevent us from trying to contradict him. Of course, we must make good arguments to avoid falling into the same error, in my humble opinion, in which some fall when participating in another classic debate, this one from the less playful field of epidemiology: Frequentist vs. Bayesian statistics?

And these are the two approaches that we can use when dealing with a research problem.

Some previous definitions

Frequentist statistics, the best known and to which we are most accustomed, is the one that is developed according to the classic concepts of probability and hypothesis testing. Thus, it is about reaching a conclusion based on the level of statistical significance and the acceptance or rejection of a working hypothesis, always within the framework of the study being carried out. This methodology forces to stabilize the decision parameters a priori, which avoids subjectivities regarding them.

The other approach to solving problems is that of Bayesian statistics, which is increasingly fashionable and, as its name suggests, is based on the probabilistic concept of Bayes’ theorem. Its differentiating feature is that it incorporates external information into the study that is being carried out, so that the probability of a certain event can be modified by the previous information that we have on the event in question. Thus, the information obtained a priori is used to establish an a posteriori probability that allows us to make the inference and reach a conclusion about the problem we are studying.

This is another difference between the two approaches: while Frequentist statistics avoids subjectivity, Bayesian’s one introduces a subjective (but not capricious) definition of probability, based on the researcher’s conviction, to make judgments about a hypothesis.

Bayesian statistics is not really new. Thomas Bayes’ theory of probability was published in 1763, but experiences a resurgence from the last third of the last century. And as usually happens in these cases where there are two alternatives, supporters and detractors of both methods appear, which are deeply involved in the fight to demonstrate the benefits of their preferred method, sometimes looking more for the weaknesses of the opposite than for their own strengths.

And this is what we are going to talk about in this post, about some arguments that Bayesians use on some occasion that, one more time in my humble opinion, take more advantage misuses of Frequentist statistics by many authors, than of intrinsic defects of this methodology.

A bit of history

We will start with a bit of history.

The history of hypothesis testing begins back in the 20s of the last century, when the great Ronald Fisher proposed to value the working hypothesis (of absence of effect) through a specific observation and the probability of observing a value equal or greater than the observed result. This probability is the p-value, so sacred and so misinterpreted, that it does not mean more than that: the probability of finding a value equal to or more extreme than that found if the working hypothesis were true.

In summary, the p that Fisher proposed is nothing short of a measure of the discrepancy that could exist between the data found and the hypothesis of work proposed, the null hypothesis (H0).

Almost a decade later, the concept of alternative hypothesis (H1) was introduced, which did not exist in Fisher’s original approach, and the reasoning is modified based on two error rates of false positive and negative:

  1. Alpha error (type 1 error): probability of rejecting the null hypothesis when, in fact, it is true. It would be the false positive: we believe we detect an effect that, in reality, does not exist.
  2. Beta error (type 2 error): it is the probability of accepting the null hypothesis when, in fact, it is false. It is the false negative: we fail to detect an effect that actually exists.

Thus, we set a maximum value for what seems to us the worst case scenario, which is detecting a false effect, and we choose a “small” value. How small is it? Well, by convention, 0.05 (sometimes 0.01). But, I repeat, it is a value chosen by agreement (and there are those who say that it is capricious, because 5% reminds them the fingers of the hand, which are usually 5).

Thus, if p <0.05, we reject H0 in favor of H1. Otherwise, we accept H0, the hypothesis of no effect. It is important to note that we can only reject H0, never demonstrate it in a positive way. We can demonstrate the effect, but not its absence.

Everything said so far seems easy to understand: the frequentist method tries to quantify the level of uncertainty of our estimate to try to draw a conclusion from the results. The problem is that p, which is nothing more than a way to quantify this uncertainty, is sacralized and misinterpreted too often, which is used to their advantage (if I may say so) by opponents of the method to try to expose its weaknesses.

One of the major flaws attributed to the frequentist method is the dependence of the p-value on the sample size. Indeed, the value of p can be the same with a small effect size in a large sample as with a large effect size in a small sample. And this is more important than it may seem at first, since the value that will allow us to reach a conclusion will depend on a decision exogenous to the problem we are examining: the chosen sample size.

Here would be the benefit of the Bayesian method, in which larger samples would serve to provide more and more information about the study phenomenon. But I think this argument is based on a misunderstanding of what an adequate sample is. I am convinced, the more is not always the better.

We start with the debate

Another great man, David Sackett, said that “too small samples can be used to prove nothing; samples that are too large can be used to prove nothing ”. The problem is that, in my opinion, a sample is neither large nor small, but sufficient or insufficient to demonstrate the existence (or not) of an effect size that is considered clinically important.

And this is the heart of the matter. When we want to study the effect of an intervention we must, a priori, define what effect size we want to detect and calculate the necessary sample size to be able to do it, as long as the effect exists (something that we desire when we plan the experiment, but that we don’t know a priori) . When we do a clinical trial we are spending time and money, in addition to subjecting participants to potential risk, so it is important to include only those necessary to try to prove the clinically important effect. Including the necessary participants to reach the desired p <0.05, in addition to being uneconomic and unethical, demonstrates a lack of knowledge about the true meaning of p-value and sample size.

This misinterpretation of the p-value is also the reason that many authors who do not reach the desired statistical significance allow themselves to affirm that with a larger sample size they would have achieved it. And they are right, they would have reached the desired p <0.05, but they again ignore the importance of clinical significance versus statistical significance.

When the sample size to detect the clinically important effect is calculated a priori, the power of the study is also calculated, which is the probability of detecting the effect if it actually exists. If the power is greater than 80-90%, the values admitted by convention, it does not seem correct to say that you do not have enough sample. And, of course, if you have not calculated the power of the study before, you should do it before affirming that you have no results due to shortness of sample.

Another argument against the frequentist method and in favor of the Bayesian’s says that hypothesis testing is a dichotomous decision process, in which a hypothesis is rejected or accepted such as you rejects or accepts an invitation to the wedding of a distant cousin you haven’t seen for years.

Well, if they previously forgot about clinical significance, those who affirm this fact forget about our beloved confidence intervals. The results of a study should not be interpreted solely on the basis of the p-value. We must look at the confidence intervals, which inform us of the precision of the result and of the possible values that the observed effect may have and that we cannot further specify due to the effect of chance. As we saw in a previous post, the analysis of the confidence intervals can give us clinically important information, sometimes, although the p is not statistically significant.

More arguments

Finally, some detractors of the frequentist method say that the hypothesis test makes decisions without considering information external to the experiment. Again, a misinterpretation of the value of p.

As we already said in a previous post, a value of p <0.05 does not mean that H0 is false, nor that the study is more reliable, or that the result is important (even though the p has six zeros). But, most importantly for what we are discussing now, it is false that the value of p represents the probability that H0 is false (the probability that the effect is real).

Once our results allow us to affirm, with a small margin of error, that the detected effect is real and not random (in other words, when the p is statistically significant), we can calculate the probability that the effect is “real”. And for this, Oh, surprise! we will have to calibrate the value of p with the value of the basal probability of H0, which will be assigned by the researcher based on her knowledge or previous available data (which is still a Bayesian approach).

As you can see, the assessment of the credibility or likelihood of the hypothesis, one of the differentiating characteristics of the Bayesian’s approach, can also be used if we use frequentist methods.

We’re leaving…

And here we are going to leave it for today. But before finishing I would like to make a couple of considerations.

First, in Spain we have many great wines throughout our geography, not just Ribera or Rioja. For no one to get offended, I have chosen these two because they are usually the ones asked by the brothers-in-law when they come to have dinner at home.

Second, do not misunderstand me if it may have seemed to you that I am an advocate of frequentist statistics against Bayesian’s. Just as when I go to the supermarket I feel happy to be able to buy wine from various designations of origin, in research methodology I find it very good to have different ways of approaching a problem. If I want to know if my team is going to win a match, it doesn’t seem very practical to repeat the match 200 times to see what average results come out. It  would be better to try to make an inference taking into account the previous results.

And that’s all. We have not gone into depth in what we have commented at the end on the real probability of the effect, somehow mixing both approaches, frequentist’s and Bayesian’s. The easiest way, as we saw in a previous post, is to use a Held’s nomogram. But that is another story…

With very little we fine-tune a lot

Confidence interval and sample size

We all like to know what will happen in the future. So we try to invent things that help us to know what will happen, what will be the result of a certain thing. A clear example is the political elections or surveys to ask people on an issue of interest. So the polls have been invented to try to anticipate the outcome of a survey before happening. Many people do not trust polls, but as discussed below, they are a very useful tool: they allow us to make estimates with relatively little effort.

Consider, for example, that we do a Swiss-style referendum to ask people if they want to reduce their workday. Some of you will tell me that this is a waste of time, since a survey like that in Spain would have a very predictable result, but you never know. In Switzerland they asked and people preferred to continue working longer.

If we wanted to know for sure what will be the outcome of the voting, we’d have to ask everyone what their vote will be, which is impractical to carry out. So we do a poll: we select a sample of a given size and asked them. We obtain an estimate of the final result, with an accuracy that is determined by the confidence interval of the calculated estimator.

But, will the sample have to be very large?. Well, not too much, if it’s well chosen. Let’s see it.

Relation between confidence interval and sample size

Every time we do the poll we obtain a value of the p proportion that will vote, for instance, yes to the proposal we asked for. If we repeated the poll many times, we get a set of values ​​close to each other and probably close to the actual value of the population that we cannot access. Well, these values ​​(result of the different repeated polls) follow a normal distribution, so we know that 95% of the values ​​would be between the value of the proportion in the population plus or minus two times the standard deviation (actually, 1.96 times the standard deviation). This standard deviation is called the standard error, and is the measure that allows us to calculate the margin of error of the estimation by its confidence interval:

95% confidence interval (95 CI) = estimated proportion ± 1.96 x standard error

Actually, this is a simplified equation. If we start from a finite sample (n) obtained from a population (N), the standard error should be multiplied by a correction factor, so that the formula is as follows:

95\ CI= p\pm 1.96\ standard\ error\times \sqrt{1-\frac{n}{N}}

If you think about it for a moment, when the population is large the ratio n / N tends to zero, so that the result of the correction factor tends to one. This is the reason why the sample not needs to be excessively large, and why the same sample size can serve to estimate the results of an election in a little town or in the entire nation.

Therefore, the estimation accuracy is more related to the standard error. What would be the standard error in our example?. As the result is a proportion, we know it follows a binomial distribution, so the standard error is equal to \sqrt{\frac{p(1-p)}{n}}, where p is the proportion obtained and n the sample size.

The imprecision (the amplitude of the confidence interval) will be greater the larger the standard error. Therefore, the greater the product p (1-p) or the smaller the sample size, the less accurate will be our estimate and the greater our margin of error.

Anyway, this margin of error is limited. Let’s see why.


We can accurately estimate without the need of a vry large sample

px1-p_eenWe know that p can have values ​​between zero and one. If we examine the figure with the curve of p vs p(1-p), we see that the maximum value of the product is obtained when p = 0.5, with a value of 0.25. As p moves away from 0.5 in either direction, the product will be lesser.

So, for a given value of n, the standard error is maximum when p equals 0.5, using the following equation:

Maximum \ estandard\ error= \sqrt{\frac{0.5 \times 0.5}{n}}= \sqrt{\frac{0.25}{n}}= \frac{0.5}{\sqrt{n}}

Thus, we can write the formula of the maximum confidence interval:

Maximum\ 95\ CI= p\pm 1.96\times \frac{0.5}{\sqrt{n}}\approx p\pm 2\times\frac{0.5}{\sqrt{n}}= p\pm\frac{1}{\sqrt{n}}

That is, the maximum margin of error is \frac{1}{\sqrt{n}} . This means that with a sample of 100 people we will have a maximum margin of error of plus or minus 10%, depending on the value of p we have obtained (but a maximum of 10%). Thus we see that with a sample that not need to be very large, we can get a fairly accurate result.

We¡re leaving…

And with that we’re done for today. You might ask, after all we have said, why there are polls whose result is different from the definitive result. Well, I can think of two reasons. First, random. We have been able to choose, by chance, a sample that is not centered on the true value of the population (it will happen 5% of the times). Second, the sample may not be representative of the general population. And this is a key factor, because if the sampling technique is incorrect, the results of the survey will be unreliable. But that’s another story…

Freedom in degrees

Freedom is one of those concepts that everyone can understand easily, but it is extremely difficult to define. If you don’t believe me, try to state a definition of freedom and you will see that it is not so easy. Right away, you’ll be running into other people’s freedom when trying to define yours or you’ll be wondering what kind of freedom you’re trying to define.

However, with degrees of freedom it goes exactly the opposite. This term is far easy to define, but many have trouble understanding the exact meaning of this seemingly abstract concept.

The degrees of freedom are the number of observations in a sample which can take any possible value (which are “free” to take any value) given that it has been previously and independently calculated certain parameter estimate in the sample or its population of origin. Do you realize now why do I say it is easy to define but not so easy to understand?.  Let’s see an example just to be a little clear.

In a stroke of delusional imagination, let’s assume that we are school teachers. The school principal tells us there’s a competition among neighboring schools and we have to select five students to represent ours. The only condition we have to fulfill is that the final average rating of the five students must be seven points. Let’s also suppose that, as it happens, our eldest son, who records eight, is in the class. So, acting impartially, we pick him out to represent his peers. But we still need to pick four more so, why not be consistent with our sense of justice and choose his four friends. His friend Philip has 9, John 6, Louis 5 (he gets through by the narrowest of margins) and Evaristo records 10 (the very nerd). What’s the problem?. It’s that the five’s average record is 7.6 and it should be exactly 7. What can we do?.

Let’s try to remove Louis; after all he’s the one with lower grades. The problem is that we’ll have to choose a student with a score of 2 to come up with an average of 7. But we can’t select a student who has failed his tests. Then, let’s try to remove nerd-Evaristo and we’ll need to look for a student with a score of 7. If you think about it, we can make all possible combinations with the five friends, but always choosing only four, since the fifth would be bound by the average value we have set previously. So this means, nor more and no less, that we have four degrees of freedom.

When we make a statistical inference in a population, if we want the results to be reliable, we have to do each estimate independently. For instance, if we calculate the mean and the standard deviation we should do it independently, but this is not usually so, since we need and estimate of the mean to calculate the standard deviation. This is why not all the estimates can be considered free and independent of the mean. At least one of them will be conditioned by the value previously settled for the mean.

So you can see that the number of degrees of freedom indicates us the number of independent observations that are involved in the estimation of a population parameter.

This is important because estimators follow specific frequency distributions whose shape depends on the number of degrees of freedom associated with the estimate. The greater the number of degrees of freedom, the narrower the frequency distribution and the higher the power of the study to make the estimation. Thus, power and degrees of freedom are positively related with the sample size, so that the larger the sample size the greater the number of degrees of freedom and hence the greater the power.

To calculate the number of degrees of freedom of a test is usually straightforward, but it is different depending on the test in question. To calculate the mean of a sample is the simplest case. We have already seen that it equals n-1, being n the sample size. Similarly, when we are dealing with two samples and two means, the number of degrees of freedom equals n1+n2-2. In general, when calculating several parameters, the degrees of freedom are calculated as n-p-1, being p the number of parameters to be estimated. This is useful when we do an analysis of variance to compare two or more means.

And so we could give examples for the calculation of each particular statistical or test we’d want to accomplish. But that’s another story…

Size and power

Two associated qualities. And very enviable too. Especially when it comes to scientific studies (what were you thinking about?). Although there’re more factors involved, as we’ll see in a moment.

Let’s suppose we are measuring the mean of a variable in two samples to find out if there’re differences between them. We know that, just by random sampling, the results of the two samples will be different but we’ll want to know if that difference is wide enough to allow us to suppose they are actually different.

To find it out we make a hypothesis testing using the appropriate statistical. In our case, let’s suppose we do a Student t test. We calculate the value of our t and estimate its probability. Most of statistical, included t, follow a specific frequency or probability distribution. These distributions are generally bell-shape, more or less symmetrical and centered on certain value. Thus, values near the center are more likely to occur, while those in the extremes edges are less likely. By convention, when this probability is less than 5% we consider the occurrence of that value of the parameter measured unlikely to happen.

But of course, unlikely is not synonymous with impossible. It may be that, by chance, we have choose a sample that is not centered on the same value as the reference population, so the value happens in spite of its low probability of happening in this population.

And this is important because it can lead to errors in our conclusions. Remember that when we have two values to compare we establish the null hypothesis (H0) that the two values are equivalent, and that any difference is due to a random sampling error. Then, if we know its frequency distribution, we can calculate the probability of that difference occurring by chance. Finally, if it is less than 5% we’ll consider unlikely for it to be fortuitous and we’ll reject H0: the difference is not the result of chance and there’s a real effect or a real difference.

But again, unlikely is not impossible. If we have the misfortune of having chosen a biased sample to the population, we could reject the null hypothesis without having a real effect and commit a type 1 error.

Conversely, if the probability is greater than 5% we will not be able to reject H0 and we will say that the difference is due to chance. But here’s a little concept hue that is important to consider. The null hypothesis is only falsifiable. This means that we can reject it, but not affirm it. When we cannot reject it, if we assume it’s true we’ll run the risk of not detecting a trend that really exist. This is the type 2 error.

Usually we are more interested in accepting theories as safely as possible, so we look for low type 1 error probabilities, usually 5%. This is called the alpha value. But the two types of errors are interlinked, so a very low alpha compels us to accept a higher type 2 error (or beta) probability, generally 20%.

The reciprocal value of beta is what is called the power of the study (1-beta). This power is the probability of detecting an effect, given that it really exists, or put it in other words, the probability of not committing a type 2 error.

To understand the factors involved with the study power, will you let me pester you with a little equation:

1-\beta \propto \frac{SE\sqrt{n}\alpha }{\sigma }

SE represents the standard error. Being it in the numerator implies that the lower SE (the more subtle the difference) the lower the power of the study to detect the effect. The same applies to the sample size (n) and alpha: the larger the sample and the higher the significance that we tolerate (with increased risk of type 1 error), the greater the power of the study. Finally, s is the standard deviation: the more variability is in the population, the lower the power of the study.

The utility of the above equation is that we can solve is to obtain the sample size in the following way:


With this formula we can calculate the sample size we need to get the desired power we choose. Beta is usually set at 0.8 (80%). SE and s are obtained from pilot studies or previous data or regulations and, if they don’t exist, they are set by the researcher. Finally, as we have already mentioned, alpha is usually set at 0.05 (5%), although if we are very afraid of committing a type 1 error we can set it at 0.01.

Closing this post, I would like to draw your attention to the relationship between n and alpha in the first equation. Notice that the power doesn’t change if we increase sample size and concomitantly diminish the significance level. This leads to the situation that, sometimes, to obtain statistical significance is only a matter of increasing enough the sample size. It is therefore essential to assess the clinical relevance of the results and not just its p-values. But that’s another story…