Factors that condition the sample size
Nowadays, the teaching of Medicine and, in general, teaching at the university level, is quite well defined and standardized. And this is not only at a national level, but also at the level of our international environment.
But this has not always been the case. In the beginning, each one went his own way and there was diversity in the ways and objectives of teaching, as you will see in the little story that I am going to tell in this post.
A bit of history
At the end of the 19th century and the beginning of the 20th century, medical schools in the United States were a little lost in terms of their objectives and ways of teaching. Although there were honorable exceptions, such as Johns Hopkins, Harvard, or Michigan Medical Schools, most were of more than poor quality.
The educational system was oriented to allow the teacher to have time for their matters, the reason why it was based on the master classes and the scarcity of practices during the training that, in some schools, was completed in periods as ridiculous as two semesters.
Thus, in 1906, the Council on Medical Education of the American Medical Association began to worry about the matter and to collect information. Given what they found, and in order to maintain objectivity, they commissioned a third party, the Carnegie Foundation for the Advancement of Education, to develop a report on the subject.
And the Foundation, in turn, entrusted it to a man named Abraham Flexner, who had graduated from Johns Hopkins some 20 years earlier. This man not only did not delegate to anyone else, but he took the job with great determination: he studied the admission conditions, the facilities, the competence of the faculty and other aspects of the medical schools of the United States and Canada.
So far nothing abnormal. But the funny thing is that he studied it in ALL the schools that were, at that time, a total of 155. A great credit for an strenuous job, but surely he could have saved effort (and time and money) if he had selected a number of representative schools offices and thus would have reduced the number of establishments to be investigated.
The importance of selecting a suitable sample
So you have already seen how Mr. Flexner was able to include a 100% of his target population in his study, something few can boast about. Of course, in addition to being unnecessary, many times this is not possible and may not even be convenient.
It is one thing to study medical schools and quite another to compare the efficacy or safety of a new treatment with the standard treatment or placebo.
A basic principle for biomedical research, the principle of equipoise, tells us that in order to compare two treatments in a trial, the researcher has to really ignore which of the two is better. Once this principle is no longer fulfilled, it is unethical to continue the trial or carry out a similar one.
The reason is because, although the investigator believes that her new treatment will be better, it may be the equivalent or even worse than the comparison option, putting trial participants at risk.
This is one of the reasons that makes the preliminary calculation of the necessary sample size so convenient: we must know what is the minimum number of participants that we need to be able to statistically demonstrate the effect of the new treatment if this effect exists, something that we did not know when we started the study.
It would not be ethical to include more patients than necessary just to obtain the desired p < 0.05. We must establish the clinically important effect that we want to detect and calculate the sample size so that the study has the necessary power to detect it.
Determinants of the sample size
The needed sample size is different in each situation and depends on many factors. We are not going to see in this post how to calculate the sample size in each of the situations, but we are going to limit ourselves to reflecting on the conditions that can influence us in the way of calculating it and in the necessary size obtained.
Let’s look at some of these factors that we should take into account when planning the sample size necessary for our study.
What is the clinically relevant difference we want to detect
A very common vice is to anxiously seek to obtain a p that is statistically significant. When we see a p value lower than 0.05, our faces light up and we no longer think of anything else.
Gross error: the significance of p depends, among other things, on the sample size. And, as we have already commented, it is not a question of obtaining a significant p, but rather of studying a magnitude of effect that we consider clinically relevant.
This difference is determined by the researcher, usually based on her knowledge of the subject she is studying or according to what has been published or known from previous studies.
When we compare two interventions in a clinical trial, we always start seting the null hypothesis that both interventions are equally effective. We know that, simply by chance, even if the null hypothesis is true, the value of the result variable that we obtain will be different in the two groups.
For example, suppose we study two hypotensive drugs, A and B, and measure the difference in mean arterial pressure between the end and the beginning of the intervention. As we has mentioned, the null hypothesis assumes that the differences will be equal in the two groups.
However, as we already know, the values that we will obtain will be different, so we will ask ourselves what is the probability that this difference is due to chance. If the probability is less than 5% (p < 0.05), we will feel confident enough to reject the null hypothesis and we will conclude that one of the treatments is more effective than the other.
The problem is that, no matter how small the difference between the two groups, statistical significance (p < 0.05) can be achieved if the sample size is increased enough.
Imagine that treatment A lowers blood pressure by 20 mmHg and B, by 18 mmHg. If we include a sufficient number of participants, we can obtain a p < 0.05, but can we really conclude that A is better than B with only this difference? Obviously not. From a clinical point of view, I would say that they have similar efficacy.
We should especify a difference that is relevant to us. For example, we may decide that we want to detect a difference between the two drugs of 20 mmHg or more. With this difference, we will calculate the number of participants necessary for, if this difference exists, the p value to be significant. We will nor need one more or one les participant.
If we stay below this necessary size, even if we detect a difference of 20 mmHg, the p may not be significant. The study will not have the necessary power to detect the effect due to an insufficient sample size.
If the difference detected is less than 20 mmHg, the p will not be significant either. It’s okay, there is no clinically relevant difference between the two treatments. What would not make sense is to increase the sample size to demonstrate the statistical significance of an effect less than that considered clinically relevant.
One caveat before leaving this point: everything we have said takes place in the realm of probabilities, so we always have a certain probability of making an error when performing hypothesis testing (type I error and type II error).
The variability in the population of the parameter under study
This is another important factor. The greater the variability of the outcome variable of our study in the target population, the larger the sample size required to detect the same effect size.
The variability in the population is reflected in the standard deviation, which influences the calculations of the standard error and the confidence intervals. The larger the standard error of the variable, the larger the required sample size, since estimates on the population are less precise.
The same happens with the precision of the estimate that we want to make. The more precise we want our estimate to be, the larger the sample size needed, and vice versa.
The reliability we expect from the study
The reliability of the study depends on two parameters whose value we must set to perform the sample size calculation: the confidence level and the power of the study.
The confidence level reflects the degree of certainty that we have that, if we repeat the study under the same conditions, we will obtain a similar result again. Usually a confidence level of 95% is chosen, although we can raise or lower it depending on how strict we want to be with the necessary degree of security.
Power, for its part, reflects the probability that the results we obtain in the study represent reality. As we have already said, it is the probability that the study will detect the effect, if it exists. It is usually marked by 80%, although it can also be increased to 90% in some studies.
As it is easy to intuit, the higher the level of confidence and the greater the power of the study, the larger the required sample size, and vice versa.
The type of study we are going to carry out
We are talking about clinical trials all the time, but the sample size calculation applies to other methodological designs as well.
Thus, we can calculate the sample size necessary to make prevalence estimates in cross-sectional studies with a certain precision, to compare the association and risk measures in observational studies, to establish the correlation between two variables, etc.
Logically, the type of design will influence the way the sample size is calculated and the number of participants required.
Again the paired groups
It is important to establish the relationship that exists between the two groups that we want to compare, which, as we already know, can be independent or paired.
As is already known, the variability is greater between independent groups than between paired groups, which will influence the necessary sample size, which will always be greater when we handle independent groups.
The direction of contrast
Hypothesis testing can be one-tailed or two-tailed (unilateral or bilateral).
The bilateral contrast assumes in its alternative hypothesis that there is a difference between the two compared interventions, but does not says which of the two is more effective. For its part, the one-tailed test does establish in the alternative hypothesis which of the two interventions is superior.
The most common is to choose the bilateral contrast, since when we carry out an experiment we do not know the direction that the result can take. However, if we are sure what the direction of the effect is going to be, we can adopt a one-tailed test.
Two-tailed contrast is more conservative, making it more difficult to achieve statistical significance than with one-tailed contrast, and also requires a larger sample size.
In any case, let’s not get confused: the elegant thing is to carry out a bilateral contrast and, if we opt for a unilateral one, it should never be to reach the significant p more easily or with fewer participants.
The characteristics of the study variable
Logically, the sample size will be different if we want to measure one or more variables and it will also depend on the type of variables. This aspect is also linked to something we have already talked about, the precision with which we want to estimate each variable.
We have already seen the factors that can influence the number of participants that our study should have if we want it to be able to detect an effect that we consider clinically relevant.
To summarize, we can say that the size of the necessary sample will be greater the lower the probability of type I and type II error that we accept, the greater the dispersion of the variable in the study population and the smaller the size of the effect.
The sample size will also increase when we compare independent groups, when we want to compare more than one variable, and when we opt for a two-tailed hypothesis test.
And here we are going to leave the subject for today.
In case you’re curious about what happened to Mr. Flexner, I can tell you that his report was devastating. He concluded that 31 schools could train doctors better than the 155 he studied. Therefore, he recommended reducing the number of schools and, as a consequence, the number of students.
According to Flexner, too many doctors were being trained for the needs of the market. I don’t know, I think this sounds like something to me…
And now we are definitely going. We have talked a lot about the importance of having the right sample size and the factors that can influence it. However, it is not enough that the size is well calculated.
A sample of adequate size will be of no use if the sampling technique provides us with a sample that is not representative of the study population. But that is another story…