The King under review

Critica appraisal of treatment studies

We all know that the randomized clinical trial is the king of interventional methodological designs. It is the type of epidemiological study that allows a better control of systematic errors or biases, since the researcher controls the variables of the study and the participants are randomly assigned among the interventions that are compared.

In this way, if two homogeneous groups that differ only in the intervention present some difference of interest during the follow-up, we can affirm with some confidence that this difference is due to the intervention, the only thing that the two groups do not have in common. For this reason, the clinical trial is the preferred design to answer clinical questions about intervention or treatment, although we will always have to be prudent with the evidence generated by a single clinical trial, no matter how well performed. When we perform a systematic review of randomized clinical trials on the same intervention and combine them in a meta-analysis, the answers we get will be more reliable than those obtained from a single study. That’s why some people say that the ideal design for answering treatment questions is not the clinical trial, but the meta-analysis of clinical trials.

In any case, as systematic reviews assess their primary studies individually and as it is more usual to find individual trials and not systematic reviews, it is advisable to know how to make a good critical appraisal in order to draw conclusions. In effect, we cannot relax when we see that an article corresponds to a clinical trial and take its content for granted. A clinical trial can also contain its traps and tricks, so, as with any other type of design, it will be a good practice to make a critical reading of it, based on our usual three pillars: validity, importance and applicability.

Critical appraisal of treatment studies

As always, when studying scientific rigor or VALIDITY (internal validity), we will first look at a series of essential primary criteria. If these are not met, it is better not to waste time with the trial and try to find another more profitable one.

Is there a clearly defined clinical question? In its origin, the trial must be designed to answer a structured clinical question about treatment, motivated by one of our multiple knowledge gaps. A working hypothesis should be proposed with its corresponding null and alternative hypothesis, if possible on a topic that is relevant from the clinical point of view. It is preferable that the study try to answer only one question. When you have several questions, the trial may get complicated in excess and end up not answering any of them completely and properly.

Was the assignment done randomly? As we have already said, to be able to affirm that the differences between the groups are due to the intervention, they must be homogeneous. This is achieved by assigning patients randomly, the only way to control the known confounding variables and, more importantly, also those that we do not know. If the groups were different and we attributed the difference only to the intervention, we could incur in a confusion bias. The trial should contain the usual and essential table 1 with the frequency of appearance of the demographic and confusion variables of both samples to be sure that the groups are homogeneous. A frequent error is to look for the differences between the two groups and evaluate them according to their p, when we know that p does not measure homogeneity. If we have distributed them at random, any difference we observe will necessarily be random (we will not need a p to know that). The sample size is not designed to discriminate between demographic variables, so a non-significant p may simply indicate that the sample is small to reach statistical significance. On the other hand, any minimal difference can reach statistical significance if the sample is large enough. So forget about the p: if there is any difference, what you have to do is assess whether it has sufficient clinical relevance to have influenced the results or, more elegantly, we will have to control the unbalanced covariates during the randomization. Fortunately, it is increasingly rare to find the tables of the study groups with the comparison of p between the intervention and control groups.

But it is not enough for the study to be randomized, we must also consider whether the randomization sequence was done correctly. The method used must ensure that all components of the selected population have the same probability of being chosen, so random number tables or computer generated sequences are preferred. The randomization must be hidden, so that it is not possible to know which group the next participant will belong to. That is why people like centralized systems by telephone or through the Internet. And here is something very curious: it turns out that it is well known that randomization produces samples of different sizes, especially if the samples are small, which is why samples randomized by blocks balanced in size are sometimes used. And I ask you, how many studies have you read with the same number of participants in the two branches and who claimed to be randomized? Do not trust if you see equal groups, especially if they are small, and do not be fooled: you can always use one of the multiple binomial probability calculators available on the Internet to know what is the probability that chance generates the groups that the authors present (we always speak of simple randomization, not by blocks, conglomerates, minimization or other techniques). You will be surprised with what you will find.

It is also important that the follow-up has been long and complete enough, so that the study lasts long enough to be able to observe the outcome variable and that every participant who enters the study is taken into account at the end. As a general rule, if the losses exceed 20%, it is admitted that the internal validity of the study may be compromised.

We will always have to analyze the nature of losses during follow-up, especially if they are high. We must try to determine if the losses are random or if they are related to any specific variable (which would be a bad matter) and estimate what effect they may have on the results of the trial. The most usual is usually to adopt the so-called worst-case scenarios: it is assumed that all the losses of the control group have gone well and all those in the intervention group have gone badly and the analysis is repeated to check if the conclusions are modified, in which case the validity of the study would be seriously compromised. The last important aspect is to consider whether patients who have not received the previously assigned treatment (there is always someone who does not know and mess up) have been analyzed according to the intention of treatment, since it is the only way to preserve all the benefits that are obtained with randomization. Everything that happens after the randomization (as a change of the assignment group) can influence the probability that the subject experiences the effect we are studying, so it is important to respect this analysis by intention to treat and analyze each one in the group in which it was initially assigned.

Once these primary criteria have been verified, we will look at three secondary criteria that influence internal validity. It will be necessary to verify that the groups were similar at the beginning of the study (we have already talked about the table with the data of the two groups), that the masking was carried out in an appropriate way as a form of control of biases and that the two groups were managed and controlled in a similar way except, of course, the intervention under study. We know that masking or blinding allows us to minimize the risk of information bias, which is why the researchers and participants are usually unaware of which group is assigned to each, which is known as double blind. Sometimes, given the nature of the intervention (think about a group that is operated on and another one that does not) it will be impossible to mask researchers and participants, but we can always give the masked data to the person who performs the analysis of the results (the so-called blind evaluator), which ameliorate this incovenient.

To summarize this section of validity of the trial, we can say that we will have to check that there is a clear definition of the study population, the intervention and the result of interest, that the randomization has been done properly, that they have been treated to control the information biases through masking, that there has been an adequate follow-up with control of the losses and that the analysis has been correct (analysis by intention of treat and control of covariates not balanced by randomization).

A famous Colombian: Alejandro Jadad Bechara

A very simple tool that can also help us assess the internal validity of a clinical trial is the Jadad’s scale, also called the Oxford’s quality scoring system. Jadad, a Colombian doctor, devised a scoring system with 7 questions. First, 5 questions whose affirmative answer adds 1 point:

  1. Is the study described as randomized?
  2. Is the method used to generate the randomization sequence described and is it adequate?
  3. Is the study described as double blind?
  4. Is the masking method described and is it adequate?
  5. Is there a description of the losses during follow up?

Finally, two questions whose negative answer subtracts 1 point:

  1. Is the method used to generate the randomization sequence adequate?
  2. Is the masking method appropriate?

As you can see, the Jadad’s scale assesses the key points that we have already mentioned: randomization, masking and monitoring. A trial is considered a rigorous study from the methodological point of view if it has a score of 5 points. If the study has 3 points or less, we better use it to wrap the sandwich.

We will now proceed to consider the results of the study to gauge its clinical RELEVANCE. It will be necessary to determine the variables measured to see if the trial adequately expresses the magnitude and precision of the results. It is important, once again, not to settle for being inundated with multiple p full of zeros. Remember that the p only indicates the probability that we are giving as good differences that only exist by chance (or, to put it simply, to make a type 1 error), but that statistical significance does not have to be synonymous with clinical relevance.

In the case of continuous variables such as survival time, weight, blood pressure, etc., it is usual to express the magnitude of the results as a difference in means or medians, depending on which measure of centralization is most appropriate. However, in cases of dichotomous variables (live or dead, healthy or sick, etc.) the relative risk, its relative and absolute reduction and the number needed to treat (NNT) will be used. Of all of them, the one that best expresses the clinical efficiency is always the NNT. Any trial worthy of our attention must provide this information or, failing that, the necessary information so that we can calculate it.

But to allow us to know a more realistic estimate of the results in the population, we need to know the precision of the study, and nothing is easier than resorting to confidence intervals. These intervals, in addition to precision, also inform us of statistical significance. It will be statistically significant if the risk ratio interval does not include the value one and that of the mean difference the value zero. In the case that the authors do not provide them, we can use a calculator to obtain them, such as those available on the CASP website.

A good way to sort the study of the clinical importance of a trial is to structure it in these four aspects: Quantitative assessment (measures of effect and its precision), Qualitative assessment (relevance from the clinical point of view), Comparative assessment (see if the results are consistent with those of other previous studies) and Cost-benefit assessment (this point would link to the next section of the critical appraisal that has to do with the applicability of the results of the trial).

To finish the critical reading of a treatment article we will value its APPLICABILITY (also called external validity), for which we will have to ask ourselves if the results can be generalized to our patients or, in other words, if there is any difference between our patients and those of the study that prevents the generalization of the results. It must be taken into account in this regard that the stricter the inclusion criteria of a study, the more difficult it will be to generalize its results, thereby compromising its external validity.

But, in addition, we must consider whether all clinically important outcomes have been taken into account, including side effects and undesirable effects. The measured result variable must be important for the investigator and for the patient. Do not forget that the fact that demonstrating that the intervention is effective does not necessarily mean that it is beneficial for our patients. We must also assess the harmful or annoying effects and study the benefits-costs-risks balance, as well as the difficulties that may exist to apply the treatment in our environment, the patient’s preferences, etc.

As it is easy to understand, a study can have a great methodological validity and its results have great importance from the clinical point of view and not be applicable to our patients, either because our patients are different from those of the study, because it does not adapt to your preferences or because it is unrealizable in our environment. However, the opposite usually does not happen: if the validity is poor or the results are unimportant, we will hardly consider applying the conclusions of the study to our patients.

We’re leaving…

To finish, recommend that you use some of the tools available for critical appraisal, such as the CASP templates, or a checklist, such as CONSORT, so as not to leave any of these points without consideration. Yes, all we have talked about is randomized and controlled clinical trials, and what happens if it is nonrandomized trials or other kinds of quasi-experimental studies? Well for that we follow another set of rules, such as those of the TREND statement. But that is another story…

King of Kings

Randomized clinical trial

There is no doubt that when doing a research in biomedicine we can choose from a large number of possible designs, all with their advantages and disadvantages. But in such a diverse and populous court, among jugglers, wise men, gardeners and purple flautists, it reigns over all of them the true Crimson King in epidemiology: the randomized clinical trial.

Definition of ranndomized clinical trial

The clinical trial is an interventional analytical study, with antegrade direction and concurrent temporality, and with sampling of a closed cohort with control of exposure. In a trial, a sample of a population is selected and divided randomly into two groups. One of the groups (intervention group) undergoes the intervention that we want to study, while the other (control group) serves as a reference to compare the results. After a given follow-up period, the results are analyzed and the differences between the two groups are compared. We can thus evaluate the benefits of treatments or interventions while controlling the biases of other types of studies: randomization favors that possible confounding factors, known or not, are distributed evenly between the two groups, so that if in the end we detect any difference, this has to be due to the intervention under study. This is what allows us to establish a causal relationship between exposure and effect.

From what has been said up to now, it is easy to understand that the randomized clinical trial is the most appropriate design to assess the effectiveness of any intervention in medicine and is the one that provides, as we have already mentioned, a higher quality evidence to demonstrate the causal relationship between the intervention and the observed results.

But to enjoy all these benefits it is necessary to be scrupulous in the approach and methodology of the trials. There are checklists published by experts who understand a lot of these issues, as is the case of the CONSORT list, which can help us assess the quality of the trial’s design. But among all these aspects, let us give some thought to those that are crucial for the validity of the clinical trial.

Components of randomized clinical trials

Everything begins with a knowledge gap that leads us to formulate a structured clinical question. The only objective of the trial should be to answer this question and it is enough to respond appropriately to a single question. Beware of clinical trials that try to answer many questions, since, in many cases, in the end they do not respond well to any. In addition, the approach must be based on what the inventors of methodological jargon call the equipoise principle, which does not mean more than, deep in our hearts, we do not really know which of the two interventions is more beneficial for the patient (from the ethical point of view, it would be necessary to be anathema to make a comparison if we already know with certainty which of the two interventions is better). It is curious in this sense how the trials sponsored by the pharmaceutical industry are more likely to breach the equipoise principle, since they have a preference for comparing with placebo or with “non-intervention” in order to be able to demonstrate more easily the efficacy of their products.Then we must carefully choose the sample on which we will perform the trial. Ideally, all members of the population should have the same probability not only of being selected, but also of finishing in either of the two branches of the trial. Here we are faced with a small dilemma. If we are very strict with the inclusion and exclusion criteria, the sample will be very homogeneous and the internal validity of the study will be strengthened, but it will be more difficult to extend the results to the general population (this is the explanatory attitude of sample selection). On the other hand, if we are not so rigid, the results will be more similar to those of the general population, but the internal validity of the study may be compromised (this is the pragmatic attitude).

Randomization is one of the key points of the clinical trial. It is the one that assures us that we can compare the two groups, since it tends to distribute the known variables equally and, more importantly, also the unknown variables between the two groups. But do not relax too much: this distribution is not guaranteed at all, it is only more likely to happen if we randomize correctly, so we should always check the homogeneity of the two groups, especially with small samples.

In addition, randomization allows us to perform masking appropriately, with which we perform an unbiased measurement of the response variable, avoiding information biases. These results of the intervention group can be compared with those of the control group in three ways. One of them is to compare with a placebo. The placebo should be a preparation of physical characteristics indistinguishable from the intervention drug but without its pharmacological effects. This serves to control the placebo effect (which depends on the patient’s personality, their feelings towards the intervention, their love for the research team, etc.), but also the side effects that are due to the intervention and not to the pharmacological effect (think, for example, of the percentage of local infections in a trial with medication administered intramuscularly).

The other way is to compare with the accepted as the most effective treatment so far. If there is a treatment that works, the logical (and more ethical) is that we use it to investigate whether the new one brings benefits. It is also usually the usual comparison method in equivalence or non-inferiority studies. Finally, the third possibility is to compare with non-intervention, although in reality this is a far-fetched way of saying that only the usual care that any patient would receive in their clinical situation is applied.

It is essential that all participants in the trial are submitted to the same follow-up guideline, which must be long enough to allow the expected response to occur. All losses that occur during follow-up should be detailed and analyzed, since they can compromise the validity and power of the study to detect significant differences. And what do we do with those that get lost or end up in a different branch to the one assigned? If there are many, it may be more reasonable to reject the study. Another possibility is to exclude them and act as if they had never existed, but we can bias the results of the trial. A third possibility is to include them in the analysis in the branch of the trial in which they have participated (there is always one that gets confused and takes what he should not), which is known as analysis by treatment or analysis by protocol. And the fourth and last option we have is to analyze them in the branch that was initially assigned to them, regardless of what they did during the study. This is called the intention-to-treat analysis, and it is the only one of the four possibilities that allows us to retain all the benefits that randomization had previously provided.

Data analysis

As a final phase, we would have the analyze and compare the data to draw the conclusions of the trial, using for this the association and impact measures of effect that, in the case of the clinical trial, are usually the response rate, the risk ratio (RR), the relative risk reduction (RRR), the absolute risk reduction (ARR) and the number needed to treat (NNT). Let’s see them with an example.

Let’s imagine that we carried out a clinical trial in which we tried a new antibiotic (let’s call it A not to get warm from head to feet) for the treatment of a serious infection of the location that we are interested in studying. We randomize the selected patients and give them the new drug or the usual treatment (our control group), according to what corresponds to them by chance. In the end, we measure how many of our patients fail treatment (present the event we want to avoid).

Thirty six out of the 100 patients receiving drug A present the event to be avoided. Therefore, we can conclude that the risk or incidence of the event in those exposed (Ie) is 0.36. On the other hand, 60 of the 100 controls (we call them the group of not exposed) have presented the event, so we quickly calculate that the risk or incidence in those not exposed (Io) is 0.6.

At first glance we already see that the risk is different in each group, but as in science we have to measure everything, we can divide the risks between exposed and not exposed, thus obtaining the so-called risk ratio (RR = Ie / Io). An RR = 1 means that the risk is equal in the two groups. If the RR> 1 the event will be more likely in the group of exposed (the exposure we are studying will be a risk factor for the production of the event) and if RR is between 0 and 1, the risk will be lower in those exposed. In our case, RR = 0.36 / 0.6 = 0.6. It is easier to interpret RR> 1. For example, a RR of 2 means that the probability of the event is twice as high in the exposed group. Following the same reasoning, a RR of 0.3 would tell us that the event is a third less frequent in the exposed than in the controls. You can see in the attached table how these measures are calculated.

But what we are interested in is to know how much the risk of the event decreases with our intervention to estimate how much effort is needed to prevent each one. For this we can calculate the RRR and the ARR. The RRR is the risk difference between the two groups with respect to the control (RRR = [Ie-Io] / Io). In our case it is 0.4, which means that the intervention tested reduces the risk by 60% compared to the usual treatment.

The ARR is simpler: it is the difference between the risks of exposed and controls (ARR = Ie – Io). In our case it is 0.24 (we ignore the negative sign), which means that out of every 100 patients treated with the new drug there will be 24 fewer events than if we had used the control treatment. But there is still more: we can know how many we have to treat with the new drug to avoid an event by just doing the rule of three (24 is to 100 as 1 is to x) or, easier to remember, calculating the inverse of the ARR. Thus, the NNT = 1 / ARR = 4.1. In our case we would have to treat four patients to avoid an adverse event. The context will always tell us the clinical importance of this figure.

As you can see, the RRR, although it is technically correct, tends to magnify the effect and does not clearly quantify the effort required to obtain the results. In addition, it may be similar in different situations with totally different clinical implications. Let’s see it with another example that I also show you in the table. Suppose another trial with a drug B in which we obtain three events in the 100 treated and five in the 100 controls. If you do the calculations, the RR is 0.6 and the RRR is 0.4, as in the previous example, but if you calculate the ARR you will see that it is very different (ARR = 0.02), with an NNT of 50 It is clear that the effort to avoid an event is much greater (4 versus 50) despite the same RR and RRR.

So, at this point, let me advice you. As the data needed to calculate RRR are the same than to calculate the easier ARR (and NNT), if a scientific paper offers you only the RRR and hide the ARR, distrust it and do as with the brother-in-law who offers you wine and cured cheese, asking him why he does not better put a skewer of Iberian ham. Well, I really wanted to say that you’d better ask yourselves why they don’t give you the ARR and compute it using the information from the article.

Basic design modifications

So far all that we have said refers to the classical design of parallel clinical trials, but the king of designs has many faces and, very often, we can find papers in which it is shown a little differently, which may imply that the analysis of the results has special peculiarities.

Let’s start with one of the most frequent variations. If we think about it for a moment, the ideal design would be that which would allow us to experience in the same individual the effect of the study intervention and the control intervention (the placebo or the standard treatment), since the parallel trial is an approximation that it assumes that the two groups respond equally to the two interventions, which always implies a risk of bias that we try to minimize with randomization. If we had a time machine we could try the intervention in all of them, write down what happens, turn back the clock and repeat the experiment with the control intervention so we could compare the two effects. The problem, the more alert of you have already imagined, is that the time machine has not been invented yet.

But what has been invented is the cross-over clinical trial, in which each subject is their own control. As you can see in the attached figure, in this type of test each subject is randomized to a group, subjected to the intervention, allowed to undergo a wash-out period and, finally, subjected to the other intervention. Although this solution is not as elegant as that of the time machine, the defenders of cross-trials argue the fact that variability within each individual is less than the interindividual one, with which the estimate can be more accurate than that of the parallel trial and, in general, smaller sample sizes are needed. Of course, before using this design you have to make a series of considerations. Logically, the effect of the first intervention should not produce irreversible changes or be very prolonged, because it would affect the effect of the second. In addition, the washing period must be long enough to avoid any residual effects of the first intervention.

It is also necessary to consider whether the order of the interventions can affect the final result (sequence effect), with which only the results of the first intervention would be valid. Another problem is that, having a longer duration, the characteristics of the patient can change throughout the study and be different in the two periods (period effect). And finally, beware of the losses during the study, which are more frequent in longer studies and have a greater impact on the final results than in parallel trials.

Imagine now that we want to test two interventions (A and B) in the same population. Can we do it with the same trial and save costs of all kinds? Yes, we can, we just have to design a factorial clinical trial. In this type of trial, each participant undergoes two consecutive randomizations: first it is assigned to intervention A or to placebo (P) and, second, to intervention B or placebo, with which we will have four study groups: AB, AP, BP and PP. As is logical, the two interventions must act by independent mechanisms to be able to assess the results of the two effects independently.

Usually, an intervention related to a more plausible and mature hypothesis and another one with a less contrasted hypothesis are studied, assuring that the evaluation of the second does not influence the inclusion and exclusion criteria of the first one. In addition, it is not convenient that neither of the two options has many annoying effects or is badly tolerated, because the lack of compliance with one treatment usually determines the poor compliance of the other. In cases where the two interventions are not independent, the effects could be studied separately (AP versus PP and BP versus PP), but the design advantages are lost and the necessary sample size increases.

At other times it may happen that we are in a hurry to finish the study as soon as possible. Imagine a very bad disease that kills lots of people and we are trying a new treatment. We want to have it available as soon as possible (if it works, of course), so after every certain number of participants we will stop and analyze the results and, in the case that we can already demonstrate the usefulness of the treatment, we will consider the study finished. This is the design that characterizes the sequential clinical trial. Remember that in the parallel trial the correct thing is to calculate previously the sample size. In this design, with a more Bayesian mentality, a statistic is established whose value determines an explicit termination rule, so that the size of the sample depends on the previous observations. When the statistic reaches the predetermined value we see ourselves with enough confidence to reject the null hypothesis and we finish the study. The problem is that each stop and analysis increases the error of rejecting it being true (type 1 error), so it is not recommended to do many intermediate analysis. In addition, the final analysis of the results is complex because the usual methods do not work, but there are others that take into account the intermediate analysis. This type of trial is very useful with very fast-acting interventions, so it is common to see them in titration studies of opioid doses, hypnotics and similar poisons.

Clustered trials

There are other occasions when individual randomization does not make sense. Imagine we have taught the doctors of a center a new technique to better inform their patients and we want to compare it with the old one. We cannot tell the same doctor to inform some patients in one way and others in another, since there would be many possibilities for the two interventions to contaminate each other. It would be more logical to teach the doctors in a group of centers and not to teach those in another group and compare the results. Here what we would randomize is the centers to train their doctors or not. This is the trial with group assignment design. The problem with this design is that we do not have many guarantees that the participants of the different groups behave independently, so the size of the sample needed can increase a lot if there is great variability between the groups and little within each group. In addition, an aggregate analysis of the results has to be done, because if it is done individually, the confidence intervals are falsely narrowed and we can find false statistical meanings. The usual thing is to calculate a weighted synthetic statistic for each group and make the final comparisons with it.

The last of the series that we are going to discuss is the community essay, in which the intervention is applied to population groups. When carried out in real conditions on populations, they have great external validity and often allow for cost-efficient measures based on their results. The problem is that it is often difficult to establish control groups, it can be more difficult to determine the necessary sample size and it is more complex to make causal inference from their results. It is the typical design for evaluating public health measures such as water fluoridation, vaccinations, etc.

We’re leaving…

I’m done now. The truth is that this post has been a bit long (and I hope not too hard), but the King deserves it. In any case, if you think that everything is said about clinical trials, you have no idea of all that remains to be said about types of sampling, randomization, etc., etc., etc. But that is another story…

Regular customers

Re-randomization in clinical trials

We saw in a previous post that sample size is very important. The sample should be the right size, neither more nor less. If too large, we are wasting resources, something to keep in mind in modern times. If we use a small sample we will save money, but lose statistical power. This means that it may happen that there is a difference in effect between the two interventions tested in a clinical trial and not be able to recognize it, which we will be just throwing good money equally.

When sample size is out of our reach

The problem is that sometimes it can be very difficult to get an adequate sample size, needing excessively long periods of time to get the desired size. Well, for these cases, someone with commercial mentality has devised a method that is to include the same participant many times in the trial. It’s like in bars. Better to have a regular clientele who comes many times to the establishment, always easier than to have a very busy parish (which is also desirable).

There are times when the same patient needs the same treatment in repeated occasions. Consider, for example, asthmatics that need bronchodilator treatment repeatedly, or couples undergoing a process of in vitro fertilization, which requires several cycles to succeed.

Re-randomization in clinical trials

Although the usual standard in clinical trials is randomizing participants, in these cases we can randomize each participant independently whenever he needs treatment. For example, if we are testing two bronchodilators, we can randomize the same subject to one of two every time he has an asthma attack and needs treatment. This procedure is known as re-randomization and consists, as we have seen, in randomizing situations rather than participants.

This trick is quite correct from a methodological point of view, provided that certain conditions discussed below are met.

The participant enters the trial the first time in the usual way, being randomly assigned to one of two arms of the trial. Subsequently he is followed-up during the appropriate period and the results of the study variables are collected. Once the follow-up period is finished, if the patient requires new treatment, and continues to meet the inclusion criteria of the trial, he is randomized again, repeating this cycle as necessary to achieve the desired sample size.

This mode of recruiting situations instead of participants achieves getting the sample size with a smaller number of participants. For example, if we need 500 participants, we can randomize 500 once, 250 twice, or 200 once and 50 six times. The important thing is that the number of randomizations of each participant cannot be specified previously, but must depend on the need of treatment in every time.

Three conditions

To apply this method correctly you need to meet three requirements. First, patients can only be re-randomized when they have fully completed the follow-up period of the previous procedure. This is logical because, otherwise, the effects of the two treatments would overlap and a biased measure of the effect of the intervention would be obtained.

Second, each new randomization in the same participant should be done independently of the others. In other words, the probability of assignment to each intervention should not depend on previous assignments. Some authors are tempted to use reallocations to balance the two groups, but this can bias comparisons between the two groups.

Third, the participant should receive the same benefit of each intervention. Otherwise, we get a biased estimate of treatment effect.

We see, then, that this is a good way to reach more easily the sample size we want. The problem with this type of design is that the analysis of the results is more complex than that of conventional clinical trial.

Basically, without going into details, there are two methods of analysis of results. The simplest is the unadjusted analysis, in which all interventions, even if they belong to the same participant are treated independently. This model, which is usually expressed by a linear regression model, does not take into account the effect that participants can have on the results.

The other method is adjusted for the effect of patients, which takes into account the correlation between observations of the same participants.

We’re leaving…

And here we leave for today. We have not talked anything about the mathematical treatment of the adjusted method to avoid burning the reader’s neurons. Suffice it to say that there are several models that have to do with using generalized linear models and mixed-effects models. But that is another story…

Steady… ready…

Don’t!. Not so fast. Before you rush out there you have to be sure that everything is well prepared. It is difficult to conceive that anyone gets to run a marathon without preparing beforehand, without sufficient physical fitness and nutrition. Well, actually it is difficult to conceive what is having to be running nonstop 42 kilometers, so let another credible example.

Imagine that it is bedtime and we are so busted as if we had run a marathon. This situation is already more credible for most. Anyone in their right mind knows that it is advisable to drink water and go through the bathroom before going to bed. Payment for not doing these preparations will be getting up in the middle of the night, stumbling and shivering, to satisfy needs that could have being foreseen and avoided (except prostate imperatives, of course).

Now imagine that we want to conduct a clinical trial. We planned the study, we chose our population, we randomize participants perfectly, and we give to the intervention group our new drug to combat chronic unbearable fildulastrosis that we want to study and wham!! Most of them do not tolerate the drug and withdraw the study early. We will have wasted money and time, and it is difficult to know which of the two the most precious resource is in these times.

Could we have avoided these problem?. Poor tolerance to the drug is a fact that we cannot help but, because there are people who tolerate it, we could have used a little trick: to give the drug at all before randomizing, drawn from the study the intolerant and randomize then only those who can endure the drug until the end of the study. This is what is called using a run-in period, also known as a run-in phase (some call it open-label phase, but I think that this term is not always equivalent to inclusion period).

Overall, during the run-in study participants are observed before being assigned to the study group that corresponds to them to verify that they meet the inclusion criteria for an intervention, or that they are good compliers, tolerate the intervention, etc. Being sure that they meet the prerequisites for inclusion in the study we ensure a more valid and consistent initial observation before random assignment to the group that corresponds.

At other times we can see that intervention is used during the run-in using their response as part of the inclusion criteria, since you can select or exclude individuals based on their response to treatment.

You see how a run-in period can deliver us of the bad compliers, of those with poor health that can give us unpleasant surprises during the trial and of those who cannot tolerate the drug in question, so that we can better focus on determining the efficacy of treatment, since most of the loses during follow-up will be for reasons not related to the intervention.

Anyway, we must take a number of precautions. We must be careful in choosing the initial sample, whose size can be larger than that required without run-in. It is very important the baseline situation of the participants in order to stratify or to make a more efficient statistical analysis. In addition, randomization must be done the latest but as close as possible to the intervention, although it is not uncommon to find studies in which participants were randomized before the run-in period. Finally, to interpret the results of a study with a run-in period, we must take into account the differences between the baseline characteristics of the participants who have been excluded during the period and those who are ultimately assigned to study groups.

But not all is entirely made up of saints. Although to exclude bad-compliers or those with more adverse effects allows us to increase the power of the study and better estimate the effect of the intervention, applicability or generalizability of the results will be compromise because the results come from a more restrictive sample of participants. To say that in a more elegant way, we have to pay for the increased internal validity with a reduction of the external validity of the study.

To end this post, just to say something about something similar to the run-in period. Imagine that we want to test a new proton pump inhibitor in ulcerous patients. As all of them receive some treatment, it may artefact the results of our intervention. The trick here is to tell everyone to stop the medication for a while before randomization and allocation to the branches of the study. But do not confuse this with the run-in period. This is what is known as a washout period. But that is another story…

More than one rooster per pen

Factorial clinical trial

The clinical trial is the king of epidemiological designs. But it is also the most expensive to perform. And, in our times, this is an important inconvenience to launch a trial.

Usually in each trial an intervention in one group versus the control group without intervention or placebo is evaluated. But, what if we could test various interventions in the same trial?. The costs probably would be lower than trying each intervention separately, each in its conventional parallel trial. Well, this is possible to carry out, using a design named as factorial trial.


Factorial clinical trial

The simplest form is the 2×2 factorial trial, in which two different interventions are tried on the same sample of participants. The trick is to randomized participants several times to form more than the two groups of a parallel trial. Suppose we want to do a factorial trial with treatments A and B (let’s not complicate us too much thinking about an example). First, we randomly assign participants to either receive or not the treatment A. Then we perform another randomization to receive or not treatment B. Thus, the sample of N participants is divided into four groups, as it’s shown in the attached table: N/4 receive only A, N/4 only receive B, N/4 receive A and B simultaneously, and N/4 remain untreated (control group).

This design is the basic 2×2 factorial trial. If we focus on the table, the analysis of marginal values in rows allows us to compare the effect between receiving A and not to receive it. For its part, the marginal analysis of the columns allows us to compare the effect between getting B and not to receive it. We could also compare values of each cell separately, but then we’d lose power to detect differences, and with that one of the advantages of this type of design.

The sample size required is usually calculated imagining that we are going to do two independent parallel trials and taking the largest number needed to detect the smallest of the effects we want to study.

Meanwhile, randomization is done using the same methods as with the parallel trial, but it’s repeated several times. Another alternative would be to identify all the groups (A, B, A + B and control, in our example) and make the random assignment at once. The result is the same.

The main reason for doing a factorial trial should be the economical one, since they require less participants than the two comparable parallel trials, so it results cheaper. This is especially useful if the sponsor of the trial does not expect to make huge profits with the results. Therefore it is common to see factorial trials with unprofitable, or with well-known and traditionally used treatments.

Conditions for conducting a factorial clinical trial

An important condition for making a factorial trial with guarantees is that there is no interaction between the two treatments, so that their effects are independent. When there is interaction between the two treatments (one effects depend on the presence of the other), the analysis is complicated and the necessary sample is higher, as we couldn’t analyze the marginal values of the table to detect differences, but we would have to assess the differences among all groups and, as we’ve said, the statistical power of the study would be lower.

In any case, we will always check for interaction. This can be done using a regression model with an interaction term and comparing the model with the same regression model without interaction. In the case of detecting interaction (which could not have been previously suspected), we must make an analysis of each group separately even at the cost of losing power to detect statistically significant differences.

And can we compare more than two interventions? We can compare all we want, but we must bear in mind that design complexity increases, as do the number of groups to be compared and the possibility of encountering interaction among any of the tested interventions. For these reasons, it is advisable to keep the number of interventions as low as possible.

We have already discussed the most obvious advantage of factorial trial: lower cost resulting of requiring a smaller sample size. Another advantage is that they serve if we are also interested in assessing the effect of the combination of interventions, assessing by the way the existence of interaction.

Meanwhile, interactions between interventions are the main limitation of this design. We have already mentioned that when there, we require the individual analysis of the groups, with the loss of power that entails. Another drawback is that the compliance of the participants cannot be very good: the more treatments a participant must perform correctly, the most likely he will not do it the way he should.

We’re leaving…

And here we leave for today the story of factorial clinical trials. We have described the simplest way, the 2×2 factorial. However, as we have said, things can get complicated comparing more interventions and also assigning different sizes to each of the groups. For example, if we detect smaller differences in one treatment group that interests us, we can assign more patients to it. Of course all this complicates the analysis and calculation of sample size. But that is another story…

The gregarious one

Cluster clinical trials

Conventional randomized clinical trial is an individualistic design, in which each participant is randomized to receive the intervention or placebo to measure the outcome variable after each and compare the differences. This single randomization is complemented by the masking process, so that no one knows which group each participant belongs to and it cannot be effects related to this knowledge.

The problem is that there are times when it is not possible to mask the intervention, so that participants know what everyone receives. Suppose we want to study the effect of certain dietary advice in blood pressure levels in a population. We can give or not the recommendations to each participant, but each of them will know if we gave it to them or not, so masking is not possible.

In addition, two facts that can invalidate the comparison of effects with or without the intervention may occur. First, participants can share information between them, with what some in the placebo group would also know the advices and could follow them. Second, it could be difficult for the researchers to treat objectively the participants from both groups, and their recommendations could be directed to the wrong participant in some situations. This is what is known as contamination between groups, very often when we try to study interventions in public health or health promotion programs.

But do not worry ahead of time, because to solve this problem we can fall back on the gregarious cousin of the randomized clinical trial’s family: the cluster randomized trial.

Cluster clinical trials

In these trials the unit of randomization is not the individual but groups of individuals. Thinking in the previous example, we could randomize patients from a health center to the intervention group and patients from another center to the control group. This has the advantage that prevents contamination between groups, with the added advantage that participants within each group behave similarly.

For this design to work properly there have to be a sufficient number of groups to allow that the basal characteristics of the components will be balanced by randomization. It’s also mandatory to keep in mind a number of special considerations during the phases of design, analysis and communication of results of cluster trials, since the lack of independence of the participants in each group has major statistical considerations. It may occur that the members of each group have some common characteristics different from those of other groups (selection bias) and also it may be a different distribution of confounding variables within each group.

Sample size

One problem with this type of design is that it has less power than the equivalent randomized clinical trial, so larger sample sizes are needed in relation to what is call the cluster inflation factor. Furthermore, the number and size of each group and the correlation that may exist between the results of patients within the same group, using intracluster correlation coefficient, must be considered.

Thus, to calculate the sample size we have to multiply the size that would have the standard trial by a factor of study design, which has into account the cluster size and the intracluster correlation coefficient. The formula to calculate it is the following:

N (cluster trial) = Inflation factor x N (standard clinical trial)

Inflation factor = 1 + [(m – 1) x ICC], where m is the cluster size and ICC is the intracluster correlation coefficient.

Here’s an example. Suppose we have been considering a trial and we would need 400 participants for the standard trial to detect certain effect size with the power and desired statistical significance. We estimate the intracluster correlation coefficient is equal to 0.15 and determined that we want to clusters with 30 participants. The sample size required for a cluster randomized trial is

N (cluster trial) = (1 + [(30 – 1) x 0.15]) x 400 = 2140

Rounding off, we need 72 clusters of 30 participants, with a total sample of 2160. As can be seen, about five times the size of conventional trial’s sample.

Analysis of results

Another peculiarity of cluster trials is that the analysis phase must take into account the lack of independence among the patients in each group, no matter whether we calculate results individually or we get summary measures at cluster level. This is because if we ignore the lack of independence among participants it will increase the probability of making a type I error and draw the wrong conclusion. To understand this, a p-value of 0.01 can become something more than 0.05 once we consider this effect.

This causes that we cannot use tests like Student’s t test and we have to resort to robust analysis of variance or to the more employed random effects model, which not only takes into account the cluster effect, but enables an estimate and assess the degree of contamination. It also takes into account the heterogeneity by unobserved factors and allows adjusting for covariates that produce imbalances between different groups. One possibility is to make the analysis considering and not considering the effect of clustering and check if the values of significance are different, in which case it supports the fact that we have chosen the right kind of design for our study.

And these are the most important issues that we have keep in mind when conducting a cluster trial. Its main advantage is to avoid contamination between participants, as we saw at the beginning, so they are very useful for assessing strategies to improve health and for educational programs. Its main drawback has been already mentioned: the lower power with the consequent need for much larger sample sizes.

Finally, just to say that all these issues concerning the calculation of sample size and statistical analysis taking into account the effect of clusters should be clearly specified in the communication phase of the test results.

We’re leaving…

One last advice. If you carry out a cluster trial or the critical reading of a clinical cluster trial, you do not forget to check that the authors have taken into account the peculiarities that we have discussed. To do this you can use the CONSORT statement. This is a checklist of characteristics that must meet the clinical trials, including the specific characteristics of cluster trials. But that is another story…

Intention is what matters

Intention-to-treat analysis

Someone always does not do what he’s told. No matter how simple the approach of a clinical trial seems to be regarding to its participants. They are randomly assigned to one of the two arms of the trial and some have to take the pill A whereas other have to take B, so we can test which one of both is better.

However, there’s always someone who does not do what he has to and takes the pill that not correspond, or doesn’t take any pill at all, or takes it wrong, or withdraws it ahead of the proper time, etc., etc., etc.

Types of analysis

And what do we do when it comes to analyzing the results? Common sense tells us that if a participant has been wrong with the assigned treatment we should include him in the group of the pill he actually took (this is called to make a per protocol analysis). Other option is to forget that participant who doesn`t take the treatment. But this attitude is not correct if we want to make an unbiased analysis of the results. If participants begin to change from one group to the other we lose the benefit we obtained by distributing them randomly, and the result can be the come into play of confounding or modifying variables that were balanced between the two groups during randomization.


To avoid this, the right thing is to respect the initial intention of group assignment and analyze the results of the subject being mistaken as if he had taken the treatment correctly assigned. It is what is known as the intention to treat analysis, the only preserving the advantages of randomization.

There’re several reasons why a participant in a trial cannot receive the assigned treatment, in addition to a poor compliance by its part. Here are some.

Sometimes it may be the researcher who makes an erroneous inclusion of the participant in the treatment group. Imagine that, after randomization, we realize that some participants are not eligible for the intervention, either because they have the disease or because we discover that there is a contraindication to surgery, for example. If we are strict, we should include them in the analysis group to which they were assigned, although they have not received the intervention. However, it may be reasonable to exclude them if the causes of exclusion are previously specified in the trial protocol. However, it is important that this is performed by someone who does not know the allocation and results, so participants of both arms of the trial are managed similarly. Anyway, if we want more security, we can do a sensitivity analysis with and without these subjects to see how the results change.

Another problem of this type can result of missing data. The results of all variables, and especially the principal, should be present for all participants, but this is not always the case, so we have to decide what to do with the subjects with any missing data.

Most statistical programs operate with complete data analysis excluding those records of subjects with missing data. This reduces the effective sample size and may bias the results, in addition to reducing the power of the study. Some models, such as mixed longitudinal or Cox regression handle the records with some missing data, but no one can do anything if all the information of a subject is missing. In these cases we can use data imputation in all of its forms, so that we fill the gaps to take advantage of the overall sample according to the intention to treat.

When data imputation is not convenient, one thing we can do is what is called an analysis of extreme cases. This is done by assigning the gaps the best and worst possible outcomes and sees how the results change. So, we’ll get an idea of the maximum potential impact of missing data on the results of the study. In any case, there is no doubt that the best strategy will be to design the study so that the missing data are kept to a minimum.

Anyway, there’s always someone who is mistaken and mess the performance of the trial. What can we do?

Variations of intention-to-treat analysis

One possibility is to use an intention to treat modified analysis. It includes everyone in the assigned group, but it’s allowed to exclude participants like those who never started treatment or who were not considered suitable for the study. The problem is that this opens a door to mask the data as we are interested in and bias the results to our advantage. Therefore, we must be suspicious when these changes were not specified in the trial protocol and are a post hoc decision.

The other possibility is to make the analysis according to treatment received (per protocol analysis). The problem, as we have said, is that the balance of randomization is lost. Also, if those who have been mistaken have some special feature the results of the study may be biased. Moreover, the advantage of analyzing the facts as the really have happened is that we can get a better idea of how treatment can work in real life.

Finally, perhaps the safest thing to do is to perform both analyzes, the per protocol and the intention to treat, and compared the results obtained with each. In these cases it may be that we detect an effect with the per protocol analysis and not with the intention to treat analysis. This may be due to two main causes. First, per protocol analysis may create spurious associations by the lack of the balance of confounders guaranteed by randomization. Second, the intention to treat analysis favors the null hypothesis, so it has less power than the per protocol analysis. Of course, if we detect a significant effect, we will be strengthened if the analysis was by intention to treat.

We’re leaving…

And here we end for today. We have seen how try to control errors in the assignment to groups in the trial and how we can impute the missing data, which is a fancy way of saying that we invent data where they’re missing. Of course, we can only do that if some conditions are fulfilled. But that’s another story…

The consolation of not being worse

We live in a frantic and highly competitive world. We are continually inundated with messages about how good it is to be the best in this and that. As indeed it is. But most of us soon realize that it is impossible to be the best in everything we do. Gradually, we even realize that it is very hard to be the best at something, and not only in general. In the end, sooner or later, ordinary mortals have to conform to the minimum of not be the worst at what one does.

But this is not that bad. You can’t always be the best and indeed, you certainly do not have to. Consider, for example, we have a great treatment for a very bad disease. This treatment is effective, inexpensive, easy to use and well tolerated. Are we interested in change to another drug?. Probably not. But think now, for example, that it produces an irreversible aplastic anemia in 3% of those who take it. In this case we would like to find a better treatment.

Better?. Well, not really better. If only it were the same in all but except the production of aplasia, we’d change to the new treatment.

The most common goal of clinical trials is to show the superiority of an intervention against a placebo or the standard treatment. But, increasingly, trials are performed with the sole objective to show that the new treatment is equal to the current. The planning of these equivalence trials should be careful and paying attention to a number of aspects.

First, there is no equivalence from an absolute point of view, so you must take much care in keeping the same conditions in both arms of the trial. In addition, we must first set the sensitivity level that we will need in the study. To do this, we first define the margin of equivalence, which is the maximum difference between the two interventions to be considered acceptable from a clinical point of view. Second, we will calculate the sample size needed to discriminate the difference from the point of view of statistical significance.

It is important to understand that the margin of equivalence is marked by the investigator based on the clinical significance of what is being valued. The narrower the margin, the larger the needed sample size to achieve statistical significance and reject the null hypothesis that the differences we observe are due to chance. Contrary to what may seem at first sight, equivalence studies usually require larger samples than studies of superiority.

After obtaining the results, we’ll analyze the confidence intervals of the differences in effect between the two interventions. Only those intervals not crossing the line of no-effect (one for relative risks and odds ratio and zero for mean differences) are statistically significant. If they are also included within the predefined equivalence margins, they will be considered equivalents with the probability of error chosen for the confidence interval, usually 5%. If an interval falls outside the range of equivalency, the intervention is considered not equivalent. In the case of crossing any of the limits of the margin of equivalence, the study is not conclusive as to prove or reject the equivalence of the two interventions, although we should assess the extent and distribution of the interval regarding to the margins of equivalence to rate its possible relevance from a clinical point of view. Sometimes, not statistically significant results or those outside the equivalence range limits may also provide useful clinical information.

equivalencyLook at the example of the figure to better understand what we have said so far. We have the intervals of nine studies represented with its position regarding the line of no-effect and the limits of equivalence. Only studies A, B, D, G and H show a statistically significant difference, because they are not crossing the line of no-effect. A’s intervention is superior, whereas H’s is showed inferior. However, only in case of D’s can we conclude equivalence of the two interventions, while B’s and G’s are inconclusive with regard to equivalence.

You can also conclude equivalence of the two interventions of E study. Notice that, although the difference obtained in D is statistically significant, is not to exceed the limits of equivalence: it’s superior to E from the statistical point of view, but it seems that the difference has no clinical relevance.

Besides the studies B and G already mentioned, C, F and I are inconclusive regarding equivalence. However, C will probably not be inferior and F could be Inferior. We could even estimate the probability of these assumptions based on the amount of the intervals that fall within the limits of equivalence.

An important aspect of equivalence studies is the method used to analyze results. We know that the intention to treat analysis is always preferable to the per protocol analysis as it keeps the advantages of randomization of known and unknown variables that may influence the results. The problem is that the intention to treat analysis favors the null hypothesis, minimizing the differences, if any. This is an advantage in superiority studies: finding a difference reinforces de result. However, this is not so advantageous in the case of equivalence studies. Otherwise, the per protocol analysis would tend to increase any difference, but this is not always the case and may vary depending on what motivated the protocol violations, losses or mistakes of assignment between the two arms of the trial. For these reason, it’s usually advised to analyze results in both ways and to check that interventions showed equivalents with both methods. We’ll also take into account losses during study and analyze the information provided by the participants who don’t follow the original protocol.

A particular case of this type of trial is the non-inferiority. In this case, researchers are contented to demonstrate that the new intervention is not worse than the comparison. All we have said about equivalence is valid here, but considering only the lower limit of the range of equivalence.

One last thing. Studies of superiority are to demonstrate superiority and equivalence studies are to demonstrate equivalence. One of the designs is not useful to show the goal of the other. Furthermore, if a study fails to demonstrate superiority, it does not exactly mean that the two procedures are equivalent.

We have reached the end without speaking anything about other characteristic equivalence studies: bioequivalence studies. These are phase I trials conducted by pharmaceutical companies to test the equivalence of different presentations of the same drug, and they have some design specifications. But that’s another story…

The other sides of the King

We’ve already talked at other times about the king of experimental designs, the randomized clinical trial, in which a population is randomly assigned into two groups to undergo the intervention under study, one of the groups, and to serve as a control group, the other one. This is the most common side of the King, the parallel clinical trial, which is ideal for most studies about treatment, for many studies about prognosis or prevention strategies and, with its peculiarities, for studies assessing diagnostics tests. But the King is very versatile and has many other sides to accommodate to other situations.

If we think about it for a moment, the ideal design would be one that allows us to test in the same individual the effect of the intervention study and of the stablished control (placebo or standard treatment) because parallel testing is an approach that assumes that both groups respond equally to both interventions, which is always a risk of bias that we try to minimize with randomization. If we had a time machine we could test the intervention in all, note what happens, turn back in time and repeat the experiment with the control intervention. So, we could compare the two effects. The problem is, the more vigilant of you will have already guessed, that time machine has not been invented yet.

But was has been already invented is the cross-over trial design, in which each subject acts as his own control.

Crossover trialIn this type of trial, every subject is randomized to a group, the corresponding intervention is performed, it takes place a washout period, and the other intervention is carried out. Although this solution is not as elegant as the time machine, the cross-over study defenders argue that the variability within each individual is less than the inter-individual variation. Thus, the estimate may be more accurate that the obtained with a parallel trial and we usually require smaller sample sizes. However, before using this design, a number of considerations have to be done. Logically, the effect of the first intervention should not cause irreversible changes or be very long, because it would affect the effect of the second. In addition, the washout period must be long enough to avoid leaving any residual effect of the first intervention.

We must also consider whether the order of the interventions could affect the final outcome, because in this case only results of the first intervention will be reliable (sequence effect). Another problem is that, by having a longer duration, patient characteristics may change during the study and may be different in the two periods (period effect). And finally, be alert to the losses during follow-up, more frequent in longer studies and which have greatest impact in cross-over studies trials and with more repercussion in final results than in the case of parallel trials.

Imagine now that we want to test two interventions (A and B) in the same population. Can we do it with only one trial, saving costs of any kind?. Yes, we can. We only have to design a factorial clinical trial. In this type of trial, each participant undergoes two consecutive randomizations. She’s first assigned to the intervention A or placebo (P), and then, to the intervention B or placebo, with which we’ll have four study groups: AB, AP, BP and PP. Obviously, the two interventions must act through independent mechanisms to be able to assess the results of the two effects independently.

It’s usually studied one more mature and plausible hypothesis and one that has been less tested, ensuring that the evaluation of the second doesn’t affect the inclusion and exclusion criteria of the first. Furthermore, it’s not desirable that any of the two interventions have many troublesome effects or be poorly tolerated, because the lack of compliance with one treatment will affect the compliance with the other. In cases in which the two interventions seem not to be independent, their effect could be studied separately (AP vs. PP and BP vs. PP), but we’ll lost the advantages of the design and a larger sample size will be required.

Other times it may happen that we are in a hurry to finish the study soon. Imagine a very bad disease that kills people by dozens and we’re trying a new treatment. We’ll want to have it available as soon as possible (if it works, of course), so we’ll pause the trial and discuss its results after being tested the treatment in a certain number of participants, because if we can already show the usefulness of the treatment, we’ll end the study. This is the type of design that characterizes the sequential clinical trial. Remember the in the parallel clinical trial the right thing is to pre-calculate the sample size. In this design, with a more Bayesian’s mentality, we stablish and statistic whose value determines an explicit ending rule, whereby the sample size depends on the previous observation of the study. When this statistic reaches the preset value we are confident enough to reject the null hypothesis and end the study. The problem is that each stop and analysis increases the error of reject the null hypothesis being true (type 1 error), so it’s not recommended to perform many interim analysis. Moreover, the final analysis of results is more complex because we have to take into account the interim analysis. This type of trials is very helpful with very quick impact interventions, which is often seen in studies about dose titration of opioids, hypnotics, and poisons of that kind.

There are other occasions where individual randomization makes no sense. Think we have taught physicians of a health center a new technique to better inform their patients and want to compare it with the old one. We cannot say the same physician to inform some patients in a way and other patients in another, since there would be a strong possibility that the two interventions contaminate to each other. It would be more logical to teach a group of medical centers and not teach another group and compare the results. Here we randomize health centers to form or not their doctor. This is the cluster allocations design. The problem with this design is that we have little assurance that participants of different groups behave independently, so the sample size required can be greatly increased if there is great variability among groups and little within each group. In addition, we must perform and aggregate analysis of results, because if we do it individually confidence interval will be falsely narrowed and we can find false statistical significance. The usual practice is to calculate a weighted statistic for each group and make final comparisons with it.

The last of the series we are going to deal with is the community trial, in which the intervention is applied to populations. As it’s performed on populations under actual conditions it has high external validity and it often allow us recommending cost-effective measures based on their results. The problem is that it is often difficult to establish a control group, it may be more difficult to determine the sample size needed and is more complex to perform causal inference from their results. It is the typical design for evaluating public health measures such as water fluoridation design, vaccinations, etc.

As you can see, the King has many sides. But it also has lower-rank relatives, but which are not less worthy. It’s so because it has a whole family of quasi-experimental studies consisting of trials that are not randomized or controlled, or any of both things. But that’s another story…

To what do you attribute it?

It seems like only yesterday. I began my adventures at the hospital and had my first contacts with The Patient. And, by the way, I didn’t know much about diseases but I knew without thinking about it what were the three questions with which any good clinical history began: what is bothering you?, how long has it been going on?, and to what do you attribute it?.

The fact is that the need to know the why of things is inherent to human nature and, of course, is of great importance in medicine. Everyone is mad for establishing cause and effect relations; sometimes one does it rather loosely and comes to the conclusion that the culprit of his summer’s cold is the supermarket’s guy, who has set the air conditioned at maximal power. This is the reason why studies on etiology must be conducted and assessed with scientific rigour. For this reason and because when we talk about etiology we also refer to harm, including that derived from our own actions (what educated people call iatrogenic).

This is why studies on etiology/harm have similar designs. The clinical trial is the ideal choice and we can use it, for example, to know if a treatment is the cause of the patient’s recovery. But when we study risk factors or harmful exposures, the ethical principle of nonmaleficence prevent us to randomized exposures, so we have to resort to observational studies such us cohort studies or case-control studies, although the level of evidence provided by them will be smaller than that of the experimental studies.

To critically appraise a paper on etiology / harm, we’ll resort to our well-known pillars: validity, relevance and applicability.

First, we’ll focus on the VALIDITY or scientific rigour of the work, which should answer to the question whether the factor or intervention studied was the cause of the adverse effect or disease observed.

As always, we’ll asses a series of primary validity criteria. If these are not fulfilled, we’ll left the paper and devote ourselves to something else more profitable. The first is to determine whether groups compared were similar regarding to other important factors different from the exposure studied. Randomization in clinical trials provides that the groups are homogeneous, but we cannot count on it in the case of observational studies. The homogeneity of the two cohorts is essential and the study is not valid without it. One can always argue that has stratified the differences between the two groups or that has made a multivariate analysis to control for the effect of known confounders but, what about the unknown?. The same applies to case-control studies, much more sensitive to bias and confusion.

Have exposure and effect been assessed in the same way in all groups?. In clinical trials and cohort studies we have to check that the effect has had the same likelihood of appearance and of be detected in the two groups. Moreover, in case-control studies is very important to properly asses previous exposure, so we must investigate whether there is potential bias in data collection, such us recall bias (patients often remember symptoms better than healthy). Finally, we must consider if follow-up has been long enough and complete. Losses during the study, common in observational designs, can bias the results.

If we have answered yes to all the three questions, we’ll turn to consider secondary validity criteria. Study’s results have to be evaluated to determine whether the association between exposure and effect satisfies a reasonably evidence of causality.Hill_en One useful tool are the Hill’s criteria, which was a gentleman who suggested using a series of items to try to distinguish the causal or non-causal nature of an association. These criteria are: a) strength of association, represented by the risk ratio between exposure and effect, that we’ll consider shortly; b) consistency, which is reproducibility in populations or in different situations; c) specificity, which means that a cause produces a unique effect and no a multiple one; d) temporality: it’s essential that cause precedes the effect; e) biological gradient: the more intense the cause, the more intense the effect; f) plausibility: the relationship has to be logical according to our biological knowledge; g) coherence, the relationship should not be in conflict with other knowledge about disease or effect; h) experimental evidence, often difficult to obtain in humans for ethical reasons; and finally, i) analogy to other known situations. Although these are a quite-vintage criteria and some of them may be irrelevant (experimental evidence or analogy), they may serve as a guidance. The criterion of temporality would be a necessary one and would be well complemented with biological gradient, plausibility and coherence.

Another important aspect is to consider whether, apart from the intervention under study, both groups were treated similarly. In this type of study in which the double-blind is absent is where there is more risk of bias due to co-interventions, especially if these are treatments with a much greater effect than the exposure under study.

Regarding the RELEVANCE of the results, we must consider the magnitude and precision of the association between exposure and effect.

What was the strength of the association?. The most common measure of association is the risk ratio (RR), which can be used in trials and cohort studies. However, in case-control studies we don’t know the incidence of the effect (the effect has occurred when the study is conducted), so we used the odds ratio (OR). As we know, the interpretation of the two parameters is similar. Even the values of the two are similar when the frequency of the effect is very low. However, the greater the magnitude or frequency of the effect, the more different RR and OR are, with the peculiarity that the OR tends to overestimate the strength of the association when it is greater than 1 and underestimate it when it is less than 1. Anyway, these vagaries of OR will exceptionally modify the qualitative interpretation of the results.

It has to be kept in mind that a test is statistically significant for any value of OR or RR whose confidence interval does not include one, but observational studies have to be a little more demanding. Thus, in a cohort study we’ll like to see values greater than or equal to three for RR and equal than or greater than four in case-control studies.

Another useful parameter (in trials and cohort studies) is the difference in risks or incidence difference, which is a fancy way of calling our known absolute risk reduction (ARR), which allows us to calculate the NNT (or NNH, number needed to harm) parameter that best quantifies us the clinical significance of the association. Also, similar to the relative risk reduction (RRR), we have the attributable fraction in the exposed, which is the percentage of risk observed in the exposed that is due to exposure.

And, what is the accuracy of the results?. As we know, we’ll use our beloved confidence intervals, which serve to determine the accuracy of the parameter estimate in the population. It is always useful to have all these parameters, which must be included in the study or its calculation should be possible from the data provided by the authors.

Finally, we’ll asses the APPLICABILITY of the results to our clinical practice.

Are the results applicable to our patients?. Search to see if there are differences that advise against extrapolating results of the work to our environment. Also, consider what is the magnitude of the risk in our patients based on the results of the study and their characteristics. And finally, having all this information in mind, we must think about our working conditions, the choices we have and the patient’s preferences to decide whether to avoid or not the studied exposure. For example, if the magnitude of the risk is high and we have an effective alternative, the decision will be clear, but things are not always so simple.

As always, I advise you to use the resources available on the Internet, such as CASP’s, both the design-specific templates and the calculator to assess the relevance of the results.

Before concluding, let me clarify one thing. Although we’ve said we use RR in cohort studies and clinical trials and we use OR in case-control studies, actually we can use OR in any type of study (not so for RR, for which we must know the incidence of the effect). The problem is that ORs are somewhat less accurate, so we prefer to use RR and NNT whenever possible. However, OR is increasingly popular for another reason, its use in logistic regression models, which allow us to obtain estimates adjusted for confounding variables. But that’s another story…