The King under review

Critica appraisal of treatment studies

We all know that the randomized clinical trial is the king of interventional methodological designs. It is the type of epidemiological study that allows a better control of systematic errors or biases, since the researcher controls the variables of the study and the participants are randomly assigned among the interventions that are compared.

In this way, if two homogeneous groups that differ only in the intervention present some difference of interest during the follow-up, we can affirm with some confidence that this difference is due to the intervention, the only thing that the two groups do not have in common. For this reason, the clinical trial is the preferred design to answer clinical questions about intervention or treatment, although we will always have to be prudent with the evidence generated by a single clinical trial, no matter how well performed. When we perform a systematic review of randomized clinical trials on the same intervention and combine them in a meta-analysis, the answers we get will be more reliable than those obtained from a single study. That’s why some people say that the ideal design for answering treatment questions is not the clinical trial, but the meta-analysis of clinical trials.

In any case, as systematic reviews assess their primary studies individually and as it is more usual to find individual trials and not systematic reviews, it is advisable to know how to make a good critical appraisal in order to draw conclusions. In effect, we cannot relax when we see that an article corresponds to a clinical trial and take its content for granted. A clinical trial can also contain its traps and tricks, so, as with any other type of design, it will be a good practice to make a critical reading of it, based on our usual three pillars: validity, importance and applicability.

Critical appraisal of treatment studies

As always, when studying scientific rigor or VALIDITY (internal validity), we will first look at a series of essential primary criteria. If these are not met, it is better not to waste time with the trial and try to find another more profitable one.

Is there a clearly defined clinical question? In its origin, the trial must be designed to answer a structured clinical question about treatment, motivated by one of our multiple knowledge gaps. A working hypothesis should be proposed with its corresponding null and alternative hypothesis, if possible on a topic that is relevant from the clinical point of view. It is preferable that the study try to answer only one question. When you have several questions, the trial may get complicated in excess and end up not answering any of them completely and properly.

Was the assignment done randomly? As we have already said, to be able to affirm that the differences between the groups are due to the intervention, they must be homogeneous. This is achieved by assigning patients randomly, the only way to control the known confounding variables and, more importantly, also those that we do not know. If the groups were different and we attributed the difference only to the intervention, we could incur in a confusion bias. The trial should contain the usual and essential table 1 with the frequency of appearance of the demographic and confusion variables of both samples to be sure that the groups are homogeneous. A frequent error is to look for the differences between the two groups and evaluate them according to their p, when we know that p does not measure homogeneity. If we have distributed them at random, any difference we observe will necessarily be random (we will not need a p to know that). The sample size is not designed to discriminate between demographic variables, so a non-significant p may simply indicate that the sample is small to reach statistical significance. On the other hand, any minimal difference can reach statistical significance if the sample is large enough. So forget about the p: if there is any difference, what you have to do is assess whether it has sufficient clinical relevance to have influenced the results or, more elegantly, we will have to control the unbalanced covariates during the randomization. Fortunately, it is increasingly rare to find the tables of the study groups with the comparison of p between the intervention and control groups.

But it is not enough for the study to be randomized, we must also consider whether the randomization sequence was done correctly. The method used must ensure that all components of the selected population have the same probability of being chosen, so random number tables or computer generated sequences are preferred. The randomization must be hidden, so that it is not possible to know which group the next participant will belong to. That is why people like centralized systems by telephone or through the Internet. And here is something very curious: it turns out that it is well known that randomization produces samples of different sizes, especially if the samples are small, which is why samples randomized by blocks balanced in size are sometimes used. And I ask you, how many studies have you read with the same number of participants in the two branches and who claimed to be randomized? Do not trust if you see equal groups, especially if they are small, and do not be fooled: you can always use one of the multiple binomial probability calculators available on the Internet to know what is the probability that chance generates the groups that the authors present (we always speak of simple randomization, not by blocks, conglomerates, minimization or other techniques). You will be surprised with what you will find.

It is also important that the follow-up has been long and complete enough, so that the study lasts long enough to be able to observe the outcome variable and that every participant who enters the study is taken into account at the end. As a general rule, if the losses exceed 20%, it is admitted that the internal validity of the study may be compromised.

We will always have to analyze the nature of losses during follow-up, especially if they are high. We must try to determine if the losses are random or if they are related to any specific variable (which would be a bad matter) and estimate what effect they may have on the results of the trial. The most usual is usually to adopt the so-called worst-case scenarios: it is assumed that all the losses of the control group have gone well and all those in the intervention group have gone badly and the analysis is repeated to check if the conclusions are modified, in which case the validity of the study would be seriously compromised. The last important aspect is to consider whether patients who have not received the previously assigned treatment (there is always someone who does not know and mess up) have been analyzed according to the intention of treatment, since it is the only way to preserve all the benefits that are obtained with randomization. Everything that happens after the randomization (as a change of the assignment group) can influence the probability that the subject experiences the effect we are studying, so it is important to respect this analysis by intention to treat and analyze each one in the group in which it was initially assigned.

Once these primary criteria have been verified, we will look at three secondary criteria that influence internal validity. It will be necessary to verify that the groups were similar at the beginning of the study (we have already talked about the table with the data of the two groups), that the masking was carried out in an appropriate way as a form of control of biases and that the two groups were managed and controlled in a similar way except, of course, the intervention under study. We know that masking or blinding allows us to minimize the risk of information bias, which is why the researchers and participants are usually unaware of which group is assigned to each, which is known as double blind. Sometimes, given the nature of the intervention (think about a group that is operated on and another one that does not) it will be impossible to mask researchers and participants, but we can always give the masked data to the person who performs the analysis of the results (the so-called blind evaluator), which ameliorate this incovenient.

To summarize this section of validity of the trial, we can say that we will have to check that there is a clear definition of the study population, the intervention and the result of interest, that the randomization has been done properly, that they have been treated to control the information biases through masking, that there has been an adequate follow-up with control of the losses and that the analysis has been correct (analysis by intention of treat and control of covariates not balanced by randomization).

A famous Colombian: Alejandro Jadad Bechara

A very simple tool that can also help us assess the internal validity of a clinical trial is the Jadad’s scale, also called the Oxford’s quality scoring system. Jadad, a Colombian doctor, devised a scoring system with 7 questions. First, 5 questions whose affirmative answer adds 1 point:

  1. Is the study described as randomized?
  2. Is the method used to generate the randomization sequence described and is it adequate?
  3. Is the study described as double blind?
  4. Is the masking method described and is it adequate?
  5. Is there a description of the losses during follow up?

Finally, two questions whose negative answer subtracts 1 point:

  1. Is the method used to generate the randomization sequence adequate?
  2. Is the masking method appropriate?

As you can see, the Jadad’s scale assesses the key points that we have already mentioned: randomization, masking and monitoring. A trial is considered a rigorous study from the methodological point of view if it has a score of 5 points. If the study has 3 points or less, we better use it to wrap the sandwich.

We will now proceed to consider the results of the study to gauge its clinical RELEVANCE. It will be necessary to determine the variables measured to see if the trial adequately expresses the magnitude and precision of the results. It is important, once again, not to settle for being inundated with multiple p full of zeros. Remember that the p only indicates the probability that we are giving as good differences that only exist by chance (or, to put it simply, to make a type 1 error), but that statistical significance does not have to be synonymous with clinical relevance.

In the case of continuous variables such as survival time, weight, blood pressure, etc., it is usual to express the magnitude of the results as a difference in means or medians, depending on which measure of centralization is most appropriate. However, in cases of dichotomous variables (live or dead, healthy or sick, etc.) the relative risk, its relative and absolute reduction and the number needed to treat (NNT) will be used. Of all of them, the one that best expresses the clinical efficiency is always the NNT. Any trial worthy of our attention must provide this information or, failing that, the necessary information so that we can calculate it.

But to allow us to know a more realistic estimate of the results in the population, we need to know the precision of the study, and nothing is easier than resorting to confidence intervals. These intervals, in addition to precision, also inform us of statistical significance. It will be statistically significant if the risk ratio interval does not include the value one and that of the mean difference the value zero. In the case that the authors do not provide them, we can use a calculator to obtain them, such as those available on the CASP website.

A good way to sort the study of the clinical importance of a trial is to structure it in these four aspects: Quantitative assessment (measures of effect and its precision), Qualitative assessment (relevance from the clinical point of view), Comparative assessment (see if the results are consistent with those of other previous studies) and Cost-benefit assessment (this point would link to the next section of the critical appraisal that has to do with the applicability of the results of the trial).

To finish the critical reading of a treatment article we will value its APPLICABILITY (also called external validity), for which we will have to ask ourselves if the results can be generalized to our patients or, in other words, if there is any difference between our patients and those of the study that prevents the generalization of the results. It must be taken into account in this regard that the stricter the inclusion criteria of a study, the more difficult it will be to generalize its results, thereby compromising its external validity.

But, in addition, we must consider whether all clinically important outcomes have been taken into account, including side effects and undesirable effects. The measured result variable must be important for the investigator and for the patient. Do not forget that the fact that demonstrating that the intervention is effective does not necessarily mean that it is beneficial for our patients. We must also assess the harmful or annoying effects and study the benefits-costs-risks balance, as well as the difficulties that may exist to apply the treatment in our environment, the patient’s preferences, etc.

As it is easy to understand, a study can have a great methodological validity and its results have great importance from the clinical point of view and not be applicable to our patients, either because our patients are different from those of the study, because it does not adapt to your preferences or because it is unrealizable in our environment. However, the opposite usually does not happen: if the validity is poor or the results are unimportant, we will hardly consider applying the conclusions of the study to our patients.

We’re leaving…

To finish, recommend that you use some of the tools available for critical appraisal, such as the CASP templates, or a checklist, such as CONSORT, so as not to leave any of these points without consideration. Yes, all we have talked about is randomized and controlled clinical trials, and what happens if it is nonrandomized trials or other kinds of quasi-experimental studies? Well for that we follow another set of rules, such as those of the TREND statement. But that is another story…

Regular customers

Re-randomization in clinical trials

We saw in a previous post that sample size is very important. The sample should be the right size, neither more nor less. If too large, we are wasting resources, something to keep in mind in modern times. If we use a small sample we will save money, but lose statistical power. This means that it may happen that there is a difference in effect between the two interventions tested in a clinical trial and not be able to recognize it, which we will be just throwing good money equally.

When sample size is out of our reach

The problem is that sometimes it can be very difficult to get an adequate sample size, needing excessively long periods of time to get the desired size. Well, for these cases, someone with commercial mentality has devised a method that is to include the same participant many times in the trial. It’s like in bars. Better to have a regular clientele who comes many times to the establishment, always easier than to have a very busy parish (which is also desirable).

There are times when the same patient needs the same treatment in repeated occasions. Consider, for example, asthmatics that need bronchodilator treatment repeatedly, or couples undergoing a process of in vitro fertilization, which requires several cycles to succeed.

Re-randomization in clinical trials

Although the usual standard in clinical trials is randomizing participants, in these cases we can randomize each participant independently whenever he needs treatment. For example, if we are testing two bronchodilators, we can randomize the same subject to one of two every time he has an asthma attack and needs treatment. This procedure is known as re-randomization and consists, as we have seen, in randomizing situations rather than participants.

This trick is quite correct from a methodological point of view, provided that certain conditions discussed below are met.

The participant enters the trial the first time in the usual way, being randomly assigned to one of two arms of the trial. Subsequently he is followed-up during the appropriate period and the results of the study variables are collected. Once the follow-up period is finished, if the patient requires new treatment, and continues to meet the inclusion criteria of the trial, he is randomized again, repeating this cycle as necessary to achieve the desired sample size.

This mode of recruiting situations instead of participants achieves getting the sample size with a smaller number of participants. For example, if we need 500 participants, we can randomize 500 once, 250 twice, or 200 once and 50 six times. The important thing is that the number of randomizations of each participant cannot be specified previously, but must depend on the need of treatment in every time.

Three conditions

To apply this method correctly you need to meet three requirements. First, patients can only be re-randomized when they have fully completed the follow-up period of the previous procedure. This is logical because, otherwise, the effects of the two treatments would overlap and a biased measure of the effect of the intervention would be obtained.

Second, each new randomization in the same participant should be done independently of the others. In other words, the probability of assignment to each intervention should not depend on previous assignments. Some authors are tempted to use reallocations to balance the two groups, but this can bias comparisons between the two groups.

Third, the participant should receive the same benefit of each intervention. Otherwise, we get a biased estimate of treatment effect.

We see, then, that this is a good way to reach more easily the sample size we want. The problem with this type of design is that the analysis of the results is more complex than that of conventional clinical trial.

Basically, without going into details, there are two methods of analysis of results. The simplest is the unadjusted analysis, in which all interventions, even if they belong to the same participant are treated independently. This model, which is usually expressed by a linear regression model, does not take into account the effect that participants can have on the results.

The other method is adjusted for the effect of patients, which takes into account the correlation between observations of the same participants.

We’re leaving…

And here we leave for today. We have not talked anything about the mathematical treatment of the adjusted method to avoid burning the reader’s neurons. Suffice it to say that there are several models that have to do with using generalized linear models and mixed-effects models. But that is another story…

Steady… ready…

Don’t!. Not so fast. Before you rush out there you have to be sure that everything is well prepared. It is difficult to conceive that anyone gets to run a marathon without preparing beforehand, without sufficient physical fitness and nutrition. Well, actually it is difficult to conceive what is having to be running nonstop 42 kilometers, so let another credible example.

Imagine that it is bedtime and we are so busted as if we had run a marathon. This situation is already more credible for most. Anyone in their right mind knows that it is advisable to drink water and go through the bathroom before going to bed. Payment for not doing these preparations will be getting up in the middle of the night, stumbling and shivering, to satisfy needs that could have being foreseen and avoided (except prostate imperatives, of course).

Now imagine that we want to conduct a clinical trial. We planned the study, we chose our population, we randomize participants perfectly, and we give to the intervention group our new drug to combat chronic unbearable fildulastrosis that we want to study and wham!! Most of them do not tolerate the drug and withdraw the study early. We will have wasted money and time, and it is difficult to know which of the two the most precious resource is in these times.

Could we have avoided these problem?. Poor tolerance to the drug is a fact that we cannot help but, because there are people who tolerate it, we could have used a little trick: to give the drug at all before randomizing, drawn from the study the intolerant and randomize then only those who can endure the drug until the end of the study. This is what is called using a run-in period, also known as a run-in phase (some call it open-label phase, but I think that this term is not always equivalent to inclusion period).

Overall, during the run-in study participants are observed before being assigned to the study group that corresponds to them to verify that they meet the inclusion criteria for an intervention, or that they are good compliers, tolerate the intervention, etc. Being sure that they meet the prerequisites for inclusion in the study we ensure a more valid and consistent initial observation before random assignment to the group that corresponds.

At other times we can see that intervention is used during the run-in using their response as part of the inclusion criteria, since you can select or exclude individuals based on their response to treatment.

You see how a run-in period can deliver us of the bad compliers, of those with poor health that can give us unpleasant surprises during the trial and of those who cannot tolerate the drug in question, so that we can better focus on determining the efficacy of treatment, since most of the loses during follow-up will be for reasons not related to the intervention.

Anyway, we must take a number of precautions. We must be careful in choosing the initial sample, whose size can be larger than that required without run-in. It is very important the baseline situation of the participants in order to stratify or to make a more efficient statistical analysis. In addition, randomization must be done the latest but as close as possible to the intervention, although it is not uncommon to find studies in which participants were randomized before the run-in period. Finally, to interpret the results of a study with a run-in period, we must take into account the differences between the baseline characteristics of the participants who have been excluded during the period and those who are ultimately assigned to study groups.

But not all is entirely made up of saints. Although to exclude bad-compliers or those with more adverse effects allows us to increase the power of the study and better estimate the effect of the intervention, applicability or generalizability of the results will be compromise because the results come from a more restrictive sample of participants. To say that in a more elegant way, we have to pay for the increased internal validity with a reduction of the external validity of the study.

To end this post, just to say something about something similar to the run-in period. Imagine that we want to test a new proton pump inhibitor in ulcerous patients. As all of them receive some treatment, it may artefact the results of our intervention. The trick here is to tell everyone to stop the medication for a while before randomization and allocation to the branches of the study. But do not confuse this with the run-in period. This is what is known as a washout period. But that is another story…

More than one rooster per pen

Factorial clinical trial

The clinical trial is the king of epidemiological designs. But it is also the most expensive to perform. And, in our times, this is an important inconvenience to launch a trial.

Usually in each trial an intervention in one group versus the control group without intervention or placebo is evaluated. But, what if we could test various interventions in the same trial?. The costs probably would be lower than trying each intervention separately, each in its conventional parallel trial. Well, this is possible to carry out, using a design named as factorial trial.


Factorial clinical trial

The simplest form is the 2×2 factorial trial, in which two different interventions are tried on the same sample of participants. The trick is to randomized participants several times to form more than the two groups of a parallel trial. Suppose we want to do a factorial trial with treatments A and B (let’s not complicate us too much thinking about an example). First, we randomly assign participants to either receive or not the treatment A. Then we perform another randomization to receive or not treatment B. Thus, the sample of N participants is divided into four groups, as it’s shown in the attached table: N/4 receive only A, N/4 only receive B, N/4 receive A and B simultaneously, and N/4 remain untreated (control group).

This design is the basic 2×2 factorial trial. If we focus on the table, the analysis of marginal values in rows allows us to compare the effect between receiving A and not to receive it. For its part, the marginal analysis of the columns allows us to compare the effect between getting B and not to receive it. We could also compare values of each cell separately, but then we’d lose power to detect differences, and with that one of the advantages of this type of design.

The sample size required is usually calculated imagining that we are going to do two independent parallel trials and taking the largest number needed to detect the smallest of the effects we want to study.

Meanwhile, randomization is done using the same methods as with the parallel trial, but it’s repeated several times. Another alternative would be to identify all the groups (A, B, A + B and control, in our example) and make the random assignment at once. The result is the same.

The main reason for doing a factorial trial should be the economical one, since they require less participants than the two comparable parallel trials, so it results cheaper. This is especially useful if the sponsor of the trial does not expect to make huge profits with the results. Therefore it is common to see factorial trials with unprofitable, or with well-known and traditionally used treatments.

Conditions for conducting a factorial clinical trial

An important condition for making a factorial trial with guarantees is that there is no interaction between the two treatments, so that their effects are independent. When there is interaction between the two treatments (one effects depend on the presence of the other), the analysis is complicated and the necessary sample is higher, as we couldn’t analyze the marginal values of the table to detect differences, but we would have to assess the differences among all groups and, as we’ve said, the statistical power of the study would be lower.

In any case, we will always check for interaction. This can be done using a regression model with an interaction term and comparing the model with the same regression model without interaction. In the case of detecting interaction (which could not have been previously suspected), we must make an analysis of each group separately even at the cost of losing power to detect statistically significant differences.

And can we compare more than two interventions? We can compare all we want, but we must bear in mind that design complexity increases, as do the number of groups to be compared and the possibility of encountering interaction among any of the tested interventions. For these reasons, it is advisable to keep the number of interventions as low as possible.

We have already discussed the most obvious advantage of factorial trial: lower cost resulting of requiring a smaller sample size. Another advantage is that they serve if we are also interested in assessing the effect of the combination of interventions, assessing by the way the existence of interaction.

Meanwhile, interactions between interventions are the main limitation of this design. We have already mentioned that when there, we require the individual analysis of the groups, with the loss of power that entails. Another drawback is that the compliance of the participants cannot be very good: the more treatments a participant must perform correctly, the most likely he will not do it the way he should.

We’re leaving…

And here we leave for today the story of factorial clinical trials. We have described the simplest way, the 2×2 factorial. However, as we have said, things can get complicated comparing more interventions and also assigning different sizes to each of the groups. For example, if we detect smaller differences in one treatment group that interests us, we can assign more patients to it. Of course all this complicates the analysis and calculation of sample size. But that is another story…

Intention is what matters

Intention-to-treat analysis

Someone always does not do what he’s told. No matter how simple the approach of a clinical trial seems to be regarding to its participants. They are randomly assigned to one of the two arms of the trial and some have to take the pill A whereas other have to take B, so we can test which one of both is better.

However, there’s always someone who does not do what he has to and takes the pill that not correspond, or doesn’t take any pill at all, or takes it wrong, or withdraws it ahead of the proper time, etc., etc., etc.

Types of analysis

And what do we do when it comes to analyzing the results? Common sense tells us that if a participant has been wrong with the assigned treatment we should include him in the group of the pill he actually took (this is called to make a per protocol analysis). Other option is to forget that participant who doesn`t take the treatment. But this attitude is not correct if we want to make an unbiased analysis of the results. If participants begin to change from one group to the other we lose the benefit we obtained by distributing them randomly, and the result can be the come into play of confounding or modifying variables that were balanced between the two groups during randomization.


To avoid this, the right thing is to respect the initial intention of group assignment and analyze the results of the subject being mistaken as if he had taken the treatment correctly assigned. It is what is known as the intention to treat analysis, the only preserving the advantages of randomization.

There’re several reasons why a participant in a trial cannot receive the assigned treatment, in addition to a poor compliance by its part. Here are some.

Sometimes it may be the researcher who makes an erroneous inclusion of the participant in the treatment group. Imagine that, after randomization, we realize that some participants are not eligible for the intervention, either because they have the disease or because we discover that there is a contraindication to surgery, for example. If we are strict, we should include them in the analysis group to which they were assigned, although they have not received the intervention. However, it may be reasonable to exclude them if the causes of exclusion are previously specified in the trial protocol. However, it is important that this is performed by someone who does not know the allocation and results, so participants of both arms of the trial are managed similarly. Anyway, if we want more security, we can do a sensitivity analysis with and without these subjects to see how the results change.

Another problem of this type can result of missing data. The results of all variables, and especially the principal, should be present for all participants, but this is not always the case, so we have to decide what to do with the subjects with any missing data.

Most statistical programs operate with complete data analysis excluding those records of subjects with missing data. This reduces the effective sample size and may bias the results, in addition to reducing the power of the study. Some models, such as mixed longitudinal or Cox regression handle the records with some missing data, but no one can do anything if all the information of a subject is missing. In these cases we can use data imputation in all of its forms, so that we fill the gaps to take advantage of the overall sample according to the intention to treat.

When data imputation is not convenient, one thing we can do is what is called an analysis of extreme cases. This is done by assigning the gaps the best and worst possible outcomes and sees how the results change. So, we’ll get an idea of the maximum potential impact of missing data on the results of the study. In any case, there is no doubt that the best strategy will be to design the study so that the missing data are kept to a minimum.

Anyway, there’s always someone who is mistaken and mess the performance of the trial. What can we do?

Variations of intention-to-treat analysis

One possibility is to use an intention to treat modified analysis. It includes everyone in the assigned group, but it’s allowed to exclude participants like those who never started treatment or who were not considered suitable for the study. The problem is that this opens a door to mask the data as we are interested in and bias the results to our advantage. Therefore, we must be suspicious when these changes were not specified in the trial protocol and are a post hoc decision.

The other possibility is to make the analysis according to treatment received (per protocol analysis). The problem, as we have said, is that the balance of randomization is lost. Also, if those who have been mistaken have some special feature the results of the study may be biased. Moreover, the advantage of analyzing the facts as the really have happened is that we can get a better idea of how treatment can work in real life.

Finally, perhaps the safest thing to do is to perform both analyzes, the per protocol and the intention to treat, and compared the results obtained with each. In these cases it may be that we detect an effect with the per protocol analysis and not with the intention to treat analysis. This may be due to two main causes. First, per protocol analysis may create spurious associations by the lack of the balance of confounders guaranteed by randomization. Second, the intention to treat analysis favors the null hypothesis, so it has less power than the per protocol analysis. Of course, if we detect a significant effect, we will be strengthened if the analysis was by intention to treat.

We’re leaving…

And here we end for today. We have seen how try to control errors in the assignment to groups in the trial and how we can impute the missing data, which is a fancy way of saying that we invent data where they’re missing. Of course, we can only do that if some conditions are fulfilled. But that’s another story…

The consolation of not being worse

We live in a frantic and highly competitive world. We are continually inundated with messages about how good it is to be the best in this and that. As indeed it is. But most of us soon realize that it is impossible to be the best in everything we do. Gradually, we even realize that it is very hard to be the best at something, and not only in general. In the end, sooner or later, ordinary mortals have to conform to the minimum of not be the worst at what one does.

But this is not that bad. You can’t always be the best and indeed, you certainly do not have to. Consider, for example, we have a great treatment for a very bad disease. This treatment is effective, inexpensive, easy to use and well tolerated. Are we interested in change to another drug?. Probably not. But think now, for example, that it produces an irreversible aplastic anemia in 3% of those who take it. In this case we would like to find a better treatment.

Better?. Well, not really better. If only it were the same in all but except the production of aplasia, we’d change to the new treatment.

The most common goal of clinical trials is to show the superiority of an intervention against a placebo or the standard treatment. But, increasingly, trials are performed with the sole objective to show that the new treatment is equal to the current. The planning of these equivalence trials should be careful and paying attention to a number of aspects.

First, there is no equivalence from an absolute point of view, so you must take much care in keeping the same conditions in both arms of the trial. In addition, we must first set the sensitivity level that we will need in the study. To do this, we first define the margin of equivalence, which is the maximum difference between the two interventions to be considered acceptable from a clinical point of view. Second, we will calculate the sample size needed to discriminate the difference from the point of view of statistical significance.

It is important to understand that the margin of equivalence is marked by the investigator based on the clinical significance of what is being valued. The narrower the margin, the larger the needed sample size to achieve statistical significance and reject the null hypothesis that the differences we observe are due to chance. Contrary to what may seem at first sight, equivalence studies usually require larger samples than studies of superiority.

After obtaining the results, we’ll analyze the confidence intervals of the differences in effect between the two interventions. Only those intervals not crossing the line of no-effect (one for relative risks and odds ratio and zero for mean differences) are statistically significant. If they are also included within the predefined equivalence margins, they will be considered equivalents with the probability of error chosen for the confidence interval, usually 5%. If an interval falls outside the range of equivalency, the intervention is considered not equivalent. In the case of crossing any of the limits of the margin of equivalence, the study is not conclusive as to prove or reject the equivalence of the two interventions, although we should assess the extent and distribution of the interval regarding to the margins of equivalence to rate its possible relevance from a clinical point of view. Sometimes, not statistically significant results or those outside the equivalence range limits may also provide useful clinical information.

equivalencyLook at the example of the figure to better understand what we have said so far. We have the intervals of nine studies represented with its position regarding the line of no-effect and the limits of equivalence. Only studies A, B, D, G and H show a statistically significant difference, because they are not crossing the line of no-effect. A’s intervention is superior, whereas H’s is showed inferior. However, only in case of D’s can we conclude equivalence of the two interventions, while B’s and G’s are inconclusive with regard to equivalence.

You can also conclude equivalence of the two interventions of E study. Notice that, although the difference obtained in D is statistically significant, is not to exceed the limits of equivalence: it’s superior to E from the statistical point of view, but it seems that the difference has no clinical relevance.

Besides the studies B and G already mentioned, C, F and I are inconclusive regarding equivalence. However, C will probably not be inferior and F could be Inferior. We could even estimate the probability of these assumptions based on the amount of the intervals that fall within the limits of equivalence.

An important aspect of equivalence studies is the method used to analyze results. We know that the intention to treat analysis is always preferable to the per protocol analysis as it keeps the advantages of randomization of known and unknown variables that may influence the results. The problem is that the intention to treat analysis favors the null hypothesis, minimizing the differences, if any. This is an advantage in superiority studies: finding a difference reinforces de result. However, this is not so advantageous in the case of equivalence studies. Otherwise, the per protocol analysis would tend to increase any difference, but this is not always the case and may vary depending on what motivated the protocol violations, losses or mistakes of assignment between the two arms of the trial. For these reason, it’s usually advised to analyze results in both ways and to check that interventions showed equivalents with both methods. We’ll also take into account losses during study and analyze the information provided by the participants who don’t follow the original protocol.

A particular case of this type of trial is the non-inferiority. In this case, researchers are contented to demonstrate that the new intervention is not worse than the comparison. All we have said about equivalence is valid here, but considering only the lower limit of the range of equivalence.

One last thing. Studies of superiority are to demonstrate superiority and equivalence studies are to demonstrate equivalence. One of the designs is not useful to show the goal of the other. Furthermore, if a study fails to demonstrate superiority, it does not exactly mean that the two procedures are equivalent.

We have reached the end without speaking anything about other characteristic equivalence studies: bioequivalence studies. These are phase I trials conducted by pharmaceutical companies to test the equivalence of different presentations of the same drug, and they have some design specifications. But that’s another story…

The other sides of the King

We’ve already talked at other times about the king of experimental designs, the randomized clinical trial, in which a population is randomly assigned into two groups to undergo the intervention under study, one of the groups, and to serve as a control group, the other one. This is the most common side of the King, the parallel clinical trial, which is ideal for most studies about treatment, for many studies about prognosis or prevention strategies and, with its peculiarities, for studies assessing diagnostics tests. But the King is very versatile and has many other sides to accommodate to other situations.

If we think about it for a moment, the ideal design would be one that allows us to test in the same individual the effect of the intervention study and of the stablished control (placebo or standard treatment) because parallel testing is an approach that assumes that both groups respond equally to both interventions, which is always a risk of bias that we try to minimize with randomization. If we had a time machine we could test the intervention in all, note what happens, turn back in time and repeat the experiment with the control intervention. So, we could compare the two effects. The problem is, the more vigilant of you will have already guessed, that time machine has not been invented yet.

But was has been already invented is the cross-over trial design, in which each subject acts as his own control.

Crossover trialIn this type of trial, every subject is randomized to a group, the corresponding intervention is performed, it takes place a washout period, and the other intervention is carried out. Although this solution is not as elegant as the time machine, the cross-over study defenders argue that the variability within each individual is less than the inter-individual variation. Thus, the estimate may be more accurate that the obtained with a parallel trial and we usually require smaller sample sizes. However, before using this design, a number of considerations have to be done. Logically, the effect of the first intervention should not cause irreversible changes or be very long, because it would affect the effect of the second. In addition, the washout period must be long enough to avoid leaving any residual effect of the first intervention.

We must also consider whether the order of the interventions could affect the final outcome, because in this case only results of the first intervention will be reliable (sequence effect). Another problem is that, by having a longer duration, patient characteristics may change during the study and may be different in the two periods (period effect). And finally, be alert to the losses during follow-up, more frequent in longer studies and which have greatest impact in cross-over studies trials and with more repercussion in final results than in the case of parallel trials.

Imagine now that we want to test two interventions (A and B) in the same population. Can we do it with only one trial, saving costs of any kind?. Yes, we can. We only have to design a factorial clinical trial. In this type of trial, each participant undergoes two consecutive randomizations. She’s first assigned to the intervention A or placebo (P), and then, to the intervention B or placebo, with which we’ll have four study groups: AB, AP, BP and PP. Obviously, the two interventions must act through independent mechanisms to be able to assess the results of the two effects independently.

It’s usually studied one more mature and plausible hypothesis and one that has been less tested, ensuring that the evaluation of the second doesn’t affect the inclusion and exclusion criteria of the first. Furthermore, it’s not desirable that any of the two interventions have many troublesome effects or be poorly tolerated, because the lack of compliance with one treatment will affect the compliance with the other. In cases in which the two interventions seem not to be independent, their effect could be studied separately (AP vs. PP and BP vs. PP), but we’ll lost the advantages of the design and a larger sample size will be required.

Other times it may happen that we are in a hurry to finish the study soon. Imagine a very bad disease that kills people by dozens and we’re trying a new treatment. We’ll want to have it available as soon as possible (if it works, of course), so we’ll pause the trial and discuss its results after being tested the treatment in a certain number of participants, because if we can already show the usefulness of the treatment, we’ll end the study. This is the type of design that characterizes the sequential clinical trial. Remember the in the parallel clinical trial the right thing is to pre-calculate the sample size. In this design, with a more Bayesian’s mentality, we stablish and statistic whose value determines an explicit ending rule, whereby the sample size depends on the previous observation of the study. When this statistic reaches the preset value we are confident enough to reject the null hypothesis and end the study. The problem is that each stop and analysis increases the error of reject the null hypothesis being true (type 1 error), so it’s not recommended to perform many interim analysis. Moreover, the final analysis of results is more complex because we have to take into account the interim analysis. This type of trials is very helpful with very quick impact interventions, which is often seen in studies about dose titration of opioids, hypnotics, and poisons of that kind.

There are other occasions where individual randomization makes no sense. Think we have taught physicians of a health center a new technique to better inform their patients and want to compare it with the old one. We cannot say the same physician to inform some patients in a way and other patients in another, since there would be a strong possibility that the two interventions contaminate to each other. It would be more logical to teach a group of medical centers and not teach another group and compare the results. Here we randomize health centers to form or not their doctor. This is the cluster allocations design. The problem with this design is that we have little assurance that participants of different groups behave independently, so the sample size required can be greatly increased if there is great variability among groups and little within each group. In addition, we must perform and aggregate analysis of results, because if we do it individually confidence interval will be falsely narrowed and we can find false statistical significance. The usual practice is to calculate a weighted statistic for each group and make final comparisons with it.

The last of the series we are going to deal with is the community trial, in which the intervention is applied to populations. As it’s performed on populations under actual conditions it has high external validity and it often allow us recommending cost-effective measures based on their results. The problem is that it is often difficult to establish a control group, it may be more difficult to determine the sample size needed and is more complex to perform causal inference from their results. It is the typical design for evaluating public health measures such as water fluoridation design, vaccinations, etc.

As you can see, the King has many sides. But it also has lower-rank relatives, but which are not less worthy. It’s so because it has a whole family of quasi-experimental studies consisting of trials that are not randomized or controlled, or any of both things. But that’s another story…

The chameleon

Adaptive clinical trial

What a fascinating reptile. It’s known by its eyes, with its ability to rotate independently covering the whole angle of the circle. Also known is its long tongue, with which it traps from the distance the bug that it eats without moving from its place. But the most famous of the chameleon’s abilities is that of changing color and blending into the environment when it wants to go unnoticed, which is not surprising because the chameleon is, it must be said, a pretty ugly bug.

But today we’re going to talk about clinical trials. About one type of clinical trial in particular: as a true chameleon of epidemiology, it changes its design as it is being performed to suit the circumstances as they occur. I am talking about adaptive clinical trials.

Adaptive clinical trial

A clinical trial usually has a fixed design or protocol that we must not change and, when changed, we must explain in detail and justify the reason why we did it. However, in an adaptive clinical trial we defined in advance, prospectively, the possibility of changes in one or more aspects of the study design based on data that are obtained during the trial. We usually plan at what time throughout the study we’ll analyze the available data and results to determine if we perform some of the predetermined changes. Otherwise, any change is a violation of the study protocol that jeopardizes the validity of the results.

There’re many changes we can do. We can change the probabilities of the randomization method, the sample size, and even the characteristics of the follow-up, which can be lengthened or shortened, and modify the visits that were planned in the initial design. But we can go further and change the dose of the tested treatment or the allowed or prohibited concomitants medications, depending on our interests.

We can also change aspects such us the inclusion criteria, outcome variables (especially the components of composite variables), the analytical methods of the study and even to transform a superiority trial to a non-inferiority one, or vice versa.

As we have mentioned a couple of times, these changes must be planned in advance.  We have to define the events that will induce us to make the adaptations of the protocol. For instance, we can plan to increase or decrease the sample size to improve power after enrolling a number of participants, or to include some groups during a predetermined follow-up and, from there, not to implement the intervention with the group in which it is no effective.

Pros and cons

The advantages of an adaptive design are obvious. First, flexibility is evident. The other two are more theoretical and are not always met but, a priori, they are more efficient than conventional designs and are more likely to demonstrate the effect of the intervention, if it exists.

Its main drawback is the difficulty of planning a priori all the possibilities of change and the subsequent interpretation of the results. It’s difficult to interpret final results when the course of the trial depends heavily on the intermediate data being obtained. Moreover, this makes it imperative to have a fast and easy access to study data while performing it, which can be difficult in the context of a clinical trial.

We’re leaving…

And here we end up for today. I insist on the need of the a priori planning of trial protocol and, in the case of adaptive designs, of each adaptive condition. As a matter of fact, nowadays most clinical trials are registered before performing for the recording of their design conditions. Of course, this also facilitates the posterior study publication, even if the results are not favorable, which helps to combat publication bias. But that’s another story…

About pilots

Pilot studies

No doubt that the randomized clinical trial is the King of epidemiological designs when we want to show, for instance, the effectiveness of a treatment. The problem is that clinical trials are difficult and expensive to perform, so before we get into a trial it is usual to carry out other previous studies.

These previous studies may be of the observational type. With a cohort or a case-control study we can gather enough information about the effect of an intervention to justify the subsequent performance of a clinical trial.

However, observational studies are also expensive and complexes, so we often resort to another solution: doing a clinical trial on a smaller scale to obtain evidence in order to do or not to do a large-scale trial, which results would be definitive. These previous studies are generally known by the name of pilot studies, and they have a number of characteristics that should be taken into account.

Pilot studies

For example, the aim of a pilot study is to provide some assurance that the effort of making the final trial will provide something useful, so it tries more to observe the type of intervention’s effect than to demonstrate its effectiveness.

Being relatively small studies, pilot studies often lack of sufficient power to achieve statistical significance at the usual level of 0.05, so some authors recommend setting the value of alpha at 0.2. This alpha-value is the chance we have of making a type I error, which is to reject the null hypothesis of no-effect when it’s true or, in other words, accepting the existence of an effect that doesn’t really exist.

But, what is going on? Don’t we mind to have a 20% chance of being wrong?. For other trial the acceptable limit is 5%. Well, the true isn’t that we don’t mind, but the point of view with a pilot study is different of the one with a conventional clinical trial.

If we commit a type I error doing a conventional clinical trial, we’ll admit a treatment as effective when it’s not. It’s easy to understand that this can carry bad consequences and harm patients who undergo in the future to the alleged beneficial intervention. However, if we make a type I error in a pilot study, all that will happens is that we’ll spend time and money to make a definitive trial that finally will prove that the treatment is not effective.

In a definitive clinical trial is preferable not to take for effective an ineffective or unsafe treatment, while in a pilot study is preferable to perform a bigger clinical trial with an ineffective treatment than not to test one that could be effective. This is why the threshold of type I error is increased to 0.2.

Better use confidence intervals

Anyway, if we are interested in study the direction of the intervention’s effect, it may be advisable to use confidence intervals instead of classical hypothesis testing with its p-values.

These confidence intervals have to be compared with the minimal clinically important difference, which must be defined a priori. If the interval doesn’t include the null value and includes the minimal important difference, we’ll have arguments for conducting a large-scale trial to definitively show the effect. Suffice is to say that, as we can increase the alpha-value, we can use confidence intervals with levels below 95%.

Another peculiarity of pilot studies is the choice of the outcome variables. Considering that a pilot study seeks to test just how the components of the trial will work together in the future trial, we can understand that sometimes it’s impractical to use an outcome variable and we have to use a surrogate variable, that is one which provides an indirect measure of the effect when the direct measurement is not practical or impossible. For example, if we’re studying an antitumor treatment, the outcome variable may be the five-year survival, but in a pilot study may be more useful an indirect variable who indicates the decrease in tumor size. It will indicate the direction of treatment’s effect without prolonging the pilot study for too long.

We’re leaving…

So, you can see that pilot studies should be interpreting taking into account their peculiarities. Moreover, they also help us to predict how the definitive trial can function, anticipating problems that could ruin an expensive and complex clinical trial. This is the case of missing data and losses to follow-up, which are usually larger in pilot studies than in conventional trials. Although they have less significance, losses in pilot studies should be evaluated trying to prevent future losses in the final trial because, although there’re many ways to manage losses and missing data, the best way is always to prevent their occurrence. But that’s another story…

To see well you must be blind

It’s said that there’s none so blind than those that refuse to see. But it’s also true that wanting to see too much can be counterproductive. Sometimes it is better to see just the essential and indispensable.

That’s what happens with scientific studies. Imagine that we want to test a new treatment and we propose a trial to some people, giving the new drug to some of them and a placebo to the rest. If we all know what is treated each with, it might be that researchers or participants expectations influence, even inadvertently, the way we evaluate the results of the study. This is why you have to use masking techniques, better known as blinding.

Let’s suppose we want to test a new drug for treating a very severe disease. If a participant knows he’s receiving the drug he will be much more tolerant with side effects than if he receives placebo. And something similar can happen to the researcher. It’s easy to imagine that you would take less interest in asking for a toxicity sign to a patient that you know is being treated with a harmless placebo.

All of these facts may influence the way participants and researchers evaluated the effects of treatment and may lead to a bias in interpreting results.

Masking techniques can be performed at different levels. The lowest level is not masking at all, making what is called and open or un-blinded trial. Although masking is the ideal thing to do, there’re times when it’s not possible or convenient. For example, think about the case you need to cause unnecessary inconvenience to the patient, such as administering an intravenous placebo for a long time or doing a sham surgical procedure. Other times it’s difficult to find a placebo galenicaly indistinguishable from the drug tested. And finally, sometimes it doesn’t make much sense to blind if treatment produces easily recognizable effects that don’t occur with placebo.

The next level is the single-blind, when either participants or researchers don’t know which treatment is receiving each one of them. A further step is the double-blind, in which neither researchers nor participants know which group each one is assigned to. And finally, we can do a triple-blinding when, in addition to the aforementioned, the person who analyze the data or who has the responsibility to control and stop the study also unknowns which group each participant is assigned to. Imagine someone has a serious adverse effect and we have to decide if we must stop the study. No doubt that knowing if that person is receiving the drug or placebo can influence our decision.

But what can we do when masking is not possible or is inconvenient?. For such cases we have no more choice than to make an open or un-blinded study, although we can try to use a blind evaluator. This means that, although researchers and participants know the allocations to placebo or intervention groups, the person who analyzes the results doesn’t know it. This is especially important when the outcome variable is a subjective one. By the way, it’s not so essential when we measure objective variables, such as a laboratory determination. Think that you won’t assess an X-ray film with the same detail or criteria if you know that the individual comes from the placebo or the intervention group.

To end this post, we are going to discuss two other possible errors resulting from lack of blinding. If a participant knows he’s receiving the studied drug he can improve just by a placebo effect. On the other hand, if he knows he’s in the placebo arm, he can modify his behavior just because he knows “he’s not protected” by the new treatment. This is called contamination and it’s a real problem in studies about lifestyle habits.

And that’s all. Just to clarify a concept before the end. We have seen that there is some relationship between lack of blinding and the appearance of a placebo effect. But don’t be mistaken, masking is not the way to control the placebo effect. For that we have to resort to another trick: randomization. But that’s another story…