# Critical appraisal of treatment studies.

Guidelines for critical appraisal of treatment studies are given, being the randomized clinical trial the most frequent design.

We all know that the randomized clinical trial is the king of interventional methodological designs. It is the type of epidemiological study that allows a better control of systematic errors or biases, since the researcher controls the variables of the study and the participants are randomly assigned among the interventions that are compared.

In this way, if two homogeneous groups that differ only in the intervention present some difference of interest during the follow-up, we can affirm with some confidence that this difference is due to the intervention, the only thing that the two groups do not have in common.

For this reason, the clinical trial is the preferred design to answer clinical questions about intervention or treatment, although we will always have to be prudent with the evidence generated by a single clinical trial, no matter how well performed. When we perform a systematic review of randomized clinical trials on the same intervention and combine them in a meta-analysis, the answers we get will be more reliable than those obtained from a single study.

That’s why some people say that the ideal design for answering treatment questions is not the clinical trial, but the meta-analysis of clinical trials.

In any case, as systematic reviews assess their primary studies individually and as it is more usual to find individual trials and not systematic reviews, it is advisable to know how to make a good critical appraisal in order to draw conclusions. In effect, we cannot relax when we see that an article corresponds to a clinical trial and take its content for granted. A clinical trial can also contain its traps and tricks, so, as with any other type of design, it will be a good practice to make a critical reading of it, based on our usual three pillars: validity, importance and applicability.

## Critical appraisal of treatment studies

As always, when studying scientific rigor or **VALIDITY** (internal validity), we will first look at a series of essential **primary criteria**. If these are not met, it is better not to waste time with the trial and try to find another more profitable one.

**Is there a clearly defined clinical question?** In its origin, the trial must be designed to answer a structured clinical question about treatment, motivated by one of our multiple knowledge gaps. A working hypothesis should be proposed with its corresponding null and alternative hypothesis, if possible on a topic that is relevant from the clinical point of view. It is preferable that the study try to answer only one question. When you have several questions, the trial may get complicated in excess and end up not answering any of them completely and properly.

**Was the assignment done randomly?** As we have already said, to be able to affirm that the differences between the groups are due to the intervention, they must be homogeneous. This is achieved by assigning patients randomly, the only way to control the known confounding variables and, more importantly, also those that we do not know. If the groups were different and we attributed the difference only to the intervention, we could incur in a confusion bias.

The trial should contain the usual and essential table 1 with the frequency of appearance of the demographic and confusion variables of both samples to be sure that the groups are homogeneous. A frequent error is to look for the differences between the two groups and evaluate them according to their p, when we know that p does not measure homogeneity. If we have distributed them at random, any difference we observe will necessarily be random (we will not need a p to know that).

The sample size is not designed to discriminate between demographic variables, so a non-significant p may simply indicate that the sample is small to reach statistical significance. On the other hand, any minimal difference can reach statistical significance if the sample is large enough.

So forget about the p: if there is any difference, what you have to do is assess whether it has sufficient clinical relevance to have influenced the results or, more elegantly, we will have to control the unbalanced covariates during the randomization. Fortunately, it is increasingly rare to find the tables of the study groups with the comparison of p between the intervention and control groups.

But it is not enough for the study to be randomized, we must also consider whether the **randomization sequence was done correctly**. The method used must ensure that all components of the selected population have the same probability of being chosen, so random number tables or computer generated sequences are preferred. The randomization must be hidden, so that it is not possible to know which group the next participant will belong to. That is why people like centralized systems by telephone or through the Internet.

And here is something very curious: it turns out that it is well known that randomization produces samples of different sizes, especially if the samples are small, which is why samples randomized by blocks balanced in size are sometimes used. And I ask you, how many studies have you read with the same number of participants in the two branches and who claimed to be randomized?

Do not trust if you see equal groups, especially if they are small, and do not be fooled: you can always use one of the multiple binomial probability calculators available on the Internet to know what is the probability that chance generates the groups that the authors present (we always speak of simple randomization, not by blocks, conglomerates, minimization or other techniques). You will be surprised with what you will find.

It is also important that the **follow-up has been long and complete enough**, so that the study lasts long enough to be able to observe the outcome variable and that every participant who enters the study is taken into account at the end. As a general rule, if the losses exceed 20%, it is admitted that the internal validity of the study may be compromised.

We will always have to analyze the nature of losses during follow-up, especially if they are high. We must try to determine if the losses are random or if they are related to any specific variable (which would be a bad matter) and estimate what effect they may have on the results of the trial. The most usual is usually to adopt the so-called worst-case scenarios: it is assumed that all the losses of the control group have gone well and all those in the intervention group have gone badly and the analysis is repeated to check if the conclusions are modified, in which case the validity of the study would be seriously compromised.

The last important aspect is to consider whether patients who have not received the previously assigned treatment (there is always someone who does not know and mess up) have been analyzed according to the **intention of treatment**, since it is the only way to preserve all the benefits that are obtained with randomization. Everything that happens after the randomization (as a change of the assignment group) can influence the probability that the subject experiences the effect we are studying, so it is important to respect this analysis by intention to treat and analyze each one in the group in which it was initially assigned.

Once these primary criteria have been verified, we will look at three **secondary criteria** that influence internal validity. It will be necessary to verify that the **groups were similar at the beginning of the study** (we have already talked about the table with the data of the two groups), that the **masking** was carried out in an appropriate way as a form of control of biases and that **the two groups were managed and controlled in a similar way** except, of course, the intervention under study.

We know that masking or blinding allows us to minimize the risk of information bias, which is why the researchers and participants are usually unaware of which group is assigned to each, which is known as double blind. Sometimes, given the nature of the intervention (think about a group that is operated on and another one that does not) it will be impossible to mask researchers and participants, but we can always give the masked data to the person who performs the analysis of the results (the so-called blind evaluator), which ameliorate this incovenient.

To summarize this section of validity of the trial, we can say that we will have to check that there is a clear definition of the study population, the intervention and the result of interest, that the randomization has been done properly, that they have been treated to control the information biases through masking, that there has been an adequate follow-up with control of the losses and that the analysis has been correct (analysis by intention of treat and control of covariates not balanced by randomization).

## A famous Colombian: Alejandro Jadad Bechara

A very simple tool that can also help us assess the internal validity of a clinical trial is the Jadad’s scale, also called the Oxford’s quality scoring system. Jadad, a Colombian doctor, devised a scoring system with 7 questions. First, 5 questions whose affirmative answer adds 1 point:

- Is the study described as randomized?
- Is the method used to generate the randomization sequence described and is it adequate?
- Is the study described as double blind?
- Is the masking method described and is it adequate?
- Is there a description of the losses during follow up?

Finally, two questions whose negative answer subtracts 1 point:

- Is the method used to generate the randomization sequence adequate?
- Is the masking method appropriate?

As you can see, the Jadad’s scale assesses the key points that we have already mentioned: randomization, masking and monitoring. A trial is considered a rigorous study from the methodological point of view if it has a score of 5 points. If the study has 3 points or less, we better use it to wrap the sandwich.

We will now proceed to consider the results of the study to gauge its clinical **RELEVANCE**. It will be necessary to determine the variables measured to see if the trial adequately expresses the magnitude and precision of the results. It is important, once again, not to settle for being inundated with multiple p full of zeros. Remember that the p only indicates the probability that we are giving as good differences that only exist by chance (or, to put it simply, to make a type 1 error), but that statistical significance does not have to be synonymous with clinical relevance.

In the case of continuous variables such as survival time, weight, blood pressure, etc., it is usual to express the magnitude of the results as a difference in means or medians, depending on which measure of centralization is most appropriate. However, in cases of dichotomous variables (live or dead, healthy or sick, etc.) the relative risk, its relative and absolute reduction and the number needed to treat (NNT) will be used. Of all of them, the one that best expresses the clinical efficiency is always the NNT. Any trial worthy of our attention must provide this information or, failing that, the necessary information so that we can calculate it.

But to allow us to know a more realistic estimate of the results in the population, we need to know the precision of the study, and nothing is easier than resorting to confidence intervals. These intervals, in addition to precision, also inform us of statistical significance. It will be statistically significant if the risk ratio interval does not include the value one and that of the mean difference the value zero. In the case that the authors do not provide them, we can use a calculator to obtain them, such as those available on the CASP website.

A good way to sort the study of the clinical importance of a trial is to structure it in these four aspects: Quantitative assessment (measures of effect and its precision), Qualitative assessment (relevance from the clinical point of view), Comparative assessment (see if the results are consistent with those of other previous studies) and Cost-benefit assessment (this point would link to the next section of the critical appraisal that has to do with the applicability of the results of the trial).

To finish the critical reading of a treatment article we will value its **APPLICABILITY** (also called external validity), for which we will have to ask ourselves if the results can be generalized to our patients or, in other words, if there is any difference between our patients and those of the study that prevents the generalization of the results. It must be taken into account in this regard that the stricter the inclusion criteria of a study, the more difficult it will be to generalize its results, thereby compromising its external validity.

But, in addition, we must consider whether **all clinically important outcomes have been taken into account**, including side effects and undesirable effects. The measured result variable must be important for the investigator and for the patient. Do not forget that the fact that demonstrating that the intervention is effective does not necessarily mean that it is beneficial for our patients. We must also assess the harmful or annoying effects and study the **benefits-costs-risks balance**, as well as the difficulties that may exist to apply the treatment in our environment, the patient’s preferences, etc.

As it is easy to understand, a study can have a great methodological validity and its results have great importance from the clinical point of view and not be applicable to our patients, either because our patients are different from those of the study, because it does not adapt to your preferences or because it is unrealizable in our environment. However, the opposite usually does not happen: if the validity is poor or the results are unimportant, we will hardly consider applying the conclusions of the study to our patients.

## We’re leaving…

To finish, recommend that you use some of the tools available for critical appraisal, such as the CASP templates, or a checklist, such as CONSORT, so as not to leave any of these points without consideration. Yes, all we have talked about is randomized and controlled clinical trials, and what happens if it is nonrandomized trials or other kinds of quasi-experimental studies? Well for that we follow another set of rules, such as those of the TREND statement. But that is another story…