# Science without sense…double nonsense

### Píldoras sobre medicina basada en pruebas

Archive for the Critical appraisal Category

## The power of transitive property

When Georg Cantor wanted to develop the set theory, he could not get an idea of ​​everything that would come after that, probably from the hand of mathematicians as dedicated as he was. I can think of the curious case of binary relations, which the older ones of you will remember of the time when children learned things at school.

It turns out that some mathematical genius begins to think and describes a series of properties. The first is reflective property. This means that, if a number x is equal to x, then so, it is x. In case anyone has not understood, let us give an anatomical example: my right hand is my right hand. I believe that the genius who invented the reflexive property needed a long recovery in some spa after such a huge mental strain.

It was in this spa where he decided to do something more intense, so he described the symmetric property, which is much more complex: whenever a number x equals y, then y equals x. Going back to the anatomical simile, if my arms and legs are my extremities, you will have to agree that my extremities are my arms and my legs. Algebra is fascinating.

Luckily, in the end, with the purpose of filling a file and save back, our anonymous genius invented the transitive property, which says more or less like this: if a number x is related to y, and y is related to z, there will be transitivity if x relates to z. Again, to the anatomy: if my leg is mine and my foot is from my leg, my foot is also mine. After that, more properties were derived from these three, but we shall leave it here for the moment, because today we are going to use the power of transitive property to know which of two things that we have not really come to compare is the better of both. Think, for example, of a crazed mob running into a shopping center on the first day of sales. They look at everything before deciding what to buy, but it is not necessary to compare all the products two to two to know which one we like best.

In medicine something similar happens. The usual thing is that there are several options to treat the same disease (although those of us who have been in the business for a long time now know that the more there are, the more likely that none will work at all). Clinical trials, and meta-analyzes of clinical trials, only compare pairs and it may happen that no one has compared the two we have at our disposal or that we want to know which is, in theory, the best of all available.

Well, for that a methodological design called network meta-analysis (NMA), also called multiple-treatments meta-analysis or mixed-treatments comparisons meta-analysis, has been invented. And in this last term, mixed comparisons, is the crux of the matter, because it turns out that there are several types of comparisons. Let’s see them.

Let’s assume we have three possible treatments that, after a deep reflection, I decided to call A, B and C. The simplest situation is to compare two of them, A and B, for example, with a conventional clinical trial. We would be making a direct comparison between the two interventions. But it may happen that we do not have any trial that directly compares A and B, but there are two different trials that compare the interventions with another intervention, C (you can see it in the attached figure). In this case we can resort to the power of the transitive property and make an indirect comparison between A and B based on their relative efficacy against C. For example, if A reduces mortality by 100% compared to C and B reduces it by 50 % compared to C, we can say that B reduces mortality 50% relative to A. Of course, in order to do this, transitivity has to be fulfilled, something that we cannot take for granted. For example, if I like pork and pig likes to reboar through mud, that does not mean that I like to reboar through mud. Transitivity is not fulfilled in this case (I think).

Well, an NMA is nothing more than a series of direct, indirect and mixed comparisons that allow us to compare the relative effects of several interventions. Multiple comparisons are typically represented using a diagram as a network where we can see the direct, indirect and mixed comparisons. Each node in the network, which can vary in size according to its specific contribution, correspond with one of the primary studies of the review, while the lines joining the nodes represent the comparisons. The complete network will represent all comparisons of treatments identified from the primary studies of the review that incorporates our NMA.

As with the other types of meta-analyzes coupled with a systematic review, the validity of the NMA will depend on the validity of the primary studies, the heterogeneity among them and the possible existing information biases, factors that will condition the quality of the direct comparisons.

In addition, indirect comparisons are considered observational and require, as we have already mentioned, that the researcher issue the transitivity of the interventions based on her knowledge about them, about the disease and about the designs of the primary studies.

Another specific aspect of the NMA is that of coherence or consistency, which makes reference to the level of agreement among the evidence coming from direct and indirect comparisons. This level of agreement, which can be measured with specific statistical methods, must be high in order for the summary result measure to be valid. The results of the comparisons must go in the same direction, they cannot be divergent. When this is not fulfilled, the cause probably lies in the poor methodological quality of the primary studies, in their heterogeneity or in the presence of biases.

As in other meta-analyzes, the result of the NMA is expressed with a summary result measure that can be an odds ratio, a means difference, a risk ratio, etc. This point estimate is accompanied by an interval that gives us information about the accuracy of this estimate. The statistical analysis of the NMA can use frequentist methods (the one we usually see in usual clinical trials) or Bayesian methods. The latter are based on the assignment of a probability of the effect of the treatment prior to the analysis of the data and then to assign a posteriori probability after the analysis. For what interests us here, the frequentist methods will assess the accuracy of the point estimate by means of the known confidence intervals (usually 95%), while the Bayesians will provide credibility intervals (also 95%), of similar significance.

With all this data we will obtain an ordered rank of the compared treatments, with the best heading the list. But do not trust yourself too much, you have to look at these ranks carefully for several reasons. First, the best treatment in one situation may not be so in another. Second, we must take into account other factors such as cost, availability, knowledge of the clinician, etc. Third, these ordered ranks do not take into account the magnitude of the differences between the different elements. And fourth, chance can play tricks on us and put in a good position a treatment that, in reality, is not as good as it may seem.

Once reviewed, at a glance, the peculiarities of the NMA, what can we say about their critical appraisal? As we have a checklist for the systematic review with the usual meta-analysis, the PRISMA statement, there is a specific declaration for the NMA, the PRISMA-NMA. This list includes, as specific items, aspects such as the description of the geometry of the treatments network, the consideration of the transitivity and consistency assumptions and the description of the methods used to analyze the structure of the network and the suitability of the comparisons, in case some may have a lower degree of evidence. All this will be facilitated if the authors provide the graph with the study network and briefly explain its characteristics.

Anyway, you know that I’d rather resort on the CASP’s tools for critical appraisal of documents. Although there is no a specific for NMA, I advise you to use the systematic review with usual meta-analysis one and, later, to make some considerations about the specific aspects of the NMA.

To not extend this post much, we will skip the whole part that NMA share with any other systematic review and go directly to its specific aspects. You can consult the corresponding post where we reviewed the critical appraisal of a systematic review. As always, we will follow our three pillars of wisdom: validity, relevance and applicability.

Regarding VALIDITY, we will ask three specific questions.

1. Does the review respond to a well-defined clinical question that justifies the realization of a NMA?This question has the classic components of the PICO question,although the intervention and the comparison will encompass the multiple comparisons of the network.
2. Was an exhaustive search of the relevant studies carried out?This aspect is important to avoid publication biasand the inclusion of all the important information available. Their absence can affect the consistency of the comparisons.
3. There should be a clear specification of the target population, the treatments evaluated and the outcome measures used.All these aspects can condition the validity of indirect comparisons.If we want to infer the relationship between the effects of A and B by comparing their individual effects with respect to C, it is essential that A and B are treated similarly in their comparison with C, that the A-C and B-C comparisons are made with patients that are similar, that the same outcome measures are used and that the risk of bias in the studies is low. The latter can be assessed with the usual tools, such as the Cochrane’s.

To finish this section, we will check that the results are analyzed and presented in an appropriate way, which statistical method has been used (frequentist or Bayesian), and if confidence or credibility intervals, the analysis of the network, etc. are provided.

Although we will not go into it, we will say that there are multiple types of networks (star, loop, line …). For comparisons to be more valid, indirect comparisons must be supported by direct ones. This can be seen in the network scheme by the presence of triangles similar to the graph that I attached at the beginning of the post (or other closed geometric shapes). In conditions of equality of other factors that can have an influence and that we have already mentioned, the more triangles we see, the more valid the comparisons will be.

As a last aspect, we will evaluate if the authors have used the appropriate methods to assess the heterogeneity and the possible existence of inconsistency: sensitivity analysis, metaregression, etc.

Going to the RELEVANCE section, we will value the results of the meta-analysis. Here we will consider five specific aspects:

1. What is the result? As in any other meta-analysis, we will assess the result and its importance from the clinical point of view.

It will be necessary to assess how the result could have been influenced by the risk of bias in the primary studies: the greater the risk of bias, the farthest our estimate can be from the truth.

1. Are the results accurate?In this sense, we must assess the amplitude of the confidence or credibility intervals, taking into account how the conclusions of the study would be affected at each end of the interval.
2. Is there consistency of results among different studies?There may be variability by pure chance or by heterogeneity among the studies.We can assess it by observing the shape of the forest plots and helping us with the usual statistical methods, such as I2.
3. Are indirect comparisons reliable?We return again to the concept of transitivity, which must be taken into account together with the other factors that we have previously commented on and which may increase the risk of bias: homogeneous populations, outcome variables and common comparators, etc.
4. Is there consistency among direct and indirect comparisons?We will have to check for closed geometric shapes within the network (our triangles or loops),as well as rule out causes of inconsistency, which are the same we have already mentioned as causing heterogeneity and intransivity.

Finally, we will finish our critical appraisal by making some special considerations regarding the APPLICABILITY of the results.

In addition to taking into account, as usual, if all the important effects and variables for the patient have been considered and if the patients are similar to those of our environment, we will ask the questions specifically related to the use of a NMA, such as if the the network has considered all the possibilities of treatment or if the different comparison subgroups that have been established have credibility from the clinical point of view.

And here we will leave for today. A beast difficult to tame, this NMA. And that we have not spoken anything of its statistical methodology, quite complex but that computer packages develop without flinching. In addition, we could have talked a lot about the types of networks and the comparisons that can be drawn from each of them. But that’s another story…

## An unfairly treated genius

The genius that I am talking about in the title of this post is none other than Alan Mathison Turing, considered one of the fathers of computer science and a forerunner of modern computing.

For mathematicians, Turing is best known for his involvement in the solution of the decision problem previously proposed by Gottfried Wilhelm Leibniz and David Hilbert, who were seeking to define a method that could be applied to any mathematical sentence to prove whether that sentence were or not true (to those interested in the matter, it could be demonstrated that such a method does not exist).

But what it is Turing is famous for among the general public comes thanks to the cinema and to his work in statistics during World War II. And it is that Turing was taken to exploiting Bayesian magic to deepen the concept of how the evidence we are collecting during an investigation can support the initial hypothesis or not, thus favoring the development of a new alternative hypothesis. This allowed him to decipher the code of the Enigma machine, which was the one used by the German navy’s sailors to encrypt their messages, and that is the story that has been taken to the screen. This line of work led to the development of concepts such as the weight of evidence and concepts of probability, with which confront null and alternative hypotheses, which were applied in biomedicine and enabled the development of new ways to evaluate new diagnostic tests capabilities, such as the ones we are going to deal with today.

But all this story about Alan Turing turn out to be just a recognition of one of the people whose contribution made it possible to develop the methodological design that we are going to talk about today, which is none other than the meta-analysis of diagnostic accuracy.

We already know that a meta-analysis is a quantitative synthesis method that is used in systematic reviews to integrate the results of primary studies into a summary result measure. The most common is to find systematic reviews on treatment, for which the implementation methodology and the choice of summary result measure are quite well defined. Reviews on diagnostic tests, which have been possible after the development and characterization of the parameters that measure the diagnostic performance of a test, are less common.

The process of conducting a diagnostic systematic review essentially follows the same guidelines as a treatment review, although there are some specific differences that we will try to clarify. We will focus first on the choice of the outcome summary measure and try to take into account the rest of the peculiarities when we give some recommendations for a critical appraisal of these studies.

When choosing the outcome measure, we will find the first big difference with the meta-analyzes of treatment. In the meta-analysis of diagnostic accuracy (MDA) the most frequent way to assess the test is to combine sensitivity and specificity as summary values. However, these indicators present the problem that the cut-off points to consider the results of the test as positive or negative usually vary among the different primary studies of the review. Moreover, in some cases positivity may depend on the objectivity of the evaluator (think of results of imaging tests). All this, besides being a source of heterogeneity among the primary studies, constitutes the origin of a typical MDA bias called the threshold effect, in which we will stop a little later.

For this reason, many authors do not like to use sensitivity and specificity as summary measures and resort to positive and negative likelihood ratios. These ratios have two advantages. First, they are more robust against the presence of threshold effect. Second, as we know, they allow calculating the post-test probability either using Bayes’ rule (pre-test odds  x likelihood ratio = posttest odds) or a Fagan’s nomogram (you can review these concepts in the corresponding post).

Finally, a third possibility is to resort to another of the inventions that derive from Turing’s work: the diagnostic odds ratio (DOR).

The DOR is defined as the ratio of the odds of the patient being positive with a test with respect to the odds of being positive while being healthy. This phrase may seem a bit cryptic, but it is not so. The odds of the patient being positive versus being negative is only the ratio between true positives (TP) and false negatives (FN): TP / FN. On the other hand, the odds of the healthy being positive versus negative is the quotient between false positives (FP) and true negatives (TN): FP / TN. And seeing this, we can only define the ratio between the two odds, as you can see in the attached figure. The DOR can also be expressed in terms of the predictive values ​​and the likelihood ratios, according to the expressions that you can see in the same figure. Finally, it is also possible to calculate their confidence interval, according to the formula that ends the figure.

Like all odds ratios, the possible values ​​of the DOR go from zero to infinity. The null value is 1, which means that the test has no discriminatory capacity between the healthy and the sick. A value greater than one indicates discriminatory capacity, which will be greater the greater the value. Finally, values ​​between zero and 1 will indicate that the test not only does not discriminate well between the sick and healthy, but classifies them in a wrong way and gives us more negative values ​​among the sick than among the healthy.

The DOR is a global parameter easy to interpret and does not depend on the prevalence of the disease, although it must be said that it can vary between groups of patients with different severity of disease. In addition, it is also a very robust measure against the threshold effect and is very useful for calculating the summary ROC curves that we will comment on below.

The second peculiar aspect of MDA that we are going to deal with is the threshold effect. We must always assess their presence when we find ourselves before a MDA. The first thing will be to observe the clinical heterogeneity among the primary studies, which could be evident without needing to make many considerations. There is also a simple mathematical form, which is to calculate the Spearman’s correlation coefficient between sensitivity and specificity . If there is a threshold effect, there will be an inverse correlation between the two, the stronger the higher the threshold effect.

Finally, a graphical method is to assess the dispersion of the sensitivity and specificity representation of the primary studies on the summary ROC curve of the meta-analysis. A dispersion allows us to suspect the threshold effect, but it can also occur due to the heterogeneity of the studies and other biases such as selection’s or verification’s.

The third specific element of MDA that we are going to comment on is that of the summary ROC curve (sROC), which is an estimate of the common ROC curve adjusted according to the results of the primary studies of the review. There are several ways to calculate it, some quite complicated from the mathematical point of view, but the most used are the regression models that use the DOR as an estimator, since, as we have said, it is very robust against heterogeneity and the threshold effect. But do not be alarmed, most of the statistical packages calculate and represent the sROC with little effort.

The reading of sROC is similar to that of any ROC curve. The two more used parameters are area under the ROC curve (AUC) and Q index. The AUC of a perfect curve is equal to 1. Values above 0.5 indicate its discriminatory diagnostic capacity, which will be higher the closer it gets to 1. A value of 0.5 tells us that the usefulness of the test is the same that flipping a coin. Finally, values ​​below 0.5 indicate that the test does not contribute at all to the diagnosis it intends to perform.

On the other hand, the Q index corresponds to the point at which sensitivity and specificity are equal. Similar to AUC manner, a value greater than 0.5 indicate the overall effectiveness of the diagnostic test, which will be higher the closer the index value is to 1. In addition, confidence intervals can also be calculated both for AUC as Q index, with which it will be possible to assess the precision of the estimation of the summary measure of the MDA.

Once seen (at a glance) the specific aspects of MDA, we will give some recommendations to perform the critical appraising of this type of study. CASP network does not provide a specific tool for MDA, but we can follow the lines of the systematic review of treatment studies taking into account the differential aspects of MDA. As always, we will follow our three basic pillars: validity, relevance and applicability.

Let’s start with the questions that value the VALIDITY of the study.

The first question asks if it has been clearly specified the issue of the review. As with any systematic review, diagnostic tests’ should try to answer a specific question that is clinically relevant, and which is usually proposed following the PICO scheme of a structured clinical question. The second question makes us reflect if the type of studies that have been included in the review are adequate. The ideal design is that of a cohort to which the diagnostic test that we want to assess and the gold standard are blindly and independently applied. Other studies based on case-control designs are less valid for the evaluation of diagnostic tests, and will reduce the validity of the results.

If the answer to both questions is yes, we turn to the secondary criteria. Have important studies that have to do with the subject been included? We must verify that a global and unbiased search of the literature has been carried out. The methodology of the search is similar to that of systematic reviews on treatment, although we should take some precautions. For example, diagnostic studies are usually indexed differently in databases, so the use of the usual filters of other types of revisions can cause us to lose relevant studies. We will have to carefully check the search strategy, which must be provided by the authors of the review.

In addition, we must verify that the authors have ruled out the possibility of a publication bias. This poses a special problem in MDA, since the study of the publication bias in these studies is not well developed and the usual methods such as the funnel plot or the Egger’s test are not very reliable. The most conservative thing to do is always assume that there may be a publication bias.

It is very important that enough has been done to assess the quality of the studies, looking for the existence of possible biases. For this the authors can use specific tools, such as the one provided by the QUADAS-2 declaration.

To finish the section of internal or methodological validity, we must ask ourselves if it was reasonable to combine the results of the primary studies. It is fundamental, in order to draw conclusions from combined data, that studies are homogeneous and that the differences among them are due solely to chance. We will have to assess the possible sources of heterogeneity and if there may be a threshold effect, which the authors have had to take into account.

In summary, the fundamental aspects that we will have to analyze to assess the validity of a MDA will be: 1) that the objectives are well defined; 2) that the bibliographic search has been exhaustive; and 3) that the internal or methodological validity of the included studies has also been verified. In addition, we will review the methodological aspects of the meta-analysis technique: the convenience of combining the studies to perform a quantitative synthesis, an adequate evaluation of the heterogeneity of the primary studies and the possible threshold effect and use of an adequate mathematical model to combine the results of the primary studies (sROC, DOR, etc.).

Regarding the RELEVANCE of the results we must consider what is the overall result of the review and if the interpretation has been made in a judicious manner. We will value more those MDA that provide more robust measures against possible biases, such as likelihood ratios and DOR. In addition, we must assess the accuracy of the results, for which we will use our beloved confidence intervals, which will give us an idea of ​​the precision of the estimation of the true magnitude of the effect in the population.

We will conclude the critical appraisal of MDA assessing the APPLICABILITY of the results to our environment. We will have to ask whether we can apply the results to our patients and how they will influence the attention to them. We will have to see if the primary studies of the review describe the participants and if they resemble our patients. In addition, it will be necessary to see if all the relevant results have been considered for decision making in the problem under study and, as always, the benefit-cost-risk ratio must be assessed. The fact that the conclusion of the review seems valid does not mean that we have to apply it in a compulsory way.

Well, with all that said, we are going to finish today. The title of this post refers to the mistreatment suffered by a genius. We already know what genius we were referring to: Alan Turing. Now, we will clarify the abuse. Despite being one of the most brilliant minds of the 20th century, as witnessed by his work on statistics, computing, cryptography, cybernetics, etc., and having saved his country from the blockade of the German Navy during the war, in 1952 he was tried for his homosexuality and convicted of serious indecency and sexual perversion. As it is easy to understand, his career ended after the trial and Alan Turing died in 1954, apparently after eating a piece of an apple poisoned with cyanide, which was labeled as suicide, although there are theories that speak rather of murder. They say that from here comes the bitten apple of a well-known brand of computers, although there are others who say that the apple just represents a play on words between bite and byte.

I do not know which of the two theories is true, but I prefer to recall Turing every time I see the little-apple. My humble tribute to a great man.

And now we finish. We have seen the peculiarities of the meta-analyzes of diagnostic accuracy and how to assess them. Much more could be said of all the mathematics associated with its specific aspects such as the presentation of variables, the study of publication bias, the threshold effect, etc. But that’s another story…

## Chickenphant

The unreal mixture of different parts of animals has been an obsession of so-called human beings since immemorial time. The most emblematic case is that of Chimera (which gives its name to the whole family of mixtures of different animals). This mythological being, daughter of Typhon and the viper Echidna, had a lion’s head, a goat’s body and a dragon’s tail, which allowed him to breathe in flames and scare everyone up who passed by. Of course, it did not help him when Belloforontes, mounted on Pegasus (another weirdo, a horse with wings), insisted on crossing it with his lead spear. You see, in his strength was his downfall: the fire melted the tip of the spear into this rare creature, which resulted in its death.

Besides Chimera, there are many more of these beings, all of them fruit of human imagination. To name a few, we can remember the unicorns (these had worse luck than Pegasus, instead of wings they had horns, one each animal), the basilisks (a kind of snake rooster of quite bad character), the gryphon (lion’s body and eagle for the rest) and all those in which part of the mixture is human, such as manticores (head of man and body of lion), centaurs, Minotauro, Medusa (with their snakes instead of hair), mermaids…

In any case, among all the beings of this imaginary zoo, I am left with the chickenphant (gallifante in Spanish). This was a mixture of chicken and elephant that was used on TV to reward the wit of children who attended a popular contest. Millenials will have no idea what I’m talking about, but surely those who grew up in the 80s do know what I mean.

And all this came to my mind when I was reflecting on the number of chimeras that also exist among the possible types of scientific study designs, especially among observational studies. Let’s get to know a little three of these chickenphants of epidemiology: the case-control studies nested in a cohort and the case and cohort studies , to end with another particular specimen, the case-crossover or self-controlled studies.

Within observational studies, we all know the classic cohorts and the cases and controls studies, the most frequently used.

In a cohort study, a group or cohort is subjected to an exposure and followed over time to compare the frequency of appearance of the effect compared to an unexposed cohort, which acts as a control. These studies tend to be of an antegrade direction, so they allow us to measure the incidence of the disease and calculate the risk ratio between the two groups. On the other hand , a case-control study starts from two population groups, one of which presents the effect or disease under study and compares its exposure to a specific factor with respect to the group that does not have the disease and that acts as a control. Being of retrograde direction and directly selecting cases of disease, it is not possible to directly calculate the incidence density and, therefore, the risks ratios between the two groups, making the odds ratio the measure of association typical of case-control studies.

The cohort study is the most solid of the two from a methodological point of view. The problem is that they usually require long follow-up periods and large cohorts, especially when the frequency of the disease studied is low, which leads to the need to manage all the covariates of this large cohort, which increases the costs of the study.

Well, for these cases in which neither the cases and controls nor the cohorts are well suited to the needs of the researcher, epidemiologists have invented a series of designs that are halfway between the two and can mitigate their shortcomings. These hybrid designs are the case-control studies nested in a cohort and the case and cohort studies to which we have already referred.

In another order of things, in classical observational studies the key point is in the selection of controls, which have to be representative of the level of exposure to the risk factor evaluated in the population from which the cases originate. An adequate selection of controls becomes even more difficult when the effect occurs abruptly. For example, if we want to know if a copious meal increases the risk of heart attack, we would have great difficulty in collecting representative controls of the population, since the risk factors can act instants before the event.

To avoid these difficulties, the principle of “you make your bed, you lie in it” was applied and the third type of chimera we have mentioned was designed, in which each participant acts, at the same time, as his own control. They are case-crossover studies, also known as self-monitoring cases studies.

Let’s see these weirdos, beginning with cases and controls nested in a cohort.

Suppose we have done a study in which we have used a cohort with many participants. Well, we can reuse it in a nested case-control study. We took the cohort and followed it over time, selecting as cases those subjects who are developing the disease and assigning them as controls individuals from the same cohort who have not yet presented it (although they can do it later). Thus, cases and controls come from the same cohort. It is convenient to match them taking into account confusing and time-dependent variables, such as the years they have been included in the cohort. In this way, the same subject can act as a control on several occasions and end as a case in another, which will have to be taken into account at the time of the statistical analysis of the studies. As this seems a bit confusing, I show you a scheme of this type of studies in the first attached figure.

As we are seeing how cases arise, we are doing a sampling by density of incidence, which will allow us to estimate risks ratios. This is an important difference with conventional case-control studies, in which an odds ratio is usually calculated, which can only be assimilated to the relative risk when the frequency of the effect is very low.

Another difference is that all the information about the cohort is collected at the beginning of the study, so there is less risk of producing the classic information biases of the case-control studies, usually of a retrospective nature.

The other type of hybrid observational design that we are going to discuss is that of the case and cohort studies. Here we also start from a large initial cohort, from which we select a more manageable sub-cohort that will be used as a comparison group. Thus, we see which individuals of the initial cohort develop the disease and compare them with the sub-cohort (regardless of whether or not they belong to the sub-cohort). You can see the outline of a case study and cohort in the second figure attached.

As in the previous example, when choosing cases over time we can estimate the density of incidence in cases and not cases, calculating the risk ratio from them. As we can imagine, this design is cheaper than conventional studies because it greatly reduces the volume of information of healthy subjects that must be handled, without losing efficiency when studying rare diseases. The problem that arises is that the sub-cohort has an overrepresentation of cases, so that the analysis of the results cannot be done as in traditional cohorts, but has its own methodology, much more complicated.

To summarize what has been said so far, we will say that the nested case-control study is more like the classic case-control study, while the case and cohort study is more like the conventional cohort study. The fundamental difference between the two is that in the nested study the sampling of the controls is done by incidence density and by pairing, so we must wait until all cases have been produced to select the entire reference population. This is not the case in the case and cohort study, which is much simpler, in which the reference population is selected at the beginning of the study.

To put an end to these hybrid studies, we will say some things about case-crossover studies. These focus on the moment in which the event occurs and try to see if there has been something unusual that has favored it, comparing the expositions of moments immediate to the event with previous ones that serve as control. Therefore, we compare case moments with control moments, each individual acting as their own control.

For the study to be valid from the methodological point of view, the authors have to clearly describe a series of characteristic periods of time. The first is the induction period, which is the delay time that occurs from the beginning of the exposure until the production of the effect.

The second is the period of effect, which is the interval during which exposure can trigger the effect. Finally, the period of risk would be the sum of the two previous periods, from the moment of exposure to the beginning of the event.

The induction period is usually very brief most of the times, so the period of risk and effect are usually equivalent. In the attached figure I show you the relationship between the three periods so that you understand it better.

It is essential that these three periods be clearly specified, since a poor estimate of the period of effect, both by excess and by defect, produces a dilution of the effect of the exposure and makes its detection more difficult.

Some of you will tell me that these studies are similar to other self-controlled studies, such as paired cases and controls studies. The difference is that in the latter one or more similar controls are chosen for each case, while in the self-controlled each one is its own control. They also look a little like cross-over clinical trials, in which all participants are subjected to intervention and control, but these are experimental studies in which the researcher intervenes in the production of the exposure, while self-controlled studies are observational studies.

In what it resembles paired cases and controls is in the statistical analysis, only here case moments and control moments are analyzed. In this way, it is usual to use conditional logistic regression models, being the most common measure of association the odds ratio.

As you can see, hybrid studies are a whole new family that threatens to grow in number and complexity. As far as I know, there are no checklists to critically aprraise these types of designs, so we will have to apply judiciously the principles we apply when analyzing classical observational studies, taking into account, in addition, the particularities of each type of study.

For this, we will follow our three pillars: validity, relevance and applicability.

In the VALIDITY section we will assess the methodological quality with which the study was made. We will check that there is a clear definition of the study population, the exposure and the effect. If we use a reference cohort, it should be representative of the population and should be followed completely. On the other hand, the cases will be representative of the population of cases from which they come and the controls have to come from a population with an exposure level representative of the case population.

The measurement of the exposure and the effect must be done blindly, being independent the measurement of the effect and the knowledge of the level of exposure. In addition, we will analyze if attention has been paid to the temporal relationship of events between exposure and effect and if there was a relationship between the level of exposure and the degree of effect. Finally, the statistical analysis should be correct, taking into account the control of possible confounding factors. This part can be complicated by the complexity of the statistical studies that usually require this type of designs.

In addition, as we have already mentioned, if we are facing a case-crossover study, we must ensure that there has been a correct definition of the three periods, especially the period of effect, whose inaccuracy may affect the conclusion of the study to a greater degree.

Next, we will evaluate the RELEVANCE of the results and their accuracy as measured by their confidence intervals. We will look for the impact measurements calculated by the authors of the study and, if they do not provide them, we will try to calculate them ourselves. Finally, we will compare the results with other previously published in the literature to see if they are concordant with the existing knowledge and what new knowledge are provided.

We will finish the critical appraising assessing the APPLICABILITY of the results. We will think if the participants can be assimilated to our patients and if the conclusions are applicable to our environment.

And here we are going to finish this post. We have seen a whole new range of hybrid studies that combine the advantages of two observational studies to better adapt to situations in which classical studies are more difficult to apply. The drawback of these studies, as we have said, is that the analysis is a bit more complicated than that of the conventional studys, since it is not enough to get a crude analysis of the results, but must be adjusted by the possibility of that a participant can act as control and case (in the nested studies) and by the overrepresentation of the cases in the sub-cohort (in the cohort and cases).

I just finish commenting that all I have said about the case-crossover studies refers to the so-called unidirectional case-crossover ones, studies in which there is a very precise temporal relationship between exposure and effect. For the cases in which the exposure is more maintained, other types of case-crossover studies called bidirectional case-crossover studies can be used, in which control periods are selected before and after the effect. But that is another story…

## There is another world, and it is this one

And there are other lives, but they are in you. It was already said by Paul Éluard, that last century’s surrealist who had the bad idea of ​​visiting Cadaqués accompanied by his wife, Elena Ivanovna Diakonova, better known as Gala. He was not very clever there, but his phrase did give for many more things.

For example, it has been used by many writers who love the unknown, myths and mystery. I personally knew the phrase when I was a young teenager because it was written as a preface to a series of science fiction books. Even, in more recent times, it is related to that other incorporeal world that is cyberspace, where we spend a more and more greater part of our life.

But, to help Éluard rest peacefully in his tomb at Père- Lachaise, I’ll tell you that I like more his original idea about our two worlds, between which we can share our limited life time: the real world, where we make the most part of the things, and the world of the imagination, our intimate space, where we dream our most impossible realities.

You will think that today I am very metaphysical, but this is the thought that has come to my mind when I started thinking about the topic that we are going to deal with in this post. And the fact is that in the realm of medicine there are two worlds too.

We are very used to numbers and the objective results of our quantitative research. As an example, we have our revered systematic reviews, which gather the scientific evidence available on a specific health technology to assess its efficacy, safety, economic impact, etc. If we want to know if watching a lot of TV is a risk factor for suffering this terrible disease that is fildulastrosis, the best thing will be to do a systematic review of clinical trials (assuming there are any). Thus, we can calculate a multitude of parameters that, with a number, will give us a full idea of ​​the impact of such an unhealthy habit.

But if what we want to know is how fildulastrosis affects the person who suffers it, how much unhappiness it produces, how it alters family and social life, things get a little complicated with this type of research methodology. And this is important because the social and cultural aspects related to the real context of people are increasingly valued. Luckily, there are other worlds and they are in this one. I am referring to the world of qualitative research. Today we are going to take a look (a short one) at this world.

Qualitative research is a method that studies reality in its natural context, as it occurs, in order to interpret the phenomena according to the meanings they have for the people involved. And for this it uses all kinds of sources and materials that help us to describe the routine and the meaning of problematic situations for people’s lives: interviews, life stories, images, sounds … Although all this has nothing to do with the gridded world of quantitative research, both methods are not incompatible and may even be complementary. Simply, qualitative methods provide alternative information, different and complementary to that of quantitative methods, which is useful for evaluating the perspectives of the people involved in the problem we are studying. Quantitative research is a way to address the problem deductively, while qualitative uses an inductive approach.

Logically, the methods used by qualitative research are different from quantitative’s ones. In addition, they are numerous, so we will not describe them in depth. We will say that the specific methods most used are meta-synthesis, phenomenology, meta-ethnography, meta-study, meta-interpretation, the grounded theory, the biographical method and the aggregative review, among others.

The most frequently used of these methods is meta-synthesis, which starts with a research question and a bibliographic search, in a similar way to what we know about systematic reviews. However, there are a couple of important differences. In quantitative research, the research question must be clearly defined, while in qualitative research this question is, by definition, flexible and is usually modified and refined as data collection progresses. The other aspect has to do with the literature search, because in qualitative research it is not so clearly defined what databases have to be used and there are not the filters and methodologies available to documentarists to make revisions of quantitative research.

Also, techniques used for collecting data are different to those we are more accustomed to in quantitative research. One of them is observation, which allows the researcher to obtain information about the phenomenon as it occurs. The paradigm of observation in qualitative research is participant observation, in which the observer interacts socially with the subjects of the medium in which the phenomenon of study occurs. For example, if we want to assess the experiences of travelers on a commercial flight, nothing better than buying a ticket and posing as another traveler, collecting all the information about comfort, punctuality, attention provided by the flight staff, quality of the snacks, etc.

Another technique widely used is the interview, in which a person asks another people or group of people for information on a specific topic. When it is done to groups it is called, as it could not be otherwise, group interview. In this case the script is quite closed and the role of the interviewer is quite prominent, unlike in focus groups discussion, in which everything can be more open, at the discretion of the group’s facilitator. Anyway, when we want to know the opinion of many people, we can resort to the questionnaire technique, which polls the opinion of large groups so that each component of the group spends a minimum time to complete it, unlike the focus groups, in the that all remain throughout the interview time.

The structure of a qualitative research study usually includes five fundamental steps, which can be influenced according to the methods and techniques used:

1. Definition of the problem. As we have already mentioned when discussing the research question, the definition of the problem has a certain degree of provisionality and can change throughout the study, since one of the objectives may be to find out precisely if the definition of the problem is well done.
2. Study design. It must also be flexible. The problem with this phase is that there are times when the proposed design is not what we see in the published article. There is still a certain lack of definition of many methodological aspects, especially when compared with the methodology of quantitative research.
3. Data collection. The techniques we have discussed are used: interview, observation, reading of texts, etc.
4. Analysis of the data. This aspect also differs from the quantitative analysis. Here it will be interesting to unravel the meaning structures of the collected data to determine their scope and social implications. Although methods are being devised to express in numerical form, the usual thing is that we do not see many figures here and, of course, nothing to do with quantitative methods.
5. Report and validation of the information. The objective is to generate conceptual interpretations of the facts to get a sense of the meaning they have for the people involved. Again, and unlike with quantitative research, the goal is not to project the results of possible interventions on the environment, but to interpret facts that are at hand.

At this point, what can we say about the critical appraisal of qualitative research? Well, to give you an idea, I will tell you that there is a great variety in opinions on this subject, from those who think that it makes no sense to evaluate the quality of a qualitative study to those who try to design evaluation instruments that provide numerical results similar to those of quantitative studies. So, my friends, there is no uniform consensus on whether you should evaluate, in the first place, or on how, in the second. In addition, some people think that even studies that can be considered of low quality should be taken into account because, after all, who is able to define with certainty what a good qualitative research study is?

In general, when we make a critical appraisal of a qualitative research study, we will have to assess a series of aspects such as its integrity, complexity, creativity, validity of the data, quality of the descriptive narrative, the interpretation of the results and the scope of its conclusions. We are going to continue here our habit of resorting to the CASPe’s critical appraisal program, which provides us with a template with 10 questions to perform the critical appraisal of a qualitative study. These questions are structured in three pillars: rigor, credibility and relevance.

The questions of rigor refer to the suitability of the methods used to answer the clinical question. As usual, the first questions are about elimination. If the answer is not affirmative, we will have resolved the controversy since, at least with this study, it will not be worthwhile to continue with our assessment. Were the objectives of the research clearly defined? It is necessary to value that the question is well specified, as well as the objective of the investigation and the justification of its necessity. Is the qualitative methodology congruent? We will have to decide if the methods used by the authors are adequate to obtain the data that will allow them to reach the objective of the investigation. Finally, is the research method used suitable for achieving the objectives? Researchers must explicitly say the method they have used (meta-synthesis, grounded theory …). In addition, the specified method must match the one used, which sometimes may not be the case.

If we have answered affirmatively to these three questions, it will be worth continuing and we will move on to the detailed questions. Is the participant selection strategy consistent with the research question and the method used? It must be justified why the selected participants were the most suitable, as well as explain who called them, where, etc. Are data collection techniques used congruent with the research question and the method used? The technique of collecting data (for example, discussion groups) and the registration format will have to be specified and justified. If the collection strategy is modified throughout the study, the reason for this will have to be justified.

Have the relationship between the researcher and the object of research (reflexivity) been considered? It will be necessary to consider if the involvement of the researcher in the process has been able to bias the data obtained and if this has been taken into account when designing the data collection, the selection of the participants and the scope of the study. To finish with the assessment of the rigor of the work, we will ask ourselves if the ethical aspects have been taken into account. It will be necessary to take into account common aspects with quantitative research, such as informed consent, approval by ethical committee or confidentiality of data, as well as specific aspects about the effect of the study on participants before and after its completion.

The next block of two questions has to do with the credibility of the study, which is related to the ability of the results to represent the phenomenon from the subjective point of view of the participants. The first question makes us think if the analysis of the data was sufficiently rigorous. The entire analysis process should be described, the categories that may have arisen from the collected data, if the subjectivity of the researcher has been assessed and how the data that could be contradictory to each other has been handled. In the case that fragments of testimonies of participants are presented to elaborate the results, the reference of their origin must be clearly specified. The second question has to do with whether the exposure of the results was made clearly. They should be presented in a detailed and understandable manner, showing their relationship to the research question. We will review at this point the strategies adopted to ensure the credibility of the results, as well as if the authors have reflected on the limitations of the study.

We will finish the critical assessment by answering the only question of the block that has to do with the relevance of the study, which is nothing more than its usefulness or applicability to our clinical practice. Are the results of the investigation applicable? We will have to assess how the results contribute to our practice, how they contribute to the existing knowledge and in what contexts may they be applicable.

And here we are going to leave it for today. You have already seen that we have taken a look into a world quite different from the one we are more used to, in which we have to change a little the mentality of how to pose and study problems. Before leaving, I have to warn you, as in previous posts, to not to look for fildulastrosis, because you will not find this disease anywhere. Actually, fildulastrosis is an invention of mine in homage to a very illustrious character, sadly deceased: Forges. Antonio Fraguas (from the English translation of his last name comes his nom de guerre) was, in my humble opinion, the best graphic humorist since I have conscience. For many years I began the day seeing the daily Forges’ joke, so since some time there are mornings that one does not know how to start the day. Forges had many own invented words and I really liked his percutoria’s fildulastro, who had the defect of escalporning now and then. Hence comes my fildulastrosis, so from here I thank him and I give him this little tribute.

And now we’re leaving. We have not talked much about other methods of qualitative research such as grounded theory, meta- ethnogarphy, etc. Those interested have bibliography where they are explained in a better way than I could do it. And, of course, as in quantitative research, there are also ways to combine qualitative research studies. But that is another story…

## Powerful gentleman

Yes, as the illustrious Francisco de Quevedo y Villegas once said, powerful gentleman is Don Dinero (Mr. Money). A great truth because, who, purely in love, does not humble himself before the golden yellow? And even more in a mercantilist and materialist society like ours.

But the problem is not that we are materialistic and just think about money. The problem is that nobody believes they have all the money they need. Even the wealthiest would like to have much more money. And many times, it is true, we do not have enough money to cover all our needs as we would like.

And that does not only happen at the individual’s level, but also at social groups level. Any country has a limited amount of money, which is why you cannot spend everything you want and you have to choose where you spend your money. Let’s think, for example, of our healthcare system, in which new health technologies (new treatments, new diagnostic techniques, etc. ) are getting better … and more expensive (sometimes, even bordering on obscenity). If we are spending at the limit of our possibilities and want to apply a new treatment, we only have two choices: either we increase our wealth (where do we get the money from?) or we stop spending it on something else. There would be a third one that is used frequently, even if it is not the right thing to do: spend what we do not have and pass on the debt to whoever comes next.

Yes, my friends, the saying that Health is priceless does not hold up economically. Resources are always limited and we must all be aware of the so-called opportunity cost of a product: the price it costs, the money will have to stop spending on something else.

Therefore, it is very important to properly evaluate any new health technology before deciding its implementation in the health system, and this is why the so-called economic evaluation studies have been developed, aimed at identifying what actions should be prioritized to maximize the benefits produced in an environment with limited resources. These studies are a tool to assist in decision-making, but are not aim to replace it, so other elements have to be taken into account, such as justice, equity and free access to the election.

The economic evaluation (EV) studies encompass a whole series of methodology and specific terminology that is usually little known by those who are not dedicated to the evaluation of health technologies. Let’s briefly review its characteristics to finally give some recommendations on how to make a critical appraisal of these studies.

The first thing would be to explain what are the two characteristics that define an EV. These are the measure of the costs and benefits of the interventions (the first one) and the choice or comparison between two or more alternatives (the second one). These two features are essential to say that we are facing an EV, which can be defined as the comparative analysis of different health interventions in terms of costs and benefits. The methodology of development of an EV will have to take into account a number of aspects that we list below and that you can see summarized in the attached table.

– Objective of the study. It will be determined if the use of a new technology is justified in terms of the benefits it produces. For this, a research question will be formulated with a structure similar to that of other types of epidemiological studies.

– Perspectives of the analysis. It is the point of view of the person or institution to whom the analysis is targeted, which will include the costs and benefits that must be taken into account from the positioning chosen. The most global perspective is that of the Society, although the one of the funders, that of specific organizations (for example, hospitals) or that of patients and families can also be adopted. The most usual is to adopt the perspective of the funders, sometimes accompanied by the social one. If so, both must be well differentiated.

– Time horizon of the analysis. It is the period of time during which the main economic and health effects of the intervention are evaluated.

– Choice of the comparator. It is a crucial point to be able to determine the incremental effectiveness of the new technology and on which the importance of the study for the decision makers will largely depend. In practice, the most commonly comparator is the alternative that is commonly used (the gold standard), although it can sometimes be compared with the non-treatment option, which must be justified.

– Identification of costs. Costs are usually considered taking into account the total amount of the resource consumed and the monetary value of the resource unit (you know, as the friendly hostesses of an old TV contest said: 25 responses, at 5 pesetas each, 125 pesetas). The costs are classified as direct and indirect and as sanitary and non-sanitary. The direct ones are those clearly related to the illness (hospitalization, laboratory tests, laundry and kitchen, etc.), while the indirect refer to productivity or its loss (work functionality, mortality). On the other hand, health costs are those related to the intervention (medicines, diagnostic tests, etc.), while non-health costs are those that the patient or other entities have to pay or those related to productivity.

What costs will be included in an EV? It will depend on the intervention being analyzed and, especially, on the perspective and time horizon of the analysis.

Quantification of costs. It will be necessary to determine the amount of resources used, either individually or in aggregate, depending on the information available.

– Cost assessment. They will be assigned a unit price, specifying the source and the method used to assign this price. When the study covers long periods of time, it must be borne in mind that things do not cost the same over the years. If I tell you that I knew a time when you went out at night with a thousand pesetas (the equivalent of about 6 euros now) and came back home with money in your pocket, you will think it is another of my frequent ravings, but I swear it is true.

To take this into account, a weighting factor or discount rate is used, which is usually between 3% and 6%. For who is curious, the general formula is CV = FV / (1 + d) n, where CV is the current value, FV future value, n is the number of years and d the discount rate.

Identification, measurement and evaluation of results. The benefits obtained can be classified into health and non-health ones. Health benefits are clinical consequences of the intervention, generally measured from a point of view of interest to the patient (improvement of blood pressure figures, deaths avoided, etc.). On the other hand, the non-health ones are divided as they cause improvements in productivity or in the quality of life.

The first ones are easy to understand: productivity can improve because people go to work earlier (shorter hospitalization, shorter convalescence) or because they work better to improve the health conditions of the worker. The second ones are related to the concept of quality of life related to health, which reflects the impact of the disease and its treatment on the patient.

The quality of life related to health can be estimated using a series of questionnaires on the preferences of patients, summarized in a single score value that, together with the amount of life, will provide us with the quality-adjusted life year (QALY).

To assess the quality of life we ​​refer to the utilities of the health states, which are expressed with a numerical value between 0 and 1, in which 0 represents the utility of the state of death and 1 that of perfect health. In this sense, a year of life lived in perfect health is equivalent to 1 QALY (1 year of life x 1 utility = 1 QALY). Thus, to determine the value in QALYs we will multiply the value associated with a state of health by the years lived in that state. For example, half a year in perfect health (0.5 years x 1 utility) would be equivalent to one year with some ailments (1 year x 0.5 utility).

Type of economic analysis. We can choose between four types of economic analysis.

The first, the cost minimization analysis. This is used when there is no difference in effect between the two options compared, situation in which will be enough to compare the costs to choose the cheapest. The second, the cost-effectiveness analysis. This is used when the interventions are similar and determines the relationship between costs and consequences of interventions in units usually used in clinical practice (decrease in days of admission, for example). The third, the cost-utility analysis. It is similar to cost-effectiveness, but the effectiveness is adjusted for quality of life, so the outcome is the QALY. Finally, the fourth method is the cost-benefit analysis. In this type everything is measured in monetary units, which we usually understand quite well, although it can be a little complicated to explain with them the gains in health.

Analysis of results. The analysis will depend on the type of economic analysis used. In the case of cost-effectiveness studies, it is typical to calculate two measures, the average cost-effectiveness (dividing the cost between the benefit) and the incremental cost-effectiveness (the extra cost per unit of additional benefit obtained with an option with respect to the other). This last parameter is important, since it constitutes a limit of efficiency of the intervention, which we will be chosen or not depending on how much we are willing to pay for an additional unit of effectiveness.

– Sensitivity analysis. As with other types of designs, EVs do not get rid off uncertainty, generally due to lack of reliability of the available data. Therefore, it is convenient to evaluate the degree of uncertainty through a sensitivity analysis to check the degree of stability of the results and how they can be modified if the main variables vary. An example may be the variation of the discount rate chosen.

There are five types of sensitivity analysis: univariate (the study variables are modified one by one), multivariate (two or more are modified), extremes (we put ourselves in the most optimistic and most pessimistic scenarios for the intervention), threshold (identifies if there is a critical value above or below which the choice is reversed towards one or the other the interventions compared) and probabilistic (assuming a certain probability distribution for the uncertainty of the parameters used).

Conclusion. This is the last section of the development of an EV. The conclusions should take into account two aspects: internal validity (correct analysis for patients included in the study) and external validity (possibility of extrapolating the conclusions to other groups of similar patients).

As we said at the beginning of this post, EVs have a lot of jargon and its own methodological aspects, which makes it difficult for us to make a critical appraising and a correct understanding of its content. But let no one get discouraged, we can do it by relying on our three basic pillars: validity, relevance and applicability.

There are multiple guides that systematically explain how to assess an EV. Perhaps the first to appear was that of the British NICE (National Institute for Clinical Excellence), but subsequently others have arisen such as that of the Australian PBAC (Pharmaceutical Benefits Advisory Committee) and that of the Canadian CADTH (Canadian Agency for Drugs and Technologies in Health). In Spain we could not be less and the Laín Entralgo’s Health Technology Assessment Unit also developed an instrument to determine the quality of an EV. This guide establishes recommendations for 17 domains that closely resemble what we have said so far, completing with a checklist to facilitate the assessment of the quality of the EV.

Anyway, as my usual sufferers know, I prefer to use a simpler checklist that is available on the Internet for free, which is none other than the tool provided by the CASPe group and that you can download from their website. We are going to follow these 11 CASPe’s questions, although without losing sight of the recommendations of the Spanish guide that we have mentioned.

As always, we will start with the VALIDITY, trying to answer first two elimination questions. If the answer is negative, we can leave the study aside and dedicate ourselves to another more productive task.

Is the question or objective of the evaluation well defined? The research question should be clear and define the target population of the study. There will also be three fundamental aspects that should be clear in the objective: the options compared, the perspective of the analysis and the time horizon. Is there a sufficient description of all possible alternatives and their consequences? The actions to follow must be perfectly defined in all the compared options, including who, where and to whom each action is applied. The usual will be to compare the new technology, at least, with the one of habitual use, always justifying the choice of the comparison technology, especially if this is the non-treatment one (in the case of pharmacological interventions).

If we have been able to answer these two questions affirmatively, we will move on to the four questions of detail. Are there evidence of the effectiveness, of the intervention or of the evaluated program? We will see if there are trials, reviews or other previous studies that prove the effectiveness of the interventions. Think of a cost minimization study, in which we want to know which of the two options, both effective, is cheaper. Logically, we will have to have prior evidence of this effectiveness. Are the effects of the intervention (or interventions) identified, measured and appropriately valued or considered? These effects can be measured with simple units, often derived from clinical practice, with monetary units and more elaborate calculation units, such as the QALYs mentioned above. Are the costs incurred by the intervention (interventions) identified, measured and appropriately valued? The resources used must be well identified and measured in the appropriate units. The method and source used to assign the value to the resources used must be specified, as we have already mentioned. Finally, were discount rates applied to the costs of the intervention/s? And to the effects? As we already know, this is fundamental when the time horizon of the study is prolonged. In Spain, it is recommended to use a discount rate of 3% for basic resources. When doing sensitivity analysis this rate will be tested between 0% and 5%, which will allow comparison with other studies.

Once assessed the internal validity of our EV, we will answer the questions regarding the RELEVANCE of the results. Firstly, what are the evaluation results? We will review the units that have been used (QALYs, monetary costs, etc.) and if the incremental benefits analysis have been carried out, in appropriate cases. The second question in this section refers to whether an adequate sensitivity analysis has been carried out to know how the results would vary with changes in costs or effectiveness. In addition, it is recommended that the authors justify the modifications made with respect to the base case, the choice of the variables that are modified and the method used in the sensitivity analysis. Our Spanish guide recommends carrying out, whenever possible, a probabilistic sensitivity analysis, detailing all the statistical tests performed and the confidence intervals of the results.

Finally, we will assess the Cost-efeor external validity of our study by answering the last three questions. Would the program be equally effective in your environment? It will be necessary to consider if the target population, the perspective, the availability of technologies, etc., are applicable to our clinical context. Finally, we must reflect on whether the costs would be transferable to our environment and if it would be worth applying them to our environment. This may depend on social, political, economic, population, etc. differences, between our environment and that in which the study has been carried out.

And with this we are going to finish this post for today. Even if I blow your mind after all we have said, you can believe me if I tell you that we have done nothing but scratch the surface of this stormy world of economic valuation studies. We have not discussed anything, for example, about the statistical methods that can be used in studies of sensitivity, which can become complicated, nor about the studies using modeling, employing techniques only available to privileged minds, like Markov chains, stochastic models or discrete event simulation models, to name a few. Neither have we talked about the type of studies on which economic evaluations are based.  These can be experimental or observational studies, but they have a series of peculiarities that differentiate them from other studies of similar design, but with different functions. This is the case of clinical trials that incorporate an economic evaluation (also known as piggy -back clinical trials , which tend to have a more pragmatic design than conventional trials. But that is another story…

## King Kong versus Godzilla

What a mess these two elements make when they are left loose and come together! In this story, almost as old as me (please, do not run to look at what year the movie was made) poor King Kong, who must have traveled more than Tarzan, leaves his Skull Island to defend a village from an evil giant octopus and drinks a potion that leaves him sound asleep. Then, some Japanese gentlemen seized the opportunity to take him to their country. I, who have visited Japan, can imagine the effect it produced on the poor monkey when he woke up, so it had no choice but to escape, with the misfortune of meeting Godzilla, who had also escaped from an iceberg where it had been previously frozen. And there they are bundled and the fight begins, stones over here, atomic rays over there, until the thing gets out of control and finally King Kong is going to attack Tokyo, I do not remember exactly for what reason. I swear I have not taken any hallucinogenic, the film is like that and I will not reveal more for not spoiling the end in the incredible case that you want to see the film after what I have told you. What I do not know is what the screenwriters would have taken before planning this story.

At this point you will be thinking about how today’s post may be related to this story. Well, the truth is that it has nothing to do with what we are going to talk about, but I could not think of a better way to start. Well, it may actually be related, because today we are going to talk about a family of monsters within epidemiological studies: the ecological studies. It’s funny that when you read something about ecological studies, it always starts by saying that they are simple. Well, I do not think so. The truth is that they have a lot to get our teeth into and we are going to try to explain them in a simple way. I thank my friend Eduardo (to whom I dedicate this post) for the effort he made to describe them intelligibly. Thanks to him I could understand them. Well… a little bit.

Ecological studies are observational studies that have the peculiarity that the study population are not individual subjects, but grouped subjects (in conglomerates), so the level of inference of their estimates is also aggregated. They tend to be cheap and quick to perform (I suppose that hence its supposed simplicity), since they usually use data from secondary sources already available, and are very useful when it is not possible to measure the exposure at the individual level or when the measurement of the effect can only be measured at the population level (such as the results of a vaccination campaign, for example).

The problem comes when we want to make inferences at the individual level based on their results, since they are subject to a series of biases that we will comment later on. In addition, since they use to be descriptive studies of historical temporality, it can be difficult to determine the temporal gradation between the exposure and the effect studied.

We will look at the specific characteristics in relation to three aspects of its methodology: types of variables and analyzes, types of studies and biases.

Ecological variables are classified in aggregate and environmental variables (also called global variables). The aggregate ones show a summary of individual observations. They are usually averages or proportions, such as the mean age at which the first King Kong’s movie is seen or the rate of geeks for every 1000 moviegoers, to name two absurd examples.

On the other hand, environmental measures are characteristic of a specific place. These can have a parallelism at an individual level (for example, the levels of environmental pollution, related to the crap that each swallows) or be attributes of groups without equivalence at the individual level (such as water quality, to say the least).

As for the analysis, it can be done at the aggregate level, using data from groups of participants, or at the individual level, but better without mixing the two types. Moreover, if data of both types is collected, it will be more convenient to transform them into a single level, the simplest being to aggregate the individual data, although it can also be done the other way around and, even, make an analysis in the two levels with techniques of hierarchical multilevel statistics, only afforded by a few privileged minds.

Obviously, the level of inference we want to apply will depend on what our objective is. If we want to study the effects of a risk factor at the individual level, the inference will be individual. An example would be to study the relationship between the number of hours television is watched and the incidence of brain cancer. On the other hand, and following a very pediatric example, if we want to know the effectiveness of a vaccine, the inferences will be made in an aggregated form from the data of vaccination coverage in the population. And to finish curling the curl, we can measure an exposure factor of the two forms, individual and grouped. For example, density of Mexican restaurants in a population and frequency of antacids intake. In this case we would make a contextual inference.

Regarding the type of ecological studies, we can classify them according to the exposure method and the grouping method.

According to the exposure method, the thing is relatively simple and we can find two types of studies. If we do not measure the exposure variable, or we do it partially, we talk about exploratory studies. In the opposite case, we will find ourselves before an analytical study.

According to the grouping method, we can consider three types: multiple (when multiple zones are selected), temporary (there is measurement over time) and mixed (combination of both).

The complexity begins when the two dimensions (exposure and grouping) are combined, since then we can find ourselves before a series of more complex designs. Thus, multiple group studies can be exploratory (the exposure factor is not measured, but the effect is measured) or analytical studies (the most frequent, we measure both here). The studies of temporal tendency, to not be less, can also be exploratory and analytical, in a similar way to the previous ones, but with a temporal trend. Finally, there will be mixed studies that compare the temporal trends of several geographical areas. Simple, isn’t it?

Well, this is nothing compared to the complexity of the statistical techniques used in these studies. Until recently the analyzes were very simple and based on measures of association or linear correlation, but in recent times we have seen the development of numerous techniques based on regression models and more exotic things such as the log-linear multiplicative models or the Poisson’s regression. The merit of all these studies is that, based on the grouped measures, they allow us to know how many exposed or unexposed subjects have the effect, thus allowing the calculation of rates, attributable fractions, etc. Do not fear, we will not go into detail, but there is available bibliography for those who want to keep warm from head to feet.

To finish with the methodological aspects of the ecological studies, we will list some of its most characteristic biases, favored by the fact of using aggregate analysis units.

The most famous of all is the ecological bias, also known as ecological fallacy. This occurs when the grouped measure does not measure the biological effect at the individual level, in such a way that the individual inference made is erroneous. This bias became famous with the New England’s study that concluded that there was a relationship between chocolate consumption and Nobel prizes but the problem is that, apart from the funny of this example, the ecological fallacy is the main limitation of this type of studies.

Another bias that has some peculiarities in this type of studies is the confusion bias. In studies dealing with individual units, confusion occurs when the exposure variable is related to the effect and exposure, without being part of the causal relationship between the two. This ménage à trois is a bit more complex in ecological studies. The risk factor can behave similarly at the ecological level, but not at the individual level and vice versa, it is possible that confounding factors at the individual level do not produce confusion at the aggregate level. In any case, as in the rest of the studies, we must try to control the confounding factors, for which there are two fundamental approaches.

The first one, to include the possible confounding variables in the mathematical model as covariables and perform a multivariate analysis, with which it will be more complicated to study the effect. The second one, to adjust or standardize the rates of the effect by the confounding variables and perform the regression model with the adjusted rates. To be able to do this it is essential that all the variables introduced in the model have to be adjusted too to the same variable of confusion and that the covariances of the variables are known, which does not always happen. In any case, and it is not to discourage, many times we cannot be sure that the confounding factors have been adequately controlled, even using the most recent and sophisticated multilevel analysis techniques, since the origin can be in unknown characteristics about the distribution of data among groups.

Other gruesome aspects of ecological studies are the temporal ambiguity bias (we have already commented, it is often difficult to ensure that exposure precedes the effect) and collinearity (difficulty in assess the effects of two or more exposures that can occur simultaneous). In addition, although they are not specific to ecological studies, they are very susceptible to presenting information biases.

You can see that I was right at the beginning when I told you that ecological studies seem to me a lot of things, but simple. In any case, it is convenient to understand what their methodology is based on, because, with the development of new analysis techniques, they have gained in prestige and power and it is more than possible that we meet them more and more frequently.

But do not despair, the important thing for us, consumers of medical literature, is to understand how they work so that we can make a critical appraisal of the articles when we deal with them. Although, as far as I know, there are no checklists as structured as CASP has for other designs, the critical appraisal will be done following the usual general scheme according to our three pillars: validity, relevance and applicability.

The study of VALIDITY will be done in a similar way to other types of cross-sectional observational studies. The first thing will be to check that there is a clear definition of the population and the exposure or effect under study. The units of analysis and their level of aggregation will have to be clearly specified, as well as the methods of measuring the effect and exposure, the latter, as we already know, only in analytical studies.

The sample of the study should be representative, for which we will have to review the selection procedures, the inclusion and exclusion criteria and its size. These data will also influence the external validity of the results.

As in any observational study, the measurement of exposure and effect should be done blindly and independently, using valid instruments. The authors must present the data completely, taking into account if there are loses or out of range values. Finally, there must be a correct analysis of the results, with a control of the typical biases of these studies: ecological, information, confusion, temporal ambiguity and collinearity.

In the RELEVANCE section we can begin with a quantitative assessment, summarizing the most important result and reviewing the magnitude of the effect. We must search or calculate ourselves, if possible, the most appropriate impact measures: differences in incidence rates, attributable fraction in exposed, etc. If the authors do not offer this data, but do provide the regression model, it is possible to calculate the impact measurements from the multiplication coefficients of the independent variables of the model. I’m not going to put here the list of formulas for not making this post even more unfriendly, but you know that they exist in case one day you need them.

Then we will make a qualitative assessment of the results, trying to assess the clinical interest of the main outcome measure, the interest of the effect size and the impact it may have for the patient, the system or the Society.

We will finish this section with a comparative assessment (looking for similar studies and comparing the main outcome measure and other alternative measures) and an assessment of the relationship between benefits, risks and costs, as we would do with any other type of study.

Finally, we will consider the APPLICABILITY of the results in clinical practice, taking into account aspects such as adverse effects, economic cost, etc. We already know that the fact that the study is well done does not mean that we have to apply it obligatorily in our environment.

And here we are going to leave it for today. When you read or do an ecological study, be careful not to fall into the temptation of drawing causality conclusions. Regardless of the pitfalls that the ecological fallacy may have for you, ecological studies are observational, so they can be used to generate hypotheses of causality, but not to confirm them.

And now we’re leaving. I did not tell you who won the fight between King Kong and Godzilla so as not to be a spoiler, but surely the smartest of you have already imagined it. After all, and to its disgrace, only one of the two later traveled to New York. But that is another story…

## The crystal ball

How I wish I could predict the future! And not only to win millions in the lottery, which is the first thing you can think of. There are more important things in life than money (or so that’s what some say), decisions that we make based on assumptions that end up not being fulfilled and that complicate our lives to unsuspected limits. We all have ever thought about “if you lived twice …” I have no doubt, if I met the genie of the lamp one of the three wishes I would ask would be a crystal ball to see the future.

And we could also do well in our work as doctors. In our day to day we are forced to make decisions about the diagnosis or prognosis of our patients and we always do it on the swampy terrain of uncertainty, always assuming the risk of making some mistake. We, especially when we are more experienced, estimate consciously or unconsciously the likelihood of our assumptions, which helps us in making diagnostic or therapeutic decisions. However, it would be good to also have a crystal ball to know more accurately the evolution of the patient’s course.

The problem, as with other inventions that would be very useful in medicine (like the time machine), is that nobody has yet managed to manufacture a crystal ball that really works. But do not let us down. We cannot know for sure what will happen, but we can estimate the probability that a certain result will occur.

For this, we can use all those variables related to the patient that have a known diagnostic or prognostic value and integrate them to perform the calculation of probabilities. Well, doing such a thing would be the same as designing and applying what is known as a clinical prediction rule (CPR).

Thus, if we get a little formal, we can define a CPR as a tool composed of a set of variables of clinical history, physical examination and basic complementary tests, which provides us with an estimate of the probability of an event, suggesting a diagnosis or predicting a concrete response to a treatment.

The critical appraisal of an article about a CPR shares similar aspects with those of the ones about diagnostic tests and also has specific aspects related to the methodology of its design and application. For this reason, we will briefly look at the methodological aspects of CPRs before entering into their critical assessment.

In the process of developing a CPR, the first thing to do is to define it. The four key elements are the study population, the variables that we will consider as potentially predictive, the gold or reference standard that classifies whether the event we want to predict occurs or not and the criterion of assessment of the result.

It must be borne in mind that the variables we choose must be clinically relevant, they must be collected accurately and, of course, they must be available at the time we want to apply the CPR for decision making. It is advisable not to fall into the temptation of putting variables everywhere and endlessly since, apart from complicating the application of the CPR, it can decrease its validity. In general, it is recommended that for every variable that is introduced in the model there should have been at least 10 events that we want to predict (the design is made in a certain sample whose components have the variables but only a certain number have ended up presenting the event to predict).

I would also like to highlight the importance of the gold standard. There must be a diagnostic test or a set of well-defined criteria that allow us to clearly define the event we want to predict with the CPR.

Finally, it is convenient that those who collect the variables during this definition phase are unaware of the results of the gold standard, and vice versa. The absence of blinding decreases the validity of the CPR.

The next step is the derivation or design phase itself. This is where the statistical methods that allow to include predictive variables and exclude those that are not going to contribute anything are applied. We will not go into statistics, just say that the most commonly used methods are those based on logistic regression, although discriminant, survival and even more exotic analysis based on discriminant risks or neural networks can be used, only afforded by a few virtuous ones.

In the logistic regression models, the event will be the dichotomous dependent variable (it happens or it does not happen) and the other variables will be the predictive or independent variables. Thus, each coefficient that multiplies each predictive variable will be the natural antilogarithm of the adjusted odds ratio. In case anyone has not understood, the adjusted odds ratio for each predictive variable will be calculated raising the number “e” to the value of the coefficient of that variable in the regression model.

The usual thing is that a certain score is assigned on a scale according to the weight of each variable, so that the total sum of points of all the predictive variables will allow to classify the patient in a specific range of prediction of event production. There are also other more complex methods using regression equations, but after all you always get the same thing: an individualized estimate of the probability of the event in a particular patient.

With this process we perform the categorization of patients in homogenous groups of probability, but we still need to know if this categorization is adjusted to reality or, what is the same, what is the capacity of discrimination of the CPR.

The overall validity or discrimination capacity of the PRC will be assess by contrasting its results with those of the gold standard, using similar techniques to those used to assess the power of diagnostic tests: sensitivity, specificity, predictive values and likelihood ratios. In addition, in cases where the CPR provides a quantitative estimate, we can resort to the use of the ROC curves, since the area under the curve will represent the global validity of the CPR.

The last step of the design phase will be the calibration of the CPR, which is nothing more than checking its good behavior throughout the range of possible results.

Some CPR’s authors end this here, but they forget two fundamental steps of the elaboration: the validation and the calculation of the clinical impact of the rule.

The validation consists in testing the CPR in samples different to the one used for its design. We can take a surprise and verify that a rule that works well in a certain sample does not work in another. Therefore, it must be tested, not only in similar patients (limited validation), but also in different clinical settings (broad validation), which will increase the external validity of the CPR.

The last phase is to check its clinical performance. This is where many CPRs crash down after having gone through all the previous steps (maybe that’s why this last check is often avoided). To assess the clinical impact, we will have to apply CPR in our patients and see how clinical outcome measures change such as survival, complications, costs, etc. The ideal way to analyze the clinical impact of a CPR is to conduct a clinical trial with two groups of patients managed with and without the rule.

For those self-sacrificing people who are still reading, now that we know what a CPR is and how it is designed, we will see how the critical appraisal of these works is done. And for this, as usual, we will use our three pillars: validity, relevance and applicability. To not forget anything, we will follow the questions that are listed on the grid for CRP studies of the CASP tool.

Regarding VALIDITY, we will start first with some elimination questions. If the answer is negative, it may be time to wait until someone finally makes up a crystal ball that works.

Does the rule answer a well-defined question? The population, the event to be predicted, the predictive variables and the outcome evaluation criteria must be clearly defined. If this is not done or these components do not fit our clinical scenario, the rule will not help us. The predictive variables must be clinically relevant, reliable and well defined in advance.

Did the study population from which the rule was derived include an adequate spectrum of patients? It must be verified that the method of patient selection is adequate and that the sample is representative. In addition, it must include patients from the entire spectrum of the disease. As with diagnostic tests, events may be easier to predict in certain groups, so there must be representatives of all of them. Finally, we must see if the sample was validated in a different group of patients. As we have already said, it is not enough that the rule works in the group of patients in which it has been derived, but that it must be tested in other groups that are similar or different from those with which it was generated.

If the answer to these three questions has been affirmative, we can move on to the three next questions. Was there a blind evaluation of the outcome and of the predictor variables? We have already commented, it is important that the person who collects the predictive variables does not know the result of the reference pattern, and vice versa. The collection of information must be prospective and independent. The next thing to ask is whether the predictor variables and the outcome in all the patients were measured. If the outcome or the variables are not measured in all patients, the validity of the CPR can be compromised. In any case, the authors should explain the exclusions, if there are any. Finally, are the methods of derivation and validation of the rule described? We already know that it is essential that the results of the rule be validated in a population different from the one used for the design.

If the answers to the previous questions indicate that the study is valid, we will answer the questions about the RELEVANCE of the results. The first is if you can calculate the performance of the CRP. The results should be presented with their sensitivity, specificity, odds ratios, ROC curves, etc., depending on the result provided by the rule (scoring scales, regression formulas, etc.). All these indicators will help us to calculate the probabilities of occurrence of the event in environments with different prevalence. This is similar to what we did with the studies of diagnostic tests, so I invite you to review the post on the subject to not repeat too much. The second question is: what is the precision of the results? Here we will not extend either: remember our revered confidence intervals, which will inform us of the accuracy of the results of the rule.

To finish, we will consider the APPLICABILITY of the results to our environment, for which we will try to answer three questions. Will the reproducibility of the PRC and its interpretation be satisfactory within the scope of the scenario? We will have to think about the similarities and differences between the field in which the CPR develops and our clinical environment. In this sense, it will be helpful if the rule has been validated in several samples of patients from different environments, which will increase its external validity. Is the test acceptable in this case? We will think wether the rule is easy to apply in our environment and wether it makes sense to do it from the clinical point of view in our environment. Finally, will the results modify clinical behavior, health outcomes or costs? If, from our point of view, the results of the CPR are not going to change anything, the rule will be useless and a waste of time. Here our opinion will be important, but we must also look for studies that assess the impact of the rule on costs or on health outcomes.

And up to here everything I wanted to tell you about critical appraising of studies on CPRs. Anyway, before finishing I would like to tell you a little about a checklist that, of course, also exists for the valuation of this type of studies: the checklist CHARMS (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modeling Studies). You will not tell me that the name, although a bit fancy, is not lovely.

This list is designed to assess the primary studies of a systematic review on CPRs. It try to answer some general design questions and assess 11 domains to extract enough information to perform the critical appraisal. The two great parts that are valued are the risk of bias in the studies and its applicability. The risk of bias refers to the design or validation flaws that may result in the model being less discriminative, excessively optimistic, etc. The applicability, on the other hand, refers to the degree to which the primary studies are in agreement with the question that motivates the systematic review, for which it informs us of whether the rule can be applied to the target population. This list is good and helps to assess and understand the methodological aspects of this type of studies but, in my humble opinion, it is easier to make a systematic critical appraisal by using the CASP’s tool.

And here, finally, we leave it for today. We have not spoken anything, so as not to stretch ourselves too long, of what to do with the result of the rule. The fundamental thing, we already know, is that we can calculate the probability of occurrence of the event in individual patients from environments with different prevalence. But that is another story…

## Doc, is this serious?

I wonder how many times I have heard this question or one of its many variants. Because it turns out that we are always thinking about clinical trials and clinical questions about diagnosis and treatment, but think about whether a patient ever asked you if the treatment you were proposing was endorsed by a randomized controlled trial that meets the criteria of the CONSORT statement and has a good score on the Jadad scale. I can say, at least, that it has never happened to me. But they do ask me daily what will happen to them in the future.

And here lies the relevance of prognostic studies. Note that you cannot always heal and that, unfortunately, many times all we can do is assist and relieve, if it is possible, the announcement of serious sequelae or death. But it is essential to have good quality information about the future of our patient’s disease. This information will also serve to calibrate therapeutic efforts in each situation depending on the risks and benefits. And besides, prognostic’s studies are used to compare results between different departments or hospitals. Nobody comes up saying that a hospital is worse than another because their mortality is higher without first checking that the prognosis of patients is similar.

Before getting into the critical appraisal of prognostic studies, let’s clarify the difference between risk factor and prognostic factor. The risk factor is a characteristic of the environment or the subject that favors the development of the disease, while the prognostic factor is that which, once the disease occurs, influences its evolution. Risk factor and prognostic factor are different things, although sometimes they can coincide. What the two do share is the same type of study design. The ideal would be to use clinical trials, but most of the time we cannot or are not ethical to randomize the prognostic or risk factors. Let’s think we want to demonstrate the deleterious effect of booze on the liver. The way with the highest degree of evidence to prove it would be to make two random groups of participants and give 10 whiskeys a day to the participants of one arm and some water to the participants of the other, to see the differences in liver damage after a year, for example. However, it is evident to anyone that we cannot do a clinical trial like this. Not because we cannot find subjects for the intervention arm, but because ethics and common sense prevent us from doing it.

For this reason, it is usual to use cohort studies: we would study what differences at the level of the liver there may be between individuals who drink and who do not drink alcohol by their own choice. In cases that require very long follow-ups or in which the effect we want to measure is very rare, case-control studies can be used, but they will always be less powerful because they have a higher risk of bias. Following our ethyl example, we would study people with and without liver damage and we would see if one of the two groups was exposed to alcohol.

A prognostic study should inform us of three aspects: what result we evaluate, how likely they are to happen, and in what time frame we expect it to happen. And to appraise it, as always, we will base on our three pillars: validity, relevance and applicability.

To assess the VALIDITY, we´ll first consider if the article meets a set of primary or elimination criteria. If the answer is not, we better throw the paper and go to read the last bullshit our Facebook’s friends have written on our wall.

Is the study sample well defined and is it representative of patients at a similar stage of disease? The sample, which is usually called initial or incipient cohort, should be formed by a group of patients at the same stage of disease, ideally at the beginning, at it should be followed-up prospectively. It should be well specified the type of patients included, the criteria for diagnosing them and the method of selection. We must also verify that the follow-up has been long enough and complete enough to observe the event we study. Each participant has to be followed-up from the start to the end of the study, either because he’s healed, because he presents the event or because the study ends. It is very important to take into account losses during the study, very common in designs with long follow-up. The study should provide the characteristics of patients lost and the reasons for the loss. If they are similar to those who are not lost during follow-up, we can get valid results. If the number of patients lost to follow-up is greater than 20% it’s usually done a sensitivity analysis using the worst possible scenario, which considers that all losses have had a poor prognosis and then recalculate the results to check if they are modified, in which case the study results could be invalidated.

Once these two aspect being assessed, we turn to the secondary criteria about internal validity or scientific rigor.

Were outcomes measured objectively and unbiased? It must be clearly specified what is being measured and how before starting the study. In addition, in order to avoid the information bias, the ideal is that the measure of results is done blinded to the researcher, who must not know whether the subject in question is subjected to any of the prognostic factors.

Were the results adjusted by all relevant prognostic values? We must take into account all the confounding variables and prognostic factors that may influence the results. In case they are known from previous studies, known factors may be considered. Otherwise, the authors will determine these effects using stratified data analysis (the easiest method) or multivariate analysis (the more powerful and complex), usually by a proportional hazards model or Cox regression analysis. Although we’re not going to talk about regression models now, there are two simple aspects that we can take into account. First, these models need a certain number of events per variable included in the model, so distrust those where many variables are analyzed, especially with small samples. Second, the variables included are decided by the author and are different from one work to another, so we will have to assess if they have not included any that may be relevant to the final result.

Were the results validated in other groups of patients? When we set groups of variables and we make multiple comparisons we risk the chance plays a trick on us and shows us associations that don’t exists. This is why when a risk factor is described in a group (training or derivation group), the results should be replicated in an independent group (validation group) to be really sure about the effect.

Now we must consider what the results are to determine their RELEVANCE. For this, we’ll check if the probability of the outcome of the study is estimated and provided by the authors, as well as the accuracy of this estimate and the risk associated with the factors influencing the prognosis.

Is the probability of the event specified in a given period of time? There are several ways to present the number of events occurring during the follow-up period. The simplest would be to provide an incidence rate (events / person / unit time) or the cumulative frequency at any given time. Another indicator is the median survival, which is just the moment at follow-up in which the event has happened in half of the cohort participants (remember that although we speak about survival, the event not need tro be necessarily death).

We can use survival curves of various kinds to determine the probability of the occurrence of the event in each period and the rate at which it is presenting. Actuarial or life tables are used for larger samples when we don’t know the exact time of the event and we use fixed time periods. However, the more often used are the Kaplan-Meier curves, which better measure the probability of the event for each particular time with smaller samples. This method can provide hazard ratios and median survival, as well as other parameter accor4ding to the regression model used.

To assess the accuracy of the results will look, as always, for the confidence intervals. The larger the interval, the less accurate the estimate of the probability of occurrence in the general population, which is what we really want to know. Keep in mind that the number of patients is generally lower as time passes, so it is usual that the survival curves are more accurate at the beginning than at the end of follow up. Finally, we’ll assess the factors that modify the prognosis. The right thing is to represent all the variables that may influence the prognosis with its corresponding relative risks, which will allow us to evaluate the clinical significance of the association.

Finally, we must consider the APPLICABILITY of the results. Do they apply to my patients? We will look for similarities between the study patients and ours and assess whether the differences we find allow us to extrapolate the results to our practice. But besides, are the results useful? The fact that they’re applicable doesn’t necessarily mean that we have to implement them. We have to assess carefully if they’re going to help us to decide what treatment to apply and how to inform our patients and their families.

As always, I recommend you to use a template, such as those provided by CASP, for systematically critical appraisal without leaving any important matter without assessing.

You can see that articles about prognosis have a lot of to say. And we haven’t almost talked about regression models and survival curves, which are often the statistical core of this type of articles. But that’s another story…

## You have to know what you are looking for

Every day we find articles that show new diagnostic tests that appear to have been designed to solve all our problems. But we should not be tempted to pay credit to everything we read before reconsidering what we have, in fact, read. At the end of the day, if we paid attention to everything we read we would be swollen from drinking Coca-Cola.

We know that a diagnostic test is not going to say whether or not a person is sick. Its result will only allow us to increase or decrease the probability that the individual is sick or not so we can confirm or rule out the diagnosis, but always with some degree of uncertainty.

Anyone has a certain risk of suffering from any disease, which is nothing more than the prevalence of the disease in the general population. Below a certain level of probability, it seems so unlikely that the patient is sick that we leave him alone and do not do any diagnostic tests (although some find it hard to restrain the urge to always ask for something). This is the diagnostic or test threshold.

But if, in addition to belonging to the population, one has the misfortune of having symptoms, that probability will increase until this threshold is exceeded, in which the probability of presenting the disease justifies performing diagnostic tests. Once we have the result of the test that we have chosen, the probability (post-test probability) will have changed. It may have changed to less and it has been placed below the test threshold, so we discard the diagnosis and leave the patient alone again. It may also exceed another threshold, the therapeutic, from which the probability of the disease reaches the sufficient level so as not to need further tests and to be able to initiate the treatment.

The usefulness of the diagnostic test will be in its ability to reduce the probability below the threshold of testing (and discard the diagnosis) or, on the contrary, to increase it to the threshold at which it is justified to start treatment. Of course, sometimes the test leaves us halfway and we have to do additional tests before confirming the diagnosis with enough security to start the treatment.

Diagnostic tests studies should provide information about the ability of a test to produce the same results when performed under similar conditions (reliability) and about the accuracy with which the measurements reflect that measure (validity). But they also give us data about their discriminatory power (sensitivity and specificity), their clinical performance (positive predictive value and negative predictive value), its ability to modify the probability of illness and change our position between the two thresholds (likelihood ratios), and about other aspects that allow us to assess whether it’s worth to test our patients with the diagnostic test. And to check if a study gives us the right information we need to make a critical appraisal and read the paper based on our three pillars: validity, relevance and applicability.

Let’s start with VALIDITY. First, we’ll make ourselves some basic eliminating questions about primary criteria about the study. If the answer to these questions is no, the best you can do probably is to use the article to wrap your mid-morning snack.

Was the diagnostic test blindly and independently compared with an appropriate gold standard or reference test?. We must review that results of reference test were not interpreted differently depending on the results of the study test, thus committing an incorporation bias, which could invalidate the results. Another problem that can arise is that the reference test results are frequently inconclusive. If we made the mistake of excluding that doubtful cases we’d commit and indeterminate exclusion bias that, in addition to overestimate the sensitivity and specificity of the test, will compromise the external validity of the study, whose conclusions would only be applicable to patients with indeterminate result.

Do patients encompass a similar spectrum to which we will find in our practice?. The inclusion criteria of the study should be clear, and the study must include healthy and diseased with varying severity or progression stages of disease. As we know, the prevalence influences the clinical performance of the test so if it’s validated, for example, in a tertiary center (the probability of being sick is statistically greater) its diagnostic capabilities will be overestimated when we use the test at a Primary Care center or with the general population (where the proportion of diseased will be lower).

At this point, if we think it’s worth reading further, we’ll focus on secondary criteria, which are those that add value to the study design. Another question to ask is: had the study test’s results any influence in the decision to do the reference test?. We have to check that there hasn’t been a sequence bias or a diagnostic verification bias, whereby excluding those with negative test. Although this is common in current practice (we start with simple tests and perform the more invasive ones only in positive patients), doing so in a diagnostic test study affect the validity of the results. Both tests should be done independently and blindly, so that the subjectivity of the observer does not influence the results (review bias). Finally, is the method described with enough detail to allow its reproduction?. It should be clear what is considered normal and abnormal and what criteria we have used to define normal and how we have interpreted the results of the test.

Having analyzed the internal validity of the study we’ll appraise the RELEVANCE of the presented data. The purpose of a diagnostic study is to determine the ability of a test to correctly classify individuals according to the presence or absence of disease. Actually, and to be more precise, we want to know how the likelihood of being ill increases after the test’s result (post-test probability). It’s therefore essential that the study gives information about the direction and magnitude of this change (pretest / posttest), that we know depends on the characteristics of the test and, to a large extent, on the prevalence or pretest probability.

Do the work present likelihood ratios or is it possible to calculate them from the data?. This information is critical because if not, we couldn’t estimate the clinical impact of the study test. We have to be especially careful with tests with quantitative results in which the researcher has established a cutoff of normality. When using ROC curves, it is usual to move the cutoff to favor sensitivity or specificity of the test, but we must always appraise how this measure affects the external validity of the study, since it may limit its applicability to a particular group of patients.

How reliable are the results?. We will have to determine whether the results are reproducible and how they can be affected by variations among different observers or when retested in succession. But we have not only to assess the reliability, but also how accurate the results are. The study was done on a sample of patients, but it should provide an estimate of their values in the population, so results should be expressed with their corresponding confident intervals.

The third pillar in critical appraising is that of APLICABILITY or external validity, which will help us to determine whether the results are useful to our patients. In this regard, we ask three questions. Is the test available and is it possible to perform it in our patients?. If the test is not available all we’ll have achieved with the study is to increase our vast knowledge. But if we can apply the test we must ask whether our patients fulfill the inclusion and exclusion criteria of the study and, if not, consider how these differences may affect the applicability of the test.

The second question is if we know the pretest probability of our patients. If our prevalence is very different from that of the study the actual usefulness of the test can be modified. One solution may be to do a sensitivity analysis evaluating how the study results would be modified after changing values of pre and posttest probability to a different ones that are clinically reasonable.

Finally, we should ask ourselves the most important question: can posttest probability change our therapeutic attitude, so being helpful to the patient?. For example, if the pretest probability is very low, probably the posttest probability will be also very low and won’t reach the therapeutic threshold, so it would be not worth spending money and effort with the test. Conversely, is pretest probability is very high it may be worth starting treatment without any more evidence, unless the treatment is very expensive or dangerous. As always, the virtue will be in the middle and it will be in these intermediate areas where more benefits can be obtained from the studied diagnostic test. In any case, we must never forget who our boss is (I mean the patient, not our boss at the office): you must not to be content only with studying the effectiveness or cost-effectiveness, but also consider the risks, discomfort, and patients preferences and the consequences that can lead to the performing of the diagnostic test.

If you allow me an advice, when critically appraising an article about diagnostic tests I recommend you to use the CASP’s templates, which can be downloaded from the website. They will help you make the critical appraising in a systematic and easy way.

A clarification to go running out: we must not confuse the studies of diagnostic tests with diagnostic prediction rules. Although the assessment is similar, the prediction rules have specific characteristics and methodological requirements that must be assessed in an appropriate way and that we will see in another post.

Finally, just say that everything we have said so far applies to the specific papers about diagnostic tests. However, the assessment of diagnostic tests may be part of observational studies such as cohort or case-control studies, which can have some peculiarity in the sequence of implementation and validation criteria of the study and reference test. But that’s another story…

## The King under review

We all know that the randomized clinical trial is the king of interventional methodological designs. It is the type of epidemiological study that allows a better control of systematic errors or biases, since the researcher controls the variables of the study and the participants are randomly assigned among the interventions that are compared.

In this way, if two homogeneous groups that differ only in the intervention present some difference of interest during the follow-up, we can affirm with some confidence that this difference is due to the intervention, the only thing that the two groups do not have in common. For this reason, the clinical trial is the preferred design to answer clinical questions about intervention or treatment, although we will always have to be prudent with the evidence generated by a single clinical trial, no matter how well performed. When we perform a systematic review of randomized clinical trials on the same intervention and combine them in a meta-analysis, the answers we get will be more reliable than those obtained from a single study. That’s why some people say that the ideal design for answering treatment questions is not the clinical trial, but the meta-analysis of clinical trials.

In any case, as systematic reviews assess their primary studies individually and as it is more usual to find individual trials and not systematic reviews, it is advisable to know how to make a good critical appraisal in order to draw conclusions. In effect, we cannot relax when we see that an article corresponds to a clinical trial and take its content for granted. A clinical trial can also contain its traps and tricks, so, as with any other type of design, it will be a good practice to make a critical reading of it, based on our usual three pillars: validity, importance and applicability.

As always, when studying scientific rigor or VALIDITY (internal validity), we will first look at a series of essential primary criteria. If these are not met, it is better not to waste time with the trial and try to find another more profitable one.

Is there a clearly defined clinical question? In its origin, the trial must be designed to answer a structured clinical question about treatment, motivated by one of our multiple knowledge gaps. A working hypothesis should be proposed with its corresponding null and alternative hypothesis, if possible on a topic that is relevant from the clinical point of view. It is preferable that the study try to answer only one question. When you have several questions, the trial may get complicated in excess and end up not answering any of them completely and properly.

Was the assignment done randomly? As we have already said, to be able to affirm that the differences between the groups are due to the intervention, they must be homogeneous. This is achieved by assigning patients randomly, the only way to control the known confounding variables and, more importantly, also those that we do not know. If the groups were different and we attributed the difference only to the intervention, we could incur in a confusion bias. The trial should contain the usual and essential table 1 with the frequency of appearance of the demographic and confusion variables of both samples to be sure that the groups are homogeneous. A frequent error is to look for the differences between the two groups and evaluate them according to their p, when we know that p does not measure homogeneity. If we have distributed them at random, any difference we observe will necessarily be random (we will not need a p to know that). The sample size is not designed to discriminate between demographic variables, so a non-significant p may simply indicate that the sample is small to reach statistical significance. On the other hand, any minimal difference can reach statistical significance if the sample is large enough. So forget about the p: if there is any difference, what you have to do is assess whether it has sufficient clinical relevance to have influenced the results or, more elegantly, we will have to control the unbalanced covariates during the randomization. Fortunately, it is increasingly rare to find the tables of the study groups with the comparison of p between the intervention and control groups.

But it is not enough for the study to be randomized, we must also consider whether the randomization sequence was done correctly. The method used must ensure that all components of the selected population have the same probability of being chosen, so random number tables or computer generated sequences are preferred. The randomization must be hidden, so that it is not possible to know which group the next participant will belong to. That is why people like centralized systems by telephone or through the Internet. And here is something very curious: it turns out that it is well known that randomization produces samples of different sizes, especially if the samples are small, which is why samples randomized by blocks balanced in size are sometimes used. And I ask you, how many studies have you read with the same number of participants in the two branches and who claimed to be randomized? Do not trust if you see equal groups, especially if they are small, and do not be fooled: you can always use one of the multiple binomial probability calculators available on the Internet to know what is the probability that chance generates the groups that the authors present (we always speak of simple randomization, not by blocks, conglomerates, minimization or other techniques). You will be surprised with what you will find.

It is also important that the follow-up has been long and complete enough, so that the study lasts long enough to be able to observe the outcome variable and that every participant who enters the study is taken into account at the end. As a general rule, if the losses exceed 20%, it is admitted that the internal validity of the study may be compromised.

We will always have to analyze the nature of losses during follow-up, especially if they are high. We must try to determine if the losses are random or if they are related to any specific variable (which would be a bad matter) and estimate what effect they may have on the results of the trial. The most usual is usually to adopt the so-called worst-case scenarios: it is assumed that all the losses of the control group have gone well and all those in the intervention group have gone badly and the analysis is repeated to check if the conclusions are modified, in which case the validity of the study would be seriously compromised. The last important aspect is to consider whether patients who have not received the previously assigned treatment (there is always someone who does not know and mess up) have been analyzed according to the intention of treatment, since it is the only way to preserve all the benefits that are obtained with randomization. Everything that happens after the randomization (as a change of the assignment group) can influence the probability that the subject experiences the effect we are studying, so it is important to respect this analysis by intention to treat and analyze each one in the group in which it was initially assigned.

Once these primary criteria have been verified, we will look at three secondary criteria that influence internal validity. It will be necessary to verify that the groups were similar at the beginning of the study (we have already talked about the table with the data of the two groups), that the masking was carried out in an appropriate way as a form of control of biases and that the two groups were managed and controlled in a similar way except, of course, the intervention under study. We know that masking or blinding allows us to minimize the risk of information bias, which is why the researchers and participants are usually unaware of which group is assigned to each, which is known as double blind. Sometimes, given the nature of the intervention (think about a group that is operated on and another one that does not) it will be impossible to mask researchers and participants, but we can always give the masked data to the person who performs the analysis of the results (the so-called blind evaluator), which ameliorate this incovenient.

To summarize this section of validity of the trial, we can say that we will have to check that there is a clear definition of the study population, the intervention and the result of interest, that the randomization has been done properly, that they have been treated to control the information biases through masking, that there has been an adequate follow-up with control of the losses and that the analysis has been correct (analysis by intention of treat and control of covariates not balanced by randomization).

A very simple tool that can also help us assess the internal validity of a clinical trial is the Jadad’s scale, also called the Oxford’s quality scoring system. Jadad, a Colombian doctor, devised a scoring system with 7 questions. First, 5 questions whose affirmative answer adds 1 point:

1. Is the study described as randomized?
2. Is the method used to generate the randomization sequence described and is it adequate?
3. Is the study described as double blind?
5. Is there a description of the losses during follow up?

Finally, two questions whose negative answer subtracts 1 point:

1. Is the method used to generate the randomization sequence adequate?
2. Is the masking method appropriate?

As you can see, the Jadad’s scale assesses the key points that we have already mentioned: randomization, masking and monitoring. A trial is considered a rigorous study from the methodological point of view if it has a score of 5 points. If the study has 3 points or less, we better use it to wrap the sandwich.

We will now proceed to consider the results of the study to gauge its clinical RELEVANCE. It will be necessary to determine the variables measured to see if the trial adequately expresses the magnitude and precision of the results. It is important, once again, not to settle for being inundated with multiple p full of zeros. Remember that the p only indicates the probability that we are giving as good differences that only exist by chance (or, to put it simply, to make a type 1 error), but that statistical significance does not have to be synonymous with clinical relevance.

In the case of continuous variables such as survival time, weight, blood pressure, etc., it is usual to express the magnitude of the results as a difference in means or medians, depending on which measure of centralization is most appropriate. However, in cases of dichotomous variables (live or dead, healthy or sick, etc.) the relative risk, its relative and absolute reduction and the number needed to treat (NNT) will be used. Of all of them, the one that best expresses the clinical efficiency is always the NNT. Any trial worthy of our attention must provide this information or, failing that, the necessary information so that we can calculate it.

But to allow us to know a more realistic estimate of the results in the population, we need to know the precision of the study, and nothing is easier than resorting to confidence intervals. These intervals, in addition to precision, also inform us of statistical significance. It will be statistically significant if the risk ratio interval does not include the value one and that of the mean difference the value zero. In the case that the authors do not provide them, we can use a calculator to obtain them, such as those available on the CASP website.

A good way to sort the study of the clinical importance of a trial is to structure it in these four aspects: Quantitative assessment (measures of effect and its precision), Qualitative assessment (relevance from the clinical point of view), Comparative assessment (see if the results are consistent with those of other previous studies) and Cost-benefit assessment (this point would link to the next section of the critical appraisal that has to do with the applicability of the results of the trial).

To finish the critical reading of a treatment article we will value its APPLICABILITY (also called external validity), for which we will have to ask ourselves if the results can be generalized to our patients or, in other words, if there is any difference between our patients and those of the study that prevents the generalization of the results. It must be taken into account in this regard that the stricter the inclusion criteria of a study, the more difficult it will be to generalize its results, thereby compromising its external validity.

But, in addition, we must consider whether all clinically important outcomes have been taken into account, including side effects and undesirable effects. The measured result variable must be important for the investigator and for the patient. Do not forget that the fact that demonstrating that the intervention is effective does not necessarily mean that it is beneficial for our patients. We must also assess the harmful or annoying effects and study the benefits-costs-risks balance, as well as the difficulties that may exist to apply the treatment in our environment, the patient’s preferences, etc.

As it is easy to understand, a study can have a great methodological validity and its results have great importance from the clinical point of view and not be applicable to our patients, either because our patients are different from those of the study, because it does not adapt to your preferences or because it is unrealizable in our environment. However, the opposite usually does not happen: if the validity is poor or the results are unimportant, we will hardly consider applying the conclusions of the study to our patients.

To finish, recommend that you use some of the tools available for critical appraisal, such as the CASP templates, or a checklist, such as CONSORT, so as not to leave any of these points without consideration. Yes, all we have talked about is randomized and controlled clinical trials, and what happens if it is nonrandomized trials or other kinds of quasi-experimental studies? Well for that we follow another set of rules, such as those of the statement. But that is another story…