You will all know the case of someone who, after carrying out a study and collecting several million variables, addressed the statistician of his workplace and, demonstrating in a reliable way his clarity of ideas regarding his work, he said: please (You have to be educated), crosscheck everything with everything, to see what comes out.
At this point, several things can happen to you. If the statistician is an unscrupulous soulmate, he will give you a half smile and tell you to come back after a few days. Then, you will be provided with several hundred sheets with graphics, tables and numbers with which you will not know what to do. Another thing that can happen to you is to send to hell, tired as she will be to have similar requests made.
But you can be lucky and find a competent and patient statistician who, in a self-sacrificing way, will explain to you that the thing should not work like that. The logical thing is that you, before collecting any data, have prepared a report of the project in which it is planned, among other things, what is to be analyzed and what variables must be crossed between them. She can even suggest you that, if the analysis is not very complicated, you can try to do it yourself.
The latter may seem like the delirium of a mind disturbed by mathematics but, if you think about it for a moment, it is not such a bad idea. If we do the analysis, at least the preliminary, of our results, it can help us to better understand the study. Also, who can know what we want better than ourselves?
With the current statistical packages, the simplest bivariate statistics can be within our reach. We only have to be careful in choosing the right hypothesis test, for which we must take into account three aspects: the type of variables that we want to compare, if the data are paired or independent and if we have to use parametric or non-parametric tests. Let’s see these three aspects.
Regarding the type of variables, there are multiple denominations according to the classification or the statistical package that we use but, simplifying, we will say that there are three types of variables. First, there are the continuous variables. As the name suggests, they collect the value of a continuous variable such as weight, height, blood glucose concentration, etc. Second, there are the nominal variables, which consist of two or more categories that are mutually excluding. For example, the variable “hair color” can have the categories “brown”, “blonde” and “red hair”. When these variables have two categories, we call them dichotomous (yes / no, alive / dead, etc.). Finally, when the categories are ordered by rank, we speak of ordinal variables: ” do not smoke “, ” smoke little “, ” smoke moderately “, ” smoke a lot “. Although they can sometimes use numbers, they indicate the position of the categories within the series, without implying, for example, that the distance from category 1 to 2 is the same as that from 2 to 3. For example, we can classify vesicoureteral reflux in grades I, II, III and IV (having a degree IV is more than a II, but it does not mean that you have twice as much reflux).
Knowing what kind of variable we are dealing with is simple. If we doubt, we can follow the following reasoning based on the answer to two questions:
Does the variable have infinite theoretical values? Here we have to do a bit of abstraction and think about what “theoretical values” really means. For example, if we measure the weight of the subjects of the study, theoretical values will be infinite although, in practice, this will be limited by the precision of our scale. If the answer to this first question is “yes” we will be before a continuous variable. If it is not, we move on to the next question.
Are the values sorted in some kind of rank? If the answer is “yes”, we will be dealing with an ordinal variable. If the answer is “no”, we will have a nominal variable.
The second aspect is that of paired or independent measures. Two measures are paired when a variable is measured twice after having applied some change, usually in the same subject. For example: blood pressure before and after a stress test, weight before and after a nutritional intervention, etc. On the other hand, independent measures are those that are not related to each other (they are different variables): weight, height, gender, age, etc.
Finally, we mentioned the possibility of using parametric or non-parametric tests. We are not going to go into detail now, but in order to use a parametric test the variable must fulfill a series of characteristics, such as following a normal distribution, having a certain sample size, etc. In addition, there are techniques that are more robust than others when it comes to having to meet these conditions. When in doubt, it is preferable to use non-parametric techniques unnecessarily (the only problem is that it is more difficult to achieve statistical significance, but the contrast is just as valid) than using a parametric test when the necessary requirements are not met.
Once we have already answered these three aspects, we can only make the pairs of variables that we are going to compare and choose the appropriate statistical test. You can see it summarized in the attached table.The type of independent variable is represented in the rows, which is the one whose value does not depend on another variable (it is usually on the x axis of the graphic representations) and which is usually the one that we modified in the study to see the effect on another variable (the dependent). In the columns, on the other hand, we have the dependent variable, which is the one whose value is modified with the changes of the independent variable. Anyway, do get muddled: the statistical software will make the hypothesis contrast without taking into account which is the dependent and which the independent, only taking into account the types of variables.
The table is self-explanatory, so we will not give it much time. For example, if we have measured blood pressure (contiuous variable) and we want to know if there are differences between men and women (gender, nominal dichotomous variable), the appropriate test will be Student’s t test for independent samples. If we wanted to see if there is a difference in pressure before and after a treatment, we would use the same Student’s t test but for paired samples.
Another example: if we want to know if there are significant differences in the color of hair (nominal, polytomous: “blond”, “brown” and “redhead) and if the participant is from the north or south of Europe (nominal, dichotomous), we could use a Chi-square’s test.
And here we will end for today. We have not talked about the peculiarities of each test that we have to take into account, but we have only mentioned the test itself. For example, the chi-square’s has to meet minimums in each box of the contingency table, in the case of Student’s t we must consider whether the variances are equal (homoscedasticity) or not, etc. But that is another story…
When Georg Cantor wanted to develop the set theory, he could not get an idea of everything that would come after that, probably from the hand of mathematicians as dedicated as he was. I can think of the curious case of binary relations, which the older ones of you will remember of the time when children learned things at school.
It turns out that some mathematical genius begins to think and describes a series of properties. The first is reflective property. This means that, if a number x is equal to x, then so, it is x. In case anyone has not understood, let us give an anatomical example: my right hand is my right hand. I believe that the genius who invented the reflexive property needed a long recovery in some spa after such a huge mental strain.
It was in this spa where he decided to do something more intense, so he described the symmetric property, which is much more complex: whenever a number x equals y, then y equals x. Going back to the anatomical simile, if my arms and legs are my extremities, you will have to agree that my extremities are my arms and my legs. Algebra is fascinating.
Luckily, in the end, with the purpose of filling a file and save back, our anonymous genius invented the transitive property, which says more or less like this: if a number x is related to y, and y is related to z, there will be transitivity if x relates to z. Again, to the anatomy: if my leg is mine and my foot is from my leg, my foot is also mine. After that, more properties were derived from these three, but we shall leave it here for the moment, because today we are going to use the power of transitive property to know which of two things that we have not really come to compare is the better of both. Think, for example, of a crazed mob running into a shopping center on the first day of sales. They look at everything before deciding what to buy, but it is not necessary to compare all the products two to two to know which one we like best.
In medicine something similar happens. The usual thing is that there are several options to treat the same disease (although those of us who have been in the business for a long time now know that the more there are, the more likely that none will work at all). Clinical trials, and meta-analyzes of clinical trials, only compare pairs and it may happen that no one has compared the two we have at our disposal or that we want to know which is, in theory, the best of all available.
Well, for that a methodological design called network meta-analysis (NMA), also called multiple-treatments meta-analysis or mixed-treatments comparisons meta-analysis, has been invented. And in this last term, mixed comparisons, is the crux of the matter, because it turns out that there are several types of comparisons. Let’s see them.
Let’s assume we have three possible treatments that, after a deep reflection, I decided to call A, B and C. The simplest situation is to compare two of them, A and B, for example, with a conventional clinical trial. We would be making a direct comparison between the two interventions. But it may happen that we do not have any trial that directly compares A and B, but there are two different trials that compare the interventions with another intervention, C (you can see it in the attached figure). In this case we can resort to the power of the transitive property and make an indirect comparison between A and B based on their relative efficacy against C. For example, if A reduces mortality by 100% compared to C and B reduces it by 50 % compared to C, we can say that B reduces mortality 50% relative to A. Of course, in order to do this, transitivity has to be fulfilled, something that we cannot take for granted. For example, if I like pork and pig likes to reboar through mud, that does not mean that I like to reboar through mud. Transitivity is not fulfilled in this case (I think).
Well, an NMA is nothing more than a series of direct, indirect and mixed comparisons that allow us to compare the relative effects of several interventions. Multiple comparisons are typically represented using a diagram as a network where we can see the direct, indirect and mixed comparisons. Each node in the network, which can vary in size according to its specific contribution, correspond with one of the primary studies of the review, while the lines joining the nodes represent the comparisons. The complete network will represent all comparisons of treatments identified from the primary studies of the review that incorporates our NMA.
As with the other types of meta-analyzes coupled with a systematic review, the validity of the NMA will depend on the validity of the primary studies, the heterogeneity among them and the possible existing information biases, factors that will condition the quality of the direct comparisons.
In addition, indirect comparisons are considered observational and require, as we have already mentioned, that the researcher issue the transitivity of the interventions based on her knowledge about them, about the disease and about the designs of the primary studies.
Another specific aspect of the NMA is that of coherence or consistency, which makes reference to the level of agreement among the evidence coming from direct and indirect comparisons. This level of agreement, which can be measured with specific statistical methods, must be high in order for the summary result measure to be valid. The results of the comparisons must go in the same direction, they cannot be divergent. When this is not fulfilled, the cause probably lies in the poor methodological quality of the primary studies, in their heterogeneity or in the presence of biases.
As in other meta-analyzes, the result of the NMA is expressed with a summary result measure that can be an odds ratio, a means difference, a risk ratio, etc. This point estimate is accompanied by an interval that gives us information about the accuracy of this estimate. The statistical analysis of the NMA can use frequentist methods (the one we usually see in usual clinical trials) or Bayesian methods. The latter are based on the assignment of a probability of the effect of the treatment prior to the analysis of the data and then to assign a posteriori probability after the analysis. For what interests us here, the frequentist methods will assess the accuracy of the point estimate by means of the known confidence intervals (usually 95%), while the Bayesians will provide credibility intervals (also 95%), of similar significance.
With all this data we will obtain an ordered rank of the compared treatments, with the best heading the list. But do not trust yourself too much, you have to look at these ranks carefully for several reasons. First, the best treatment in one situation may not be so in another. Second, we must take into account other factors such as cost, availability, knowledge of the clinician, etc. Third, these ordered ranks do not take into account the magnitude of the differences between the different elements. And fourth, chance can play tricks on us and put in a good position a treatment that, in reality, is not as good as it may seem.
Once reviewed, at a glance, the peculiarities of the NMA, what can we say about their critical appraisal? As we have a checklist for the systematic review with the usual meta-analysis, the PRISMA statement, there is a specific declaration for the NMA, the PRISMA-NMA. This list includes, as specific items, aspects such as the description of the geometry of the treatments network, the consideration of the transitivity and consistency assumptions and the description of the methods used to analyze the structure of the network and the suitability of the comparisons, in case some may have a lower degree of evidence. All this will be facilitated if the authors provide the graph with the study network and briefly explain its characteristics.
Anyway, you know that I’d rather resort on the CASP’s tools for critical appraisal of documents. Although there is no a specific for NMA, I advise you to use the systematic review with usual meta-analysis one and, later, to make some considerations about the specific aspects of the NMA.
To not extend this post much, we will skip the whole part that NMA share with any other systematic review and go directly to its specific aspects. You can consult the corresponding post where we reviewed the critical appraisal of a systematic review. As always, we will follow our three pillars of wisdom: validity, relevance and applicability.
Regarding VALIDITY, we will ask three specific questions.
Does the review respond to a well-defined clinical question that justifies the realization of a NMA?This question has the classic components of the PICO question,although the intervention and the comparison will encompass the multiple comparisons of the network.
Was an exhaustive search of the relevant studies carried out?This aspect is important to avoid publication biasand the inclusion of all the important information available. Their absence can affect the consistency of the comparisons.
There should be a clear specification of the target population, the treatments evaluated and the outcome measures used.All these aspects can condition the validity of indirect comparisons.If we want to infer the relationship between the effects of A and B by comparing their individual effects with respect to C, it is essential that A and B are treated similarly in their comparison with C, that the A-C and B-C comparisons are made with patients that are similar, that the same outcome measures are used and that the risk of bias in the studies is low. The latter can be assessed with the usual tools, such as the Cochrane’s.
To finish this section, we will check that the results are analyzed and presented in an appropriate way, which statistical method has been used (frequentist or Bayesian), and if confidence or credibility intervals, the analysis of the network, etc. are provided.
Although we will not go into it, we will say that there are multiple types of networks (star, loop, line …). For comparisons to be more valid, indirect comparisons must be supported by direct ones. This can be seen in the network scheme by the presence of triangles similar to the graph that I attached at the beginning of the post (or other closed geometric shapes). In conditions of equality of other factors that can have an influence and that we have already mentioned, the more triangles we see, the more valid the comparisons will be.
As a last aspect, we will evaluate if the authors have used the appropriate methods to assess the heterogeneity and the possible existence of inconsistency: sensitivity analysis, metaregression, etc.
Going to the RELEVANCEsection, we will value the results of the meta-analysis. Here we will consider five specific aspects:
What is the result? As in any other meta-analysis, we will assess the result and its importance from the clinical point of view.
It will be necessary to assess how the result could have been influenced by the risk of bias in the primary studies: the greater the risk of bias, the farthest our estimate can be from the truth.
Are the results accurate?In this sense, we must assess the amplitude of the confidence or credibility intervals, taking into account how the conclusions of the study would be affected at each end of the interval.
Is there consistency of results among different studies?There may be variability by pure chance or by heterogeneity among the studies.We can assess it by observing the shape of the forest plots and helping us with the usual statistical methods, such as I2.
Are indirect comparisons reliable?We return again to the concept of transitivity, which must be taken into account together with the other factors that we have previously commented on and which may increase the risk of bias: homogeneous populations, outcome variables and common comparators, etc.
Is there consistency among direct and indirect comparisons?We will have to check for closed geometric shapes within the network (our triangles or loops),as well as rule out causes of inconsistency, which are the same we have already mentioned as causing heterogeneity and intransivity.
Finally, we will finish our critical appraisal by making some special considerations regarding the APPLICABILITYof the results.
In addition to taking into account, as usual, if all the important effects and variables for the patient have been considered and if the patients are similar to those of our environment, we will ask the questions specifically related to the use of a NMA, such as if the the network has considered all the possibilities of treatment or if the different comparison subgroups that have been established have credibility from the clinical point of view.
And here we will leave for today. A beast difficult to tame, this NMA. And that we have not spoken anything of its statistical methodology, quite complex but that computer packages develop without flinching. In addition, we could have talked a lot about the types of networks and the comparisons that can be drawn from each of them. But that’s another story…
The genius that I am talking about in the title of this post is none other than Alan Mathison Turing, considered one of the fathers of computer science and a forerunner of modern computing.
For mathematicians, Turing is best known for his involvement in the solution of the decision problem previously proposed by Gottfried Wilhelm Leibniz and David Hilbert, who were seeking to define a method that could be applied to any mathematical sentence to prove whether that sentence were or not true (to those interested in the matter, it could be demonstrated that such a method does not exist).
But what it is Turing is famous for among the general public comes thanks to the cinema and to his work in statistics during World War II. And it is that Turing was taken to exploiting Bayesian magic to deepen the concept of how the evidence we are collecting during an investigation can support the initial hypothesis or not, thus favoring the development of a new alternative hypothesis. This allowed him to decipher the code of the Enigma machine, which was the one used by the German navy’s sailors to encrypt their messages, and that is the story that has been taken to the screen. This line of work led to the development of concepts such as the weight of evidence and concepts of probability, with which confront null and alternative hypotheses, which were applied in biomedicine and enabled the development of new ways to evaluate new diagnostic tests capabilities, such as the ones we are going to deal with today.
But all this story about Alan Turing turn out to be just a recognition of one of the people whose contribution made it possible to develop the methodological design that we are going to talk about today, which is none other than the meta-analysis of diagnostic accuracy.
We already know that a meta-analysis is a quantitative synthesis method that is used in systematic reviews to integrate the results of primary studies into a summary result measure. The most common is to find systematic reviews on treatment, for which the implementation methodology and the choice of summary result measure are quite well defined. Reviews on diagnostic tests, which have been possible after the development and characterization of the parameters that measure the diagnostic performance of a test, are less common.
The process of conducting a diagnostic systematic review essentially follows the same guidelines as a treatment review, although there are some specific differences that we will try to clarify. We will focus first on the choice of the outcome summary measure and try to take into account the rest of the peculiarities when we give some recommendations for a critical appraisal of these studies.
When choosing the outcome measure, we will find the first big difference with the meta-analyzes of treatment. In the meta-analysis of diagnostic accuracy (MDA) the most frequent way to assess the test is to combine sensitivity and specificity as summary values. However, these indicators present the problem that the cut-off points to consider the results of the test as positive or negative usually vary among the different primary studies of the review. Moreover, in some cases positivity may depend on the objectivity of the evaluator (think of results of imaging tests). All this, besides being a source of heterogeneity among the primary studies, constitutes the origin of a typical MDA bias called the threshold effect, in which we will stop a little later.
For this reason, many authors do not like to use sensitivity and specificity as summary measures and resort to positive and negative likelihood ratios. These ratios have two advantages. First, they are more robust against the presence of threshold effect. Second, as we know, they allow calculating the post-test probability either using Bayes’ rule (pre-test odds x likelihood ratio = posttest odds) or a Fagan’s nomogram (you can review these concepts in the corresponding post).
Finally, a third possibility is to resort to another of the inventions that derive from Turing’s work: the diagnostic odds ratio (DOR).
The DOR is defined as the ratio of the odds of the patient being positive with a test with respect to the odds of being positive while being healthy. This phrase may seem a bit cryptic, but it is not so. The odds of the patient being positive versus being negative is only the ratio between true positives (TP) and false negatives (FN): TP / FN. On the other hand, the odds of the healthy being positive versus negative is the quotient between false positives (FP) and true negatives (TN): FP / TN. And seeing this, we can only define the ratio between the two odds, as you can see in the attached figure. The DOR can also be expressed in terms of the predictive values and the likelihood ratios, according to the expressions that you can see in the same figure. Finally, it is also possible to calculate their confidence interval, according to the formula that ends the figure.
Like all odds ratios, the possible values of the DOR go from zero to infinity. The null value is 1, which means that the test has no discriminatory capacity between the healthy and the sick. A value greater than one indicates discriminatory capacity, which will be greater the greater the value. Finally, values between zero and 1 will indicate that the test not only does not discriminate well between the sick and healthy, but classifies them in a wrong way and gives us more negative values among the sick than among the healthy.
The DOR is a global parameter easy to interpret and does not depend on the prevalence of the disease, although it must be said that it can vary between groups of patients with different severity of disease. In addition, it is also a very robust measure against the threshold effect and is very useful for calculating the summary ROC curves that we will comment on below.
The second peculiar aspect of MDA that we are going to deal with is the threshold effect. We must always assess their presence when we find ourselves before a MDA. The first thing will be to observe the clinical heterogeneity among the primary studies, which could be evident without needing to make many considerations. There is also a simple mathematical form, which is to calculate the Spearman’s correlation coefficient between sensitivity and specificity . If there is a threshold effect, there will be an inverse correlation between the two, the stronger the higher the threshold effect.
Finally, a graphical method is to assess the dispersion of the sensitivity and specificity representation of the primary studies on the summary ROC curve of the meta-analysis. A dispersion allows us to suspect the threshold effect, but it can also occur due to the heterogeneity of the studies and other biases such as selection’s or verification’s.
The third specific element of MDA that we are going to comment on is that of the summary ROC curve (sROC), which is an estimate of the common ROC curve adjusted according to the results of the primary studies of the review. There are several ways to calculate it, some quite complicated from the mathematical point of view, but the most used are the regression models that use the DOR as an estimator, since, as we have said, it is very robust against heterogeneity and the threshold effect. But do not be alarmed, most of the statistical packages calculate and represent the sROC with little effort.
The reading of sROC is similar to that of any ROC curve. The two more used parameters are area under the ROC curve (AUC) and Q index. The AUC of a perfect curve is equal to 1. Values above 0.5 indicate its discriminatory diagnostic capacity, which will be higher the closer it gets to 1. A value of 0.5 tells us that the usefulness of the test is the same that flipping a coin. Finally, values below 0.5 indicate that the test does not contribute at all to the diagnosis it intends to perform.
On the other hand, the Q index corresponds to the point at which sensitivity and specificity are equal. Similar to AUC manner, a value greater than 0.5 indicate the overall effectiveness of the diagnostic test, which will be higher the closer the index value is to 1. In addition, confidence intervals can also be calculated both for AUC as Q index, with which it will be possible to assess the precision of the estimation of the summary measure of the MDA.
Once seen (at a glance) the specific aspects of MDA, we will give some recommendations to perform the critical appraising of this type of study. CASP network does not provide a specific tool for MDA, but we can follow the lines of the systematic review of treatment studies taking into account the differential aspects of MDA. As always, we will follow our three basic pillars: validity, relevance and applicability.
Let’s start with the questions that value the VALIDITY of the study.
The first question asks if it has been clearly specified the issue of the review. As with any systematic review, diagnostic tests’ should try to answer a specific question that is clinically relevant, and which is usually proposed following the PICO scheme of a structured clinical question. The second question makes us reflect if the type of studies that have been included in the review are adequate. The ideal design is that of a cohort to which the diagnostic test that we want to assess and the gold standard are blindly and independently applied. Other studies based on case-control designs are less valid for the evaluation of diagnostic tests, and will reduce the validity of the results.
If the answer to both questions is yes, we turn to the secondary criteria. Have important studies that have to do with the subject been included? We must verify that a global and unbiased search of the literature has been carried out. The methodology of the search is similar to that of systematic reviews on treatment, although we should take some precautions. For example, diagnostic studies are usually indexed differently in databases, so the use of the usual filters of other types of revisions can cause us to lose relevant studies. We will have to carefully check the search strategy, which must be provided by the authors of the review.
In addition, we must verify that the authors have ruled out the possibility of a publication bias. This poses a special problem in MDA, since the study of the publication bias in these studies is not well developed and the usual methods such as the funnel plot or the Egger’s test are not very reliable. The most conservative thing to do is always assume that there may be a publication bias.
It is very important that enough has been done to assess the quality of the studies, looking for the existence of possible biases. For this the authors can use specific tools, such as the one provided by the QUADAS-2 declaration.
To finish the section of internal or methodological validity, we must ask ourselves if it was reasonable to combine the results of the primary studies. It is fundamental, in order to draw conclusions from combined data, that studies are homogeneous and that the differences among them are due solely to chance. We will have to assess the possible sources of heterogeneity and if there may be a threshold effect, which the authors have had to take into account.
In summary, the fundamental aspects that we will have to analyze to assess the validity of a MDA will be: 1) that the objectives are well defined; 2) that the bibliographic search has been exhaustive; and 3) that the internal or methodological validity of the included studies has also been verified. In addition, we will review the methodological aspects of the meta-analysis technique: the convenience of combining the studies to perform a quantitative synthesis, an adequate evaluation of the heterogeneity of the primary studies and the possible threshold effect and use of an adequate mathematical model to combine the results of the primary studies (sROC, DOR, etc.).
Regarding the RELEVANCE of the results we must consider what is the overall result of the review and if the interpretation has been made in a judicious manner. We will value more those MDA that provide more robust measures against possible biases, such as likelihood ratios and DOR. In addition, we must assess the accuracy of the results, for which we will use our beloved confidence intervals, which will give us an idea of the precision of the estimation of the true magnitude of the effect in the population.
We will conclude the critical appraisal of MDA assessing the APPLICABILITY of the results to our environment. We will have to ask whether we can apply the results to our patients and how they will influence the attention to them. We will have to see if the primary studies of the review describe the participants and if they resemble our patients. In addition, it will be necessary to see if all the relevant results have been considered for decision making in the problem under study and, as always, the benefit-cost-risk ratio must be assessed. The fact that the conclusion of the review seems valid does not mean that we have to apply it in a compulsory way.
Well, with all that said, we are going to finish today. The title of this post refers to the mistreatment suffered by a genius. We already know what genius we were referring to: Alan Turing. Now, we will clarify the abuse. Despite being one of the most brilliant minds of the 20th century, as witnessed by his work on statistics, computing, cryptography, cybernetics, etc., and having saved his country from the blockade of the German Navy during the war, in 1952 he was tried for his homosexuality and convicted of serious indecency and sexual perversion. As it is easy to understand, his career ended after the trial and Alan Turing died in 1954, apparently after eating a piece of an apple poisoned with cyanide, which was labeled as suicide, although there are theories that speak rather of murder. They say that from here comes the bitten apple of a well-known brand of computers, although there are others who say that the apple just represents a play on words between bite and byte.
I do not know which of the two theories is true, but I prefer to recall Turing every time I see the little-apple. My humble tribute to a great man.
And now we finish. We have seen the peculiarities of the meta-analyzes of diagnostic accuracy and how to assess them. Much more could be said of all the mathematics associated with its specific aspects such as the presentation of variables, the study of publication bias, the threshold effect, etc. But that’s another story…
The unreal mixture of different parts of animals has been an obsession of so-called human beings since immemorial time. The most emblematic case is that of Chimera (which gives its name to the whole family of mixtures of different animals). This mythological being, daughter of Typhon and the viper Echidna, had a lion’s head, a goat’s body and a dragon’s tail, which allowed him to breathe in flames and scare everyone up who passed by. Of course, it did not help him when Belloforontes, mounted on Pegasus (another weirdo, a horse with wings), insisted on crossing it with his lead spear. You see, in his strength was his downfall: the fire melted the tip of the spear into this rare creature, which resulted in its death.
Besides Chimera, there are many more of these beings, all of them fruit of human imagination. To name a few, we can remember the unicorns (these had worse luck than Pegasus, instead of wings they had horns, one each animal), the basilisks (a kind of snake rooster of quite bad character), the gryphon (lion’s body and eagle for the rest) and all those in which part of the mixture is human, such as manticores (head of man and body of lion), centaurs, Minotauro, Medusa (with their snakes instead of hair), mermaids…
In any case, among all the beings of this imaginary zoo, I am left with the chickenphant (gallifante in Spanish). This was a mixture of chicken and elephant that was used on TV to reward the wit of children who attended a popular contest. Millenials will have no idea what I’m talking about, but surely those who grew up in the 80s do know what I mean.
And all this came to my mind when I was reflecting on the number of chimeras that also exist among the possible types of scientific study designs, especially among observational studies. Let’s get to know a little three of these chickenphants of epidemiology: the case-control studies nested in a cohort and the case and cohort studies , to end with another particular specimen, the case-crossover or self-controlled studies.
Within observational studies, we all know the classic cohorts and the cases and controls studies, the most frequently used.
In a cohort study, a group or cohort is subjected to an exposure and followed over time to compare the frequency of appearance of the effect compared to an unexposed cohort, which acts as a control. These studies tend to be of an antegrade direction, so they allow us to measure the incidence of the disease and calculate the risk ratio between the two groups. On the other hand , a case-control study starts from two population groups, one of which presents the effect or disease under study and compares its exposure to a specific factor with respect to the group that does not have the disease and that acts as a control. Being of retrograde direction and directly selecting cases of disease, it is not possible to directly calculate the incidence density and, therefore, the risks ratios between the two groups, making the odds ratio the measure of association typical of case-control studies.
The cohort study is the most solid of the two from a methodological point of view. The problem is that they usually require long follow-up periods and large cohorts, especially when the frequency of the disease studied is low, which leads to the need to manage all the covariates of this large cohort, which increases the costs of the study.
Well, for these cases in which neither the cases and controls nor the cohorts are well suited to the needs of the researcher, epidemiologists have invented a series of designs that are halfway between the two and can mitigate their shortcomings. These hybrid designs are the case-control studies nested in a cohort and the case and cohort studies to which we have already referred.
In another order of things, in classical observational studies the key point is in the selection of controls, which have to be representative of the level of exposure to the risk factor evaluated in the population from which the cases originate. An adequate selection of controls becomes even more difficult when the effect occurs abruptly. For example, if we want to know if a copious meal increases the risk of heart attack, we would have great difficulty in collecting representative controls of the population, since the risk factors can act instants before the event.
To avoid these difficulties, the principle of “you make your bed, you lie in it” was applied and the third type of chimera we have mentioned was designed, in which each participant acts, at the same time, as his own control. They are case-crossover studies, also known as self-monitoring cases studies.
Let’s see these weirdos, beginning with cases and controls nested in a cohort.
Suppose we have done a study in which we have used a cohort with many participants. Well, we can reuse it in a nested case-control study. We took the cohort and followed it over time, selecting as cases those subjects who are developing the disease and assigning them as controls individuals from the same cohort who have not yet presented it (although they can do it later). Thus, cases and controls come from the same cohort. It is convenient to match them taking into account confusing and time-dependent variables, such as the years they have been included in the cohort. In this way, the same subject can act as a control on several occasions and end as a case in another, which will have to be taken into account at the time of the statistical analysis of the studies. As this seems a bit confusing, I show you a scheme of this type of studies in the first attached figure.
As we are seeing how cases arise, we are doing a sampling by density of incidence, which will allow us to estimate risks ratios. This is an important difference with conventional case-control studies, in which an odds ratio is usually calculated, which can only be assimilated to the relative risk when the frequency of the effect is very low.
Another difference is that all the information about the cohort is collected at the beginning of the study, so there is less risk of producing the classic information biases of the case-control studies, usually of a retrospective nature.
The other type of hybrid observational design that we are going to discuss is that of the case and cohort studies. Here we also start from a large initial cohort, from which we select a more manageable sub-cohort that will be used as a comparison group. Thus, we see which individuals of the initial cohort develop the disease and compare them with the sub-cohort (regardless of whether or not they belong to the sub-cohort). You can see the outline of a case study and cohort in the second figure attached.
As in the previous example, when choosing cases over time we can estimate the density of incidence in cases and not cases, calculating the risk ratio from them. As we can imagine, this design is cheaper than conventional studies because it greatly reduces the volume of information of healthy subjects that must be handled, without losing efficiency when studying rare diseases. The problem that arises is that the sub-cohort has an overrepresentation of cases, so that the analysis of the results cannot be done as in traditional cohorts, but has its own methodology, much more complicated.
To summarize what has been said so far, we will say that the nested case-control study is more like the classic case-control study, while the case and cohort study is more like the conventional cohort study. The fundamental difference between the two is that in the nested study the sampling of the controls is done by incidence density and by pairing, so we must wait until all cases have been produced to select the entire reference population. This is not the case in the case and cohort study, which is much simpler, in which the reference population is selected at the beginning of the study.
To put an end to these hybrid studies, we will say some things about case-crossover studies. These focus on the moment in which the event occurs and try to see if there has been something unusual that has favored it, comparing the expositions of moments immediate to the event with previous ones that serve as control. Therefore, we compare case moments with control moments, each individual acting as their own control.
For the study to be valid from the methodological point of view, the authors have to clearly describe a series of characteristic periods of time. The first is the induction period, which is the delay time that occurs from the beginning of the exposure until the production of the effect.
The second is the period of effect, which is the interval during which exposure can trigger the effect. Finally, the period of risk would be the sum of the two previous periods, from the moment of exposure to the beginning of the event.
The induction period is usually very brief most of the times, so the period of risk and effect are usually equivalent. In the attached figure I show you the relationship between the three periods so that you understand it better.
It is essential that these three periods be clearly specified, since a poor estimate of the period of effect, both by excess and by defect, produces a dilution of the effect of the exposure and makes its detection more difficult.
Some of you will tell me that these studies are similar to other self-controlled studies, such as paired cases and controls studies. The difference is that in the latter one or more similar controls are chosen for each case, while in the self-controlled each one is its own control. They also look a little like cross-over clinical trials, in which all participants are subjected to intervention and control, but these are experimental studies in which the researcher intervenes in the production of the exposure, while self-controlled studies are observational studies.
In what it resembles paired cases and controls is in the statistical analysis, only here case moments and control moments are analyzed. In this way, it is usual to use conditional logistic regression models, being the most common measure of association the odds ratio.
As you can see, hybrid studies are a whole new family that threatens to grow in number and complexity. As far as I know, there are no checklists to critically aprraise these types of designs, so we will have to apply judiciously the principles we apply when analyzing classical observational studies, taking into account, in addition, the particularities of each type of study.
For this, we will follow our three pillars: validity, relevance and applicability.
In the VALIDITYsection we will assess the methodological quality with which the study was made. We will check that there is a clear definition of the study population, the exposure and the effect. If we use a reference cohort, it should be representative of the population and should be followed completely. On the other hand, the cases will be representative of the population of cases from which they come and the controls have to come from a population with an exposure level representative of the case population.
The measurement of the exposure and the effect must be done blindly, being independent the measurement of the effect and the knowledge of the level of exposure. In addition, we will analyze if attention has been paid to the temporal relationship of events between exposure and effect and if there was a relationship between the level of exposure and the degree of effect. Finally, the statistical analysis should be correct, taking into account the control of possible confounding factors. This part can be complicated by the complexity of the statistical studies that usually require this type of designs.
In addition, as we have already mentioned, if we are facing a case-crossover study, we must ensure that there has been a correct definition of the three periods, especially the period of effect, whose inaccuracy may affect the conclusion of the study to a greater degree.
Next, we will evaluate the RELEVANCEof the results and their accuracy as measured by their confidence intervals. We will look for the impact measurements calculated by the authors of the study and, if they do not provide them, we will try to calculate them ourselves. Finally, we will compare the results with other previously published in the literature to see if they are concordant with the existing knowledge and what new knowledge are provided.
We will finish the critical appraising assessing the APPLICABILITYof the results. We will think if the participants can be assimilated to our patients and if the conclusions are applicable to our environment.
And here we are going to finish this post. We have seen a whole new range of hybrid studies that combine the advantages of two observational studies to better adapt to situations in which classical studies are more difficult to apply. The drawback of these studies, as we have said, is that the analysis is a bit more complicated than that of the conventional studys, since it is not enough to get a crude analysis of the results, but must be adjusted by the possibility of that a participant can act as control and case (in the nested studies) and by the overrepresentation of the cases in the sub-cohort (in the cohort and cases).
I just finish commenting that all I have said about the case-crossover studies refers to the so-called unidirectional case-crossover ones, studies in which there is a very precise temporal relationship between exposure and effect. For the cases in which the exposure is more maintained, other types of case-crossover studies called bidirectional case-crossover studies can be used, in which control periods are selected before and after the effect. But that is another story…
And there are other lives, but they are in you. It was already said by Paul Éluard, that last century’s surrealist who had the bad idea of visiting Cadaqués accompanied by his wife, Elena Ivanovna Diakonova, better known as Gala. He was not very clever there, but his phrase did give for many more things.
For example, it has been used by many writers who love the unknown, myths and mystery. I personally knew the phrase when I was a young teenager because it was written as a preface to a series of science fiction books. Even, in more recent times, it is related to that other incorporeal world that is cyberspace, where we spend a more and more greater part of our life.
But, to help Éluard rest peacefully in his tomb at Père- Lachaise, I’ll tell you that I like more his original idea about our two worlds, between which we can share our limited life time: the real world, where we make the most part of the things, and the world of the imagination, our intimate space, where we dream our most impossible realities.
You will think that today I am very metaphysical, but this is the thought that has come to my mind when I started thinking about the topic that we are going to deal with in this post. And the fact is that in the realm of medicine there are two worlds too.
We are very used to numbers and the objective results of our quantitative research. As an example, we have our revered systematic reviews, which gather the scientific evidence available on a specific health technology to assess its efficacy, safety, economic impact, etc. If we want to know if watching a lot of TV is a risk factor for suffering this terrible disease that is fildulastrosis, the best thing will be to do a systematic review of clinical trials (assuming there are any). Thus, we can calculate a multitude of parameters that, with a number, will give us a full idea of the impact of such an unhealthy habit.
But if what we want to know is how fildulastrosis affects the person who suffers it, how much unhappiness it produces, how it alters family and social life, things get a little complicated with this type of research methodology. And this is important because the social and cultural aspects related to the real context of people are increasingly valued. Luckily, there are other worlds and they are in this one. I am referring to the world of qualitative research. Today we are going to take a look (a short one) at this world.
Qualitative research is a method that studies reality in its natural context, as it occurs, in order to interpret the phenomena according to the meanings they have for the people involved. And for this it uses all kinds of sources and materials that help us to describe the routine and the meaning of problematic situations for people’s lives: interviews, life stories, images, sounds … Although all this has nothing to do with the gridded world of quantitative research, both methods are not incompatible and may even be complementary. Simply, qualitative methods provide alternative information, different and complementary to that of quantitative methods, which is useful for evaluating the perspectives of the people involved in the problem we are studying. Quantitative research is a way to address the problem deductively, while qualitative uses an inductive approach.
Logically, the methods used by qualitative research are different from quantitative’s ones. In addition, they are numerous, so we will not describe them in depth. We will say that the specific methods most used are meta-synthesis, phenomenology, meta-ethnography, meta-study, meta-interpretation, the grounded theory, the biographical method and the aggregative review, among others.
The most frequently used of these methods is meta-synthesis, which starts with a research question and a bibliographic search, in a similar way to what we know about systematic reviews. However, there are a couple of important differences. In quantitative research, the research question must be clearly defined, while in qualitative research this question is, by definition, flexible and is usually modified and refined as data collection progresses. The other aspect has to do with the literature search, because in qualitative research it is not so clearly defined what databases have to be used and there are not the filters and methodologies available to documentarists to make revisions of quantitative research.
Also, techniques used for collecting data are different to those we are more accustomed to in quantitative research. One of them is observation, which allows the researcher to obtain information about the phenomenon as it occurs. The paradigm of observation in qualitative research is participant observation, in which the observer interacts socially with the subjects of the medium in which the phenomenon of study occurs. For example, if we want to assess the experiences of travelers on a commercial flight, nothing better than buying a ticket and posing as another traveler, collecting all the information about comfort, punctuality, attention provided by the flight staff, quality of the snacks, etc.
Another technique widely used is the interview, in which a person asks another people or group of people for information on a specific topic. When it is done to groups it is called, as it could not be otherwise, group interview. In this case the script is quite closed and the role of the interviewer is quite prominent, unlike in focus groups discussion, in which everything can be more open, at the discretion of the group’s facilitator. Anyway, when we want to know the opinion of many people, we can resort to the questionnaire technique, which polls the opinion of large groups so that each component of the group spends a minimum time to complete it, unlike the focus groups, in the that all remain throughout the interview time.
The structure of a qualitative research study usually includes five fundamental steps, which can be influenced according to the methods and techniques used:
Definition of the problem. As we have already mentioned when discussing the research question, the definition of the problem has a certain degree of provisionality and can change throughout the study, since one of the objectives may be to find out precisely if the definition of the problem is well done.
Study design. It must also be flexible. The problem with this phase is that there are times when the proposed design is not what we see in the published article. There is still a certain lack of definition of many methodological aspects, especially when compared with the methodology of quantitative research.
Data collection. The techniques we have discussed are used: interview, observation, reading of texts, etc.
Analysis of the data. This aspect also differs from the quantitative analysis. Here it will be interesting to unravel the meaning structures of the collected data to determine their scope and social implications. Although methods are being devised to express in numerical form, the usual thing is that we do not see many figures here and, of course, nothing to do with quantitative methods.
Report and validation of the information. The objective is to generate conceptual interpretations of the facts to get a sense of the meaning they have for the people involved. Again, and unlike with quantitative research, the goal is not to project the results of possible interventions on the environment, but to interpret facts that are at hand.
At this point, what can we say about the critical appraisal of qualitative research? Well, to give you an idea, I will tell you that there is a great variety in opinions on this subject, from those who think that it makes no sense to evaluate the quality of a qualitative study to those who try to design evaluation instruments that provide numerical results similar to those of quantitative studies. So, my friends, there is no uniform consensus on whether you should evaluate, in the first place, or on how, in the second. In addition, some people think that even studies that can be considered of low quality should be taken into account because, after all, who is able to define with certainty what a good qualitative research study is?
In general, when we make a critical appraisal of a qualitative research study, we will have to assess a series of aspects such as its integrity, complexity, creativity, validity of the data, quality of the descriptive narrative, the interpretation of the results and the scope of its conclusions. We are going to continue here our habit of resorting to the CASPe’s critical appraisal program, which provides us with a template with 10 questions to perform the critical appraisal of a qualitative study. These questions are structured in three pillars: rigor, credibility and relevance.
The questions of rigorrefer to the suitability of the methods used to answer the clinical question. As usual, the first questions are about elimination. If the answer is not affirmative, we will have resolved the controversy since, at least with this study, it will not be worthwhile to continue with our assessment. Were the objectives of the research clearly defined? It is necessary to value that the question is well specified, as well as the objective of the investigation and the justification of its necessity. Is the qualitative methodology congruent? We will have to decide if the methods used by the authors are adequate to obtain the data that will allow them to reach the objective of the investigation. Finally, is the research method used suitable for achieving the objectives? Researchers must explicitly say the method they have used (meta-synthesis, grounded theory …). In addition, the specified method must match the one used, which sometimes may not be the case.
If we have answered affirmatively to these three questions, it will be worth continuing and we will move on to the detailed questions. Is the participant selection strategy consistent with the research question and the method used? It must be justified why the selected participants were the most suitable, as well as explain who called them, where, etc. Are data collection techniques used congruent with the research question and the method used? The technique of collecting data (for example, discussion groups) and the registration format will have to be specified and justified. If the collection strategy is modified throughout the study, the reason for this will have to be justified.
Have the relationship between the researcher and the object of research (reflexivity) been considered? It will be necessary to consider if the involvement of the researcher in the process has been able to bias the data obtained and if this has been taken into account when designing the data collection, the selection of the participants and the scope of the study. To finish with the assessment of the rigor of the work, we will ask ourselves if the ethical aspects have been taken into account. It will be necessary to take into account common aspects with quantitative research, such as informed consent, approval by ethical committee or confidentiality of data, as well as specific aspects about the effect of the study on participants before and after its completion.
The next block of two questions has to do with the credibilityof the study, which is related to the ability of the results to represent the phenomenon from the subjective point of view of the participants. The first question makes us think if the analysis of the data was sufficiently rigorous. The entire analysis process should be described, the categories that may have arisen from the collected data, if the subjectivity of the researcher has been assessed and how the data that could be contradictory to each other has been handled. In the case that fragments of testimonies of participants are presented to elaborate the results, the reference of their origin must be clearly specified. The second question has to do with whether the exposure of the results was made clearly. They should be presented in a detailed and understandable manner, showing their relationship to the research question. We will review at this point the strategies adopted to ensure the credibility of the results, as well as if the authors have reflected on the limitations of the study.
We will finish the critical assessment by answering the only question of the block that has to do with the relevanceof the study, which is nothing more than its usefulness or applicability to our clinical practice. Are the results of the investigation applicable? We will have to assess how the results contribute to our practice, how they contribute to the existing knowledge and in what contexts may they be applicable.
And here we are going to leave it for today. You have already seen that we have taken a look into a world quite different from the one we are more used to, in which we have to change a little the mentality of how to pose and study problems. Before leaving, I have to warn you, as in previous posts, to not to look for fildulastrosis, because you will not find this disease anywhere. Actually, fildulastrosis is an invention of mine in homage to a very illustrious character, sadly deceased: Forges. Antonio Fraguas (from the English translation of his last name comes his nom de guerre) was, in my humble opinion, the best graphic humorist since I have conscience. For many years I began the day seeing the daily Forges’ joke, so since some time there are mornings that one does not know how to start the day. Forges had many own invented words and I really liked his percutoria’s fildulastro, who had the defect of escalporning now and then. Hence comes my fildulastrosis, so from here I thank him and I give him this little tribute.
And now we’re leaving. We have not talked much about other methods of qualitative research such as grounded theory, meta- ethnogarphy, etc. Those interested have bibliography where they are explained in a better way than I could do it. And, of course, as in quantitative research, there are also ways to combine qualitative research studies. But that is another story…
Yes, as the illustrious Francisco de Quevedo y Villegas once said, powerful gentleman is Don Dinero (Mr. Money). A great truth because, who, purely in love, does not humble himself before the golden yellow? And even more in a mercantilist and materialist society like ours.
But the problem is not that we are materialistic and just think about money. The problem is that nobody believes they have all the money they need. Even the wealthiest would like to have much more money. And many times, it is true, we do not have enough money to cover all our needs as we would like.
And that does not only happen at the individual’s level, but also at social groups level. Any country has a limited amount of money, which is why you cannot spend everything you want and you have to choose where you spend your money. Let’s think, for example, of our healthcare system, in which new health technologies (new treatments, new diagnostic techniques, etc. ) are getting better … and more expensive (sometimes, even bordering on obscenity). If we are spending at the limit of our possibilities and want to apply a new treatment, we only have two choices: either we increase our wealth (where do we get the money from?) or we stop spending it on something else. There would be a third one that is used frequently, even if it is not the right thing to do: spend what we do not have and pass on the debt to whoever comes next.
Yes, my friends, the saying that Health is priceless does not hold up economically. Resources are always limited and we must all be aware of the so-called opportunity cost of a product: the price it costs, the money will have to stop spending on something else.
Therefore, it is very important to properly evaluate any new health technology before deciding its implementation in the health system, and this is why the so-called economic evaluation studies have been developed, aimed at identifying what actions should be prioritized to maximize the benefits produced in an environment with limited resources. These studies are a tool to assist in decision-making, but are not aim to replace it, so other elements have to be taken into account, such as justice, equity and free access to the election.
The economic evaluation (EV) studies encompass a whole series of methodology and specific terminology that is usually little known by those who are not dedicated to the evaluation of health technologies. Let’s briefly review its characteristics to finally give some recommendations on how to make a critical appraisal of these studies.
The first thing would be to explain what are the two characteristics that define an EV. These are the measure of the costs and benefits of the interventions (the first one) and the choice or comparison between two or more alternatives (the second one). These two features are essential to say that we are facing an EV, which can be defined as the comparative analysis of different health interventions in terms of costs and benefits. The methodology of development of an EV will have to take into account a number of aspects that we list below and that you can see summarized in the attached table.
– Objective of the study. It will be determined if the use of a new technology is justified in terms of the benefits it produces. For this, a research question will be formulated with a structure similar to that of other types of epidemiological studies.
– Perspectives of the analysis. It is the point of view of the person or institution to whom the analysis is targeted, which will include the costs and benefits that must be taken into account from the positioning chosen. The most global perspective is that of the Society, although the one of the funders, that of specific organizations (for example, hospitals) or that of patients and families can also be adopted. The most usual is to adopt the perspective of the funders, sometimes accompanied by the social one. If so, both must be well differentiated.
– Time horizon of the analysis. It is the period of time during which the main economic and health effects of the intervention are evaluated.
– Choice of the comparator. It is a crucial point to be able to determine the incremental effectiveness of the new technology and on which the importance of the study for the decision makers will largely depend. In practice, the most commonly comparator is the alternative that is commonly used (the gold standard), although it can sometimes be compared with the non-treatment option, which must be justified.
– Identification of costs. Costs are usually considered taking into account the total amount of the resource consumed and the monetary value of the resource unit (you know, as the friendly hostesses of an old TV contest said: 25 responses, at 5 pesetas each, 125 pesetas). The costs are classified as direct and indirect and as sanitary and non-sanitary. The direct ones are those clearly related to the illness (hospitalization, laboratory tests, laundry and kitchen, etc.), while the indirect refer to productivity or its loss (work functionality, mortality). On the other hand, health costs are those related to the intervention (medicines, diagnostic tests, etc.), while non-health costs are those that the patient or other entities have to pay or those related to productivity.
What costs will be included in an EV? It will depend on the intervention being analyzed and, especially, on the perspective and time horizon of the analysis.
–Quantification of costs. It will be necessary to determine the amount of resources used, either individually or in aggregate, depending on the information available.
– Cost assessment. They will be assigned a unit price, specifying the source and the method used to assign this price. When the study covers long periods of time, it must be borne in mind that things do not cost the same over the years. If I tell you that I knew a time when you went out at night with a thousand pesetas (the equivalent of about 6 euros now) and came back home with money in your pocket, you will think it is another of my frequent ravings, but I swear it is true.
To take this into account, a weighting factor or discount rate is used, which is usually between 3% and 6%. For who is curious, the general formula is CV = FV / (1 + d) n, where CV is the current value, FV future value, n is the number of years and d the discount rate.
–Identification, measurement and evaluation of results. The benefits obtained can be classified into health and non-health ones. Health benefits are clinical consequences of the intervention, generally measured from a point of view of interest to the patient (improvement of blood pressure figures, deaths avoided, etc.). On the other hand, the non-health ones are divided as they cause improvements in productivity or in the quality of life.
The first ones are easy to understand: productivity can improve because people go to work earlier (shorter hospitalization, shorter convalescence) or because they work better to improve the health conditions of the worker. The second ones are related to the concept of quality of life related to health, which reflects the impact of the disease and its treatment on the patient.
The quality of life related to health can be estimated using a series of questionnaires on the preferences of patients, summarized in a single score value that, together with the amount of life, will provide us with the quality-adjusted life year (QALY).
To assess the quality of life we refer to the utilities of the health states, which are expressed with a numerical value between 0 and 1, in which 0 represents the utility of the state of death and 1 that of perfect health. In this sense, a year of life lived in perfect health is equivalent to 1 QALY (1 year of life x 1 utility = 1 QALY). Thus, to determine the value in QALYs we will multiply the value associated with a state of health by the years lived in that state. For example, half a year in perfect health (0.5 years x 1 utility) would be equivalent to one year with some ailments (1 year x 0.5 utility).
–Type of economic analysis. We can choose between four types of economic analysis.
The first, the cost minimization analysis. This is used when there is no difference in effect between the two options compared, situation in which will be enough to compare the costs to choose the cheapest. The second, the cost-effectiveness analysis. This is used when the interventions are similar and determines the relationship between costs and consequences of interventions in units usually used in clinical practice (decrease in days of admission, for example). The third, the cost-utility analysis. It is similar to cost-effectiveness, but the effectiveness is adjusted for quality of life, so the outcome is the QALY. Finally, the fourth method is the cost-benefit analysis. In this type everything is measured in monetary units, which we usually understand quite well, although it can be a little complicated to explain with them the gains in health.
–Analysis of results. The analysis will depend on the type of economic analysis used. In the case of cost-effectiveness studies, it is typical to calculate two measures, the average cost-effectiveness (dividing the cost between the benefit) and the incremental cost-effectiveness (the extra cost per unit of additional benefit obtained with an option with respect to the other). This last parameter is important, since it constitutes a limit of efficiency of the intervention, which we will be chosen or not depending on how much we are willing to pay for an additional unit of effectiveness.
– Sensitivity analysis. As with other types of designs, EVs do not get rid off uncertainty, generally due to lack of reliability of the available data. Therefore, it is convenient to evaluate the degree of uncertainty through a sensitivity analysis to check the degree of stability of the results and how they can be modified if the main variables vary. An example may be the variation of the discount rate chosen.
There are five types of sensitivity analysis: univariate (the study variables are modified one by one), multivariate (two or more are modified), extremes (we put ourselves in the most optimistic and most pessimistic scenarios for the intervention), threshold (identifies if there is a critical value above or below which the choice is reversed towards one or the other the interventions compared) and probabilistic (assuming a certain probability distribution for the uncertainty of the parameters used).
–Conclusion. This is the last section of the development of an EV. The conclusions should take into account two aspects: internal validity (correct analysis for patients included in the study) and external validity (possibility of extrapolating the conclusions to other groups of similar patients).
As we said at the beginning of this post, EVs have a lot of jargon and its own methodological aspects, which makes it difficult for us to make a critical appraising and a correct understanding of its content. But let no one get discouraged, we can do it by relying on our three basic pillars: validity, relevance and applicability.
There are multiple guides that systematically explain how to assess an EV. Perhaps the first to appear was that of the British NICE (National Institutefor ClinicalExcellence), but subsequently others have arisen such as that of the Australian PBAC (Pharmaceutical BenefitsAdvisoryCommittee) and that of the Canadian CADTH (Canadian Agency forDrugsand TechnologiesinHealth). In Spain we could not be less and the Laín Entralgo’s Health Technology Assessment Unit also developed an instrument to determine the quality of an EV. This guide establishes recommendations for 17 domains that closely resemble what we have said so far, completing with a checklist to facilitate the assessment of the quality of the EV.
Anyway, as my usual sufferers know, I prefer to use a simpler checklist that is available on the Internet for free, which is none other than the tool provided by the CASPe group and that you can download from their website. We are going to follow these 11 CASPe’s questions, although without losing sight of the recommendations of the Spanish guide that we have mentioned.
As always, we will start with the VALIDITY, trying to answer first two elimination questions. If the answer is negative, we can leave the study aside and dedicate ourselves to another more productive task.
Is the question or objective of the evaluation well defined? The research question should be clear and define the target population of the study. There will also be three fundamental aspects that should be clear in the objective: the options compared, the perspective of the analysis and the time horizon. Is there a sufficient description of all possible alternatives and their consequences? The actions to follow must be perfectly defined in all the compared options, including who, where and to whom each action is applied. The usual will be to compare the new technology, at least, with the one of habitual use, always justifying the choice of the comparison technology, especially if this is the non-treatment one (in the case of pharmacological interventions).
If we have been able to answer these two questions affirmatively, we will move on to the four questions of detail. Are there evidence of the effectiveness, of the intervention or of the evaluated program? We will see if there are trials, reviews or other previous studies that prove the effectiveness of the interventions. Think of a cost minimization study, in which we want to know which of the two options, both effective, is cheaper. Logically, we will have to have prior evidence of this effectiveness. Are the effects of the intervention (or interventions) identified, measured and appropriately valued or considered? These effects can be measured with simple units, often derived from clinical practice, with monetary units and more elaborate calculation units, such as the QALYs mentioned above. Are the costs incurred by the intervention (interventions) identified, measured and appropriately valued? The resources used must be well identified and measured in the appropriate units. The method and source used to assign the value to the resources used must be specified, as we have already mentioned. Finally, werediscount rates applied to the costs of the intervention/s?And to the effects? As we already know, this is fundamental when the time horizon of the study is prolonged. In Spain, it is recommended to use a discount rate of 3% for basic resources. When doing sensitivity analysis this rate will be tested between 0% and 5%, which will allow comparison with other studies.
Once assessed the internal validity of our EV, we will answer the questions regarding the RELEVANCE of the results. Firstly, whatare the evaluation results? We will review the units that have been used (QALYs, monetary costs, etc.) and if the incremental benefits analysis have been carried out, in appropriate cases. The second question in this section refers to whether an adequate sensitivity analysis has been carried out to know how the results would vary with changes in costs or effectiveness. In addition, it is recommended that the authors justify the modifications made with respect to the base case, the choice of the variables that are modified and the method used in the sensitivity analysis. Our Spanish guide recommends carrying out, whenever possible, a probabilistic sensitivity analysis, detailing all the statistical tests performed and the confidence intervals of the results.
Finally, we will assess the Cost-efeor external validity of our study by answering the last three questions. Would the program be equally effective in your environment? It will be necessary to consider if the target population, the perspective, the availability of technologies, etc., are applicable to our clinical context. Finally, we must reflect on whether the costs would be transferable to our environment and if it would be worth applying them to our environment. This may depend on social, political, economic, population, etc. differences, between our environment and that in which the study has been carried out.
And with this we are going to finish this post for today. Even if I blow your mind after all we have said, you can believe me if I tell you that we have done nothing but scratch the surface of this stormy world of economic valuation studies. We have not discussed anything, for example, about the statistical methods that can be used in studies of sensitivity, which can become complicated, nor about the studies using modeling, employing techniques only available to privileged minds, like Markov chains, stochastic models or discrete event simulation models, to name a few. Neither have we talked about the type of studies on which economic evaluations are based. These can be experimental or observational studies, but they have a series of peculiarities that differentiate them from other studies of similar design, but with different functions. This is the case of clinical trials that incorporate an economic evaluation (also known as piggy-backclinicaltrials , which tend to have a more pragmatic design than conventional trials. But that is another story…
What a mess these two elements make when they are left loose and come together! In this story, almost as old as me (please, do not run to look at what year the movie was made) poor King Kong, who must have traveled more than Tarzan, leaves his Skull Island to defend a village from an evil giant octopus and drinks a potion that leaves him sound asleep. Then, some Japanese gentlemen seized the opportunity to take him to their country. I, who have visited Japan, can imagine the effect it produced on the poor monkey when he woke up, so it had no choice but to escape, with the misfortune of meeting Godzilla, who had also escaped from an iceberg where it had been previously frozen. And there they are bundled and the fight begins, stones over here, atomic rays over there, until the thing gets out of control and finally King Kong is going to attack Tokyo, I do not remember exactly for what reason. I swear I have not taken any hallucinogenic, the film is like that and I will not reveal more for not spoiling the end in the incredible case that you want to see the film after what I have told you. What I do not know is what the screenwriters would have taken before planning this story.
At this point you will be thinking about how today’s post may be related to this story. Well, the truth is that it has nothing to do with what we are going to talk about, but I could not think of a better way to start. Well, it may actually be related, because today we are going to talk about a family of monsters within epidemiological studies: the ecological studies. It’s funny that when you read something about ecological studies, it always starts by saying that they are simple. Well, I do not think so. The truth is that they have a lot to get our teeth into and we are going to try to explain them in a simple way. I thank my friend Eduardo (to whom I dedicate this post) for the effort he made to describe them intelligibly. Thanks to him I could understand them. Well… a little bit.
Ecological studies are observational studies that have the peculiarity that the study population are not individual subjects, but grouped subjects (in conglomerates), so the level of inference of their estimates is also aggregated. They tend to be cheap and quick to perform (I suppose that hence its supposed simplicity), since they usually use data from secondary sources already available, and are very useful when it is not possible to measure the exposure at the individual level or when the measurement of the effect can only be measured at the population level (such as the results of a vaccination campaign, for example).
The problem comes when we want to make inferences at the individual level based on their results, since they are subject to a series of biases that we will comment later on. In addition, since they use to be descriptive studies of historical temporality, it can be difficult to determine the temporal gradation between the exposure and the effect studied.
We will look at the specific characteristics in relation to three aspects of its methodology: types of variables and analyzes, types of studies and biases.
Ecological variables are classified in aggregate and environmental variables (also called global variables). The aggregate ones show a summary of individual observations. They are usually averages or proportions, such as the mean age at which the first King Kong’s movie is seen or the rate of geeks for every 1000 moviegoers, to name two absurd examples.
On the other hand, environmental measures are characteristic of a specific place. These can have a parallelism at an individual level (for example, the levels of environmental pollution, related to the crap that each swallows) or be attributes of groups without equivalence at the individual level (such as water quality, to say the least).
As for the analysis, it can be done at the aggregate level, using data from groups of participants, or at the individual level, but better without mixing the two types. Moreover, if data of both types is collected, it will be more convenient to transform them into a single level, the simplest being to aggregate the individual data, although it can also be done the other way around and, even, make an analysis in the two levels with techniques of hierarchical multilevel statistics, only afforded by a few privileged minds.
Obviously, the level of inference we want to apply will depend on what our objective is. If we want to study the effects of a risk factor at the individual level, the inference will be individual. An example would be to study the relationship between the number of hours television is watched and the incidence of brain cancer. On the other hand, and following a very pediatric example, if we want to know the effectiveness of a vaccine, the inferences will be made in an aggregated form from the data of vaccination coverage in the population. And to finish curling the curl, we can measure an exposure factor of the two forms, individual and grouped. For example, density of Mexican restaurants in a population and frequency of antacids intake. In this case we would make a contextual inference.
Regarding the type of ecological studies, we can classify them according to the exposure method and the grouping method.
According to the exposure method, the thing is relatively simple and we can find two types of studies. If we do not measure the exposure variable, or we do it partially, we talk about exploratory studies. In the opposite case, we will find ourselves before an analytical study.
According to the grouping method, we can consider three types: multiple (when multiple zones are selected), temporary (there is measurement over time) and mixed (combination of both).
The complexity begins when the two dimensions (exposure and grouping) are combined, since then we can find ourselves before a series of more complex designs. Thus, multiple group studies can be exploratory (the exposure factor is not measured, but the effect is measured) or analytical studies (the most frequent, we measure both here). The studies of temporal tendency, to not be less, can also be exploratory and analytical, in a similar way to the previous ones, but with a temporal trend. Finally, there will be mixed studies that compare the temporal trends of several geographical areas. Simple, isn’t it?
Well, this is nothing compared to the complexity of the statistical techniques used in these studies. Until recently the analyzes were very simple and based on measures of association or linear correlation, but in recent times we have seen the development of numerous techniques based on regression models and more exotic things such as the log-linear multiplicative models or the Poisson’s regression. The merit of all these studies is that, based on the grouped measures, they allow us to know how many exposed or unexposed subjects have the effect, thus allowing the calculation of rates, attributable fractions, etc. Do not fear, we will not go into detail, but there is available bibliography for those who want to keep warm from head to feet.
To finish with the methodological aspects of the ecological studies, we will list some of its most characteristic biases, favored by the fact of using aggregate analysis units.
The most famous of all is the ecological bias, also known as ecological fallacy. This occurs when the grouped measure does not measure the biological effect at the individual level, in such a way that the individual inference made is erroneous. This bias became famous with the New England’s study that concluded that there was a relationship between chocolate consumption and Nobel prizes but the problem is that, apart from the funny of this example, the ecological fallacy is the main limitation of this type of studies.
Another bias that has some peculiarities in this type of studies is the confusion bias. In studies dealing with individual units, confusion occurs when the exposure variable is related to the effect and exposure, without being part of the causal relationship between the two. This ménage à trois is a bit more complex in ecological studies. The risk factor can behave similarly at the ecological level, but not at the individual level and vice versa, it is possible that confounding factors at the individual level do not produce confusion at the aggregate level. In any case, as in the rest of the studies, we must try to control the confounding factors, for which there are two fundamental approaches.
The first one, to include the possible confounding variables in the mathematical model as covariables and perform a multivariate analysis, with which it will be more complicated to study the effect. The second one, to adjust or standardize the rates of the effect by the confounding variables and perform the regression model with the adjusted rates. To be able to do this it is essential that all the variables introduced in the model have to be adjusted too to the same variable of confusion and that the covariances of the variables are known, which does not always happen. In any case, and it is not to discourage, many times we cannot be sure that the confounding factors have been adequately controlled, even using the most recent and sophisticated multilevel analysis techniques, since the origin can be in unknown characteristics about the distribution of data among groups.
Other gruesome aspects of ecological studies are the temporal ambiguity bias (we have already commented, it is often difficult to ensure that exposure precedes the effect) and collinearity (difficulty in assess the effects of two or more exposures that can occur simultaneous). In addition, although they are not specific to ecological studies, they are very susceptible to presenting information biases.
You can see that I was right at the beginning when I told you that ecological studies seem to me a lot of things, but simple. In any case, it is convenient to understand what their methodology is based on, because, with the development of new analysis techniques, they have gained in prestige and power and it is more than possible that we meet them more and more frequently.
But do not despair, the important thing for us, consumers of medical literature, is to understand how they work so that we can make a critical appraisal of the articles when we deal with them. Although, as far as I know, there are no checklists as structured as CASP has for other designs, the critical appraisal will be done following the usual general scheme according to our three pillars: validity, relevance and applicability.
The study of VALIDITYwill be done in a similar way to other types of cross-sectional observational studies. The first thing will be to check that there is a clear definition of the population and the exposure or effect under study. The units of analysis and their level of aggregation will have to be clearly specified, as well as the methods of measuring the effect and exposure, the latter, as we already know, only in analytical studies.
The sample of the study should be representative, for which we will have to review the selection procedures, the inclusion and exclusion criteria and its size. These data will also influence the external validity of the results.
As in any observational study, the measurement of exposure and effect should be done blindly and independently, using valid instruments. The authors must present the data completely, taking into account if there are loses or out of range values. Finally, there must be a correct analysis of the results, with a control of the typical biases of these studies: ecological, information, confusion, temporal ambiguity and collinearity.
In the RELEVANCEsection we can begin with a quantitative assessment, summarizing the most important result and reviewing the magnitude of the effect. We must search or calculate ourselves, if possible, the most appropriate impact measures: differences in incidence rates, attributable fraction in exposed, etc. If the authors do not offer this data, but do provide the regression model, it is possible to calculate the impact measurements from the multiplication coefficients of the independent variables of the model. I’m not going to put here the list of formulas for not making this post even more unfriendly, but you know that they exist in case one day you need them.
Then we will make a qualitative assessment of the results, trying to assess the clinical interest of the main outcome measure, the interest of the effect size and the impact it may have for the patient, the system or the Society.
We will finish this section with a comparative assessment (looking for similar studies and comparing the main outcome measure and other alternative measures) and an assessment of the relationship between benefits, risks and costs, as we would do with any other type of study.
Finally, we will consider the APPLICABILITYof the results in clinical practice, taking into account aspects such as adverse effects, economic cost, etc. We already know that the fact that the study is well done does not mean that we have to apply it obligatorily in our environment.
And here we are going to leave it for today. When you read or do an ecological study, be careful not to fall into the temptation of drawing causality conclusions. Regardless of the pitfalls that the ecological fallacy may have for you, ecological studies are observational, so they can be used to generate hypotheses of causality, but not to confirm them.
And now we’re leaving. I did not tell you who won the fight between King Kong and Godzilla so as not to be a spoiler, but surely the smartest of you have already imagined it. After all, and to its disgrace, only one of the two later traveled to New York. But that is another story…
How I wish I could predict the future! And not only to win millions in the lottery, which is the first thing you can think of. There are more important things in life than money (or so that’s what some say), decisions that we make based on assumptions that end up not being fulfilled and that complicate our lives to unsuspected limits. We all have ever thought about “if you lived twice …” I have no doubt, if I met the genie of the lamp one of the three wishes I would ask would be a crystal ball to see the future.
And we could also do well in our work as doctors. In our day to day we are forced to make decisions about the diagnosis or prognosis of our patients and we always do it on the swampy terrain of uncertainty, always assuming the risk of making some mistake. We, especially when we are more experienced, estimate consciously or unconsciously the likelihood of our assumptions, which helps us in making diagnostic or therapeutic decisions. However, it would be good to also have a crystal ball to know more accurately the evolution of the patient’s course.
The problem, as with other inventions that would be very useful in medicine (like the time machine), is that nobody has yet managed to manufacture a crystal ball that really works. But do not let us down. We cannot know for sure what will happen, but we can estimate the probability that a certain result will occur.
For this, we can use all those variables related to the patient that have a known diagnostic or prognostic value and integrate them to perform the calculation of probabilities. Well, doing such a thing would be the same as designing and applying what is known as a clinical prediction rule (CPR).
Thus, if we get a little formal, we can define a CPR as a tool composed of a set of variables of clinical history, physical examination and basic complementary tests, which provides us with an estimate of the probability of an event, suggesting a diagnosis or predicting a concrete response to a treatment.
The critical appraisal of an article about a CPR shares similar aspects with those of the ones about diagnostic tests and also has specific aspects related to the methodology of its design and application. For this reason, we will briefly look at the methodological aspects of CPRs before entering into their critical assessment.
In the process of developing a CPR, the first thing to do is to define it. The four key elements are the study population, the variables that we will consider as potentially predictive, the gold or reference standard that classifies whether the event we want to predict occurs or not and the criterion of assessment of the result.
It must be borne in mind that the variables we choose must be clinically relevant, they must be collected accurately and, of course, they must be available at the time we want to apply the CPR for decision making. It is advisable not to fall into the temptation of putting variables everywhere and endlessly since, apart from complicating the application of the CPR, it can decrease its validity. In general, it is recommended that for every variable that is introduced in the model there should have been at least 10 events that we want to predict (the design is made in a certain sample whose components have the variables but only a certain number have ended up presenting the event to predict).
I would also like to highlight the importance of the gold standard. There must be a diagnostic test or a set of well-defined criteria that allow us to clearly define the event we want to predict with the CPR.
Finally, it is convenient that those who collect the variables during this definition phase are unaware of the results of the gold standard, and vice versa. The absence of blinding decreases the validity of the CPR.
The next step is the derivation or design phase itself. This is where the statistical methods that allow to include predictive variables and exclude those that are not going to contribute anything are applied. We will not go into statistics, just say that the most commonly used methods are those based on logistic regression, although discriminant, survival and even more exotic analysis based on discriminant risks or neural networks can be used, only afforded by a few virtuous ones.
In the logistic regression models, the event will be the dichotomous dependent variable (it happens or it does not happen) and the other variables will be the predictive or independent variables. Thus, each coefficient that multiplies each predictive variable will be the natural antilogarithm of the adjusted odds ratio. In case anyone has not understood, the adjusted odds ratio for each predictive variable will be calculated raising the number “e” to the value of the coefficient of that variable in the regression model.
The usual thing is that a certain score is assigned on a scale according to the weight of each variable, so that the total sum of points of all the predictive variables will allow to classify the patient in a specific range of prediction of event production. There are also other more complex methods using regression equations, but after all you always get the same thing: an individualized estimate of the probability of the event in a particular patient.
With this process we perform the categorization of patients in homogenous groups of probability, but we still need to know if this categorization is adjusted to reality or, what is the same, what is the capacity of discrimination of the CPR.
The overall validity or discrimination capacity of the PRC will be assess by contrasting its results with those of the gold standard, using similar techniques to those used to assess the power of diagnostic tests: sensitivity, specificity, predictive values and likelihood ratios. In addition, in cases where the CPR provides a quantitative estimate, we can resort to the use of the ROC curves, since the area under the curve will represent the global validity of the CPR.
The last step of the design phase will be the calibration of the CPR, which is nothing more than checking its good behavior throughout the range of possible results.
Some CPR’s authors end this here, but they forget two fundamental steps of the elaboration: the validation and the calculation of the clinical impact of the rule.
The validation consists in testing the CPR in samples different to the one used for its design. We can take a surprise and verify that a rule that works well in a certain sample does not work in another. Therefore, it must be tested, not only in similar patients (limited validation), but also in different clinical settings (broad validation), which will increase the external validity of the CPR.
The last phase is to check its clinical performance. This is where many CPRs crash down after having gone through all the previous steps (maybe that’s why this last check is often avoided). To assess the clinical impact, we will have to apply CPR in our patients and see how clinical outcome measures change such as survival, complications, costs, etc. The ideal way to analyze the clinical impact of a CPR is to conduct a clinical trial with two groups of patients managed with and without the rule.
For those self-sacrificing people who are still reading, now that we know what a CPR is and how it is designed, we will see how the critical appraisal of these works is done. And for this, as usual, we will use our three pillars: validity, relevance and applicability. To not forget anything, we will follow the questions that are listed on the grid for CRP studies of the CASP tool.
Regarding VALIDITY, we will start first with some elimination questions. If the answer is negative, it may be time to wait until someone finally makes up a crystal ball that works.
Does the rule answer a well-defined question? The population, the event to be predicted, the predictive variables and the outcome evaluation criteria must be clearly defined. If this is not done or these components do not fit our clinical scenario, the rule will not help us. The predictive variables must be clinically relevant, reliable and well defined in advance.
Did the study population from which the rule was derived include an adequate spectrum of patients? It must be verified that the method of patient selection is adequate and that the sample is representative. In addition, it must include patients from the entire spectrum of the disease. As with diagnostic tests, events may be easier to predict in certain groups, so there must be representatives of all of them. Finally, we must see if the sample was validated in a different group of patients. As we have already said, it is not enough that the rule works in the group of patients in which it has been derived, but that it must be tested in other groups that are similar or different from those with which it was generated.
If the answer to these three questions has been affirmative, we can move on to the three next questions. Was there a blind evaluation of the outcome and of the predictor variables? We have already commented, it is important that the person who collects the predictive variables does not know the result of the reference pattern, and vice versa. The collection of information must be prospective and independent. The next thing to ask is whether the predictor variables and the outcome in all the patients were measured. If the outcome or the variables are not measured in all patients, the validity of the CPR can be compromised. In any case, the authors should explain the exclusions, if there are any. Finally, are the methods of derivation and validation of the rule described? We already know that it is essential that the results of the rule be validated in a population different from the one used for the design.
If the answers to the previous questions indicate that the study is valid, we will answer the questions about the RELEVANCEof the results. The first is if you can calculate the performance of the CRP. The results should be presented with their sensitivity, specificity, odds ratios, ROC curves, etc., depending on the result provided by the rule (scoring scales, regression formulas, etc.). All these indicators will help us to calculate the probabilities of occurrence of the event in environments with different prevalence. This is similar to what we did with the studies of diagnostic tests, so I invite you to review the post on the subject to not repeat too much. The second question is: what is the precision of the results? Here we will not extend either: remember our revered confidence intervals, which will inform us of the accuracy of the results of the rule.
To finish, we will consider the APPLICABILITYof the results to our environment, for which we will try to answer three questions. Will the reproducibility of the PRC and its interpretation be satisfactory within the scope of the scenario? We will have to think about the similarities and differences between the field in which the CPR develops and our clinical environment. In this sense, it will be helpful if the rule has been validated in several samples of patients from different environments, which will increase its external validity. Is the test acceptable in this case? We will think wether the rule is easy to apply in our environment and wether it makes sense to do it from the clinical point of view in our environment. Finally, will the results modify clinical behavior, health outcomes or costs? If, from our point of view, the results of the CPR are not going to change anything, the rule will be useless and a waste of time. Here our opinion will be important, but we must also look for studies that assess the impact of the rule on costs or on health outcomes.
And up to here everything I wanted to tell you about critical appraising of studies on CPRs. Anyway, before finishing I would like to tell you a little about a checklist that, of course, also exists for the valuation of this type of studies: the checklist CHARMS (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modeling Studies). You will not tell me that the name, although a bit fancy, is not lovely.
This list is designed to assess the primary studies of a systematic review on CPRs. It try to answer some general design questions and assess 11 domains to extract enough information to perform the critical appraisal. The two great parts that are valued are the risk of bias in the studies and its applicability. The risk of bias refers to the design or validation flaws that may result in the model being less discriminative, excessively optimistic, etc. The applicability, on the other hand, refers to the degree to which the primary studies are in agreement with the question that motivates the systematic review, for which it informs us of whether the rule can be applied to the target population. This list is good and helps to assess and understand the methodological aspects of this type of studies but, in my humble opinion, it is easier to make a systematic critical appraisal by using the CASP’s tool.
And here, finally, we leave it for today. We have not spoken anything, so as not to stretch ourselves too long, of what to do with the result of the rule. The fundamental thing, we already know, is that we can calculate the probability of occurrence of the event in individual patients from environments with different prevalence. But that is another story…
Yes, I know that the saying goes just the opposite. But that is precisely the problem we have with so much new information technology. Today anyone can write and make public what goes through his head, reaching a lot of people, although what he says is bullshit (and no, I do not take this personally, not even my brother-in-law reads what I post!). The trouble is that much of what is written is not worth a bit, not to refer to any type of excreta. There is a lot of smoke and little fire, when we all would like the opposite to happen.
The same happens in medicine when we need information to make some of our clinical decisions. Anywhere the source we go, the volume of information will not only overwhelm us, but above all the majority of it will not serve us at all. Also, even if we find a well-done article it may not be enough to answer our question completely. That’s why we love so much the revisions of literature that some generous souls publish in medical journals. They save us the task of reviewing a lot of articles and summarizing the conclusions. Great, isn’t it? Well, sometimes it is, sometimes it is not. As when we read any type of medical literature’s study, we should always make a critical appraisal and not rely solely on the good know-how of its authors.
Revisions, of which we already know there are two types, also have their limitations, which we must know how to value. The simplest form of revision, our favorite when we are younger and ignorant, is what is known as a narrative review or author’s review. This type of review is usually done by an expert in the topic, who reviews the literature and analyzes what she finds as she believes that it is worth (for that she is an expert) and summarizes the qualitative synthesis with her expert’s conclusions. These types of reviews are good for getting a general idea about a topic, but they do not usually serve to answer specific questions. In addition, since it is not specified how the information search is done, we cannot reproduce it or verify that it includes everything important that has been written on the subject. With these revisions we can do little critical appraising, since there is no precise systematization of how these summaries have to be prepared, so we will have to trust unreliable aspects such as the prestige of the author or the impact of the journal where it is published.
As our knowledge of the general aspects of science increases, our interest is shifting towards other types of revisions that provide us with more specific information about aspects that escape our increasingly wide knowledge. This other type of review is the so-called systematic review (SR), which focuses on a specific question, follows a clearly specified methodology of searching and selection of information and performs a rigorous and critical analysis of the results found. Moreover, when the primary studies are sufficiently homogeneous, the SR goes beyond the qualitative synthesis, also performing a quantitative synthesis analysis, which has the nice name of meta-analysis. With these reviews we can do a critical appraising following an ordered and pre-established methodology, in a similar way as we do with other types of studies.
The prototype of SR is the one made by the Cochrane’s Collaboration, which has developed a specific methodology that you can consult in the manuals available on its website. But, if you want my advice, do not trust even the Cochrane’s and make a careful critical appraising even if the review has been done by them, not taking it for granted simply because of its origin. As one of my teachers in these disciplines says (I’m sure he’s smiling if he’s reading these lines), there is life after Cochrane’s. And, besides, there is lot of it, and good, I would add.
Although SRs and meta-analyzes impose a bit of respect at the beginning, do not worry, they can be critically evaluated in a simple way considering the main aspects of their methodology. And to do it, nothing better than to systematically review our three pillars: validity, relevance and applicability.
Regarding VALIDITY, we will try to determine whether or not the revision gives us some unbiased results and respond correctly to the question posed. As always, we will look for some primary validity criteria. If these are not fulfilled we will think if it is already time to walk the dog: we probably make better use of the time.
Has the aim of the review been clearly stated? All SRs should try to answer a specific question that is relevant from the clinical point of view, and that usually arises following the PICO scheme of a structured clinical question. It is preferable that the review try to answer only one question, since if it tries to respond to several ones there is a risk of not responding adequately to any of them. This question will also determine the type of studies that the review should include, so we must assess whether the appropriate type has been included. Although the most common is to find SRs of clinical trials, they can include other types of observational studies, diagnostic tests, etc. The authors of the review must specify the criteria for inclusion and exclusion of the studies, in addition to considering their aspects regarding the scope of realization, study groups, results, etc. Differences among the studies included in terms of (P) patients, (I) intervention or (O) outcomes make two SRs that ask the same question to reach to different conclusions.
If the answer to the two previous questions is affirmative, we will consider the secondary criteria and leave the dog’s walk for later. Have important studies that have to do with the subject been included? We must verify that a global and unbiased search of the literature has been carried out. It is frequent to do the electronic search including the most important databases (generally PubMed, Embase and the Cochrane’s Library), but this must be completed with a search strategy in other media to look for other works (references of the articles found, contact with well-known researchers, pharmaceutical industry, national and international registries, etc.), including the so-called gray literature (thesis, reports, etc.), since there may be important unpublished works. And that no one be surprised about the latter: it has been proven that the studies that obtain negative conclusions have more risk of not being published, so they do not appear in the SR. We must verify that the authors have ruled out the possibility of this publication bias. In general, this entire selection process is usually captured in a flow diagram that shows the evolution of all the studies assessed in the SR.
It is very important that enough has been done to assess the quality of the studies, looking for the existence of possible biases. For this, the authors can use an ad hoc designed tool or, more usually, resort to one that is already recognized and validated, such as the bias detection tool of the Cochrane’s Collaboration, in the case of reviews of clinical trials. This tool assesses five criteria of the primary studies to determine their risk of bias: adequate randomization sequence (prevents selection bias), adequate masking (prevents biases of realization and detection, both information biases), concealment of allocation (prevents selection bias), losses to follow-up (prevents attrition bias) and selective data information (prevents information bias). The studies are classified as high, low or indeterminate risk of bias according to the most important aspects of the design’s methodology (clinical trials in this case).
In addition, this must be done independently by two authors and, ideally, without knowing the authors of the study or the journals where the primary studies of the review were published. Finally, it should be recorded the degree of agreement between the two reviewers and what they did if they did not agree (the most common is to resort to a third party, which will probably be the boss of both).
To conclude with the internal or methodological validity, in case the results of the studies have been combined to draw common conclusions with a meta-analysis, we must ask ourselves if it was reasonable to combine the results of the primary studies. It is fundamental, in order to draw conclusions from combined data, that the studies are homogeneous and that the differences among them are due solely to chance. Although some variability of the studies increases the external validity of the conclusions, we cannot unify the data for the analysis if there are a lot of variability. There are numerous methods to assess the homogeneity about which we are not going to refer now, but we are going to insist on the need for the authors of the review to have studied it adequately.
In summary, the fundamental aspects that we will have to analyze to assess the validity of a SR will be: 1) that the aims of the review are well defined in terms of population, intervention and measurement of the result; 2) that the bibliographic search has been exhaustive; 3) that the criteria for inclusion and exclusion of primary studies in the review have been adequate; and 4) that the internal or methodological validity of the included studies has also been verified. In addition, if the SR includes a meta-analysis, we will review the methodological aspects that we saw in a previous post: the suitability of combining the studies to make a quantitative synthesis, the adequate evaluation of the heterogeneity of the primary studies and the use of a suitable mathematical model to combine the results of the primary studies (you know, that of the fixed effect and random effects models).
Regarding the RELEVANCEof the results we must consider what is the overall result of the review and if the interpretation has been made in a judicious manner. The SR should provide a global estimate of the effect of the intervention based on a weighted average of the included quality items. Most often, relative measures such as risk ratio or odds ratio are expressed, although ideally, they should be complemented with absolute measures such as absolute risk reduction or the number needed to treat (NNT). In addition, we must assess the accuracy of the results, for which we will use our beloved confidence intervals, which will give us an idea of the accuracy of the estimation of the true magnitude of the effect in the population. As you can see, the way of assessing the importance of the results is practically the same as assessing the importance of the results of the primary studies. In this case we give examples of clinical trials, which is the type of study that we will see more frequently, but remember that there may be other types of studies that can better express the relevance of their results with other parameters. Of course, confidence intervals will always help us to assess the accuracy of the results.
The results of the meta-analyzes are usually represented in a standardized way, usually using the so-called forest plot. A graph is drawn with a vertical line of zero effect (in the one for relative risk and odds ratio and zero for means differences) and each study is represented as a mark (its result) in the middle of a segment (its confidence interval). Studies with results with statistical significance are those that do not cross the vertical line. Generally, the most powerful studies have narrower intervals and contribute more to the overall result, which is expressed as a diamond whose lateral ends represent its confidence interval. Only diamonds that do not cross the vertical line will have statistical significance. Also, the narrower the interval, the more accurate result. And, finally, the further away from the zero-effect line, the clearer the difference between the treatments or the comparative exposures will be.
If you want a more detailed explanation about the elements that make up a forest plot, you can go to the previous post where we explained it or to the online manuals of the Cochrane’s Collaboration.
We will conclude the critical appraising of the SR assessing the APPLICABILITYof the results to our environment. We will have to ask ourselves if we can apply the results to our patients and how they will influence the care we give them. We will have to see if the primary studies of the review describe the participants and if they resemble our patients. In addition, although we have already said that it is preferable that the SR is oriented to a specific question, it will be necessary to see if all the relevant results have been considered for the decision making in the problem under study, since sometimes it will be convenient to consider some other additional secondary variable. And, as always, we must assess the benefit-cost-risk ratio. The fact that the conclusion of the SR seems valid does not mean that we have to apply it in a compulsory way.
If you want to correctly evaluate a SR without forgetting any important aspect, I recommend you to use a checklist such as PRISMA’s or some of the tools available on the Internet, such as the grills that can be downloaded from the CASPpage, which are the ones we have used for everything we have said so far.
The PRISMA statement (Preferred Reporting Items for Systematic reviews and Meta-Analyzes) consists of 27 items, classified in 7 sections that refer to the sections of title, summary, introduction, methods, results, discussion and financing:
Title: it must be identified as SR, meta-analysis or both. If it is specified, in addition, that it deals with clinical trials, priority will be given to other types of reviews.
Summary: it should be a structured summary that should include background, objectives, data sources, inclusion criteria, limitations, conclusions and implications. The registration number of the revision must also be included.
Introduction: includes two items, the justification of the study (what is known, controversies, etc) and the objectives (what question tries to answer in PICO terms of the structured clinical question).
Methods. It is the section with the largest number of items (12):
– Protocol and registration: indicate the registration number and its availability.
– Eligibility criteria: justification of the characteristics of the studies and the search criteria used.
– Sources of information: describe the sources used and the last search date.
– Search: complete electronic search strategy, so that it can be reproduced.
– Selection of studies: specify the selection process and inclusion’s and exclusion’s criteria.
– Data extraction process: describe the methods used to extract the data from the primary studies.
– Data list: define the variables used.
– Risk of bias in primary studies: describe the method used and how it has been used in the synthesis of results.
– Summary measures: specify the main summary measures used.
– Results synthesis: describe the methods used to combine the results.
– Risk of bias between studies: describe biases that may affect cumulative evidence, such as publication bias.
– Additional analyzes: if additional methods are made (sensitivity, metaregression, etc) specify which were pre-specified.
Results. Includes 7 items:
– Selection of studies: it is expressed through a flow chart that assesses the number of records in each stage (identification, screening, eligibility and inclusion).
– Characteristics of the studies: present the characteristics of the studies from which data were extracted and their bibliographic references.
– Risk of bias in the studies: communicate the risks in each study and any evaluation that is made about the bias in the results.
– Results of the individual studies: study data for each study or intervention group and estimation of the effect with their confidence interval. The ideal is to accompany it with a forest plot.
– Synthesis of the results: present the results of all the meta-analysis performed with the confidence intervals and the consistency measures.
– Risk of bias between the subjects: present any evaluation that is made of the risk of bias between the studies.
– Additional analyzes: if they have been carried out, provide the results of the same.
Discussion. Includes 3 items:
– Summary of the evidence: summarize the main findings with the strength of the evidence of each main result and the relevance from the clinical point of view or of the main interest groups (care providers, users, health decision-makers, etc.).
– Limitations: discuss the limitations of the results, the studies and the review.
– Conclusions: general interpretation of the results in context with other evidences and their implications for future research.
Financing: describe the sources of funding and the role they played in the realization of the SR.
As you can see, we have not talked practically anything about meta-analysis, with all its statistical techniques to assess homogeneity and its fixed and random effects models. And is that the meta-analysis is a beast that must be eaten separately, so we have already devoted two post only about it that you can check when you want. But that is another story…
I wonder how many times I have heard this question or one of its many variants. Because it turns out that we are always thinking about clinical trials and clinical questions about diagnosis and treatment, but think about whether a patient ever asked you if the treatment you were proposing was endorsed by a randomized controlled trial that meets the criteria of the CONSORT statement and has a good score on the Jadad scale. I can say, at least, that it has never happened to me. But they do ask me daily what will happen to them in the future.
And here lies the relevance of prognostic studies. Note that you cannot always heal and that, unfortunately, many times all we can do is assist and relieve, if it is possible, the announcement of serious sequelae or death. But it is essential to have good quality information about the future of our patient’s disease. This information will also serve to calibrate therapeutic efforts in each situation depending on the risks and benefits. And besides, prognostic’s studies are used to compare results between different departments or hospitals. Nobody comes up saying that a hospital is worse than another because their mortality is higher without first checking that the prognosis of patients is similar.
Before getting into the critical appraisal of prognostic studies, let’s clarify the difference between risk factor and prognostic factor. The risk factor is a characteristic of the environment or the subject that favors the development of the disease, while the prognostic factor is that which, once the disease occurs, influences its evolution. Risk factor and prognostic factor are different things, although sometimes they can coincide. What the two do share is the same type of study design. The ideal would be to use clinical trials, but most of the time we cannot or are not ethical to randomize the prognostic or risk factors. Let’s think we want to demonstrate the deleterious effect of booze on the liver. The way with the highest degree of evidence to prove it would be to make two random groups of participants and give 10 whiskeys a day to the participants of one arm and some water to the participants of the other, to see the differences in liver damage after a year, for example. However, it is evident to anyone that we cannot do a clinical trial like this. Not because we cannot find subjects for the intervention arm, but because ethics and common sense prevent us from doing it.
For this reason, it is usual to use cohort studies: we would study what differences at the level of the liver there may be between individuals who drink and who do not drink alcohol by their own choice. In cases that require very long follow-ups or in which the effect we want to measure is very rare, case-control studies can be used, but they will always be less powerful because they have a higher risk of bias. Following our ethyl example, we would study people with and without liver damage and we would see if one of the two groups was exposed to alcohol.
A prognostic study should inform us of three aspects: what result we evaluate, how likely they are to happen, and in what time frame we expect it to happen. And to appraise it, as always, we will base on our three pillars: validity, relevance and applicability.
To assess the VALIDITY, we´ll first consider if the article meets a set of primary or elimination criteria. If the answer is not, we better throw the paper and go to read the last bullshit our Facebook’s friends have written on our wall.
Is the study sample well defined and is it representative of patients at a similar stage of disease? The sample, which is usually called initial or incipient cohort, should be formed by a group of patients at the same stage of disease, ideally at the beginning, at it should be followed-up prospectively. It should be well specified the type of patients included, the criteria for diagnosing them and the method of selection. We must also verify that the follow-up has been long enough and complete enough to observe the event we study. Each participant has to be followed-up from the start to the end of the study, either because he’s healed, because he presents the event or because the study ends. It is very important to take into account losses during the study, very common in designs with long follow-up. The study should provide the characteristics of patients lost and the reasons for the loss. If they are similar to those who are not lost during follow-up, we can get valid results. If the number of patients lost to follow-up is greater than 20% it’s usually done a sensitivity analysis using the worst possible scenario, which considers that all losses have had a poor prognosis and then recalculate the results to check if they are modified, in which case the study results could be invalidated.
Once these two aspect being assessed, we turn to the secondary criteria about internal validity or scientific rigor.
Were outcomes measured objectively and unbiased? It must be clearly specified what is being measured and how before starting the study. In addition, in order to avoid the information bias, the ideal is that the measure of results is done blinded to the researcher, who must not know whether the subject in question is subjected to any of the prognostic factors.
Were the results adjusted by all relevant prognostic values? We must take into account all the confounding variables and prognostic factors that may influence the results. In case they are known from previous studies, known factors may be considered. Otherwise, the authors will determine these effects using stratified data analysis (the easiest method) or multivariate analysis (the more powerful and complex), usually by a proportional hazards model or Cox regression analysis. Although we’re not going to talk about regression models now, there are two simple aspects that we can take into account. First, these models need a certain number of events per variable included in the model, so distrust those where many variables are analyzed, especially with small samples. Second, the variables included are decided by the author and are different from one work to another, so we will have to assess if they have not included any that may be relevant to the final result.
Were the results validated in other groups of patients? When we set groups of variables and we make multiple comparisons we risk the chance plays a trick on us and shows us associations that don’t exists. This is why when a risk factor is described in a group (training or derivation group), the results should be replicated in an independent group (validation group) to be really sure about the effect.
Now we must consider what the results are to determine their RELEVANCE. For this, we’ll check if the probability of the outcome of the study is estimated and provided by the authors, as well as the accuracy of this estimate and the risk associated with the factors influencing the prognosis.
Is the probability of the event specified in a given period of time? There are several ways to present the number of events occurring during the follow-up period. The simplest would be to provide an incidence rate (events / person / unit time) or the cumulative frequency at any given time. Another indicator is the median survival, which is just the moment at follow-up in which the event has happened in half of the cohort participants (remember that although we speak about survival, the event not need tro be necessarily death).
We can use survival curves of various kinds to determine the probability of the occurrence of the event in each period and the rate at which it is presenting. Actuarial or life tables are used for larger samples when we don’t know the exact time of the event and we use fixed time periods. However, the more often used are the Kaplan-Meier curves, which better measure the probability of the event for each particular time with smaller samples. This method can provide hazard ratios and median survival, as well as other parameter accor4ding to the regression model used.
To assess the accuracy of the results will look, as always, for the confidence intervals. The larger the interval, the less accurate the estimate of the probability of occurrence in the general population, which is what we really want to know. Keep in mind that the number of patients is generally lower as time passes, so it is usual that the survival curves are more accurate at the beginning than at the end of follow up. Finally, we’ll assess the factors that modify the prognosis. The right thing is to represent all the variables that may influence the prognosis with its corresponding relative risks, which will allow us to evaluate the clinical significance of the association.
Finally, we must consider the APPLICABILITYof the results. Do they apply to my patients? We will look for similarities between the study patients and ours and assess whether the differences we find allow us to extrapolate the results to our practice. But besides, are the results useful? The fact that they’re applicable doesn’t necessarily mean that we have to implement them. We have to assess carefully if they’re going to help us to decide what treatment to apply and how to inform our patients and their families.
As always, I recommend you to use a template, such as those provided by CASP, for systematically critical appraisal without leaving any important matter without assessing.
You can see that articles about prognosis have a lot of to say. And we haven’t almost talked about regression models and survival curves, which are often the statistical core of this type of articles. But that’s another story…