Science without sense…double nonsense

Píldoras sobre medicina basada en pruebas

The following articles were authored by mmolina

Worshipped, but misunderstood

Statistics wears most of us who call ourselves “clinicians” out. The knowledge on the subject acquired during our formative years has long lived in the foggy world of oblivion. We vaguely remember terms such as probability distribution, hypothesis contrast, analysis of variance, regression … It is for this reason that we are always a bit apprehensive when we come to the methods section of scientific articles, in which all these techniques are detailed that, although they are known to us, we do not know with enough depth to correctly interpret their results.

Fortunately, Providence has given us a lifebelt: our beloved and worshipped p. Who has not felt lost with a cumbersome description of mathematical methods to finally breathe a sigh of relieve when finding the value of p? Especially if the p is small and has many zeros.

The problem with p is that, although it is unanimously worshipped, it is also mostly misunderstood. Its value is, very often, misinterpreted. And this is so because many of us harbor misconceptions about what the p-value really means.

Let’s try to clarify it.

Whenever we want to know something about a variable, the effect of an exposure, the comparison of two treatments, etc., we will face the ubiquity of random: it is everywhere and we can never get rid of it, although we can try to limit it and, of course, try to measure its effect.

Let’s give an example to understand it better. Suppose we are doing a clinical trial to compare the effect of two diets, A and B, on weight gain in two groups of participants. Simplifying, the trial will have one of three outcomes: those of diet A gain more weight, those of diet B gain more weight, both groups gain equal weight (there could even be a fourth: both groups lose weight). In any case, we will always obtain a different result, just by chance (even if the two diets are the same).

Imagine that those in diet A put on 2 kg and those in diet B, 3 kg. Is it more fattening the effect of diet B or is the difference due to chance (chosen samples, biological variability, inaccuracy of measurements, etc.)? This is where our hypothesis contrast comes in.

When we are going to do the test, we start from the hypothesis of equality, of no difference in effect (the two diets induce the same increment of weight). This is what we call the null hypothesis (H0) that, I repeat it to keep it clear, we assume that it is the real one. If the variable we are measuring follows a known probability distribution (normal, chi-square, Student’s t, etc.), we can calculate the probability of presenting each of the values of the distribution. In other words, we can calculate the probability of obtaining a result as different from equality as we have obtained, always under the assumption of H0.

That is the p-value: the probability that the difference in the result observed is due to chance. By agreement, if that probability is less than 5% (0.05) it will seem unlikely that the difference is due to chance and we will reject H0, the equality hypothesis, accepting the alternative hypothesis (Ha) that, in this example, will say that one diet better than the other. On the other hand, if the probability is greater than 5%, we will not feel confident enough to affirm that the difference is not due to chance, so we DO NOT reject H0 and we keep with the hypothesis of equal effects: the two diets are similar.

Keep in mind that we always move in the realm of probability. If p is less than 0.05 (statistically significant), we will reject H0, but always with a probability of committing a type 1 error: take for granted an effect that, in reality, does not exist (a false positive). On the other hand, if p is greater than 0.05, we keep with H0 and we say that there is no difference in effect, but always with a probability of committing a type 2 error: not detecting an effect that actually exists (false negative).

We can see, therefore, that the value of p is somewhat simple from the conceptual point of view. However, there are a number of common errors about what p-value represents or does not represent. Let’s try to clarify them.

It is false that a p-value less than 0.05 means that the null hypothesis is false and a p-value greater than 0.05 that the null hypothesis is true. As we have already mentioned, the approach is always probabilistic. The p <0.05 only means that, by agreement, it is unlikely that H0 is true, so we reject it, although always with a small probability of being wrong. On the other hand, if p> 0.05, it is also not guaranteed that H0 is true, since there may be a real effect that the study does not have sufficient power to detect.

At this point we must emphasize one fact: the null hypothesis is only falsifiable. This means that we can only reject it (with which we keep with Ha, with a probability of error), but we can never affirm that it is true. If p> 0.05 we cannot reject it, so we will remain in the initial assumption of equality of effect, which we cannot demonstrate in a positive way.

It is false that p-value is related to the reliability of the study. We can think that the conclusions of the study will be more reliable the lower the value of p, but it is not true either. Actually, the p-value is the probability of obtaining a similar value by chance if we repeat the experiment in the same conditions and it not only depends on whether the effect we want to demonstrate exists or not. There are other factors that can influence the magnitude of the p-value: the sample size, the effect size, the variance of the measured variable, the probability distribution used, etc.

It is false that p-value indicates the relevance of the result. As we have already repeated several times, p-value is only the probability that the difference observed is due to chance. A statistically significant difference does not necessarily have to be clinically relevant. Clinical relevance is established by the researcher and it is possible to find results with a very small p that are not relevant from the clinical point of view and vice versa, insignificant values that are clinically relevant.

It is false that p-value represents the probability that the null hypothesis is true. This belief is why, sometimes, we look for the exact value of p and do not settle for knowing only if it is greater or less than 0.05. The fault of this error of concept is a misinterpretation of conditional probability. We are interested in knowing what is the probability that H0 is true once we have obtained some results with our test. Mathematically expressed, we want to know P (H0 | results). However, the value of p gives us the probability of obtaining our results under the assumption that the null hypothesis is true, that is, P (results | H0).

Therefore, if we interpret that the probability that H0 is true in view of our results (P (H0 | results)) is equal to the value of p (P (results | H0)) we will be falling into an inverse fallacy or transposition of conditionals fallacy.

In fact, the probability that H0 is true does not depend only on the results of the study, but is also influenced by the previous probability that was estimated before the study, which is a measure of the subjective belief that reflects its plausibility, generally based on previous studies and knowledge. Let’s think we want to contrast an effect that we believe is very unlikely to be true. We will value with caution a p-value <0.05, even being significant. On the contrary, if we are convinced that the effect exists, will be settle for with little demands of p-value.

In summary, to calculate the probability that the effect is real we must calibrate the p-value with the value of the baseline probability of H0, which will be assigned by the researcher or by previously available data. There are mathematical methods to calculate this probability based on its baseline probability and the p-value, but the simplest way is to use a graphical tool, the Held’s nomogram, which you can see in the figure.

To use the Held’s nomogram we just have to draw a line from the previous H0 probability that we consider to the p-value and extend it to see what posterior probability value we reach. As an example, we have represented a study with a p-value = 0.03 in which we believe that the probability of H0 is 20% (we believe there is 80% that the effect is real). If we extend the line it will tell us that the minimum probability of H0 is 6%: there is a 94% probability that the effect is real. On the other hand, think of another study with the same p-value but in which we think that the probability of the effect is lower, for example, of 20% (the probability of H0 is 80%). For the same value of p, the minimum posterior probability of H0 is 50%, then there is 50% that the effect is real. As we can see, the posterior probability changes according to the previous probability.

And here we will end for today. We have seen how p-value only gives us an idea of the role that chance may have had in our results and that, in addition, may depend on other factors, perhaps the most important the sample size. The conclusion is that, in many cases, the p-value is a parameter that allows to assess in a very limited way the relevance of the results of a study. To do it better, it is preferable to resort to the use of confidence intervals, which will allow us to assess clinical relevance and statistical significance. But that is another story…

The cheaters detector

When we think about inventions and inventors, the name of Thomas Alva Edison, known among his friends as the Wizard of Menlo Park, comes to most of us. This gentleman created more than a thousand inventions, some of which can be said to have changed the world. Among them we can name the incandescent bulb, the phonograph, the kinetoscope, the polygraph, the quadruplex telegraph, etc., etc., etc. But perhaps its great merit is not to have invented all these things, but to apply methods of chain production and teamwork to the research process, favoring the dissemination of their inventions and the creation of the first industrial research laboratory.

But in spite of all his genius and excellence, Edison failed to go on to invent something that would have been as useful as the light bulb: a cheaters detector. The explanation for this pitfall is twofold: he lived between the nineteenth and twentieth centuries and did not read articles about medicine. If he had lived in our time and had to read medical literature, I have no doubt that the Wizard of Menlo Park would have realized the usefulness of this invention and would have pull his socks up.

And it is not that I am especially negative today, the problem is that, as Altman said more than 15 years ago, the material sent to medical journals is defective from the methodological point of view in a very high percentage of cases. It’s sad, but the most appropriate place to store many of the published studies is the rubbish can.

In most cases the cause is probably the ignorance of those who write. “We are clinicians”, we say, so we leave aside the methodological aspects, of which we have a knowledge, in general, quite deficient. To fix it, journal editors send our studies to other colleagues, who are more or less like us. “We are clinicians”, they say, so all our mistakes go unnoticed to them.

Although this is, in itself, serious, it can be remedied by studying. But it is an even more serious fact that, sometimes, these errors can be intentional with the aim of inducing the reader to reach a certain conclusion after reading the article. The remedy for this problem is to make a critical appraisal of the study, paying attention to its internal validity. In this sense, perhaps the most difficult aspect to assess for the clinician without methodological training is that related to the statistics used to analyze the results of the study. It is in this, undoubtedly, that most can be taken advantage of our ignorance using methods that provide more striking results, instead of the right methods.

As I know that you are not going to be willing to do a master’s degree in biostatistics, waiting for someone to invent the cheaters detector, we are going to give a series of clues so that non-expert readers can suspect the existence of these cheats.

The first may seem obvious, but it is not: has a statistical method been used? Although it is exceptionally rare, there may be authors who do not consider using any. I remember a medical congress that I could attend in which the values of a variable were exposed throughout the study that, first, went up and then went down, which allowed the speaker to conclude that the result was not “on the blink”. As it is logical and evident, any comparison must be made with the proper hypotheses contrast and the level of significance and the statistical test used have to be specified. Otherwise, the conclusions will lack any validity.

A key aspect of any study, especially those with an intervention, is the previous calculation of the necessary sample size. The investigator must define the clinically relevant effect that he wants to be able to detect with his study and then calculate what sample size will provide the study with enough power to prove it. The sample of a study is not large or small, but sufficient or insufficient. If the sample is not sufficient, an existing effect may not be detected due to lack of power (type 2 error). On the other hand, a larger sample than necessary may show an effect that is not relevant from the clinical point of view as statistically significant. Here are two very common cheats. First, the study that does not reach significance and its authors say it is due to lack of power (insufficient sample size), but do not make any effort to calculate the power, which can always be done a posteriori. In that case, we can calculate it using statistical programs or any of the calculators available on the internet, such as GRANMO. Second, the sample size is increased until the difference observed is significant, finding the desired p <0.05. This case is simpler: we only have to assess whether the effect found is relevant from the clinical point of view. I advise you to practice and compare the necessary sample sizes of the studies with those defined by the authors. Maybe you’ll have some surprise.

Once the participants have been selected, a fundamental aspect is that of the homogeneity of the basal groups. This is especially important in the case of clinical trials: if we want to be sure that the observed difference in effect between the two groups is due to the intervention, the two groups should be the same in everything, except in the intervention.

For this we will look at the classic table I of the trial publication. Here we have to say that, if we have distributed the participants at random between the two groups, any difference between them will be due, one way or another, to random. Do not be fooled by the p, remember that the sample size is calculated for the clinically relevant magnitude of the main variable, not for the baseline characteristics of the two groups. If you see any difference and it seems clinically relevant, it will be necessary to verify that the authors have taken into account their influence on the results of the study and have made the appropriate adjustment during the analysis phase.

The next point is that of randomization. This is a fundamental part of any clinical trial, so it must be clearly defined how it was done. Here I have to tell you that chance is capricious and has many vices, but rarely produces groups of equal size. Think for a moment if you flip a coin 100 times. Although the probability of getting heads in each throw is 50%, it will be very rare that by throwing 100 times you will get exactly 50 heads. The greater the number of participants, the more suspicious it should seem to us that the two groups are equal. But beware, this only applies to simple randomization. There are methods of randomization in which groups can be more balanced.

Another hot spot is the misuse that can sometimes be made with qualitative variables. Although qualitative variables can be coded with numbers, be very careful with doing arithmetic operations with them. Probably it will not make any sense. Another cheat that we can find has to do with the fact of categorizing a continuous variable. Passing a continuous variable to a qualitative one usually leads to loss of information, so it must have a clear clinical meaning. Otherwise, we can suspect that the reason is the search for a p value less than 0.05, always easier to achieve with the qualitative variable.

Going into the analysis of the data, we must check that the authors have followed the a priori designed protocol of the study. Always be wary of post hoc studies that were not planned from the beginning. If we look for enough, we will always find a group that behaves as we want. As it is said, if you torture the data long enough, it will confess to anything.

Another unacceptable behavior is to finish the study ahead of time for good results. Once again, if the duration of the follow-up has been established during the design phase as the best time to detect the effect, this must be respected. Any violation of the protocol must be more than justified. Logically, it is ethical to finish the study ahead of time due to security reasons, but it will be necessary to take into account how this fact affects the evaluation of the results.

Before performing the analysis of the results, the authors of any study have to debug their data, reviewing the quality and integrity of the values collected. In this sense, one of the aspects to pay attention to is the management of outliers. These are the values that are far from the central values of the distribution. In many occasions they can be due to errors in the calculation, measurement or transcription of the value of the variable, but they can also be real values that are due to the special idiosyncrasy of the variable. The problem is that there is a tendency to eliminate them from the analysis even when there is no certainty that they are due to an error. The correct thing to do is to take them into account when doing the analysis and use, if necessary, robust statistical methods that allow these deviations to be adjusted.

Finally, the aspect that can be more strenuous to those not very expert in statistics is knowing if the correct statistical method has been used. A frequent error is the use of parametric tests without previously checking if the necessary requirements are met. This can be done by ignorance or to obtain statistical significance, since parametric tests are less demanding in this regard. To understand each other, the p-value will be smaller than if we use the equivalent non-parametric test.

Also, with certain frequency, other requirements needed to be able to apply a certain contrast test are ignored. As an example, in order to perform a Student’s t test or an ANOVA, homoscedasticity (a very ugly word that means that the variances are equal) must be checked, and that check is overlooked in many studies. The same happens with regression models that, frequently, are not accompanied by the mandatory diagnosis of the model that allows and justify its use.

Another issue in which there may be cheating is that of multiple comparisons. For example, when the ANOVA reaches significant, the meaning is that there are at least two means that are different, but we do not know which, so we start comparing them two by two. The problem is that when we make repeated comparisons the probability of type I error increases, that is, the probability of finding significant differences only by chance. This may allow finding, if only by chance, a p <0.05, what improves the appearance of the study (especially if you spent a lot of time and / or money doing it). In these cases, the authors must use some of the available corrections (such as Bonferroni’s, one of the simplest) so that the global alpha remains below 0.05. The price to pay is simple: the p-value has to be much smaller to be significant. When we see multiple comparisons without a correction, it will only have two explanations: the ignorance of the one who made the analysis or the attempt to find a statistical significance that, probably, would not support the decrease in p-value that the correction would entail.

Another frequent victim of misuse of statistics is the Pearson’s correlation coefficient, which is used for almost everything. The correlation, as such, tells us if two variables are related, but does not tell us anything about the causality of one variable for the production of the other. Another misuse is to use the correlation coefficient to compare the results obtained by two observers, when probably what should be used in this case is the intraclass correlation coefficient (for continuous variables) or the kappa index (for dichotomous qualitative variables). Finally, it is also incorrect to compare two measurement methods (for example, capillary and venous glycaemia) by correlation or linear regression. For these cases the correct thing would be to use the Passing-Bablok’s regression.

Another situation in which a paranoid mind like mine would suspect is one in which the statistical method employed is not known by the smartest people in the place. Whenever there is a better known (and often simpler) way to do the analysis, we must ask ourselves why they have used such a weird method. In these cases, we will require the authors to justify their choice and provide a reference where we can review the method. In statistics, you have to try to choose the right technique for each occasion and not the one that gives us the most appealing result.

In any of the previous contrast tests, the authors usually use a level of significance for p <0.05, as usual, but the contrast can be done with one or two tails. When we do a trial to try a new drug, what we expect is that it works better than the placebo or the drug with which we are comparing it. However, two other situations can occur that we cannot disdain: that it works the same or, even, that it works worse. A bilateral contrast (with two tails) does not assume the direction of the effect, since it calculates the probability of obtaining a difference equal to or greater than that observed, in both directions. If the researcher is very sure of the direction of the effect, he can make a unilateral contrast (with one tail), measuring the probability of the result in the direction considered. The problem is when he does it for another reason: the p-value of a bilateral contrast is twice as large as that of the unilateral contrast, so it will be easier to achieve statistical significance with the unilateral contrast. The wrong thing is to do the unilateral contrast for that reason. The correct thing, unless there are well-justified reasons, is to make a bilateral contrast.

To go finishing this tricky post, we will say a few words about the use of appropriate measures to present the results. There are many ways to make up the truth without getting to lie and, although basically all say the same, the appearance can be very different depending on how we say it. The most typical example is to use relative risk measures instead of absolute and impact measures. Whenever we see a clinical trial, we must demand that authors provide the absolute risk reduction and the number needed to treat (NNT). The relative risk reduction gives a greater number than the absolute, so it will seem that the impact is greater. Given that the absolute measures are easier to calculate and are obtained from the same data as the relative ones, we should be suspicious if the authors do not offer them to us: perhaps the effect is not as important as they are trying to make us see.

Another example is the use of odds ratio versus risk ratio (when both can be calculated). The odds ratio tends to magnify the association between the variables, so its unjustified use can also make us to be suspicious. If you can, calculate the risk ratio and compare the two measures.

Likewise, we will suspect of studies of diagnostic tests that do not provide us with the likelihood ratios and are limited to sensitivity, specificity and predictive values. Predictive values can be high if the prevalence of the disease in the study population is high, but it would not be applicable to populations with a lower proportion of patients. This is avoided with the use of likelihood ratios. We should always ask ourselves the reason that the authors may have had to obviate the most valid parameter to calibrate the power of a diagnostic test.

And finally, be very careful with the graphics representations of results: here the possibilities of making up the truth are only limited by our imagination. You have to look at the units used and try to extract the information from the graph beyond what it might seem to represent at first glance.

And here we leave the topic for today. We have not spoken in detail about another of the most misunderstood and manipulated entities, which is none other than our p. Many meanings are attributed to p, usually erroneously, as the probability that the null hypothesis is true, probability that has its specific method to make an estimate. But that is another story…

Pairing

You will all know the case of someone who, after carrying out a study and collecting several million variables, addressed the statistician of his workplace and, demonstrating in a reliable way his clarity of ideas regarding his work, he said: please (You have to be educated), crosscheck everything with everything, to see what comes out.

At this point, several things can happen to you. If the statistician is an unscrupulous soulmate, he will give you a half smile and tell you to come back after a few days. Then, you will be provided with several hundred sheets with graphics, tables and numbers with which you will not know what to do. Another thing that can happen to you is to send to hell, tired as she will be to have similar requests made.

But you can be lucky and find a competent and patient statistician who, in a self-sacrificing way, will explain to you that the thing should not work like that. The logical thing is that you, before collecting any data, have prepared a report of the project in which it is planned, among other things, what is to be analyzed and what variables must be crossed between them. She can even suggest you that, if the analysis is not very complicated, you can try to do it yourself.

The latter may seem like the delirium of a mind disturbed by mathematics but, if you think about it for a moment, it is not such a bad idea. If we do the analysis, at least the preliminary, of our results, it can help us to better understand the study. Also, who can know what we want better than ourselves?

With the current statistical packages, the simplest bivariate statistics can be within our reach. We only have to be careful in choosing the right hypothesis test, for which we must take into account three aspects: the type of variables that we want to compare, if the data are paired or independent and if we have to use parametric or non-parametric tests. Let’s see these three aspects.

Regarding the type of variables, there are multiple denominations according to the classification or the statistical package that we use but, simplifying, we will say that there are three types of variables. First, there are the continuous variables. As the name suggests, they collect the value of a continuous variable such as weight, height, blood glucose concentration, etc. Second, there are the nominal variables, which consist of two or more categories that are mutually excluding. For example, the variable “hair color” can have the categories “brown”, “blonde” and “red hair”. When these variables have two categories, we call them dichotomous (yes / no, alive / dead, etc.). Finally, when the categories are ordered by rank, we speak of ordinal variables: ” do not smoke “, ” smoke little “, ” smoke moderately “, ” smoke a lot “. Although they can sometimes use numbers, they indicate the position of the categories within the series, without implying, for example, that the distance from category 1 to 2 is the same as that from 2 to 3. For example, we can classify vesicoureteral reflux in grades I, II, III and IV (having a degree IV is more than a II, but it does not mean that you have twice as much reflux).

Knowing what kind of variable we are dealing with is simple. If we doubt, we can follow the following reasoning based on the answer to two questions:

  1. Does the variable have infinite theoretical values? Here we have to do a bit of abstraction and think about what “theoretical values” really means. For example, if we measure the weight of the subjects of the study, theoretical values ​​will be infinite although, in practice, this will be limited by the precision of our scale. If the answer to this first question is “yes” we will be before a continuous variable. If it is not, we move on to the next question.
  2. Are the values ​​sorted in some kind of rank? If the answer is “yes”, we will be dealing with an ordinal variable. If the answer is “no”, we will have a nominal variable.

The second aspect is that of paired or independent measures. Two measures are paired when a variable is measured twice after having applied some change, usually in the same subject. For example: blood pressure before and after a stress test, weight before and after a nutritional intervention, etc. On the other hand, independent measures are those that are not related to each other (they are different variables): weight, height, gender, age, etc.

Finally, we mentioned the possibility of using parametric or non-parametric tests. We are not going to go into detail now, but in order to use a parametric test the variable must fulfill a series of characteristics, such as following a normal distribution, having a certain sample size, etc. In addition, there are techniques that are more robust than others when it comes to having to meet these conditions. When in doubt, it is preferable to use non-parametric techniques unnecessarily (the only problem is that it is more difficult to achieve statistical significance, but the contrast is just as valid) than using a parametric test when the necessary requirements are not met.

Once we have already answered these three aspects, we can only make the pairs of variables that we are going to compare and choose the appropriate statistical test. You can see it summarized in the attached table.The type of independent variable is represented in the rows, which is the one whose value does not depend on another variable (it is usually on the x axis of the graphic representations) and which is usually the one that we modified in the study to see the effect on another variable (the dependent). In the columns, on the other hand, we have the dependent variable, which is the one whose value is modified with the changes of the independent variable. Anyway, do get muddled: the statistical software will make the hypothesis contrast without taking into account which is the dependent and which the independent, only taking into account the types of variables.

The table is self-explanatory, so we will not give it much time. For example, if we have measured blood pressure (contiuous variable) and we want to know if there are differences between men and women (gender, nominal dichotomous variable), the appropriate test will be Student’s t test for independent samples. If we wanted to see if there is a difference in pressure before and after a treatment, we would use the same Student’s t test but for paired samples.

Another example: if we want to know if there are significant differences in the color of hair (nominal, polytomous: “blond”, “brown” and “redhead) and if the participant is from the north or south of Europe (nominal, dichotomous), we could use a Chi-square’s test.

And here we will end for today. We have not talked about the peculiarities of each test that we have to take into account, but we have only mentioned the test itself. For example, the chi-square’s has to meet minimums in each box of the contingency table, in the case of Student’s t we must consider whether the variances are equal (homoscedasticity) or not, etc. But that is another story…

The power of transitive property

When Georg Cantor wanted to develop the set theory, he could not get an idea of ​​everything that would come after that, probably from the hand of mathematicians as dedicated as he was. I can think of the curious case of binary relations, which the older ones of you will remember of the time when children learned things at school.

It turns out that some mathematical genius begins to think and describes a series of properties. The first is reflective property. This means that, if a number x is equal to x, then so, it is x. In case anyone has not understood, let us give an anatomical example: my right hand is my right hand. I believe that the genius who invented the reflexive property needed a long recovery in some spa after such a huge mental strain.

It was in this spa where he decided to do something more intense, so he described the symmetric property, which is much more complex: whenever a number x equals y, then y equals x. Going back to the anatomical simile, if my arms and legs are my extremities, you will have to agree that my extremities are my arms and my legs. Algebra is fascinating.

Luckily, in the end, with the purpose of filling a file and save back, our anonymous genius invented the transitive property, which says more or less like this: if a number x is related to y, and y is related to z, there will be transitivity if x relates to z. Again, to the anatomy: if my leg is mine and my foot is from my leg, my foot is also mine. After that, more properties were derived from these three, but we shall leave it here for the moment, because today we are going to use the power of transitive property to know which of two things that we have not really come to compare is the better of both. Think, for example, of a crazed mob running into a shopping center on the first day of sales. They look at everything before deciding what to buy, but it is not necessary to compare all the products two to two to know which one we like best.

In medicine something similar happens. The usual thing is that there are several options to treat the same disease (although those of us who have been in the business for a long time now know that the more there are, the more likely that none will work at all). Clinical trials, and meta-analyzes of clinical trials, only compare pairs and it may happen that no one has compared the two we have at our disposal or that we want to know which is, in theory, the best of all available.

Well, for that a methodological design called network meta-analysis (NMA), also called multiple-treatments meta-analysis or mixed-treatments comparisons meta-analysis, has been invented. And in this last term, mixed comparisons, is the crux of the matter, because it turns out that there are several types of comparisons. Let’s see them.

Let’s assume we have three possible treatments that, after a deep reflection, I decided to call A, B and C. The simplest situation is to compare two of them, A and B, for example, with a conventional clinical trial. We would be making a direct comparison between the two interventions. But it may happen that we do not have any trial that directly compares A and B, but there are two different trials that compare the interventions with another intervention, C (you can see it in the attached figure). In this case we can resort to the power of the transitive property and make an indirect comparison between A and B based on their relative efficacy against C. For example, if A reduces mortality by 100% compared to C and B reduces it by 50 % compared to C, we can say that B reduces mortality 50% relative to A. Of course, in order to do this, transitivity has to be fulfilled, something that we cannot take for granted. For example, if I like pork and pig likes to reboar through mud, that does not mean that I like to reboar through mud. Transitivity is not fulfilled in this case (I think).

Well, an NMA is nothing more than a series of direct, indirect and mixed comparisons that allow us to compare the relative effects of several interventions. Multiple comparisons are typically represented using a diagram as a network where we can see the direct, indirect and mixed comparisons. Each node in the network, which can vary in size according to its specific contribution, correspond with one of the primary studies of the review, while the lines joining the nodes represent the comparisons. The complete network will represent all comparisons of treatments identified from the primary studies of the review that incorporates our NMA.

As with the other types of meta-analyzes coupled with a systematic review, the validity of the NMA will depend on the validity of the primary studies, the heterogeneity among them and the possible existing information biases, factors that will condition the quality of the direct comparisons.

In addition, indirect comparisons are considered observational and require, as we have already mentioned, that the researcher issue the transitivity of the interventions based on her knowledge about them, about the disease and about the designs of the primary studies.

Another specific aspect of the NMA is that of coherence or consistency, which makes reference to the level of agreement among the evidence coming from direct and indirect comparisons. This level of agreement, which can be measured with specific statistical methods, must be high in order for the summary result measure to be valid. The results of the comparisons must go in the same direction, they cannot be divergent. When this is not fulfilled, the cause probably lies in the poor methodological quality of the primary studies, in their heterogeneity or in the presence of biases.

As in other meta-analyzes, the result of the NMA is expressed with a summary result measure that can be an odds ratio, a means difference, a risk ratio, etc. This point estimate is accompanied by an interval that gives us information about the accuracy of this estimate. The statistical analysis of the NMA can use frequentist methods (the one we usually see in usual clinical trials) or Bayesian methods. The latter are based on the assignment of a probability of the effect of the treatment prior to the analysis of the data and then to assign a posteriori probability after the analysis. For what interests us here, the frequentist methods will assess the accuracy of the point estimate by means of the known confidence intervals (usually 95%), while the Bayesians will provide credibility intervals (also 95%), of similar significance.

With all this data we will obtain an ordered rank of the compared treatments, with the best heading the list. But do not trust yourself too much, you have to look at these ranks carefully for several reasons. First, the best treatment in one situation may not be so in another. Second, we must take into account other factors such as cost, availability, knowledge of the clinician, etc. Third, these ordered ranks do not take into account the magnitude of the differences between the different elements. And fourth, chance can play tricks on us and put in a good position a treatment that, in reality, is not as good as it may seem.

Once reviewed, at a glance, the peculiarities of the NMA, what can we say about their critical appraisal? As we have a checklist for the systematic review with the usual meta-analysis, the PRISMA statement, there is a specific declaration for the NMA, the PRISMA-NMA. This list includes, as specific items, aspects such as the description of the geometry of the treatments network, the consideration of the transitivity and consistency assumptions and the description of the methods used to analyze the structure of the network and the suitability of the comparisons, in case some may have a lower degree of evidence. All this will be facilitated if the authors provide the graph with the study network and briefly explain its characteristics.

Anyway, you know that I’d rather resort on the CASP’s tools for critical appraisal of documents. Although there is no a specific for NMA, I advise you to use the systematic review with usual meta-analysis one and, later, to make some considerations about the specific aspects of the NMA.

To not extend this post much, we will skip the whole part that NMA share with any other systematic review and go directly to its specific aspects. You can consult the corresponding post where we reviewed the critical appraisal of a systematic review. As always, we will follow our three pillars of wisdom: validity, relevance and applicability.

Regarding VALIDITY, we will ask three specific questions.

  1. Does the review respond to a well-defined clinical question that justifies the realization of a NMA?This question has the classic components of the PICO question,although the intervention and the comparison will encompass the multiple comparisons of the network.
  2. Was an exhaustive search of the relevant studies carried out?This aspect is important to avoid publication biasand the inclusion of all the important information available. Their absence can affect the consistency of the comparisons.
  3. There should be a clear specification of the target population, the treatments evaluated and the outcome measures used.All these aspects can condition the validity of indirect comparisons.If we want to infer the relationship between the effects of A and B by comparing their individual effects with respect to C, it is essential that A and B are treated similarly in their comparison with C, that the A-C and B-C comparisons are made with patients that are similar, that the same outcome measures are used and that the risk of bias in the studies is low. The latter can be assessed with the usual tools, such as the Cochrane’s.

To finish this section, we will check that the results are analyzed and presented in an appropriate way, which statistical method has been used (frequentist or Bayesian), and if confidence or credibility intervals, the analysis of the network, etc. are provided.

Although we will not go into it, we will say that there are multiple types of networks (star, loop, line …). For comparisons to be more valid, indirect comparisons must be supported by direct ones. This can be seen in the network scheme by the presence of triangles similar to the graph that I attached at the beginning of the post (or other closed geometric shapes). In conditions of equality of other factors that can have an influence and that we have already mentioned, the more triangles we see, the more valid the comparisons will be.

As a last aspect, we will evaluate if the authors have used the appropriate methods to assess the heterogeneity and the possible existence of inconsistency: sensitivity analysis, metaregression, etc.

Going to the RELEVANCE section, we will value the results of the meta-analysis. Here we will consider five specific aspects:

  1. What is the result? As in any other meta-analysis, we will assess the result and its importance from the clinical point of view.

It will be necessary to assess how the result could have been influenced by the risk of bias in the primary studies: the greater the risk of bias, the farthest our estimate can be from the truth.

  1. Are the results accurate?In this sense, we must assess the amplitude of the confidence or credibility intervals, taking into account how the conclusions of the study would be affected at each end of the interval.
  2. Is there consistency of results among different studies?There may be variability by pure chance or by heterogeneity among the studies.We can assess it by observing the shape of the forest plots and helping us with the usual statistical methods, such as I2.
  3. Are indirect comparisons reliable?We return again to the concept of transitivity, which must be taken into account together with the other factors that we have previously commented on and which may increase the risk of bias: homogeneous populations, outcome variables and common comparators, etc.
  4. Is there consistency among direct and indirect comparisons?We will have to check for closed geometric shapes within the network (our triangles or loops),as well as rule out causes of inconsistency, which are the same we have already mentioned as causing heterogeneity and intransivity.

Finally, we will finish our critical appraisal by making some special considerations regarding the APPLICABILITY of the results.

In addition to taking into account, as usual, if all the important effects and variables for the patient have been considered and if the patients are similar to those of our environment, we will ask the questions specifically related to the use of a NMA, such as if the the network has considered all the possibilities of treatment or if the different comparison subgroups that have been established have credibility from the clinical point of view.

And here we will leave for today. A beast difficult to tame, this NMA. And that we have not spoken anything of its statistical methodology, quite complex but that computer packages develop without flinching. In addition, we could have talked a lot about the types of networks and the comparisons that can be drawn from each of them. But that’s another story…

An unfairly treated genius

The genius that I am talking about in the title of this post is none other than Alan Mathison Turing, considered one of the fathers of computer science and a forerunner of modern computing.

For mathematicians, Turing is best known for his involvement in the solution of the decision problem previously proposed by Gottfried Wilhelm Leibniz and David Hilbert, who were seeking to define a method that could be applied to any mathematical sentence to prove whether that sentence were or not true (to those interested in the matter, it could be demonstrated that such a method does not exist).

But what it is Turing is famous for among the general public comes thanks to the cinema and to his work in statistics during World War II. And it is that Turing was taken to exploiting Bayesian magic to deepen the concept of how the evidence we are collecting during an investigation can support the initial hypothesis or not, thus favoring the development of a new alternative hypothesis. This allowed him to decipher the code of the Enigma machine, which was the one used by the German navy’s sailors to encrypt their messages, and that is the story that has been taken to the screen. This line of work led to the development of concepts such as the weight of evidence and concepts of probability, with which confront null and alternative hypotheses, which were applied in biomedicine and enabled the development of new ways to evaluate new diagnostic tests capabilities, such as the ones we are going to deal with today.

But all this story about Alan Turing turn out to be just a recognition of one of the people whose contribution made it possible to develop the methodological design that we are going to talk about today, which is none other than the meta-analysis of diagnostic accuracy.

We already know that a meta-analysis is a quantitative synthesis method that is used in systematic reviews to integrate the results of primary studies into a summary result measure. The most common is to find systematic reviews on treatment, for which the implementation methodology and the choice of summary result measure are quite well defined. Reviews on diagnostic tests, which have been possible after the development and characterization of the parameters that measure the diagnostic performance of a test, are less common.

The process of conducting a diagnostic systematic review essentially follows the same guidelines as a treatment review, although there are some specific differences that we will try to clarify. We will focus first on the choice of the outcome summary measure and try to take into account the rest of the peculiarities when we give some recommendations for a critical appraisal of these studies.

When choosing the outcome measure, we will find the first big difference with the meta-analyzes of treatment. In the meta-analysis of diagnostic accuracy (MDA) the most frequent way to assess the test is to combine sensitivity and specificity as summary values. However, these indicators present the problem that the cut-off points to consider the results of the test as positive or negative usually vary among the different primary studies of the review. Moreover, in some cases positivity may depend on the objectivity of the evaluator (think of results of imaging tests). All this, besides being a source of heterogeneity among the primary studies, constitutes the origin of a typical MDA bias called the threshold effect, in which we will stop a little later.

For this reason, many authors do not like to use sensitivity and specificity as summary measures and resort to positive and negative likelihood ratios. These ratios have two advantages. First, they are more robust against the presence of threshold effect. Second, as we know, they allow calculating the post-test probability either using Bayes’ rule (pre-test odds  x likelihood ratio = posttest odds) or a Fagan’s nomogram (you can review these concepts in the corresponding post).

Finally, a third possibility is to resort to another of the inventions that derive from Turing’s work: the diagnostic odds ratio (DOR).

The DOR is defined as the ratio of the odds of the patient being positive with a test with respect to the odds of being positive while being healthy. This phrase may seem a bit cryptic, but it is not so. The odds of the patient being positive versus being negative is only the ratio between true positives (TP) and false negatives (FN): TP / FN. On the other hand, the odds of the healthy being positive versus negative is the quotient between false positives (FP) and true negatives (TN): FP / TN. And seeing this, we can only define the ratio between the two odds, as you can see in the attached figure. The DOR can also be expressed in terms of the predictive values ​​and the likelihood ratios, according to the expressions that you can see in the same figure. Finally, it is also possible to calculate their confidence interval, according to the formula that ends the figure.

Like all odds ratios, the possible values ​​of the DOR go from zero to infinity. The null value is 1, which means that the test has no discriminatory capacity between the healthy and the sick. A value greater than one indicates discriminatory capacity, which will be greater the greater the value. Finally, values ​​between zero and 1 will indicate that the test not only does not discriminate well between the sick and healthy, but classifies them in a wrong way and gives us more negative values ​​among the sick than among the healthy.

The DOR is a global parameter easy to interpret and does not depend on the prevalence of the disease, although it must be said that it can vary between groups of patients with different severity of disease. In addition, it is also a very robust measure against the threshold effect and is very useful for calculating the summary ROC curves that we will comment on below.

The second peculiar aspect of MDA that we are going to deal with is the threshold effect. We must always assess their presence when we find ourselves before a MDA. The first thing will be to observe the clinical heterogeneity among the primary studies, which could be evident without needing to make many considerations. There is also a simple mathematical form, which is to calculate the Spearman’s correlation coefficient between sensitivity and specificity . If there is a threshold effect, there will be an inverse correlation between the two, the stronger the higher the threshold effect.

Finally, a graphical method is to assess the dispersion of the sensitivity and specificity representation of the primary studies on the summary ROC curve of the meta-analysis. A dispersion allows us to suspect the threshold effect, but it can also occur due to the heterogeneity of the studies and other biases such as selection’s or verification’s.

The third specific element of MDA that we are going to comment on is that of the summary ROC curve (sROC), which is an estimate of the common ROC curve adjusted according to the results of the primary studies of the review. There are several ways to calculate it, some quite complicated from the mathematical point of view, but the most used are the regression models that use the DOR as an estimator, since, as we have said, it is very robust against heterogeneity and the threshold effect. But do not be alarmed, most of the statistical packages calculate and represent the sROC with little effort.

The reading of sROC is similar to that of any ROC curve. The two more used parameters are area under the ROC curve (AUC) and Q index. The AUC of a perfect curve is equal to 1. Values above 0.5 indicate its discriminatory diagnostic capacity, which will be higher the closer it gets to 1. A value of 0.5 tells us that the usefulness of the test is the same that flipping a coin. Finally, values ​​below 0.5 indicate that the test does not contribute at all to the diagnosis it intends to perform.

On the other hand, the Q index corresponds to the point at which sensitivity and specificity are equal. Similar to AUC manner, a value greater than 0.5 indicate the overall effectiveness of the diagnostic test, which will be higher the closer the index value is to 1. In addition, confidence intervals can also be calculated both for AUC as Q index, with which it will be possible to assess the precision of the estimation of the summary measure of the MDA.

Once seen (at a glance) the specific aspects of MDA, we will give some recommendations to perform the critical appraising of this type of study. CASP network does not provide a specific tool for MDA, but we can follow the lines of the systematic review of treatment studies taking into account the differential aspects of MDA. As always, we will follow our three basic pillars: validity, relevance and applicability.

Let’s start with the questions that value the VALIDITY of the study.

The first question asks if it has been clearly specified the issue of the review. As with any systematic review, diagnostic tests’ should try to answer a specific question that is clinically relevant, and which is usually proposed following the PICO scheme of a structured clinical question. The second question makes us reflect if the type of studies that have been included in the review are adequate. The ideal design is that of a cohort to which the diagnostic test that we want to assess and the gold standard are blindly and independently applied. Other studies based on case-control designs are less valid for the evaluation of diagnostic tests, and will reduce the validity of the results.

If the answer to both questions is yes, we turn to the secondary criteria. Have important studies that have to do with the subject been included? We must verify that a global and unbiased search of the literature has been carried out. The methodology of the search is similar to that of systematic reviews on treatment, although we should take some precautions. For example, diagnostic studies are usually indexed differently in databases, so the use of the usual filters of other types of revisions can cause us to lose relevant studies. We will have to carefully check the search strategy, which must be provided by the authors of the review.

In addition, we must verify that the authors have ruled out the possibility of a publication bias. This poses a special problem in MDA, since the study of the publication bias in these studies is not well developed and the usual methods such as the funnel plot or the Egger’s test are not very reliable. The most conservative thing to do is always assume that there may be a publication bias.

It is very important that enough has been done to assess the quality of the studies, looking for the existence of possible biases. For this the authors can use specific tools, such as the one provided by the QUADAS-2 declaration.

To finish the section of internal or methodological validity, we must ask ourselves if it was reasonable to combine the results of the primary studies. It is fundamental, in order to draw conclusions from combined data, that studies are homogeneous and that the differences among them are due solely to chance. We will have to assess the possible sources of heterogeneity and if there may be a threshold effect, which the authors have had to take into account.

In summary, the fundamental aspects that we will have to analyze to assess the validity of a MDA will be: 1) that the objectives are well defined; 2) that the bibliographic search has been exhaustive; and 3) that the internal or methodological validity of the included studies has also been verified. In addition, we will review the methodological aspects of the meta-analysis technique: the convenience of combining the studies to perform a quantitative synthesis, an adequate evaluation of the heterogeneity of the primary studies and the possible threshold effect and use of an adequate mathematical model to combine the results of the primary studies (sROC, DOR, etc.).

Regarding the RELEVANCE of the results we must consider what is the overall result of the review and if the interpretation has been made in a judicious manner. We will value more those MDA that provide more robust measures against possible biases, such as likelihood ratios and DOR. In addition, we must assess the accuracy of the results, for which we will use our beloved confidence intervals, which will give us an idea of ​​the precision of the estimation of the true magnitude of the effect in the population.

We will conclude the critical appraisal of MDA assessing the APPLICABILITY of the results to our environment. We will have to ask whether we can apply the results to our patients and how they will influence the attention to them. We will have to see if the primary studies of the review describe the participants and if they resemble our patients. In addition, it will be necessary to see if all the relevant results have been considered for decision making in the problem under study and, as always, the benefit-cost-risk ratio must be assessed. The fact that the conclusion of the review seems valid does not mean that we have to apply it in a compulsory way.

Well, with all that said, we are going to finish today. The title of this post refers to the mistreatment suffered by a genius. We already know what genius we were referring to: Alan Turing. Now, we will clarify the abuse. Despite being one of the most brilliant minds of the 20th century, as witnessed by his work on statistics, computing, cryptography, cybernetics, etc., and having saved his country from the blockade of the German Navy during the war, in 1952 he was tried for his homosexuality and convicted of serious indecency and sexual perversion. As it is easy to understand, his career ended after the trial and Alan Turing died in 1954, apparently after eating a piece of an apple poisoned with cyanide, which was labeled as suicide, although there are theories that speak rather of murder. They say that from here comes the bitten apple of a well-known brand of computers, although there are others who say that the apple just represents a play on words between bite and byte.

I do not know which of the two theories is true, but I prefer to recall Turing every time I see the little-apple. My humble tribute to a great man.

And now we finish. We have seen the peculiarities of the meta-analyzes of diagnostic accuracy and how to assess them. Much more could be said of all the mathematics associated with its specific aspects such as the presentation of variables, the study of publication bias, the threshold effect, etc. But that’s another story…

Chickenphant

The unreal mixture of different parts of animals has been an obsession of so-called human beings since immemorial time. The most emblematic case is that of Chimera (which gives its name to the whole family of mixtures of different animals). This mythological being, daughter of Typhon and the viper Echidna, had a lion’s head, a goat’s body and a dragon’s tail, which allowed him to breathe in flames and scare everyone up who passed by. Of course, it did not help him when Belloforontes, mounted on Pegasus (another weirdo, a horse with wings), insisted on crossing it with his lead spear. You see, in his strength was his downfall: the fire melted the tip of the spear into this rare creature, which resulted in its death.

Besides Chimera, there are many more of these beings, all of them fruit of human imagination. To name a few, we can remember the unicorns (these had worse luck than Pegasus, instead of wings they had horns, one each animal), the basilisks (a kind of snake rooster of quite bad character), the gryphon (lion’s body and eagle for the rest) and all those in which part of the mixture is human, such as manticores (head of man and body of lion), centaurs, Minotauro, Medusa (with their snakes instead of hair), mermaids…

In any case, among all the beings of this imaginary zoo, I am left with the chickenphant (gallifante in Spanish). This was a mixture of chicken and elephant that was used on TV to reward the wit of children who attended a popular contest. Millenials will have no idea what I’m talking about, but surely those who grew up in the 80s do know what I mean.

And all this came to my mind when I was reflecting on the number of chimeras that also exist among the possible types of scientific study designs, especially among observational studies. Let’s get to know a little three of these chickenphants of epidemiology: the case-control studies nested in a cohort and the case and cohort studies , to end with another particular specimen, the case-crossover or self-controlled studies.

Within observational studies, we all know the classic cohorts and the cases and controls studies, the most frequently used.

In a cohort study, a group or cohort is subjected to an exposure and followed over time to compare the frequency of appearance of the effect compared to an unexposed cohort, which acts as a control. These studies tend to be of an antegrade direction, so they allow us to measure the incidence of the disease and calculate the risk ratio between the two groups. On the other hand , a case-control study starts from two population groups, one of which presents the effect or disease under study and compares its exposure to a specific factor with respect to the group that does not have the disease and that acts as a control. Being of retrograde direction and directly selecting cases of disease, it is not possible to directly calculate the incidence density and, therefore, the risks ratios between the two groups, making the odds ratio the measure of association typical of case-control studies.

The cohort study is the most solid of the two from a methodological point of view. The problem is that they usually require long follow-up periods and large cohorts, especially when the frequency of the disease studied is low, which leads to the need to manage all the covariates of this large cohort, which increases the costs of the study.

Well, for these cases in which neither the cases and controls nor the cohorts are well suited to the needs of the researcher, epidemiologists have invented a series of designs that are halfway between the two and can mitigate their shortcomings. These hybrid designs are the case-control studies nested in a cohort and the case and cohort studies to which we have already referred.

In another order of things, in classical observational studies the key point is in the selection of controls, which have to be representative of the level of exposure to the risk factor evaluated in the population from which the cases originate. An adequate selection of controls becomes even more difficult when the effect occurs abruptly. For example, if we want to know if a copious meal increases the risk of heart attack, we would have great difficulty in collecting representative controls of the population, since the risk factors can act instants before the event.

To avoid these difficulties, the principle of “you make your bed, you lie in it” was applied and the third type of chimera we have mentioned was designed, in which each participant acts, at the same time, as his own control. They are case-crossover studies, also known as self-monitoring cases studies.

Let’s see these weirdos, beginning with cases and controls nested in a cohort.

Suppose we have done a study in which we have used a cohort with many participants. Well, we can reuse it in a nested case-control study. We took the cohort and followed it over time, selecting as cases those subjects who are developing the disease and assigning them as controls individuals from the same cohort who have not yet presented it (although they can do it later). Thus, cases and controls come from the same cohort. It is convenient to match them taking into account confusing and time-dependent variables, such as the years they have been included in the cohort. In this way, the same subject can act as a control on several occasions and end as a case in another, which will have to be taken into account at the time of the statistical analysis of the studies. As this seems a bit confusing, I show you a scheme of this type of studies in the first attached figure.

As we are seeing how cases arise, we are doing a sampling by density of incidence, which will allow us to estimate risks ratios. This is an important difference with conventional case-control studies, in which an odds ratio is usually calculated, which can only be assimilated to the relative risk when the frequency of the effect is very low.

Another difference is that all the information about the cohort is collected at the beginning of the study, so there is less risk of producing the classic information biases of the case-control studies, usually of a retrospective nature.

The other type of hybrid observational design that we are going to discuss is that of the case and cohort studies. Here we also start from a large initial cohort, from which we select a more manageable sub-cohort that will be used as a comparison group. Thus, we see which individuals of the initial cohort develop the disease and compare them with the sub-cohort (regardless of whether or not they belong to the sub-cohort). You can see the outline of a case study and cohort in the second figure attached.

As in the previous example, when choosing cases over time we can estimate the density of incidence in cases and not cases, calculating the risk ratio from them. As we can imagine, this design is cheaper than conventional studies because it greatly reduces the volume of information of healthy subjects that must be handled, without losing efficiency when studying rare diseases. The problem that arises is that the sub-cohort has an overrepresentation of cases, so that the analysis of the results cannot be done as in traditional cohorts, but has its own methodology, much more complicated.

To summarize what has been said so far, we will say that the nested case-control study is more like the classic case-control study, while the case and cohort study is more like the conventional cohort study. The fundamental difference between the two is that in the nested study the sampling of the controls is done by incidence density and by pairing, so we must wait until all cases have been produced to select the entire reference population. This is not the case in the case and cohort study, which is much simpler, in which the reference population is selected at the beginning of the study.

To put an end to these hybrid studies, we will say some things about case-crossover studies. These focus on the moment in which the event occurs and try to see if there has been something unusual that has favored it, comparing the expositions of moments immediate to the event with previous ones that serve as control. Therefore, we compare case moments with control moments, each individual acting as their own control.

For the study to be valid from the methodological point of view, the authors have to clearly describe a series of characteristic periods of time. The first is the induction period, which is the delay time that occurs from the beginning of the exposure until the production of the effect.

The second is the period of effect, which is the interval during which exposure can trigger the effect. Finally, the period of risk would be the sum of the two previous periods, from the moment of exposure to the beginning of the event.

The induction period is usually very brief most of the times, so the period of risk and effect are usually equivalent. In the attached figure I show you the relationship between the three periods so that you understand it better.

It is essential that these three periods be clearly specified, since a poor estimate of the period of effect, both by excess and by defect, produces a dilution of the effect of the exposure and makes its detection more difficult.

Some of you will tell me that these studies are similar to other self-controlled studies, such as paired cases and controls studies. The difference is that in the latter one or more similar controls are chosen for each case, while in the self-controlled each one is its own control. They also look a little like cross-over clinical trials, in which all participants are subjected to intervention and control, but these are experimental studies in which the researcher intervenes in the production of the exposure, while self-controlled studies are observational studies.

In what it resembles paired cases and controls is in the statistical analysis, only here case moments and control moments are analyzed. In this way, it is usual to use conditional logistic regression models, being the most common measure of association the odds ratio.

As you can see, hybrid studies are a whole new family that threatens to grow in number and complexity. As far as I know, there are no checklists to critically aprraise these types of designs, so we will have to apply judiciously the principles we apply when analyzing classical observational studies, taking into account, in addition, the particularities of each type of study.

For this, we will follow our three pillars: validity, relevance and applicability.

In the VALIDITY section we will assess the methodological quality with which the study was made. We will check that there is a clear definition of the study population, the exposure and the effect. If we use a reference cohort, it should be representative of the population and should be followed completely. On the other hand, the cases will be representative of the population of cases from which they come and the controls have to come from a population with an exposure level representative of the case population.

The measurement of the exposure and the effect must be done blindly, being independent the measurement of the effect and the knowledge of the level of exposure. In addition, we will analyze if attention has been paid to the temporal relationship of events between exposure and effect and if there was a relationship between the level of exposure and the degree of effect. Finally, the statistical analysis should be correct, taking into account the control of possible confounding factors. This part can be complicated by the complexity of the statistical studies that usually require this type of designs.

In addition, as we have already mentioned, if we are facing a case-crossover study, we must ensure that there has been a correct definition of the three periods, especially the period of effect, whose inaccuracy may affect the conclusion of the study to a greater degree.

Next, we will evaluate the RELEVANCE of the results and their accuracy as measured by their confidence intervals. We will look for the impact measurements calculated by the authors of the study and, if they do not provide them, we will try to calculate them ourselves. Finally, we will compare the results with other previously published in the literature to see if they are concordant with the existing knowledge and what new knowledge are provided.

We will finish the critical appraising assessing the APPLICABILITY of the results. We will think if the participants can be assimilated to our patients and if the conclusions are applicable to our environment.

And here we are going to finish this post. We have seen a whole new range of hybrid studies that combine the advantages of two observational studies to better adapt to situations in which classical studies are more difficult to apply. The drawback of these studies, as we have said, is that the analysis is a bit more complicated than that of the conventional studys, since it is not enough to get a crude analysis of the results, but must be adjusted by the possibility of that a participant can act as control and case (in the nested studies) and by the overrepresentation of the cases in the sub-cohort (in the cohort and cases).

I just finish commenting that all I have said about the case-crossover studies refers to the so-called unidirectional case-crossover ones, studies in which there is a very precise temporal relationship between exposure and effect. For the cases in which the exposure is more maintained, other types of case-crossover studies called bidirectional case-crossover studies can be used, in which control periods are selected before and after the effect. But that is another story…

There is another world, and it is this one

And there are other lives, but they are in you. It was already said by Paul Éluard, that last century’s surrealist who had the bad idea of ​​visiting Cadaqués accompanied by his wife, Elena Ivanovna Diakonova, better known as Gala. He was not very clever there, but his phrase did give for many more things.

For example, it has been used by many writers who love the unknown, myths and mystery. I personally knew the phrase when I was a young teenager because it was written as a preface to a series of science fiction books. Even, in more recent times, it is related to that other incorporeal world that is cyberspace, where we spend a more and more greater part of our life.

But, to help Éluard rest peacefully in his tomb at Père- Lachaise, I’ll tell you that I like more his original idea about our two worlds, between which we can share our limited life time: the real world, where we make the most part of the things, and the world of the imagination, our intimate space, where we dream our most impossible realities.

You will think that today I am very metaphysical, but this is the thought that has come to my mind when I started thinking about the topic that we are going to deal with in this post. And the fact is that in the realm of medicine there are two worlds too.

We are very used to numbers and the objective results of our quantitative research. As an example, we have our revered systematic reviews, which gather the scientific evidence available on a specific health technology to assess its efficacy, safety, economic impact, etc. If we want to know if watching a lot of TV is a risk factor for suffering this terrible disease that is fildulastrosis, the best thing will be to do a systematic review of clinical trials (assuming there are any). Thus, we can calculate a multitude of parameters that, with a number, will give us a full idea of ​​the impact of such an unhealthy habit.

But if what we want to know is how fildulastrosis affects the person who suffers it, how much unhappiness it produces, how it alters family and social life, things get a little complicated with this type of research methodology. And this is important because the social and cultural aspects related to the real context of people are increasingly valued. Luckily, there are other worlds and they are in this one. I am referring to the world of qualitative research. Today we are going to take a look (a short one) at this world.

Qualitative research is a method that studies reality in its natural context, as it occurs, in order to interpret the phenomena according to the meanings they have for the people involved. And for this it uses all kinds of sources and materials that help us to describe the routine and the meaning of problematic situations for people’s lives: interviews, life stories, images, sounds … Although all this has nothing to do with the gridded world of quantitative research, both methods are not incompatible and may even be complementary. Simply, qualitative methods provide alternative information, different and complementary to that of quantitative methods, which is useful for evaluating the perspectives of the people involved in the problem we are studying. Quantitative research is a way to address the problem deductively, while qualitative uses an inductive approach.

Logically, the methods used by qualitative research are different from quantitative’s ones. In addition, they are numerous, so we will not describe them in depth. We will say that the specific methods most used are meta-synthesis, phenomenology, meta-ethnography, meta-study, meta-interpretation, the grounded theory, the biographical method and the aggregative review, among others.

The most frequently used of these methods is meta-synthesis, which starts with a research question and a bibliographic search, in a similar way to what we know about systematic reviews. However, there are a couple of important differences. In quantitative research, the research question must be clearly defined, while in qualitative research this question is, by definition, flexible and is usually modified and refined as data collection progresses. The other aspect has to do with the literature search, because in qualitative research it is not so clearly defined what databases have to be used and there are not the filters and methodologies available to documentarists to make revisions of quantitative research.

Also, techniques used for collecting data are different to those we are more accustomed to in quantitative research. One of them is observation, which allows the researcher to obtain information about the phenomenon as it occurs. The paradigm of observation in qualitative research is participant observation, in which the observer interacts socially with the subjects of the medium in which the phenomenon of study occurs. For example, if we want to assess the experiences of travelers on a commercial flight, nothing better than buying a ticket and posing as another traveler, collecting all the information about comfort, punctuality, attention provided by the flight staff, quality of the snacks, etc.

Another technique widely used is the interview, in which a person asks another people or group of people for information on a specific topic. When it is done to groups it is called, as it could not be otherwise, group interview. In this case the script is quite closed and the role of the interviewer is quite prominent, unlike in focus groups discussion, in which everything can be more open, at the discretion of the group’s facilitator. Anyway, when we want to know the opinion of many people, we can resort to the questionnaire technique, which polls the opinion of large groups so that each component of the group spends a minimum time to complete it, unlike the focus groups, in the that all remain throughout the interview time.

The structure of a qualitative research study usually includes five fundamental steps, which can be influenced according to the methods and techniques used:

  1. Definition of the problem. As we have already mentioned when discussing the research question, the definition of the problem has a certain degree of provisionality and can change throughout the study, since one of the objectives may be to find out precisely if the definition of the problem is well done.
  2. Study design. It must also be flexible. The problem with this phase is that there are times when the proposed design is not what we see in the published article. There is still a certain lack of definition of many methodological aspects, especially when compared with the methodology of quantitative research.
  3. Data collection. The techniques we have discussed are used: interview, observation, reading of texts, etc.
  4. Analysis of the data. This aspect also differs from the quantitative analysis. Here it will be interesting to unravel the meaning structures of the collected data to determine their scope and social implications. Although methods are being devised to express in numerical form, the usual thing is that we do not see many figures here and, of course, nothing to do with quantitative methods.
  5. Report and validation of the information. The objective is to generate conceptual interpretations of the facts to get a sense of the meaning they have for the people involved. Again, and unlike with quantitative research, the goal is not to project the results of possible interventions on the environment, but to interpret facts that are at hand.

At this point, what can we say about the critical appraisal of qualitative research? Well, to give you an idea, I will tell you that there is a great variety in opinions on this subject, from those who think that it makes no sense to evaluate the quality of a qualitative study to those who try to design evaluation instruments that provide numerical results similar to those of quantitative studies. So, my friends, there is no uniform consensus on whether you should evaluate, in the first place, or on how, in the second. In addition, some people think that even studies that can be considered of low quality should be taken into account because, after all, who is able to define with certainty what a good qualitative research study is?

In general, when we make a critical appraisal of a qualitative research study, we will have to assess a series of aspects such as its integrity, complexity, creativity, validity of the data, quality of the descriptive narrative, the interpretation of the results and the scope of its conclusions. We are going to continue here our habit of resorting to the CASPe’s critical appraisal program, which provides us with a template with 10 questions to perform the critical appraisal of a qualitative study. These questions are structured in three pillars: rigor, credibility and relevance.

The questions of rigor refer to the suitability of the methods used to answer the clinical question. As usual, the first questions are about elimination. If the answer is not affirmative, we will have resolved the controversy since, at least with this study, it will not be worthwhile to continue with our assessment. Were the objectives of the research clearly defined? It is necessary to value that the question is well specified, as well as the objective of the investigation and the justification of its necessity. Is the qualitative methodology congruent? We will have to decide if the methods used by the authors are adequate to obtain the data that will allow them to reach the objective of the investigation. Finally, is the research method used suitable for achieving the objectives? Researchers must explicitly say the method they have used (meta-synthesis, grounded theory …). In addition, the specified method must match the one used, which sometimes may not be the case.

If we have answered affirmatively to these three questions, it will be worth continuing and we will move on to the detailed questions. Is the participant selection strategy consistent with the research question and the method used? It must be justified why the selected participants were the most suitable, as well as explain who called them, where, etc. Are data collection techniques used congruent with the research question and the method used? The technique of collecting data (for example, discussion groups) and the registration format will have to be specified and justified. If the collection strategy is modified throughout the study, the reason for this will have to be justified.

Have the relationship between the researcher and the object of research (reflexivity) been considered? It will be necessary to consider if the involvement of the researcher in the process has been able to bias the data obtained and if this has been taken into account when designing the data collection, the selection of the participants and the scope of the study. To finish with the assessment of the rigor of the work, we will ask ourselves if the ethical aspects have been taken into account. It will be necessary to take into account common aspects with quantitative research, such as informed consent, approval by ethical committee or confidentiality of data, as well as specific aspects about the effect of the study on participants before and after its completion.

The next block of two questions has to do with the credibility of the study, which is related to the ability of the results to represent the phenomenon from the subjective point of view of the participants. The first question makes us think if the analysis of the data was sufficiently rigorous. The entire analysis process should be described, the categories that may have arisen from the collected data, if the subjectivity of the researcher has been assessed and how the data that could be contradictory to each other has been handled. In the case that fragments of testimonies of participants are presented to elaborate the results, the reference of their origin must be clearly specified. The second question has to do with whether the exposure of the results was made clearly. They should be presented in a detailed and understandable manner, showing their relationship to the research question. We will review at this point the strategies adopted to ensure the credibility of the results, as well as if the authors have reflected on the limitations of the study.

We will finish the critical assessment by answering the only question of the block that has to do with the relevance of the study, which is nothing more than its usefulness or applicability to our clinical practice. Are the results of the investigation applicable? We will have to assess how the results contribute to our practice, how they contribute to the existing knowledge and in what contexts may they be applicable.

And here we are going to leave it for today. You have already seen that we have taken a look into a world quite different from the one we are more used to, in which we have to change a little the mentality of how to pose and study problems. Before leaving, I have to warn you, as in previous posts, to not to look for fildulastrosis, because you will not find this disease anywhere. Actually, fildulastrosis is an invention of mine in homage to a very illustrious character, sadly deceased: Forges. Antonio Fraguas (from the English translation of his last name comes his nom de guerre) was, in my humble opinion, the best graphic humorist since I have conscience. For many years I began the day seeing the daily Forges’ joke, so since some time there are mornings that one does not know how to start the day. Forges had many own invented words and I really liked his percutoria’s fildulastro, who had the defect of escalporning now and then. Hence comes my fildulastrosis, so from here I thank him and I give him this little tribute.

And now we’re leaving. We have not talked much about other methods of qualitative research such as grounded theory, meta- ethnogarphy, etc. Those interested have bibliography where they are explained in a better way than I could do it. And, of course, as in quantitative research, there are also ways to combine qualitative research studies. But that is another story…

Powerful gentleman

Yes, as the illustrious Francisco de Quevedo y Villegas once said, powerful gentleman is Don Dinero (Mr. Money). A great truth because, who, purely in love, does not humble himself before the golden yellow? And even more in a mercantilist and materialist society like ours.

But the problem is not that we are materialistic and just think about money. The problem is that nobody believes they have all the money they need. Even the wealthiest would like to have much more money. And many times, it is true, we do not have enough money to cover all our needs as we would like.

And that does not only happen at the individual’s level, but also at social groups level. Any country has a limited amount of money, which is why you cannot spend everything you want and you have to choose where you spend your money. Let’s think, for example, of our healthcare system, in which new health technologies (new treatments, new diagnostic techniques, etc. ) are getting better … and more expensive (sometimes, even bordering on obscenity). If we are spending at the limit of our possibilities and want to apply a new treatment, we only have two choices: either we increase our wealth (where do we get the money from?) or we stop spending it on something else. There would be a third one that is used frequently, even if it is not the right thing to do: spend what we do not have and pass on the debt to whoever comes next.

Yes, my friends, the saying that Health is priceless does not hold up economically. Resources are always limited and we must all be aware of the so-called opportunity cost of a product: the price it costs, the money will have to stop spending on something else.

Therefore, it is very important to properly evaluate any new health technology before deciding its implementation in the health system, and this is why the so-called economic evaluation studies have been developed, aimed at identifying what actions should be prioritized to maximize the benefits produced in an environment with limited resources. These studies are a tool to assist in decision-making, but are not aim to replace it, so other elements have to be taken into account, such as justice, equity and free access to the election.

The economic evaluation (EV) studies encompass a whole series of methodology and specific terminology that is usually little known by those who are not dedicated to the evaluation of health technologies. Let’s briefly review its characteristics to finally give some recommendations on how to make a critical appraisal of these studies.

The first thing would be to explain what are the two characteristics that define an EV. These are the measure of the costs and benefits of the interventions (the first one) and the choice or comparison between two or more alternatives (the second one). These two features are essential to say that we are facing an EV, which can be defined as the comparative analysis of different health interventions in terms of costs and benefits. The methodology of development of an EV will have to take into account a number of aspects that we list below and that you can see summarized in the attached table.

– Objective of the study. It will be determined if the use of a new technology is justified in terms of the benefits it produces. For this, a research question will be formulated with a structure similar to that of other types of epidemiological studies.

– Perspectives of the analysis. It is the point of view of the person or institution to whom the analysis is targeted, which will include the costs and benefits that must be taken into account from the positioning chosen. The most global perspective is that of the Society, although the one of the funders, that of specific organizations (for example, hospitals) or that of patients and families can also be adopted. The most usual is to adopt the perspective of the funders, sometimes accompanied by the social one. If so, both must be well differentiated.

– Time horizon of the analysis. It is the period of time during which the main economic and health effects of the intervention are evaluated.

– Choice of the comparator. It is a crucial point to be able to determine the incremental effectiveness of the new technology and on which the importance of the study for the decision makers will largely depend. In practice, the most commonly comparator is the alternative that is commonly used (the gold standard), although it can sometimes be compared with the non-treatment option, which must be justified.

– Identification of costs. Costs are usually considered taking into account the total amount of the resource consumed and the monetary value of the resource unit (you know, as the friendly hostesses of an old TV contest said: 25 responses, at 5 pesetas each, 125 pesetas). The costs are classified as direct and indirect and as sanitary and non-sanitary. The direct ones are those clearly related to the illness (hospitalization, laboratory tests, laundry and kitchen, etc.), while the indirect refer to productivity or its loss (work functionality, mortality). On the other hand, health costs are those related to the intervention (medicines, diagnostic tests, etc.), while non-health costs are those that the patient or other entities have to pay or those related to productivity.

What costs will be included in an EV? It will depend on the intervention being analyzed and, especially, on the perspective and time horizon of the analysis.

 Quantification of costs. It will be necessary to determine the amount of resources used, either individually or in aggregate, depending on the information available.

– Cost assessment. They will be assigned a unit price, specifying the source and the method used to assign this price. When the study covers long periods of time, it must be borne in mind that things do not cost the same over the years. If I tell you that I knew a time when you went out at night with a thousand pesetas (the equivalent of about 6 euros now) and came back home with money in your pocket, you will think it is another of my frequent ravings, but I swear it is true.

To take this into account, a weighting factor or discount rate is used, which is usually between 3% and 6%. For who is curious, the general formula is CV = FV / (1 + d) n, where CV is the current value, FV future value, n is the number of years and d the discount rate.

 Identification, measurement and evaluation of results. The benefits obtained can be classified into health and non-health ones. Health benefits are clinical consequences of the intervention, generally measured from a point of view of interest to the patient (improvement of blood pressure figures, deaths avoided, etc.). On the other hand, the non-health ones are divided as they cause improvements in productivity or in the quality of life.

The first ones are easy to understand: productivity can improve because people go to work earlier (shorter hospitalization, shorter convalescence) or because they work better to improve the health conditions of the worker. The second ones are related to the concept of quality of life related to health, which reflects the impact of the disease and its treatment on the patient.

The quality of life related to health can be estimated using a series of questionnaires on the preferences of patients, summarized in a single score value that, together with the amount of life, will provide us with the quality-adjusted life year (QALY).

To assess the quality of life we ​​refer to the utilities of the health states, which are expressed with a numerical value between 0 and 1, in which 0 represents the utility of the state of death and 1 that of perfect health. In this sense, a year of life lived in perfect health is equivalent to 1 QALY (1 year of life x 1 utility = 1 QALY). Thus, to determine the value in QALYs we will multiply the value associated with a state of health by the years lived in that state. For example, half a year in perfect health (0.5 years x 1 utility) would be equivalent to one year with some ailments (1 year x 0.5 utility).

 Type of economic analysis. We can choose between four types of economic analysis.

The first, the cost minimization analysis. This is used when there is no difference in effect between the two options compared, situation in which will be enough to compare the costs to choose the cheapest. The second, the cost-effectiveness analysis. This is used when the interventions are similar and determines the relationship between costs and consequences of interventions in units usually used in clinical practice (decrease in days of admission, for example). The third, the cost-utility analysis. It is similar to cost-effectiveness, but the effectiveness is adjusted for quality of life, so the outcome is the QALY. Finally, the fourth method is the cost-benefit analysis. In this type everything is measured in monetary units, which we usually understand quite well, although it can be a little complicated to explain with them the gains in health.

 Analysis of results. The analysis will depend on the type of economic analysis used. In the case of cost-effectiveness studies, it is typical to calculate two measures, the average cost-effectiveness (dividing the cost between the benefit) and the incremental cost-effectiveness (the extra cost per unit of additional benefit obtained with an option with respect to the other). This last parameter is important, since it constitutes a limit of efficiency of the intervention, which we will be chosen or not depending on how much we are willing to pay for an additional unit of effectiveness.

– Sensitivity analysis. As with other types of designs, EVs do not get rid off uncertainty, generally due to lack of reliability of the available data. Therefore, it is convenient to evaluate the degree of uncertainty through a sensitivity analysis to check the degree of stability of the results and how they can be modified if the main variables vary. An example may be the variation of the discount rate chosen.

There are five types of sensitivity analysis: univariate (the study variables are modified one by one), multivariate (two or more are modified), extremes (we put ourselves in the most optimistic and most pessimistic scenarios for the intervention), threshold (identifies if there is a critical value above or below which the choice is reversed towards one or the other the interventions compared) and probabilistic (assuming a certain probability distribution for the uncertainty of the parameters used).

 Conclusion. This is the last section of the development of an EV. The conclusions should take into account two aspects: internal validity (correct analysis for patients included in the study) and external validity (possibility of extrapolating the conclusions to other groups of similar patients).

As we said at the beginning of this post, EVs have a lot of jargon and its own methodological aspects, which makes it difficult for us to make a critical appraising and a correct understanding of its content. But let no one get discouraged, we can do it by relying on our three basic pillars: validity, relevance and applicability.

There are multiple guides that systematically explain how to assess an EV. Perhaps the first to appear was that of the British NICE (National Institute for Clinical Excellence), but subsequently others have arisen such as that of the Australian PBAC (Pharmaceutical Benefits Advisory Committee) and that of the Canadian CADTH (Canadian Agency for Drugs and Technologies in Health). In Spain we could not be less and the Laín Entralgo’s Health Technology Assessment Unit also developed an instrument to determine the quality of an EV. This guide establishes recommendations for 17 domains that closely resemble what we have said so far, completing with a checklist to facilitate the assessment of the quality of the EV.

Anyway, as my usual sufferers know, I prefer to use a simpler checklist that is available on the Internet for free, which is none other than the tool provided by the CASPe group and that you can download from their website. We are going to follow these 11 CASPe’s questions, although without losing sight of the recommendations of the Spanish guide that we have mentioned.

As always, we will start with the VALIDITY, trying to answer first two elimination questions. If the answer is negative, we can leave the study aside and dedicate ourselves to another more productive task.

Is the question or objective of the evaluation well defined? The research question should be clear and define the target population of the study. There will also be three fundamental aspects that should be clear in the objective: the options compared, the perspective of the analysis and the time horizon. Is there a sufficient description of all possible alternatives and their consequences? The actions to follow must be perfectly defined in all the compared options, including who, where and to whom each action is applied. The usual will be to compare the new technology, at least, with the one of habitual use, always justifying the choice of the comparison technology, especially if this is the non-treatment one (in the case of pharmacological interventions).

If we have been able to answer these two questions affirmatively, we will move on to the four questions of detail. Are there evidence of the effectiveness, of the intervention or of the evaluated program? We will see if there are trials, reviews or other previous studies that prove the effectiveness of the interventions. Think of a cost minimization study, in which we want to know which of the two options, both effective, is cheaper. Logically, we will have to have prior evidence of this effectiveness. Are the effects of the intervention (or interventions) identified, measured and appropriately valued or considered? These effects can be measured with simple units, often derived from clinical practice, with monetary units and more elaborate calculation units, such as the QALYs mentioned above. Are the costs incurred by the intervention (interventions) identified, measured and appropriately valued? The resources used must be well identified and measured in the appropriate units. The method and source used to assign the value to the resources used must be specified, as we have already mentioned. Finally, were discount rates applied to the costs of the intervention/s? And to the effects? As we already know, this is fundamental when the time horizon of the study is prolonged. In Spain, it is recommended to use a discount rate of 3% for basic resources. When doing sensitivity analysis this rate will be tested between 0% and 5%, which will allow comparison with other studies.

Once assessed the internal validity of our EV, we will answer the questions regarding the RELEVANCE of the results. Firstly, what are the evaluation results? We will review the units that have been used (QALYs, monetary costs, etc.) and if the incremental benefits analysis have been carried out, in appropriate cases. The second question in this section refers to whether an adequate sensitivity analysis has been carried out to know how the results would vary with changes in costs or effectiveness. In addition, it is recommended that the authors justify the modifications made with respect to the base case, the choice of the variables that are modified and the method used in the sensitivity analysis. Our Spanish guide recommends carrying out, whenever possible, a probabilistic sensitivity analysis, detailing all the statistical tests performed and the confidence intervals of the results.

Finally, we will assess the Cost-efeor external validity of our study by answering the last three questions. Would the program be equally effective in your environment? It will be necessary to consider if the target population, the perspective, the availability of technologies, etc., are applicable to our clinical context. Finally, we must reflect on whether the costs would be transferable to our environment and if it would be worth applying them to our environment. This may depend on social, political, economic, population, etc. differences, between our environment and that in which the study has been carried out.

And with this we are going to finish this post for today. Even if I blow your mind after all we have said, you can believe me if I tell you that we have done nothing but scratch the surface of this stormy world of economic valuation studies. We have not discussed anything, for example, about the statistical methods that can be used in studies of sensitivity, which can become complicated, nor about the studies using modeling, employing techniques only available to privileged minds, like Markov chains, stochastic models or discrete event simulation models, to name a few. Neither have we talked about the type of studies on which economic evaluations are based.  These can be experimental or observational studies, but they have a series of peculiarities that differentiate them from other studies of similar design, but with different functions. This is the case of clinical trials that incorporate an economic evaluation (also known as piggy -back clinical trials , which tend to have a more pragmatic design than conventional trials. But that is another story…

King Kong versus Godzilla

What a mess these two elements make when they are left loose and come together! In this story, almost as old as me (please, do not run to look at what year the movie was made) poor King Kong, who must have traveled more than Tarzan, leaves his Skull Island to defend a village from an evil giant octopus and drinks a potion that leaves him sound asleep. Then, some Japanese gentlemen seized the opportunity to take him to their country. I, who have visited Japan, can imagine the effect it produced on the poor monkey when he woke up, so it had no choice but to escape, with the misfortune of meeting Godzilla, who had also escaped from an iceberg where it had been previously frozen. And there they are bundled and the fight begins, stones over here, atomic rays over there, until the thing gets out of control and finally King Kong is going to attack Tokyo, I do not remember exactly for what reason. I swear I have not taken any hallucinogenic, the film is like that and I will not reveal more for not spoiling the end in the incredible case that you want to see the film after what I have told you. What I do not know is what the screenwriters would have taken before planning this story.

At this point you will be thinking about how today’s post may be related to this story. Well, the truth is that it has nothing to do with what we are going to talk about, but I could not think of a better way to start. Well, it may actually be related, because today we are going to talk about a family of monsters within epidemiological studies: the ecological studies. It’s funny that when you read something about ecological studies, it always starts by saying that they are simple. Well, I do not think so. The truth is that they have a lot to get our teeth into and we are going to try to explain them in a simple way. I thank my friend Eduardo (to whom I dedicate this post) for the effort he made to describe them intelligibly. Thanks to him I could understand them. Well… a little bit.

Ecological studies are observational studies that have the peculiarity that the study population are not individual subjects, but grouped subjects (in conglomerates), so the level of inference of their estimates is also aggregated. They tend to be cheap and quick to perform (I suppose that hence its supposed simplicity), since they usually use data from secondary sources already available, and are very useful when it is not possible to measure the exposure at the individual level or when the measurement of the effect can only be measured at the population level (such as the results of a vaccination campaign, for example).

The problem comes when we want to make inferences at the individual level based on their results, since they are subject to a series of biases that we will comment later on. In addition, since they use to be descriptive studies of historical temporality, it can be difficult to determine the temporal gradation between the exposure and the effect studied.

We will look at the specific characteristics in relation to three aspects of its methodology: types of variables and analyzes, types of studies and biases.

Ecological variables are classified in aggregate and environmental variables (also called global variables). The aggregate ones show a summary of individual observations. They are usually averages or proportions, such as the mean age at which the first King Kong’s movie is seen or the rate of geeks for every 1000 moviegoers, to name two absurd examples.

On the other hand, environmental measures are characteristic of a specific place. These can have a parallelism at an individual level (for example, the levels of environmental pollution, related to the crap that each swallows) or be attributes of groups without equivalence at the individual level (such as water quality, to say the least).

As for the analysis, it can be done at the aggregate level, using data from groups of participants, or at the individual level, but better without mixing the two types. Moreover, if data of both types is collected, it will be more convenient to transform them into a single level, the simplest being to aggregate the individual data, although it can also be done the other way around and, even, make an analysis in the two levels with techniques of hierarchical multilevel statistics, only afforded by a few privileged minds.

Obviously, the level of inference we want to apply will depend on what our objective is. If we want to study the effects of a risk factor at the individual level, the inference will be individual. An example would be to study the relationship between the number of hours television is watched and the incidence of brain cancer. On the other hand, and following a very pediatric example, if we want to know the effectiveness of a vaccine, the inferences will be made in an aggregated form from the data of vaccination coverage in the population. And to finish curling the curl, we can measure an exposure factor of the two forms, individual and grouped. For example, density of Mexican restaurants in a population and frequency of antacids intake. In this case we would make a contextual inference.

Regarding the type of ecological studies, we can classify them according to the exposure method and the grouping method.

According to the exposure method, the thing is relatively simple and we can find two types of studies. If we do not measure the exposure variable, or we do it partially, we talk about exploratory studies. In the opposite case, we will find ourselves before an analytical study.

According to the grouping method, we can consider three types: multiple (when multiple zones are selected), temporary (there is measurement over time) and mixed (combination of both).

The complexity begins when the two dimensions (exposure and grouping) are combined, since then we can find ourselves before a series of more complex designs. Thus, multiple group studies can be exploratory (the exposure factor is not measured, but the effect is measured) or analytical studies (the most frequent, we measure both here). The studies of temporal tendency, to not be less, can also be exploratory and analytical, in a similar way to the previous ones, but with a temporal trend. Finally, there will be mixed studies that compare the temporal trends of several geographical areas. Simple, isn’t it?

Well, this is nothing compared to the complexity of the statistical techniques used in these studies. Until recently the analyzes were very simple and based on measures of association or linear correlation, but in recent times we have seen the development of numerous techniques based on regression models and more exotic things such as the log-linear multiplicative models or the Poisson’s regression. The merit of all these studies is that, based on the grouped measures, they allow us to know how many exposed or unexposed subjects have the effect, thus allowing the calculation of rates, attributable fractions, etc. Do not fear, we will not go into detail, but there is available bibliography for those who want to keep warm from head to feet.

To finish with the methodological aspects of the ecological studies, we will list some of its most characteristic biases, favored by the fact of using aggregate analysis units.

The most famous of all is the ecological bias, also known as ecological fallacy. This occurs when the grouped measure does not measure the biological effect at the individual level, in such a way that the individual inference made is erroneous. This bias became famous with the New England’s study that concluded that there was a relationship between chocolate consumption and Nobel prizes but the problem is that, apart from the funny of this example, the ecological fallacy is the main limitation of this type of studies.

Another bias that has some peculiarities in this type of studies is the confusion bias. In studies dealing with individual units, confusion occurs when the exposure variable is related to the effect and exposure, without being part of the causal relationship between the two. This ménage à trois is a bit more complex in ecological studies. The risk factor can behave similarly at the ecological level, but not at the individual level and vice versa, it is possible that confounding factors at the individual level do not produce confusion at the aggregate level. In any case, as in the rest of the studies, we must try to control the confounding factors, for which there are two fundamental approaches.

The first one, to include the possible confounding variables in the mathematical model as covariables and perform a multivariate analysis, with which it will be more complicated to study the effect. The second one, to adjust or standardize the rates of the effect by the confounding variables and perform the regression model with the adjusted rates. To be able to do this it is essential that all the variables introduced in the model have to be adjusted too to the same variable of confusion and that the covariances of the variables are known, which does not always happen. In any case, and it is not to discourage, many times we cannot be sure that the confounding factors have been adequately controlled, even using the most recent and sophisticated multilevel analysis techniques, since the origin can be in unknown characteristics about the distribution of data among groups.

Other gruesome aspects of ecological studies are the temporal ambiguity bias (we have already commented, it is often difficult to ensure that exposure precedes the effect) and collinearity (difficulty in assess the effects of two or more exposures that can occur simultaneous). In addition, although they are not specific to ecological studies, they are very susceptible to presenting information biases.

You can see that I was right at the beginning when I told you that ecological studies seem to me a lot of things, but simple. In any case, it is convenient to understand what their methodology is based on, because, with the development of new analysis techniques, they have gained in prestige and power and it is more than possible that we meet them more and more frequently.

But do not despair, the important thing for us, consumers of medical literature, is to understand how they work so that we can make a critical appraisal of the articles when we deal with them. Although, as far as I know, there are no checklists as structured as CASP has for other designs, the critical appraisal will be done following the usual general scheme according to our three pillars: validity, relevance and applicability.

The study of VALIDITY will be done in a similar way to other types of cross-sectional observational studies. The first thing will be to check that there is a clear definition of the population and the exposure or effect under study. The units of analysis and their level of aggregation will have to be clearly specified, as well as the methods of measuring the effect and exposure, the latter, as we already know, only in analytical studies.

The sample of the study should be representative, for which we will have to review the selection procedures, the inclusion and exclusion criteria and its size. These data will also influence the external validity of the results.

As in any observational study, the measurement of exposure and effect should be done blindly and independently, using valid instruments. The authors must present the data completely, taking into account if there are loses or out of range values. Finally, there must be a correct analysis of the results, with a control of the typical biases of these studies: ecological, information, confusion, temporal ambiguity and collinearity.

In the RELEVANCE section we can begin with a quantitative assessment, summarizing the most important result and reviewing the magnitude of the effect. We must search or calculate ourselves, if possible, the most appropriate impact measures: differences in incidence rates, attributable fraction in exposed, etc. If the authors do not offer this data, but do provide the regression model, it is possible to calculate the impact measurements from the multiplication coefficients of the independent variables of the model. I’m not going to put here the list of formulas for not making this post even more unfriendly, but you know that they exist in case one day you need them.

Then we will make a qualitative assessment of the results, trying to assess the clinical interest of the main outcome measure, the interest of the effect size and the impact it may have for the patient, the system or the Society.

We will finish this section with a comparative assessment (looking for similar studies and comparing the main outcome measure and other alternative measures) and an assessment of the relationship between benefits, risks and costs, as we would do with any other type of study.

Finally, we will consider the APPLICABILITY of the results in clinical practice, taking into account aspects such as adverse effects, economic cost, etc. We already know that the fact that the study is well done does not mean that we have to apply it obligatorily in our environment.

And here we are going to leave it for today. When you read or do an ecological study, be careful not to fall into the temptation of drawing causality conclusions. Regardless of the pitfalls that the ecological fallacy may have for you, ecological studies are observational, so they can be used to generate hypotheses of causality, but not to confirm them.

And now we’re leaving. I did not tell you who won the fight between King Kong and Godzilla so as not to be a spoiler, but surely the smartest of you have already imagined it. After all, and to its disgrace, only one of the two later traveled to New York. But that is another story…

The crystal ball

How I wish I could predict the future! And not only to win millions in the lottery, which is the first thing you can think of. There are more important things in life than money (or so that’s what some say), decisions that we make based on assumptions that end up not being fulfilled and that complicate our lives to unsuspected limits. We all have ever thought about “if you lived twice …” I have no doubt, if I met the genie of the lamp one of the three wishes I would ask would be a crystal ball to see the future.

And we could also do well in our work as doctors. In our day to day we are forced to make decisions about the diagnosis or prognosis of our patients and we always do it on the swampy terrain of uncertainty, always assuming the risk of making some mistake. We, especially when we are more experienced, estimate consciously or unconsciously the likelihood of our assumptions, which helps us in making diagnostic or therapeutic decisions. However, it would be good to also have a crystal ball to know more accurately the evolution of the patient’s course.

The problem, as with other inventions that would be very useful in medicine (like the time machine), is that nobody has yet managed to manufacture a crystal ball that really works. But do not let us down. We cannot know for sure what will happen, but we can estimate the probability that a certain result will occur.

For this, we can use all those variables related to the patient that have a known diagnostic or prognostic value and integrate them to perform the calculation of probabilities. Well, doing such a thing would be the same as designing and applying what is known as a clinical prediction rule (CPR).

Thus, if we get a little formal, we can define a CPR as a tool composed of a set of variables of clinical history, physical examination and basic complementary tests, which provides us with an estimate of the probability of an event, suggesting a diagnosis or predicting a concrete response to a treatment.

The critical appraisal of an article about a CPR shares similar aspects with those of the ones about diagnostic tests and also has specific aspects related to the methodology of its design and application. For this reason, we will briefly look at the methodological aspects of CPRs before entering into their critical assessment.

In the process of developing a CPR, the first thing to do is to define it. The four key elements are the study population, the variables that we will consider as potentially predictive, the gold or reference standard that classifies whether the event we want to predict occurs or not and the criterion of assessment of the result.

It must be borne in mind that the variables we choose must be clinically relevant, they must be collected accurately and, of course, they must be available at the time we want to apply the CPR for decision making. It is advisable not to fall into the temptation of putting variables everywhere and endlessly since, apart from complicating the application of the CPR, it can decrease its validity. In general, it is recommended that for every variable that is introduced in the model there should have been at least 10 events that we want to predict (the design is made in a certain sample whose components have the variables but only a certain number have ended up presenting the event to predict).

I would also like to highlight the importance of the gold standard. There must be a diagnostic test or a set of well-defined criteria that allow us to clearly define the event we want to predict with the CPR.

Finally, it is convenient that those who collect the variables during this definition phase are unaware of the results of the gold standard, and vice versa. The absence of blinding decreases the validity of the CPR.

The next step is the derivation or design phase itself. This is where the statistical methods that allow to include predictive variables and exclude those that are not going to contribute anything are applied. We will not go into statistics, just say that the most commonly used methods are those based on logistic regression, although discriminant, survival and even more exotic analysis based on discriminant risks or neural networks can be used, only afforded by a few virtuous ones.

In the logistic regression models, the event will be the dichotomous dependent variable (it happens or it does not happen) and the other variables will be the predictive or independent variables. Thus, each coefficient that multiplies each predictive variable will be the natural antilogarithm of the adjusted odds ratio. In case anyone has not understood, the adjusted odds ratio for each predictive variable will be calculated raising the number “e” to the value of the coefficient of that variable in the regression model.

The usual thing is that a certain score is assigned on a scale according to the weight of each variable, so that the total sum of points of all the predictive variables will allow to classify the patient in a specific range of prediction of event production. There are also other more complex methods using regression equations, but after all you always get the same thing: an individualized estimate of the probability of the event in a particular patient.

With this process we perform the categorization of patients in homogenous groups of probability, but we still need to know if this categorization is adjusted to reality or, what is the same, what is the capacity of discrimination of the CPR.

The overall validity or discrimination capacity of the PRC will be assess by contrasting its results with those of the gold standard, using similar techniques to those used to assess the power of diagnostic tests: sensitivity, specificity, predictive values and likelihood ratios. In addition, in cases where the CPR provides a quantitative estimate, we can resort to the use of the ROC curves, since the area under the curve will represent the global validity of the CPR.

The last step of the design phase will be the calibration of the CPR, which is nothing more than checking its good behavior throughout the range of possible results.

Some CPR’s authors end this here, but they forget two fundamental steps of the elaboration: the validation and the calculation of the clinical impact of the rule.

The validation consists in testing the CPR in samples different to the one used for its design. We can take a surprise and verify that a rule that works well in a certain sample does not work in another. Therefore, it must be tested, not only in similar patients (limited validation), but also in different clinical settings (broad validation), which will increase the external validity of the CPR.

The last phase is to check its clinical performance. This is where many CPRs crash down after having gone through all the previous steps (maybe that’s why this last check is often avoided). To assess the clinical impact, we will have to apply CPR in our patients and see how clinical outcome measures change such as survival, complications, costs, etc. The ideal way to analyze the clinical impact of a CPR is to conduct a clinical trial with two groups of patients managed with and without the rule.

For those self-sacrificing people who are still reading, now that we know what a CPR is and how it is designed, we will see how the critical appraisal of these works is done. And for this, as usual, we will use our three pillars: validity, relevance and applicability. To not forget anything, we will follow the questions that are listed on the grid for CRP studies of the CASP tool.

Regarding VALIDITY, we will start first with some elimination questions. If the answer is negative, it may be time to wait until someone finally makes up a crystal ball that works.

Does the rule answer a well-defined question? The population, the event to be predicted, the predictive variables and the outcome evaluation criteria must be clearly defined. If this is not done or these components do not fit our clinical scenario, the rule will not help us. The predictive variables must be clinically relevant, reliable and well defined in advance.

Did the study population from which the rule was derived include an adequate spectrum of patients? It must be verified that the method of patient selection is adequate and that the sample is representative. In addition, it must include patients from the entire spectrum of the disease. As with diagnostic tests, events may be easier to predict in certain groups, so there must be representatives of all of them. Finally, we must see if the sample was validated in a different group of patients. As we have already said, it is not enough that the rule works in the group of patients in which it has been derived, but that it must be tested in other groups that are similar or different from those with which it was generated.

If the answer to these three questions has been affirmative, we can move on to the three next questions. Was there a blind evaluation of the outcome and of the predictor variables? We have already commented, it is important that the person who collects the predictive variables does not know the result of the reference pattern, and vice versa. The collection of information must be prospective and independent. The next thing to ask is whether the predictor variables and the outcome in all the patients were measured. If the outcome or the variables are not measured in all patients, the validity of the CPR can be compromised. In any case, the authors should explain the exclusions, if there are any. Finally, are the methods of derivation and validation of the rule described? We already know that it is essential that the results of the rule be validated in a population different from the one used for the design.

If the answers to the previous questions indicate that the study is valid, we will answer the questions about the RELEVANCE of the results. The first is if you can calculate the performance of the CRP. The results should be presented with their sensitivity, specificity, odds ratios, ROC curves, etc., depending on the result provided by the rule (scoring scales, regression formulas, etc.). All these indicators will help us to calculate the probabilities of occurrence of the event in environments with different prevalence. This is similar to what we did with the studies of diagnostic tests, so I invite you to review the post on the subject to not repeat too much. The second question is: what is the precision of the results? Here we will not extend either: remember our revered confidence intervals, which will inform us of the accuracy of the results of the rule.

To finish, we will consider the APPLICABILITY of the results to our environment, for which we will try to answer three questions. Will the reproducibility of the PRC and its interpretation be satisfactory within the scope of the scenario? We will have to think about the similarities and differences between the field in which the CPR develops and our clinical environment. In this sense, it will be helpful if the rule has been validated in several samples of patients from different environments, which will increase its external validity. Is the test acceptable in this case? We will think wether the rule is easy to apply in our environment and wether it makes sense to do it from the clinical point of view in our environment. Finally, will the results modify clinical behavior, health outcomes or costs? If, from our point of view, the results of the CPR are not going to change anything, the rule will be useless and a waste of time. Here our opinion will be important, but we must also look for studies that assess the impact of the rule on costs or on health outcomes.

And up to here everything I wanted to tell you about critical appraising of studies on CPRs. Anyway, before finishing I would like to tell you a little about a checklist that, of course, also exists for the valuation of this type of studies: the checklist CHARMS (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modeling Studies). You will not tell me that the name, although a bit fancy, is not lovely.

This list is designed to assess the primary studies of a systematic review on CPRs. It try to answer some general design questions and assess 11 domains to extract enough information to perform the critical appraisal. The two great parts that are valued are the risk of bias in the studies and its applicability. The risk of bias refers to the design or validation flaws that may result in the model being less discriminative, excessively optimistic, etc. The applicability, on the other hand, refers to the degree to which the primary studies are in agreement with the question that motivates the systematic review, for which it informs us of whether the rule can be applied to the target population. This list is good and helps to assess and understand the methodological aspects of this type of studies but, in my humble opinion, it is easier to make a systematic critical appraisal by using the CASP’s tool.

And here, finally, we leave it for today. We have not spoken anything, so as not to stretch ourselves too long, of what to do with the result of the rule. The fundamental thing, we already know, is that we can calculate the probability of occurrence of the event in individual patients from environments with different prevalence. But that is another story…