Critical appraisal of clinical prediction rules
How I wish I could predict the future! And not only to win millions in the lottery, which is the first thing you can think of. There are more important things in life than money (or so that’s what some say), decisions that we make based on assumptions that end up not being fulfilled and that complicate our lives to unsuspected limits. We all have ever thought about “if you lived twice …” I have no doubt, if I met the genie of the lamp one of the three wishes I would ask would be a crystal ball to see the future.
And we could also do well in our work as doctors. In our day to day we are forced to make decisions about the diagnosis or prognosis of our patients and we always do it on the swampy terrain of uncertainty, always assuming the risk of making some mistake. We, especially when we are more experienced, estimate consciously or unconsciously the likelihood of our assumptions, which helps us in making diagnostic or therapeutic decisions. However, it would be good to also have a crystal ball to know more accurately the evolution of the patient’s course.
The problem, as with other inventions that would be very useful in medicine (like the time machine), is that nobody has yet managed to manufacture a crystal ball that really works. But do not let us down. We cannot know for sure what will happen, but we can estimate the probability that a certain result will occur.
For this, we can use all those variables related to the patient that have a known diagnostic or prognostic value and integrate them to perform the calculation of probabilities. Well, doing such a thing would be the same as designing and applying what is known as a clinical prediction rule (CPR).
Thus, if we get a little formal, we can define a CPR as a tool composed of a set of variables of clinical history, physical examination and basic complementary tests, which provides us with an estimate of the probability of an event, suggesting a diagnosis or predicting a concrete response to a treatment.
The critical appraisal of an article about a CPR shares similar aspects with those of the ones about diagnostic tests and also has specific aspects related to the methodology of its design and application. For this reason, we will briefly look at the methodological aspects of CPRs before entering into their critical assessment.
Clinical prediction rules
In the process of developing a CPR, the first thing to do is to define it. The four key elements are the study population, the variables that we will consider as potentially predictive, the gold or reference standard that classifies whether the event we want to predict occurs or not and the criterion of assessment of the result.
It must be borne in mind that the variables we choose must be clinically relevant, they must be collected accurately and, of course, they must be available at the time we want to apply the CPR for decision making. It is advisable not to fall into the temptation of putting variables everywhere and endlessly since, apart from complicating the application of the CPR, it can decrease its validity. In general, it is recommended that for every variable that is introduced in the model there should have been at least 10 events that we want to predict (the design is made in a certain sample whose components have the variables but only a certain number have ended up presenting the event to predict).
I would also like to highlight the importance of the gold standard. There must be a diagnostic test or a set of well-defined criteria that allow us to clearly define the event we want to predict with the CPR.
Finally, it is convenient that those who collect the variables during this definition phase are unaware of the results of the gold standard, and vice versa. The absence of blinding decreases the validity of the CPR.
The next step is the derivation or design phase itself. This is where the statistical methods that allow to include predictive variables and exclude those that are not going to contribute anything are applied. We will not go into statistics, just say that the most commonly used methods are those based on logistic regression, although discriminant, survival and even more exotic analysis based on discriminant risks or neural networks can be used, only afforded by a few virtuous ones.
In the logistic regression models, the event will be the dichotomous dependent variable (it happens or it does not happen) and the other variables will be the predictive or independent variables. Thus, each coefficient that multiplies each predictive variable will be the natural antilogarithm of the adjusted odds ratio. In case anyone has not understood, the adjusted odds ratio for each predictive variable will be calculated raising the number “e” to the value of the coefficient of that variable in the regression model.
The usual thing is that a certain score is assigned on a scale according to the weight of each variable, so that the total sum of points of all the predictive variables will allow to classify the patient in a specific range of prediction of event production. There are also other more complex methods using regression equations, but after all you always get the same thing: an individualized estimate of the probability of the event in a particular patient.
With this process we perform the categorization of patients in homogenous groups of probability, but we still need to know if this categorization is adjusted to reality or, what is the same, what is the capacity of discrimination of the CPR.
The overall validity or discrimination capacity of the PRC will be assess by contrasting its results with those of the gold standard, using similar techniques to those used to assess the power of diagnostic tests: sensitivity, specificity, predictive values and likelihood ratios. In addition, in cases where the CPR provides a quantitative estimate, we can resort to the use of the ROC curves, since the area under the curve will represent the global validity of the CPR.
The last step of the design phase will be the calibration of the CPR, which is nothing more than checking its good behavior throughout the range of possible results.
Some CPR’s authors end this here, but they forget two fundamental steps of the elaboration: the validation and the calculation of the clinical impact of the rule.
The validation consists in testing the CPR in samples different to the one used for its design. We can take a surprise and verify that a rule that works well in a certain sample does not work in another. Therefore, it must be tested, not only in similar patients (limited validation), but also in different clinical settings (broad validation), which will increase the external validity of the CPR.
The last phase is to check its clinical performance. This is where many CPRs crash down after having gone through all the previous steps (maybe that’s why this last check is often avoided). To assess the clinical impact, we will have to apply CPR in our patients and see how clinical outcome measures change such as survival, complications, costs, etc. The ideal way to analyze the clinical impact of a CPR is to conduct a clinical trial with two groups of patients managed with and without the rule.
Critical appraisal of clinical prediction rules
For those self-sacrificing people who are still reading, now that we know what a CPR is and how it is designed, we will see how the critical appraisal of these works is done. And for this, as usual, we will use our three pillars: validity, relevance and applicability. To not forget anything, we will follow the questions that are listed on the grid for CRP studies of the CASP tool.
Regarding VALIDITY, we will start first with some elimination questions. If the answer is negative, it may be time to wait until someone finally makes up a crystal ball that works.
Does the rule answer a well-defined question? The population, the event to be predicted, the predictive variables and the outcome evaluation criteria must be clearly defined. If this is not done or these components do not fit our clinical scenario, the rule will not help us. The predictive variables must be clinically relevant, reliable and well defined in advance.
Did the study population from which the rule was derived include an adequate spectrum of patients? It must be verified that the method of patient selection is adequate and that the sample is representative. In addition, it must include patients from the entire spectrum of the disease. As with diagnostic tests, events may be easier to predict in certain groups, so there must be representatives of all of them. Finally, we must see if the sample was validated in a different group of patients. As we have already said, it is not enough that the rule works in the group of patients in which it has been derived, but that it must be tested in other groups that are similar or different from those with which it was generated.
If the answer to these three questions has been affirmative, we can move on to the three next questions. Was there a blind evaluation of the outcome and of the predictor variables? We have already commented, it is important that the person who collects the predictive variables does not know the result of the reference pattern, and vice versa. The collection of information must be prospective and independent. The next thing to ask is whether the predictor variables and the outcome in all the patients were measured. If the outcome or the variables are not measured in all patients, the validity of the CPR can be compromised. In any case, the authors should explain the exclusions, if there are any. Finally, are the methods of derivation and validation of the rule described? We already know that it is essential that the results of the rule be validated in a population different from the one used for the design.
If the answers to the previous questions indicate that the study is valid, we will answer the questions about the RELEVANCE of the results. The first is if you can calculate the performance of the CRP. The results should be presented with their sensitivity, specificity, odds ratios, ROC curves, etc., depending on the result provided by the rule (scoring scales, regression formulas, etc.). All these indicators will help us to calculate the probabilities of occurrence of the event in environments with different prevalence. This is similar to what we did with the studies of diagnostic tests, so I invite you to review the post on the subject to not repeat too much. The second question is: what is the precision of the results? Here we will not extend either: remember our revered confidence intervals, which will inform us of the accuracy of the results of the rule.
To finish, we will consider the APPLICABILITY of the results to our environment, for which we will try to answer three questions. Will the reproducibility of the PRC and its interpretation be satisfactory within the scope of the scenario? We will have to think about the similarities and differences between the field in which the CPR develops and our clinical environment. In this sense, it will be helpful if the rule has been validated in several samples of patients from different environments, which will increase its external validity. Is the test acceptable in this case? We will think wether the rule is easy to apply in our environment and wether it makes sense to do it from the clinical point of view in our environment. Finally, will the results modify clinical behavior, health outcomes or costs? If, from our point of view, the results of the CPR are not going to change anything, the rule will be useless and a waste of time. Here our opinion will be important, but we must also look for studies that assess the impact of the rule on costs or on health outcomes.
And up to here everything I wanted to tell you about critical appraising of studies on CPRs. Anyway, before finishing I would like to tell you a little about a checklist that, of course, also exists for the valuation of this type of studies: the checklist CHARMS (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modeling Studies). You will not tell me that the name, although a bit fancy, is not lovely.
This list is designed to assess the primary studies of a systematic review on CPRs. It try to answer some general design questions and assess 11 domains to extract enough information to perform the critical appraisal. The two great parts that are valued are the risk of bias in the studies and its applicability. The risk of bias refers to the design or validation flaws that may result in the model being less discriminative, excessively optimistic, etc. The applicability, on the other hand, refers to the degree to which the primary studies are in agreement with the question that motivates the systematic review, for which it informs us of whether the rule can be applied to the target population. This list is good and helps to assess and understand the methodological aspects of this type of studies but, in my humble opinion, it is easier to make a systematic critical appraisal by using the CASP’s tool.
And here, finally, we leave it for today. We have not spoken anything, so as not to stretch ourselves too long, of what to do with the result of the rule. The fundamental thing, we already know, is that we can calculate the probability of occurrence of the event in individual patients from environments with different prevalence. But that is another story…