Data imputation includes a number of techniques for assigning theoretical values to variables with missing data.
I guess you know the quote. It’s usually used to mean that someone does not walk very sane regarding something he said. You also know that the phrase belongs to a longer proverb which says that “neither are all there are, nor are all those who are”, which usually refers to asylums.
According to the saying, not all who are admitted to a mental hospital are crazy, not all of us which are outside are sane. Me, personally, I cannot say anything about the first half, because I’ve never been admitted to one of them, although I’m always on time. As to the second half, it’s clearly true. I’d dare say there’re even more crazy people outside than inside.
But now we’re not going to talk about crazy people, but about others that are not too. But these are not because of anything special, but simply because they are missing. We’re going to talk about missing data.
The absence of data is very common in any research study. There’s no survey or database of any study that has no empty cells, sometimes even from data of great interest to the researcher. The reason for missing data can be multiple. Sometimes respondents do not answer for lack of time or interest. Sometime they respond but giving a meaningless answer or the investigator encodes a wrong answer. Other times, missing data are related to loss to follow-up that occurs in many studies, or to lack of compliance with trials treatments.
Types of missing data
There are several ways to deal with data loss, but which one to choose depends largely on the mechanisms that cause this lack of data. In this regard, data may be lost at random (MAR), not at random (MNAR) and completely at random (MCAR).
MAR data may be related to a certain variable, but not to its value. For example, if we assess the teratogenic effect of a drug, the variable value will depend on the variables “previous pregnancy” or “drug prescription”, which may be also missing from the registry. Another example is the accidental omission or neglecting in answering one of the questions of the survey.
On the other hand, MCAR are not related to any of the measured variables and with any known or unknown factors that may influence the variable. As its name implies, losses occur totally at random, although this is something that rarely happens. The assumption that losses are completely random is difficult to prove, because they can always be due to an unthought-of variable that has an unknown effect on the outcome variable.
Lastly, MNAR are due to a certain cause, usually unobserved. For instance, if trial participants miss an intermediate visit due to forgetfulness, lack of data at that visit may be random. But if they miss that visit because they are sick as a result of the trial intervention, the missing data cannot be considered random.
MAR and MCAR can be ignored, but always with risk of committing bias. However, MNAR must never be ignored. Doing so will always lead us to obtaining biased estimates, compromising the internal and external validity of our results.
So, how can we deal with missing data?. Ideally, of course, we have to prevent losing data, for which we must be careful in designing the study, especially data collection phase. But however careful we are, it will be rare that we don’t come up with missing data. In this situation we have two options: ignore them or invent them.
We can ignore them and do a complete data analysis. The problem is that we always lose the information of participants with any missing data, besides running the risk of committing some bias. And we have also said that this practice is strongly discouraged when dealing with MNAR. In these cases, losses should be analyzed and explained.
The other option is to invent them, but as it sound really bad, we use the word to impute. There are different data imputation techniques, the simple and the multiple.
Among the simple imputation techniques are the unconditional means method, the conditional means for grouped data method, the imputation with dummy variables, imputation using a conditional distribution (hot-deck method), the last observation carried forward method (cold-deck method) and regression imputation method.
Most researchers rather use multiple imputation methods, making sure beforehand that missing data are at random, which sometimes can be tricky, as we mentioned previously. These methods use a Monte Carlo simulation and replace missing data with others obtained from a number of simulations, which is usually considered optimum to be between 3 and 10. The math in-here is complex, although most statistical computer applications implement some data imputation algorithm.
It is difficult to decide when to use a simple imputation method or a multiple one. In general, if we are dealing with a complex survey and the number of missing data is not very high, it’s likely that a simple method reproduces well the characteristics of this subpopulation of interest in which data are missing. However, let’s not succumb to the temptation to apply the easiest way: multiple imputation methods are often more suitable for this porpoise than simple methods.
To end with missing data just say that there are any other options other than obviate or invent them. For example, when dealing with continuous variables we could use a repeated linear measures model to analyze results along follow-up. Regarding categorical variables, there are other more sophisticated statistical techniques, such us generalized estimating equations or random-effects generalized linear mixed models. But that’s another story…