Obfuscation factor helps us, when conducting a survey, to assess the bias that can be caused by false answers o lies.
Today, you are going to let me be a little dirty. Dirty and piggy, as a matter of fact. The thing is that I’ve been recently mulling over something that I’ve noticed a lot of times. Sure that some of you have noticed it too.
Have you realized how many drivers (and she-drivers, don’t be surprised) take advantage of red lights to take off their boogers?. Some of them, so help me God, even eat their boogers. Yuck!
However, if I ask people around me no one recognizes to do it, so it intrigues me why I have such a bad luck of running into the most piggy neighborhood while I’m stopped at a red light. Of course, the reason might be that people whom I ask feel embarrassed to admit they practice such an unhealthy habit.
It seems that knowing the truth poses a huge problem. Imagine that I want to take a survey. I go to traffic offices, I get a list of drivers phone numbers and I start calling people asking them: do you take off your boogers while in a red light?.
Any survey you do can be distorted by four sources of error.
The first one is selection bias, when you do a wrong choice of respondents. If I only call to people from preppy neighborhoods most of them will answer “no” (and not because they don’t do it, but because they will qualm about confessing the truth). The second source of error is the “no answer” one: many respondents will hung up the phone without answering, giving me regards to my family, by the way. The third source is recall bias. This means that the respondent says he or she doesn’t remember the answer to my question. I think this would apply little to our example. What we will found a lot in our survey is the four source of error: lie.
This fact is well-known to Financial Minister’s people. They are very used to people trying to cheat them. If they call you asking if you’ve ever cheated with taxes, what will you answer?.
But, can we do something to get rid of lie?. Well, except for doing the questions in person applying truth serum to respondents, we can’t get rid of it completely, but we can minimize it a lot with a little trick.
Let’s think that I propose the following game to my telephone respondents: roll a die and, if you get one or two, answer me that you take off your boogers, even if it is a lie and you don’t. On the other way, if the die takes out three to six, answer me the truth. In any case, what you never say to me is the number you got rolling the die.
Thus, the subject that I’m asking will understand that I cannot know if he or she is telling truth or lie and so he or she will be less likely to lie. This protection of respondent’s privacy implies that I cannot know the true answer of each of the respondents but, in return, I can know the aggregate behavior of the sample of respondents, although always with some uncertainty. How we do it?. Let’s develop our example.
First, we’re going to think about who will answer “yes”. On the one hand, those who take out one or two with the dice will answer “yes”. We call p the probability of this event (2/6 in our example). If I ask to n number of persons, we’ll come up with a total of n times p people (we get this results calculating the sum of hits in a series by applying the binomial probability theory).
On the other hand, people who take three to six with the dice and get their boogers removed at red lights will also answer “yes”. The number is n (total respondents) multiplied by the probability of the outcome of the dice (1-p, 4/6 in our example) and multiply by the probability of practicing such a dirty habit (its prevalence, Pr, which is precisely what we want to know).
So if we add both numbers of “yes”, required and truthful, we will come up with the following formula, were m are those who answer they removed their boogers:
m = np + n(1-p)Pr
And now we can solve Pr using our broad knowledge about algebra:
Pr = [(m/n)-p] / 1-p
Suppose we surveyed 100 individuals and 62 out of them answered “yes”. How many of them actually eat boogers regularly?. Substituting values in our formula (m=62, n=100, p=2/6) yields a figure of 0.43. That means that at least 43% of people take advantage of red lights to do mining working. And the real figure is probably higher, because some people will still lie despite our clever ruse.
p is usually call obfuscation factor and we can get its value using dice, coins or whatever. But be careful we you choose its value. If it is too large the subject will feel more confident to answer honestly, but the uncertainty in our calculation will be higher. On the other hand, the smaller the p, the scarier the respondent will be that we link him with the real answer, so he will be prone to lie through his teeth. As always, in the middle is virtue.
Those who haven’t gone to vomit by now have seen how we have used binomial probability calculation to address such a disgusting issue. By the way, if you think about it, what we have done resembles the calculation of a disease prevalence in a population knowing the sensitivity and specificity of a diagnostic test. But that’s another story…