## The oddities of small towns

I remember when I was a child and went to school that almost everyone had a village to go to during holidays. Of course, they were other times and most of children’s parents had recently immigrated to the city, so almost everyone has “his village”. Now things are different. Most school children were born where they live, so it’s almost frowned upon to be a rube.

However, small towns have many interesting things. For example, they usually are most peaceful places to live and with a healthier lifestyle. But, although few people know, small towns are haunted by chance. Small towns are an easy prey for a thing called the law of small numbers. Do you know what I’m talking about?. We’ll try to explain it with an example.

When I worked as a resident there was a village, whose name I’ll not say not to offend anyone, from almost all patients with rare diseases came from. We, ignorant, even speculate on the possibility that the abundant slate around the city were radioactive and had the blame for the apparently high incidence of strange pathology among the inhabitants of this village. However, the explanation was much simpler and didn’t require any conspiracy theory. Blame was on small numbers.

We will assume that the risk of suffering fildulastrosis is one thousandth (prevalence PV = 0.001). As we all know, this genetic disease is caused by a mutation that occurs totally at random, so the situation of having or not having the disease can be assume as a Bernouilli’s event that follows a binomial probability distribution.

According to the prevalence that we have chosen, if we toured villages we expect to find a case of fildulastrosis per 1.000 inhabitants. If we get to a 5,000 inhabitants town and it has only one case instead of five, what would we say?. For sure we would think that we were beholding one more of the benefits of country life, much healthier, less stressful, and in contact with nature.

And what if we get to one even smaller, say 1,000 inhabitants, and we see that there are four sick people?. Following an as stupid as the above reasoning, we would think we were beholding one of the effects of country life, with less health resources and in contact with farm animals and other filthy stuff of wildlife.

But we would be wrong in both cases. Living in the countryside is not the cause of more or less people getting sick. Let’s see what happens in the these villages.

If there are 1,000 people we expect to find one case of fildulastrosis (Pv=0.001). In fact, if we use a binomial probability calculator, the probability of being at least one patient is 63%. But if we play around with the calculator, we can see that the probability of being two or more is 26%, that being three or more is 8% and that being four or more is 2%. You see, the prevalence triples in one in four 1,000 inhabitants village by chance alone. Now consider that the city has 10,000 inhabitants. The expected number of cases is 10 (with a probability of 54%). However, the probability that there are at least 20 cases falls to 0.3% and the probability of 30 goes near to zero. This means that random is much more whimsical with small villages. Large samples are always more accurate and it is more difficult to find extreme values just by chance.

What about the other example?. It is the same: the small sample is less precise and more susceptible to drift towards extreme values just by chance. As the first village has 5,000 inhabitants, we expect to find at least five cases of fildulastrosis (probability 61%). If we use the calculator, we see that the probability that there are four or less is 44%, that there are three or less is 26%, and that there are two or less is 12%. It means that in one in eight 5,000 inhabitants village prevalence drops to 0.0004 just by chance. What would happen with a larger village, say 10,000 inhabitants?. We will expect 10 cases or less with a probability of 58%, but the probability that prevalence dropping to 0.0004 (four cases or less) falls to 3%. And if you do the calculation for a 100,000 inhabitant’s city you’ll see that the probability that the prevalence lowers half is practically zero.

The law of small numbers is true in both ways. We will no longer have to give any absurd explanation when we find a small city with an abnormally high or low prevalence of a known disease. We´ll know it is due to the whim of chance and its law of small numbers.

And here we end up today. I hope no one has gone to Google to find what kind of disease fildulastrosis is, but if anyone has found it, please explain it to me. The example we set if very simple to make it easier to demonstrate the issue of imprecision in small samples. In real life, it is probable that onset of certain diseases may condition and increased risk of disease in the relatives, which could further exaggerate the effect we have shown towards the emergence of more extreme values. But that’s another story…

## The cooker and his cake

Knowing how to cook is a plus. What in good terms you stay when you have guests and you know how to cook properly!. It takes you two or three hours to buy the ingredients, you spend a fortune in it, and it takes another two or three hours working in the kitchen… and, in the end, it turns out that your great dish you were preparing ends up as a wreck.

And this may happen even to best cookers. We can never be sure that our dish will turn out good, although we have prepared it many times before. So you will understand the problem with my cousin.

As it happens, he’s going to give a party and the dessert has been his lot. He knows how to do a pretty and tasty cake, but it only turns out really good half of the times he tries. So he’s very concerned about making fool of himself at the party, as it’s easy to understand. Of course, my cousin is very clever and has thought that, if he makes more than one cake, at least one of them will turn out good. But how many cakes does he have to do to get at least one good?.

The problem with this question is that it doesn’t have an exact answer. The more cakes we make, the more likely one of them turns out good. But, of course, you can make two hundreds cakes and have the bad luck that all of them turn out bad. But do not despair: although we cannot give a number with absolute certainty, we can measure the probability of getting along with a certain number of cakes. Let’s see it.

We are going to imagine the probability distribution, which is just the set of situations that include all the possible situations that may occur. For example, if my cousin makes one cake, it can turns out good (G) or bad (B), both of them with a probability of 0.5. You can see it represented in Figure A. He’ll have a 50% chance of success.

If he makes two cakes it may be that he gets one cake good, two or none. The possible combinations are: GG, GB, BG, and BB. The chance of coming up with a good one is 0.5, and 0.25 the chance of getting two good ones, so the probability of getting at least one cake good is 0.75 or 75% (3/4). It’s represented in Figure B. We see that options have improved, but it’s still much room for failure. If he makes three cakes, the options will be: GGG, GGB, GBG, GBB, BGB, BGG, BBG, and BBB. The situation is improving, we have an 87.5% (1/8) of probabilities to get at least one cake. We represent it in Figure C.

And what if he makes four cakes, or five, or…?. The issue becomes a pain in the ass. It’s increasingly difficult to imagine all the possible combinations. What can we do?. Well, we can think a little.

If we look at the graphs, the bars represent the discrete elements of the probability of each of the possible events. As the number of possibilities and the number of vertical bars increase, the bars distribution begins to take a bell shape, conforming to a known probability distribution, the binomial distribution.

People who know about this stuff called Bernouilli experiments to those who have only two possible solutions (are dichotomous), like flipping a coin (heads or tails) or making our cakes (good or bad). However, the binomial distribution measures the number of successes (k) of a series of Bernouilli experiments (n) with a certain probability of success of each event (p).

In our case the probability is p = 0.5 and we can calculate the probability of success by repeating the experiment (cooking cakes) using the following formula:

If we replace p by 0.5 (the probability of the cake comes out good), we can play with different values of n to obtain the probability of getting at least one good cake (k ≥ 1).

If we make four cakes, the probability of having at least one good is of 93.75% and if we make five the probability increases to 96.87%, a reasonable probability for what we are dealing with. I believe that if my cousin makes fives cakes it will be very difficult for him to ruin his party.

We could also clear up the value of the probability and calculate the reverse: given a value of P(k,n), get the number of attempts  needed. Another thing you can also do is to calculate all these things without using the formula, but using any probability calculator available online.

And this is the end of this tasty post. There are, as you can imagine, more types of probability distributions, both discrete as the binomial and continuous as the normal distribution, the most famous of all of them. But that’s another story…

## The tails of p

Forgive me my friends from the other side of the Atlantic, but I am not thinking about the kind of tails that many perverse minds are. Far from it, today we’re going to talk about a lot more boring tails but that are very important if we want to do a hypothesis testing. And, as usual, we will illustrate the point with an example to try to understand it much better.

Let’s suppose we take a coin and, armed with infinite patience, toss it 1000 times, getting heads 560 times. We all know that the probability of getting heads is 0.5, so if you throw the coin 1000 times we expected to get an average number of 500 heads. But we’ve got 560, so we can consider two possibilities that come to mind immediately.

First, the coin if fair and we’ve got 60 more heads just by chance. This will be our null hypothesis, which says that the probability of getting heads [P(heads)] is equal to 0.5. Second, our coin is not fair, but it is loaded to obtain more heads. This will be our alternative hypothesis (Ha), which states that P(heads) > 0.5.

Well, let’s make a hypothesis testing using one of the binomial probability calculators that are available on the Internet. Assuming the null hypothesis that the coin is fair, the probability to obtain 560 heads or more is 0.008%. Being it lower than 5%, we reject our null hypothesis: the coin is loaded.

Now, if you look well at it, the alternative hypothesis has a directionality towards P(heads) > 0.5, but we could have hypothesized that the coin were not fair without presupposing it was load in favor of heads or tails: P(heads) not equal to 0.5. In this case we would calculate the probability to get a number of heads that were 60 above or below 500, in both directions. This probability values 0.016%, so we’d reject our null hypothesis and would conclude that the coin is not fair. The problem is that the test doesn’t tell us in what direction it’s loaded but, in the face of the results, we assume it favors heads. In the first example we did a one-tailed test, while in the second we have made a two-tailed test. In the figure, you can see the probability areas in both tests. In the one-tailed test, the red small area on the right represents the probability that the difference from the expected value is due to chance. In the two-tailed test, this area is doubled and located on both sides of the probability distribution. Notice that two-tailed p’s value doubles the one-tailed value. In our example, both p values are so low that we can reject the null hypothesis in any case. But this is not always so, and there may be occasions when the researcher chooses to do a one-tailed test to get statistical significance that is not possible with the two-tailed test.

And I’m saying one of the two tails because we have calculated the right tail probability, but we could have calculated the probability of the left tail. Consider the unlikely event that even though the coin is loaded favoring tails, we have got more heads just by chance. Our Ha now says that P(heads) < 0.5. In this case we’d calculate the probability that, under the null hypothesis that the coin is fair, we can get 560 tails or less. This p-value is 99.9%, so we cannot reject our null hypothesis that the coin is fair.

But, what is going on here?, you’ll ask. The first hypothesis test we did allowed us to reject the null hypothesis and the last test says otherwise. Being the same coin and the same data, shouldn’t we have reached the same conclusion?. As it turns out, it seems not. Remember that the fact that we cannot reject the null hypothesis is not the same as to conclude that it is true, a fact we can never be sure of. In the last example, the null hypothesis that the coin is fair is a better option than the alternative that it is loaded favoring tails. However, that does not mean we can conclude that the coin is fair.

You see therefore how important it is to be clear about the meaning of the null and alternative hypothesis when doing a hypothesis testing. And always remember that, even though we cannot reject the null hypothesis it doesn’t mandatorily imply it is true. It could just happen that we haven’t enough power to reject it. This leads me to think about type I and type II errors and their relation with power and sample size. But that’s another story…

## Do not gamble

Have you been to Las Vegas?. It’s an interesting city to visit. Once. Two, tops. Casinos are an amazing thing, with everyone playing like crazy with the hope of getting rich with little effort.

But, who do you think that pays everything you see in Las Vegas?. You get it, those who gamble. The banker never loses. Take my advice and don’t risk your money in casinos, because the likelihood of winning is rather scarce and, even in you win, the more likely is that you don’t win too much money. Of course, this may not be true if you bet large quantities of money, but if you have so much money you will have no need to bet to get rich.

Let’s see with an example how difficult is to become a millionaire by this method. Think about one of the possible gambles you can do with roulette: the street bet or three-number bet. For those of you who have not played ever, our wheel has 38 numbers.

In this bet, we put our gaming chips on three numbers of the same row and the wheel spins. Suppose we bet one euro on every play. The street bet pays 11 to one, which means that if the ball lands on any of our three numbers the banker gives us back our euro and 11 more. But if the ball lands in any other of the 38 numbers, we’ll lose one euro.

So the likelihood of winning is p = 3/38 and of losing is q = 35/38. Let’s consider first what the theoretical net gain of each play is: it is the sum of 11 times the probability of winning minus the probability of losing one euro:

Net gain = (3/38 x 11) – 35/38 = -0.0526 €

This means that, on average, we’ll lose something more than five cents on every play. And what if we play 300 times in a row? Can we get rich then?.

Then neither, because the expected gain is the average profit of each play by the total number of bets, i.e. -0.0526 x 300 = €-15.78. So why on earth do people gamble if the more times you bet the higher your expected loss?. Precisely because it’s an expected amount, but the number of times you can win or lose follows a binomial distribution, so there will be lucky people with small loses or even profits, but also unfortunate ones who will lose much more than expected.

The next point that you may be thinking about is what our chances of winning are if we play three hundred times in a row. Let’s find it out.

Let’s call W the number of times we win and G our net profit after 300 bets. The net gain is the number of times you win multiply by 11 (remember that the bet is paid eleven to one) minus the number of times you don’t win (and lose one euro). On the other hand, the number of times you lose equals the number of bets minus the number of times you win. Thus:

G = 11 W + (-1)(300 – W) -> 12 W – 300

If we win, our net gain G must be greater than zero. If we put it in the above equation:

12W – 300 > 0

We have

W > 300/12

W > 25

This means that, to avoid losing money, we have to win a minimum of 25 times out of the 300 we play. And 25 are many or few times?. To me, honestly, it seems a lot, but we can calculate the probability of winning 25 times.

We have already said that the model of the wheel follows a binominal probability frequency distribution:

Where n is the number of bets, k is the number of successes, p the likelihood of winning and q or (1-p) the probability of losing. If we substitute the letters of the equation for our data (n=300, p=3/38 and k=25) we can calculate what is the probability of winning at least 25 times. The problem is that numbers become so large that it becomes difficult to deal with, so I would advise you to use a statistical program or any available Internet calculator to do it. I’ve done it and I’ve came up with a probability of 42%.

Good!, someone of you may think. A 42% chance of winning is not so bad. But think for a moment that what is not bad, at least from the casino’s point of view, is the 58% chance of losing we have. And besides, 42% is the overall probability of winning. If you calculate the number of bets you have to win to get a net gain of 100 euros, you will see that it is more than 34, and the probability of winning more than 34 bets out of 300 goes down to 2.2%.

To end with this playful post, just to tell you that if you don’t have a binomial probability calculator, you can get an approximation using the normal distribution. You’d have to calculate the average profit and its standard error and, using both of them and your desired net gain, calculate the standardize z-value to estimate its probability. But that’s another story…

## Lie to me

Today, you are going to let me be a little dirty. Dirty and piggy, as a matter of fact. The thing is that I’ve been recently mulling over something that I’ve noticed a lot of times. Sure that some of you have noticed it too.

Have you realized how many drivers (and she-drivers, don’t be surprised) take advantage of red lights to take off their boogers?. Some of them, so help me God, even eat their boogers. Yuck!

However, if I ask people around me no one recognizes to do it, so it intrigues me why I have such a bad luck of running into the most piggy neighborhood while I’m stopped at a red light. Of course, the reason might be that people whom I ask feel embarrassed to admit they practice such an unhealthy habit.

It seems that knowing the truth poses a huge problem. Imagine that I want to take a survey. I go to traffic offices, I get a list of drivers phone numbers and I start calling people asking them: do you take off your boogers while in a red light?.

Any survey you do can be distorted by four sources of error. The first one is selection bias, when you do a wrong choice of respondents. If I only call to people from preppy neighborhoods most of them will answer “no” (and not because they don’t do it, but because they will qualm about confessing the truth). The second source of error is the “no answer” one: many respondents will hung up the phone without answering, giving me regards to my family, by the way. The third source is recall bias. This means that the respondent says he or she doesn’t remember the answer to my question. I think this would apply little to our example. What we will found a lot in our survey is the four source of error: lie.

This fact is well-known to Financial Minister’s people. They are very used to people trying to cheat them. If they call you asking if you’ve ever cheated with taxes, what will you answer?.

But, can we do something to get rid of lie?. Well, except for doing the questions in person applying truth serum to respondents, we can’t get rid of it completely, but we can minimize it a lot with a little trick.

Let’s think that I propose the following game to my telephone respondents: roll a die and, if you get one or two, answer me that you take off your boogers, even if it is a lie and you don’t. On the other way, if the die takes out three to six, answer me the truth. In any case, what you never say to me is the number you got rolling the die.

Thus, the subject that I’m asking will understand that I cannot know if he or she is telling truth or lie and so he or she will be less likely to lie. This protection of respondent’s privacy implies that I cannot know the true answer of each of the respondents but, in return, I can know the aggregate behavior of the sample of respondents, although always with some uncertainty. How we do it?. Let’s develop our example.

First, we’re going to think about who will answer “yes”. On the one hand, those who take out one or two with the dice will answer “yes”. We call p the probability of this event (2/6 in our example). If I ask to n number of persons, we’ll come up with a total of n times p people (we get this results calculating the sum of hits in a series by applying the binomial probability theory).

On the other hand, people who take three to six with the dice and get their boogers removed at red lights will also answer “yes”. The number is n (total respondents) multiplied by the probability of the outcome of the dice (1-p, 4/6 in our example) and multiply by the probability of practicing such a dirty habit (its prevalence, Pr, which is precisely what we want to know).

So if we add both numbers of “yes”, required and truthful, we will come up with the following formula, were m are those who answer they removed their boogers:

m = np + n(1-p)Pr

And now we can solve Pr using our broad knowledge about algebra:

Pr = [(m/n)-p] / 1-p

Suppose we surveyed 100 individuals and 62 out of them answered “yes”. How many of them actually eat boogers regularly?. Substituting values in our formula (m=62, n=100, p=2/6) yields a figure of 0.43. That means that at least 43% of people take advantage of red lights to do mining working. And the real figure is probably higher, because some people will still lie despite our clever ruse.

p is usually call obfuscation factor and we can get its value using dice, coins or whatever. But be careful we you choose its value. If it is too large the subject will feel more confident to answer honestly, but the uncertainty in our calculation will be higher. On the other hand, the smaller the p, the scarier the respondent will be that we link him with the real answer, so he will be prone to lie through his teeth. As always, in the middle is virtue.

Those who haven’t gone to vomit by now have seen how we have used binomial probability calculation to address such a disgusting issue. By the way, if you think about it, what we have done resembles the calculation of a disease prevalence in a population knowing the sensitivity and specificity of a diagnostic test. But that’s another story…