I don’t like the end of summer. The days with bad weather begin, I wake up completely in the dark and in the evening it gets dark early and early. And, as if this were not bad enough, the cumbersome moment of change between summer and winter time is approaching.
In addition to the inconvenience of the change and the tedium of being two or three days remembering what time it is and what it could be if it had not been any change, we must proceed to adjust a lot of clocks manually. And, no matter how much you try to change them all, you always leave some with the old hour. It does not happen to you with the kitchen clock, at which you always look to know how fast you have to have breakfast, or with the one in the car, which stares at you every morning. But surely there are some that you do not change. Even, it has ever happened to me, that I realize it when the next time to change I see that I don’t need to do it because I left it unchanged in the previous time.
These forgotten clocks remind me a little of categorical or qualitative variables.
You will think that, once again, I forgot to take my pill this morning, but no. Everything has its reasoning. When we finish a study and we already have the results, the first thing we do is a description of them and then go on to do all kinds of contrasts, if applicable.
Well, qualitative variables are always belittled when we apply our knowledge of descriptive statistics. We usually limit ourselves to classifying them and making frequency tables with which to calculate some indices as their relative or accumulated frequency, to give some representative measure such as mode and little else. We use to work a little more with its graphic representation with bar or sector diagrams, pictograms and other similar inventions. And finally, we apply a little more effort when we relate two qualitative variables through a contingency table.
However, we forget their variability, something we would never do with a quantitative variable. The quantitative variables are like that kitchen wall clock that looks us straight in the eye every morning and does not allow us to leave it out of time. Therefore, we use these concepts we understand very well as the mean and variance or standard deviation. But that we do not know how to objectively measure the variability of qualitative or categorical variables, whether nominal or ordinal, does not mean that it does not exist a way to do it. For this purpose, several diversity indexes have been developed, which some authors distinguish as dispersion, variability and disparity indexes. Let’s see some of them, whose formulas you can see in the attached box, so you can enjoy the beauty of mathematical language.
The two best known indexes used to measure the variability or diversity are the Blau’s index (or of Hirschman- Herfindal’s) and the entropy index (or Teachman’s). Both have a very similar meaning and, in fact, are linearly correlated.
Blau’s index quantifies the probability that two individuals chosen at random from a population are in different categories of a variable (provided that the population size is infinite or the sampling is performed with replacement). Its minimum value, zero, would indicate that all members are in the same category, so there would be no variety. The higher its value, the more dispersed among the different categories of the variable will be the components of the group. This maximum value is reached when the components are distributed equally among all categories (their relative frequencies are equal). Its maximum value would be (k-1) / k, which is a function of k (the number of categories of the qualitative variable) and not of the population size. This value tends to 1 as the number of categories increases (to put it more correctly, when k tends to infinity).
Let’s look at some examples to clarify it a bit. If you look at the Blau’s index formula, the value of the sum of the squares of the relative frequencies in a totally homogeneous population will be 1, so the index will be 0. There will only be one category with frequency 1 (100%) and the rest with zero frequency.
As we have said, although the subjects are distributed similarly in all categories, the index increases as the number of categories increases. For example, if there are four categories with a frequency of 0.25, the index will be 0.75 (1 – (4 x 0.252)). If there are five categories with a frequency of 0.2, the index will be 0.8 (1 – (5 x 0.22). And so on.
As a practical example, imagine a disease in which there is diversity from the genetic point of view. In a city A, 85% of patients has genotype 1 and 15% genotype 2. The Blau’s index values 1 – (0.852 + 0.152) = 0.255. In view of this result, we can say that, although it is not homogeneous, the degree of heterogeneity is not very high.
Now imagine a city B with 60% of genotype 1, 25% of genotype 2 and 15% of genotype 3. The Blau’s index will be 1 – (0.62 x 0.252 x 0.152) = 0.555. Clearly, the degree of heterogeneity is greater among the patients of city B than among those of A. The smartest of you will tell me that that was already clear without calculating the index, but you have to take into account that I chose a very simple example for not giving my all calculating. In real-life, more complex studies, it is not usually so obvious and, in any case, it is always more objective to quantify the measure than to remain with our subjective impression.
This index could also be used to compare the diversity of two different variables (as long as it makes sense to do so) but, the fact that its maximum value depends on the number of categories of the variable, and not on the size of the sample or population, questions its usefulness to compare the diversity of variables with different number of categories. To avoid this problem, the Blau’s index can be normalized by dividing it by its maximum, thus obtaining the qualitative variation index. Its meaning is, of course, the same as that of the Blau’s index and its value ranges between 0 and 1. Thus, we can use either one if we compare the diversity of two variables with the same number of categories, but it will be more correct to use the qualitative variation index if the variables have a different number of categories.
The other index, somewhat less famous, is the Teachman’s index or entropy index , whose formula is also attached. Very briefly we will say that its minimum value, which is zero, indicates that there are no differences between the components in the variable of interest (the population is homogeneous). Its maximum value can be estimated as the negative value of the neperian logarithm of the inverse of the number of categories (- ln ( 1 / k)) and is reached when all categories have the same relative frequency (entropy reaches its maximum value). As you can see, very similar to Blau’s, which is much easier to calculate than Teachman’s.
To end this entry, the third index that I want to talk about today tells us, more than about the variability of the population, about the dispersion that its components have regarding the most frequent value. This can be measured by the variation ratio, which indicates the degree to which the observed values do not coincide with that of mode, which is the most frequent category. As with the previous ones, I also show the formula in the attached box.
In order not to clash with the previous ones, its minimum value is also zero and is obtained when all cases coincide with the mode. The lower the value, the less the dispersion. The lower the absolute frequency of the mode, the closer it will be to 1, the value that indicates maximum dispersion. I think this index is very simple, so we are not going to devote more attention to it.
And we have reached the end of this post. I hope that from now on we will pay more attention to the descriptive analysis of the results of the qualitative variables. Of course, it would be necessary to complete it with an adequate graphic description using the well-known bar or sector diagrams (the pies) and others less known as the Pareto’s diagrams. But that is another story…