There are days when I feel biblical. Other days I feel mythological. Today I feel philosophical and even a little Masonic.
And the reason is that the other day it gave me to wonder what the difference between exoteric and esoteric were, so I consulted with that friend of us all who knows so much about everything, our friend Google. It kindly explained to me that both terms are similar and usually explain two aspects of the same doctrine. Exoterism refers to that knowledge that is not limited to a certain group in the community that deals with that doctrine, and that can be disclosed and made available to anyone. On the other hand, esotericism refers to knowledge that belongs to a deeper and higher order, only available to a privileged few specially educated to understand it.
And now, once the difference is understood, I ask you a slightly tricky question: is multivariate statistics exoteric or esoteric? The answer, of course, will depend on each one, but we are going to see if it is true that both concepts are not contradictory, but complementary, and we can strike a balance between both of them, at least in understanding the usefulness of multivariate techniques.
We are more used to work with univariate or bivariate statistical techniques, which allow us to study together a maximum of two characteristics of the individuals in a population to detect relationships between them.
However, with the mathematical development and, above all, the calculating power of modern computers, multivariate statistical techniques are becoming increasingly important.
We can define multivariate analysis as the set of statistical procedures that simultaneously study various characteristics of the same subject or entity, in order to analyze the interrelation that may exist among all the random variables that these characteristics represent. Let me insist on the two aspects of these techniques: the multiplicity of variables and the study of their possible interrelations.
There are many multivariate analysis techniques, ranging from purely descriptive methods to those that use statistical inference techniques to draw conclusions from the data and that are able to develop models that are not obvious to the naked eye by observing the data obtained. They will also allow us to develop prediction models of various variables and establish relationships among them.
Some of these techniques are the extension of their equivalents with two variables, one dependent and the other independent or explanatory. Others have nothing similar in two-dimensional statistics.
Some authors classify these techniques into three broad groups: full-range and non-full-range models, techniques to reduce dimensionality, and classification and discrimination methods. Do not worry if this seems gibberish, we will try to simplify it a bit.
To be able to talk about the FULL AND NON-FULL RANGE TECHNIQUES, I think it will be necessary to explain first what range we are talking about.
A previous digress
Although we are not going to go into the subject in depth, all these methods involve matrix calculation techniques within them. You know, matrices or arrays, a set of two-dimensional numbers (the ones we are going to discuss here) that form rows and columns and that can be added and multiplied together, in addition to other calculations.
The range of an array is defined as the number of rows or columns that are linearly independent (no matter rows or columns, the number is the same). The range can value from 0 to the minimum number of rows or columns. For example, a 2 row by 3 column array may have a range from 0 to 2. A 5 row by 3 column array may have a range from 0 to 3. Now imagine an array with two rows, the first 1 2 3 and the second 3 6 9 (it has 3 columns). Its maximum range would be 2 (the smallest number of rows and columns) but, if you look closely, the second row is the first one multiplied by 3, so there is only one linearly independent row, so its range is equal to 1.
Well, an array is said to be a full-range one when its range is equal to the largest possible for an array of the same dimensions. The third example that I have given you would be a non-full range array, since a 2×3 matrix would have a maximum range of 2 and that of our array is 1.
Once this is understood, we go with the full and non-full range methods.
Multiple linear regression
The first one we will look at is the multiple linear regression model. This model, an extension of the simple linear regression model, is used when we have a dependent variable and a series of explanatory variables, all of them quantitative variables, and they can be linearly related and the explanatory variables form a full-range array.
Like simple regression, this technique allows us to predict changes in the dependent variable based on the values of explanatory variables. The formula is like that of the simple regression, but including all the explanatory independent variables, so I’m not going to bore you with it. However, since I have punished you with ranges and matrices, let me tell you that, in arrays terms, it can be expressed as follows:
Y = Xβ + ei
where X is the full range matrix of the explanatory variables. The equation includes an error term that is justified by the possible omission in the model of relevant explanatory variables or measurement errors.
To complicate matters, imagine that we were to simultaneously correlate several independent variables with several dependent ones. In this case, multiple regression does not help us, and we would have to resort to the canonical correlation technique, which allows us to make predictions of various dependent variables based on the values of several explanatory variables.
Non-full range techniques
If you remember bivariate statistics, analysis of variance (ANOVA) is the technique that allows us to study the effect on a quantitative dependent variable of explanatory variables when these are categories of a qualitative variable (we call these categories as factors). In this case, since each observation can belong to one and only one of the factors of the explanatory variable, matrix X will be of a non-full range one.
A slightly more complicated situation occurs when the explanatory variables are a quantitative variable and one or more factors of a qualitative variable. On these occasions we resorted to a generalized linear model called the analysis of covariance (ANCOVA).
Transferring what we have just said to the realm of multivariate statistics, we would have to use the extension of these techniques. The extension of ANOVA when there is more than one dependent variable that cannot be combined into one is the multivariate analysis of variance (MANOVA). If factors of qualitative variables coexist with quantitative variables, we will resort to the multivariate analysis of covariance (MANCOVA).
The second group of multivariate techniques are those that try the REDUCTION OF DIMENSIONALITY.
Sometimes we must handle such a high number of variables that it is complex to organize them and reach some useful conclusion. Now, if we are lucky that the variables are correlated with each other, the information provided by the set will be redundant, since the information given by some variables will include that already provided by other variables in the set.
In these cases, it is useful to reduce the dimension of the problem by decreasing the number of variables to a smaller set of variables that are not correlated with each other and that collect most of the information included in the original set. And we say most of the information because, obviously, the more we reduce the number, the more information we will lose.
The two fundamental techniques that we will use in these cases are principal component analysis and factor analysis.
Principal component analysis
Principal component analysis takes a set of p correlated variables and transforms them into a new set of uncorrelated variables, which we call principal components. These main components allow us to explain the variables in terms of their common dimensions.
Without going into detail, a correlation matrix and a series of vectors are calculated that will provide us with the new main components, ordered from highest to lowest according to the variance of the original data that each component encompass. Each component will be a linear combination of the original variables, somewhat like a regression line.
Let’s imagine a very simple case with six explanatory variables (X1 to X6). Principal component 1 (PC1) can be, let’s say, 0.15X1 + 0.5X2 – 0.6X3 + 0.25X4 – 0.1X5 – 0.2X6 and, in addition, encompass 47% of the total variance. If PC2 turns out to encompass 30% of the variance, with PC1 and PC2 we will have 77% of the total variance controlled with a data set that is easier to handle (let’s think if instead of 6 variables we have 50). And not only that, if we represent graphically PC1 versus PC2, we can see if some type of grouping of the variable under study occurs according to the values of the principal components.
In this way, if we are lucky and a few components collect most of the variance of the original variables, we will have reduced the dimension of the problem. And although, sometimes, this is not possible, it can always help us to find groupings in the data defined by a large number of variables, which links us to the following technique, factor analysis.
We know that the total variance of our data (the one studied by principal component analysis) is the sum of three components: the common or shared variance, the specific variance of each variable, and the variance due to chance and measurement errors. Again, without going into detail, the factor analysis method starts from the correlation matrix to isolate only the common variance and try to find a series of common underlying dimensions, called factors, that are not observable by looking at the original set of variables.
As we can see, these two methods are very similar, so there is a lot of confusion about when to use one and when another, especially considering that principal component analysis may be the first step in the factor analysis methodology.
As we have already said, principal component analysis tries to explain the maximum possible proportion of the total variance of the original data, while the objective of the factor analysis study is to explain the covariance or correlation that exists among its variables. Therefore, principal component analysis will usually be used to search for linear combinations of the original variables and reduce one large data set to a smaller and more manageable one, while we will resort to factor analysis when looking for a new set of variables, generally smaller than the original, and to represent what the original variables have in common.
Moving forward on our arduous today’s path, for those hard-working who are still reading, we are going to discuss CLASSIFICATION AND DISCRIMINATION METHODS, which are two: cluster analysis and discriminant analysis.
Cluster analysis tries to recognize patterns to summarize the information contained in the initial set of variables, which are grouped according to their greater or less homogeneity. In summary, we look for groups that are mutually exclusive, so that the elements are as similar as possible to those of their group and as different as possible to those of the other groups.
The most famous part of the cluster analysis is, without a doubt, its graphic representation, with decision trees and dendrograms, in which homogeneous groups increasingly different from those farthest between the branches of the tree are represented.
But, instead of wanting to segment the population, let’s assume that we already have a population segmented into a number of classes, k. Suppose we have a group of individuals defined by a number p of random variables. If we want to know to what class of the population a certain individual may belong, we will resort to the technique of discriminant analysis.
Suppose that we have a new treatment that is awfully expensive, so we only want to give it to patients who we are sure that they will comply with the treatment. Thus, our population is segmented into compliant and non-compliant classes. It would be very useful for us to select a set of variables that would allow us to discriminate which class a specific person can belong to, and even which of these variables are the ones that best discriminate between the two groups. Thus, we will measure the variables in the candidate for treatment and, using what is known as a discrimination criterion or rule, we will assign it to one or the other group and proceed accordingly. Of course, do not forget, there will always be a probability of being wrong, so we will be interested in finding the discriminant rule that minimizes the probability of discrimination error.
Discriminant analysis may seem similar to cluster analysis, but if we think about it, the difference is clear. In the discriminant analysis the groups are previously defined (compliant or non-compliant, in our example), while in the cluster analysis we look for groups that are not evident: we would analyze the data and discover that there are patients who do not take the pill that we give them, something that had not even crossed our minds (in addition to our ignorance, we would demonstrate our innocence).
We are leaving…
And here we are going to leave it for today. We have flown over the steep landscape of multivariate statistics from a great height and I hope it has served to transfer it from the field of the esoteric to that of the exoteric (or was it the other way around?). We have not entered the specific methodology of each technique, since we could have written an entire book. By roughly understanding what each method is and what it is for, I think we have quite a lot. In addition, statistical packages carried them out, as always, effortlessly.
Also don’t you think that we have talked about all the methods that have been developed for multivariate analysis. There are many others, such as conjoint analysis and multidimensional scaling, widely used in advertising to determine the attributes of an object that are preferred by the population and how they influence their perception of it. We could also get lost among other newer techniques, such as correspondence analysis, or linear probability models, such as logit and probit analysis, which are combinations of multiple regression and discriminant analysis, not to mention simultaneous or structural equation models. But that is another story…