Today we are going to talk again about the relationship that can exist between two variables. We saw in a previous post how we could measure the relationship between two variables by the method of correlation, which measured the strength of relationship between two variables when neither one can be considered a predictor of the other. That is, when the values of one do not serve to calculate the values of the other, although both vary in a predictable way.

A similar thing, which we will discuss in this post, is regression. This not only explains the relationship between two variables, but we can quantify how one of the variables, which we call dependent, changes with changes in the other variables, which will be independent.

But we can get still one step further: the values of the independent variable can serve to predict the values of the dependent one. Suppose we measure weight and height and calculate the regression model between both variables. If we know the height of an individual we can use the regression equation to estimate what his weight is (in this case, height is the independent variable and weight the dependent one).

If we call the independent variable x and the dependent y, any simple regression model can be represented by the following expression:

Function(x) = a + bx

In this expression, a represents the result of the function when x values zero. It is usually called the intercept because it is the point where the regression line crosses the y-axis. Meanwhile, b is often called the slope, which represents the amount in which y changes with the variations in x (if x increases in b units, y increases in b units).

And what is the meaning of function(y)? It depends on the type of variable the dependent is. We know that variables are classified as quantitative (or continuous), qualitative (nominal or ordinal) and time to event (also called survival variables). Well, function(y) will be different depending on the type of the dependent variable, because the regression model that we will apply will be different for each type of variable.

In the case of continuous variables, the regression model applied is simple linear regression function and function(y) will be the arithmetic mean. The equation is as follows:

y = a + bx

Using the example of weight and height, if we replace x by the desired value of height and we solve the equation, we will obtain the mean weight of individuals of that height.

In the event that the dependent variable is qualitative binary we use a logistic regression model. In this case we will code the dependent variable as zero or one and the function of y will no longer be the mean, but the natural logarithm of the odds ratio of the variable when its value is one. Suppose we calculate the relationship between weight (independent variable) and gender (dependent variable). In this case we could codify as one if female and zero if male, representing the regression line as follows:

ln(OR) = a + bx

If we substitute x by the weight in question and solve the equation, we get the logarithm of the OR of being a woman (value 1). To get the OR we have to raise the number e to the result of the equation (to do the antilogarithm). From here it is easy to calculate the value of the probability of being female (p = OR / 1 + OR) or male (one minus the value of the probability of being female).

This function of ln (OR) is expressed on many occasions as ln (p / 1-p) as the odds ratio is the probability of an event occurring (p) divided by the probability of not happen (1-p ). This function is called logit, so we can also see the logistic regression model represented as follows:

logit(y) = a + bx

Finally, we can find the case that the dependent variable is of time to event type. In this case we must use a Cox proportional hazards regression model. Its structure is very similar to that of the logistic regression, only then we use the logarithm of hazard ratio instead of the odds ratio:

ln(HR) = a + bx

As we did with logistic regression, to get the value of the hazard ratio we have to do the antilogarithm of the regression equation solution (e raised to the result of the equation).

And although there are many more, these are the three most commonly used regression models. In all these cases we have talked of equations with one independent variable, so we say that we are talking about simple regression. But we can put all the independent variables we want, using the following formula:

Function(y) = a + bx_{1} + cx_{2} + … + nx_{x}

Of course, we will no longer talk about simple regression, but about multiple regression, but everything we have described would be equally applicable.

And here we will leave the topic. We could talk about the value of the intercept and slope as the independent variable is continuous or qualitative, as they are read a little differently. But that is another story…