**Published on** January 17th, 2013

# Regression Analysis

**By Tor G. Jakobsen**

**Regression analysis is the work horse of the social sciences, and a tool every student should have at least some knowledge about. It allows the researcher to include several variables in the same model, a so-called multivariate analysis.**

The researcher can thus control for the effect of other independent variables, which is important because a bivariate analysis (with only one explanatory and one dependent variable) often can provide a false picture of reality.

A regression analysis gives us the slope of the independent (*X*) variable. The dependent (*Y*) is a function of the *X*. For each unit’s (for example, a person, company, or country) increase in *X*, the unit will increase by the value of the slope of *X* (which we call the coefficient, or *B*) on the *Y*-axis.

There will be variation in *Y* that is not explained by our model. We denote this with en *e* (the error term). For simplicity we will start with the equation for a bivariate model: *Y* = *B*0 + *B*1*X*+*e*

*B*0 is the constant (or intercept) which is where the slope crosses the *Y*-axis.

The figure shows both the slope (regression line) and a line for the average (denoted with the *Y* hat). The regression line improves our ability to predict values on *Y* compared to if we just guessed using the sample mean, as is illustrated in the graph. The dots are usually closer to the regression line than they are to the mean line.

The technique used to calculate a standard regression line is called *Ordinary Least Squares*, which minimizes the squared error terms. In other words, it draws the line that makes the total distance from all the dots to the regression line as small as possible.

We usually get the slope expressed as a number in a regression table. If the slope goes upwards, we get a positive coefficient. If the slope goes downwards, we get a negative coefficient. If the slope is flat, we get a zero.

Ordinary Least Squares (OLS) regression presupposes that the *Y* (but not the *X*-variables) is continuous. It makes no sense to talk about an increase if the dependent variable is categorical. In practice, the dependent should have at least 5–7 values.

However, make sure not to include an *X*-variable that is too similar to the phenomenon you are explaining. You can also not have *X*-variables that are too similar to each other. One solution to the latter problem is to collapse them to a scale or an index.

good enough