Dummy Variables

Chapter 11: Regression Analysis: Multiple Linear Regression

Dummy Variables

Besides quantitative predictor variables, it is also possible to include categorical predictor variables into a regression model. This is done by creating one or more dummy variables.
$\phantom{0}$

Dummy Variable

A dummy variable is a binary variable used in regression analysis to represent a particular subgroup of the sample.

A dummy variable takes on a value of $1$ if an individual belongs to a particular subgroup and a value of $0$ if the individual does not belong to that subgroup.

If you want to add a categorical predictor variable with two levels to the regression model, a single dummy variable is sufficient.

If you want to add a categorical predictor variable with more than two levels to the regression model, multiple dummy variables need to be created. For a categorical variable with $k$ levels, you will need $k-1$ dummy variables.

Example: Adding a Binary Variable to the Model

Consider the following regression equation:

$\hat{Y}=-12+9X_1$

Where $X_1$ is a person's age and $\hat{Y}$ is their predicted income in 1000 euros.

Now suppose that, besides a person's age, you would also like to incorporate into the model whether or not the person has a Dutch nationality. This variable can take on two values: either you are Dutch or you aren't.

To incorporate this variable into the model, a dummy variable $X_2$ can be introduced, which takes on a value of $1$ if the person in question is Dutch and a value of $0$ if the person has another nationality.

Suppose the new model is described by the following regression equation:

$\hat{Y}=9X_1-12 + 5X_2$

Here $b_2=5$ . So if you have a Dutch nationality, the model predicts you will earn $5000$ euros more than a person of the same age, but with a different nationality.

On the basis of this dummy variable, it is possible to construct two models: one for people with a Dutch nationality and one for people with another nationality.

The predicted income of someone with a Dutch nationality is:
- $\hat{Y_1}=9X_1-12+5\cdot1=9X-7$
The predicted income of someone with a different nationality is:
- $Y_2=9X_1-12+5\cdot0=9X-12$

Notice that both equations have the same regression coefficient but different intercepts. The distance between the two regression lines is equal to the coefficient of the dummy variable, $b_2=5$ .

Example: Multiple Dummy Variables

Consider the following regression equation:

$\hat{Y}=-12+9X$

Where $X$ is a person's age and $\hat{Y}$ is their predicted income in 1000 euros.

Now suppose that, instead of treating age like a quantitative variable, you want to treat age like a categorical variable by grouping people into $4$ age groups: Child, Teen, Adult, and Elder.

Since there are $4$ age groups (levels), you need $k-1=4-1=3$ dummy variables:

The variable $X_1$ is one if the person is a child and zero otherwise.
The variable $X_2$ is one if the person is a teen and zero otherwise.
The variable $X_3$ is one if the person is an adult and zero otherwise
For an elderly person, $X_1, X_2$ , and $X_3$ are all zero.

On the basis of these three dummy variables, you can construct four different regression models, one for each age group.

	$X_1$	$X_2$	$X_3$	Regression Model
$\phantom{0}$ Child	1	0	0	$\phantom{00}$ $\hat{Y}=b_0+b_1X_1$
$\phantom{0}$ Teen	0	1	0	$\phantom{00}$ $\hat{Y}=b_0+b_2X_2$
$\phantom{0}$ Adult	0	0	1	$\phantom{00}$ $\hat{Y}=b_0+b_3X_3$
$\phantom{0}$ Elder	0	0	0	$\phantom{00}$ $\hat{Y}=b_0$