Chapter 11: Regression Analysis: Multiple Linear Regression
Dummy Variables
Besides quantitative predictor variables, it is also possible to include categorical predictor variables into a regression model. This is done by creating one or more dummy variables.
Dummy Variable
A dummy variable is a binary variable used in regression analysis to represent a particular subgroup of the sample.
A dummy variable takes on a value of if an individual belongs to a particular subgroup and a value of if the individual does not belong to that subgroup.
If you want to add a categorical predictor variable with two levels to the regression model, a single dummy variable is sufficient.
If you want to add a categorical predictor variable with more than two levels to the regression model, multiple dummy variables need to be created. For a categorical variable with levels, you will need dummy variables.
Example: Adding a Binary Variable to the Model
Consider the following regression equation:
Where is a person's age and is their predicted income in 1000 euros.
Now suppose that, besides a person's age, you would also like to incorporate into the model whether or not the person has a Dutch nationality. This variable can take on two values: either you are Dutch or you aren't.
To incorporate this variable into the model, a dummy variable can be introduced, which takes on a value of if the person in question is Dutch and a value of if the person has another nationality.
Suppose the new model is described by the following regression equation:
Here . So if you have a Dutch nationality, the model predicts you will earn euros more than a person of the same age, but with a different nationality.
On the basis of this dummy variable, it is possible to construct two models: one for people with a Dutch nationality and one for people with another nationality.
- The predicted income of someone with a Dutch nationality is:
- The predicted income of someone with a different nationality is:
Notice that both equations have the same regression coefficient but different intercepts. The distance between the two regression lines is equal to the coefficient of the dummy variable, .
Example: Multiple Dummy Variables
Consider the following regression equation:
Where is a person's age and is their predicted income in 1000 euros.
Now suppose that, instead of treating age like a quantitative variable, you want to treat age like a categorical variable by grouping people into age groups: Child, Teen, Adult, and Elder.
Since there are age groups (levels), you need dummy variables:
- The variable is one if the person is a child and zero otherwise.
- The variable is one if the person is a teen and zero otherwise.
- The variable is one if the person is an adult and zero otherwise
- For an elderly person, , and are all zero.
On the basis of these three dummy variables, you can construct four different regression models, one for each age group.
Regression Model | ||||
Child | 1 | 0 | 0 | |
Teen | 0 | 1 | 0 | |
Adult | 0 | 0 | 1 | |
Elder | 0 | 0 | 0 |
Or visit omptest.org if jou are taking an OMPT exam.