Chapter 11: Regression Analysis: Multiple Linear Regression
Dummy Variables
Besides quantitative predictor variables, it is also possible to include categorical predictor variables into a regression model. This is done by creating one or more dummy variables.
#\phantom{0}#
Dummy Variable
A dummy variable is a binary variable used in regression analysis to represent a particular subgroup of the sample.
A dummy variable takes on a value of #1# if an individual belongs to a particular subgroup and a value of #0# if the individual does not belong to that subgroup.
If you want to add a categorical predictor variable with two levels to the regression model, a single dummy variable is sufficient.
If you want to add a categorical predictor variable with more than two levels to the regression model, multiple dummy variables need to be created. For a categorical variable with #k# levels, you will need #k-1# dummy variables.
Example: Adding a Binary Variable to the Model
Consider the following regression equation:
\[\hat{Y}=-12+9X_1\]
Where #X_1# is a person's age and #\hat{Y}# is their predicted income in 1000 euros.
Now suppose that, besides a person's age, you would also like to incorporate into the model whether or not the person has a Dutch nationality. This variable can take on two values: either you are Dutch or you aren't.
To incorporate this variable into the model, a dummy variable #X_2# can be introduced, which takes on a value of #1# if the person in question is Dutch and a value of #0# if the person has another nationality.
Suppose the new model is described by the following regression equation:
\[\hat{Y}=9X_1-12 + 5X_2\]
Here #b_2=5#. So if you have a Dutch nationality, the model predicts you will earn #5000# euros more than a person of the same age, but with a different nationality.
On the basis of this dummy variable, it is possible to construct two models: one for people with a Dutch nationality and one for people with another nationality.
- The predicted income of someone with a Dutch nationality is:
- #\hat{Y_1}=9X_1-12+5\cdot1=9X-7#
- The predicted income of someone with a different nationality is:
- #Y_2=9X_1-12+5\cdot0=9X-12#
Notice that both equations have the same regression coefficient but different intercepts. The distance between the two regression lines is equal to the coefficient of the dummy variable, #b_2=5#.
Example: Multiple Dummy Variables
Consider the following regression equation:
\[\hat{Y}=-12+9X\]
Where #X# is a person's age and #\hat{Y}# is their predicted income in 1000 euros.
Now suppose that, instead of treating age like a quantitative variable, you want to treat age like a categorical variable by grouping people into #4# age groups: Child, Teen, Adult, and Elder.
Since there are #4# age groups (levels), you need #k-1=4-1=3# dummy variables:
- The variable #X_1# is one if the person is a child and zero otherwise.
- The variable #X_2# is one if the person is a teen and zero otherwise.
- The variable #X_3# is one if the person is an adult and zero otherwise
- For an elderly person, #X_1, X_2#, and #X_3# are all zero.
On the basis of these three dummy variables, you can construct four different regression models, one for each age group.
#X_1# | #X_2# | #X_3# | Regression Model | |
#\phantom{0}#Child | 1 | 0 | 0 | #\phantom{00}##\hat{Y}=b_0+b_1X_1# |
#\phantom{0}#Teen | 0 | 1 | 0 | #\phantom{00}##\hat{Y}=b_0+b_2X_2# |
#\phantom{0}#Adult | 0 | 0 | 1 | #\phantom{00}##\hat{Y}=b_0+b_3X_3# |
#\phantom{0}#Elder | 0 | 0 | 0 | #\phantom{00}##\hat{Y}=b_0# |
Or visit omptest.org if jou are taking an OMPT exam.