Dummy Variables

Chapter 11: Regression Analysis: Multiple Linear Regression

Dummy Variables

Besides quantitative predictor variables, it is also possible to include categorical predictor variables into a regression model. This is done by creating one or more dummy variables.
#\phantom{0}#

Dummy Variable

A dummy variable is a binary variable used in regression analysis to represent a particular subgroup of the sample.

A dummy variable takes on a value of #1# if an individual belongs to a particular subgroup and a value of #0# if the individual does not belong to that subgroup.

If you want to add a categorical predictor variable with two levels to the regression model, a single dummy variable is sufficient.

If you want to add a categorical predictor variable with more than two levels to the regression model, multiple dummy variables need to be created. For a categorical variable with #k# levels, you will need #k-1# dummy variables.

Example: Adding a Binary Variable to the Model

Consider the following regression equation:

\[\hat{Y}=-12+9X_1\]

Where #X_1# is a person's age and #\hat{Y}# is their predicted income in 1000 euros.

Now suppose that, besides a person's age, you would also like to incorporate into the model whether or not the person has a Dutch nationality. This variable can take on two values: either you are Dutch or you aren't.

To incorporate this variable into the model, a dummy variable #X_2# can be introduced, which takes on a value of #1# if the person in question is Dutch and a value of #0# if the person has another nationality.

Suppose the new model is described by the following regression equation:
\[\hat{Y}=9X_1-12 + 5X_2\]

Here #b_2=5#. So if you have a Dutch nationality, the model predicts you will earn #5000# euros more than a person of the same age, but with a different nationality.

On the basis of this dummy variable, it is possible to construct two models: one for people with a Dutch nationality and one for people with another nationality.

The predicted income of someone with a Dutch nationality is:
- #\hat{Y_1}=9X_1-12+5\cdot1=9X-7#
The predicted income of someone with a different nationality is:
- #Y_2=9X_1-12+5\cdot0=9X-12#

Notice that both equations have the same regression coefficient but different intercepts. The distance between the two regression lines is equal to the coefficient of the dummy variable, #b_2=5#.

Example: Multiple Dummy Variables

Consider the following regression equation:

\[\hat{Y}=-12+9X\]

Where #X# is a person's age and #\hat{Y}# is their predicted income in 1000 euros.

Now suppose that, instead of treating age like a quantitative variable, you want to treat age like a categorical variable by grouping people into #4# age groups: Child, Teen, Adult, and Elder.

Since there are #4# age groups (levels), you need #k-1=4-1=3# dummy variables:

The variable #X_1# is one if the person is a child and zero otherwise.
The variable #X_2# is one if the person is a teen and zero otherwise.
The variable #X_3# is one if the person is an adult and zero otherwise
For an elderly person, #X_1, X_2#, and #X_3# are all zero.

On the basis of these three dummy variables, you can construct four different regression models, one for each age group.

	#X_1#	#X_2#	#X_3#	Regression Model
#\phantom{0}#Child	1	0	0	#\phantom{00}##\hat{Y}=b_0+b_1X_1#
#\phantom{0}#Teen	0	1	0	#\phantom{00}##\hat{Y}=b_0+b_2X_2#
#\phantom{0}#Adult	0	0	1	#\phantom{00}##\hat{Y}=b_0+b_3X_3#
#\phantom{0}#Elder	0	0	0	#\phantom{00}##\hat{Y}=b_0#