Introduction to Regression Analysis

Chapter 11: Regression Analysis: Simple Linear Regression

Introduction to Regression Analysis

Regression Analysis

Regression analysis is a statistical procedure for estimating the relationship between variables. Specifically, regression is used to predict the value of a continuous outcome (dependent) variable on the basis of one or more predictor (independent) variables.

Regression analysis evaluates the relationship between variables by finding the best-fitting straight line through a set of data points, and the resulting line is called the regression line.

#\phantom{0}#
The most straightforward type of regression is Simple Linear Regression.
#\phantom{0}#

Simple Linear Regression

In Simple Linear Regression, the value of the outcome variable is predicted using a single predictor variable.

The regression line of a Simple Linear Regression is mathematically described by the following regression equation:

\[\hat{Y}=aX+b\]

Where:

#\hat{Y}# is the predicted value of the outcome variable #Y#.
#X# is the predictor variable.
#a# is the slope of the regression line and is called the regression coefficient.
#b# is the value of #\hat{Y}# when #X=0# and is called the intercept.

#\phantom{0}#
Describing the relationship between two variables as a straight line provides an easy way to predict values of the outcome variable #Y# for certain values of the predictor variable #X#. Simply enter a value for #X# into the regression equation to get the predicted value of #Y#.
#\phantom{0}#

Example: Regression Analysis

For #10# days, the owner of an ice cream truck has kept track of how much ice cream he sold and what the maximum temperature in #^\circ{}C# was that day. He has calculated the regression line to find the relationship between the maximum temperature and the amount of ice cream sold.

Take a look at the scatterplot below. The blue dots represent the #10# #\blue{\textbf{data points}}# that serve as the basis for the regression analysis. The #\orange{\textbf{regression line}}# #\hat{Y} = 2.93X -20.45# is drawn in orange.
#\phantom{0}#

#\phantom{0}#
Here, #a=2.93# is the regression coefficient. This value predicts how much more ice cream #Y# will be sold, given that the maximum temperature #X# increases by #1#. For example, if the maximum temperature increases by #2#, the amount of ice cream sold is predicted to increase by #2\cdot 2.93=5.86#.

The intercept #b# is #-20.45#. In this case, the negative value of the intercept holds no particular meaning, since it's not possible to sell a negative amount of ice cream.

To calculate the predicted amount of ice cream sold at a particular maximum temperature, simply enter a value for the #X# into the equation. For example, at a maximum temperature of #X=25#, the predicted amount of ice cream sold is:

\[\hat{Y}=2.93X-20.45=2.93\cdot25-20.45=52.8\]

#\phantom{0}#
An important thing to consider when performing a regression analysis is that even a single outlier can have a large impact on the results of the analysis, especially when working with relatively small datasets.
#\phantom{0}#

Example: Effect of Outlier

Let's revisit the ice cream truck example, but this time at a temperature of #22# degrees, the owner of the truck sells #500# ice creams.
#\phantom{0}#

#\phantom{0}#
This value is notably larger than the other data points. Such a data point is called an influential outlier and causes the entire regression line to shift upwards. When you find such an outstanding value, you can consider omitting it from the analysis.