Chapter

Thirteen

McGraw-Hill/Irwin

Linear Regression and Correlation

Important to know if relationships exist among variables.

Egs.

Amount of gas & the mileage

Population vs. precipitation

Recognizing and modeling the relationship between two variables can be useful in predicting.

Eg.

Predicting how much the sales revenue would be if a certain dollar amount is spent on advertising.

The Dependent Variable is the variable being predicted or estimated.

The Independent Variable provides the basis for estimation.  It is the predictor variable.

Correlation Analysis

Measurement of association between two variables.

A Scatter Diagram is a chart that portrays the relationship between two variables.

If you suspect two variables to have a relationship, start with drawing a scatter plot.

Using Excel to create a Scatter Plot (Chart Wizard)

Example on Page 379-81

Correlation Coefficient (Pearson R

• Measures strength of the relationship between two variables.
• It requires interval or ratio-scaled data.
• It can range from -1.00 to 1.00.
• Positive values indicate a  direct relationship & negative values indicate an  inverse relationship.
• Values of -1.00 or 1.00 indicate perfect and strong correlation.
• Values close to 0.0 indicate weak correlation.
• It is the square of the coefficient of correlation (R).
• It also ranges from 0 to 1.
• The proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X).

Eg. 80% of the variation in miles driven is accounted by number of gallons in the tank. The 20% is influenced by road conditions, number of passengers, etc.  (more discussion later!).

The coefficient of determination  (R2)

Correlation and Cause

A high correlation indicates a strong relationship between the variables.

But they don��t necessarily mean, one variable influences the other.

Another eg:

A study measured

the number of TV sets per person (say, X)  &  the life expectancy (say, Y) for every country.

The study found a high correlation.

On this basis, it was concluded that countries with more TV sets have a higher life expectancies.

Regression Analysis

If there is a strong relationship (r value) between two variables, one can estimate a linear model of the form:

Y��= a + bX   [a=Estimate of ��; b=Estimate of ��]

where

Y�� is the predicted value and Y is the actual value for a given X.

a is the Y-intercept (it is value of Y�� when X=0).

b is the slope of the line, or the average change in Y�� for each change of one unit in X

The least squares principle is used to fit the line. ie., ��(Y – Y��)2 is minimized.

a and b are calculated as:

b = r

sy

sx

a = Y – bX

{ Regression line always passes through (X,Y) }

{ If r=1, slope is similar to ��y/��x }

Error in

prediction

��

(Actual)   (Predicted)

(Y�� – Y)  is the error in prediction

.

Error in prediction

Example

(page 400)

The production supervisor of XYZ Inc. looked at the number of units produced by 5 of his employees during a week. He also looked at how long they had been working for the company.

(Years)   (#ofUnits)

The supervisor wants to know

(i) if there is a correlation between X and Y (ie. R)

(ii) the equation to the regression line (ie. Y�� = a + bX)

(iii) how much of variation in Y is explained by X (ie. R2)

Using Excel for Regression

1. What is the independent variable?

2. What is the dependent variable?

3. What is the regression equation?

4. Is it a significant predictor of #Units?

5. Is Years a sig. predictor of #Units?

6. If one had Years=20, predict #Units.

7. Construct a 95% CI around it.

Use SE to calculate CI

Watch the screencam tutorial in the book CD to learn how to use Excel for regression

The equation will be correct in 96% of the cases

Calculating Total Variation

Page

402

Now, we want to find out how much of this variation is contributed by Years on Job.

(Years)   (#ofUnits)

The sample mean is 6. Total Variation is given by  ��(Y-Y)2.

[see Chapter 3, pages 78 & 80]

When we came up with the Regression equation,

Y�� = 2 + 0.4X

we added the assumption that Years on the job & Production are related.

Let us see how well this equation fits our data.

It can be seen that the ��fit�� between Y�� and Y is not ��perfect��.

Let us calculate the error variation as shown in next slide.

Catching the ��Error��

(Actual)   (Predicted)

Calculating Unexplained Variation (Error in prediction)

Page

401

Unexplained variation

Total variation

1 -

R2  =

Calculating Coefficient of Determination

Substituting the values for the Unexplained & Total variations from our example problem, we get

=    1 – 4/20

=   16/20

=   0.8*

Thus, we say that 80% of the variation in weekly production is explained by years of experience on the job.

* Compare this with the computer output on the next slide

R2  =  Explained variation / Total variation    (Equation 1)

Explained variation  =  [ Total variation - Unexplained variation ]  (Equation 2)

SSR     =         SST     -    SSE

Substituting Equation2 in Equation1,

R2 = [ Total variation - Unexplained variation ] / Total variation

 (Equation 3)

Interpreting Excel Regression Output

Make sure you know how to interpret the Excel output.

(No kidding!)

p-value

Use this for CI

F, R2 & SE tell if the regression model is really useful for prediction.

Multiple Regression

You can extend the idea of linear regression and make an independent variable dependent on more than one variable.

Eg. The price of a house can be dependent on Sq ft, Number of bedrooms, Baths, Pool, Garage, etc. [see page 503].

The general multiple regression equation is:

Y��  =  a + b1. X1 + b2. X2  + �� + bn. Xn

Practice!

1. What are the independent variables?

2. What is the dependent variable?

3. What is the regression equation?

4. Is it a significant predictor of Price?

5. Is Bedrooms a sig. predictor of Price?

6. Is Baths a sign. predictor of Price?

7. If one had 8 Bedrooms, predict Price.

8. Construct a 95% CI around it. 