Multiple Regression and Excel

Multiple Regression and Excel (Overview)

In the lectures, you have been introduced to the idea of regression based on a two quantitative variables. We usually label the explanatory variable X and the response variable Y. With such data, regression allows you to predict Y values based on X observations via the regression line. In a standard regression for Y on X the regression line has the form Y-pred = a + bX. [Herein, Y-pred is used in place of the notation adopted in Section 3.3 of Agresti and Franklin, Y-hat, and refers to the Y value predicted by the regression equation.]

Multiple regression

In some studies, it may be considered too simplistic or misleading to assume that the response variable is associated to only one explanatory variable. As suggested in Example 14 of Chapter 3 [Agresti and Franklin 129-30], for a study of crime rates in Florida [by county] one may want to associate such observations with rates of education and rates of urbanization. That is, the person conducting the study may well want to examine the extent to which rates of education and urbanization are associated with crime rates in Florida.

To understand the possible association between two [or more] explanatory variables X₁, X₂ [in our example, rate of education and rate of urbanization] and a single response variable Y [crime rate], one performs a multiple regression. The result of a multiple regression is a linear equation of the form

Y-pred = a + (b₁)X₁ + (b₂)X₂.

A predicted Y value can be obtained from a particular X₁ observation and X₂ observation by using the regression equation in the usual way. The question is: What values are to be used for a, b₁, and b₂?

Luckily, Excel can help you here. Under the Tools -> Data Analysis menu you can choose Regression. This allows you to do a multiple regression. The data for the X₁, X₂ [etc], and Y variables should already be entered into columns. The Regression toolkit will ask for the cells containing the Y [response variable] information and all the cells containing X [explanatory variable] information. The latter may be contained in several [adjacent] columns. The rest of the toolkit is self-explanatory.

Included in this supplementary folder you'll find the [multiple] regression analysis performed on our example with Excel that reflects aspects of regression as discussed in Chapter 3 of the text.

The results are quite another matter. There are really only a few pieces of information produced that you need to focus on at the moment. Two are contained in the small table labeled 'Regression Statistics.' The "Multiple R' measures the strength of the association; it doesn't reflect the direction [see below]. The 'R Square' value is, as the name implies, the square of the correlation coefficient for the regression. This measures the strength of the regression prediction compared with predicting solely by the response mean. The important thing is: The closer to 1 the stronger the regression prediction, the closer to 0 the weaker. [To be precise: (r²)% of the observed variability in Y can be attributed to X. For more information on this measure, see Section 3.3 of the text.]

The only other pieces of information that can be explained at this point are in the last table of results. The column of information labeled 'Coefficients' contains, in order, the values a, b₁ and b₂ [etc] that you need to generate the regression equation for your data.

Not only are a, b₁ and b₂important values for generating the regression equation, they have another important use. For example, the value b₁ tells you how Y is predicted to vary as X₁ is varied and X₂ is held constant. In statistical parlance, you can see how Y varies with the variable X₁ while controlling for the influence of the variable X₂ by considering b₁. [A simple regression for Y on X₁wouldn't provide this information, since the influence of X₂ isn't involved in the regression analysis.] The value of b₂ can be similarly useful in predicting how Y varies with X₂while controlling for the influence of X₁.

In our example, the value of b₁ is -0.6. In this case, we could say that, controlling for the effects of urbanization [that is, among counties having the same rate of urbanization], a 1unit increase in the rate of education would suggest a decrease of about 0.6 [6 crimes per 10,000 residents] in the crime rate.

When Excel performs a regression with more than one explanatory variable, you won't be able to measure the direction of the association or obtain a scatter plot. This is because the regression equation based on two or more explanatory variables as described above would generate a [hyper]plane in three [or more] dimensions. However, you can obtain residual plots for each of the explanatory variables used [turn the option Residual Plots ON by checking the appropriate box], which might help to identify heteroscedastic behavior or regression outliers in your data set.

We'll revisit [multiple] regression analyses with Excel in the context of material discussed in later chapters of the text.

Non-linear scatter plots and multiple regression

Imagine a scatter plot suggests that the association between two variables X and Y is non-linear. Perhaps the scatter plot looks more like a parabola. This might indicate a quadratic relationship between X and Y [that is, Y-pred = a + (b₁)X + (b₂)X²]. In this case, you can use multiple regression to fit a parabola to the data. To do this, you would set up three columns of data: X, X², and Y. With these, you would then have Excel carry out a multiple regression for Y on the variables X and X².

Included in this supplementary folder is an example of a multiple regression performed with Excel that applies to a quadratic [non-linear] analysis.