In the lectures, you have been introduced to the idea of regression based on a two quantitative variables. We usually label the explanatory variable X and the response variable Y. With such data, regression allows you to predict Y values based on X observations via the regression line. In a standard regression for Y on X the regression line has the form Y-pred = a + bX. [Herein, Y-pred is used in place of the notation adopted in Section 3.3 of Agresti and Franklin, Y-hat, and refers to the Y value predicted by the regression equation.]

**Multiple regression**

In some studies, it may be considered too simplistic or misleading
to assume that
the response variable is associated to only one explanatory variable.
As suggested in Example 14 of Chapter 3 [Agresti and Franklin 129-30],
for a study of crime rates in Florida [by county] one may want to
associate
such
observations with rates of education **and** rates of urbanization.
That is, the person conducting the study may well want to
examine
the extent to which rates of education **and** urbanization
are
associated with crime rates in Florida.

To understand the possible association between two [or more]
explanatory variables X_{1}, X_{2} [in our example,
rate of education and rate of urbanization] and a single response
variable Y [crime rate], one performs a **multiple
regression**. The result of a
multiple regression is a linear equation of the form

Y-pred = a + (b_{1})X_{1} + (b_{2})X_{2}.

A predicted Y value can be obtained from a particular X_{1}
observation and X_{2} observation by using the regression
equation in the usual way. The question is: What
values
are to be used for a, b_{1}, and b_{2}?

Luckily, Excel can help you here. Under the Tools -> Data
Analysis menu
you can choose Regression.
This allows you to do a multiple regression.
The data for the X_{1}, X_{2} [etc], and Y variables
should already be entered into columns. The Regression toolkit will ask
for the cells containing the Y [response variable] information and all
the cells containing X [explanatory variable] information. The latter
may
be contained in several [adjacent] columns. The rest of the toolkit is
self-explanatory.

Included in this supplementary folder you'll find the [multiple] regression analysis performed on our example with Excel that reflects aspects of regression as discussed in Chapter 3 of the text.

The results are quite another matter. There are really only a few
pieces of information produced that you need to focus on at the moment.
Two are contained
in the small table labeled 'Regression Statistics.' The "Multiple R'
measures the strength of the association; it doesn't reflect the
direction [see below]. The 'R Square'
value is, as the name implies, the square of the correlation
coefficient for the regression. This measures the strength of the
regression prediction compared with predicting solely by the response
mean. The important thing is: The closer to 1 the stronger the
regression prediction, the closer to 0 the weaker. [To be
precise: (r^{2})% of the observed variability in Y can be
attributed to X. For more
information on this measure, see Section 3.3 of the text.]

The only other pieces of information that can be explained at this
point are in the last table of results. The column of information
labeled 'Coefficients' contains, in order, the values a, b_{1}
and b_{2} [etc]
that you need to generate the regression equation for your data.

Not only are a, b_{1} and b_{2 }important values
for generating the regression equation, they have another important
use. For example, the value b_{1} tells you how Y is
predicted to vary
as X_{1} is varied **and** X_{2} is held
constant. In statistical parlance, you can see how Y varies with
the variable X_{1} while **controlling for the influence of
the variable** X_{2} by considering b_{1}. [A
simple regression for Y on X_{1 }wouldn't provide this
information, since the influence of X_{2} isn't involved in the
regression analysis.] The value of b_{2} can be similarly
useful in predicting how Y varies with X_{2 }while controlling
for the influence of X_{1}.

In our example, the value of b_{1} is -0.6. In this
case, we could say that, controlling for the
effects of urbanization [that is, among counties having the same rate
of urbanization],
a 1unit increase in the rate of education would suggest a decrease of
about 0.6 [6 crimes per 10,000 residents] in the crime rate.

When Excel performs a regression with more than one explanatory variable, you won't be able to measure the direction of the association or obtain a scatter plot. This is because the regression equation based on two or more explanatory variables as described above would generate a [hyper]plane in three [or more] dimensions. However, you can obtain residual plots for each of the explanatory variables used [turn the option Residual Plots ON by checking the appropriate box], which might help to identify heteroscedastic behavior or regression outliers in your data set.

We'll revisit [multiple] regression analyses with Excel in the
context of material discussed in later chapters of the text.

**Non-linear scatter plots and
multiple regression**

Imagine a scatter plot suggests that the association between two
variables X and Y is non-linear. Perhaps the scatter plot looks
more like a parabola.
This might indicate a quadratic relationship between X and Y [that is,
Y-pred = a + (b_{1})X + (b_{2})X^{2}]. In
this case, you can use multiple regression to fit a parabola to the
data.
To do this, you would set up three columns of data: X, X^{2},
and Y. With these, you would then have Excel carry out a multiple
regression
for Y on the variables X and X^{2}.

Included in this supplementary folder is an example of a multiple
regression performed with Excel that applies to a quadratic
[non-linear] analysis.

Last revised: January 2007. Questions to: kvalente@mail.colgate.edu

Copyright 2003 © Colgate University. All rights reserved.