Spreadsheet Macros: Histogram, Regression and ANOVA

Many statistical calculations are fairly complex involving many inputs and outputs. The process of setting up your spreadsheet for e.g. creating a histogram is the same for each variable but involves new data and thus new results. Excel (and other spreadsheets) provide "macros" or scripts to set up and perform these operations in a convenient fashion.

There are some disadvantages to using these Macros. Most do not include links to the original data. So if you go back and correct a data point, or if you have random numbers generating your data and they get recomputed, then the output of the macro no longer reflects the data in your spreadsheet. This is a serious deficiency and has wasted a lot of time for students who get caught without realizing what is going on.

In this discussion we will be looking at two analysis tools: Histogram and Regression. We describe how to get this info without any macro package and then with each of the Macro packages for Excel: Data Analysis Toolpak and StatPlusLE. We start by explaining how to set up the Macro packages.

Loading Macro Tools

The macro tool of choice for Excel is called the Data Analysis Toolpak. It appears under the Data tab in excel 2007 otherwise the Tools menu. With a standard install it is not loaded so you must add it.

If you have Excel 2008 (which is a Mac version) then Microsoft did not include a scripting/macro language and thus you have no Data Analysis Toolpak. In its place Microsoft is suggesting people use a free program (with lots of advertisement to buy the "pro" version) called StatPlusLE. It is available at: http://www.analystsoft.com/en/products/statplusmacle/ Once you download and install it it will run alongside Excel. Rather than running a single application, it runs as a separate application transferring data back and forth. The slight inconvenience here is partially offset by slightly better output (more info with better formatting). It installs like most any Mac application. Just follow the instructions.

Histogram

Here we wish to take a list of data for one variable and divide it into "bins", counting how many datapoints lie in each bin. Then we plot a bar chart showing the distribution of that variable. Warning: Excel does not allow bars of unequal width, so we will restrict ourselves to equal width bins.

You start with a column of data and a column of "bins". The number in the bin column are actually the top end of each bin. There will be one more bar in the histogram than numbers in the bin range because the last bar represents all values larger than the final "top end of each bin". This is often denoted by the term "More". Usually the bin range is created by hand and it is tradition to label it "Bin" at the top of the column. It is also reasonable to put the word "More" after the last value. For example, to use the Bins 0-5, 5-10, 10-15 the column should be entered as:

Bin
   5
  10
  15
More

Now you are ready for creating the histogram.

No Macros

Use the FREQUENCY() function. This is an Excel array function which means that it returns more than one value and you must press Control-Shift-Return to have it fill all the values.

Type "Frequency" as a label for the new column in the cell next to the cell with "Bin" in it. Then select the whole column down to the cell next to the word "More". Type the function expression (or use the "function wizard" button remembering to press Control-Shift while hitting the OK button). The expression should be:

=frequency(A1:A100, B2:B5)

with A1:A100 replaced by the range of input data and B2:B5 replaced by your bin values (without the words Bin or More). Now press Control-Shift-Return (or press Control-Shift and click on OK) and all the values should appear--including one next to the word More. If you forgot the Control-Shift, only one number will appear. Just highlight all the cells again, click in the formula bar at the top to activate the formula and then press Control-Shift-Return.

Now that you have the data, you can create a chart. Select both the bin and frequency columns that you want to plot (including labels). Press the chart wizard button or the Insert->Chart menu. Choose a Column chart. Then the next screen has two tabs at the top: Data Range and Series. Choose the "Series" tab and make the bins column the Category X-axis labels, with the frequency values being the series values. You can preview the chart in this window. Type a title, x-label and y-label in the next screen as desired. When you press Finish the chart will appear in front of your spreadsheet. you can then move it as you like.

Your hand-made table and chart will automatically be updated when you change the data.

Data Analysis

Select the Data Analysis item and then select Histogram from the dialog box. It will ask for two ranges: the input range is for the data and the bin range is for your bins. Press the arrow buttons beside each entry box to select the ranges by hand if you prefer that to typing them in. Press the arrow button again when done selecting. Select only the numbers not the column labels. If you want a chart, click the checkbox for Chart Output. Finally, you need to choose where to put the output. That can be on another worksheet, or in any cell you choose. If you choose a cell, leave enough room down and to the right for the output.

StatPlusLE

Open the StatPlusLE program. Select the menu Statistics->Basic Statistics and Tables->Histogram. Select your data column for the "Continuous Variables" and the bin values for your bin range. The buttons with arrows allow you to use your mouse to select the ranges. After selecting the range get back to StatPlusLE and it will be entered into the blank for you. You don't need to use the advanced features of Frequency Variables (which could tell how many times a value should be repeated) or Layer Variables. Press OK and a new worksheet will be created with the results of the histogram in both table and chart form. You can copy and paste this back into your worksheet if you like.

Regression

Here we wish to find the best linear fit for a single dependent variable Y as a function of one or more independent variables X. If you just want the slope and intercept of a line with a single column of X data use the =SLOPE() and =INTERCEPT() functions. If you want more information about the regression and tests of hypotheses for whether the slopes are zero, use the LINEST() function or the Macros (which use the LINEST function).

You should start with a single column of Y-values and a set of X-values with each X variable in its own column and all X values next to each other.

No Macros

Set up a place for the output: Type the words in a column with some room to the right of it. This will be the left side of your output. The column headings for the output should be the names of your X variables in order followed by a column for the Intercept term. This might look like:

                  GNP       Unemployment      Intercept
Coefficients
Std Error
R^2 and SE_y
F and DoF
SSreg, SSresid

Now select the region in the table where the results go. Type the LINEST function (or paste using the function wizard) with the correct data columns for y-variables and x-variables and the values True for the 3rd and 4th arguments which correspond to intercept and stats respectively. The intercept flag can be False if you want your line to go through the origin. The stats flag should be turned on to get the stats you want.

Press Control-Shift-Return when ready. There will be some cells that show an error code because there is no output for that cell. I don't know how to get rid of the error messages. If they really bug you use one of the macros below.

Your regression results will now adjust if you change your data.

Data Analysis

Select the Tools->Data Analysis menu item and the Regression tool from the resulting dialog box. That will ask you for a range of y-values and a range of x-values. If you data columns have labels at the top, select those labels too and check the Labels checkbox in the dialog window. Select an output range with plenty of space to the right and below. There is quite a bit of output from this macro. Check any plots you want to be created. Then press OK.

The output appears in sections: Regression Statistics, ANOVA and then intercept and coefficients. Your instructor can tell you More about what these items mean.

StatPlusLE

Open the StatPlusLE program. Select the Statistics->Regression->Linear Regression menu item. Enter the ranges for the Dependent(Y) variable and then the Independent(X) variables. When you press OK a new worksheet will open with lots of useful results. This includes all from the Data Analysis Toolpak (which your instructor can explain) plus a few more (like the formula for the best line and residual values).

ANOVA

ANOVA is used to determine whether a continuous dependent variable (Y) depends on categorical/group independent variables (X). It can also be performed using regression analysis by introducing a binary dependent variable for each group. Thus some people group ANOVA into regression analysis. Both require continuous dependent variables. A nice description of the problem can be found at http://math.colgate.edu/math102/dlantz/examples/ANOVA/anova.html

No Macros

I don't know of any function that returns the results for ANOVA. If someone does contact me.

Data Analysis

Select the Tools->Data Analysis menu item and the ANOVA single factor tool from the resulting dialog box. That will ask you for the data range and whether the data is grouped by column or row. Also select an output range with some room to the right and below. The output looks like that on website referred to above and your instructor can give you details.

StatPlusLE

Open the StatPlusLE program. Select the Statistics->ANOVA->Simple ANOVA(One way) menu item. Enter the data range. A new worksheet is created with the results. The results are essentially identical to that of the Data Analysis Toolpak.