The "Sum of Squares Total" in ANOVA
Many of the computer implementations of ANOVA, including the one
in Excel, print out two values that are not used in the later
steps of ANOVA: the sum of the SSE (the sum of the squared
deviations within samples from the sample averages) and the SSG
(the sum of the squared deviations of the sample averages from
the overall average, weighted by the size of the samples), and the
sum of the DFE and the DFG. The first of these is
often denoted SST and called the "total squared deviation
(from the average)", because it is also equal to the sum of the squared
deviations of all the data values from the grand average. And the
second, denoted DFT, is called the total degrees of freedom.
It is easy to see that
DFT = DFE + DFG =
(N - I) + (I - 1) = N - 1,
and this
is a reasonable quantity to call the "total degrees of freedom". But
it is not so obvious that the two ways of interpreting SST,
on the one hand as the sum of SSE and SSG, and
on the other hand as the sum
of the squares of the differences of the data values from the overall
average, give the same value. It implies the following equation (which
I must render as a graphic because of the limitations of HTML):
At first glance, it looks reasonable that the two end expressions
and are equal. But in
general, it is not true that
(A - B)2 + (B - C)2
= (A - C)2
Because of the squaring, there are several "middle
terms" in these expressions. In this case, do
they really "cancel out", to make the two interpretations of SST
really equal? To see why it works here, we note first that the definitions
of AVg and AV give us some substitutions to use:
Using these and some familiar facts about
summations, starting with
the more complicated expression , we have:
where the second and third terms in the expression
labelled add to zero. In a similar way,
but working on the expression , we have:
This is the same expression as we got from , so
they are indeed equal.