# 2.2 - One-Way ANOVA Sums of Squares, Mean Squares, and F-test

by

The previous discussion showed two ways of estimating the model but still hasn't addressed how to assess evidence related to whether the observed differences in the means among the groups is "real". In this section, we develop what is called the * ANOVA F-test* that provides a method of aggregating the differences among the means of 2 or more groups and testing our null hypothesis of no difference in the means vs the alternative. In order to develop the test, some additional notation needs to be defined. The sample size in each group is denoted

*n*and the total sample size is

_{j}**N=Σn**where Σ (capital sigma) means "add up over whatever follows". An estimated

_{j}= n_{1}+n_{2}+...+n_{J}*(e*

**residual**_{ij}) is the difference between an observation, γ

_{ij}, and the model estimate, γ̂

_{ij}=μ̂

_{j}, for that observation, γ

_{ij}- γ̂

_{ij}=e

_{ij}. It is basically what is left over that the mean part of the model (μ̂

_{j}) does not explain and is our window into how "good" the model might be.

Consider the four different fake results for a situation with four groups in Figure 2-3. In Situation 1, it looks like there is little evidence for a difference in the means and in Situation 2, it looks fairly clear that there is a difference in the group means. Why? It is because the variation in the means looks "clear" relative to the variation around the means. Consider alternate versions of each result in Situations 3 and 4 and how much evidence there appears to be for same sizes of differences in the means. In the plots, there are two sources of variability in the responses - how much the group means vary across the groups and how much variability there is around the means in each group. So we need a test statistic to help us make some sort of comparison of the groups and to account for the amount of variability present around the means. The statistic is called the * ANOVA F-statistic*. It is developed using

*which are measures of total variation like used in the numerator of the standard deviation that took all the observations, subtracted the mean, squared the differences, and then added up the results over all the observations to generate a measure of total variability. With multiple groups, we will focus on decomposing that total variability (*

**sums of squares***) into variability among the means (we'll call this*

**Total Sums of Squares***) and variability in the residuals or errors (*

**Explanatory Variable A's Sums of Squares***). We define each of these quantities in the One-Way ANOVA situation as follows:*

**Error Sums of Squares**- ⚪
= Total Sums of Squares**SS**_{Total}- ■ By summing over all nj observations in each group and then adding those results up across the groups , we accumulate the variation across all N observations.
- ■ Total variation is assessed by squaring the deviations of the responses around the overall or
(γ̿, the estimated mean for all the observations and available from the mean-only model).**grand mean** - ■ Note: this is the residual variation if the null model is used, so there is no further decomposition possible for that model.
- ■ This is also equivalent to the numerator of the sample variance which is what you get when you ignore the information on the potential differences in the groups.

- ⚪
= Explanatory Variable A's Sums of Squares**SS**_{A}- ■ Variation in the group means around the grand mean based on explanatory variable A.
- ■ Also called sums of squares for the treatment, regression, or model.

- ⚪
= Error (Residual) Sums of Squares**SS**_{E}- ■ Variation in the responses around the group means.
- ■ Also called the sums of squares for the residuals.

The possibly surprising result given the mass of notation just presented is that the total sums of squares is ALWAYS equal to the sum of explanatory variable A's sum of squares and the error sums of squares, **SS _{Total} = SS_{A} + SS_{E}**. This equality means that if the

**SS**goes up, then the

_{A}**SS**must go down if

_{E}**SS**remains the same. This result is called the

_{Total}*. We use these results to build our test statistic and organize this information in what is called an*

**sums of squares decomposition formula***. The ANOVA table is generated using the anova function applied to the reference-coded model:*

**ANOVA table**> lm2 <- lm(Years~Attr,data=MockJuryR)

> anova(lm2)

Analysis of Variance Table

Response: Years

Df | Sum Sq | Mean Sq | F value | Pr(>F) | ||

Attr | 2 | 70.94 | 35.469 | 2.77 | 0.067 | . |

Residuals | 111 | 1421.32 | 12.805 |

Note that the ANOVA table has a row labelled Attr, which contains information for the grouping variable (we'll generally refer to this as explanatory variable A but here it is the picture group that was randomly assigned), and a row labelled Residuals, which is synonymous with "Error". The SS are available in the Sum Sq column. It doesn't show a row for "Total" but the SS_{Total} =SS_{A}+SS_{E} = 1492.26.

> 70.94+1421.32

[1] 1492.26

It may be easiest to understand the sums of squares decomposition by connecting it to our permutation ideas. In a permutation situation, the total variation (SS_{Total}) cannot change - it is the same responses varying around the grand mean. However, the amount of variation attributed to variation among the means and in the residuals can change if we change which observations go with which group. In Figure 2-4, the means and 95% confidence intervals are displayed for the three treatment levels. In panel (a), the results for the original data set (a) are presented including sums of squares. Three permuted versions of the data set are summarized in panels (b), (c), and (d). The SS_{A} is 70.9 in the real data set and between 6.6 and 11 in the permuted data sets. If you had to pick among the plots for the one with the most evidence of a difference in the means, you hopefully would pick panel (a). This visual "unusualness" suggests that this observed result is unusual relative to the possibilities under permutations, which are, again, the possibilities tied to having the null hypothesis being true. But note that the differences are not that great between these permuted data sets and the real one.

One way to think about **SS _{A}** is that it is a function that converts the variation in the group means into a single value. This makes it a reasonable test statistic in a permutation testing context. By comparing the observed SS

_{A}=70.9 to the permutation results of 6.7, 6.6, and 11 we see that the observed result is much more extreme than the three alternate versions. In contrast to our previous test statistics where positive and negative differences were possible, SS

_{A}is always positive with a value of 0 corresponding to no variation in the means. The larger the SS

_{A}, the more variation there was in the means. The permutation p-value for the alternative hypothesis of

**some**(not of greater or less than!) difference in the true means of the groups will involve counting the number of permuted SS

_{A}* results that are larger than what we observed.

To do a permutation test, we need to be able to calculate and extract the SS_{A} value. In the ANOVA table, it is in the first row and is the second number and we can use the [,] referencing to extract that number from the ANOVA table that anova produces (anova(lm(Years~Attr,data=MockJury))[1,2]). We'll store the observed value of SSA is Tobs:

> Tobs <- anova(lm(Years~Attr,data=MockJury))[1,2]; Tobs

[1] 70.93836

The following code performs the permutations using the shuffle function and then makes a plot of the resulting permutation distribution:

> B<-1000

> Tstar<-matrix(NA,nrow=B)

> for (b in (1:B)){

+ Tstar[b]<-anova(lm(Years~shuffle(Attr),data=MockJury))[1,2]

+ }

> hist(Tstar,labels=T)

> abline(v=Tobs,col="red",lwd=3)

> plot(density(Tstar),main="Density curve of Tstar")

> abline(v=Tobs,col="red",lwd=3)

The right-skewed distribution (Figure 2-5) contains the distribution of SS_{A}*'s under permutations (where all the groups are assumed to be equivalent under the null hypothesis). While the observed result is larger than many SS_{A}*'s, there are also many results that are much larger than observed that showed up when doing permutations. The proportion of permuted results that exceed the observed value is found using pdata as before, except only for the area to the right of the observed result. We know that Tobs will always be positive so no absolute values are required now.

> pdata(Tobs,Tstar,lower.tail=F)

[1] 0.071

This provides a permutation-based p-value of 0.071 and suggests marginal evidence against the null hypothesis of no difference in the true means. We would interpret this as saying that there is a 7.1% chance of getting a SS_{A} as large or larger than we observed, given that the null hypothesis is true.

It ends up that some nice parametric statistical results are available (if our assumptions are met) for the ratio of estimated variances, which are called ** Mean Squares**. To turn sums of squares into mean square (variance) estimates, we divide the sums of squares by the amount of free information available. For example, remember the typical variance estimator introductory statistics, , where we "lose" one piece of information to estimate the mean and there are N deviations around the single mean so we divide by N-1. Now consider which still has N deviations but it varies around the J means, so the

**= MS**

*Mean Square Error*_{E}= SS

_{E}/(N-J). Basically, we lose J pieces of information in this calculation because we have to estimate J means. The sums of squares for explanatory variable A is harder to see in the formula , but the same reasoning can be used to understand the denominator for forming the

**or MS**

*Mean Square for variable A*_{A}: there are

*J*means that vary around the grand mean so MS

_{A}= SS

_{A}/(

*J*-1). In summary, the two mean squares are simply:

- ■ MS
_{A}= SS_{A}/(*J*-1), which estimates the variance of the group means around the grand mean. - ■ MS
_{Error}= SS_{Error}/(*N-J*), which estimates the variation of the errors around the group means.

These results are put together using a ratio to define the ** ANOVA F-statistic** (also called the

**) as**

*F-ratio**F*=MS_{A}/MS_{Error}.

This statistic is close to 1 if the variability in the means is "similar" to the variability in the residuals and would lead to no evidence being found of a difference in the means. If the MS_{A} is much larger than the MS_{E}, the *F*-statistic will provide evidence against the null hypothesis. The "size" of the *F*-statistic is formalized by finding the p-value. The *F*-statistic, if assumptions discussed below are met and we assume the null hypothesis is true, follows an *F*-distribution. The ** F-distribution** is a right-skewed distribution whose shape is defined by what are called the

**(**

*numerator degrees of freedom**J*-1) and the

*denominator degrees of freedom**(N-J)*. These names correspond to the values that we used to calculate the mean squares and where in the

*F*-ratio each mean square was used;

*F*-distributions are denoted by their degrees of freedom using the convention of

*F(numerator df, denominator df)*. Some examples of different

*F*-distributions are displayed for you in Figure 2-6.

The characteristics of the F-distribution can be summarized as:

- ⚪ Right skewed,
- ⚪ Nonzero probabilities for values greater than 0,
- ⚪ Shape changes depending on the
**numerator**and**denominator DF**, and - ⚪
**Always use the right-tailed area for p-values.**

Now we are ready to see an ANOVA table when we know about all its components. Note the general format of the ANOVA table is^{22}:

Source | DF | Sums of Squares | Mean Squares | F-ratio | P-value |

Variable A | J-1 | SS_{A} | MS_{A} = SS_{A}/(J-1) | F=MS_{A}/MS_{E} | Right tail of F(J-1,N-J) |

Residuals | N-J | SS_{E} | MS_{E} =SS_{E}/(N-J) | ||

Total | N-1 | SS_{Total} |

The table is oriented to help you reconstruct the *F*-ratio from each of its components. The output from R is similar although it does not provide the last row. The R version of the table for the type of picture effect (Attr) with *J*=3 levels and *N*=114 observations, repeated from above, is:

> anova(lm2)

Analysis of Variance Table

Response: Years

Df | Sum Sq | Mean Sq | F value | Pr(>F) | ||

Attr | 2 | 70.94 | 35.469 | 2.77 | 0.067 | . |

Residuals | 111 | 1421.32 | 12.805 |

The p-value from the *F*-distribution is 0.067. We can verify this result using the observed F-statistic of 2.77 (which came from the ratio of the means squares: 35.47/12.8) which follows an F(2,111) if the null hypothesis is true and other assumptions are met. Using the pf function provides us with areas in the specified F-distribution with the df1 provided to the function as the numerator DF and df2 as the denominator and lower.tail=F reflecting our desire for a right tailed area.

> pf(2.77,df1=2,df2=111,lower.tail=F)

[1] 0.06699803

The result from the F-distribution using this parametric procedure is similar to the p-value obtained using permutations with the test statistic of the SS_{A}, which was 0.071. The *F*-statistic obviously is another potential test statistic to use as a test statistic in a permutation approach. We should check that we get similar results from it with permutations as we did from using SS_{A} as a test statistic. The following code generates the permutation distribution for the *F*-statistic (Figure 2-7) and assesses how unusual the observed *F*-statistic of 2.77 was in this permutation distribution. The only change in the code involves moving from extracting SS_{A} to extracting the *F*-ratio which is in the 4^{th} column of the anova output:

> anova(lm(Years~Attr,data=MockJury))[1,4]

[1] 2.770024

> Tobs <- anova(lm(Years~Attr,data=MockJury))[1,4]; Tobs

[1] 2.770024

> B<-1000

> Tstar<-matrix(NA,nrow=B)

> for (b in (1:B)){

+ Tstar[b]<-anova(lm(Years~shuffle(Attr),data=MockJury))[1,4]

+ }

> hist(Tstar,labels=T)

> abline(v=Tobs,col="red",lwd=3)

> plot(density(Tstar),main="Density curve of Tstar")

> abline(v=Tobs,col="red",lwd=3)

> pdata(Tobs,Tstar,lower.tail=F)

[1] 0.064

The permutation-based p-value is 0.064 which, again, matches the other results closely. The first conclusion is that using a test statistic of the F-statistic or the SS_{A} provide similar permutation results. However, we tend to favor using the F-statistic because it is more commonly used in reporting ANOVA results, not because it is any better in a permutation context.

It is also interesting to compare the permutation distribution for the *F-statistic* and the parametric *F*(2,111) distribution (Figure 2-8). They do not match perfectly but are quite similar. Some the differences around 0 are due to the behavior of the method used to create the density curve and are not really a problem for the methods. This explains why both methods give similar results. In some situations, the correspondence will not be quite to close.

So how can we rectify this result (p-value≈0.06) and the Chapter 1 result that detected a difference between *Average* and *Unattractive* with a p-value≈0.03? I selected the two groups to compare in Chapter 1 because they were furthest apart. "Cherry-picking" the comparison that is likely to be most different creates a false sense of the real situation and inflates the Type I error rate because of the selection. If the entire suite of comparisons are considered, this result may lose some of its luster. In other words, if we consider the suite of all pair-wise differences (and the tests) implicit in comparing all of them, we need stronger evidence in the most different pair than a p-value of 0.033 to suggest overall differences. The *Beautiful* and *Average* groups are not that different from each other so they do not contribute much to the overall *F-test*. In Section 2.5, we will revisit this topic and consider a method that is statistically valid for performing all possible pair-wise comparisons.

^{22}Make sure you can work from left to right and up and down to fill in the ANOVA table given just the necessary information to determine the other components - there is always a question like this on the exam...