• 19. R ANOVA pipeline


Here’s a brief review of how to complete all the bits of an ANOVA in R. To do so, we will look into Sepal.Width by Species in the iris dataset.

Brought to you by popular demand!

ANOVA calculations by hand

We can decompose variance components ourselves as follows:

First as the sums of squares

SS_iris <- iris |>
  mutate(grand_mean = mean(Sepal.Width))|>
  group_by(Species)|>
  mutate(group_mean = mean(Sepal.Width))|>
  ungroup()|>
  summarise(SS_model = sum((group_mean   - grand_mean)^2) , 
            SS_error = sum((Sepal.Width - group_mean)^2) , 
            SS_total = sum((Sepal.Width - grand_mean)^2),
            n_total  = n(),
            n_groups = n_distinct(Species))
SS_model SS_error SS_total n_total n_groups
11.345 16.962 28.307 150 3

Then as full blown ANOVA results

anova_results <- SS_iris |>
  mutate(df_model = n_groups - 1,
         df_error  = n_total  - n_groups,
         MS_model  = SS_model / df_model,
         MS_error  = SS_error / df_error,
         F_stat    = MS_model / MS_error,
         r2        = SS_model / SS_total,
         p_value   = pf(F_stat , df1 = df_model, df2 = df_error, 
                        lower.tail = FALSE)) 
SS_model SS_error SS_total df_model df_error MS_model MS_error F_stat r2 p_value
11.34 16.96 28.31 2 147 5.67 0.12 49.16 0.4 4.49e-17

Or R can do this for you

We saw we can do this with the aov() |> summary() pipeline, or the lm() |> anova() pipeline. Here I do the latter. Either way our answers matvh those that we derived above:

lm(Sepal.Width ~ Species, data = iris) |> 
  anova()
Analysis of Variance Table

Response: Sepal.Width
           Df Sum Sq Mean Sq F value    Pr(>F)    
Species     2 11.345  5.6725   49.16 < 2.2e-16 ***
Residuals 147 16.962  0.1154                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The tidy() function in the broom package makes this output tidy and shows our p-values are identical!

library(broom)
lm(Sepal.Width ~ Species, data = iris) |> 
  anova()|>
  tidy()
# A tibble: 2 × 6
  term         df sumsq meansq statistic   p.value
  <chr>     <int> <dbl>  <dbl>     <dbl>     <dbl>
1 Species       2  11.3  5.67       49.2  4.49e-17
2 Residuals   147  17.0  0.115      NA   NA       

Posthoc tests and significance groups

As above there are a few ways to do this, and, as above we stuck the the lm() framework.

library(multcomp)
lm(Sepal.Width ~ Species, data = iris)|>
   glht(linfct = mcp(Species = "Tukey"))|> 
   summary()

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Sepal.Width ~ Species, data = iris)

Linear Hypotheses:
                            Estimate Std. Error t value Pr(>|t|)    
versicolor - setosa == 0    -0.65800    0.06794  -9.685 < 0.0001 ***
virginica - setosa == 0     -0.45400    0.06794  -6.683 < 0.0001 ***
virginica - versicolor == 0  0.20400    0.06794   3.003  0.00875 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

We see that all groups differ significantly from one another. So call one significance group “a”, another significance group “b”, and the third significance group “c”. It doesn’t matter which one’s which—just that they’re all different. These are arbitrary labels. For more complex cases, R can find significance groups for us:

lm(Sepal.Width ~ Species, data = iris)|>
   glht(linfct = mcp(Species = "Tukey"))|> 
   cld()
    setosa versicolor  virginica 
       "a"        "b"        "c"