• 19. R ANOVA pipeline

Here’s a brief review of how to complete all the bits of an ANOVA in R. To do so, we will look into Sepal.Width by Species in the iris dataset.

Brought to you by popular demand!

ANOVA calculations by hand

We can decompose variance components ourselves as follows:

First as the sums of squares

SS_iris <- iris |>
  mutate(grand_mean = mean(Sepal.Width))|>
  group_by(Species)|>
  mutate(group_mean = mean(Sepal.Width))|>
  ungroup()|>
  summarise(SS_model = sum((group_mean   - grand_mean)^2) , 
            SS_error = sum((Sepal.Width - group_mean)^2) , 
            SS_total = sum((Sepal.Width - grand_mean)^2),
            n_total  = n(),
            n_groups = n_distinct(Species))

SS_model	SS_error	SS_total	n_total	n_groups
11.345	16.962	28.307	150	3

Then as full blown ANOVA results

anova_results <- SS_iris |>
  mutate(df_model = n_groups - 1,
         df_error  = n_total  - n_groups,
         MS_model  = SS_model / df_model,
         MS_error  = SS_error / df_error,
         F_stat    = MS_model / MS_error,
         r2        = SS_model / SS_total,
         p_value   = pf(F_stat , df1 = df_model, df2 = df_error, 
                        lower.tail = FALSE))

SS_model	SS_error	SS_total	df_model	df_error	MS_model	MS_error	F_stat	r2	p_value
11.34	16.96	28.31	2	147	5.67	0.12	49.16	0.4	4.49e-17

Or R can do this for you

We saw we can do this with the aov() |> summary() pipeline, or the lm() |> anova() pipeline. Here I do the latter. Either way our answers matvh those that we derived above:

lm(Sepal.Width ~ Species, data = iris) |> 
  anova()

Analysis of Variance Table

Response: Sepal.Width
           Df Sum Sq Mean Sq F value    Pr(>F)    
Species     2 11.345  5.6725   49.16 < 2.2e-16 ***
Residuals 147 16.962  0.1154                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The tidy() function in the broom package makes this output tidy and shows our p-values are identical!

library(broom)
lm(Sepal.Width ~ Species, data = iris) |> 
  anova()|>
  tidy()

# A tibble: 2 × 6
  term         df sumsq meansq statistic   p.value
  <chr>     <int> <dbl>  <dbl>     <dbl>     <dbl>
1 Species       2  11.3  5.67       49.2  4.49e-17
2 Residuals   147  17.0  0.115      NA   NA

Posthoc tests and significance groups

As above there are a few ways to do this, and, as above we stuck the the lm() framework.

library(multcomp)
lm(Sepal.Width ~ Species, data = iris)|>
   glht(linfct = mcp(Species = "Tukey"))|> 
   summary()


     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Sepal.Width ~ Species, data = iris)

Linear Hypotheses:
                            Estimate Std. Error t value Pr(>|t|)    
versicolor - setosa == 0    -0.65800    0.06794  -9.685 < 0.0001 ***
virginica - setosa == 0     -0.45400    0.06794  -6.683 < 0.0001 ***
virginica - versicolor == 0  0.20400    0.06794   3.003  0.00875 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

We see that all groups differ significantly from one another. So call one significance group “a”, another significance group “b”, and the third significance group “c”. It doesn’t matter which one’s which—just that they’re all different. These are arbitrary labels. For more complex cases, R can find significance groups for us:

lm(Sepal.Width ~ Species, data = iris)|>
   glht(linfct = mcp(Species = "Tukey"))|> 
   cld()

    setosa versicolor  virginica 
       "a"        "b"        "c"