SS_iris <- iris |>
mutate(grand_mean = mean(Sepal.Width))|>
group_by(Species)|>
mutate(group_mean = mean(Sepal.Width))|>
ungroup()|>
summarise(SS_model = sum((group_mean - grand_mean)^2) ,
SS_error = sum((Sepal.Width - group_mean)^2) ,
SS_total = sum((Sepal.Width - grand_mean)^2),
n_total = n(),
n_groups = n_distinct(Species))• 19. R ANOVA pipeline
Here’s a brief review of how to complete all the bits of an ANOVA in R. To do so, we will look into Sepal.Width by Species in the iris dataset.
Brought to you by popular demand!
ANOVA calculations by hand
We can decompose variance components ourselves as follows:
First as the sums of squares
| SS_model | SS_error | SS_total | n_total | n_groups |
|---|---|---|---|---|
| 11.345 | 16.962 | 28.307 | 150 | 3 |
Then as full blown ANOVA results
anova_results <- SS_iris |>
mutate(df_model = n_groups - 1,
df_error = n_total - n_groups,
MS_model = SS_model / df_model,
MS_error = SS_error / df_error,
F_stat = MS_model / MS_error,
r2 = SS_model / SS_total,
p_value = pf(F_stat , df1 = df_model, df2 = df_error,
lower.tail = FALSE)) | SS_model | SS_error | SS_total | df_model | df_error | MS_model | MS_error | F_stat | r2 | p_value |
|---|---|---|---|---|---|---|---|---|---|
| 11.34 | 16.96 | 28.31 | 2 | 147 | 5.67 | 0.12 | 49.16 | 0.4 | 4.49e-17 |
Or R can do this for you
We saw we can do this with the aov() |> summary() pipeline, or the lm() |> anova() pipeline. Here I do the latter. Either way our answers matvh those that we derived above:
lm(Sepal.Width ~ Species, data = iris) |>
anova()Analysis of Variance Table
Response: Sepal.Width
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 11.345 5.6725 49.16 < 2.2e-16 ***
Residuals 147 16.962 0.1154
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The tidy() function in the broom package makes this output tidy and shows our p-values are identical!
library(broom)
lm(Sepal.Width ~ Species, data = iris) |>
anova()|>
tidy()# A tibble: 2 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Species 2 11.3 5.67 49.2 4.49e-17
2 Residuals 147 17.0 0.115 NA NA
Posthoc tests and significance groups
As above there are a few ways to do this, and, as above we stuck the the lm() framework.
library(multcomp)
lm(Sepal.Width ~ Species, data = iris)|>
glht(linfct = mcp(Species = "Tukey"))|>
summary()
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: lm(formula = Sepal.Width ~ Species, data = iris)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
versicolor - setosa == 0 -0.65800 0.06794 -9.685 < 0.0001 ***
virginica - setosa == 0 -0.45400 0.06794 -6.683 < 0.0001 ***
virginica - versicolor == 0 0.20400 0.06794 3.003 0.00875 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
We see that all groups differ significantly from one another. So call one significance group “a”, another significance group “b”, and the third significance group “c”. It doesn’t matter which one’s which—just that they’re all different. These are arbitrary labels. For more complex cases, R can find significance groups for us:
lm(Sepal.Width ~ Species, data = iris)|>
glht(linfct = mcp(Species = "Tukey"))|>
cld() setosa versicolor virginica
"a" "b" "c"