• 15. Is It Normal?

Motivating scenario: Many common statistical approaches assume that your data (or the model’s residuals) are normally distributed. So before we can run and intepret such models we must bets able to evaluate if this assumption is fair. Here I show you how to do that!

Learning goals: By the end of this chapter you should be able to:

Explain why visually assessing normality is often preferred over formal statistical tests of the null that data came from a normal.
Create and interpret a Quantile-Quantile (QQ) plot to evaluate if a dataset is approximately normal.
Visually recognize common patterns of non-normality (e.g. skew and bimodality).

Cartoon of a normal distribution looking skeptically at an excited looking bimodal / negatively skewed distribution. The first says to the second, "you're not normal." — Figure 1: A fun picture of a normal (orange) and non-normal (blue) distribution from Allison Horst.

Is it normal?

Many standard statistical approaches rely, to some extent, on normal data (or more specifically, a normally distributed sampling distribution of residuals). It is therefore often important to know if our data (or at least the residuals of a linear model) are normally distributed, as this influences how much faith we have in the results of a given statistical procedure.

While there are ways to formally test the null hypothesis that data come from a normal distribution, we rarely use these because deviations from normality can be most critical when we have the least power to detect them. For this reason,we typically rely on visual inspection rather than null hypothesis significance testing to assess whether data are approximately normal.

“Quantile-Quantile” plots and the eye test

A QQ (aka “quantile-quantile”) plot is a useful tool to help us visually evaluate if data are roughly normal. It does so by comparing the quantiles of your data against the theoretical quantiles you would expect if your data came from some an ideal version of a specified distribution (in this case a normal distribution). If your data normal, the points will fall along a straight line.

The QQ-plot of petal area in parviflora RILs planted at site GC (Figure 2.3) reveals that the points are fairly close to the predicted line, although both the small and large values are slightly larger than expected. Is this a big deal? Is this deviation surprising? To answer that, we need to understand the variability we expect from a normal distribution.

gc_rils |>
  ggplot(aes(sample = petal_area_mm))+
  geom_qq()+
  geom_qq_line()+
  labs(x = "Theoretical quantiles",
       y = "Observed petal area",
       title = "Normal QQ plot of petal area among parviflora RILs")

The image shows a quantile-quantile (QQ) plot of petal lengths in *Clarkia xantiana ssp. parviflora* RILs. The x-axis represents the theoretical quantiles (z-transformed expectations), and the y-axis represents the observed petal area data. A straight line is drawn to indicate the expected values if the data were normally distributed. The plotted points fall near the line, though slight deviations can be observed at both the lower and upper ends, where the observed values are larger than expected. — Figure 2: A quantile-quantile plot of petal area in *parviflora* RILs.

Making a QQ plot in R: We can create a QQ-plot using the geom_qq() function and add a line with geom_qq_line(). Here, we map our quantity of interest onto the sample attribute.

What normal distributions look like

I’m always surprised by how easily I can convince myself that a sample doesn’t come from a normal distribution. Try hitting the Generate a sample from the normal distribution button in the app below a few times, and experiment with the sample size to get a sense of the variability in what samples from a normal distribution can look like.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| column: page-right
#| standalone: true
#| viewerHeight: 900
library(shiny)
library(ggplot2)
library(bslib)

ui <- fluidPage(
  theme = bs_theme(bootswatch = "flatly"),
  titlePanel("Getting a feel for normal distributions"),
  
  fluidRow(
    column(
      12,
      div(
        style = "margin-bottom: 10px;",
        actionButton("go", "Generate a sample from the normal distribution"),
        br(), br(),
        tags$label("Sample Size:", `for` = "n", style = "font-weight:600;")
      ),
      sliderInput(
        "n", label = NULL, min = 10, max = 100, value = 24, step = 1, ticks = TRUE, width = "100%"
      )
    )
  ),
  
  hr(),
  h3(textOutput("subtitle")),
  br(),
  
  # 2 x 2 plot grid
  fluidRow(
    column(
      6,
      h4("Histogram"),
      plotOutput("hist", height = "220px")
    ),
    column(
      6,
      h4("Density plot"),
      plotOutput("dens", height = "220px")
    )
  ),
  fluidRow(
    column(
      6,
      h4("quantile-quantile plot"),
      plotOutput("qq", height = "220px")
    ),
    column(
      6,
      h4("Cumulative distribution"),
      plotOutput("ecdf", height = "220px")
    )
  )
)

server <- function(input, output, session) {
  # Re-sample ONLY when the button is clicked (and use current n)
  sample_rv <- eventReactive(input$go, {
    rnorm(input$n)
  }, ignoreInit = TRUE)
  
  # Initialize once so there is something to show before first click
  observeEvent(TRUE, {
    if (is.null(isolate(sample_rv()))) {
      isolate({
        # seed-free initial draw; changes when button is pressed
        assign("._init_x", rnorm(isolate(input$n)), envir = .GlobalEnv)
      })
    }
  }, once = TRUE)
  
  x <- reactive({
    z <- sample_rv()
    if (is.null(z)) get("._init_x", envir = .GlobalEnv) else z
  })
  
  output$subtitle <- renderText({
    paste0("A sample of size ", length(x()), " from the standard normal distribution")
  })
  
  output$hist <- renderPlot({
    ggplot(data.frame(x = x()), aes(x)) +
      geom_histogram(color = "black", fill = "grey40", alpha = 0.7, bins = max(6, round(sqrt(length(x()))))) +
      labs(x = "x", y = "count") +
      theme_minimal()
  }, res = 150)
  
  output$dens <- renderPlot({
    ggplot(data.frame(x = x()), aes(x)) +
      geom_density(fill = "grey60", alpha = 0.5) +
      labs(x = "x", y = "density") +
      theme_minimal()
  }, res = 150)
  
  output$qq <- renderPlot({
    ggplot(data.frame(x = x()), aes(sample = x)) +
      stat_qq(size = 1.6) +
      stat_qq_line() +
      labs(x = "theoretical", y = "sample") +
      theme_minimal()
  }, res = 150)
  
  output$ecdf <- renderPlot({
    ggplot(data.frame(x = x()), aes(x)) +
      stat_ecdf(geom = "step", linewidth = 0.9) +
      coord_cartesian(ylim = c(0, 1)) +
      labs(x = "x", y = "y") +
      theme_minimal()
  }, res = 150)
}

shinyApp(ui, server)

Examples of a sample not from a normal distribution

Let’s compare the samples from a normal distribution, in our shinyapp above, to cases in which the data are not normal. For example,

Figure 3 A-D makes it clear that across the three Iris species, petal length is bimodal. -Figure 3 E-H makes it clear that across all mammals the distribution of body weights are exponentially distributed.

These examples are a bit extreme. Over the term, we’ll get practice in visually assessing if data are normal-ish.

A multi-panel figure showing two datasets visualized in four different ways. The columns are labeled Histogram, Density Plot, QQ Plot, and CDF. The top row shows "All iris petal lengths," which has a bimodal (two-humped) distribution. The bottom row shows "Mammal body weights," which has a highly right-skewed distribution, with most data clustered near zero and a long tail extending to the right. — Figure 3: **A-D.** The distribution of petal lengths for the combined iris dataset. The histogram and density plot clearly show a bimodal distribution with two distinct groups, which causes the points on the QQ plot to form an S-curve rather than a straight line. **E-H.** The distribution of mammal body weights. All four plots reveal a strong right-skew, characteristic of an exponential-like distribution. This is seen in the L-shaped histogram, the long right tail of the density plot, and the pronounced upward curve of the points in the QQ plot.