5. Simple Summaries

Motivating Scenario:
You have a fresh new dataset and want to check it out. How do you go about understanding numeric variables of interest?

Learning Goals: By the end of this chapter, you should be able to:

Interpret a histogram and connect it to key descriptions of data, and to:
- Responsibly interpret a given numerical summary.
- Evaluate which summary is most informative for a given variable.
- Use histograms to visually assess measures of central tendency and spread.
Explain and interpret summaries of the location of a numeric variable.
- Understand the mathematics behind these summaries and calculate them manually.
- Use R to calculate and visualize them.
Explain and interpret summaries of the spread of a numeric variable.
- Understand the mathematics behind these summaries and calculate them manually.
- Use R to calculate and visualize them.
Recognize the difference between parametric and non-parametric summaries.
- Identify situations where parametric vs. non-parametric summaries are appropriate.
- Read off standard non-parametric summaries from a boxplot.

Code for selecting data from a few columns from RILs planted at GC

ril_link <- "https://raw.githubusercontent.com/ybrandvain/datasets/refs/heads/master/clarkia_rils.csv"
ril_data <- readr::read_csv(ril_link) |>
  dplyr::mutate(growth_rate = case_when(growth_rate =="1.8O" ~ "1.80",
                                          .default = growth_rate),  
                growth_rate = as.numeric(growth_rate),
                visited = mean_visits > 0)
gc_rils <- ril_data |>
  filter(location == "GC", !is.na(prop_hybrid), ! is.na(mean_visits))|>
  select(petal_color, petal_area_mm, num_hybrid, offspring_genotyped, prop_hybrid, mean_visits , asd_mm )

library(ggplot2)
gc_rils |> 
  filter(!is.na(petal_color))|>
  ggplot(aes(x = petal_color))+
  geom_bar()

Figure 1: A bar plot showing the number of pink and white flowered plants, experimentally planted at location GC.

We begin our summary of variation in continuous variables in a somewhat unlikely place – a bar plot of a binary variable. Standard bar plots have a y-axis which starts at zero and shows the number of observations in each category, displayed on the x-axis. So, for example, Figure 1 shows that there are nearly fifty pink flowered RILs and closer to forty white flowered RILs planted at location GC. Bar plots can have any number of categories on the x-axis – but binary variables have a special characteristic – if we think of one option (say pink) as zero, and the other option (say white) as one, the mean tells us that 46.15% of the RILs planted had white flowers. Of course, this is a strange mean, as these plants were either pink or white, and none were 46.15% white, but it is a mean, nonetheless.

gc_rils |>
  summarise(mean(petal_color == "white", na.rm = TRUE)) |>
  pull()

[1] 0.4615385

Histograms

We can similarly display discrete numerical variables as a bar plot. So, for example, the x-axis of Figure 2 (in the Num hybrid bar plot tab) shows the number of a mother’s seeds that we found to be hybrids, and the y-axis shows the number of mothers with x seeds shown to be hybrid.

Unfortunately, Figure 2 is somewhat misleading – we aimed to genotype eight seeds per plant, but we sometimes missed this goal. Figure 3 (in the Prop hybrid bar plot tab) is more honest, as it shows the proportion, but it is a bit distracting and confusing – there are occasional weird dips which represent – not a biological fact of parviflora hybridization, but experimental weirdness of a non-standard sample size across mothers.

Figure 4 (in the Prop hybrid hist tab) is a histogram that displays the distribution of the proportion of hybrid seeds per RIL. A histogram is much like a bar plot, but rather than referring to a single value of the x-variable, in a histogram, values of x are binned.

gc_rils |> 
  ggplot(aes(x = num_hybrid))+
  geom_bar()+
  scale_x_continuous(breaks = 0:8)

Figure 2: A bar plot showing the number of genotyped seeds of each mom shown to be hybrid.

gc_rils |> 
  ggplot(aes(x = prop_hybrid))+
  geom_bar()

Figure 3: A bar plot showing the proportion of genotyped plants shown to be hybrid. Most of the time, we genotyped eight seeds per mom.

gc_rils |> 
  ggplot(aes(x = prop_hybrid))+
  geom_histogram(binwidth = 1/8, color = "white")+
  scale_x_continuous(breaks= seq(0,1,1/8))

Figure 4: A histogram showing the proportion of genotyped plants shown to be hybrid. Most of the time, we genotyped eight seeds per mom.

It helps to remember that in contrast to a bar plot, where the value on the x axis is the value of all the observations in that bar, in a histogram this value is the center of the range of values in the bin. So in Figure 4 the first bin corresponds to values between -0.0625 and 0.0625, and the second bin (centered on 1/8th) corresponds to proportions between 0.0625 and 0.1875, etc… . The x-axis of Figure 5 shows this more explicitly, but it is too busy, and complex for standard presentation.

Figure 5: A histogram showing the proportion of genotyped plants shown to be hybrid. Most of the time, we genotyped eight seeds per mom. This is identical to Figure Figure 4, except, the x-label shows the range of x values rather than the center of thebins.

When making a histogram, you are in charge of the bin size. In ggplot2 you can specify this with one of two arguments:

binwidth: Which tells R how big to make the the bins.
bins: Which tells R how many equally sized bins to split the data into.

There is not a universally correct answer for the appropriate bin size – it will vary by data set. It takes thought and expertise. It is best to experiment some until you find the binning that honestly reflects the variability in your data. This reflects a trade-off between capturing the variability, without distracting the reader with every bump or dip. I usually start with about thirty bins and then dial that number up or down until I feel like the variability in the data is well-communicated by the plot.

Let’s get started with univariate summaries!

The following sections introduce how to summarize a single variable by:

Describing its shape including its skew, and number of modes.
Changing its shape (if necessary), including when to transform, rules of transformation, how to transform variables in R, and a table of common transformations.
Describing its center, including standard non-parametric and parametric summaries, when to use which, and some useful but used-less summaries.
Describing its width including nonparametric and parametric summaries, with a worked example with mathematical calculations, R functions, and an illustrative plot or visualization.

Then we summarize the chapter, present practice questions, a glossary, a review of R functions and R packages introduced, and present additional resources.