• 2. Data in R summary

Links to: Summary, Chatbot Tutor, Practice Questions, Glossary, R functions, R packages introduced, and Additional resources.

Chapter summary

A close-up photograph of a vibrant pink *Clarkia xantiana* flower with delicate, deeply lobed petals. The petals have a soft gradient, fading from a rich pink at the center to a lighter shade towards the edges. The reproductive structures—dark purple stamens with pollen-covered anthers and a protruding stigma—are prominently visible. The background is softly blurred, showing additional flowers and green stems in what appears to be a greenhouse or controlled growth environment. — A beautiful *Clarkia xantiana* flower.

Keeping data in the tidy format—where each column represents a variable and each row represents an observation—allows you to fully leverage the powerful tools of the tidyverse. In the tidyverse, data are stored in tibbles, a modern update to data frames that enhances readability and maintains consistent data types. The dplyr package offers a suite of intuitive functions for transforming and analyzing data. For example, mutate() lets you create or modify variables, while summarize() computes summary statistics. When paired with group_by(), you can easily generate summaries across groups. Other essential functions include select() for choosing columns, filter() for subsetting rows, rename(), and arrange() for ordering data. Together—and especially when used with the pipe operator (group_by(...) |> summarize(...))—these tools enable clear, reproducible workflows. In the next chapter, you’ll see how tidy data also powers beautiful and flexible plots using ggplot2.

Chatbot tutor

Please interact with this custom chatbot (link here) I have made to help you with this chapter. I suggest interacting with at least ten back-and-forths to ramp up and then stopping when you feel like you got what you needed from it.

Practice Questions

Try these questions! Use the R environment below to work without changing tabs.

Q1) The code above returns the error: "Error: could not find function "summarise"". How can you solve this?

Change “summarise” to “summarize” load the dplyr library

Q2) Revisit the pollinator visitation dataset we explored. Which location has a greater anther stigma distance (asd_mm)?

ril_link <- "https://raw.githubusercontent.com/ybrandvain/datasets/refs/heads/master/clarkia_rils.csv"
ril_data <- readr::read_csv(ril_link)
ril_data      |>
  group_by(location) |>
  summarise(avg_asd = mean(asd_mm, na.rm = TRUE))

# A tibble: 5 × 2
  location avg_asd
  <chr>      <dbl>
1 GC         0.920
2 LB         0.861
3 SR         0.921
4 US         0.866
5 <NA>     NaN

Q3) Consider the table below. The data are

location	GC	GC	GC	GC	GC	GC
ril	A1	A100	A102	A104	A106	A107
mean_visits	0.0000	0.1875	0.2500	0.0000	0.0000	0.0000

Here the data are transposed, so the data are not tidy. Remember in tidy data each variable is a column, not a row. This is particularly hard for R because there are numerous types of data in a column.

Q4 Consider the table below. The data are

location-ril	mean_visits
GC-A1	0.0000
GC-A100	0.1875
GC-A102	0.2500
GC-A104	0.0000
GC-A106	0.0000
GC-A107	0.0000

Here location and ril are combined in a single column, so the data are not tidy. Remember in tidy data each variable is its own column. It would be hard to get e.g. means for RILs of locations in this format.

Q5 Consider the table below. The data are

ril	GC	SR
A1	0.0000	0.6667
A100	0.1875	0.5833
A102	0.2500	0.6667
A104	0.0000	1.7500
A106	0.0000	0.5000
A107	0.0000	1.5000

This is known as “wide format” and is not tidy. Here the variable, location, is used as a column heading. This can be a fine way to present data to people, but it’s not how we are analyzing data.

Q6 You should always make sure data are tidy when (pick best answer)

collecting data presenting data analyzing data with dplyr all of the above

Q7 What is wrong with the code below (pick the most egregious issue).

I overwrote iris and lost the raw data I did not show the output I used summarise() rather than summarize() I did not tell R to remove missing data when calculating the mean.

iris <- iris |> 
  summarise(mean_sepal_length =  mean(Sepal.Length))

Q8 After running the code below, how many rows and columns will the output tibble have? NOTE The original data has 593 rows, 7 columns and 186 unique RILs*

ril_data   |>
    group_by(ril) |>
    summarize(avg_visits = mean(mean_visits, na.rm = TRUE))

Glossary of Terms

Tidy Data A structured format where:
- Each row represents an observation.
- Each column represents a variable.
- Each cell contains a single measurement.
Tibbles: A modern form of a data frame in R with:
- Cleaner printing (only first 10 rows, fits columns to screen).
- Explicit display of data types (e.g., , ).
- Strict subsetting (prevents automatic type conversion).
- Character data is not automatically converted to factors.
Piping (|>) functions: A way to chain operations together, making code more readable and modular.
Grouping in Data Analysis: Grouped operations allow calculations within subsets of data (e.g., mean visits per location).
Missing Data (NA): R uses NA to represent missing values. Operations with NA return NA unless handled explicitly (e.g., na.rm = TRUE to ignore missing values, use = "pairwise.complete.obs", etc).
Warnings: Indicate a possible issue but allow code to run (e.g., NAs introduced by coercion).
Errors: Stop execution completely when something is invalid.

Key R functions

read_csv() (readr): Reads a CSV file into R as a tibble, automatically guessing column types.
select() (dplyr): Selects specific columns from a dataset.
mutate() (dplyr): Creates or modifies columns in a dataset.
case_when() (dplyr): Replaces values conditionally within a column.
as.numeric(): Converts a vector to a numeric data type.
summarize() (dplyr): Computes summary statistics on a dataset (e.g., mean, sum).
mean(): Computes the mean (average) of a numeric vector.
- Argument: na.rm = TRUE: An argument used in functions like mean() and sd() to remove missing values (NA) before computation.
pull() (dplyr): Extracts a single column from a tibble as a vector.
group_by() (dplyr): Groups data by one or more variables for summary operations.
|> (Base R Pipe Operator): Passes the result of one function into another, making code more readable.

R Packages Introduced

readr: A tidyverse package for reading rectangular data files (e.g., read_csv()).
dplyr: A tidyverse package for data manipulation, including mutate(), glimpse(), and across().

Additional resources

R Recipes:

Other web resources:

Chapter 10: Tidy data from R for data science (Grolemund & Wickham (2018)).
Animated dplyr functions from R or the rest of us.

Videos:

Basic Data Manipulation (From Stat454).
Calculations on tibble (From Stat454).