Keeping data in the tidy format—where each column represents a variable and each row represents an observation—allows you to fully leverage the powerful tools of the tidyverse. In the tidyverse, data are stored in tibbles, a modern update to data frames that enhances readability and maintains consistent data types. The dplyr package offers a suite of intuitive functions for transforming and analyzing data. For example, mutate() lets you create or modify variables, while summarize() computes summary statistics. When paired with group_by(), you can easily generate summaries across groups. Other essential functions include select() for choosing columns, filter() for subsetting rows, rename(), and arrange() for ordering data. Together—and especially when used with the pipe operator (group_by(...) |> summarize(...))—these tools enable clear, reproducible workflows. In the next chapter, you’ll see how tidy data also powers beautiful and flexible plots using ggplot2.
Chatbot tutor
Please interact with this custom chatbot (link here) I have made to help you with this chapter. I suggest interacting with at least ten back-and-forths to ramp up and then stopping when you feel like you got what you needed from it.
Practice Questions
Try these questions! Use the R environment below to work without changing tabs.
Q1) The code above returns the error: "Error: could not find function "summarise"". How can you solve this?
Q2) Revisit the pollinator visitation dataset we explored. Which location has a greater anther stigma distance (asd_mm)?
# A tibble: 5 × 2
location avg_asd
<chr> <dbl>
1 GC 0.920
2 LB 0.861
3 SR 0.921
4 US 0.866
5 <NA> NaN
Q3) Consider the table below. The data are
location
GC
GC
GC
GC
GC
GC
ril
A1
A100
A102
A104
A106
A107
mean_visits
0.0000
0.1875
0.2500
0.0000
0.0000
0.0000
Here the data are transposed, so the data are not tidy. Remember in tidy data each variable is a column, not a row. This is particularly hard for R because there are numerous types of data in a column.
.
Q4 Consider the table below. The data are
location-ril
mean_visits
GC-A1
0.0000
GC-A100
0.1875
GC-A102
0.2500
GC-A104
0.0000
GC-A106
0.0000
GC-A107
0.0000
Here location and ril are combined in a single column, so the data are not tidy. Remember in tidy data each variable is its own column. It would be hard to get e.g. means for RILs of locations in this format.
Q5 Consider the table below. The data are
ril
GC
SR
A1
0.0000
0.6667
A100
0.1875
0.5833
A102
0.2500
0.6667
A104
0.0000
1.7500
A106
0.0000
0.5000
A107
0.0000
1.5000
This is known as “wide format” and is not tidy. Here the variable, location, is used as a column heading. This can be a fine way to present data to people, but it’s not how we are analyzing data.
Q6 You should always make sure data are tidy when (pick best answer)
Q7 What is wrong with the code below (pick the most egregious issue).
Q8 After running the code below, how many rows and columns will the output tibble have? NOTE The original data has 593 rows, 7 columns and 186 unique RILs*
Cleaner printing (only first 10 rows, fits columns to screen).
Explicit display of data types (e.g., , ).
Strict subsetting (prevents automatic type conversion).
Character data is not automatically converted to factors.
Piping (|>) functions: A way to chain operations together, making code more readable and modular.
Grouping in Data Analysis: Grouped operations allow calculations within subsets of data (e.g., mean visits per location).
Missing Data (NA): R uses NA to represent missing values. Operations with NA return NA unless handled explicitly (e.g., na.rm = TRUE to ignore missing values, use = "pairwise.complete.obs", etc).
Warnings: Indicate a possible issue but allow code to run (e.g., NAs introduced by coercion).
Errors: Stop execution completely when something is invalid.
Key R functions
read_csv()(readr): Reads a CSV file into R as a tibble, automatically guessing column types.
select()(dplyr): Selects specific columns from a dataset.
mutate()(dplyr): Creates or modifies columns in a dataset.
case_when()(dplyr): Replaces values conditionally within a column.
as.numeric(): Converts a vector to a numeric data type.
summarize()(dplyr): Computes summary statistics on a dataset (e.g., mean, sum).
mean(): Computes the mean (average) of a numeric vector.
Argument: na.rm = TRUE: An argument used in functions like mean() and sd() to remove missing values (NA) before computation.
pull()(dplyr): Extracts a single column from a tibble as a vector.
group_by()(dplyr): Groups data by one or more variables for summary operations.
|>(Base R Pipe Operator): Passes the result of one function into another, making code more readable.
R Packages Introduced
readr: A tidyverse package for reading rectangular data files (e.g., read_csv()).
dplyr: A tidyverse package for data manipulation, including mutate(), glimpse(), and across().