location | GC | GC | GC | GC | GC | GC |
ril | A1 | A100 | A102 | A104 | A106 | A107 |
mean_visits | 0.0000 | 0.1875 | 0.2500 | 0.0000 | 0.0000 | 0.0000 |
• 2. Data in R summary
Links to: Summary, Chatbot Tutor, Practice Questions, Glossary, R functions, R packages introduced, and Additional resources.
Chapter summary
Keeping data in the tidy format—where each column represents a variable and each row represents an observation—allows you to fully leverage the powerful tools of the tidyverse. In the tidyverse, data are stored in tibbles, a modern update to data frames that enhances readability and maintains consistent data types. The dplyr
package offers a suite of intuitive functions for transforming and analyzing data. For example, mutate()
lets you create or modify variables, while summarize()
computes summary statistics. When paired with group_by()
, you can easily generate summaries across groups. Other essential functions include select()
for choosing columns, filter()
for subsetting rows, rename()
, and arrange()
for ordering data. Together—and especially when used with the pipe operator (group_by(...) |> summarize(...)
)—these tools enable clear, reproducible workflows. In the next chapter, you’ll see how tidy data also powers beautiful and flexible plots using ggplot2
.
Chatbot tutor
Please interact with this custom chatbot (link here) I have made to help you with this chapter. I suggest interacting with at least ten back-and-forths to ramp up and then stopping when you feel like you got what you needed from it.
Practice Questions
Try these questions! Use the R environment below to work without changing tabs.
Q2) Revisit the pollinator visitation dataset we explored. Which location has a greater anther stigma distance (asd_mm
)?
Q3) Consider the table below. The data are
Here the data are transposed, so the data are not tidy. Remember in tidy data each variable is a column, not a row. This is particularly hard for R because there are numerous types of data in a column.
.
Q4 Consider the table below. The data are
location-ril | mean_visits |
---|---|
GC-A1 | 0.0000 |
GC-A100 | 0.1875 |
GC-A102 | 0.2500 |
GC-A104 | 0.0000 |
GC-A106 | 0.0000 |
GC-A107 | 0.0000 |
Here location and ril are combined in a single column, so the data are not tidy. Remember in tidy data each variable is its own column. It would be hard to get e.g. means for RILs of locations in this format.
Q5 Consider the table below. The data are
ril | GC | SR |
---|---|---|
A1 | 0.0000 | 0.6667 |
A100 | 0.1875 | 0.5833 |
A102 | 0.2500 | 0.6667 |
A104 | 0.0000 | 1.7500 |
A106 | 0.0000 | 0.5000 |
A107 | 0.0000 | 1.5000 |
This is known as “wide format” and is not tidy. Here the variable, location, is used as a column heading. This can be a fine way to present data to people, but it’s not how we are analyzing data.
Q6 You should always make sure data are tidy when (pick best answer)
Q7 What is wrong with the code below (pick the most egregious issue).
<- iris |>
iris summarise(mean_sepal_length = mean(Sepal.Length))
Q8 After running the code below, how many rows and columns will the output tibble have? NOTE The original data has 593 rows, 7 columns and 186 unique RILs*
|>
ril_data group_by(ril) |>
summarize(avg_visits = mean(mean_visits, na.rm = TRUE))
Glossary of Terms
- Tidy Data A structured format where:
- Each row represents an observation.
- Each column represents a variable.
- Each cell contains a single measurement.
- Each row represents an observation.
- Tibbles: A modern form of a data frame in R with:
- Cleaner printing (only first 10 rows, fits columns to screen).
- Explicit display of data types (e.g.,
, ).
- Strict subsetting (prevents automatic type conversion).
- Character data is not automatically converted to factors.
- Cleaner printing (only first 10 rows, fits columns to screen).
- Piping (|>) functions: A way to chain operations together, making code more readable and modular.
- Grouping in Data Analysis: Grouped operations allow calculations within subsets of data (e.g., mean visits per location).
- Missing Data (
NA
): R usesNA
to represent missing values. Operations withNA
returnNA
unless handled explicitly (e.g., na.rm = TRUE to ignore missing values, use = "pairwise.complete.obs", etc).
- Warnings: Indicate a possible issue but allow code to run (e.g., NAs introduced by coercion).
- Errors: Stop execution completely when something is invalid.
Key R functions
read_csv()
(readr): Reads a CSV file into R as a tibble, automatically guessing column types.case_when()
(dplyr): Replaces values conditionally within a column.as.numeric()
: Converts a vector to a numeric data type.summarize()
(dplyr): Computes summary statistics on a dataset (e.g., mean, sum).mean()
: Computes the mean (average) of a numeric vector.- Argument:
na.rm = TRUE
: An argument used in functions likemean()
andsd()
to remove missing values (NA
) before computation.
- Argument:
pull()
(dplyr): Extracts a single column from a tibble as a vector.group_by()
(dplyr): Groups data by one or more variables for summary operations.|>
(Base R Pipe Operator): Passes the result of one function into another, making code more readable.
R Packages Introduced
Additional resources
R Recipes:
- Selecting columns.
- Add a new column (or modify an existing one).
- Summarize data.
- Summarize data by group.
Other web resources:
- Chapter 10: Tidy data from R for data science (Grolemund & Wickham (2018)).
- Animated dplyr functions from R or the rest of us.
Videos:
Basic Data Manipulation (From Stat454).
Calculations on tibble (From Stat454).