• 6. Association Summary

Links to: Summary. Chatbot tutor. Questions. Glossary. R functions. R packages. More resources.

Chapter Summary

Correlation doesn't imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing *look over there*.

A cartoon on correlation from xkcd. The original rollover text says: “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing look over there”. See this link for a more detailed explanation.

Associations reveal how variables relate to one another - e.g. if they tend to increase together, differ across groups, or cluster. Differences in conditional means (or proportions) describe how a numeric (or categorical) response variable varies across levels of a categorical explanatory variable. For two numeric variables, covariance captures how deviations from their means align, and correlation standardizes this to a unitless scale between -1 and 1. While these summaries can highlight patterns, interpretation requires care: strong associations don’t necessarily imply causation, and predictions may not hold across contexts or datasets.

Chatbot tutor

Please interact with this custom chatbot (link here) I have made to help you with this chapter. I suggest interacting with at least ten back-and-forths to ramp up and then stopping when you feel like you got what you needed from it.

Practice Questions

Try these questions! By using the R environment you can work without leaving this “book”. To help you jump right into thinking and analysis, I have loaded the ril data, cleaned it some, an have started some of the code!

Q1) Extend the analysis above to examine the association between leaf water content (lwc) and the proportion of hybrid seeds (prop_hybrid). The correlation between lwc and prop_hybrid is:

Q2) Based on the analysis above, which variable – leaf water content (lwc), or petal area (log10_petal_area_mm) is more plausibly interpreted as influencing proportion hybrid seed set (prop_hybrid)?
Q3) Based on the observed negative association between leaf water content and proportion hybrid seed set, which explanation best accounts for this pattern?

The set of questions below focuses on comparing the association between petal color and pollinator visitation to the association between petal color and proportion hybrid seed. Use the webR console above to work through these!

Q4) The difference in conditional mean hybrid proportion between pink and white flowers is:

Q5) The pooled standard deviation of hybrid proportion between pink and white flowers is:

Q6) Which trait is more strongly associated with petal color — the proportion of hybrid seeds or visits from a pollinator (visits)?

Q7 SETUP We collected 131 plants (74 parviflora, 57 xantiana) from a natural hybrid zone between xantiana and parviflora at Sawmill Road. We then genotyped these plants at a chloroplast marker that distinguishes between chloroplasts originating from parviflora and xantiana. All 74 parviflora plants had a parviflora chloroplast, while 49 of the 57 xantiana plants had a xantiana chloroplast (the remaining 8 had a parviflora chloroplast).

Q7A) If having a xantiana chloroplast and being a xantiana plant were independent, what proportion of plants would you expect to be xantiana and have a xantiana chloroplast?

If two binary variables are independent, the expected joint proportion (i.e. the probability of A and B) is the product of their proportions:

\[ P(A \text{ and } B) = P(A) \times P(B) \]

Q7B) Quantify the difference between the proportion of plants that are xantiana and have xantiana chloroplasts vs. what we expect if these two binary variables were independent.

Q7C) What is the covariance between being a xantiana plant and having a xantiana chloroplast? Hint: remember Bessel’s correction.

Q8) In the code above, I calculated the correlation and covariance between lwc and prop_hybrid using their mathematical formulas. However, my calculated values don’t match those returned by cor(). Why not?

Q9 SETUP Consider the plots above

Q9A) In which plot are x and y most tightly associated?

Q9B) In which plot are x and y most tightly linearly associated?

Q9C) In which plot do x and y have the largest correlation coefficient?

Q9C) In which plot are does x do the worst job of predicting y?

📊 Glossary of Terms

🔗 1. Types of Association

  • Association: A relationship or pattern between two variables, without assuming causation.
  • Correlation: A numerical summary of how two variables move together.
    • Positive: As one increases, the other tends to increase.
    • Negative: As one increases, the other tends to decrease.
  • Causation: A relationship in which changes in one variable directly produce changes in another.

⚖️ 2. Categorical Associations

  • Conditional Proportion: The proportion of a category (e.g., visited flowers) within levels of another variable (e.g., pink or white petals).
    • Written as \(P(A|B)\), the probability of A given B.
  • Multiplication Rule: If two variables are independent, then \(P(A \text{ and } B) = P(A) \times P(B)\).
  • Relative Risk: The ratio of conditional proportions between two groups.
  • Confounding Variable: A third variable that creates a false appearance of association between two others.

🔢 3. Numeric Associations

  • Covariance (cov()): Measures how two numeric variables co-vary.

    • Positive: variables increase together.
    • Negative: one increases as the other decreases.
    • Sensitive to scale.
  • Cross Product: For two variables, the product of their deviations from their means:
    \((X_i - \bar{X})(Y_i - \bar{Y})\)

  • Correlation Coefficient (cor()): A unitless summary of linear association, ranging from -1 to 1.
    \(r = \frac{\text{Cov}_{X,Y}}{s_X s_Y}\)

    • r ≈ 0: No linear relationship
    • r > 0: Positive linear relationship
    • r < 0: Negative linear relationship

📏 4. Comparing Group Means

  • Conditional Mean: The average of a numeric variable within each group of a categorical variable.

  • Difference in Means: A common summary of how a numeric variable differs across groups.

  • Cohen’s D: Standardized difference between two group means.
    \(D = \frac{\bar{X}_1 - \bar{X}_2}{s_{pooled}}\)

  • Pooled Standard Deviation: A weighted average of within-group standard deviations, used in Cohen’s D.


📈 5. Visual Summaries of Associations

  • Scatterplot: Plots individual observations for two numeric variables. Good for spotting trends and calculating correlation.
  • Boxplot: Shows distributions (medians, IQRs) across groups.
  • Barplot of Conditional Proportions: Visualizes proportions of one categorical variable within levels of another.
  • Sina Plot: A jittered density-style plot used to show distributions of numeric values within categories, especially useful when overplotting is an issue.

Key R Functions

📊 Visualizing Associations


📈 Summarizing Associations Between Variables

  • group_by() ([dplyr]): Groups data for grouped summaries like conditional proportions or means.
  • summarise() ([dplyr]): Summarizes multiple rows into a single value, e.g., a mean, covariance, or correlation.
  • mean() ([base R]): Computes means (or proportions). In this chapter we combine this with group_by() to find conditional means (or conditional proportions).
  • cov(): Calculates covariance between two numeric variables.
  • cor(): Calculates the correlation coefficient.

We often combine these below with the following chain of operations.
- For conditional means: data|>group_by()|>summarize(mean()).
- For associations: data |>group_by()|>summarize(cor()).

R Packages Introduced

  • GGally: Extends ggplot2 with convenient functions for exploring relationships among multiple variables. The ggpairs() function produces a matrix of plots showing pairwise associations, including histograms, scatterplots, and correlation coefficients.

  • ggforce: Provides advanced geoms for ggplot2. This chapter uses geom_sina() to reduce overplotting by jittering points while preserving density.

Additional resources

Other web resources:

Videos:

  • Correlation Doesn’t Equal Causation: Crash Course Statistics #8.

  • Calling Bullshit has a fantastic set of videos on correlation and causation.

    • Correlation and Causation: “Correlations are often used to make claims about causation. Be careful about the direction in which causality goes. For example: do food stamps cause poverty?”
    • What are Correlations? :“Jevin providers an informal introduction to linear correlations.”
    • Spurious Correlations?: “We look at Tyler Vigen’s silly examples of quantities appear to be correlated over time), and note that scientific studies may accidentally pick up on similarly meaningless relationships.”
    • Correlation Exercise” “When is correlation all you need, and causation is beside the point? Can you figure out which way causality goes for each of several correlations?”
    • Common Causes: “We explain how common causes can generate correlations between otherwise unrelated variables, and look at the correlational evidence that storks bring babies. We look at the need to think about multiple contributing causes. The fallacy of post hoc propter ergo hoc: the mistaken belief that if two events happen sequentially, the first must have caused the second.”
    • Manipulative Experiments: “We look at how manipulative experiments can be used to work out the direction of causation in correlated variables, and sum up the questions one should ask when presented with a correlation.