MathBored

Essential Math Primer
← Back to Primer Overview
H48 • Lesson 70 of 105

Correlation vs. Causation

Relationship analysis, correlation coefficient, inference

High School Essentials • 9-12

Prerequisites: H47

Key Concepts

  • correlation
  • causation
  • inference

Correlation vs. Causation

"Ice cream sales and drowning deaths both increase in summer -- does ice cream cause drowning?" Of course not. Both are driven by a third factor: warm weather. Learning to distinguish correlation (two things move together) from causation (one thing makes the other happen) is one of the most important skills in data literacy.

Scatter Plots and Correlation

A scatter plot displays paired data as points on a coordinate plane. The pattern reveals the relationship between the two variables.

Positive Correlation

As x increases, y tends to increase. Points trend upward from left to right.

Example: study hours and test scores.

Negative Correlation

As x increases, y tends to decrease. Points trend downward.

Example: price of a product and quantity demanded.

No correlation: no clear pattern. The points form a shapeless cloud.

The Correlation Coefficient r

The correlation coefficient r quantifies the strength and direction of a linear relationship between two variables.

Value of rInterpretation
r = 1Perfect positive linear relationship
r = -1Perfect negative linear relationship
r = 0No linear relationship
0.7 ≤ |r| ≤ 1Strong linear association
0.4 ≤ |r| < 0.7Moderate linear association
|r| < 0.4Weak linear association

Key Property of r

The correlation coefficient r only measures linear relationships. A data set with a perfect U-shape pattern might have r ≈ 0 even though there is a strong (non-linear) relationship. Always look at the scatter plot, not just the number.

Why Correlation Does Not Imply Causation

Even when two variables are strongly correlated, there are several possible explanations besides direct causation:

ExplanationDescriptionExample
Confounding variableA third variable causes bothIce cream and drowning (both caused by summer heat)
Reverse causationThe direction is opposite to what you assumedFirefighters at fires -- more firefighters does not cause bigger fires
CoincidenceRandom chance produced a patternCorrelation between Nicolas Cage films and pool drownings

Worked Example 1 -- Identifying Confounding Variables

Study finds: students who eat breakfast score higher on tests. Does breakfast cause better scores?

  • Possible confounders: families who provide breakfast may also provide a stable home, more sleep, better nutrition overall, and stronger academic support.
  • Conclusion: we cannot conclude causation from this observational data alone. The correlation is real, but the cause may be the overall home environment, not breakfast itself.

Observational vs. Experimental Studies

Observational Study

Researchers observe and record data without interfering. They can find correlations but cannot establish causation because confounders may be present.

Experiment (Randomized Controlled Trial)

Researchers randomly assign subjects to treatment and control groups. Randomization balances confounders across groups, so differences in outcomes can suggest causation.

Worked Example 2 -- Study Design

A researcher wants to know if a new fertilizer increases crop yield. Describe an experiment.

  • Randomly divide 100 identical plots into two groups of 50.
  • Treatment group: apply the new fertilizer. Control group: no fertilizer (or a standard one).
  • Keep all other conditions (water, sunlight, soil) the same.
  • Compare average yields. If the treatment group yields significantly more, we have evidence the fertilizer caused the increase.

Worked Example 3 -- Interpreting r

A study finds r = -0.82 between hours of TV watched per day and GPA. Interpret this.

  • The negative sign means as TV hours increase, GPA tends to decrease.
  • |r| = 0.82 indicates a strong linear association.
  • However, this is likely an observational study. We cannot conclude TV causes lower GPA. Confounders (motivation, time management, socioeconomic factors) may explain the pattern.

Common Mistake

Assuming that a large or statistically significant correlation proves causation. Even r = 0.95 does not prove that one variable causes the other. Only a well-designed experiment with randomization and controls can support causal claims.

Practice Problems

  1. A scatter plot shows points trending downward from left to right with moderate scatter. Estimate the correlation coefficient.
    Show Solution

    Negative direction with moderate scatter suggests r is somewhere around -0.5 to -0.7 (moderate negative correlation).

  2. "Countries that consume more chocolate win more Nobel Prizes." Name a likely confounding variable.
    Show Solution

    National wealth (GDP per capita). Wealthier countries can afford both more chocolate consumption and more research funding, which produces more Nobel laureates.

  3. Is the following an observational study or an experiment? "Researchers surveyed 500 people about their exercise habits and measured their blood pressure."
    Show Solution

    Observational study -- the researchers did not assign exercise habits; they simply recorded existing behavior.

  4. If r = 0.02 between shoe size and intelligence, what can you conclude?
    Show Solution

    There is essentially no linear relationship between shoe size and intelligence. The variables are not correlated.

  5. A randomized experiment finds that patients taking a new drug recover faster than those taking a placebo (p < 0.01). Can we conclude the drug caused faster recovery? Explain.
    Show Solution

    Yes, cautiously. Because subjects were randomly assigned, confounders are balanced between groups. The statistically significant result (p < 0.01) provides strong evidence that the drug, not some other factor, caused the improvement. Replication would strengthen the conclusion further.

Summary

Overview