Correlation vs. Causation

"Ice cream sales and drowning deaths both increase in summer -- does ice cream cause drowning?" Of course not. Both are driven by a third factor: warm weather. Learning to distinguish correlation (two things move together) from causation (one thing makes the other happen) is one of the most important skills in data literacy.

Scatter Plots and Correlation

A scatter plot displays paired data as points on a coordinate plane. The pattern reveals the relationship between the two variables.

Positive Correlation

As x increases, y tends to increase. Points trend upward from left to right.

Example: study hours and test scores.

Negative Correlation

As x increases, y tends to decrease. Points trend downward.

Example: price of a product and quantity demanded.

No correlation: no clear pattern. The points form a shapeless cloud.

The Correlation Coefficient r

The correlation coefficient r quantifies the strength and direction of a linear relationship between two variables.

Value of r	Interpretation
r = 1	Perfect positive linear relationship
r = -1	Perfect negative linear relationship
r = 0	No linear relationship
0.7 ≤ \|r\| ≤ 1	Strong linear association
0.4 ≤ \|r\| < 0.7	Moderate linear association
\|r\| < 0.4	Weak linear association

Key Property of r

The correlation coefficient r only measures linear relationships. A data set with a perfect U-shape pattern might have r ≈ 0 even though there is a strong (non-linear) relationship. Always look at the scatter plot, not just the number.

Why Correlation Does Not Imply Causation

Even when two variables are strongly correlated, there are several possible explanations besides direct causation:

Explanation	Description	Example
Confounding variable	A third variable causes both	Ice cream and drowning (both caused by summer heat)
Reverse causation	The direction is opposite to what you assumed	Firefighters at fires -- more firefighters does not cause bigger fires
Coincidence	Random chance produced a pattern	Correlation between Nicolas Cage films and pool drownings

Worked Example 1 -- Identifying Confounding Variables

Study finds: students who eat breakfast score higher on tests. Does breakfast cause better scores?

Possible confounders: families who provide breakfast may also provide a stable home, more sleep, better nutrition overall, and stronger academic support.
Conclusion: we cannot conclude causation from this observational data alone. The correlation is real, but the cause may be the overall home environment, not breakfast itself.

Observational vs. Experimental Studies

Observational Study

Researchers observe and record data without interfering. They can find correlations but cannot establish causation because confounders may be present.

Experiment (Randomized Controlled Trial)

Researchers randomly assign subjects to treatment and control groups. Randomization balances confounders across groups, so differences in outcomes can suggest causation.

Worked Example 2 -- Study Design

A researcher wants to know if a new fertilizer increases crop yield. Describe an experiment.

Randomly divide 100 identical plots into two groups of 50.
Treatment group: apply the new fertilizer. Control group: no fertilizer (or a standard one).
Keep all other conditions (water, sunlight, soil) the same.
Compare average yields. If the treatment group yields significantly more, we have evidence the fertilizer caused the increase.

Worked Example 3 -- Interpreting r

A study finds r = -0.82 between hours of TV watched per day and GPA. Interpret this.

The negative sign means as TV hours increase, GPA tends to decrease.
|r| = 0.82 indicates a strong linear association.
However, this is likely an observational study. We cannot conclude TV causes lower GPA. Confounders (motivation, time management, socioeconomic factors) may explain the pattern.

Common Mistake

Assuming that a large or statistically significant correlation proves causation. Even r = 0.95 does not prove that one variable causes the other. Only a well-designed experiment with randomization and controls can support causal claims.

Practice Problems

A scatter plot shows points trending downward from left to right with moderate scatter. Estimate the correlation coefficient.
Show Solution

Negative direction with moderate scatter suggests r is somewhere around -0.5 to -0.7 (moderate negative correlation).
"Countries that consume more chocolate win more Nobel Prizes." Name a likely confounding variable.
Show Solution

National wealth (GDP per capita). Wealthier countries can afford both more chocolate consumption and more research funding, which produces more Nobel laureates.
Is the following an observational study or an experiment? "Researchers surveyed 500 people about their exercise habits and measured their blood pressure."
Show Solution

Observational study -- the researchers did not assign exercise habits; they simply recorded existing behavior.
If r = 0.02 between shoe size and intelligence, what can you conclude?
Show Solution

There is essentially no linear relationship between shoe size and intelligence. The variables are not correlated.
A randomized experiment finds that patients taking a new drug recover faster than those taking a placebo (p < 0.01). Can we conclude the drug caused faster recovery? Explain.
Show Solution

Yes, cautiously. Because subjects were randomly assigned, confounders are balanced between groups. The statistically significant result (p < 0.01) provides strong evidence that the drug, not some other factor, caused the improvement. Replication would strengthen the conclusion further.

Summary

Correlation measures how strongly two variables move together; it ranges from -1 to 1.
Correlation does NOT imply causation -- confounders, reverse causation, and coincidence can all produce correlations.
Only randomized controlled experiments can establish causation.
Always look at the scatter plot, not just the correlation coefficient.