Relationship analysis, correlation coefficient, inference
High School Essentials • 9-12
"Ice cream sales and drowning deaths both increase in summer -- does ice cream cause drowning?" Of course not. Both are driven by a third factor: warm weather. Learning to distinguish correlation (two things move together) from causation (one thing makes the other happen) is one of the most important skills in data literacy.
A scatter plot displays paired data as points on a coordinate plane. The pattern reveals the relationship between the two variables.
As x increases, y tends to increase. Points trend upward from left to right.
Example: study hours and test scores.
As x increases, y tends to decrease. Points trend downward.
Example: price of a product and quantity demanded.
No correlation: no clear pattern. The points form a shapeless cloud.
The correlation coefficient r quantifies the strength and direction of a linear relationship between two variables.
| Value of r | Interpretation |
|---|---|
| r = 1 | Perfect positive linear relationship |
| r = -1 | Perfect negative linear relationship |
| r = 0 | No linear relationship |
| 0.7 ≤ |r| ≤ 1 | Strong linear association |
| 0.4 ≤ |r| < 0.7 | Moderate linear association |
| |r| < 0.4 | Weak linear association |
The correlation coefficient r only measures linear relationships. A data set with a perfect U-shape pattern might have r ≈ 0 even though there is a strong (non-linear) relationship. Always look at the scatter plot, not just the number.
Even when two variables are strongly correlated, there are several possible explanations besides direct causation:
| Explanation | Description | Example |
|---|---|---|
| Confounding variable | A third variable causes both | Ice cream and drowning (both caused by summer heat) |
| Reverse causation | The direction is opposite to what you assumed | Firefighters at fires -- more firefighters does not cause bigger fires |
| Coincidence | Random chance produced a pattern | Correlation between Nicolas Cage films and pool drownings |
Study finds: students who eat breakfast score higher on tests. Does breakfast cause better scores?
Researchers observe and record data without interfering. They can find correlations but cannot establish causation because confounders may be present.
Researchers randomly assign subjects to treatment and control groups. Randomization balances confounders across groups, so differences in outcomes can suggest causation.
A researcher wants to know if a new fertilizer increases crop yield. Describe an experiment.
A study finds r = -0.82 between hours of TV watched per day and GPA. Interpret this.
Assuming that a large or statistically significant correlation proves causation. Even r = 0.95 does not prove that one variable causes the other. Only a well-designed experiment with randomization and controls can support causal claims.
Negative direction with moderate scatter suggests r is somewhere around -0.5 to -0.7 (moderate negative correlation).
National wealth (GDP per capita). Wealthier countries can afford both more chocolate consumption and more research funding, which produces more Nobel laureates.
Observational study -- the researchers did not assign exercise habits; they simply recorded existing behavior.
There is essentially no linear relationship between shoe size and intelligence. The variables are not correlated.
Yes, cautiously. Because subjects were randomly assigned, confounders are balanced between groups. The statistically significant result (p < 0.01) provides strong evidence that the drug, not some other factor, caused the improvement. Replication would strengthen the conclusion further.