Finding trend lines for scatter plot data and making predictions
Reserve & Extensions • K-12
When real-world data is plotted on a graph, it rarely falls perfectly on a line. But often there is a trend -- a general direction. A line of best fit (also called a regression line or trend line) is the straight line that best represents the overall pattern in a scatter plot.
A scatter plot displays data as individual points on a coordinate plane. Each point represents a pair of related values (for example, hours studied and test score).
Scatter plots can show different relationships:
| Pattern | Meaning | Example |
|---|---|---|
| Positive correlation | As x increases, y tends to increase | Height vs. shoe size |
| Negative correlation | As x increases, y tends to decrease | Price vs. demand |
| No correlation | No apparent pattern | Birthday month vs. GPA |
When drawing a trend line by hand:
A teacher records hours studied (x) and test scores (y) for 6 students: (1, 52), (2, 58), (3, 65), (4, 71), (5, 74), (6, 82). The line of best fit is approximately y = 5.6x + 47.
The mathematically "best" line minimizes the total squared vertical distances from each data point to the line. This is called the least squares method.
For each data point, the vertical distance to the line is called a residual. The least squares line makes the sum of all squared residuals as small as possible. You will typically use a calculator or software to find this line exactly.
Using the line y = 5.6x + 47, predict the test score for a student who studies 8 hours.
Using y = 5.6x + 47 to predict a score for x = 20 hours gives y = 159 -- impossible on a 100-point test! The linear relationship breaks down beyond the data range. Always be cautious with extrapolation.
Data points: (10, 85), (20, 78), (30, 70), (40, 61), (50, 55). What type of correlation?
Just because two variables are correlated does not mean one causes the other. Ice cream sales and drowning incidents are both correlated with hot weather, but ice cream does not cause drowning. Always consider whether there is a lurking variable.
1. A line of best fit is y = 3x + 10. Predict y when x = 7.
y = 3(7) + 10 = 21 + 10 = 31
2. Data shows: as temperature increases, hot chocolate sales decrease. What type of correlation is this?
Negative correlation. As one variable goes up, the other goes down.
3. A line of best fit is based on data from x = 5 to x = 30. Is predicting at x = 15 interpolation or extrapolation?
Interpolation, because x = 15 falls within the data range (5 to 30).
4. A regression line has slope -2.5. Interpret this in context if x = advertising dollars (thousands) and y = unsold inventory.
For every additional $1,000 spent on advertising, unsold inventory decreases by approximately 2.5 units.
5. Why would predicting y at x = 100 be risky if the data only covers x = 0 to x = 20?
This is extrapolation far beyond the data range. The linear trend observed between 0 and 20 may not hold at x = 100. The relationship could curve, plateau, or reverse.