Line of Best Fit & Linear Regression

When real-world data is plotted on a graph, it rarely falls perfectly on a line. But often there is a trend -- a general direction. A line of best fit (also called a regression line or trend line) is the straight line that best represents the overall pattern in a scatter plot.

Scatter Plots

A scatter plot displays data as individual points on a coordinate plane. Each point represents a pair of related values (for example, hours studied and test score).

Scatter plots can show different relationships:

Pattern	Meaning	Example
Positive correlation	As x increases, y tends to increase	Height vs. shoe size
Negative correlation	As x increases, y tends to decrease	Price vs. demand
No correlation	No apparent pattern	Birthday month vs. GPA

Drawing a Line of Best Fit

When drawing a trend line by hand:

The line should follow the general direction of the data.
Roughly equal numbers of points should fall above and below the line.
The line should pass through the "middle" of the data cluster.
The line does not need to pass through any particular data point.

Worked Example 1: Using a Line of Best Fit

A teacher records hours studied (x) and test scores (y) for 6 students: (1, 52), (2, 58), (3, 65), (4, 71), (5, 74), (6, 82). The line of best fit is approximately y = 5.6x + 47.

Slope interpretation: for each additional hour studied, the score increases by about 5.6 points.
y-intercept interpretation: a student who studied 0 hours would score about 47 (the baseline).

The Least Squares Method (Concept)

The mathematically "best" line minimizes the total squared vertical distances from each data point to the line. This is called the least squares method.

For each data point, the vertical distance to the line is called a residual. The least squares line makes the sum of all squared residuals as small as possible. You will typically use a calculator or software to find this line exactly.

Worked Example 2: Making a Prediction

Using the line y = 5.6x + 47, predict the test score for a student who studies 8 hours.

Substitute x = 8: y = 5.6(8) + 47 = 44.8 + 47 = 91.8
Predicted score: about 92 points.

Interpolation vs. Extrapolation

Interpolation: predicting within the range of your data. (If data covers x = 1 to 6, predicting at x = 3.5 is interpolation.) Generally reliable.

Extrapolation: predicting beyond the range of your data. (Predicting at x = 15 from data covering 1-6.) Risky -- the trend may not continue.

Extrapolation Can Be Dangerous

Using y = 5.6x + 47 to predict a score for x = 20 hours gives y = 159 -- impossible on a 100-point test! The linear relationship breaks down beyond the data range. Always be cautious with extrapolation.

Worked Example 3: Identifying Correlation

Data points: (10, 85), (20, 78), (30, 70), (40, 61), (50, 55). What type of correlation?

As x increases (10, 20, 30, ...), y decreases (85, 78, 70, ...).
This is a negative correlation.
A reasonable line of best fit: y = -0.75x + 93.

Correlation Does Not Imply Causation

Just because two variables are correlated does not mean one causes the other. Ice cream sales and drowning incidents are both correlated with hot weather, but ice cream does not cause drowning. Always consider whether there is a lurking variable.

Practice Problems

1. A line of best fit is y = 3x + 10. Predict y when x = 7.

Show Solution

y = 3(7) + 10 = 21 + 10 = 31

2. Data shows: as temperature increases, hot chocolate sales decrease. What type of correlation is this?

Show Solution

Negative correlation. As one variable goes up, the other goes down.

3. A line of best fit is based on data from x = 5 to x = 30. Is predicting at x = 15 interpolation or extrapolation?

Show Solution

Interpolation, because x = 15 falls within the data range (5 to 30).

4. A regression line has slope -2.5. Interpret this in context if x = advertising dollars (thousands) and y = unsold inventory.

Show Solution

For every additional $1,000 spent on advertising, unsold inventory decreases by approximately 2.5 units.

5. Why would predicting y at x = 100 be risky if the data only covers x = 0 to x = 20?

Show Solution

This is extrapolation far beyond the data range. The linear trend observed between 0 and 20 may not hold at x = 100. The relationship could curve, plateau, or reverse.

Lesson Summary

A scatter plot shows the relationship between two variables as individual points.
A line of best fit summarizes the trend; the least squares method finds the optimal line.
The slope tells you the rate of change; the y-intercept is the predicted value when x = 0.
Interpolation (within data range) is reliable; extrapolation (beyond it) is risky.
Correlation does not imply causation.