Lesson 3 — Subtracting the line and reading what's left

BIO 202, Spring 2026, draft v1. In Lesson 1 you knew nothing about each individual. Here you know one extra thing per person.

What you'll do

Predict adult weight four times. Each round you get a different piece of information to work with. Don't read ahead.

A — Predict an adult's weight with no other information

Same setup as Lesson 1 Stage B, with weight instead of height. Move the slider, watch the errors.

Locked. Answer the pretest above to unlock this section.

Scenario

NHANES adult weights. The slider is your constant guess. Gray = the truth, red = your signed errors. Move the slider and watch the two distributions move relative to each other.

Weight (gray) and signed errors from your guess (red)

N:  |  μ:  |  σ:
your guess:  |  MSE:

Prediction (required before slider unlocks)

  1. Q1. The constant predictor with the smallest MSE on adult weight is:
  2. Q2. If you compared MSE and median-absolute-error, which would the mean and the median win?
Try at least 5 different guesses to unlock Stage B. 0/5 guesses

Controls

81.0

R code — constant predictor for weight

nh <- read.csv("data/clean/nhanes_adults.csv")y <- nh$Weightmean(y); median(y); sd(y)guess <- 81.0err <- guess - ymean(err^2)        # MSE -- minimized by guess = meanmean(abs(err))     # MAE -- minimized by guess = median

B — Now you know each person's height. Use it.

Same dataset. One new piece of information per person.

Complete Stage A (submit prediction, try 5 guesses) to unlock this section.

Scenario

Each NHANES adult has a height and a weight. Blue line: weight ~ α + β·height. Dashed gray: the Stage A constant predictor.

yi ~ Normal(α + β·xi, σ)

Toggle "show residuals" to see the vertical errors. Watch σ as you do.

Weight vs height with fitted line (and residual bars)

α̂:  |  β̂:  |  R²:  |  residual σ:
MSE constant predictor (Stage A):  |  MSE line (Stage B):

Prediction (required before controls unlock)

  1. Q1. When you fit a line of weight on height, the residual SD compared to the marginal SD of weight will be:
  2. Q2. The intercept α̂ from lm(weight ~ height) reads as:
Toggle controls (residuals on/off, subsample, refit) at least 5 times to unlock Stage C. 0/5 toggles

Controls

400
42

R code — fit weight on height

nh <- read.csv("data/clean/nhanes_adults.csv")set.seed(42)idx <- sample(nrow(nh), 400)d <- nh[idx, ]fit <- lm(Weight ~ Height, data = d)summary(fit)sd(residuals(fit))   # residual SD -- smaller than sd(d$Weight)plot(d$Height, d$Weight, pch = 16, col = "#444",     xlab = "height (cm)", ylab = "weight (kg)")abline(fit, col = "#2f6b8f", lwd = 2)abline(h = mean(d$Weight), col = "gray60", lty = 2)

C — Residual-pattern drill (5 rounds)

Five scatters. Each has a straight-line fit. Some are wrong on purpose. Pick the residual pattern. Move on.

Complete Stage B (submit prediction, toggle controls 5 times) to unlock this section.

Scenario

Each round: a scatter with a fitted line (top) and a residuals plot (bottom). Pick the pattern. Five rounds, real data each time.

Round 1 of 5 — fit and residuals

round: 1/5  |  correct so far: 0

Prediction (required before the drill starts)

  1. Q1. A residual-vs-fitted plot shows a curve (like a smile or a frown). The single most likely diagnosis is:
  2. Q2. A residual-vs-fitted plot shows two disjoint clouds, one above and one below the zero line. The single most likely diagnosis is:
Finish all 5 rounds to unlock Stage D. 0/5 rounds

R code — residuals as a diagnostic

# Round 1: mammal mass vs gestation, log-log (clean)m <- read.csv("data/clean/pantheria_mammals.csv")fit <- lm(log(AdultBodyMass_g) ~ log(GestationLen_d), data = m)plot(fitted(fit), residuals(fit), pch = 16)abline(h = 0, col = "#b23a48")# Same plot, linear-linear: curvature appearsplot(lm(AdultBodyMass_g ~ GestationLen_d, data = m))

D — Beren and Cyrus: click where the missing measurement should sit

Real growth data for both kids. One measurement is hidden per round. Click where you think it goes; the reveal tells you what was actually there.

Complete Stage C (finish all 5 drill rounds) to unlock this section.

Scenario

One kid's mass-by-age trajectory with a single point hidden. WHO median in gray. Click where you think the missing point belongs.

The reveal color tells you something about that point. Six rounds.

Round 1 of 6 — click where the missing point should be

kid:  |  round: 1/6  |  your click:
actual:  |  prediction error:  |  sick proxy:

Prediction (required before the canvas unlocks)

  1. Q1. A child is measured at the doctor when they are sick. Compared to a same-age clean measurement, the sick-day mass measurement will tend to be:
  2. Q2. If you fit "mass = f(age)" and ignore the sick-day flag, the residuals from that fit on sick-day measurements will:
Complete all 6 rounds to wrap up. 0/6 rounds

Controls

R code — kids' growth with sick-proxy flag

k <- read.csv("data/clean/kids_growth.csv")b <- subset(k, kid == "beren" & measure == "mass")fit <- lm(value ~ poly(age_years, 2), data = b)b$residual <- residuals(fit)# compare residuals by sick_proxy flagaggregate(residual ~ sick_proxy, data = b, mean)boxplot(residual ~ sick_proxy, data = b,        col = c("#6f8a4a", "#a86a1a"))

Stretch challenge (optional, recorded)

The downloaded .R for Stage D shows the boxplot move: pull residuals from the smooth "mass = f(age)" fit on Beren, then split them by sick_proxy. Do it. Are the sick-day residuals systematically more negative? Refit including sick_proxy as a predictor and report the new residual SD.

Not yet attempted.