Lesson 3 — Subtracting the line and reading what's left

BIO 202, Spring 2026, draft v1. In Lesson 1 you knew nothing about each individual. Here you know one extra thing per person.

What you'll do

Predict adult weight four times. Each round you get a different piece of information to work with. Don't read ahead.

Why "subtract the line"? Once you fit a line that says what current function should produce, the residual is what current function can't explain. A giraffe's recurrent laryngeal nerve takes ~16 ft to cover ~6 in between brain and larynx — fit nerve length on body size, and the giraffe is the residual screaming "ancestry, not engineering." The olm's eye sits in a lightless cave — fit eye complexity on light environment, and the olm is the residual. Residuals like these are the diagnostic for one specific thing: vertical transmission won over re-design. The structure is here because it was inherited from a context where it once made sense, not because current selection is putting it here. Today you write the fit down.

L — Warm-up: intercept and slope, on data you already have intuition for

Every spoken-aloud quote in a film adaptation maps to a page in the source book. Plot minute vs. page, drag a line by hand, read what the intercept and slope mean.

Scenario

Each point is a quote from Peter Jackson's Fellowship of the Ring theatrical cut. The x-coordinate is what minute it appears in the film. The y-coordinate is what page it appears on in the book.

Drag the α (intercept) and β (slope) sliders to fit a line by eye. Then read what each parameter is:

  • α = what page the film opens on at minute 0 (probably not page 0 — Jackson cuts the prologue).
  • β = pages per minute of runtime (the film's pacing).
  • residual = how far this quote is from where a perfectly linear adaptation would put it (out-of-order moments).

Jackson's Fellowship of the Ring — quote position

α (intercept): pages  |  β (slope): pages/min  |  SSR:
OLS optimum: α* = , β* =

Controls

20
2.00

Why this warm-up?

"Pages per minute" is a quantity you can imagine without learning new vocabulary. The same arithmetic — α, β, residual, fit-to-minimize-SSR — will appear with body weight and finch beaks for the rest of the lesson. The math is the new thing in the room; the data should not also be new.

A — Predict an adult's weight with no other information

Same setup as Lesson 1 Stage B, with weight instead of height. Move the slider, watch the errors.

Locked — confirm your name above to begin.

Scenario

NHANES adult weights. The slider is your constant guess. Gray = the truth, red = your signed errors. Move the slider and watch the two distributions move relative to each other.

Weight (gray) and signed errors from your guess (red)

N:  |  μ:  |  σ:
your guess:  |  MSE:

Prediction (required before slider unlocks)

  1. Q1. You have to make one guess for every NHANES adult's weight and you'll be graded on squared error. Which single number minimizes the average penalty?
  2. Q2. If you compared MSE and median-absolute-error, which would the mean and the median win?
Try at least 5 different guesses to unlock Stage B. 0/5 guesses

Controls

81.0

R code — constant predictor for weight

nh <- read.csv("data/clean/nhanes_adults.csv")y <- nh$Weightmean(y); median(y); sd(y)guess <- 81.0err <- guess - ymean(err^2)        # MSE -- minimized by guess = meanmean(abs(err))     # MAE -- minimized by guess = median

B — Now you know each person's height. Use it.

Same dataset. One new piece of information per person.

Complete Stage A (submit prediction, try 5 guesses) to unlock this section.

Scenario

Each NHANES adult has a height and a weight. Blue line: weight ~ α + β·height. Dashed gray: the Stage A constant predictor.

yi ~ Normal(α + β·xi, σ)

Toggle "show residuals" to see the vertical errors. Watch σ as you do.

Weight vs height with fitted line (and residual bars)

α̂:  |  β̂:  |  R²:  |  residual σ:
MSE constant predictor (Stage A):  |  MSE line (Stage B):

Prediction (required before controls unlock)

  1. Q1. When you fit a line of weight on height, the residual SD compared to the marginal SD of weight will be:
  2. Q2. R prints an intercept α̂ when you fit lm(weight ~ height). What is that number actually predicting?
Toggle controls (residuals on/off, subsample, refit) at least 5 times to unlock Stage C. 0/5 toggles

Controls

400
42

R code — fit weight on height

nh <- read.csv("data/clean/nhanes_adults.csv")set.seed(42)idx <- sample(nrow(nh), 400)d <- nh[idx, ]fit <- lm(Weight ~ Height, data = d)summary(fit)sd(residuals(fit))   # residual SD -- smaller than sd(d$Weight)plot(d$Height, d$Weight, pch = 16, col = "#444",     xlab = "height (cm)", ylab = "weight (kg)")abline(fit, col = "#2f6b8f", lwd = 2)abline(h = mean(d$Weight), col = "gray60", lty = 2)

C — Residual-pattern drill (5 rounds)

Five scatters. Each has a straight-line fit. Some are wrong on purpose. Pick the residual pattern. Move on.

Complete Stage B (submit prediction, toggle controls 5 times) to unlock this section.

Scenario

Each round: a scatter with a fitted line (top) and a residuals plot (bottom). Pick the pattern. Five rounds, real data each time.

Round 1 of 5 — fit and residuals

round: 1/5  |  correct so far: 0

Prediction (required before the drill starts)

  1. Q1. A residual-vs-fitted plot shows a curve (like a smile or a frown). The single most likely diagnosis is:
  2. Q2. A residual-vs-fitted plot shows two disjoint clouds, one above and one below the zero line. The single most likely diagnosis is:
Finish all 5 rounds to unlock Stage D. 0/5 rounds

R code — residuals as a diagnostic

# Round 1: mammal mass vs gestation, log-log (clean)m <- read.csv("data/clean/pantheria_mammals.csv")fit <- lm(log(AdultBodyMass_g) ~ log(GestationLen_d), data = m)plot(fitted(fit), residuals(fit), pch = 16)abline(h = 0, col = "#b23a48")# Same plot, linear-linear: curvature appearsplot(lm(AdultBodyMass_g ~ GestationLen_d, data = m))

D — Beren and Cyrus: click where the missing measurement should sit

Real growth data for both kids. One measurement is hidden per round. Click where you think it goes; the reveal tells you what was actually there.

Complete Stage C (finish all 5 drill rounds) to unlock this section.

Scenario

One kid's mass-by-age trajectory with a single point hidden. WHO median in gray. Click where you think the missing point belongs.

The reveal color tells you something about that point. Six rounds.

Round 1 of 6 — click where the missing point should be

kid:  |  round: 1/6  |  your click:
actual:  |  prediction error:  |  sick proxy:

Prediction (required before the canvas unlocks)

  1. Q1. A child is measured at the doctor when they are sick. Compared to a same-age clean measurement, the sick-day mass measurement will tend to be:
  2. Q2. If you fit "mass = f(age)" and ignore the sick-day flag, the residuals from that fit on sick-day measurements will:
Complete all 6 rounds to wrap up. 0/6 rounds

Controls

R code — kids' growth with sick-proxy flag

k <- read.csv("data/clean/kids_growth.csv")b <- subset(k, kid == "beren" & measure == "mass")fit <- lm(value ~ poly(age_years, 2), data = b)b$residual <- residuals(fit)# compare residuals by sick_proxy flagaggregate(residual ~ sick_proxy, data = b, mean)boxplot(residual ~ sick_proxy, data = b,        col = c("#6f8a4a", "#a86a1a"))

Stretch challenge (optional, recorded)

The downloaded .R for Stage D shows the boxplot move: pull residuals from the smooth "mass = f(age)" fit on Beren, then split them by sick_proxy. Do it. Are the sick-day residuals systematically more negative? Refit including sick_proxy as a predictor and report the new residual SD.

Not yet attempted.

E — A scatter where the "noise" turned out to be a missing variable

Every quote from The Lord of the Rings film adaptations, plotted as book page against film minute. Fit one line to all of them — read R². Then color by which film the quote came from.

Complete Stage D.

Scenario

Three film adaptations of Tolkien's trilogy. Each colored point is one quote spoken on screen: the x coordinate is what minute it appears in the film, the y coordinate is what page it appears on in the source book.

If the films were perfect linear adaptations, every point would sit on one line. They don't. The residuals from a pooled fit look like noise — until you color them.

Quote position — film minute (x) vs book page (y)

pooled R²:  |  colored by film R²:  |  slope diff (Jackson Fellowship vs Bakshi):

Prediction

  1. Q1. Before you color by film, the pooled fit looks noisy. After you color by film, which one is true?
  2. Q2. Which of these is a fair statistical reading of what just happened? (Pick the strongest.)
Toggle the color-by-film view at least once. 0/1 toggles

Controls

R code — residuals partition by a categorical

lotr <- read.csv("data/clean/lotr_quotes.csv")# Pooled fit ignores 'film'pooled <- lm(page ~ minute, data = lotr)# Stratified fit lets each film have its own slopestrat  <- lm(page ~ minute * film, data = lotr)summary(pooled)$r.squared; summary(strat)$r.squared

Draw the missing variable

The pooled fit ignored "film". In a DAG, "film" is a categorical node that affects both minute (when does this film's runtime end?) and page (which book is this film adapting?). Build that DAG below and watch the scatter behave: pooled vs colored-by-film. This is the same picture you saw above, generated from a causal model you control.

Showcase — the residual you just learned, drawn on the lecture example

No prediction, no controls. Just the same regression-and-residual move applied to the comparative anatomy from lecture.

What you're looking at

Ten mammals. x = the direct distance from the brainstem to the larynx (cm) — the path a sensible engineer would route the nerve down. y = the actual length of the recurrent laryngeal nerve (cm) — the path evolution actually took, looping under the aortic arch at the heart and back up.

The line is fit to the nine non-giraffe mammals. The giraffe is plotted in red — and the residual panel below shows how far off the line it sits.

Approximate values from comparative-anatomy references (Wedel 2012 for giraffe; standard texts for the rest). Treat as pedagogical rather than research-grade.

Top: brain-to-larynx distance vs. nerve length. Bottom: residuals from a fit on the non-giraffe mammals.

fit on non-giraffe (n = 9): slope:  |  R²:  |  giraffe predicted: cm  |  giraffe actual: cm  |  residual: cm

What the residual is doing

The nine-mammal fit says: for every 1 cm of direct brain-to-larynx distance, expect roughly 2–3 cm of recurrent laryngeal nerve (the loop around the aortic arch adds a relatively fixed amount). For a giraffe with a ~25 cm direct distance, the line predicts ~60 cm of nerve.

The giraffe's actual nerve is roughly 270 cm. That's the residual you're looking at — the part of the giraffe's anatomy that the "what should a nerve do?" line can't explain. Lecture's framing: there is no engineering reason for it. There is only an ancestry reason — a vertical-transmission channel laying down nerve routing through every fish, every amphibian, every early tetrapod, every short-necked mammal, until it reached an animal whose neck stretched in a way the routing was never asked to anticipate.

Same machine you used on Beren's sick-day weights. Different residual, same diagnostic. By Unit 5 you will use this exact diagnostic at every scale of life: when current function and inherited form disagree, the disagreement is data about which channel of inheritance produced what you are seeing.

R code — fit on non-giraffe, then look where the giraffe sits

v <- read.csv("data/clean/vagus_nerve.csv")non_g <- subset(v, species != "giraffe")fit <- lm(rln_length_cm ~ direct_cm, data = non_g)# predict the giraffe from the non-giraffe regressiongiraffe <- subset(v, species == "giraffe")pred <- predict(fit, newdata = giraffe)giraffe$rln_length_cm - pred   # the residual: ~200 cm