BIO 202 — Lesson 3: Subtracting the line and reading what's left

What you'll do

Predict adult weight four times. Each round you get a different piece of information to work with. Don't read ahead.

Why "subtract the line"? Once you fit a line that says what current function should produce, the residual is what current function can't explain. A giraffe's recurrent laryngeal nerve takes ~16 ft to cover ~6 in between brain and larynx — fit nerve length on body size, and the giraffe is the residual screaming "ancestry, not engineering." The olm's eye sits in a lightless cave — fit eye complexity on light environment, and the olm is the residual. Residuals like these are the diagnostic for one specific thing: vertical transmission won over re-design. The structure is here because it was inherited from a context where it once made sense, not because current selection is putting it here. Today you write the fit down.

L — Warm-up: intercept and slope, on data you already have intuition for

Every spoken-aloud quote in a film adaptation maps to a page in the source book. Plot minute vs. page, drag a line by hand, read what the intercept and slope mean.

Scenario

Each point is a quote from Peter Jackson's Fellowship of the Ring theatrical cut. The x-coordinate is what minute it appears in the film. The y-coordinate is what page it appears on in the book.

Drag the α (intercept) and β (slope) sliders to fit a line by eye. Then read what each parameter is:

α = what page the film opens on at minute 0 (probably not page 0 — Jackson cuts the prologue).
β = pages per minute of runtime (the film's pacing).
residual = how far this quote is from where a perfectly linear adaptation would put it (out-of-order moments).

Jackson's Fellowship of the Ring — quote position

α (intercept): — pages | β (slope): — pages/min | SSR: —

OLS optimum: α* = —, β* = —

Controls

α (intercept, pages)20

β (slope, pages/min)2.00

Why this warm-up?

"Pages per minute" is a quantity you can imagine without learning new vocabulary. The same arithmetic — α, β, residual, fit-to-minimize-SSR — will appear with body weight and finch beaks for the rest of the lesson. The math is the new thing in the room; the data should not also be new.

A — Predict an adult's weight with no other information

Same setup as Lesson 1 Stage B, with weight instead of height. Move the slider, watch the errors.

Scenario

NHANES adult weights. The slider is your constant guess. Gray = the truth, red = your signed errors. Move the slider and watch the two distributions move relative to each other.

Weight (gray) and signed errors from your guess (red)

N: — | μ: — | σ: —

your guess: — | MSE: —

Prediction (required before slider unlocks)

Q1. You have to make one guess for every NHANES adult's weight and you'll be graded on squared error. Which single number minimizes the average penalty?
The population mean — squared errors punish being far away, so the balance point of the distribution minimizes the total penalty The population median — it splits the population in half by count Any constant — when you have no info about each person, every guess gives the same expected score
Q2. If you compared MSE and median-absolute-error, which would the mean and the median win?
mean wins MSE; median wins absolute error they win the same metrics median wins MSE; mean wins absolute error

Try at least 5 different guesses to unlock Stage B. 0/5 guesses

Controls

your guess (kg)81.0

R code — constant predictor for weight

nh <- read.csv("data/clean/nhanes_adults.csv")y <- nh$Weightmean(y); median(y); sd(y)guess <- 81.0err <- guess - ymean(err^2)        # MSE -- minimized by guess = meanmean(abs(err))     # MAE -- minimized by guess = median

B — Now you know each person's height. Use it.

Same dataset. One new piece of information per person.

Scenario

Each NHANES adult has a height and a weight. Blue line: weight ~ α + β·height. Dashed gray: the Stage A constant predictor.

y_i ~ Normal(α + β·x_i, σ)

Toggle "show residuals" to see the vertical errors. Watch σ as you do.

Weight vs height with fitted line (and residual bars)

α̂: — | β̂: — | R²: — | residual σ: —

MSE constant predictor (Stage A): — | MSE line (Stage B): —

Prediction (required before controls unlock)

Q1. When you fit a line of weight on height, the residual SD compared to the marginal SD of weight will be:
smaller — the covariate accounts for some of the variation the same — residuals always have the same SD as the response bigger — adding a predictor injects more noise
Q2. R prints an intercept α̂ when you fit lm(weight ~ height). What is that number actually predicting?
The mean weight in the sample (independent of height) The predicted weight of an adult with height = 0 cm — geometrically meaningful, biologically nonsense unless you center the predictor The slope of weight on height (same number as β̂)

Toggle controls (residuals on/off, subsample, refit) at least 5 times to unlock Stage C. 0/5 toggles

Controls

show residual bars

subsample n400

seed42

R code — fit weight on height

nh <- read.csv("data/clean/nhanes_adults.csv")set.seed(42)idx <- sample(nrow(nh), 400)d <- nh[idx, ]fit <- lm(Weight ~ Height, data = d)summary(fit)sd(residuals(fit))   # residual SD -- smaller than sd(d$Weight)plot(d$Height, d$Weight, pch = 16, col = "#444",     xlab = "height (cm)", ylab = "weight (kg)")abline(fit, col = "#2f6b8f", lwd = 2)abline(h = mean(d$Weight), col = "gray60", lty = 2)

C — Residual-pattern drill (5 rounds)

Five scatters. Each has a straight-line fit. Some are wrong on purpose. Pick the residual pattern. Move on.

Scenario

Each round: a scatter with a fitted line (top) and a residuals plot (bottom). Pick the pattern. Five rounds, real data each time.

Round 1 of 5 — fit and residuals

round: 1/5 | correct so far: 0

Prediction (required before the drill starts)

Q1. A residual-vs-fitted plot shows a curve (like a smile or a frown). The single most likely diagnosis is:
the relationship is nonlinear — a straight line misses a curve heteroskedasticity — the variance changes with fitted value a single outlier is causing the pattern
Q2. A residual-vs-fitted plot shows two disjoint clouds, one above and one below the zero line. The single most likely diagnosis is:
a missing group/category variable that separates the data into two clusters heteroskedasticity — wider spread on one side the relationship is nonlinear

Finish all 5 rounds to unlock Stage D. 0/5 rounds

R code — residuals as a diagnostic

# Round 1: mammal mass vs gestation, log-log (clean)m <- read.csv("data/clean/pantheria_mammals.csv")fit <- lm(log(AdultBodyMass_g) ~ log(GestationLen_d), data = m)plot(fitted(fit), residuals(fit), pch = 16)abline(h = 0, col = "#b23a48")# Same plot, linear-linear: curvature appearsplot(lm(AdultBodyMass_g ~ GestationLen_d, data = m))

D — Beren and Cyrus: click where the missing measurement should sit

Real growth data for both kids. One measurement is hidden per round. Click where you think it goes; the reveal tells you what was actually there.

Scenario

One kid's mass-by-age trajectory with a single point hidden. WHO median in gray. Click where you think the missing point belongs.

The reveal color tells you something about that point. Six rounds.

Round 1 of 6 — click where the missing point should be

kid: — | round: 1/6 | your click: —

actual: — | prediction error: — | sick proxy: —

Prediction (required before the canvas unlocks)

Q1. A child is measured at the doctor when they are sick. Compared to a same-age clean measurement, the sick-day mass measurement will tend to be:
lower — children eat less and lose water weight when ill higher — illness adds inflammation weight the same — illness is uncorrelated with mass
Q2. If you fit "mass = f(age)" and ignore the sick-day flag, the residuals from that fit on sick-day measurements will:
tend to be more negative than clean-day residuals (the model overpredicts sick-day mass) tend to be more positive show the same distribution — illness flag doesn't matter

Complete all 6 rounds to wrap up. 0/6 rounds

Controls

R code — kids' growth with sick-proxy flag

k <- read.csv("data/clean/kids_growth.csv")b <- subset(k, kid == "beren" & measure == "mass")fit <- lm(value ~ poly(age_years, 2), data = b)b$residual <- residuals(fit)# compare residuals by sick_proxy flagaggregate(residual ~ sick_proxy, data = b, mean)boxplot(residual ~ sick_proxy, data = b,        col = c("#6f8a4a", "#a86a1a"))

Stretch challenge (optional, recorded)

The downloaded .R for Stage D shows the boxplot move: pull residuals from the smooth "mass = f(age)" fit on Beren, then split them by sick_proxy. Do it. Are the sick-day residuals systematically more negative? Refit including sick_proxy as a predictor and report the new residual SD.

Not yet attempted.

E — A scatter where the "noise" turned out to be a missing variable

Every quote from The Lord of the Rings film adaptations, plotted as book page against film minute. Fit one line to all of them — read R². Then color by which film the quote came from.

Scenario

Three film adaptations of Tolkien's trilogy. Each colored point is one quote spoken on screen: the x coordinate is what minute it appears in the film, the y coordinate is what page it appears on in the source book.

If the films were perfect linear adaptations, every point would sit on one line. They don't. The residuals from a pooled fit look like noise — until you color them.

Quote position — film minute (x) vs book page (y)

pooled R²: — | colored by film R²: — | slope diff (Jackson Fellowship vs Bakshi): —

Prediction

Q1. Before you color by film, the pooled fit looks noisy. After you color by film, which one is true?
The noise was real noise — there's no hidden structure to find The films share roughly the same slope; the residuals partition cleanly by film The films have different slopes — different pages-per-minute pacing — and within each, the relationship is much tighter than the pooled fit suggested
Q2. Which of these is a fair statistical reading of what just happened? (Pick the strongest.)
Clean residuals don't confirm a model — dirty residuals reject it. Coloring revealed the variable the pooled model was missing. Just need more data to get the pooled fit right. The film variable adds nothing; the pooled line is just as good.

Toggle the color-by-film view at least once. 0/1 toggles

Controls

view

R code — residuals partition by a categorical

lotr <- read.csv("data/clean/lotr_quotes.csv")# Pooled fit ignores 'film'pooled <- lm(page ~ minute, data = lotr)# Stratified fit lets each film have its own slopestrat  <- lm(page ~ minute * film, data = lotr)summary(pooled)$r.squared; summary(strat)$r.squared

Draw the missing variable

The pooled fit ignored "film". In a DAG, "film" is a categorical node that affects both minute (when does this film's runtime end?) and page (which book is this film adapting?). Build that DAG below and watch the scatter behave: pooled vs colored-by-film. This is the same picture you saw above, generated from a causal model you control.

Showcase — the residual you just learned, drawn on the lecture example

No prediction, no controls. Just the same regression-and-residual move applied to the comparative anatomy from lecture.

What you're looking at

Ten mammals. x = the direct distance from the brainstem to the larynx (cm) — the path a sensible engineer would route the nerve down. y = the actual length of the recurrent laryngeal nerve (cm) — the path evolution actually took, looping under the aortic arch at the heart and back up.

The line is fit to the nine non-giraffe mammals. The giraffe is plotted in red — and the residual panel below shows how far off the line it sits.

Approximate values from comparative-anatomy references (Wedel 2012 for giraffe; standard texts for the rest). Treat as pedagogical rather than research-grade.

Top: brain-to-larynx distance vs. nerve length. Bottom: residuals from a fit on the non-giraffe mammals.

fit on non-giraffe (n = 9): slope: — | R²: — | giraffe predicted: — cm | giraffe actual: — cm | residual: — cm

What the residual is doing

The nine-mammal fit says: for every 1 cm of direct brain-to-larynx distance, expect roughly 2–3 cm of recurrent laryngeal nerve (the loop around the aortic arch adds a relatively fixed amount). For a giraffe with a ~25 cm direct distance, the line predicts ~60 cm of nerve.

The giraffe's actual nerve is roughly 270 cm. That's the residual you're looking at — the part of the giraffe's anatomy that the "what should a nerve do?" line can't explain. Lecture's framing: there is no engineering reason for it. There is only an ancestry reason — a vertical-transmission channel laying down nerve routing through every fish, every amphibian, every early tetrapod, every short-necked mammal, until it reached an animal whose neck stretched in a way the routing was never asked to anticipate.

Same machine you used on Beren's sick-day weights. Different residual, same diagnostic. By Unit 5 you will use this exact diagnostic at every scale of life: when current function and inherited form disagree, the disagreement is data about which channel of inheritance produced what you are seeing.

R code — fit on non-giraffe, then look where the giraffe sits

v <- read.csv("data/clean/vagus_nerve.csv")non_g <- subset(v, species != "giraffe")fit <- lm(rln_length_cm ~ direct_cm, data = non_g)# predict the giraffe from the non-giraffe regressiongiraffe <- subset(v, species == "giraffe")pred <- predict(fit, newdata = giraffe)giraffe$rln_length_cm - pred   # the residual: ~200 cm

What you'll do

L — Warm-up: intercept and slope, on data you already have intuition for

Scenario

Jackson's Fellowship of the Ring — quote position

Controls

Why this warm-up?

A — Predict an adult's weight with no other information

Scenario

Weight (gray) and signed errors from your guess (red)

Prediction (required before slider unlocks)

Controls

R code — constant predictor for weight

B — Now you know each person's height. Use it.

Scenario

Weight vs height with fitted line (and residual bars)

Prediction (required before controls unlock)

Controls

R code — fit weight on height

C — Residual-pattern drill (5 rounds)

Scenario

Round 1 of 5 — fit and residuals

Prediction (required before the drill starts)

What residual pattern is this?

R code — residuals as a diagnostic

D — Beren and Cyrus: click where the missing measurement should sit

Scenario

Round 1 of 6 — click where the missing point should be

Prediction (required before the canvas unlocks)

Controls

R code — kids' growth with sick-proxy flag

Stretch challenge (optional, recorded)

E — A scatter where the "noise" turned out to be a missing variable

Scenario

Quote position — film minute (x) vs book page (y)

Prediction

Controls

R code — residuals partition by a categorical

Draw the missing variable

Showcase — the residual you just learned, drawn on the lecture example

What you're looking at

Top: brain-to-larynx distance vs. nerve length. Bottom: residuals from a fit on the non-giraffe mammals.

What the residual is doing

R code — fit on non-giraffe, then look where the giraffe sits