← all lessons

Lesson 2 — Model building and residual reading

BIO 202, Spring 2026 — draft v1. Same data, two models, different stories. What the residuals tell you about the variable you forgot.

What Lesson 2 is actually asking

You have one dataset. You can fit several different linear models to it, and the slopes will disagree. Which slope is "right" is not a question about the data — it is a question about what you are willing to assume. The residuals are how you discover that your model is missing something.

Note that "fit a regression" is not a single act. It is a loop: fit, check residuals, decide whether you need another term, fit again. The residuals are evidence.

A — One species, a clean fit, residuals that look like noise

When your model matches the data-generating process, the residuals are unstructured. That is the target — and it is what you will learn to miss in Stage B.

For this stage: say what a residual is (observed minus fitted), and name what a residual plot that says "the model is fine" should look like (no pattern, symmetric scatter, no drift in variance).

Start with Stage A.

Scenario

One species of Darwin's finch. Forty years of beak-depth measurements. Simulate the world where beak depth is a stable trait with a small year effect and Gaussian noise:

beaki,t ~ Normal(α + β·t, σ)

Fit lm(beak ~ year). Three things to look at. The scatter of beak vs year with the fitted line overlaid — what the model "sees." The residuals versus year — what the model did not capture. The Q-Q plot of residuals against a Normal — whether the noise is shaped the way the model assumes.

When the model is right, all three panels show no structure: residuals scatter symmetrically around zero, do not drift in variance, and line up on the Q-Q diagonal.

Note that "residuals look fine" is not a proof that the model is right — it is the absence of evidence that it is wrong. Stage B shows you what evidence of wrongness looks like.

beak ~ year

β̂:  |  α̂:  |  σ̂ (residual SD):

Residuals: vs year  ·  Q-Q

Prediction (required before sliders unlock)

  1. Q1. True slope is −0.02 mm/year, true σ = 0.2 mm. One simulated dataset's residual-vs-year plot will show…
  2. Q2. You set σ very large (say σ = 1.0). The Q-Q plot of residuals will…
Explore the sliders to unlock Stage B. 0/5 moves

Transfer question — new scenario

A colleague fits lm(y ~ x) and reports that the residuals "look fine." You plot them yourself and see a clear U-shape in residual-vs-x. What should your colleague try next?

Controls

20
-0.020
0.20
42

R code (base R)

# Lesson 2, Stage A — one species, clean residualsset.seed(42)n_per_year <- 20n_years    <- 40beta       <- -0.020sigma      <- 0.20year <- rep(1:n_years, each = n_per_year)beak <- rnorm(length(year), mean = 10 + beta*(year - mean(1:n_years)), sd = sigma)fit <- lm(beak ~ year)summary(fit)# the residual diagnostic — three panels that tell you the model is finepar(mfrow = c(1, 3))plot(year, beak, pch = 16, col = rgb(0.18, 0.42, 0.56, 0.35))abline(fit, col = "#b23a48", lwd = 2)plot(year, resid(fit), pch = 16); abline(h = 0, lty = 2)qqnorm(resid(fit)); qqline(resid(fit), col = "#b23a48")

B — Two species in, one model out: the residuals betray what you forgot

Same year effect. Same noise. Two species with different means, not one. Fit the pooled model (y ~ year) and watch the residuals separate into clumps.

For this stage: read a bimodal residual plot as a missing-factor signature. Fit lm(y ~ year + species) and watch the bimodality disappear. The nested-model F-test is the formal version of what your eye is already seeing.

Complete Stage A (submit prediction and move the sliders a few times) to unlock this section.

Scenario

Two species, both sampled every year. They differ in mean beak depth. Both carry the same small year effect.

beaki,t,s ~ Normal(αs + β·t, σ)   — s indexes species

Two models, same data:

  1. lm(beak ~ year) — the pooled model. Species is ignored. The fitted line tries to go through a bimodal cloud. Residuals split into an upper clump and a lower clump.
  2. lm(beak ~ year + species) — the additive model. Each species gets its own intercept. Residuals collapse back to noise.

The nested-model F-test anova(m1, m2) asks whether adding the species term reduces residual variance more than chance would. Under the pooled model's assumption that species does not matter, it nearly always does.

The trick is that you can usually see the missing variable before you know what it is. A bimodal residual distribution is a prompt: what discrete covariate have I forgotten? A smooth curve in residual-vs-x is a different prompt: what continuous covariate have I forgotten?

Scatter: beak vs year, colored by species

fortis scandens — pooled fit — per-species fits
β̂fortis:  |  β̂scandens:  |  β̂pooled:

Residual distributions: pooled model (top) vs additive model (bottom)

nested F (y~year+species vs y~year):  |  p:

Prediction (required before sliders unlock)

  1. Q1. True means are αfortis = 10, αscandens = 12. You fit lm(beak ~ year). The residuals from this model will…
  2. Q2. You switch to lm(beak ~ year + species) on the same data. The residual-vs-year plot now…
Explore the sliders to unlock Stage C. 0/5 moves

Transfer question — new scenario

You fit lm(lung_function ~ age) on a mixed sample of smokers and non-smokers. The residuals are clearly bimodal. The right next step is…

Controls

15
2.0
-0.010
0.30
42

R code (base R)

# Lesson 2, Stage B — two species, two modelsset.seed(42)n_per_year <- 15n_years    <- 40dmean      <- 2.0    # scandens minus fortisbeta       <- -0.010sigma      <- 0.30year    <- rep(1:n_years, each = 2*n_per_year)species <- rep(rep(c("fortis", "scandens"), each = n_per_year), n_years)alpha   <- ifelse(species == "fortis", 10, 10 + dmean)beak    <- rnorm(length(year), alpha + beta*(year - mean(1:n_years)), sigma)m_pool <- lm(beak ~ year)m_full <- lm(beak ~ year + species)anova(m_pool, m_full)    # nested F-testpar(mfrow = c(2, 2))hist(resid(m_pool), breaks = 30)    # bimodalhist(resid(m_full), breaks = 30)    # unimodalplot(year, resid(m_pool), pch = 16, col = factor(species))plot(year, resid(m_full), pch = 16, col = factor(species))

C — Same data, opposite sign: Simpson's paradox as a design artifact

If the two species are sampled in different fractions in different years, the pooled slope can reverse sign relative to every per-species slope. The data did not change — the pooling did.

For this stage: construct a dataset where each species trends down but the pooled model shows an up-trend. Explain, without jargon, why. Name when pooling is safe and when it is not.

Complete Stage B to unlock this section.

Scenario

Both species are present every year, but the fraction of each species in your sample shifts over time. Drag the imbalance slider to move from equal-fractions-every-year (no Simpson) toward heavy-fortis-early and heavy-scandens-late (maximum Simpson).

With heavy imbalance, the pooled slope β̂pooled comes from a sample whose composition is drifting upward over time (because the bigger-beaked species dominates the late years). The pooled line rises even though both species, individually, are declining. The per-species slopes point down; the pooled slope points up. Same data, two stories.

β̂pooled = β + shift in sample composition across t

The nested-model F-test catches this: once species is in the model, the pooled-vs-additive comparison is overwhelming. But if you never fit the additive model, the pooled slope is the only number you see.

Note that Simpson's paradox is not a failure of the data. It is a failure to read the design. The pooled slope answers a specific question: "if I randomly pick a specimen, does its year predict its beak?" — which, under composition drift, is a different question from "does each species's beak depend on year?" Different questions have different answers. The sign flip is the symptom.

Scatter with both fits and the pooled fit

β̂fortis:  |  β̂scandens:  |  β̂pooled:

Composition drift: fraction scandens per year

pooled slope sign matches per-species?

Prediction (required before sliders unlock)

  1. Q1. With imbalance set to 0 (equal species fractions every year) and both species declining at β = −0.02, the pooled slope will be…
  2. Q2. With maximum imbalance (scandens rises from 10% to 90% over 40 years) and both species declining, the pooled slope will be…
Last stage. 0/5 moves

Transfer question — new scenario

A hospital reports that its overall surgical survival rate improved from 75% to 80% over a decade. Both the "low-risk" and "high-risk" departments' survival rates dropped over that same decade. How can this be?

Controls

40
2.0
-0.020
0.70
0.30
42

R code (base R)

# Lesson 2, Stage C — Simpson's paradox from composition driftset.seed(42)n_total   <- 40n_years   <- 40dmean     <- 2.0beta      <- -0.020imb       <- 0.70   # scandens fraction sweeps from (0.5 - 0.4*imb) to (0.5 + 0.4*imb)sigma     <- 0.30dat <- do.call(rbind, lapply(1:n_years, function(t) {  frac_scandens <- 0.5 - 0.4*imb + 0.8*imb*((t - 1) / (n_years - 1))  n_s <- round(n_total * frac_scandens); n_f <- n_total - n_s  y_f <- rnorm(n_f, 10          + beta*(t - mean(1:n_years)), sigma)  y_s <- rnorm(n_s, 10 + dmean  + beta*(t - mean(1:n_years)), sigma)  data.frame(year = t, species = rep(c("fortis", "scandens"), c(n_f, n_s)),             beak = c(y_f, y_s))}))m_pool <- lm(beak ~ year, data = dat)m_full <- lm(beak ~ year + species, data = dat)coef(m_pool)[2]                # pooled slopecoef(m_full)[2]                # shared per-species slope

D — Grant finch data: three fits, one residual pattern that refuses to go away

Forty years of annual mean beak measurements for fortis and scandens on Daphne Major, plus annual population counts. Fit the pooled model, the additive model, and look at the residuals. Simpson's move from Stage C shows up when you weight the pool by abundance. The step-changes from the 1977 drought and the 2004 character displacement sit in the residuals no matter which of these models you choose.

For this stage: compare three fits on the same data, describe where the pooled slope disagrees with the per-species slopes, and identify the year-to-year structure in the additive-model residuals that proves the model is still wrong.

Complete Stage C first.

Scenario

Each point is a species-year mean — 40 years × 2 species = 80 observations, one fortis and one scandens per year. Pooling ignores species. The additive model adds a species intercept. Neither allows the slope itself to differ between species (that would be an interaction). In the unweighted, balanced design, pooled and additive slopes come out identical because species is orthogonal to year. Toggle population-count weighting to see Simpson's paradox in action: fortis is several times more abundant in most years, so count-weighting pulls the pooled slope toward the fortis trajectory while the additive slope holds its ground.

pooled: ȳt,s = α + β·t + ε
additive: ȳt,s = α + β·t + γ·I(s = scandens) + ε

The year-window controls let you fit only pre-1977, only post-2004, or any slice in between. Each slice tells a different story — pre-1977 fortis is flat, 1978–2003 fortis sits at a higher plateau, and post-2004 fortis sits at a lower one. A single line through the full forty years is a summary that misses both jumps.

Note that the additive model is the "right" model relative to pooled — it catches the species-intercept difference. But its residuals are the evidence that it is still not the right model for this question. Large positive residuals cluster in the drought window; large negative residuals cluster post-2004. A year term cannot absorb a jump. Lesson 3 does not fix this — no model of drift alone does. Fixing it requires adding rainfall, or seed-size spectrum, or congener abundance. That is what the residuals are asking for.

Annual means with three fits

β̂fortis: mm/yr  |  β̂scandens: mm/yr  |  β̂pooled: mm/yr  |  β̂additive: mm/yr
fortis annual mean scandens annual mean pooled fit additive (parallel lines)

Residuals of the additive model vs year

residual SD (pooled):  |  residual SD (additive):  |  lag-1 autocorrelation:

Prediction (required before controls unlock)

  1. Q1. Fit pooled (y ~ year) vs additive (y ~ year + species) on beak length, full range 1973–2012, unweighted. The pooled slope and the additive shared slope will…
  2. Q2. Turn on population-count weighting. Fortis outnumbers scandens in most years. The pooled slope will…
Last stage of Lesson 2. 0/5 moves

Transfer question — new scenario

You fit lm(y ~ year + species) to a two-species dataset and the residual SD is small. Looking at the residuals, they are tightly clustered in absolute value but years 5–12 are all negative and years 13–20 are all positive. Is the model adequate?

Controls

1973
2012

R code (base R)

# Lesson 2, Stage D — Grant finch beak trends, pooled vs additive, with residual reading# Source: Grant & Grant 2014, "40 Years of Evolution" (Fig. 01-06, 01-07, 14-03).beak <- read.csv("data/clean/finch_beak.csv")   # year, species, beak_length, beak_depth, beak_widthpop  <- read.csv("data/clean/finch_pop.csv")    # year, fortis, scandensy0 <- 1973y1 <- 2012trait <- "beak_length"d <- subset(beak, year >= y0 & year <= y1)d$y <- d[[trait]]d$w <- ifelse(d$species == "fortis",             pop$fortis[match(d$year, pop$year)],             pop$scandens[match(d$year, pop$year)])m_pool <- lm(y ~ year,           data = d)m_add  <- lm(y ~ year + species, data = d)m_wtd  <- lm(y ~ year,           data = d, weights = d$w)m_f    <- lm(y ~ year, data = subset(d, species == "fortis"))m_s    <- lm(y ~ year, data = subset(d, species == "scandens"))r <- resid(m_add)rho <- cor(r[-length(r)], r[-1])     # lag-1 autocorrelationplot(d$year, d$y, col = ifelse(d$species == "fortis", "#2f6b8f", "#c77a3d"), pch = 16)abline(m_pool, col = "#5a3080", lwd = 2)abline(m_add,  col = "#333",    lwd = 2)plot(d$year, r, col = ifelse(d$species == "fortis", "#2f6b8f", "#c77a3d"), pch = 16)abline(h = 0, lty = 2)