BIO 202, Spring 2026 — draft v1. Same data, two models, different stories. What the residuals tell you about the variable you forgot.
You have one dataset. You can fit several different linear models to it, and the slopes will disagree. Which slope is "right" is not a question about the data — it is a question about what you are willing to assume. The residuals are how you discover that your model is missing something.
y ~ year + species) and watch the residuals fall back into noise.Note that "fit a regression" is not a single act. It is a loop: fit, check residuals, decide whether you need another term, fit again. The residuals are evidence.
When your model matches the data-generating process, the residuals are unstructured. That is the target — and it is what you will learn to miss in Stage B.
For this stage: say what a residual is (observed minus fitted), and name what a residual plot that says "the model is fine" should look like (no pattern, symmetric scatter, no drift in variance).
One species of Darwin's finch. Forty years of beak-depth measurements. Simulate the world where beak depth is a stable trait with a small year effect and Gaussian noise:
beaki,t ~ Normal(α + β·t, σ)
Fit lm(beak ~ year). Three things to look at. The scatter of beak vs year with the fitted line overlaid — what the model "sees." The residuals versus year — what the model did not capture. The Q-Q plot of residuals against a Normal — whether the noise is shaped the way the model assumes.
When the model is right, all three panels show no structure: residuals scatter symmetrically around zero, do not drift in variance, and line up on the Q-Q diagonal.
Note that "residuals look fine" is not a proof that the model is right — it is the absence of evidence that it is wrong. Stage B shows you what evidence of wrongness looks like.
A colleague fits lm(y ~ x) and reports that the residuals "look fine." You plot them yourself and see a clear U-shape in residual-vs-x. What should your colleague try next?
# Lesson 2, Stage A — one species, clean residualsset.seed(42)n_per_year <- 20n_years <- 40beta <- -0.020sigma <- 0.20year <- rep(1:n_years, each = n_per_year)beak <- rnorm(length(year), mean = 10 + beta*(year - mean(1:n_years)), sd = sigma)fit <- lm(beak ~ year)summary(fit)# the residual diagnostic — three panels that tell you the model is finepar(mfrow = c(1, 3))plot(year, beak, pch = 16, col = rgb(0.18, 0.42, 0.56, 0.35))abline(fit, col = "#b23a48", lwd = 2)plot(year, resid(fit), pch = 16); abline(h = 0, lty = 2)qqnorm(resid(fit)); qqline(resid(fit), col = "#b23a48")
Same year effect. Same noise. Two species with different means, not one. Fit the pooled model (y ~ year) and watch the residuals separate into clumps.
For this stage: read a bimodal residual plot as a missing-factor signature. Fit lm(y ~ year + species) and watch the bimodality disappear. The nested-model F-test is the formal version of what your eye is already seeing.
Two species, both sampled every year. They differ in mean beak depth. Both carry the same small year effect.
beaki,t,s ~ Normal(αs + β·t, σ) — s indexes species
Two models, same data:
lm(beak ~ year) — the pooled model. Species is ignored. The fitted line tries to go through a bimodal cloud. Residuals split into an upper clump and a lower clump.lm(beak ~ year + species) — the additive model. Each species gets its own intercept. Residuals collapse back to noise.The nested-model F-test anova(m1, m2) asks whether adding the species term reduces residual variance more than chance would. Under the pooled model's assumption that species does not matter, it nearly always does.
The trick is that you can usually see the missing variable before you know what it is. A bimodal residual distribution is a prompt: what discrete covariate have I forgotten? A smooth curve in residual-vs-x is a different prompt: what continuous covariate have I forgotten?
lm(beak ~ year). The residuals from this model will…lm(beak ~ year + species) on the same data. The residual-vs-year plot now…You fit lm(lung_function ~ age) on a mixed sample of smokers and non-smokers. The residuals are clearly bimodal. The right next step is…
# Lesson 2, Stage B — two species, two modelsset.seed(42)n_per_year <- 15n_years <- 40dmean <- 2.0 # scandens minus fortisbeta <- -0.010sigma <- 0.30year <- rep(1:n_years, each = 2*n_per_year)species <- rep(rep(c("fortis", "scandens"), each = n_per_year), n_years)alpha <- ifelse(species == "fortis", 10, 10 + dmean)beak <- rnorm(length(year), alpha + beta*(year - mean(1:n_years)), sigma)m_pool <- lm(beak ~ year)m_full <- lm(beak ~ year + species)anova(m_pool, m_full) # nested F-testpar(mfrow = c(2, 2))hist(resid(m_pool), breaks = 30) # bimodalhist(resid(m_full), breaks = 30) # unimodalplot(year, resid(m_pool), pch = 16, col = factor(species))plot(year, resid(m_full), pch = 16, col = factor(species))
If the two species are sampled in different fractions in different years, the pooled slope can reverse sign relative to every per-species slope. The data did not change — the pooling did.
For this stage: construct a dataset where each species trends down but the pooled model shows an up-trend. Explain, without jargon, why. Name when pooling is safe and when it is not.
Both species are present every year, but the fraction of each species in your sample shifts over time. Drag the imbalance slider to move from equal-fractions-every-year (no Simpson) toward heavy-fortis-early and heavy-scandens-late (maximum Simpson).
With heavy imbalance, the pooled slope β̂pooled comes from a sample whose composition is drifting upward over time (because the bigger-beaked species dominates the late years). The pooled line rises even though both species, individually, are declining. The per-species slopes point down; the pooled slope points up. Same data, two stories.
β̂pooled = β + shift in sample composition across t
The nested-model F-test catches this: once species is in the model, the pooled-vs-additive comparison is overwhelming. But if you never fit the additive model, the pooled slope is the only number you see.
Note that Simpson's paradox is not a failure of the data. It is a failure to read the design. The pooled slope answers a specific question: "if I randomly pick a specimen, does its year predict its beak?" — which, under composition drift, is a different question from "does each species's beak depend on year?" Different questions have different answers. The sign flip is the symptom.
A hospital reports that its overall surgical survival rate improved from 75% to 80% over a decade. Both the "low-risk" and "high-risk" departments' survival rates dropped over that same decade. How can this be?
# Lesson 2, Stage C — Simpson's paradox from composition driftset.seed(42)n_total <- 40n_years <- 40dmean <- 2.0beta <- -0.020imb <- 0.70 # scandens fraction sweeps from (0.5 - 0.4*imb) to (0.5 + 0.4*imb)sigma <- 0.30dat <- do.call(rbind, lapply(1:n_years, function(t) { frac_scandens <- 0.5 - 0.4*imb + 0.8*imb*((t - 1) / (n_years - 1)) n_s <- round(n_total * frac_scandens); n_f <- n_total - n_s y_f <- rnorm(n_f, 10 + beta*(t - mean(1:n_years)), sigma) y_s <- rnorm(n_s, 10 + dmean + beta*(t - mean(1:n_years)), sigma) data.frame(year = t, species = rep(c("fortis", "scandens"), c(n_f, n_s)), beak = c(y_f, y_s))}))m_pool <- lm(beak ~ year, data = dat)m_full <- lm(beak ~ year + species, data = dat)coef(m_pool)[2] # pooled slopecoef(m_full)[2] # shared per-species slope
Forty years of annual mean beak measurements for fortis and scandens on Daphne Major, plus annual population counts. Fit the pooled model, the additive model, and look at the residuals. Simpson's move from Stage C shows up when you weight the pool by abundance. The step-changes from the 1977 drought and the 2004 character displacement sit in the residuals no matter which of these models you choose.
For this stage: compare three fits on the same data, describe where the pooled slope disagrees with the per-species slopes, and identify the year-to-year structure in the additive-model residuals that proves the model is still wrong.
Each point is a species-year mean — 40 years × 2 species = 80 observations, one fortis and one scandens per year. Pooling ignores species. The additive model adds a species intercept. Neither allows the slope itself to differ between species (that would be an interaction). In the unweighted, balanced design, pooled and additive slopes come out identical because species is orthogonal to year. Toggle population-count weighting to see Simpson's paradox in action: fortis is several times more abundant in most years, so count-weighting pulls the pooled slope toward the fortis trajectory while the additive slope holds its ground.
pooled: ȳt,s = α + β·t + ε
additive: ȳt,s = α + β·t + γ·I(s = scandens) + ε
The year-window controls let you fit only pre-1977, only post-2004, or any slice in between. Each slice tells a different story — pre-1977 fortis is flat, 1978–2003 fortis sits at a higher plateau, and post-2004 fortis sits at a lower one. A single line through the full forty years is a summary that misses both jumps.
Note that the additive model is the "right" model relative to pooled — it catches the species-intercept difference. But its residuals are the evidence that it is still not the right model for this question. Large positive residuals cluster in the drought window; large negative residuals cluster post-2004. A year term cannot absorb a jump. Lesson 3 does not fix this — no model of drift alone does. Fixing it requires adding rainfall, or seed-size spectrum, or congener abundance. That is what the residuals are asking for.
y ~ year) vs additive (y ~ year + species) on beak length, full range 1973–2012, unweighted. The pooled slope and the additive shared slope will…You fit lm(y ~ year + species) to a two-species dataset and the residual SD is small. Looking at the residuals, they are tightly clustered in absolute value but years 5–12 are all negative and years 13–20 are all positive. Is the model adequate?
# Lesson 2, Stage D — Grant finch beak trends, pooled vs additive, with residual reading# Source: Grant & Grant 2014, "40 Years of Evolution" (Fig. 01-06, 01-07, 14-03).beak <- read.csv("data/clean/finch_beak.csv") # year, species, beak_length, beak_depth, beak_widthpop <- read.csv("data/clean/finch_pop.csv") # year, fortis, scandensy0 <- 1973y1 <- 2012trait <- "beak_length"d <- subset(beak, year >= y0 & year <= y1)d$y <- d[[trait]]d$w <- ifelse(d$species == "fortis", pop$fortis[match(d$year, pop$year)], pop$scandens[match(d$year, pop$year)])m_pool <- lm(y ~ year, data = d)m_add <- lm(y ~ year + species, data = d)m_wtd <- lm(y ~ year, data = d, weights = d$w)m_f <- lm(y ~ year, data = subset(d, species == "fortis"))m_s <- lm(y ~ year, data = subset(d, species == "scandens"))r <- resid(m_add)rho <- cor(r[-length(r)], r[-1]) # lag-1 autocorrelationplot(d$year, d$y, col = ifelse(d$species == "fortis", "#2f6b8f", "#c77a3d"), pch = 16)abline(m_pool, col = "#5a3080", lwd = 2)abline(m_add, col = "#333", lwd = 2)plot(d$year, r, col = ifelse(d$species == "fortis", "#2f6b8f", "#c77a3d"), pch = 16)abline(h = 0, lty = 2)