Lesson 4 — Finding the cloud of lines that all fit the data

BIO 202, Spring 2026, draft v1. In Lesson 3 you fit a line. Here you ask what counts as "fit" in the first place.

What you'll do

Move a line by hand. Save 20 that score the same R² (to 2 decimal places) as the OLS optimum. Then ask whether a bootstrap reaches the same cloud.

Why "many lines, not one"? One line is what you'd draw if you trusted your data perfectly. The bootstrap asks the question that comes next: if I shuffled my data slightly, how thick a cloud of lines would fit just as well? That cloud is your honest uncertainty in every "X correlates with Y" claim.

A — Drag a line and watch R² update

Synthetic scatter with a true generative line. Move the sliders. R² to 2 decimal places sits in the corner.

Locked — confirm your name above to begin.

Scenario

50 points drawn from y = 1.0·x + 0.5 + Normal(0, 1). You place the line (it does not auto-fit). The big number is R² rounded to 2 decimal places.

Aim for the highest R² you can. Then see how far you can wiggle α and β before the rounded number changes.

Scatter + your line

your α: 0.5  |  your β: 1.0  |  R² (2 dp):
OLS best α̂:  |  OLS best β̂:  |  OLS R² (full):

Prediction (required before sliders unlock)

  1. Q1. You're at the OLS optimum and you move β by 0.05 (small). R² to 2 decimal places will:
  2. Q2. Moving the line in any direction from the OLS optimum:
Try at least 8 (α, β) combinations to unlock Stage B. 0/8 combos

Controls

0.5
1.0
42

R code — hand-fit a line, compare to OLS

set.seed(42)n <- 50x <- runif(n, 0, 5)y <- 0.5 + 1.0*x + rnorm(n, 0, 1)my_a <- 0.5my_b <- 1.0yhat <- my_a + my_b * xr2_my <- 1 - sum((y - yhat)^2) / sum((y - mean(y))^2)fit <- lm(y ~ x)round(summary(fit)$r.squared, 2)round(r2_my, 2)

B — Find 20 different lines that share the OLS R² (to 2 dp)

Save every line you find that matches the OLS R² within tolerance. Target: 20 saves.

Complete Stage A (submit prediction, try 8 combos) to unlock this section.

Scenario

Same scatter. When your R² (to 2 dp) matches the OLS R² within the tolerance, click Save this line. Saved lines stack as faint blue. Target: 20 saves.

Watch where the saved lines fall on the α–β plane (Stage C will plot it).

Scatter + your line + saved lines

target R² (OLS):  |  your R² (2 dp):  |  tolerance: ±0.01  |  saved: 0

Prediction (required before sliders unlock)

  1. Q1. You raise β by 0.10 from the OLS optimum. To keep R² (to 2 dp) at the OLS level, you need to:
  2. Q2. After 20 saves, the saved (α, β) pairs will:
Save 20 different lines matching the OLS R² (within tolerance). 0/20 saved

Controls

0.5
1.0
0.01

R code — hunting (α, β) pairs with matching rounded R²

# Same data as Stage A.target_r2 <- round(summary(lm(y ~ x))$r.squared, 2)# grid-search (a, b) pairs whose rounded R^2 matchesgrid <- expand.grid(a = seq(-1, 2, by = 0.05),                    b = seq(0.6, 1.4, by = 0.05))grid$r2 <- apply(grid, 1, function(p) {  1 - sum((y - (p[1] + p[2]*x))^2) / sum((y - mean(y))^2)})keep <- subset(grid, round(r2, 2) == target_r2)nrow(keep)   # how many pairs share the rounded R^2?

C — The bootstrap cloud (200 resamples of the data)

200 bootstrap resamples. Refit OLS on each. Overlay the 200 fitted lines, then look at the α–β plane next to your Stage B cloud.

Complete Stage B (submit prediction, save 20 lines) to unlock this section.

Scenario

A bootstrap resample = 50 points sampled with replacement from the original 50. Some points appear twice, some not at all. Refit OLS on each resample; collect (α̂, β̂).

200 of them, plotted as faint red lines on the scatter. Your Stage B saved cloud appears green in the α–β plot. Compare.

Scatter + bootstrap fitted lines (red) + your saved cloud overlay

α × β plot: bootstrap (red) vs. your saved (green)

Prediction (required before bootstrap runs)

  1. Q1. The bootstrap cloud of (α̂, β̂) pairs will:
  2. Q2. Compared to the OLS point estimate, the bootstrap cloud's spread tells you:
Run at least 3 bootstrap batches with different seeds. 0/3 batches

Controls

200
42

R code — bootstrap the OLS slope

# Same x, y as Stage A.set.seed(42)reps <- 200boot <- replicate(reps, {  i <- sample(length(x), length(x), replace = TRUE)  coef(lm(y[i] ~ x[i]))})apply(boot, 1, quantile, c(.025, .975))cor(boot[1,], boot[2,])   # negative -- alpha and beta trade off

D — Real data: NHANES height → weight

Same two moves as Stages B and C, on a 400-person NHANES subsample. Compare R² at β̂, at β̂ + 1 SE, and at β̂ − 1 SE.

Complete Stage C (submit prediction, run 3 bootstrap batches) to unlock this section.

Scenario

A 400-person random subsample from NHANES. Fit weight ~ height, bootstrap it 200 times. R² readouts at β̂, β̂ + 1 SE, and β̂ − 1 SE all appear in the panel. Compare them.

NHANES weight ~ height with bootstrap fit cloud

OLS β̂:  |  SE(β̂):  |  95% bootstrap β:
R² at β̂:  |  R² at β̂ + 1 SE:  |  R² at β̂ − 1 SE:

Prediction (required before bootstrap unlocks)

  1. Q1. For NHANES height → weight, R² at β̂ + 1 SE compared to R² at β̂ will be:
  2. Q2. The 95% bootstrap interval on β̂ will be:
Run at least 2 NHANES bootstrap batches to wrap up. 0/2 batches

Controls

400
200
42

R code — same move on NHANES

nh <- read.csv("data/clean/nhanes_adults.csv")set.seed(42)idx <- sample(nrow(nh), 400)d <- nh[idx, ]fit <- lm(Weight ~ Height, data = d)B <- 200boot <- replicate(B, {  k <- sample(nrow(d), nrow(d), replace = TRUE)  coef(lm(Weight ~ Height, data = d[k, ]))})apply(boot, 1, quantile, c(.025, .975))

Stretch challenge (optional)

Run the same bootstrap at n = 50, n = 200, and n = 1500. Report the bootstrap SE on β̂ for each. Then find a function of n that the three SEs fit, and say which way the cloud widens. Hit "I tried it" once you have a function.

Not yet attempted.

Showcase — bootstrapping the age of a chalk cliff

Same cloud you just built. Different question: how old does the bottom of the cliff have to be?

What you're looking at

The White Cliffs of Dover are ~100 m of chalk — coccolithophore shells stacked ~250 per millimeter, deposited in order with nothing strong enough to shuffle them since. Above: modern marine sites where the same carbonate sediment is forming today, with the rate measured at each.

The bootstrap below resamples those rates and asks: at this rate, how long would it take to deposit 100 m of chalk?

Top: 12 modern carbonate-deposition rates. Bottom: bootstrap of "years to accumulate 100 m" (log scale).

mean rate: cm/kyr  |  point estimate of age: Myr
bootstrap 95% CI on age:  |  does CI reach 6000 yr?

What the bootstrap is doing

Every resample is a possible "true" mean deposition rate for the Cretaceous Dover sea, and every one implies an age in the tens of millions of years. The pre-Darwin chronology of ~6,000 years appears nowhere in the cloud. The bootstrap forced the bottom of the cliff to be old.

R code — bootstrap the implied age

d <- read.csv("data/clean/coccolith_deposition.csv")rates <- d$rate_cm_per_kyrcliff_cm <- 10000   # 100 m of chalkage_yr <- replicate(5000, {  r <- mean(sample(rates, length(rates), replace = TRUE))  cliff_cm / r * 1000   # cm / (cm/kyr) * (yr/kyr) = yr})quantile(age_yr, c(0.025, 0.975))   # 95% CI on agemean(age_yr < 6000)   # bootstrap mass under creationist chronology: 0