BIO 202 — Lesson 4: Finding the cloud of lines that all fit the data

What you'll do

Move a line by hand. Save 20 that score the same R² (to 2 decimal places) as the OLS optimum. Then ask whether a bootstrap reaches the same cloud.

Why "many lines, not one"? One line is what you'd draw if you trusted your data perfectly. The bootstrap asks the question that comes next: if I shuffled my data slightly, how thick a cloud of lines would fit just as well? That cloud is your honest uncertainty in every "X correlates with Y" claim.

A — Drag a line and watch R² update

Synthetic scatter with a true generative line. Move the sliders. R² to 2 decimal places sits in the corner.

Scenario

50 points drawn from y = 1.0·x + 0.5 + Normal(0, 1). You place the line (it does not auto-fit). The big number is R² rounded to 2 decimal places.

Aim for the highest R² you can. Then see how far you can wiggle α and β before the rounded number changes.

Scatter + your line

your α: 0.5 | your β: 1.0 | R² (2 dp): —

OLS best α̂: — | OLS best β̂: — | OLS R² (full): —

Prediction (required before sliders unlock)

Q1. You're at the OLS optimum and you move β by 0.05 (small). R² to 2 decimal places will:
almost always stay the same (because R² near the max is flat in β) drop visibly rise
Q2. Moving the line in any direction from the OLS optimum:
cannot raise R² — OLS already gave the max can raise or lower it randomly raises it if you go in the right direction

Try at least 8 (α, β) combinations to unlock Stage B. 0/8 combos

Controls

α (intercept)0.5

β (slope)1.0

seed42

R code — hand-fit a line, compare to OLS

set.seed(42)n <- 50x <- runif(n, 0, 5)y <- 0.5 + 1.0*x + rnorm(n, 0, 1)my_a <- 0.5my_b <- 1.0yhat <- my_a + my_b * xr2_my <- 1 - sum((y - yhat)^2) / sum((y - mean(y))^2)fit <- lm(y ~ x)round(summary(fit)$r.squared, 2)round(r2_my, 2)

B — Find 20 different lines that share the OLS R² (to 2 dp)

Save every line you find that matches the OLS R² within tolerance. Target: 20 saves.

Scenario

Same scatter. When your R² (to 2 dp) matches the OLS R² within the tolerance, click Save this line. Saved lines stack as faint blue. Target: 20 saves.

Watch where the saved lines fall on the α–β plane (Stage C will plot it).

Scatter + your line + saved lines

target R² (OLS): — | your R² (2 dp): — | tolerance: ±0.01 | saved: 0

Prediction (required before sliders unlock)

Q1. You raise β by 0.10 from the OLS optimum. To keep R² (to 2 dp) at the OLS level, you need to:
raise α as well — both go the same direction lower α — the two slide along a tilted line of equal-fit pairs leave α alone — only β controls R²
Q2. After 20 saves, the saved (α, β) pairs will:
all collapse to a single point (the OLS optimum) — there is no slack trace out an elongated ellipse — α and β are negatively correlated form a uniform random cloud — no structure

Save 20 different lines matching the OLS R² (within tolerance). 0/20 saved

Controls

α (intercept)0.5

β (slope)1.0

tolerance ±0.01

R code — hunting (α, β) pairs with matching rounded R²

# Same data as Stage A.target_r2 <- round(summary(lm(y ~ x))$r.squared, 2)# grid-search (a, b) pairs whose rounded R^2 matchesgrid <- expand.grid(a = seq(-1, 2, by = 0.05),                    b = seq(0.6, 1.4, by = 0.05))grid$r2 <- apply(grid, 1, function(p) {  1 - sum((y - (p[1] + p[2]*x))^2) / sum((y - mean(y))^2)})keep <- subset(grid, round(r2, 2) == target_r2)nrow(keep)   # how many pairs share the rounded R^2?

C — The bootstrap cloud (200 resamples of the data)

200 bootstrap resamples. Refit OLS on each. Overlay the 200 fitted lines, then look at the α–β plane next to your Stage B cloud.

Scenario

A bootstrap resample = 50 points sampled with replacement from the original 50. Some points appear twice, some not at all. Refit OLS on each resample; collect (α̂, β̂).

200 of them, plotted as faint red lines on the scatter. Your Stage B saved cloud appears green in the α–β plot. Compare.

Scatter + bootstrap fitted lines (red) + your saved cloud overlay

α × β plot: bootstrap (red) vs. your saved (green)

Prediction (required before bootstrap runs)

Q1. The bootstrap cloud of (α̂, β̂) pairs will:
look like a circular cloud around the OLS optimum look like a tilted ellipse — α̂ and β̂ are correlated have α̂ and β̂ varying independently — uncorrelated
Q2. Compared to the OLS point estimate, the bootstrap cloud's spread tells you:
nothing new — OLS already gives the answer the standard error of β̂ — the cloud's width is what SE describes the bias of β̂ — whether OLS is over- or under-estimating

Run at least 3 bootstrap batches with different seeds. 0/3 batches

Controls

reps200

seed42

R code — bootstrap the OLS slope

# Same x, y as Stage A.set.seed(42)reps <- 200boot <- replicate(reps, {  i <- sample(length(x), length(x), replace = TRUE)  coef(lm(y[i] ~ x[i]))})apply(boot, 1, quantile, c(.025, .975))cor(boot[1,], boot[2,])   # negative -- alpha and beta trade off

D — Real data: NHANES height → weight

Same two moves as Stages B and C, on a 400-person NHANES subsample. Compare R² at β̂, at β̂ + 1 SE, and at β̂ − 1 SE.

Scenario

A 400-person random subsample from NHANES. Fit weight ~ height, bootstrap it 200 times. R² readouts at β̂, β̂ + 1 SE, and β̂ − 1 SE all appear in the panel. Compare them.

NHANES weight ~ height with bootstrap fit cloud

OLS β̂: — | SE(β̂): — | 95% bootstrap β: —

R² at β̂: — | R² at β̂ + 1 SE: — | R² at β̂ − 1 SE: —

Prediction (required before bootstrap unlocks)

Q1. For NHANES height → weight, R² at β̂ + 1 SE compared to R² at β̂ will be:
the same to 2 decimal places noticeably lower — moving 1 SE costs a lot exactly zero — only the OLS β̂ gives any fit
Q2. The 95% bootstrap interval on β̂ will be:
[β̂, β̂] — there's no uncertainty roughly β̂ ± 1.96·SE — a Gaussian-like band very wide — covering everything from negative to large positive

Run at least 2 NHANES bootstrap batches to wrap up. 0/2 batches

Controls

subsample n400

reps200

seed42

R code — same move on NHANES

nh <- read.csv("data/clean/nhanes_adults.csv")set.seed(42)idx <- sample(nrow(nh), 400)d <- nh[idx, ]fit <- lm(Weight ~ Height, data = d)B <- 200boot <- replicate(B, {  k <- sample(nrow(d), nrow(d), replace = TRUE)  coef(lm(Weight ~ Height, data = d[k, ]))})apply(boot, 1, quantile, c(.025, .975))

Stretch challenge (optional)

Run the same bootstrap at n = 50, n = 200, and n = 1500. Report the bootstrap SE on β̂ for each. Then find a function of n that the three SEs fit, and say which way the cloud widens. Hit "I tried it" once you have a function.

Not yet attempted.

Showcase — bootstrapping the age of a chalk cliff

Same cloud you just built. Different question: how old does the bottom of the cliff have to be?

What you're looking at

The White Cliffs of Dover are ~100 m of chalk — coccolithophore shells stacked ~250 per millimeter, deposited in order with nothing strong enough to shuffle them since. Above: modern marine sites where the same carbonate sediment is forming today, with the rate measured at each.

The bootstrap below resamples those rates and asks: at this rate, how long would it take to deposit 100 m of chalk?

Top: 12 modern carbonate-deposition rates. Bottom: bootstrap of "years to accumulate 100 m" (log scale).

mean rate: — cm/kyr | point estimate of age: — Myr

bootstrap 95% CI on age: — | does CI reach 6000 yr? —

What the bootstrap is doing

Every resample is a possible "true" mean deposition rate for the Cretaceous Dover sea, and every one implies an age in the tens of millions of years. The pre-Darwin chronology of ~6,000 years appears nowhere in the cloud. The bootstrap forced the bottom of the cliff to be old.

R code — bootstrap the implied age

d <- read.csv("data/clean/coccolith_deposition.csv")rates <- d$rate_cm_per_kyrcliff_cm <- 10000   # 100 m of chalkage_yr <- replicate(5000, {  r <- mean(sample(rates, length(rates), replace = TRUE))  cliff_cm / r * 1000   # cm / (cm/kyr) * (yr/kyr) = yr})quantile(age_yr, c(0.025, 0.975))   # 95% CI on agemean(age_yr < 6000)   # bootstrap mass under creationist chronology: 0