BIO 202 — Lesson 0: Correlation and the generative linear model (draft v2)

A — With no other information, the best guess is the mean

Someone hands you a new frog. You have no other information. The best single-number guess is the population mean μ, and σ is how wrong that guess will typically be.

For this stage: say why μ is the best guess under no-information, why σ is the typical size of the error, and why the sample ȳ wiggles around the true μ run to run.

Scenario

Gray tree frogs, one pond, no other information. Someone hands you a frog and asks for its body mass. The best number to say is the population mean μ. It minimizes your average squared error against any other value you could pick.

y_i ~ Normal(μ, σ)

You do not know μ. You have a sample of n frogs, and from them you compute ȳ. ȳ is your estimate of μ. A different n frogs would give a slightly different ȳ. The true μ is a property of the world; the estimate wiggles run to run.

σ is the other half of the story: how wrong this guess will typically be. A typical frog sits about σ away from μ. Small σ means the mean is a sharp prediction. Large σ means the mean is the best you can do, and the best is not very good.

Two parameters, one for the guess and one for the size of the error. Stage B adds one new piece of information about the frog and asks whether the guess gets any better.

Note that Stage D writes y ~ Normal(β·x + α, σ). Set β = 0 and you are back here. Regression does not replace the mean; it extends it.

Simulated sample (histogram with fixed axes)

true μ: — | sample ȳ: — | sample sd: —

Prediction (required before sliders unlock)

Q1. You raise σ from 0.5 to 2.5, keeping μ and n fixed. The histogram…
gets narrower gets wider shifts right but keeps the same width
Q2. You raise n from 20 to 500, keeping μ and σ fixed. The sample mean ȳ…
gets closer to the true μ (on average) gets farther from μ is unchanged — n only affects σ

Explore the sliders to unlock Stage B. 0/5 moves

Transfer question — new scenario

A colleague samples 15 mice from a single population. She reports a sample mean body mass of 24.3 g with SD 1.8 g. If she re-samples 15 different mice from the same population, the new sample mean will most likely be…

exactly 24.3 g — the population mean is what it is close to 24.3 g but not exactly — every sample wobbles around the true μ completely unpredictable — small samples have no signal

Controls

n (sample size)40

μ (true mean)8.0

σ (true SD)1.2

seed42

R code (base R — mirrors this simulation)

# Stage A: one variable, no predictor.set.seed(42)n     <- 40mu    <- 8.0    # true population meansigma <- 1.2   # true population SD# simulate the sample -- every frog is a draw from the SAME distributiony <- rnorm(n, mean = mu, sd = sigma)# estimates from the sample (these wiggle run-to-run)mean(y); sd(y)hist(y, breaks = 20, col = "gray80", border = "white",     xlim = c(0, 20), xlab = "body mass (g)", main = "")abline(v = mu,      col = "#b23a48", lwd = 2, lty = 2)abline(v = mean(y), col = "#2f6b8f", lwd = 2)

B — Add a group label, and the guess splits in two

Now you know which pond the frog came from. That is new information. The best guess is no longer a single number; it is the group mean. The "slope" on a 0/1 predictor is the difference between the two group means.

For this stage: recognize y = mx + b from algebra inside the regression equation. Read δ as the difference between two group means, not as a separate mysterious object.

Scenario

Now the frog comes with a tag: lowland (g = 0) or upland (g = 1). One new piece of information. The best guess is no longer μ for everyone. It is α for lowland frogs, and α + δ for upland frogs.

The object is already familiar from algebra. y = mx + b is the same line, renamed and split into two pieces so each piece can do its own job:

μ_i = α + δ · g_i

y_i ~ Normal(μ_i, σ)

α is the intercept — the guess when g = 0. δ is the slope — how much the guess changes when g goes up by one. Since g takes only two values, the "line" has only two points on it: α, and α + δ. δ is not just "the slope." It is the difference between the two group means.

The second line is Stage A layered on: each y_i is a noisy draw around its group mean, with spread σ. Same σ for both groups.

A t-test is this regression. lm(y ~ g) returns δ as the slope coefficient, and the standard error on that coefficient is the standard error of the difference between two sample means. Two names, one calculation.

Two groups, side by side (fixed axes)

ȳ₀: — | ȳ₁: — | δ̂ = ȳ₁ − ȳ₀: — | SE(δ̂): —

generative truth: α = —, α + δ = — (σ = —)

Prediction (required before sliders unlock)

Q1. You set δ = 0 (the two ponds actually share one mean). The estimated δ̂ from your sample will be…
exactly 0 near 0, not exactly — and new seeds land on either side of 0 close to 1, because equal groups of frogs still differ a lot by luck
Q2. You raise n (per group) from 10 to 200, keeping α, δ, σ fixed. SE(δ̂) will…
shrink — more data makes the estimated difference more precise grow — more data adds more noise stay the same — only σ affects the SE

Explore the sliders to unlock Stage C. 0/5 moves

Transfer question — new scenario

A friend fits lm(blood_pressure ~ treatment) with treatment coded 0 (placebo) and 1 (new drug). R prints the slope on treatment as −4.2 mmHg. The best reading is…

The drug group's average blood pressure is 4.2 mmHg. The drug group averages 4.2 mmHg lower than the placebo group. The p-value for the drug effect is 4.2.

Controls

n (per group)40

α (mean of group 0)10.0

δ (group-1 effect)1.00

σ (shared noise)1.0

seed42

R code (base R) — regression and t-test are the same calculation

# Stage B: binary predictor. y ~ N(alpha + delta*g, sigma), g in {0, 1}.set.seed(42)n     <- 40    # per groupalpha <- 10.0  # mean of group 0 (lowland)delta <- 1.00  # mean of group 1 minus mean of group 0sigma <- 1.0g <- rep(c(0, 1), each = n)                  # 0/1 predictory <- rnorm(2*n, mean = alpha + delta*g, sd = sigma)# REGRESSION form. Coefficient on g is delta_hat.fit <- lm(y ~ g)coef(fit)summary(fit)$coefficients[2, "Std. Error"]# T-TEST form. Same number, different packaging.tt <- t.test(y ~ g, var.equal = TRUE)tt$estimate[2] - tt$estimate[1]   # = coef(fit)["g"]plot(jitter(g, amount = 0.08), y, pch = 16, col = "gray40",     xaxt = "n", xlab = "group", ylab = "body mass (g)")axis(1, at = c(0, 1), labels = c("lowland", "upland"))abline(h = mean(y[g==0]), col = "#b23a48", lwd = 2)abline(h = mean(y[g==1]), col = "#2f6b8f", lwd = 2)

C — Two measurements that travel together

A second number on each frog. Do the two travel together? The correlation coefficient r is one number that says how tightly.

For this stage: read r at a glance. Sign is the direction of the tilt; magnitude is how tightly the two numbers line up. Note that r̂ wiggles run to run even when the true r is zero.

Scenario

Same frogs. Now you measure a second number on each: snout–vent length. You want to know whether two measurements on the same frog travel together, whether bigger frogs tend to be both longer and heavier.

The correlation coefficient r is one number that answers the question. It lives between −1 and +1:

r ≈ +1 — the two numbers track tightly, moving in the same direction.
r ≈ 0 — no tilt. Knowing one tells you essentially nothing about the other.
r ≈ −1 — the two numbers track tightly, moving in opposite directions.

Sign is the direction of the tilt. Magnitude is how tightly the two numbers line up. That is all r is.

Note that r is symmetric: swap x and y and r does not change. And r is a description, not a cause. A nonzero r says two measurements move together in this sample. It does not say one caused the other.

Watch for: even when the real r is zero, the sample r̂ is almost never exactly zero. Press "new seed" a few times. The wiggle you see is the same thing ȳ was doing in Stage A — an estimate wobbling around a true value that itself does not move.

Bivariate scatter (fixed axes)

true r: — | sample r̂: — | |r̂ − r|: —

Prediction (required before sliders unlock)

Q1. You set the true r to 0 and take a sample of n = 30. The sample r̂ will be…
exactly 0 near 0, but nonzero — the point cloud tilts slightly one way or the other big (|r̂| > 0.5), because small samples always look correlated
Q2. r = +0.9 and r = −0.9 — how do the two scatters compare?
the clouds are equally tight; only the slope direction changes +0.9 is tighter than −0.9 −0.9 is tighter than +0.9

Explore the sliders to unlock Stage D. 0/5 moves

Transfer question — new scenario

You plot tree height against trunk diameter for 30 trees and compute r̂ = 0.06. The best reading is…

A positive correlation, because 0.06 is not zero. The sample is small; r̂ near 0 is exactly what we'd see if the real r were 0 — we can't read a real pattern out of this. Tree height and trunk diameter are uncorrelated in nature.

Controls

n60

r (correlation)0.60

σₓ1.5

σᵧ1.0

seed42

R code (base R)

# Stage C: correlation -- two variables that co-vary.set.seed(42)n  <- 60r  <- 0.60sx <- 1.5sy <- 1.0# build a bivariate normal sample from two independent drawsz1 <- rnorm(n); z2 <- rnorm(n)x  <- sx * z1y  <- sy * (r * z1 + sqrt(1 - r^2) * z2)# sample correlation -- this wiggles sample to samplecor(x, y)plot(x, y, pch = 16, col = "gray40",     xlim = c(-8, 8), ylim = c(-8, 8),     xlab = "snout-vent length", ylab = "body mass")

D — Turn the second measurement into a prediction rule

Use x to predict y. The rule is a line. R² answers how much that line improved your guess over the Stage A mean.

For this stage: name each piece of the regression in the prediction frame. β is how the guess changes per unit of x; α is the guess at x = 0; σ is the typical error around the line; R² is how much knowing x improved the guess over the Stage A mean. Watch R² and σ move together.

Scenario

Same frogs. Use snout–vent length x as the predictor for body mass y. The prediction rule is a line:

y_i ~ Normal(β·x_i + α, σ)

For any frog with length x, the best guess is β · x + α. α is the intercept (the guess at x = 0). β is the slope (how much the guess changes when x goes up by one unit). σ is the spread around the line, the typical size of the error — same role as in Stage A.

The trick is to see this as Stage A with one new piece of information. Set β = 0 and the rule collapses to a single number (α = ȳ): you are predicting the same y for every frog, no matter its length. Set β ≠ 0 and the guess slides along a line as x changes. The information in x has entered the guess.

We fit (α̂, β̂) by ordinary least squares ("draw the one line that sits closest to all the points"). That gives one line. Turn on inferred lines and you see many others that the data find nearly as plausible — a cloud of rules, not a single answer. The width of the cloud is the uncertainty in β̂.

R² and σ are two ways of saying the same thing. R² is the fraction of y's variation that the line accounts for. Drag σ up and R² falls; drag σ down and R² climbs toward 1. Small σ means the points hug the line and knowing x helped a lot. Large σ means the points scatter away from the line and knowing x barely helped. One story, two axes.

What SE(β̂) is. It is the standard deviation of the cloud of inferred slopes. As a general rule, if β̂ is more than about 2 SE away from zero, zero sits outside the plausible cloud, and a p-value would call the slope "significant." The cloud itself is the real object. A p-value just converts "how far is zero from the cloud?" into a single number.

Scatter + fit + inferred-lines cloud (fixed axes)

fit: ŷ = — + — · x | SE(β̂): — | R²: —

generative truth: y = — + — · x (σ = —)

E — Two populations, two prediction rules. Can we tell them apart?

Each pond has its own line. "Are the slopes clearly different?" becomes a visual question: do the two clouds of plausible slopes overlap?

For this stage: translate "is the difference statistically significant?" into "do the two clouds of plausible slopes overlap?" Same question, no jargon. Say what a permutation p-value actually answers.

Scenario

Two populations of frogs now: a lowland pond (A) and an upland pond (B). Each has its own prediction rule, its own line:

y_i,g ~ Normal(β_g·x_i,g + α_g, σ)

The question is whether these are the same rule or two different rules. Concretely: is β_A different from β_B?

Draw the cloud of plausible slopes for each pond, separately. If the clouds do not overlap, the data rules out any shared slope — the rules are clearly different. If the clouds do overlap, there is a range of slopes compatible with both ponds, including "both the same." That is the question a significance test answers, with the jargon stripped out.

A second view, more formal. Shuffle the pond labels many times and each time recompute the difference in fitted slopes. The shuffled distribution answers: if the two ponds truly shared one slope, how big a split would we see just from how frogs got assigned? If the observed split sits out in the tail of that distribution, the two slopes are clearly different.

What this p-value says. It is the fraction of shuffled worlds in which the split was at least as extreme as ours. Note that it does not say "there is a 5% chance the ponds share a slope." It says: if they did share one, 5% of random shuffles would look at least this different from each other.

Two scatters + two clouds of inferred lines (fixed axes)

If the groups shared one slope: what splits would we see?

β̂_A: — | β̂_B: — | observed Δβ̂: — | under pooled null, at least as extreme: —

clouds overlap? —

Lesson 0 — Prediction, and what a regression is actually doing

What Lesson 0 is actually asking

A — With no other information, the best guess is the mean

Scenario

Simulated sample (histogram with fixed axes)

Prediction (required before sliders unlock)

Transfer question — new scenario

Controls

R code (base R — mirrors this simulation)

B — Add a group label, and the guess splits in two

Scenario

Two groups, side by side (fixed axes)

Prediction (required before sliders unlock)

Transfer question — new scenario

Controls

R code (base R) — regression and t-test are the same calculation

C — Two measurements that travel together

Scenario

Bivariate scatter (fixed axes)

Prediction (required before sliders unlock)

Transfer question — new scenario

Controls

R code (base R)

D — Turn the second measurement into a prediction rule

Scenario

Scatter + fit + inferred-lines cloud (fixed axes)

Prediction (required before sliders unlock)

Transfer question — new scenario

Controls

R code (base R)

E — Two populations, two prediction rules. Can we tell them apart?

Scenario

Two scatters + two clouds of inferred lines (fixed axes)

If the groups shared one slope: what splits would we see?

Prediction (required before sliders unlock)

Transfer question — new scenario

Controls

R code (base R)