Lesson 19 — Removing the family resemblance before comparing species

BIO 202, Spring 2026, draft v1. PGLS. Species aren't independent observations — they share ancestors. Subtract the shared part before you ask what's left.

What you'll do

Simulate trait evolution on a tree. Compare naive OLS to phylogenetic independent contrasts. Fit Pagel's λ. End at AVONET birds.

I survey species of earthworms around the world. For each species, I measure body size and soil pH. I get a clean negative regression. But within each clade — these three sister species, those three sister species — the relationship is positive. You can make a negative overall trend out of a bunch of positive smaller trends. The fact of stratification is inherent in all life on Earth, because we're all descended from a set of nested common ancestors. When comparing living things, relatedness will always matter.— 202_lec26_02

A — Naive OLS across species, ignoring phylogeny

Simulate two traits evolving by Brownian motion on a tree. Run OLS on the tip values. The slope you get can be misleading — because the species aren't independent.

Locked — confirm your name above to begin.

Scenario

Generate a tree of N species. Simulate two traits x and y, each by Brownian motion. The traits evolve independently — the "true" slope is 0. But the naive OLS regression of y on x will often show a non-zero slope, because species in the same clade have correlated trait values.

Naive scatter of x vs y across species

N species: 30  |  true slope: 0  |  naive OLS slope:  |  OLS P-value:

Prediction

  1. Q1. Two traits truly evolve independently on a tree. Across 30 species, what fraction of OLS regressions on the tip values give P < 0.05?
Try at least 5 seeds. 0/5 seeds

Controls

42
30

R code — naive OLS

library(ape)set.seed(42); N <- 30tr <- rcoal(N)   # random coalescent treex <- rTraitCont(tr); y <- rTraitCont(tr)   # independent BMsummary(lm(y ~ x))   # naive slope

B — Phylogenetic independent contrasts

Compute contrasts at each internal node. Contrasts are independent under Brownian motion. Regress contrasts of y on contrasts of x — the slope is unbiased.

Complete Stage A.

Scenario

For each internal node of the tree, compute the contrast: (x_left − x_right) / sqrt(t_left + t_right). Do the same for y. Then regress y-contrasts on x-contrasts through the origin. For independent BM traits, this slope is centered on 0.

Naive vs phylogenetic-contrasts slope

naive OLS slope:  |  PIC slope:  |  PIC P-value:

Prediction

  1. Q1. When traits truly evolve independently on a tree, the phylogenetic-contrasts slope across many simulations will be:
Try at least 5 seeds. 0/5 seeds

Controls

42
0.0

R code — phylogenetic independent contrasts

library(ape)px <- pic(x, tr); py <- pic(y, tr)summary(lm(py ~ 0 + px))   # PIC slope through origin

C — Pagel's λ — the phylogenetic signal

λ = 0 means no phylogenetic signal (species are effectively independent). λ = 1 means full Brownian. Fit λ to your traits.

Complete Stage B.

Scenario

Pagel's λ scales the off-diagonal elements of the phylogenetic covariance matrix. At λ=0, off-diagonals are 0 (no shared ancestry effect). At λ=1, full BM. The maximum-likelihood λ tells you how much your trait is phylogenetically conserved.

Log-likelihood profile over λ

true λ: 1.0  |  best-fit λ:  |  95% CI:

Prediction

  1. Q1. If a trait evolves by pure Brownian motion on a tree, Pagel's λ should fit at:
Try at least 3 true λ values. 0/3 values

Controls

1.0
42

R code — Pagel's λ

library(phytools)phylosig(tr, x, method = "lambda", test = TRUE)

What Pagel's λ is, in causal terms

Two species' traits look correlated. Three causal stories could produce that. Build each one in the DAG below and see what the trait-vs-trait scatter looks like.

  1. Direct effect. Trait 1 → Trait 2. Real biology happens within each lineage.
  2. Shared ancestry. Phylogeny → Trait 1, Phylogeny → Trait 2. Both traits inherited from a common ancestor; no within-lineage link.
  3. Both. All three arrows present.

The DAG widget treats "clade" as a categorical confounder — the same role phylogeny plays in Pagel's λ. Build each story, color by clade, watch the within-clade slopes vs the pooled slope.

Pagel's λ measures how much of the across-species covariance is shared-ancestry covariance. λ = 1 means the entire signal is the "phylogeny → both" story. λ = 0 means the entire signal is the within-lineage story. PGLS conditions on the tree, which is the same move as conditioning on "clade" in the DAG.

D — AVONET birds — hand-wing index vs migration

From data/clean/avonet_birds.csv. Regress migration distance on hand-wing index, with and without phylogenetic correction.

Complete Stage C.

Scenario

Hand-wing index (a wing-shape ratio) is known to correlate with dispersal ability. Naive cross-species correlation may overstate the relationship because closely related species share both. Compare naive OLS vs phylogenetic correction.

Hand-wing index vs flight ability proxy

N species shown:  |  naive slope:  |  corrected slope (PIC-like):

Prediction

  1. Q1. The PIC-corrected slope between hand-wing index and a flight-ability proxy will be:
Toggle between traits at least 2 times. 0/2 toggles

Controls

42

R code — AVONET PGLS

library(ape); library(nlme)birds <- read.csv("data/clean/avonet_birds.csv")# Match to BirdTree, then PGLSm <- gls(migration ~ hand_wing_index, data = birds,         correlation = corBrownian(phy = bird_tree))summary(m)

E — The earthworm sign flip — Simpson's paradox on a tree

The lecture's earthworm vignette, drawn as data. Pooled across all species, body size and soil pH show a clear negative relationship. Within each clade — the sister-species pairs — the relationship is positive. The phylogeny is doing all of the work in the pooled slope.

Complete Stage D.

Scenario

Synthetic earthworm data faithful to the structure in the lecture quote: 30 species in 6 clades of 5 species each. Within each clade, larger worms prefer slightly more alkaline soils (positive slope). Across clades, the bigger-bodied clades happen to occupy more acidic soils (so pooled across clades, the slope flips negative). The categorical variable is clade membership — and once it's drawn into the model, the within-clade slope is what's actually happening biologically.

Earthworm body size vs soil pH — pooled vs within-clade

pooled slope:  |  median within-clade slope:

Prediction

  1. Q1. When you switch from pooled to within-clade view, the within-clade slopes will be:
  2. Q2. Which is the right way to describe what the pooled negative slope was measuring?
Toggle the clade-coloring at least once. 0/1 toggles

Controls

R code — Simpson's paradox on a tree

worms <- read.csv("data/clean/earthworm_clades.csv")pooled  <- lm(soil_pH ~ body_size, data = worms)within  <- lm(soil_pH ~ body_size + clade, data = worms)summary(pooled)$coefficients["body_size", ]summary(within)$coefficients["body_size", ]