reef <- read.table("Data/reef.txt", header = TRUE)Workshop 8: Linear models with interactions
BI3010 Statistics for Biologists
expand.grid(). Used as the newdata argument in predict() to generate model predictions across a range of conditions, for example when plotting predicted values across all combinations of two continuous predictors.expand.grid()创建。作为predict()的newdata参数,用于在一系列条件下生成模型预测值,例如绘制两个连续预测变量交互作用的曲面。geom_tile(). Useful for visualising predicted values across all combinations of two continuous predictors. Conveying uncertainty on heatmaps is difficult; one workaround is to produce two additional plots showing the upper and lower 95% confidence limits, though this is imperfect.geom_tile()创建,适用于可视化连续×连续交互作用的预测响应面。在热力图上表达不确定性较为困难;分别绘制置信区间上下界的热力图是常见但并不完美的替代方案。Learning Objectives
- Recognise when the additivity assumption is likely to be violated and explain what an interaction means biologically.
- Fit a linear model with an interaction term and correctly interpret the main effect and interaction coefficients.
- Visualise a continuous-by-continuous interaction using a prediction grid and a heatmap.
- Diagnose an interaction model and evaluate whether including the interaction improved model fit.
- Apply the full workflow independently to a new dataset with a continuous-by-categorical interaction.
Workshop structure
- Formulate a research question.
- Perform exploratory data analysis (EDA).
- Identify any hidden assumptions.
- Fit an appropriate model [Focus of today].
- Diagnose the model and check assumptions.
- Summarise the results.
- Interpret and draw conclusions.
Today’s focus is on interactions between predictors. We’ll first fit a model with only main effects, find a problem in the diagnostics, and then fix it by including an interaction term. We then work through visualising interactions and finish with an independent extension exercise on a second dataset.
1. Formulate a research question
For this workshop we use a dataset from a study set in the coral reef ecosystems of the Coral Triangle, a biodiversity hotspot in the Indo-Pacific. Download it below and save it to your BI3010/data/ folder.
The dataset contains three variables:
shark_density: Tiger shark (Galeocerdo cuvier) density, estimated from underwater visual surveys along transects. Units are sharks per km².reef_complexity: A standardised structural complexity index combining coral cover, surface rugosity, and habitat diversity. Higher values indicate more structurally complex reefs.biomass_change: Net change in fish biomass (kg/m²) over a six-month period, measured using underwater cameras. Positive values indicate population growth; negative values indicate decline.
Our research question is: how does fish biomass change as shark density and reef complexity change? This is a predictive question: we want to know which combinations of shark density and reef complexity are associated with higher or lower biomass change, not to make a causal claim about either variable.
Load the data:
2. Perform exploratory data analysis
Spend a few minutes exploring the dataset. Check for any obvious errors or unusual values. Useful functions include str(), summary(), geom_histogram(), and scatter plots.
str(reef)
summary(reef)ggplot(reef, aes(x = shark_density)) + geom_histogram() + theme_minimal()
ggplot(reef, aes(x = reef_complexity)) + geom_histogram() + theme_minimal()
ggplot(reef, aes(x = biomass_change)) + geom_histogram() + theme_minimal()ggplot(reef, aes(x = shark_density, y = biomass_change)) +
geom_point() +
theme_minimal()
ggplot(reef, aes(x = reef_complexity, y = biomass_change)) +
geom_point() +
theme_minimal()4. Fit an appropriate model
We start with a model that includes only main effects, where the effects of shark density and reef complexity are assumed to be independent of each other. The statistical form of this model is:
\[biomass_i \sim \text{Normal}(\mu_i, \sigma)\]
\[\mu_i = \beta_0 + \beta_1 \times shark_i + \beta_2 \times reef_i\]
where \(\beta_0\) is the predicted biomass change when both predictors are zero, \(\beta_1\) is the change in predicted biomass for each additional shark per km², and \(\beta_2\) is the change for each additional unit of reef complexity.
bio_mod <- lm(biomass_change ~ shark_density + reef_complexity, data = reef)5. Diagnose the model and check assumptions
Before interpreting the model, check the residual plots:
par(mfrow = c(2, 2))
plot(bio_mod)
par(mfrow = c(1, 1))Question 1. Do the residual plots give you any cause for concern?
The residuals vs fitted plot shows a curved pattern rather than a random scatter around zero. This is a signal that the additivity assumption is violated: the model is missing a combined effect of the two predictors. The normal Q-Q plot and scale-location plot look reasonable; the problem is the non-random pattern in the residuals vs fitted plot.
6. Summarise the results
Despite the concern about additivity, let’s look at the parameter estimates to understand what the main-effects model is telling us before we fix it:
summary(bio_mod)
Question 2. From the summary() output, how much does predicted biomass change for each additional unit of shark density?
The shark_density coefficient is approximately 3.22. According to this model, predicted biomass change increases by roughly 3.2 kg/m² for each additional shark per km². This result will look suspicious once you interpret it in the next question.
Question 3. How much does predicted biomass change for each additional unit of reef complexity?
The reef_complexity coefficient is approximately 7.12. Predicted biomass change increases by roughly 7.1 kg/m² for each additional unit of reef complexity, holding shark density constant.
Question 4. Using the model coefficients, predict biomass change when shark density is 5 and reef complexity is 0.1. Then predict it again when shark density is 10 and reef complexity is 0.1. What does the difference between those two predictions tell you?
Using the approximate coefficients from the model:
-27.21 + 3.22 * 5 + 7.12 * 0.1 # ≈ -10.4 kg/m²
-27.21 + 3.22 * 10 + 7.12 * 0.1 # ≈ +5.7 kg/m²
Going from 5 to 10 sharks increases predicted biomass by about 16 kg/m² (the difference is simply 5 × β₁ ≈ 5 × 3.22). In a main-effects model this difference is the same regardless of reef complexity, because the additivity assumption forces the shark slope to be constant. You’ll see later why that assumption is wrong here.
Use ggeffects to quickly visualise the predicted relationship between each predictor and the response:
library(ggeffects)
preds <- ggpredict(bio_mod)
plot(preds)7. Interpret and draw conclusions
Question 5. The main-effects model suggests that higher shark density is associated with higher fish biomass. How would you explain this result? Is there anything problematic about how you arrive at that explanation?
It’s kind of hard to think why adding predators would increase the number of prey. But not impossible. You might try: trophic cascades, a landscape of fear effect reducing prey movement, compensatory breeding by prey, shifts in species composition, and so on.
The point is that you can always come up with some hypothesis to explain your results after the fact. That’s HARKing: Hypothesising After Results are Known, and it’s a serious problem in science. To be very clear, HARKing is a highly questionable research practice (i.e. uncomfortably close to fraud) where researchers try to explain their results after the model tells them the answer, while pretending they were always testing that hypothesis. Any explanation you arrive at after seeing the result should be labelled as exploratory (“we did not predict this, but one possible interpretation is…”), not presented as confirmation of a prior prediction. We’ll see shortly that the result changes completely once we account for the interaction, which makes this exercise even more telling.
Including an interaction
The main-effects model treats the effects of shark density and reef complexity as independent. But think about what is ecologically plausible. In a structurally simple reef with few hiding places, more sharks should reduce fish biomass substantially. In a complex reef with many refuges, the same number of sharks might have much less impact because prey can avoid them. This means the effect of shark density on biomass isn’t constant: it depends on reef complexity. That’s an interaction.
Before working through the maths, test your intuition. Each scenario below describes a relationship between two variables. Decide whether the description implies an interaction or not.
Click a card to select it, then click a zone to place it. Click a placed card to move it back.
The statistical notation for a model with both main effects and an interaction is:
\[biomass_i \sim \text{Normal}(\mu_i, \sigma)\]
\[\mu_i = \beta_0 + \beta_1 \times shark_i + \beta_2 \times reef_i + (\beta_3 \times shark_i \times reef_i)\]
where \(\beta_3\) is the interaction term: the change in the shark slope for every one-unit increase in reef complexity (and symmetrically, the change in the reef slope per unit of shark density).
In R, there are two equivalent ways to write the same model:
biomass_change ~ shark_density + reef_complexity + shark_density:reef_complexity
biomass_change ~ shark_density * reef_complexityThe * notation is shorthand and preferred for readability.
Question 6. Which of the following correctly completes the interaction model equation?
The interaction term multiplies both predictors together: \(\beta_3 \times shark_i \times reef_i\). The main effects (\(\beta_1 \times shark_i\) and \(\beta_2 \times reef_i\)) remain in the model alongside the interaction; removing them would change the meaning of the other parameters entirely. The brackets around the interaction term aren’t required by R but help readers identify it visually.
Now fit the model with the interaction:
bio_mod <- lm(biomass_change ~ shark_density * reef_complexity, data = reef)Diagnose the interaction model
Check the residuals again:
par(mfrow = c(2, 2))
plot(bio_mod)
par(mfrow = c(1, 1))Question 7. Does adding the interaction resolve the concern from Question 1?
Yes, the curved pattern in the residuals vs fitted plot disappears. This isn’t just a mechanical effect of adding a parameter: it’s evidence that the original model was missing the interaction. The interaction term captures the combined effect that the main-effects model could not, and once it’s included, the residuals no longer show a pattern.
Question 8. Does meeting these residual assumptions make the model inherently trustworthy?
Good diagnostics can reveal problems with linearity, equal variance of errors, normality of errors, and influential observations, but they say nothing about whether the data actually measure what we think they do, or whether the sample represents the population we want to generalise to. Those hidden assumptions require domain knowledge and careful study design, not residual plots.
Summarise the interaction model
summary(bio_mod)Question 9. What does the intercept (\(\beta_0\)) represent in this model?
The intercept is the predicted value when all predictors are zero, and that definition doesn’t change when an interaction is added. In this case, zero sharks and zero reef complexity is an unlikely combination in reality, so the intercept has limited direct biological meaning. Its value will also differ from the main-effects model because the interaction changes how the other parameters are estimated.
Question 10. What does the interaction coefficient (shark_density:reef_complexity, or \(\beta_3\)) represent?
The interaction term is symmetrical: you can read it in either direction. If \(\beta_3\) is positive, the shark density slope becomes more positive (or less negative) as reef complexity increases. You can also say that the reef complexity slope becomes more positive as shark density increases. To understand the full effect of either predictor requires plugging both values into the complete equation, not just looking at a single coefficient.
Question 11. Does the interaction model still suggest that higher shark density increases biomass?
No — we get something far more intuitive, requiring fewer mental gymnastics to explain. Predators eat prey. The more predators, the less prey. Just that simple. The positive coefficient from the main-effects model was an artefact of a model missing the interaction term. Once the interaction is included, the model correctly captures that the shark effect is negative but varies with reef complexity. This is also why we should always diagnose before interpreting: a model with violated assumptions can produce coefficients that aren’t just imprecise but point in the wrong direction entirely.
Question 12. The interaction coefficient is positive. What does this tell you about the system?
A positive \(\beta_3\) means that as reef complexity increases, the (negative) shark slope gets closer to zero. In a structurally simple reef, many sharks reduce biomass substantially; in a complex reef with many hiding places, the same number of sharks has a smaller impact on biomass. This is a biologically coherent interpretation: reef complexity acts as a refuge for prey, buffering the effect of predators.
Note that interpreting this as a causal claim requires additional causal reasoning. We framed this as a predictive question, so the interaction describes an association, not necessarily a mechanism.
Visualising interactions
A coefficient table tells you the direction and magnitude of an interaction but not what it looks like across the range of your data. For two continuous predictors, a heatmap is one of the most effective ways to show predicted values across all combinations of the two predictors.
Building a prediction grid
To produce the heatmap, we need predicted values at every combination of shark density and reef complexity. The expand.grid() function generates a data frame containing exactly those combinations:
shark_density_seq <- seq(from = min(reef$shark_density),
to = max(reef$shark_density),
length.out = 20)
reef_complexity_seq <- seq(from = min(reef$reef_complexity),
to = max(reef$reef_complexity),
length.out = 20)
fake <- expand.grid(
shark_density = shark_density_seq,
reef_complexity = reef_complexity_seq
)expand.grid() produces one row for every combination of the two sequences: 20 × 20 = 400 rows. The variable names must match those in the original dataset exactly, because predict() will look for them by name.
Now add the model predictions:
fake$fit <- predict(bio_mod, newdata = fake)Plotting the heatmap
ggplot(fake) +
geom_tile(aes(x = shark_density, y = reef_complexity, fill = fit)) +
labs(x = "Shark density",
y = "Reef complexity",
fill = "Predicted biomass change") +
theme_minimal()The heatmap makes the interaction visible: the highest predicted biomass occurs at high reef complexity combined with high shark density. Without the interaction in mind, high shark density alone would predict low biomass. The interaction flips that relationship at high reef complexity.
The default blue palette can be hard to read. My preferred option is the viridis "magma" palette, which gives much better contrast across the full range:
ggplot(fake) +
geom_tile(aes(x = shark_density, y = reef_complexity, fill = fit)) +
scale_fill_viridis_c(option = "magma") +
labs(x = "Shark density",
y = "Reef complexity",
fill = "Predicted biomass change") +
theme_minimal()One limitation of heatmaps is that it’s hard to communicate uncertainty on them. A common workaround is to produce two additional plots: one showing predictions at the upper 95% confidence limit and one at the lower. But this isn’t ideal either — it’s really hard for our eyes to match up the same region across multiple heatmaps, so you never get a precise sense of the uncertainty at any specific location. If you have any bright ideas for how to effectively convey uncertainty with heatmaps (or even just maps in general), you might have a great paper on your hands. We’re not aware of anyone who has found a particularly compelling solution to this yet.
A new challenger: Asian elephants
If time allows, work through the following dataset on human-wildlife conflict in Asian elephant populations. Download it below and save it to your BI3010/data/ folder.
Researchers recorded elephant activity at sites varying in vegetation cover and proximity to human settlements. The research question is predictive: which combinations of vegetation cover and proximity to towns are associated with higher or lower elephant activity?
The dataset contains:
veg_cover: Vegetation cover at each site (continuous, 0 to 100%).prox_town: Proximity to the nearest human settlement (categorical:"Far","Near","Very Near").activity: An activity index measuring deviation from a baseline level observed in elephants in an undisturbed, dense-vegetation area. Negative values indicate reduced activity relative to that baseline; positive values indicate increased activity.
eleph <- read.table("Data/elephants.txt", header = TRUE)Go through the full seven-step workflow. A few things to keep in mind:
The interaction here is between a continuous variable (
veg_cover) and a categorical variable (prox_town), not two continuous variables. R handles this automatically when you use*in the formula, but the interpretation changes: instead of a single \(\beta_3\), you get a separate interaction coefficient for each category ofprox_townother than the baseline category.For a continuous × categorical interaction, a heatmap isn’t appropriate. Use
ggpredict()fromggeffectswith both terms to produce separate regression lines per proximity level.Think carefully about hidden assumptions. The
activityvariable is defined relative to a “pristine” baseline. What assumptions does that involve, and are they likely to hold?
Question 13. After completing your elephant analysis, summarise your main finding. Does the effect of vegetation cover on elephant activity differ by proximity to towns, and if so, how?
When elephant herds are far from towns, activity increases with vegetation cover: denser vegetation is associated with more normal activity, which makes biological sense as undisturbed habitat. When herds are near or very near to towns, the relationship reverses: higher vegetation cover is associated with reduced activity. One interpretation is that elephants near human settlements use dense vegetation to hide and reduce movement rather than to forage, effectively avoiding contact with people. The interaction means you can’t summarise the vegetation effect with a single slope; its direction depends on proximity to towns.
The hidden assumptions worth flagging: the “baseline” activity level is measured in a separate group of elephants in pristine habitat, so the activity variable assumes those elephants are comparable to the study animals. The validity of the activity index as a measure of genuine ecological behaviour (rather than observer bias or measurement artefact) is also worth questioning.
Assumption table
| Assumption | Description | Example of violation |
|---|---|---|
| 1. Validity | The data can answer the research question. | Using an activity index defined relative to an arbitrary baseline to measure true ecological behaviour. |
| 2. Representativeness | The sample reflects the population of interest. | Reef sites selected for accessibility by divers may not represent remote, undisturbed reefs. |
| 3. Additivity | Predictor effects are independent and additive. | Shark density and reef complexity interact: the shark effect depends on complexity. |
| 4. Linearity | The relationship between predictors and response is linear on the scale modelled. | Biomass change increasing indefinitely with reef complexity rather than levelling off. |
| 5. Independence of errors | Observations are independent. | Multiple transects sampled at the same reef site on the same day. |
| 6. Equal variance of errors | Residuals have constant variance across fitted values. | Biomass variability increasing in high-complexity reefs. |
| 7. Normality of errors | Residuals are approximately normally distributed. | A few extreme biomass collapses pulling the residual distribution heavily left. |