Workshop 8: Linear models with interactions

BI3010 Statistics for Biologists

Author

Roos & Pinard

Published

16 May 2026

Glossary / 词汇表
Model terms
Interaction 交互作用
When the effect of one predictor on the response depends on the value of a second predictor. Two variables interact if knowing the value of one changes how you interpret the slope of the other. An interaction violates the additivity assumption and requires an interaction term in the model.
当一个预测变量对响应变量的影响取决于另一个预测变量的值时,称两者存在交互作用。若一个变量的值改变了另一个变量斜率的解读,则两者存在交互。交互作用违反了可加性假设,需在模型中加入交互项。
Main effect 主效应
The estimated effect of a single predictor on the response when all other predictors are held at zero. In an interaction model, the main effect slopes (β₁ and β₂) describe what happens when the other predictor is exactly zero, not the average effect across all values. Interpreting main effects in isolation from their interaction is usually misleading.
在其他预测变量等于零时,单个预测变量对响应变量的估计效应。在含交互作用的模型中,主效应斜率(β₁和β₂)描述的是另一个预测变量恰好为零时的情况,而非所有取值下的平均效应。单独解读主效应通常会产生误导。
Interaction term (β₃) 交互项(β₃)
The additional parameter in an interaction model, multiplied by both predictors (β₃ × x₁ × x₂). It captures how much the slope of one predictor changes for every one-unit increase in the other. A positive β₃ means the two predictors reinforce each other; a negative β₃ means they dampen each other's effects.
交互作用模型中的额外参数,乘以两个预测变量(β₃×x₁×x₂)。它描述了一个预测变量的斜率随另一个预测变量每增加一个单位而变化的量。正的β₃表示两个预测变量相互增强;负的β₃表示两者相互削弱各自的效应。
Assumptions
Additivity 可加性
The assumption that each predictor's effect on the response is independent of all other predictors. When additivity holds, the combined effect of two predictors equals the sum of their individual effects. An interaction violates this: the combined effect is larger or smaller than the sum of the parts.
每个预测变量对响应变量的效应与其他所有预测变量无关的假设。当可加性成立时,两个预测变量的联合效应等于各自效应之和。交互作用违反了这一假设:联合效应大于或小于各部分之和。
Research practice
HARKing 事后假设
Hypothesising After Results are Known. Presenting post-hoc explanations of unexpected findings as if they were the original, pre-registered hypothesis. HARKing makes exploratory findings look confirmatory, inflates apparent evidence for a hypothesis, and is widely considered a form of scientific misconduct.
即"结果已知后再提出假设"。将意外发现的事后解释伪装成原本就有的、预先注册的假设。这种做法使探索性发现看起来像是验证性发现,夸大了对假设的表面证据,被普遍认为是一种科学不端行为。
Visualisation
Prediction grid 预测网格
A dataset containing all combinations of predictor values of interest, typically created with expand.grid(). Used as the newdata argument in predict() to generate model predictions across a range of conditions, for example when plotting predicted values across all combinations of two continuous predictors.
包含所有感兴趣的预测变量值组合的数据集,通常通过expand.grid()创建。作为predict()newdata参数,用于在一系列条件下生成模型预测值,例如绘制两个连续预测变量交互作用的曲面。
Heatmap 热力图
A two-dimensional plot that uses fill colour to represent a third continuous variable across combinations of two predictors. Created in ggplot2 with geom_tile(). Useful for visualising predicted values across all combinations of two continuous predictors. Conveying uncertainty on heatmaps is difficult; one workaround is to produce two additional plots showing the upper and lower 95% confidence limits, though this is imperfect.
一种使用填充颜色表示第三个连续变量在两个预测变量组合中取值的二维图形。在ggplot2中通过geom_tile()创建,适用于可视化连续×连续交互作用的预测响应面。在热力图上表达不确定性较为困难;分别绘制置信区间上下界的热力图是常见但并不完美的替代方案。

Learning Objectives

Learning objectives
  1. Recognise when the additivity assumption is likely to be violated and explain what an interaction means biologically.
  2. Fit a linear model with an interaction term and correctly interpret the main effect and interaction coefficients.
  3. Visualise a continuous-by-continuous interaction using a prediction grid and a heatmap.
  4. Diagnose an interaction model and evaluate whether including the interaction improved model fit.
  5. Apply the full workflow independently to a new dataset with a continuous-by-categorical interaction.

Workshop structure

The Analytical Workflow
  1. Formulate a research question.
  2. Perform exploratory data analysis (EDA).
  3. Identify any hidden assumptions.
  4. Fit an appropriate model [Focus of today].
  5. Diagnose the model and check assumptions.
  6. Summarise the results.
  7. Interpret and draw conclusions.

Today’s focus is on interactions between predictors. We’ll first fit a model with only main effects, find a problem in the diagnostics, and then fix it by including an interaction term. We then work through visualising interactions and finish with an independent extension exercise on a second dataset.

1. Formulate a research question

For this workshop we use a dataset from a study set in the coral reef ecosystems of the Coral Triangle, a biodiversity hotspot in the Indo-Pacific. Download it below and save it to your BI3010/data/ folder.

Download reef.txt

The dataset contains three variables:

  • shark_density: Tiger shark (Galeocerdo cuvier) density, estimated from underwater visual surveys along transects. Units are sharks per km².
  • reef_complexity: A standardised structural complexity index combining coral cover, surface rugosity, and habitat diversity. Higher values indicate more structurally complex reefs.
  • biomass_change: Net change in fish biomass (kg/m²) over a six-month period, measured using underwater cameras. Positive values indicate population growth; negative values indicate decline.

Our research question is: how does fish biomass change as shark density and reef complexity change? This is a predictive question: we want to know which combinations of shark density and reef complexity are associated with higher or lower biomass change, not to make a causal claim about either variable.

Load the data:

reef <- read.table("Data/reef.txt", header = TRUE)

2. Perform exploratory data analysis

Spend a few minutes exploring the dataset. Check for any obvious errors or unusual values. Useful functions include str(), summary(), geom_histogram(), and scatter plots.

str(reef)
summary(reef)
ggplot(reef, aes(x = shark_density)) + geom_histogram() + theme_minimal()
ggplot(reef, aes(x = reef_complexity)) + geom_histogram() + theme_minimal()
ggplot(reef, aes(x = biomass_change)) + geom_histogram() + theme_minimal()
ggplot(reef, aes(x = shark_density, y = biomass_change)) +
  geom_point() +
  theme_minimal()

ggplot(reef, aes(x = reef_complexity, y = biomass_change)) +
  geom_point() +
  theme_minimal()

3. Identify hidden assumptions

Once you have explored the data, think about the hidden assumptions the model will rely on: validity, representativeness, and additivity. The additivity assumption is especially relevant here. Ask yourself: is it plausible that the effect of shark density on biomass is the same regardless of reef complexity? Or might the two predictors act together in some way?

4. Fit an appropriate model

We start with a model that includes only main effects, where the effects of shark density and reef complexity are assumed to be independent of each other. The statistical form of this model is:

\[biomass_i \sim \text{Normal}(\mu_i, \sigma)\]

\[\mu_i = \beta_0 + \beta_1 \times shark_i + \beta_2 \times reef_i\]

where \(\beta_0\) is the predicted biomass change when both predictors are zero, \(\beta_1\) is the change in predicted biomass for each additional shark per km², and \(\beta_2\) is the change for each additional unit of reef complexity.

bio_mod <- lm(biomass_change ~ shark_density + reef_complexity, data = reef)

5. Diagnose the model and check assumptions

Before interpreting the model, check the residual plots:

par(mfrow = c(2, 2))
plot(bio_mod)
par(mfrow = c(1, 1))

Question 1. Do the residual plots give you any cause for concern?

The residuals vs fitted plot shows a curved pattern rather than a random scatter around zero. This is a signal that the additivity assumption is violated: the model is missing a combined effect of the two predictors. The normal Q-Q plot and scale-location plot look reasonable; the problem is the non-random pattern in the residuals vs fitted plot.

6. Summarise the results

Despite the concern about additivity, let’s look at the parameter estimates to understand what the main-effects model is telling us before we fix it:

summary(bio_mod)

Question 2. From the summary() output, how much does predicted biomass change for each additional unit of shark density?

The shark_density coefficient is approximately 3.22. According to this model, predicted biomass change increases by roughly 3.2 kg/m² for each additional shark per km². This result will look suspicious once you interpret it in the next question.

Question 3. How much does predicted biomass change for each additional unit of reef complexity?

The reef_complexity coefficient is approximately 7.12. Predicted biomass change increases by roughly 7.1 kg/m² for each additional unit of reef complexity, holding shark density constant.

Question 4. Using the model coefficients, predict biomass change when shark density is 5 and reef complexity is 0.1. Then predict it again when shark density is 10 and reef complexity is 0.1. What does the difference between those two predictions tell you?

Using the approximate coefficients from the model:

-27.21 + 3.22 * 5  + 7.12 * 0.1  # ≈ -10.4 kg/m²
-27.21 + 3.22 * 10 + 7.12 * 0.1  # ≈  +5.7 kg/m²

Going from 5 to 10 sharks increases predicted biomass by about 16 kg/m² (the difference is simply 5 × β₁ ≈ 5 × 3.22). In a main-effects model this difference is the same regardless of reef complexity, because the additivity assumption forces the shark slope to be constant. You’ll see later why that assumption is wrong here.

Use ggeffects to quickly visualise the predicted relationship between each predictor and the response:

library(ggeffects)
preds <- ggpredict(bio_mod)
plot(preds)

7. Interpret and draw conclusions

Question 5. The main-effects model suggests that higher shark density is associated with higher fish biomass. How would you explain this result? Is there anything problematic about how you arrive at that explanation?

It’s kind of hard to think why adding predators would increase the number of prey. But not impossible. You might try: trophic cascades, a landscape of fear effect reducing prey movement, compensatory breeding by prey, shifts in species composition, and so on.

The point is that you can always come up with some hypothesis to explain your results after the fact. That’s HARKing: Hypothesising After Results are Known, and it’s a serious problem in science. To be very clear, HARKing is a highly questionable research practice (i.e. uncomfortably close to fraud) where researchers try to explain their results after the model tells them the answer, while pretending they were always testing that hypothesis. Any explanation you arrive at after seeing the result should be labelled as exploratory (“we did not predict this, but one possible interpretation is…”), not presented as confirmation of a prior prediction. We’ll see shortly that the result changes completely once we account for the interaction, which makes this exercise even more telling.

Including an interaction

The main-effects model treats the effects of shark density and reef complexity as independent. But think about what is ecologically plausible. In a structurally simple reef with few hiding places, more sharks should reduce fish biomass substantially. In a complex reef with many refuges, the same number of sharks might have much less impact because prey can avoid them. This means the effect of shark density on biomass isn’t constant: it depends on reef complexity. That’s an interaction.

Before working through the maths, test your intuition. Each scenario below describes a relationship between two variables. Decide whether the description implies an interaction or not.

Click a card to select it, then click a zone to place it. Click a placed card to move it back.

In nutrient-poor soils, fertiliser dramatically increases plant growth. In nutrient-rich soils, the same dose has almost no effect.
Larger animals tend to have slower resting heart rates, and this relationship holds across both mammals and reptiles.
Caffeine improves alertness much more when you are sleep-deprived than when you are well-rested.
A gut parasite reduces host survival by around 20% regardless of whether the host lives in a warm or cold climate.
Elevated CO₂ speeds up photosynthesis in well-watered plants but has almost no effect in drought-stressed plants.
Interaction
No interaction

The statistical notation for a model with both main effects and an interaction is:

\[biomass_i \sim \text{Normal}(\mu_i, \sigma)\]

\[\mu_i = \beta_0 + \beta_1 \times shark_i + \beta_2 \times reef_i + (\beta_3 \times shark_i \times reef_i)\]

where \(\beta_3\) is the interaction term: the change in the shark slope for every one-unit increase in reef complexity (and symmetrically, the change in the reef slope per unit of shark density).

In R, there are two equivalent ways to write the same model:

biomass_change ~ shark_density + reef_complexity + shark_density:reef_complexity
biomass_change ~ shark_density * reef_complexity

The * notation is shorthand and preferred for readability.

Question 6. Which of the following correctly completes the interaction model equation?

The interaction term multiplies both predictors together: \(\beta_3 \times shark_i \times reef_i\). The main effects (\(\beta_1 \times shark_i\) and \(\beta_2 \times reef_i\)) remain in the model alongside the interaction; removing them would change the meaning of the other parameters entirely. The brackets around the interaction term aren’t required by R but help readers identify it visually.

Now fit the model with the interaction:

bio_mod <- lm(biomass_change ~ shark_density * reef_complexity, data = reef)

Diagnose the interaction model

Check the residuals again:

par(mfrow = c(2, 2))
plot(bio_mod)
par(mfrow = c(1, 1))

Question 7. Does adding the interaction resolve the concern from Question 1?

Yes, the curved pattern in the residuals vs fitted plot disappears. This isn’t just a mechanical effect of adding a parameter: it’s evidence that the original model was missing the interaction. The interaction term captures the combined effect that the main-effects model could not, and once it’s included, the residuals no longer show a pattern.

Question 8. Does meeting these residual assumptions make the model inherently trustworthy?

Good diagnostics can reveal problems with linearity, equal variance of errors, normality of errors, and influential observations, but they say nothing about whether the data actually measure what we think they do, or whether the sample represents the population we want to generalise to. Those hidden assumptions require domain knowledge and careful study design, not residual plots.

Summarise the interaction model

summary(bio_mod)

Question 9. What does the intercept (\(\beta_0\)) represent in this model?

The intercept is the predicted value when all predictors are zero, and that definition doesn’t change when an interaction is added. In this case, zero sharks and zero reef complexity is an unlikely combination in reality, so the intercept has limited direct biological meaning. Its value will also differ from the main-effects model because the interaction changes how the other parameters are estimated.

Question 10. What does the interaction coefficient (shark_density:reef_complexity, or \(\beta_3\)) represent?

The interaction term is symmetrical: you can read it in either direction. If \(\beta_3\) is positive, the shark density slope becomes more positive (or less negative) as reef complexity increases. You can also say that the reef complexity slope becomes more positive as shark density increases. To understand the full effect of either predictor requires plugging both values into the complete equation, not just looking at a single coefficient.

Question 11. Does the interaction model still suggest that higher shark density increases biomass?

No — we get something far more intuitive, requiring fewer mental gymnastics to explain. Predators eat prey. The more predators, the less prey. Just that simple. The positive coefficient from the main-effects model was an artefact of a model missing the interaction term. Once the interaction is included, the model correctly captures that the shark effect is negative but varies with reef complexity. This is also why we should always diagnose before interpreting: a model with violated assumptions can produce coefficients that aren’t just imprecise but point in the wrong direction entirely.

Question 12. The interaction coefficient is positive. What does this tell you about the system?

A positive \(\beta_3\) means that as reef complexity increases, the (negative) shark slope gets closer to zero. In a structurally simple reef, many sharks reduce biomass substantially; in a complex reef with many hiding places, the same number of sharks has a smaller impact on biomass. This is a biologically coherent interpretation: reef complexity acts as a refuge for prey, buffering the effect of predators.

Note that interpreting this as a causal claim requires additional causal reasoning. We framed this as a predictive question, so the interaction describes an association, not necessarily a mechanism.

Visualising interactions

A coefficient table tells you the direction and magnitude of an interaction but not what it looks like across the range of your data. For two continuous predictors, a heatmap is one of the most effective ways to show predicted values across all combinations of the two predictors.

Building a prediction grid

To produce the heatmap, we need predicted values at every combination of shark density and reef complexity. The expand.grid() function generates a data frame containing exactly those combinations:

shark_density_seq <- seq(from = min(reef$shark_density),
                         to   = max(reef$shark_density),
                         length.out = 20)

reef_complexity_seq <- seq(from = min(reef$reef_complexity),
                           to   = max(reef$reef_complexity),
                           length.out = 20)

fake <- expand.grid(
  shark_density   = shark_density_seq,
  reef_complexity = reef_complexity_seq
)

expand.grid() produces one row for every combination of the two sequences: 20 × 20 = 400 rows. The variable names must match those in the original dataset exactly, because predict() will look for them by name.

Now add the model predictions:

fake$fit <- predict(bio_mod, newdata = fake)

Plotting the heatmap

ggplot(fake) +
  geom_tile(aes(x = shark_density, y = reef_complexity, fill = fit)) +
  labs(x = "Shark density",
       y = "Reef complexity",
       fill = "Predicted biomass change") +
  theme_minimal()

The heatmap makes the interaction visible: the highest predicted biomass occurs at high reef complexity combined with high shark density. Without the interaction in mind, high shark density alone would predict low biomass. The interaction flips that relationship at high reef complexity.

The default blue palette can be hard to read. My preferred option is the viridis "magma" palette, which gives much better contrast across the full range:

ggplot(fake) +
  geom_tile(aes(x = shark_density, y = reef_complexity, fill = fit)) +
  scale_fill_viridis_c(option = "magma") +
  labs(x = "Shark density",
       y = "Reef complexity",
       fill = "Predicted biomass change") +
  theme_minimal()

One limitation of heatmaps is that it’s hard to communicate uncertainty on them. A common workaround is to produce two additional plots: one showing predictions at the upper 95% confidence limit and one at the lower. But this isn’t ideal either — it’s really hard for our eyes to match up the same region across multiple heatmaps, so you never get a precise sense of the uncertainty at any specific location. If you have any bright ideas for how to effectively convey uncertainty with heatmaps (or even just maps in general), you might have a great paper on your hands. We’re not aware of anyone who has found a particularly compelling solution to this yet.

A new challenger: Asian elephants

If time allows, work through the following dataset on human-wildlife conflict in Asian elephant populations. Download it below and save it to your BI3010/data/ folder.

Download elephants.txt

Researchers recorded elephant activity at sites varying in vegetation cover and proximity to human settlements. The research question is predictive: which combinations of vegetation cover and proximity to towns are associated with higher or lower elephant activity?

The dataset contains:

  • veg_cover: Vegetation cover at each site (continuous, 0 to 100%).
  • prox_town: Proximity to the nearest human settlement (categorical: "Far", "Near", "Very Near").
  • activity: An activity index measuring deviation from a baseline level observed in elephants in an undisturbed, dense-vegetation area. Negative values indicate reduced activity relative to that baseline; positive values indicate increased activity.
eleph <- read.table("Data/elephants.txt", header = TRUE)

Go through the full seven-step workflow. A few things to keep in mind:

  1. The interaction here is between a continuous variable (veg_cover) and a categorical variable (prox_town), not two continuous variables. R handles this automatically when you use * in the formula, but the interpretation changes: instead of a single \(\beta_3\), you get a separate interaction coefficient for each category of prox_town other than the baseline category.

  2. For a continuous × categorical interaction, a heatmap isn’t appropriate. Use ggpredict() from ggeffects with both terms to produce separate regression lines per proximity level.

  3. Think carefully about hidden assumptions. The activity variable is defined relative to a “pristine” baseline. What assumptions does that involve, and are they likely to hold?

Question 13. After completing your elephant analysis, summarise your main finding. Does the effect of vegetation cover on elephant activity differ by proximity to towns, and if so, how?

When elephant herds are far from towns, activity increases with vegetation cover: denser vegetation is associated with more normal activity, which makes biological sense as undisturbed habitat. When herds are near or very near to towns, the relationship reverses: higher vegetation cover is associated with reduced activity. One interpretation is that elephants near human settlements use dense vegetation to hide and reduce movement rather than to forage, effectively avoiding contact with people. The interaction means you can’t summarise the vegetation effect with a single slope; its direction depends on proximity to towns.

The hidden assumptions worth flagging: the “baseline” activity level is measured in a separate group of elephants in pristine habitat, so the activity variable assumes those elephants are comparable to the study animals. The validity of the activity index as a measure of genuine ecological behaviour (rather than observer bias or measurement artefact) is also worth questioning.


Assumption table

Assumption Description Example of violation
1. Validity The data can answer the research question. Using an activity index defined relative to an arbitrary baseline to measure true ecological behaviour.
2. Representativeness The sample reflects the population of interest. Reef sites selected for accessibility by divers may not represent remote, undisturbed reefs.
3. Additivity Predictor effects are independent and additive. Shark density and reef complexity interact: the shark effect depends on complexity.
4. Linearity The relationship between predictors and response is linear on the scale modelled. Biomass change increasing indefinitely with reef complexity rather than levelling off.
5. Independence of errors Observations are independent. Multiple transects sampled at the same reef site on the same day.
6. Equal variance of errors Residuals have constant variance across fitted values. Biomass variability increasing in high-complexity reefs.
7. Normality of errors Residuals are approximately normally distributed. A few extreme biomass collapses pulling the residual distribution heavily left.