boto$predicted <- predict(boto_model)Workshop 4: Diagnosing linear models with a continuous predictor
BI3010 Statistics for Biologists
Learning Objectives
- Understand what residuals are and how they are used in model diagnosis.
- Interpret the four standard diagnostic plots produced by
plot(model). - Identify which assumptions each plot is designed to check.
- Recognise when a diagnostic plot signals a genuine problem and when it doesn’t.
Workshop structure
Today we work with the models from Workshop 3 and apply the full diagnostic workflow. The focus is step 5.
- Formulate a research question.
- Perform exploratory data analysis.
- Identify any hidden assumptions.
- Fit an appropriate model.
- Diagnose the model and check assumptions [Focus of today].
- Summarise the results.
- Interpret and provide inferences.
To keep the workload manageable, start by copying your workflow from Workshop 3 into a new script, or continue in the same script. If you didn’t finish Workshop 3, complete it now before continuing here.
Why diagnostics matter
Starting with the boto dolphin model from Workshop 3, we have the equation with estimated parameters:
\[length_i \sim \text{Normal}(\mu_i, \sigma)\]
\[\mu_i = 213.2 + (-5.3) \times drought_i\]
We used this to generate a figure, but how do we know if we can trust the result? How do we know we haven’t been cursed by the Monkey’s Paw? The uncomfortable answer is that we can’t confirm it with absolute certainty. The results might be completely unreliable, either because the data violated a hidden assumption or because the model isn’t well suited to the data. The particularly bad news is that we almost always get some answer from a model, even when the data or model are badly flawed.
This means it falls to us to actively look for signs that the model results shouldn’t be interpreted. This is especially important when a model could influence real-world decisions. We don’t want to conclude a drug is safe only to discover later it has serious side effects.
One more uncomfortable truth: the assumptions we care about most (validity and representativeness) are the hardest to test formally. They require domain knowledge, careful EDA, and clear thinking about the data-generating process. That is why you should spend the majority of your time on EDA and thinking about the context before fitting any model.
The assumptions that can be checked formally after fitting are the residual-based ones: linearity, equal variance, normality, and independence of errors. That is where today’s workshop focuses.
To start, ensure you have the boto model from the last workshop loaded. Add one extra line to store the model’s predictions for the original dataset:
Unlike the fake dataset approach from Workshop 3, here we pass the original data through the model so we can compute residuals: the difference between each observed length and what the model predicted for that dolphin.
Diagnosing the four horsemen
The four standard diagnostic plots are produced in one call:
par(mfrow = c(2, 2)) # arrange as 2x2 grid
plot(boto_model)
par(mfrow = c(1, 1)) # resetplot() works directly on a model object, returning all four diagnostic plots in one call. It uses base R graphics rather than ggplot2 — the output is plainer, but nothing extra needs setting up, making it the standard tool for a quick check. You will see a ggplot2-based alternative later in this workshop.
Alternatively, press Return in the console after plot(boto_model) to step through them one at a time. Let’s go through each plot and what it tells us.
Residuals vs Fitted
The y-axis shows residual errors: the difference between each dolphin’s actual measured length and what the model predicted. A perfect prediction gives a residual of zero, shown as the dashed horizontal line.
The x-axis shows the fitted (predicted) values. Points at the far left are dolphins the model predicts to be short; points at the right are those the model predicts to be longer. Points labelled with numbers are the row indices of notable observations. For example, the point labelled 6 corresponds to:
boto[6, ]Dolphin 6 is a male that experienced two droughts and was 236 cm long. The model predicted 203 cm, so its residual is approximately +33 cm. We can add a residual column directly:
boto$residual <- boto$length - boto$predicted
# Equivalently: residuals(boto_model)
boto[6, ]This plot is used to check two assumptions.
Linearity
If the residuals display a “wavy” pattern across the fitted values, drawing a straight line through the data isn’t capturing the relationship well. Here’s a simulated example that clearly violates linearity:
The straight line fitted to this curved relationship produces a systematic pattern in the residuals: the model overpredicts at low and high x values and underpredicts in the middle. The diagnostic plot makes this visible:
The red smoothed line through the residuals curves clearly away from zero rather than sitting flat. Compare this to the boto model plot above, where the smoothed line stays close to the dashed zero line across the range of fitted values.
Violating linearity can seriously mislead inference. In the simulated example above, the true relationship peaks in the middle and declines, but a straight line would suggest a flat or slightly positive trend across the whole range. The severity depends on context: sometimes a mild deviation is acceptable; a strong curve is not.
Equal variance of errors
We also use this plot to check whether residuals are equally spread across the range of fitted values. Here is a simulated example of heteroscedasticity (unequal variance):
The residuals fan outward from left to right: predictions are accurate at low x values and increasingly inaccurate at high x values. This violates the assumption that residual variance is constant across all fitted values.
Heteroscedasticity is generally less damaging than nonlinearity. Parameter estimates remain unbiased, but standard errors become inflated. This matters most when you need to quantify uncertainty (next week) or when using p-values (later in the course).
Scale-Location
This plot checks the same two assumptions as Residuals vs Fitted, but transforms the residuals by taking the square root of their absolute values. This makes it easier to spot trends in the spread of residuals rather than their sign. A flat smoothed line indicates equal variance. The two plots are complementary: if one is ambiguous, the other may be clearer.
Q-Q Residuals
Normality of errors
The Q-Q plot is perhaps the least useful of the diagnostic plots, or at least only useful in specific circumstances. That said, it’s worth understanding what it shows so you know when to pay attention to it.
The Q-Q plot compares the distribution of the model’s residuals to a theoretical Normal distribution. If the residuals are approximately Normal, the points fall close to the diagonal reference line. Systematic departures from the line indicate the residuals aren’t consistent with a Normal distribution. The three panels below show reference cases using simulated data.
Left panel: residuals consistent with a Normal distribution, with points following the reference line closely. Centre panel: a violation where the upper tail curves above the line, because values further from zero appear more often than a Normal distribution would predict. Right panel: a violation where both tails deviate away from the line for the same reason, but in both directions.
The following figure shows the same information in a different form, using a histogram of residuals overlaid with the theoretical Normal distribution. You are not expected to reproduce this, but the code is shown for reference:
resids <- residuals(boto_model)
resid_df <- data.frame(residuals = resids)
theory_df <- data.frame(x = seq(min(resids), max(resids), length.out = 200))
theory_df$y <- dnorm(theory_df$x, mean = mean(resids), sd = sd(resids))
ggplot() +
geom_histogram(data = resid_df,
aes(x = residuals, y = after_stat(density)),
binwidth = 3, fill = "lightgrey", colour = "grey60") +
geom_area(data = theory_df, aes(x = x, y = y),
colour = "black", fill = "#4AB54A", alpha = 0.3) +
labs(x = "Residuals", y = "Density") +
theme_minimal()The grey bars are the actual residuals; the green shaded area is the ideal Normal. Ideally the bars and the shading would align. Here, the bars extend slightly further in both tails than the Normal would predict, matching what the Q-Q plot shows.
Does this matter for the boto model? Not really. Minor tail deviations are common and usually acceptable. Normality of errors is the least critical of the residual assumptions, because it only governs how well the model predicts individual observations. If you want to understand how drought impacts botos, do you really care how accurately you can predict each individual dolphin? Not really. Would we expect drought to be the only thing that determines a dolphin’s length? Of course not. Moderate non-normality isn’t a serious problem when the goal is to understand a relationship rather than predict individuals precisely.
Residuals vs Leverage
This plot does not check the distributional assumptions. Instead it identifies observations that have a disproportionate influence on where the fitted line sits. Leverage measures how far an observation is from the centre of the predictor space. Cook’s distance combines leverage and residual size to measure overall influence.
Look for observations close to or beyond a Cook’s distance of 1 (shown as dashed contour lines). An observation with high Cook’s distance is pulling the fitted line substantially toward itself. The boto model shows no observations near that threshold, so no single dolphin is unduly distorting the results.
For Cook’s distance alone, without the leverage axis:
plot(boto_model, which = 4)Cook's distance and leverage: interactive explorer
Select a scenario to add a single special point to the dataset. The grey dashed line shows the baseline fit; the coloured line shows how the fit changes once the special point is included. Stats refer to the special point within the full model.
Diagnosing the boto model
Question 1. Which assumptions can be assessed using the Residuals vs Fitted plot?
Linearity and equal variance of errors. A curved pattern in the residuals suggests non-linearity; a fan or funnel shape suggests heteroscedasticity (unequal variance).
Question 2. Which assumptions can be assessed using the Scale-Location plot?
Linearity and equal variance of errors, the same pair as the Residuals vs Fitted plot, viewed from a different angle. Checking both is worthwhile since one can sometimes reveal a pattern that the other obscures.
Question 3. Which assumption does the Q-Q Residuals plot assess?
Normality of errors. The Q-Q plot compares the distribution of residuals to what a normal distribution would predict. Points that follow the diagonal line closely indicate the normality assumption is well met.
Question 4. How important is it to meet the linearity assumption?
Linearity is generally important, but how much it matters depends on the degree of departure. If the underlying trend is still broadly directional, a linear model may give useful inference. If the relationship has a peak or reversal (as with the beetles), fitting a straight line gives misleading results for at least part of the predictor range and shouldn’t be used for inference.
Question 5. How important is it to meet the equal variance of errors assumption?
Less critical than linearity. Heteroscedasticity inflates standard errors without biasing the parameter estimates themselves. The main consequence is that confidence intervals become wider than they should be, making it harder to detect genuine effects.
Question 6. Looking at the boto diagnostic plots above, do any of them suggest a problem with the model?
No. There is a slight suggestion of increasing residual spread in the Residuals vs Fitted plot, and minor tail deviations in the Q-Q plot, but neither is cause for serious concern. No observation approaches Cook’s distance of 1. Overall, the boto model assumptions are reasonably well met.
Prettier diagnostics with performance
The diagnostic plots we’ve used are standard and useful, but they’re dull as hell. The performance package produces the same diagnostics in a more informative layout, including an additional panel called the “Posterior Predictive Check.” This check uses the fitted model to simulate a range of datasets and overlays them on the observed distribution. If the model is reasonable, the simulated datasets should look similar to the real data.
install.packages(c("performance", "see")) # Run once
library(performance)
check_model(boto_model)Each panel also includes guidance text describing what a passing plot looks like. Remember, though, that not all assumptions are equally important; read the guidance in context rather than treating every deviation as a failure.
Revisiting the beetles
Bring across the beetle model from Workshop 3 (beet_mod <- lm(offspring ~ temperature, data = beetles)) and run the same diagnostics.
Question 1. Do any of the diagnostic plots for the beetle model suggest a reason not to trust the results?
Yes. The Residuals vs Fitted plot shows clear non-linearity: residuals follow a wave pattern, positive at low and high temperatures, negative in the middle. The beetles have a thermal optimum at around 20°C — too cold or too hot and they produce fewer offspring. Basically, we’re trying to fit a straight line through a hump. It just doesn’t work, and the parameter estimates aren’t suitable for inference.
The Scale-Location also shows a wave, though it’s less obvious. The Q-Q and Residuals vs Leverage plots look relatively acceptable, but those can’t override the clear linearity violation.
par(mfrow = c(2, 2))
plot(beet_mod)
par(mfrow = c(1, 1))
# The Residuals vs Fitted pattern is the critical signal here.
Dog disturbance in the Cairngorms
To finish this workshop, work through a new dataset independently. The data come from Cairngorms National Park, where researchers wanted to quantify how much off-leash dogs disturb mammals (foxes, badgers, red and roe deer, marten, stoats, weasels, and red squirrels). Camera traps placed adjacent to paths recorded how long mammals remained on camera. Observers standing on the path counted off-leash dogs passing in each six-hour window.
The dataset contains:
seconds: total time (in seconds) mammals spent in view of a camera trap during a six-hour perioddogs: number of off-leash dogs observed during that same six-hour period
Your tasks: carry out an EDA, fit a model capable of answering the researchers’ question, diagnose the model, and produce a figure of the predicted relationship. You have the remainder of the session for this.
Question 1. Carry out an EDA. Do you notice anything unusual?
One observation has 6120.112 dogs and zero seconds of mammal activity. The maximum number of dogs in all other rows is around 9. This is clearly a data entry error: the values appear to belong in the same row but are in the wrong columns. It is most likely 6120.112 seconds of mammal activity with 0 dogs, not 6120 dogs.
Question 2. What is the range of the dogs variable?
range(dogs$dogs)
# Returns 0 and 6120.112; the erroneous observation is obvious immediately.
Question 3. How should the erroneous observation be corrected? Do not change the data yet.
The values appear swapped: the row should have 0 dogs and 6120.112 seconds. Correct it by swapping the column values for that row:
dogs[dogs$dogs > 6000, ] <- c(6120.112, 0)
Question 4. Is dogs the explanatory or the response variable in a model answering the researchers’ question?
dogs is the explanatory variable. The research question is: how does mammal activity (seconds on camera) change as the number of off-leash dogs increases? If dogs were the response, the question would instead be whether mammal activity predicts dog numbers, reversing the causal direction.
Question 5. Fit a model and check the diagnostic plots. Which plot signals a problem?
Although all plots look slightly unusual, only the Residuals vs Leverage plot shows a clear problem: one observation has an extremely large Cook’s distance, far beyond the contour lines. This is the erroneous observation from Question 1.
dog_mod <- lm(seconds ~ dogs, data = dogs)
par(mfrow = c(2, 2))
plot(dog_mod)
par(mfrow = c(1, 1))
Question 6. Is the observation flagged by the diagnostic plot the same one you spotted in the EDA?
Yes. The Residuals vs Leverage plot labels the problematic observation with its row number (row 132). That is the same row containing the implausible value of 6120 dogs identified in the EDA.
Question 7. Correct the erroneous observation and refit the model. Do the diagnostics look acceptable now?
dogs[dogs$dogs > 6000, ] <- c(6120.112, 0)
dog_mod <- lm(seconds ~ dogs, data = dogs)
par(mfrow = c(2, 2))
plot(dog_mod)
par(mfrow = c(1, 1))
After correction the diagnostics look reasonable. The Residuals vs Fitted and Scale-Location plots show no strong pattern, the Q-Q is acceptable, and no observation approaches a Cook’s distance of 1.
Question 8. Assuming the model is adequate, what is the estimated change in seconds of mammal activity for each additional off-leash dog?
The slope is approximately −382 seconds per dog. For every additional off-leash dog observed, mammal activity on camera decreases by around 382 seconds, or about 6 minutes, in a six-hour window.
summary(dog_mod)
Question 9. Are there any remaining assumptions that should give us pause before interpreting the results?
Representativeness is worth flagging. The data capture mammal activity on camera, which may not represent all mammal activity in the area. Camera placement matters: a camera on a path will miss badgers active away from trails; a camera in a tree will miss ground-level activity. Whether camera activity is a valid proxy for overall mammal behaviour in the presence of dogs is a substantive question that can’t be resolved with diagnostics alone.
Assumption reference
| Assumption | Description | Example of violation |
|---|---|---|
| 1. Validity | The data can answer the research question. | Using IQ to measure overall intelligence. |
| 2. Representativeness | The data reflects the population being studied. | Using survey data from a rural area to draw conclusions about an urban population. |
| 3. Additivity | Effects of predictors are additive with no interactions. | Assuming study hours and sleep independently affect exam scores without considering their combined effect. |
| 4. Linearity | The relationship between predictors and the outcome is linear. | Assuming plant growth continues to increase indefinitely without levelling off. |
| 5. Independence of errors | Observations are independent of each other. | Modelling height using repeated measurements of the same children over time. |
| 6. Equal variance of errors | Residuals have constant variance across all predictor levels. | Modelling income, where salary variability differs across levels of a company hierarchy. |
| 7. Normality of errors | Residuals are approximately normally distributed. | Modelling income, where a small number of extremely wealthy individuals produce a heavily skewed residual distribution. |