Workshop 4: Diagnosing linear models with a continuous predictor

BI3010 Statistics for Biologists

Author

Roos & Pinard

Published

16 May 2026

Learning Objectives

Learning objectives

Understand what residuals are and how they are used in model diagnosis.
Interpret the four standard diagnostic plots produced by plot(model).
Identify which assumptions each plot is designed to check.
Recognise when a diagnostic plot signals a genuine problem and when it doesn’t.

Workshop structure

The Analytical Workflow

Today we work with the models from Workshop 3 and apply the full diagnostic workflow. The focus is step 5.

Formulate a research question.
Perform exploratory data analysis.
Identify any hidden assumptions.
Fit an appropriate model.
Diagnose the model and check assumptions [Focus of today].
Summarise the results.
Interpret and provide inferences.

To keep the workload manageable, start by copying your workflow from Workshop 3 into a new script, or continue in the same script. If you didn’t finish Workshop 3, complete it now before continuing here.

Why diagnostics matter

Starting with the boto dolphin model from Workshop 3, we have the equation with estimated parameters:

\[length_i \sim \text{Normal}(\mu_i, \sigma)\]

\[\mu_i = 213.2 + (-5.3) \times drought_i\]

We used this to generate a figure, but how do we know if we can trust the result? How do we know we haven’t been cursed by the Monkey’s Paw? The uncomfortable answer is that we can’t confirm it with absolute certainty. The results might be completely unreliable, either because the data violated a hidden assumption or because the model isn’t well suited to the data. The particularly bad news is that we almost always get some answer from a model, even when the data or model are badly flawed.

This means it falls to us to actively look for signs that the model results shouldn’t be interpreted. This is especially important when a model could influence real-world decisions. We don’t want to conclude a drug is safe only to discover later it has serious side effects.

One more uncomfortable truth: the assumptions we care about most (validity and representativeness) are the hardest to test formally. They require domain knowledge, careful EDA, and clear thinking about the data-generating process. That is why you should spend the majority of your time on EDA and thinking about the context before fitting any model.

The assumptions that can be checked formally after fitting are the residual-based ones: linearity, equal variance, normality, and independence of errors. That is where today’s workshop focuses.

To start, ensure you have the boto model from the last workshop loaded. Add one extra line to store the model’s predictions for the original dataset:

boto$predicted <- predict(boto_model)

Unlike the fake dataset approach from Workshop 3, here we pass the original data through the model so we can compute residuals: the difference between each observed length and what the model predicted for that dolphin.

Diagnosing the four horsemen

The four standard diagnostic plots are produced in one call:

par(mfrow = c(2, 2))  # arrange as 2x2 grid
plot(boto_model)
par(mfrow = c(1, 1))  # reset

plot() works directly on a model object, returning all four diagnostic plots in one call. It uses base R graphics rather than ggplot2 — the output is plainer, but nothing extra needs setting up, making it the standard tool for a quick check. You will see a ggplot2-based alternative later in this workshop.

Alternatively, press Return in the console after plot(boto_model) to step through them one at a time. Let’s go through each plot and what it tells us.

Residuals vs Fitted

The y-axis shows residual errors: the difference between each dolphin’s actual measured length and what the model predicted. A perfect prediction gives a residual of zero, shown as the dashed horizontal line.

The x-axis shows the fitted (predicted) values. Points at the far left are dolphins the model predicts to be short; points at the right are those the model predicts to be longer. Points labelled with numbers are the row indices of notable observations. For example, the point labelled 6 corresponds to:

boto[6, ]

Dolphin 6 is a male that experienced two droughts and was 236 cm long. The model predicted 203 cm, so its residual is approximately +33 cm. We can add a residual column directly:

boto$residual <- boto$length - boto$predicted
# Equivalently: residuals(boto_model)
boto[6, ]

This plot is used to check two assumptions.

Linearity

If the residuals display a “wavy” pattern across the fitted values, drawing a straight line through the data isn’t capturing the relationship well. Here’s a simulated example that clearly violates linearity:

The straight line fitted to this curved relationship produces a systematic pattern in the residuals: the model overpredicts at low and high x values and underpredicts in the middle. The diagnostic plot makes this visible:

The red smoothed line through the residuals curves clearly away from zero rather than sitting flat. Compare this to the boto model plot above, where the smoothed line stays close to the dashed zero line across the range of fitted values.

Violating linearity can seriously mislead inference. In the simulated example above, the true relationship peaks in the middle and declines, but a straight line would suggest a flat or slightly positive trend across the whole range. The severity depends on context: sometimes a mild deviation is acceptable; a strong curve is not.

Equal variance of errors

We also use this plot to check whether residuals are equally spread across the range of fitted values. Here is a simulated example of heteroscedasticity (unequal variance):

The residuals fan outward from left to right: predictions are accurate at low x values and increasingly inaccurate at high x values. This violates the assumption that residual variance is constant across all fitted values.

Heteroscedasticity is generally less damaging than nonlinearity. Parameter estimates remain unbiased, but standard errors become inflated. This matters most when you need to quantify uncertainty (next week) or when using p-values (later in the course).

Scale-Location

This plot checks the same two assumptions as Residuals vs Fitted, but transforms the residuals by taking the square root of their absolute values. This makes it easier to spot trends in the spread of residuals rather than their sign. A flat smoothed line indicates equal variance. The two plots are complementary: if one is ambiguous, the other may be clearer.

Q-Q Residuals

Normality of errors

The Q-Q plot is perhaps the least useful of the diagnostic plots, or at least only useful in specific circumstances. That said, it’s worth understanding what it shows so you know when to pay attention to it.

The Q-Q plot compares the distribution of the model’s residuals to a theoretical Normal distribution. If the residuals are approximately Normal, the points fall close to the diagonal reference line. Systematic departures from the line indicate the residuals aren’t consistent with a Normal distribution. The three panels below show reference cases using simulated data.

Left panel: residuals consistent with a Normal distribution, with points following the reference line closely. Centre panel: a violation where the upper tail curves above the line, because values further from zero appear more often than a Normal distribution would predict. Right panel: a violation where both tails deviate away from the line for the same reason, but in both directions.

The following figure shows the same information in a different form, using a histogram of residuals overlaid with the theoretical Normal distribution. You are not expected to reproduce this, but the code is shown for reference:

resids  <- residuals(boto_model)
resid_df <- data.frame(residuals = resids)
theory_df <- data.frame(x = seq(min(resids), max(resids), length.out = 200))
theory_df$y <- dnorm(theory_df$x, mean = mean(resids), sd = sd(resids))

ggplot() +
  geom_histogram(data = resid_df,
                 aes(x = residuals, y = after_stat(density)),
                 binwidth = 3, fill = "lightgrey", colour = "grey60") +
  geom_area(data = theory_df, aes(x = x, y = y),
            colour = "black", fill = "#4AB54A", alpha = 0.3) +
  labs(x = "Residuals", y = "Density") +
  theme_minimal()

The grey bars are the actual residuals; the green shaded area is the ideal Normal. Ideally the bars and the shading would align. Here, the bars extend slightly further in both tails than the Normal would predict, matching what the Q-Q plot shows.

Does this matter for the boto model? Not really. Minor tail deviations are common and usually acceptable. Normality of errors is the least critical of the residual assumptions, because it only governs how well the model predicts individual observations. If you want to understand how drought impacts botos, do you really care how accurately you can predict each individual dolphin? Not really. Would we expect drought to be the only thing that determines a dolphin’s length? Of course not. Moderate non-normality isn’t a serious problem when the goal is to understand a relationship rather than predict individuals precisely.

Residuals vs Leverage

This plot does not check the distributional assumptions. Instead it identifies observations that have a disproportionate influence on where the fitted line sits. Leverage measures how far an observation is from the centre of the predictor space. Cook’s distance combines leverage and residual size to measure overall influence.

Look for observations close to or beyond a Cook’s distance of 1 (shown as dashed contour lines). An observation with high Cook’s distance is pulling the fitted line substantially toward itself. The boto model shows no observations near that threshold, so no single dolphin is unduly distorting the results.

For Cook’s distance alone, without the leverage axis:

plot(boto_model, which = 4)

Cook's distance and leverage: interactive explorer

Select a scenario to add a single special point to the dataset. The grey dashed line shows the baseline fit; the coloured line shows how the fit changes once the special point is included. Stats refer to the special point within the full model.

Prettier diagnostics with `performance`

The diagnostic plots we’ve used are standard and useful, but they’re dull as hell. The performance package produces the same diagnostics in a more informative layout, including an additional panel called the “Posterior Predictive Check.” This check uses the fitted model to simulate a range of datasets and overlays them on the observed distribution. If the model is reasonable, the simulated datasets should look similar to the real data.

install.packages(c("performance", "see"))  # Run once
library(performance)
check_model(boto_model)

Each panel also includes guidance text describing what a passing plot looks like. Remember, though, that not all assumptions are equally important; read the guidance in context rather than treating every deviation as a failure.

Revisiting the beetles

Bring across the beetle model from Workshop 3 (beet_mod <- lm(offspring ~ temperature, data = beetles)) and run the same diagnostics.

Question 1. Do any of the diagnostic plots for the beetle model suggest a reason not to trust the results?

No: all four plots look acceptable Yes: the Residuals vs Leverage plot shows a problematic observation with high Cook’s distance Yes: the Residuals vs Fitted plot shows clear non-linearity, indicating the straight-line model is a poor fit Yes: the Q-Q plot shows a serious normality violation

Yes. The Residuals vs Fitted plot shows clear non-linearity: residuals follow a wave pattern, positive at low and high temperatures, negative in the middle. The beetles have a thermal optimum at around 20°C — too cold or too hot and they produce fewer offspring. Basically, we’re trying to fit a straight line through a hump. It just doesn’t work, and the parameter estimates aren’t suitable for inference.

The Scale-Location also shows a wave, though it’s less obvious. The Q-Q and Residuals vs Leverage plots look relatively acceptable, but those can’t override the clear linearity violation.

par(mfrow = c(2, 2))
plot(beet_mod)
par(mfrow = c(1, 1))
# The Residuals vs Fitted pattern is the critical signal here.

Dog disturbance in the Cairngorms

To finish this workshop, work through a new dataset independently. The data come from Cairngorms National Park, where researchers wanted to quantify how much off-leash dogs disturb mammals (foxes, badgers, red and roe deer, marten, stoats, weasels, and red squirrels). Camera traps placed adjacent to paths recorded how long mammals remained on camera. Observers standing on the path counted off-leash dogs passing in each six-hour window.

The dataset contains:

Download dog_disturbance.txt

seconds: total time (in seconds) mammals spent in view of a camera trap during a six-hour period
dogs: number of off-leash dogs observed during that same six-hour period

Your tasks: carry out an EDA, fit a model capable of answering the researchers’ question, diagnose the model, and produce a figure of the predicted relationship. You have the remainder of the session for this.

Question 1. Carry out an EDA. Do you notice anything unusual?

One observation has 6120.112 dogs and zero seconds of mammal activity. The maximum number of dogs in all other rows is around 9. This is clearly a data entry error: the values appear to belong in the same row but are in the wrong columns. It is most likely 6120.112 seconds of mammal activity with 0 dogs, not 6120 dogs.

Question 2. What is the range of the dogs variable?

range(dogs$dogs)
# Returns 0 and 6120.112; the erroneous observation is obvious immediately.

Question 3. How should the erroneous observation be corrected? Do not change the data yet.

The values appear swapped: the row should have 0 dogs and 6120.112 seconds. Correct it by swapping the column values for that row:

dogs[dogs$dogs > 6000, ] <- c(6120.112, 0)

Question 4. Is dogs the explanatory or the response variable in a model answering the researchers’ question?

Response: the researchers want to predict dog numbers from mammal activity Explanatory: it predicts how mammal activity (seconds on camera) changes as off-leash dog numbers increase Both: in a linear model, either variable can play either role Neither: dogs is a nuisance variable to be controlled for

dogs is the explanatory variable. The research question is: how does mammal activity (seconds on camera) change as the number of off-leash dogs increases? If dogs were the response, the question would instead be whether mammal activity predicts dog numbers, reversing the causal direction.

Question 5. Fit a model and check the diagnostic plots. Which plot signals a problem?

Although all plots look slightly unusual, only the Residuals vs Leverage plot shows a clear problem: one observation has an extremely large Cook’s distance, far beyond the contour lines. This is the erroneous observation from Question 1.

dog_mod <- lm(seconds ~ dogs, data = dogs)
par(mfrow = c(2, 2))
plot(dog_mod)
par(mfrow = c(1, 1))

Question 6. Is the observation flagged by the diagnostic plot the same one you spotted in the EDA?

No: the flagged observation is a different row from the one identified in the EDA Yes: the Residuals vs Leverage plot labels it as row 132, the same row with the implausible dog count from the EDA It is impossible to tell without running additional checks Yes, but only by coincidence since the two checks are independent

Yes. The Residuals vs Leverage plot labels the problematic observation with its row number (row 132). That is the same row containing the implausible value of 6120 dogs identified in the EDA.

Question 7. Correct the erroneous observation and refit the model. Do the diagnostics look acceptable now?

dogs[dogs$dogs > 6000, ] <- c(6120.112, 0)
dog_mod <- lm(seconds ~ dogs, data = dogs)
par(mfrow = c(2, 2))
plot(dog_mod)
par(mfrow = c(1, 1))

After correction the diagnostics look reasonable. The Residuals vs Fitted and Scale-Location plots show no strong pattern, the Q-Q is acceptable, and no observation approaches a Cook’s distance of 1.

Question 8. Assuming the model is adequate, what is the estimated change in seconds of mammal activity for each additional off-leash dog?

The slope is approximately −382 seconds per dog. For every additional off-leash dog observed, mammal activity on camera decreases by around 382 seconds, or about 6 minutes, in a six-hour window.

summary(dog_mod)

Question 9. Are there any remaining assumptions that should give us pause before interpreting the results?

Representativeness is worth flagging. The data capture mammal activity on camera, which may not represent all mammal activity in the area. Camera placement matters: a camera on a path will miss badgers active away from trails; a camera in a tree will miss ground-level activity. Whether camera activity is a valid proxy for overall mammal behaviour in the presence of dogs is a substantive question that can’t be resolved with diagnostics alone.

Assumption reference

Assumption	Description	Example of violation
1. Validity	The data can answer the research question.	Using IQ to measure overall intelligence.
2. Representativeness	The data reflects the population being studied.	Using survey data from a rural area to draw conclusions about an urban population.
3. Additivity	Effects of predictors are additive with no interactions.	Assuming study hours and sleep independently affect exam scores without considering their combined effect.
4. Linearity	The relationship between predictors and the outcome is linear.	Assuming plant growth continues to increase indefinitely without levelling off.
5. Independence of errors	Observations are independent of each other.	Modelling height using repeated measurements of the same children over time.
6. Equal variance of errors	Residuals have constant variance across all predictor levels.	Modelling income, where salary variability differs across levels of a company hierarchy.
7. Normality of errors	Residuals are approximately normally distributed.	Modelling income, where a small number of extremely wealthy individuals produce a heavily skewed residual distribution.

Glossary / 词汇表

Model output

Residual 残差

The difference between an observed value and its model-predicted value: residual = observed − predicted. Residuals are the core of all model diagnostics.

观测值与模型预测值之差：残差 = 观测值 − 预测值，是所有模型诊断的核心。

Fitted value 拟合值

The value the model predicts for each observation in the original dataset. Obtained with predict(model) when no newdata argument is provided.

模型对原始数据集中每个观测值的预测值，不提供 newdata 参数时直接调用 predict(model) 即可获得。

Standard error 标准误差

A measure of the uncertainty in an estimated parameter. Heteroscedasticity inflates standard errors, making confidence intervals wider than they should be.

参数估计值不确定性的度量。异方差性会导致标准误差膨胀，使置信区间比实际应有的更宽。

Diagnostic plots

Diagnostic plot 诊断图

A plot used after model fitting to check whether the model's assumptions are reasonably met. In R, plot(model) produces four standard diagnostic plots.

模型拟合后用于检查模型假设是否合理满足的图表。在 R 中，plot(model) 可生成四张标准诊断图。

Residuals vs Fitted 残差 vs 拟合值图

A diagnostic plot showing residuals on the y-axis and fitted values on the x-axis. Used to check linearity (no wave pattern) and equal variance (no fan shape).

以残差为纵轴、拟合值为横轴的诊断图，用于检验线性性（无波浪形状）和等方差性（无扇形分布）。

Scale-Location plot 尺度位置图

A diagnostic plot showing the square root of absolute residuals against fitted values. Complements Residuals vs Fitted for checking equal variance; a flat smoothed line is ideal.

以残差绝对值的平方根对拟合值作图的诊断图，与残差 vs 拟合值图互为补充，用于检验等方差性，理想状态下平滑线应保持平直。

Q-Q plot Q-Q 图（分位数图）

A plot comparing the quantiles of the model's residuals to the quantiles of a theoretical Normal distribution. Points falling close to the diagonal line indicate approximately Normal residuals.

将模型残差的分位数与理论正态分布的分位数进行比较的图表，点落在对角线附近表示残差近似正态分布。

Residuals vs Leverage 残差 vs 杠杆值图

A diagnostic plot used to identify influential observations. Cook's distance contour lines are overlaid; observations close to or beyond a Cook's distance of 1 warrant investigation.

用于识别影响性观测值的诊断图，叠加有库克距离等高线，接近或超过库克距离 1 的观测值需要进一步检查。

Influence and leverage

Cook's distance 库克距离

A measure of how much all fitted values would change if a particular observation were removed. High values (approaching 1) indicate that an observation is pulling the fitted line substantially toward itself.

衡量移除某个观测值后所有拟合值变化程度的指标。值接近 1 表明该观测值对拟合线有显著牵引作用。

Leverage 杠杆值

A measure of how far an observation's predictor value is from the centre of the predictor space. High-leverage points have the potential to be influential, particularly if they also have large residuals.

衡量某观测值的预测变量取值离预测变量中心有多远的指标。高杠杆值的点尤其在残差也较大时，可能对模型拟合具有重大影响。

Influential observation 影响性观测值

An observation whose removal would substantially change the model's estimated parameters. Identified by high Cook's distance in the Residuals vs Leverage plot.

移除后会导致模型估计参数发生显著变化的观测值，通过残差 vs 杠杆值图中较高的库克距离来识别。

Assumption violations

Non-linearity 非线性

A violation of the linearity assumption in which the true relationship between predictor and response follows a curve rather than a straight line. Produces a wave pattern in the Residuals vs Fitted plot.

线性假设被违反的情形，即预测变量与响应变量之间的真实关系呈曲线而非直线，在残差 vs 拟合值图中表现为波浪形。

Heteroscedasticity 异方差性

A violation of the equal variance of errors assumption, in which residual spread changes systematically across fitted values. Often appears as a fan or funnel shape in the Residuals vs Fitted and Scale-Location plots.

违反误差等方差假设的情形，表现为残差的离散程度随拟合值系统性改变，在残差 vs 拟合值图和尺度位置图中常呈扇形或漏斗形。

Homoscedasticity 同方差性

The equal variance of errors assumption is satisfied: residuals have roughly constant spread across all fitted values. Seen as a horizontal band with no fan or funnel in the Residuals vs Fitted and Scale-Location plots.

误差等方差假设得到满足：残差在所有拟合值上的离散程度大致恒定，在残差 vs 拟合值图和尺度位置图中表现为无扇形或漏斗形的水平带状分布。

Tools

performance (package) performance 包

An R package that produces enhanced model diagnostic plots via check_model(). Includes a Posterior Predictive Check panel and guidance text on each plot.

通过 check_model() 生成增强版模型诊断图的 R 包，包含后验预测检验面板及每张图的说明文字。

Posterior Predictive Check 后验预测检验

A diagnostic that uses the fitted model to simulate multiple datasets and compares them to the observed data. If the simulated distributions resemble the real data, the model is considered a plausible data-generating process.

使用已拟合模型模拟多个数据集并与观测数据进行比较的诊断方法。若模拟分布与真实数据相似，则认为该模型是合理的数据生成过程。

par(mfrow) par(mfrow)

A base R graphics parameter that partitions the plot window into a grid. par(mfrow = c(2, 2)) creates a 2x2 grid so four plots display together. Reset to par(mfrow = c(1, 1)) afterwards.

将绘图窗口分割为网格的 R 基础图形参数。par(mfrow = c(2, 2)) 创建 2x2 网格使四张图同时显示，完成后需重置为 par(mfrow = c(1, 1))。

Dataset terms

Thermal optimum 热适温度

The temperature at which an organism performs best. For burying beetles in the workshop dataset, offspring production peaks around 20°C and declines at both lower and higher temperatures, producing a non-linear relationship that a linear model cannot capture.

生物体表现最佳的温度。在本次工作坊的埋葬甲虫数据集中，后代产量在约 20°C 时达到峰值，温度过低或过高均导致产量下降，形成线性模型无法捕捉的非线性关系。