Workshop 6: Diagnosing linear models with a categorical predictor

BI3010 Statistics for Biologists

Author

Roos & Pinard

Published

16 May 2026

Glossary / 词汇表
Diagnostic plot 诊断图
A graphical tool used to check whether a model's assumptions are adequately met. The four standard plots produced by plot(model) in R are Residuals vs Fitted, Scale-Location, Q-Q Residuals, and Residuals vs Leverage.
用于检验模型假设是否得到满足的图形工具。R 中 plot(model) 生成的四张标准诊断图分别为:残差与拟合值图、尺度-位置图、Q-Q 残差图和残差与杠杆点图。
Residuals vs Fitted plot 残差与拟合值图
A diagnostic plot with fitted (predicted) values on the x-axis and residuals on the y-axis. Used to assess the equal variance of errors assumption. With a categorical predictor, there is no linearity assumption to check.
x 轴为拟合值,y 轴为残差的诊断图,用于评估误差等方差假设。对于分类预测变量,无需检验线性假设。
Scale-Location plot 尺度-位置图
A diagnostic plot that plots the square root of the absolute standardised residuals against fitted values. Provides a second view of the equal variance of errors assumption; a flat smooth line indicates equal variance.
以拟合值为 x 轴、标准化残差绝对值的平方根为 y 轴的诊断图,提供对等方差假设的第二视角,平滑线平坦表示方差均等。
Q-Q Residuals plot Q-Q 残差图
A quantile-quantile plot that compares the distribution of model residuals to a theoretical normal distribution. Points following the diagonal dashed line indicate the normality of errors assumption is met.
将模型残差分布与理论正态分布进行比较的分位数图。点落在对角虚线上表明残差正态性假设得到满足。
Residuals vs Leverage plot 残差与杠杆点图
A diagnostic plot that identifies disproportionately influential observations. Points with both high leverage and large residuals, indicated by Cook's distance contour lines, can substantially alter the fitted model.
识别对模型有不成比例影响的观测值的诊断图。具有高杠杆值和大残差的点(由库克距离等值线标示)会显著改变拟合模型。
Leverage 杠杆值
A measure of how unusual an observation's predictor value is relative to the rest of the data. High-leverage points have the potential to exert strong influence on the fitted model, though they do not always do so.
衡量一个观测值的预测变量值相对于其他数据的异常程度。高杠杆点有可能对拟合模型产生较大影响,但并非总是如此。
Cook's distance 库克距离
A measure of the overall influence of an observation on all fitted values. Values above 1 indicate the observation is changing the model substantially and warrants investigation.
衡量某个观测值对所有拟合值的综合影响程度。大于 1 的值表明该观测值对模型有实质性影响,需进一步调查。
Equal variance of errors / homoscedasticity 误差等方差性
The assumption that residuals have roughly constant spread across all fitted values (or groups). A violation means the spread of residuals differs across groups, inflating or deflating standard errors.
假设残差在所有拟合值(或组别)上具有大致相同的离散程度。违反此假设意味着不同组别的残差离散程度不同,导致标准误被高估或低估。
Heteroscedasticity 异方差性
A violation of the equal variance of errors assumption, where residual spread differs across fitted values or groups. It inflates standard errors and widens confidence intervals without biasing the parameter estimates themselves.
违反误差等方差假设,即不同拟合值或组别的残差离散程度不同。会使标准误膨胀、置信区间变宽,但不会使参数估计值本身产生偏差。
Normality of errors 残差正态性
The assumption that residuals follow a normal distribution, assessed using the Q-Q Residuals plot. Modest departures are generally tolerable; severe departures may indicate that a different model family (e.g. a GLM) is needed.
假设残差服从正态分布,通过 Q-Q 残差图进行评估。轻微偏离通常可以接受;严重偏离可能意味着需要选择不同的模型族(如广义线性模型)。
Influential observation 影响观测值
An observation whose removal would substantially change the fitted model. Identified using Cook's distance. An observation can be influential because of high leverage, a large residual, or both.
删除后会显著改变拟合模型的观测值,通过库克距离识别。高杠杆值、大残差或二者兼具均可导致一个观测值具有较大影响力。
Residual 残差
The difference between an observed value and its model-predicted value: residual = observed − predicted. Residuals are the core of all model diagnostics.
观测值与模型预测值之差:残差 = 观测值 − 预测值,是所有模型诊断的核心。

We pick up where Workshop 5 left off, returning to step 5 to check whether the rabbit model is worthy of our trust.

If you didn’t finish Workshop 5, do so before continuing.

Learning Objectives

Learning objectives

By the end of this workshop you’ll be able to:

  • Produce and interpret the four standard diagnostic plots for a linear model with a categorical predictor
  • Identify violations of linearity, equal variance, normality, and independence from residual plots
  • Assess leverage and influence using Cook’s distances
  • Decide whether a model is trustworthy enough to interpret
  • Apply the full diagnostic workflow to a novel dataset

Workshop structure

The Analytical Workflow
  1. Formulate a research question
  2. Perform exploratory data analysis
  3. Identify hidden assumptions
  4. Fit a model
  5. Diagnose the model [Focus of today]
  6. Summarise the results
  7. Interpret and draw conclusions

A reminder of the model

In Workshop 5 you fitted a model to test whether Java mongoose are associated with a reduction in the average weight of Amami rabbits on two islands south of mainland Japan. The model was:

\[weight_i \sim \text{Normal}(\mu_i, \sigma)\]

\[\mu_i = \beta_0 + \beta_1 \times Present_i\]

where \(\beta_0\) is the average weight of rabbits when mongoose are absent (\(Present = 0\)) and \(\beta_1\) is the difference from that average when mongoose are present (\(Present = 1\)).

After fitting the model, the estimated parameters are:

\[\mu_i = 269.6 + (-2.9) \times Present_i\]

If your parameter estimates differ, you have likely missed one or both of the data entry errors in the dataset. Return to Workshop 5 and resolve them before continuing.

Using those estimates, we generated predictions and plotted them:

But, as ever, can we convince ourselves that this model is worthy of our trust before we present it to others?

Diagnosing

Diagnosing a model with a categorical predictor follows the same process as with a continuous predictor. The only meaningful difference is in the Residuals vs Fitted plot, which we cover below. Run the full set of four diagnostic plots first:

par(mfrow = c(2, 2))
plot(rabbit_mod)
par(mfrow = c(1, 1))

Residuals vs Fitted

The y-axis shows residual errors (the difference between each observed weight and the model’s prediction for that rabbit). The x-axis shows the fitted (predicted) values; here, the two group means for rabbits where mongoose are present or absent.

Linearity

With a continuous predictor, we use this plot to check whether the relationship is approximately linear. With a categorical predictor there’s no line to draw between the groups, so there’s no linearity assumption to check here. One less thing to worry about.

Equal variance of errors

What we can check is whether the residual spread is roughly equal across both groups. Look at the two vertical clusters of points: one around 266 g (mongoose present) and one around 270 g (mongoose absent). In both groups, individual rabbits deviate from the group mean by a similar amount, around 40 g in either direction. The assumption of equal variance of errors looks well met here.

To understand what a problematic case looks like, consider this simulated dataset where Group B has much more variable responses than Group A:

In the Residuals vs Fitted plot for that simulated data:

The residual spread for Group B is much wider than for Group A, a clear violation. When we add confidence intervals to the original plot:

The intervals are the same width for both groups, even though Group A’s data is much less variable. Heteroscedasticity inflates the standard error across the board. The group means themselves are estimated accurately, but the uncertainty around them is not.

It’s worth saying that we generally do care about estimating uncertainty accurately. We never really know if our estimates are “true” in the sense that they accurately reflect reality, so having some measure of uncertainty is generally the responsible approach to science. If you need a reminder of why, think back to the lecture where we discussed being given the option of two drugs with different probabilities of unintended mortality, and how our decision changed depending on whether uncertainty was shown or not.

Scale-Location

The Scale-Location plot gives a second view of the equal variance assumption.

The spread is slightly larger for the group predicted at around 270 g than for the group at 266 g, which is very slightly less clean than the Residuals vs Fitted plot suggested. Importantly though, this is nowhere near substantial enough to be worried about. Compare it with the same plot for the simulated data:

That is what a genuine violation looks like: a clear upward trend showing residual spread growing with the fitted value.

Q-Q Residuals

We are checking whether the residuals follow an approximately normal distribution. The assumption is met when points lie broadly along the diagonal dashed line. We want them on the line but we don’t stress out too much if it’s not exactly perfect. If we ever did see utterly bizarre patterns in the points, that would imply one of two things: you need to choose a different distribution for your data-generating process and fit a generalised linear model (GLM), or your model is missing some important predictor variable.

Residuals vs Leverage

This plot does not check the standard assumptions. Instead, it flags observations that have a disproportionate influence on the fitted model. We look for Cook’s distance contour lines: observations beyond the 1.0 contour warrant investigation. Here, those contour lines don’t appear in the plot at all. None of our points are close to 1, so we don’t have any observations that are pulling the model one way or another.

For an alternative view of Cook’s distances without leverage on the x-axis:

plot(rabbit_mod, which = 4)

Question set 1

Question 1. Which assumption can be assessed using the Residuals vs Fitted plot when the predictor is categorical?

Equal variance of errors only. With a categorical predictor there’s no linear trend to assess, so linearity drops out. We’re only asking whether the residual spread is similar across groups.

Question 2. Which assumption can be assessed using the Scale-Location plot when the predictor is categorical?

Equal variance of errors. The Scale-Location plot is checking the same thing as the Residuals vs Fitted plot, but from a different angle. Running both is worthwhile because occasionally one reveals a pattern the other obscures.

Question 3. Which assumption does the Q-Q Residuals plot assess?

Normality of errors. The Q-Q plot compares the quantiles of the residuals to the quantiles of a theoretical normal distribution. Points close to the diagonal line indicate the assumption is met.

Question 4. How important is it to meet the equal variance of errors assumption?

Less critical than the linearity assumption. Heteroscedasticity doesn’t bias the parameter estimates (group means) themselves, but it inflates standard errors, which in turn makes confidence intervals wider or narrower than they should be. If you only need an accurate point estimate of the group mean, the impact is limited. For reliable inference, which usually requires trustworthy uncertainty estimates, meeting this assumption matters more.

Question 5. Looking at the rabbit model diagnostic plots, do any of them indicate a problem?

No. The residual spread is similar across both groups, the Q-Q plot shows only minor tail deviations, and no observation approaches Cook’s distance of 1. The diagnostics raise no concerns.

Question 6. Does passing the diagnostic plots mean the model is trustworthy?

No. Passing the diagnostics only means we haven’t found evidence of specific statistical assumption violations. The more fundamental concern for this model was raised before fitting: we can’t be certain that mongoose are genuinely absent from Tokunoshima. No diagnostic plot can check that. The diagnostics address a subset of assumptions; domain knowledge and critical thinking about the data generating process are needed for the rest.

A more visually appealing alternative

The performance package produces the same diagnostics in a more informative layout. Note that for this particular model it does not work especially well; it tries to draw a smooth line between the two groups, which looks wiggly, but that’s an artefact of the two-point structure and doesn’t indicate a problem.

library(performance)
check_model(rabbit_mod)

The lemurs

Having diagnosed the rabbit model, move on to the lemur model from Workshop 5. If you haven’t fitted that model yet, do so now. Run the diagnostic plots and decide whether you’re willing to trust the results.

Download lemurs.txt

lemur <- read.table("lemurs.txt", header = TRUE)

Lemur diagnostics. Run the four diagnostic plots for your lemur model and note what you observe in each. Are there any concerns?

# Full lemur analysis

library(ggplot2)
library(ggeffects)   # install.packages("ggeffects") if needed

# Load data
lemur <- read.table("lemurs.txt", header = TRUE)

# EDA
str(lemur)         # n = 120
summary(lemur)     # no obvious numeric issues

ggplot(lemur) +
  geom_histogram(aes(x = activity, fill = habitat),
                 colour = "white", binwidth = 0.5) +
  theme_minimal()
# One habitat is recorded as "Rain forest" instead of "Rainforest"; a data entry mistake

# Correct the mistake
lemur$habitat[lemur$habitat == "Rain forest"] <- "Rainforest"

# Hidden assumptions
# - Validity: how is "activity" measured? Accuracy may differ by habitat
#   (e.g. easier to track lemurs in dry forest than in dense rainforest).
# - Equal variance: variation may differ by habitat, something to check in diagnostics.

# Fit model
lem_mod <- lm(activity ~ habitat, data = lemur)

# Diagnose
par(mfrow = c(2, 2))
plot(lem_mod)
par(mfrow = c(1, 1))
# No obvious violations. Residual spread looks broadly similar across habitats.
# Q-Q plot shows minor tail deviations but nothing alarming.
# No influential observations (Cook's distance contours do not appear).

# Summarise
summary(lem_mod)
# Dry forest is the reference group (first alphabetically)
# Predicted activity in Dry forest:   7.9 hours
# Mangrove:    7.9 - 1.05 = 6.85 hours
# Rainforest:  7.9 + 1.8  = 9.7 hours
# Spiny forest: 7.9 - 1.8 = 6.1 hours

# Plot predictions
preds <- ggpredict(lem_mod)
plot(preds)

# Interpretation
# Lemurs are most active in rainforests and least active in spiny forests.
# This could reflect habitat preference, or differences in foraging effort
# (e.g. spending more time searching for food in some habitats).
# The validity concern about how activity is measured remains unresolved.

The diagnostics raise no serious concerns. The main assumptions to reflect on are validity (how activity is recorded and whether it is comparable across habitats) and representativeness (whether the sampled locations reflect the broader habitat types).

Assumption reference

Assumption Description Example of violation
1. Validity The data can answer the research question. Using IQ to measure overall intelligence.
2. Representativeness The data reflects the population being studied. Using survey data from a rural area to draw conclusions about an urban population.
3. Additivity Effects of predictors are additive with no interactions. Assuming study hours and sleep independently affect exam scores without considering their combined effect.
4. Linearity The relationship between predictors and the outcome is linear. Assuming plant growth continues to increase indefinitely without levelling off.
5. Independence of errors Observations are independent of each other. Modelling height using repeated measurements of the same children over time.
6. Equal variance of errors Residuals have constant variance across all predictor levels. Modelling income, where salary variability differs across levels of a company hierarchy.
7. Normality of errors Residuals follow a normal distribution. Modelling income, where the presence of extremely wealthy individuals creates a heavily skewed residual distribution.