Workshop 6: Diagnosing linear models with a categorical predictor
BI3010 Statistics for Biologists
plot(model) in R are Residuals vs Fitted, Scale-Location, Q-Q Residuals, and Residuals vs Leverage.plot(model) 生成的四张标准诊断图分别为:残差与拟合值图、尺度-位置图、Q-Q 残差图和残差与杠杆点图。We pick up where Workshop 5 left off, returning to step 5 to check whether the rabbit model is worthy of our trust.
If you didn’t finish Workshop 5, do so before continuing.
Learning Objectives
By the end of this workshop you’ll be able to:
- Produce and interpret the four standard diagnostic plots for a linear model with a categorical predictor
- Identify violations of linearity, equal variance, normality, and independence from residual plots
- Assess leverage and influence using Cook’s distances
- Decide whether a model is trustworthy enough to interpret
- Apply the full diagnostic workflow to a novel dataset
Workshop structure
- Formulate a research question
- Perform exploratory data analysis
- Identify hidden assumptions
- Fit a model
- Diagnose the model [Focus of today]
- Summarise the results
- Interpret and draw conclusions
A reminder of the model
In Workshop 5 you fitted a model to test whether Java mongoose are associated with a reduction in the average weight of Amami rabbits on two islands south of mainland Japan. The model was:
\[weight_i \sim \text{Normal}(\mu_i, \sigma)\]
\[\mu_i = \beta_0 + \beta_1 \times Present_i\]
where \(\beta_0\) is the average weight of rabbits when mongoose are absent (\(Present = 0\)) and \(\beta_1\) is the difference from that average when mongoose are present (\(Present = 1\)).
After fitting the model, the estimated parameters are:
\[\mu_i = 269.6 + (-2.9) \times Present_i\]
If your parameter estimates differ, you have likely missed one or both of the data entry errors in the dataset. Return to Workshop 5 and resolve them before continuing.
Using those estimates, we generated predictions and plotted them:
But, as ever, can we convince ourselves that this model is worthy of our trust before we present it to others?
Diagnosing
Diagnosing a model with a categorical predictor follows the same process as with a continuous predictor. The only meaningful difference is in the Residuals vs Fitted plot, which we cover below. Run the full set of four diagnostic plots first:
par(mfrow = c(2, 2))
plot(rabbit_mod)
par(mfrow = c(1, 1))Residuals vs Fitted
The y-axis shows residual errors (the difference between each observed weight and the model’s prediction for that rabbit). The x-axis shows the fitted (predicted) values; here, the two group means for rabbits where mongoose are present or absent.
Linearity
With a continuous predictor, we use this plot to check whether the relationship is approximately linear. With a categorical predictor there’s no line to draw between the groups, so there’s no linearity assumption to check here. One less thing to worry about.
Equal variance of errors
What we can check is whether the residual spread is roughly equal across both groups. Look at the two vertical clusters of points: one around 266 g (mongoose present) and one around 270 g (mongoose absent). In both groups, individual rabbits deviate from the group mean by a similar amount, around 40 g in either direction. The assumption of equal variance of errors looks well met here.
To understand what a problematic case looks like, consider this simulated dataset where Group B has much more variable responses than Group A:
In the Residuals vs Fitted plot for that simulated data:
The residual spread for Group B is much wider than for Group A, a clear violation. When we add confidence intervals to the original plot:
The intervals are the same width for both groups, even though Group A’s data is much less variable. Heteroscedasticity inflates the standard error across the board. The group means themselves are estimated accurately, but the uncertainty around them is not.
It’s worth saying that we generally do care about estimating uncertainty accurately. We never really know if our estimates are “true” in the sense that they accurately reflect reality, so having some measure of uncertainty is generally the responsible approach to science. If you need a reminder of why, think back to the lecture where we discussed being given the option of two drugs with different probabilities of unintended mortality, and how our decision changed depending on whether uncertainty was shown or not.
Scale-Location
The Scale-Location plot gives a second view of the equal variance assumption.
The spread is slightly larger for the group predicted at around 270 g than for the group at 266 g, which is very slightly less clean than the Residuals vs Fitted plot suggested. Importantly though, this is nowhere near substantial enough to be worried about. Compare it with the same plot for the simulated data:
That is what a genuine violation looks like: a clear upward trend showing residual spread growing with the fitted value.
Q-Q Residuals
We are checking whether the residuals follow an approximately normal distribution. The assumption is met when points lie broadly along the diagonal dashed line. We want them on the line but we don’t stress out too much if it’s not exactly perfect. If we ever did see utterly bizarre patterns in the points, that would imply one of two things: you need to choose a different distribution for your data-generating process and fit a generalised linear model (GLM), or your model is missing some important predictor variable.
Residuals vs Leverage
This plot does not check the standard assumptions. Instead, it flags observations that have a disproportionate influence on the fitted model. We look for Cook’s distance contour lines: observations beyond the 1.0 contour warrant investigation. Here, those contour lines don’t appear in the plot at all. None of our points are close to 1, so we don’t have any observations that are pulling the model one way or another.
For an alternative view of Cook’s distances without leverage on the x-axis:
plot(rabbit_mod, which = 4)Question set 1
Question 1. Which assumption can be assessed using the Residuals vs Fitted plot when the predictor is categorical?
Equal variance of errors only. With a categorical predictor there’s no linear trend to assess, so linearity drops out. We’re only asking whether the residual spread is similar across groups.
Question 2. Which assumption can be assessed using the Scale-Location plot when the predictor is categorical?
Equal variance of errors. The Scale-Location plot is checking the same thing as the Residuals vs Fitted plot, but from a different angle. Running both is worthwhile because occasionally one reveals a pattern the other obscures.
Question 3. Which assumption does the Q-Q Residuals plot assess?
Normality of errors. The Q-Q plot compares the quantiles of the residuals to the quantiles of a theoretical normal distribution. Points close to the diagonal line indicate the assumption is met.
Question 4. How important is it to meet the equal variance of errors assumption?
Less critical than the linearity assumption. Heteroscedasticity doesn’t bias the parameter estimates (group means) themselves, but it inflates standard errors, which in turn makes confidence intervals wider or narrower than they should be. If you only need an accurate point estimate of the group mean, the impact is limited. For reliable inference, which usually requires trustworthy uncertainty estimates, meeting this assumption matters more.
Question 5. Looking at the rabbit model diagnostic plots, do any of them indicate a problem?
No. The residual spread is similar across both groups, the Q-Q plot shows only minor tail deviations, and no observation approaches Cook’s distance of 1. The diagnostics raise no concerns.
Question 6. Does passing the diagnostic plots mean the model is trustworthy?
No. Passing the diagnostics only means we haven’t found evidence of specific statistical assumption violations. The more fundamental concern for this model was raised before fitting: we can’t be certain that mongoose are genuinely absent from Tokunoshima. No diagnostic plot can check that. The diagnostics address a subset of assumptions; domain knowledge and critical thinking about the data generating process are needed for the rest.
A more visually appealing alternative
The performance package produces the same diagnostics in a more informative layout. Note that for this particular model it does not work especially well; it tries to draw a smooth line between the two groups, which looks wiggly, but that’s an artefact of the two-point structure and doesn’t indicate a problem.
library(performance)
check_model(rabbit_mod)The lemurs
Having diagnosed the rabbit model, move on to the lemur model from Workshop 5. If you haven’t fitted that model yet, do so now. Run the diagnostic plots and decide whether you’re willing to trust the results.
lemur <- read.table("lemurs.txt", header = TRUE)Lemur diagnostics. Run the four diagnostic plots for your lemur model and note what you observe in each. Are there any concerns?
# Full lemur analysis
library(ggplot2)
library(ggeffects) # install.packages("ggeffects") if needed
# Load data
lemur <- read.table("lemurs.txt", header = TRUE)
# EDA
str(lemur) # n = 120
summary(lemur) # no obvious numeric issues
ggplot(lemur) +
geom_histogram(aes(x = activity, fill = habitat),
colour = "white", binwidth = 0.5) +
theme_minimal()
# One habitat is recorded as "Rain forest" instead of "Rainforest"; a data entry mistake
# Correct the mistake
lemur$habitat[lemur$habitat == "Rain forest"] <- "Rainforest"
# Hidden assumptions
# - Validity: how is "activity" measured? Accuracy may differ by habitat
# (e.g. easier to track lemurs in dry forest than in dense rainforest).
# - Equal variance: variation may differ by habitat, something to check in diagnostics.
# Fit model
lem_mod <- lm(activity ~ habitat, data = lemur)
# Diagnose
par(mfrow = c(2, 2))
plot(lem_mod)
par(mfrow = c(1, 1))
# No obvious violations. Residual spread looks broadly similar across habitats.
# Q-Q plot shows minor tail deviations but nothing alarming.
# No influential observations (Cook's distance contours do not appear).
# Summarise
summary(lem_mod)
# Dry forest is the reference group (first alphabetically)
# Predicted activity in Dry forest: 7.9 hours
# Mangrove: 7.9 - 1.05 = 6.85 hours
# Rainforest: 7.9 + 1.8 = 9.7 hours
# Spiny forest: 7.9 - 1.8 = 6.1 hours
# Plot predictions
preds <- ggpredict(lem_mod)
plot(preds)
# Interpretation
# Lemurs are most active in rainforests and least active in spiny forests.
# This could reflect habitat preference, or differences in foraging effort
# (e.g. spending more time searching for food in some habitats).
# The validity concern about how activity is measured remains unresolved.The diagnostics raise no serious concerns. The main assumptions to reflect on are validity (how activity is recorded and whether it is comparable across habitats) and representativeness (whether the sampled locations reflect the broader habitat types).
Assumption reference
| Assumption | Description | Example of violation |
|---|---|---|
| 1. Validity | The data can answer the research question. | Using IQ to measure overall intelligence. |
| 2. Representativeness | The data reflects the population being studied. | Using survey data from a rural area to draw conclusions about an urban population. |
| 3. Additivity | Effects of predictors are additive with no interactions. | Assuming study hours and sleep independently affect exam scores without considering their combined effect. |
| 4. Linearity | The relationship between predictors and the outcome is linear. | Assuming plant growth continues to increase indefinitely without levelling off. |
| 5. Independence of errors | Observations are independent of each other. | Modelling height using repeated measurements of the same children over time. |
| 6. Equal variance of errors | Residuals have constant variance across all predictor levels. | Modelling income, where salary variability differs across levels of a company hierarchy. |
| 7. Normality of errors | Residuals follow a normal distribution. | Modelling income, where the presence of extremely wealthy individuals creates a heavily skewed residual distribution. |