species <- read.table("Data/species.txt", header = TRUE)Workshop 9: Poisson GLMs
BI3010 Statistics for Biologists
exp()) converts predictions back to the response scale, where they are always positive. Slopes on the log scale describe multiplicative changes on the response scale.exp()指数化)将预测值转换回响应尺度,在该尺度上始终为正值。对数尺度上的斜率描述的是响应尺度上的乘法变化。exp() to the predicted value. Back-transformation is needed when reporting or plotting predictions in interpretable units (e.g. expected species counts rather than log-expected species counts). Confidence intervals must also be back-transformed after being calculated on the link scale.exp()。当以可解释单位(例如预期物种数而非对数预期物种数)报告或绘制预测值时,需要进行反变换。置信区间也必须在连接尺度上计算后再进行反变换。summary() output alongside residual degrees of freedom. For a Poisson GLM, dividing residual deviance by residual degrees of freedom gives the dispersion statistic: a value close to 1 is expected, values well above 1 indicate overdispersion.summary()输出中与残差自由度一起报告。对于泊松GLM,将残差偏差除以残差自由度得到离散统计量:接近1的值是正常的,远大于1的值表明存在过度离散。Learning Objectives
By the end of this workshop you should be able to:
- Explain why count data cannot be modelled with a Normal distribution and when the Poisson distribution is more appropriate.
- Describe the three components of a GLM: the distribution family, the link function, and the linear predictor.
- Fit a Poisson GLM in R using
glm()and check for overdispersion using the dispersion statistic. - Interpret model coefficients on the log scale and back-transform predictions to the response scale using
exp(). - Diagnose a Poisson GLM and explain why the normal Q-Q plot is largely uninformative for Poisson residuals.
Workshop structure
As always, we work through the same seven-step workflow:
- Formulate a research question.
- Perform exploratory data analysis (EDA).
- Identify any hidden assumptions.
- Fit an appropriate model [Focus of today].
- Diagnose the model and check assumptions [Focus of today].
- Summarise the results [Focus of today].
- Interpret and draw conclusions.
All previous workshops used linear models with a Normal distribution for the response. Today we extend that to Generalised Linear Models (GLMs), which allow the response variable to follow a different distribution. The specific case here is count data, which calls for a Poisson GLM.
1. Formulate a research question
The dataset for this workshop comes from a study of plant diversity across 90 plots of land. Download it below and save it to your BI3010/data/ folder.
Each row is one plot. The variables are:
Species: the number of plant species recorded in the plot (a count).Biomass: the total weight of plants in the plot, in kg.pH: the soil acidity category ("High","Mid", or"Low").
The researchers wanted to describe how species count is associated with plant biomass and soil pH. This is a predictive question: no causal claim is being made about either predictor.
Load the data:
2. Perform exploratory data analysis
Spend a few minutes exploring the dataset. Check the structure, look at distributions, and plot the response against each predictor. Useful functions: str(), summary(), geom_histogram(), table(), ggplot().
str(species)
summary(species)ggplot(species, aes(x = Species)) + geom_histogram() + theme_minimal()
ggplot(species, aes(x = Biomass)) + geom_histogram() + theme_minimal()
table(species$pH)ggplot(species, aes(x = Biomass, y = Species)) +
geom_point() +
theme_minimal()
ggplot(species, aes(x = pH, y = Species)) +
geom_boxplot() +
theme_minimal()4. Fit an appropriate model
Choosing a distribution
The response variable Species is a count: it can only take non-negative integer values (0, 1, 2, …). From this alone we can already deduce the appropriate distribution (the fact that this workshop is called Poisson GLMs may also offer a rather cryptic hint). A Normal distribution generates continuous values and allows negatives, so it’s a poor fit for count data. The Poisson distribution is designed for counts. It is described by a single parameter λ (lambda), which is both the mean and the expected count.
The distributional assumption for this model is:
\[Species_i \sim Poisson(\lambda_i)\]
where \(Species_i\) is the count of species in plot \(i\), modelled as coming from a Poisson distribution with mean \(\lambda_i\).
Choosing a link function
Because λ must always be positive (you can’t have a negative expected count), we can’t fit a straight line directly to λ — the line could go below zero. The log link solves this: it maps the linear predictor (which can be any value) onto the log scale, where values can range freely, and then the inverse (exp()) maps back to always-positive predictions.
Keep in mind(!): the link function doesn’t transform the species counts — those never change. We’re only changing the linear predictor. This is a super common misunderstanding with GLMs and link functions.
The equation for the linear predictor is:
\[log(\lambda_i) = \beta_0 + \beta_1 \times Biomass_i\]
where \(\beta_0\) is the intercept (the log expected species count when Biomass = 0) and \(\beta_1\) is the slope for biomass on the log scale.
Which explanatory variables to include?
Spare a moment’s thought for how complex the process behind the number of plant species in a plot of land actually is. How much sunlight does each plot get? How much rainfall? What seeds were already present in the soil before data collection started? Why were those seeds there and not others? Why did some of them germinate in time to be counted?
We have no data to explain any of that variation — yet it’s all still there, and it’s what we’re asking the stochastic part of the model to absorb. The deterministic part we can actually study is the role of plant biomass and soil pH.
Fitting the model
As always, fitting the model is the easiest part. We start with just Biomass as a predictor to keep things simple before adding pH. In R, a GLM is fitted with glm() rather than lm(). The family argument specifies both the distribution and the link function:
species_mod <- glm(Species ~ Biomass,
family = poisson(link = "log"),
data = species)5. Diagnose the model and check assumptions
The same four diagnostic plots apply. However, the interpretation of two of them changes for a Poisson GLM:
- The normal Q-Q plot: for a linear model we check whether residuals are approximately Normal. A Poisson GLM doesn’t assume Normal residuals, so this plot has very limited use here and can mostly be ignored.
- The residuals vs fitted and scale-location plots: these still check for patterns in the residuals, but they also pick up overdispersion, which is a Poisson-specific concern.
par(mfrow = c(2, 2))
plot(species_mod)
par(mfrow = c(1, 1))Question 1. What does the residuals vs fitted plot show?
The plot shows a funnel: residuals are small and fairly well-behaved at low fitted values but spread out considerably at higher fitted values. This is a sign that the variation in the data is larger than a Poisson model expects, which is known as overdispersion. A well-fitting Poisson model should show roughly equal spread throughout.
Question 2. What should you conclude from the normal Q-Q plot for this Poisson GLM?
The normality of errors assumption belongs to linear models, not Poisson GLMs. A Poisson GLM assumes counts follow a Poisson distribution, not that residuals are Normal. The Q-Q plot here checks something we never assumed in the first place, so it doesn’t tell us much. Focus on the residuals vs fitted and scale-location plots instead.
Question 3. What does the scale-location plot show?
The scale-location plot reinforces the picture from the residuals vs fitted plot: the standardised residuals grow in size as fitted values increase. Both plots are pointing to the same problem: the variance in the data is growing faster than a Poisson model expects.
Question 4. What does the residuals vs leverage plot show?
No observations exceed a Cook’s distance of 1, which would be the threshold for serious concern. However, a few sit uncomfortably close to the boundary — we’re not exactly jumping for joy with those Cook’s distances. The values are fine, but it’s worth noting. The bigger problem here is overdispersion, not influential observations.
Checking for overdispersion
The residual plots suggested overdispersion. We can put a number on it using the model summary. Run summary(species_mod) and find the residual deviance and the residual degrees of freedom in the output. Dividing residual deviance by residual degrees of freedom gives the dispersion statistic. For a Poisson model this should be close to 1; values well above 1 indicate overdispersion.
summary(species_mod)Question 5. Calculate the dispersion statistic for this model. What does the value tell you?
From the summary output: residual deviance ≈ 433.6 on 88 degrees of freedom.
433.6 / 88 # ≈ 4.9
We have a whopping dispersion of 4.9. My general rule of thumb is that I start getting concerned somewhere in the 1.5 to 1.8 region. 4.9 is doomsday. As a consequence, the standard errors for the coefficients will be artificially small, making p-values lower than they should be and confidence intervals too narrow. The model as it stands should be treated with serious caution.
6. Summarise the results
Despite the overdispersion, we’ll use this model to practise making predictions on the link scale and on the response scale. We’ll fix the overdispersion in the next section.
From summary(species_mod), the fitted equation on the log scale is approximately:
\[log(\hat{\lambda}_i) = 3.55 + (-0.20) \times Biomass_i\]
Predictions on the link scale
To predict on the link scale, substitute the biomass value into the equation directly.
Question 6. On the log scale, what is the predicted species count for a plot with 0 kg of biomass? And for a plot with 5 kg of biomass?
3.55 + (-0.20) * 0 # = 3.55 (at 0 kg biomass)
3.55 + (-0.20) * 5 # = 2.55 (at 5 kg biomass)
These predictions are on the log scale. A value of 3.55 doesn’t mean 3.55 species; it means log(expected species) = 3.55. To get the actual expected species count, we need to back-transform.
Predictions on the response scale
To convert a prediction from the log scale to the expected species count, apply exp() to the log-scale value.
Question 7. On the response scale, what is the expected species count for plots with 0, 5, and 10 kg of biomass?
exp(3.55 + (-0.20) * 0) # ≈ 34.8 species at 0 kg
exp(3.55 + (-0.20) * 5) # ≈ 12.8 species at 5 kg
exp(3.55 + (-0.20) * 10) # ≈ 4.7 species at 10 kg
According to this model, expected species count drops from roughly 35 at zero biomass to roughly 5 at 10 kg. The relationship isn’t linear on the response scale: equal increases in biomass produce proportionally smaller reductions in species count as biomass grows. This is a direct consequence of the log link.
Plotting the predictions
Rather than calculating one prediction at a time, we can generate a smooth prediction curve across the full range of biomass values, in the same way we did for linear models: create a fake dataset, run predict(), back-transform, and plot.
Because the model uses a log link, we need to exponentiate the predictions. We also need to back-transform the confidence interval limits, and we do this after adding and subtracting the standard error, not before.
fake <- data.frame(Biomass = seq(from = min(species$Biomass),
to = max(species$Biomass),
length.out = 20))
preds <- predict(species_mod, newdata = fake, se.fit = TRUE)
fake$fit <- exp(preds$fit)
fake$low <- exp(preds$fit - 1.96 * preds$se.fit)
fake$upp <- exp(preds$fit + 1.96 * preds$se.fit)
Question 8. Using the fake dataset above, create a figure showing the predicted species count against biomass, with a 95% confidence band. Use geom_ribbon() for the band and geom_line() for the fitted line.
ggplot(fake) +
geom_ribbon(aes(x = Biomass, ymin = low, ymax = upp), fill = "grey80") +
geom_line(aes(x = Biomass, y = fit)) +
labs(x = "Biomass (kg)",
y = "Expected species count") +
theme_minimal()
The curve isn’t a straight line on the response scale because of the log link. The confidence band widens at higher biomass values, partly because there are fewer observations there and partly because the overdispersion means the model is less certain than it should be.
7. Interpret and draw conclusions
Based on this model, biomass has a negative association with species count: plots with more plant biomass tend to have fewer plant species. However, the severe overdispersion (dispersion ≈ 4.9) means we shouldn’t place much confidence in the specific p-values or the width of the confidence intervals. The model as fitted isn’t well calibrated.
Resolving the violations
The overdispersion in the first model may partly reflect a missing predictor. We also set out to describe the association with both Biomass and pH, not just Biomass. Adding pH to the model may absorb some of the unexplained variation and reduce overdispersion.
Run a Poisson GLM with both Biomass and pH:
species_mod2 <- glm(Species ~ Biomass + pH,
family = poisson(link = "log"),
data = species)Diagnose the new model
With every new model, we diagnose before we interpret.
par(mfrow = c(2, 2))
plot(species_mod2)
par(mfrow = c(1, 1))
summary(species_mod2)Question 9. Do the diagnostic plots improve compared to the first model? Calculate the new dispersion statistic and interpret it.
The residual plots improve substantially. The funnel in the residuals vs fitted plot is much less pronounced. From the summary: residual deviance ≈ 77.1 on 86 degrees of freedom, giving a dispersion of 77.1 / 86 ≈ 0.9. This is very close to 1 and gives no cause for concern. Including pH absorbed the extra variation that was making the first model look overdispersed.
Note that underdispersion (values below 1) is generally less of a problem than overdispersion, though it can indicate the model is slightly over-fitted.
Interpret the parameters
Run summary(species_mod2) and look at the coefficient table.
summary(species_mod2)Question 10. Which pH level is the reference (baseline) category in this model, and how can you tell?
R sets the reference category to the first level alphabetically unless you specify otherwise. Here that is “High”. Because “High” is the reference, there’s no separate pHhigh coefficient: its effect is part of the intercept. The pHlow and pHmid coefficients are the estimated differences from “High” on the log scale.
Question 11. The Biomass coefficient is approximately −0.16. What does this mean on the log scale, and what does it imply on the response scale?
On the log scale, the slope is −0.16 per kg of biomass. To convert to the response scale, exponentiate: exp(−0.16) ≈ 0.85. This means that for each additional kg of biomass, the expected species count is multiplied by 0.85, a reduction of about 15%, holding pH constant. This is how slopes work on a log link: they are multiplicative on the response scale, not additive.
Question 12. The pHlow coefficient is approximately −1.13. What does this represent?
The pHlow coefficient is a difference on the log scale between low-pH plots and the reference category (high-pH plots), holding biomass constant. To interpret it on the response scale: exp(−1.13) ≈ 0.32, meaning low-pH plots are expected to have roughly 32% as many species as comparable high-pH plots. Similarly, pHmid ≈ −0.47, so mid-pH plots have exp(−0.47) ≈ 0.62: about 62% as many species as high-pH plots.
Plot the model fit
In the interest of ending on an easy note, let’s use ggeffects and have it do all the annoying coding for us:
library(ggeffects)
bio_pred <- ggpredict(species_mod2, terms = "Biomass")
plot(bio_pred)
ph_pred <- ggpredict(species_mod2, terms = "pH")
plot(ph_pred)These plots show the predicted relationship between each predictor and the expected species count, with the other predictor held at its average. The predictions are on the response scale (expected counts), making them directly interpretable.
Assumption table
| Assumption | Description | Example of violation |
|---|---|---|
| 1. Validity | The data can answer the research question. | Using a count of species in a plot to measure overall ecosystem diversity, when diversity also includes abundance and evenness. |
| 2. Representativeness | The sample reflects the population of interest. | Plots selected near roads because they were easiest to reach may not represent remote areas. |
| 3. Additivity | Predictor effects are additive and independent. | The effect of biomass on species count may differ between low and high pH soils. |
| 4. Linearity | The relationship between predictors and log(λ) is linear. | Species count declining steeply at first then levelling off, rather than following the log-linear form. |
| 5. Independence of errors | Observations are independent. | Plots that are physically close to each other may share species, violating independence. |
| 6. Equal variance of errors | Relevant for linear models; for Poisson GLMs, dispersion replaces this check. | Residuals spreading out at higher fitted values (funnel pattern), indicating overdispersion. |
| 7. Normality of errors | Relevant for linear models; not assumed for Poisson GLMs. | Not applicable here: the Q-Q plot is largely uninformative for Poisson residuals. |
| 8. Dispersion | For Poisson GLMs, variance should equal the mean (λ). Residual deviance / residual df should be close to 1. | Residual deviance / df ≈ 4.9 in the first model: severe overdispersion, resolved by including pH. |