Workshop 9: Poisson GLMs

BI3010 Statistics for Biologists

Author

Roos & Pinard

Published

16 May 2026

Glossary / 词汇表
The GLM framework
Generalised Linear Model (GLM) 广义线性模型
An extension of the linear model that allows the response variable to follow distributions other than Normal. A GLM has three components: a distribution family (describing how the response is generated), a link function (connecting the linear predictor to the mean of that distribution), and the linear predictor itself (the equation of predictors and coefficients).
线性模型的扩展,允许响应变量遵循正态分布以外的分布。GLM由三个部分组成:分布族(描述响应变量如何产生)、连接函数(将线性预测子与该分布的均值相连接)以及线性预测子本身(预测变量与系数的方程)。
Distributions and link functions
Poisson distribution 泊松分布
A probability distribution for count data: non-negative integers (0, 1, 2, ...). It is described by a single parameter λ (lambda), which is both the mean and the variance of the distribution. The Poisson distribution is the natural choice when modelling species counts, event frequencies, or any response that cannot be negative or fractional.
用于计数数据(非负整数:0, 1, 2, ...)的概率分布,由单一参数λ(lambda)描述,该参数同时是分布的均值和方差。当对物种数量、事件频率或任何不能为负数或小数的响应变量建模时,泊松分布是自然之选。
Link function 连接函数
A function that connects the linear predictor (which can take any value from −∞ to +∞) to the mean of the response distribution (which may be constrained, e.g. counts must be positive). The link function ensures that model predictions respect the natural constraints of the response. The link function transforms the linear predictor, not the observed data.
将线性预测子(可取从−∞到+∞的任意值)与响应分布的均值(可能受到约束,例如计数必须为正)相连接的函数。连接函数确保模型预测值符合响应变量的自然约束。连接函数变换的是线性预测子,而非观测数据本身。
Log link 对数连接
The default link function for Poisson GLMs. On the link scale, the linear predictor is log(λ), so it can range from −∞ to +∞. Applying the inverse (exponentiating with exp()) converts predictions back to the response scale, where they are always positive. Slopes on the log scale describe multiplicative changes on the response scale.
泊松GLM的默认连接函数。在连接尺度上,线性预测子为log(λ),可从−∞到+∞取任意值。通过取逆(使用exp()指数化)将预测值转换回响应尺度,在该尺度上始终为正值。对数尺度上的斜率描述的是响应尺度上的乘法变化。
Linear predictor 线性预测子
The linear combination of predictors and coefficients: β₀ + β₁x₁ + β₂x₂ + .... In a GLM, the link function connects the linear predictor to the expected value of the response. The linear predictor itself is always linear in the coefficients, even when the relationship between predictors and response is not linear on the response scale.
预测变量与系数的线性组合:β₀ + β₁x₁ + β₂x₂ + ...。在GLM中,连接函数将线性预测子与响应变量的期望值相连接。即使预测变量与响应变量在响应尺度上的关系不是线性的,线性预测子本身在系数上始终是线性的。
Diagnostics and interpretation
Overdispersion 过度离散
When the observed variation in the data is greater than the distribution assumes. For a Poisson GLM, the variance is assumed to equal the mean (λ). Overdispersion occurs when the variance is much larger than the mean. It makes standard errors too small, p-values too low, and confidence intervals too narrow. Checked by dividing residual deviance by residual degrees of freedom: values well above 1 indicate overdispersion.
当数据中观测到的变异大于分布所假设的变异时出现的问题。对于泊松GLM,方差被假定等于均值(λ)。当方差远大于均值时,就会出现过度离散。它使标准误差偏小,p值偏低,置信区间偏窄。通过将残差偏差除以残差自由度来检验:远大于1的值表明存在过度离散。
Back-transformation 反变换
Converting a prediction from the link scale back to the response scale. For a log link, this means applying exp() to the predicted value. Back-transformation is needed when reporting or plotting predictions in interpretable units (e.g. expected species counts rather than log-expected species counts). Confidence intervals must also be back-transformed after being calculated on the link scale.
将预测值从连接尺度转换回响应尺度。对于对数连接,这意味着对预测值应用exp()。当以可解释单位(例如预期物种数而非对数预期物种数)报告或绘制预测值时,需要进行反变换。置信区间也必须在连接尺度上计算后再进行反变换。
Residual deviance 残差偏差
A measure of how well the fitted model describes the data, analogous to the residual sum of squares in a linear model. Reported in the summary() output alongside residual degrees of freedom. For a Poisson GLM, dividing residual deviance by residual degrees of freedom gives the dispersion statistic: a value close to 1 is expected, values well above 1 indicate overdispersion.
衡量拟合模型对数据描述程度的指标,类似于线性模型中的残差平方和。在summary()输出中与残差自由度一起报告。对于泊松GLM,将残差偏差除以残差自由度得到离散统计量:接近1的值是正常的,远大于1的值表明存在过度离散。

Learning Objectives

Learning objectives

By the end of this workshop you should be able to:

  1. Explain why count data cannot be modelled with a Normal distribution and when the Poisson distribution is more appropriate.
  2. Describe the three components of a GLM: the distribution family, the link function, and the linear predictor.
  3. Fit a Poisson GLM in R using glm() and check for overdispersion using the dispersion statistic.
  4. Interpret model coefficients on the log scale and back-transform predictions to the response scale using exp().
  5. Diagnose a Poisson GLM and explain why the normal Q-Q plot is largely uninformative for Poisson residuals.

Workshop structure

The Analytical Workflow

As always, we work through the same seven-step workflow:

  1. Formulate a research question.
  2. Perform exploratory data analysis (EDA).
  3. Identify any hidden assumptions.
  4. Fit an appropriate model [Focus of today].
  5. Diagnose the model and check assumptions [Focus of today].
  6. Summarise the results [Focus of today].
  7. Interpret and draw conclusions.

All previous workshops used linear models with a Normal distribution for the response. Today we extend that to Generalised Linear Models (GLMs), which allow the response variable to follow a different distribution. The specific case here is count data, which calls for a Poisson GLM.

1. Formulate a research question

The dataset for this workshop comes from a study of plant diversity across 90 plots of land. Download it below and save it to your BI3010/data/ folder.

Download species.txt

Each row is one plot. The variables are:

  • Species: the number of plant species recorded in the plot (a count).
  • Biomass: the total weight of plants in the plot, in kg.
  • pH: the soil acidity category ("High", "Mid", or "Low").

The researchers wanted to describe how species count is associated with plant biomass and soil pH. This is a predictive question: no causal claim is being made about either predictor.

Load the data:

species <- read.table("Data/species.txt", header = TRUE)

2. Perform exploratory data analysis

Spend a few minutes exploring the dataset. Check the structure, look at distributions, and plot the response against each predictor. Useful functions: str(), summary(), geom_histogram(), table(), ggplot().

str(species)
summary(species)
ggplot(species, aes(x = Species)) + geom_histogram() + theme_minimal()
ggplot(species, aes(x = Biomass)) + geom_histogram() + theme_minimal()
table(species$pH)
ggplot(species, aes(x = Biomass, y = Species)) +
  geom_point() +
  theme_minimal()

ggplot(species, aes(x = pH, y = Species)) +
  geom_boxplot() +
  theme_minimal()

3. Identify hidden assumptions

Before fitting any model, think about the hidden assumptions. The standard ones apply: validity, representativeness, additivity, and linearity. But there’s also a new one specific to Poisson GLMs: dispersion. We’ll return to this after fitting.

4. Fit an appropriate model

Choosing a distribution

The response variable Species is a count: it can only take non-negative integer values (0, 1, 2, …). From this alone we can already deduce the appropriate distribution (the fact that this workshop is called Poisson GLMs may also offer a rather cryptic hint). A Normal distribution generates continuous values and allows negatives, so it’s a poor fit for count data. The Poisson distribution is designed for counts. It is described by a single parameter λ (lambda), which is both the mean and the expected count.

The distributional assumption for this model is:

\[Species_i \sim Poisson(\lambda_i)\]

where \(Species_i\) is the count of species in plot \(i\), modelled as coming from a Poisson distribution with mean \(\lambda_i\).

Which explanatory variables to include?

Spare a moment’s thought for how complex the process behind the number of plant species in a plot of land actually is. How much sunlight does each plot get? How much rainfall? What seeds were already present in the soil before data collection started? Why were those seeds there and not others? Why did some of them germinate in time to be counted?

We have no data to explain any of that variation — yet it’s all still there, and it’s what we’re asking the stochastic part of the model to absorb. The deterministic part we can actually study is the role of plant biomass and soil pH.

Fitting the model

As always, fitting the model is the easiest part. We start with just Biomass as a predictor to keep things simple before adding pH. In R, a GLM is fitted with glm() rather than lm(). The family argument specifies both the distribution and the link function:

species_mod <- glm(Species ~ Biomass,
                   family = poisson(link = "log"),
                   data = species)

5. Diagnose the model and check assumptions

The same four diagnostic plots apply. However, the interpretation of two of them changes for a Poisson GLM:

  • The normal Q-Q plot: for a linear model we check whether residuals are approximately Normal. A Poisson GLM doesn’t assume Normal residuals, so this plot has very limited use here and can mostly be ignored.
  • The residuals vs fitted and scale-location plots: these still check for patterns in the residuals, but they also pick up overdispersion, which is a Poisson-specific concern.
par(mfrow = c(2, 2))
plot(species_mod)
par(mfrow = c(1, 1))

Question 1. What does the residuals vs fitted plot show?

The plot shows a funnel: residuals are small and fairly well-behaved at low fitted values but spread out considerably at higher fitted values. This is a sign that the variation in the data is larger than a Poisson model expects, which is known as overdispersion. A well-fitting Poisson model should show roughly equal spread throughout.

Question 2. What should you conclude from the normal Q-Q plot for this Poisson GLM?

The normality of errors assumption belongs to linear models, not Poisson GLMs. A Poisson GLM assumes counts follow a Poisson distribution, not that residuals are Normal. The Q-Q plot here checks something we never assumed in the first place, so it doesn’t tell us much. Focus on the residuals vs fitted and scale-location plots instead.

Question 3. What does the scale-location plot show?

The scale-location plot reinforces the picture from the residuals vs fitted plot: the standardised residuals grow in size as fitted values increase. Both plots are pointing to the same problem: the variance in the data is growing faster than a Poisson model expects.

Question 4. What does the residuals vs leverage plot show?

No observations exceed a Cook’s distance of 1, which would be the threshold for serious concern. However, a few sit uncomfortably close to the boundary — we’re not exactly jumping for joy with those Cook’s distances. The values are fine, but it’s worth noting. The bigger problem here is overdispersion, not influential observations.

Checking for overdispersion

The residual plots suggested overdispersion. We can put a number on it using the model summary. Run summary(species_mod) and find the residual deviance and the residual degrees of freedom in the output. Dividing residual deviance by residual degrees of freedom gives the dispersion statistic. For a Poisson model this should be close to 1; values well above 1 indicate overdispersion.

summary(species_mod)

Question 5. Calculate the dispersion statistic for this model. What does the value tell you?

From the summary output: residual deviance ≈ 433.6 on 88 degrees of freedom.

433.6 / 88  # ≈ 4.9

We have a whopping dispersion of 4.9. My general rule of thumb is that I start getting concerned somewhere in the 1.5 to 1.8 region. 4.9 is doomsday. As a consequence, the standard errors for the coefficients will be artificially small, making p-values lower than they should be and confidence intervals too narrow. The model as it stands should be treated with serious caution.

6. Summarise the results

Despite the overdispersion, we’ll use this model to practise making predictions on the link scale and on the response scale. We’ll fix the overdispersion in the next section.

From summary(species_mod), the fitted equation on the log scale is approximately:

\[log(\hat{\lambda}_i) = 3.55 + (-0.20) \times Biomass_i\]

Predictions on the response scale

To convert a prediction from the log scale to the expected species count, apply exp() to the log-scale value.

Question 7. On the response scale, what is the expected species count for plots with 0, 5, and 10 kg of biomass?

exp(3.55 + (-0.20) * 0)   # ≈ 34.8 species at 0 kg
exp(3.55 + (-0.20) * 5)   # ≈ 12.8 species at 5 kg
exp(3.55 + (-0.20) * 10)  # ≈  4.7 species at 10 kg

According to this model, expected species count drops from roughly 35 at zero biomass to roughly 5 at 10 kg. The relationship isn’t linear on the response scale: equal increases in biomass produce proportionally smaller reductions in species count as biomass grows. This is a direct consequence of the log link.

Plotting the predictions

Rather than calculating one prediction at a time, we can generate a smooth prediction curve across the full range of biomass values, in the same way we did for linear models: create a fake dataset, run predict(), back-transform, and plot.

Because the model uses a log link, we need to exponentiate the predictions. We also need to back-transform the confidence interval limits, and we do this after adding and subtracting the standard error, not before.

fake <- data.frame(Biomass = seq(from = min(species$Biomass),
                                 to   = max(species$Biomass),
                                 length.out = 20))

preds       <- predict(species_mod, newdata = fake, se.fit = TRUE)
fake$fit    <- exp(preds$fit)
fake$low    <- exp(preds$fit - 1.96 * preds$se.fit)
fake$upp    <- exp(preds$fit + 1.96 * preds$se.fit)

Question 8. Using the fake dataset above, create a figure showing the predicted species count against biomass, with a 95% confidence band. Use geom_ribbon() for the band and geom_line() for the fitted line.

ggplot(fake) +
  geom_ribbon(aes(x = Biomass, ymin = low, ymax = upp), fill = "grey80") +
  geom_line(aes(x = Biomass, y = fit)) +
  labs(x = "Biomass (kg)",
       y = "Expected species count") +
  theme_minimal()

The curve isn’t a straight line on the response scale because of the log link. The confidence band widens at higher biomass values, partly because there are fewer observations there and partly because the overdispersion means the model is less certain than it should be.

7. Interpret and draw conclusions

Based on this model, biomass has a negative association with species count: plots with more plant biomass tend to have fewer plant species. However, the severe overdispersion (dispersion ≈ 4.9) means we shouldn’t place much confidence in the specific p-values or the width of the confidence intervals. The model as fitted isn’t well calibrated.

Resolving the violations

The overdispersion in the first model may partly reflect a missing predictor. We also set out to describe the association with both Biomass and pH, not just Biomass. Adding pH to the model may absorb some of the unexplained variation and reduce overdispersion.

Run a Poisson GLM with both Biomass and pH:

species_mod2 <- glm(Species ~ Biomass + pH,
                    family = poisson(link = "log"),
                    data = species)

Diagnose the new model

With every new model, we diagnose before we interpret.

par(mfrow = c(2, 2))
plot(species_mod2)
par(mfrow = c(1, 1))

summary(species_mod2)

Question 9. Do the diagnostic plots improve compared to the first model? Calculate the new dispersion statistic and interpret it.

The residual plots improve substantially. The funnel in the residuals vs fitted plot is much less pronounced. From the summary: residual deviance ≈ 77.1 on 86 degrees of freedom, giving a dispersion of 77.1 / 86 ≈ 0.9. This is very close to 1 and gives no cause for concern. Including pH absorbed the extra variation that was making the first model look overdispersed.

Note that underdispersion (values below 1) is generally less of a problem than overdispersion, though it can indicate the model is slightly over-fitted.

Interpret the parameters

Run summary(species_mod2) and look at the coefficient table.

summary(species_mod2)

Question 10. Which pH level is the reference (baseline) category in this model, and how can you tell?

R sets the reference category to the first level alphabetically unless you specify otherwise. Here that is “High”. Because “High” is the reference, there’s no separate pHhigh coefficient: its effect is part of the intercept. The pHlow and pHmid coefficients are the estimated differences from “High” on the log scale.

Question 11. The Biomass coefficient is approximately −0.16. What does this mean on the log scale, and what does it imply on the response scale?

On the log scale, the slope is −0.16 per kg of biomass. To convert to the response scale, exponentiate: exp(−0.16) ≈ 0.85. This means that for each additional kg of biomass, the expected species count is multiplied by 0.85, a reduction of about 15%, holding pH constant. This is how slopes work on a log link: they are multiplicative on the response scale, not additive.

Question 12. The pHlow coefficient is approximately −1.13. What does this represent?

The pHlow coefficient is a difference on the log scale between low-pH plots and the reference category (high-pH plots), holding biomass constant. To interpret it on the response scale: exp(−1.13) ≈ 0.32, meaning low-pH plots are expected to have roughly 32% as many species as comparable high-pH plots. Similarly, pHmid ≈ −0.47, so mid-pH plots have exp(−0.47) ≈ 0.62: about 62% as many species as high-pH plots.

Plot the model fit

In the interest of ending on an easy note, let’s use ggeffects and have it do all the annoying coding for us:

library(ggeffects)

bio_pred <- ggpredict(species_mod2, terms = "Biomass")
plot(bio_pred)

ph_pred <- ggpredict(species_mod2, terms = "pH")
plot(ph_pred)

These plots show the predicted relationship between each predictor and the expected species count, with the other predictor held at its average. The predictions are on the response scale (expected counts), making them directly interpretable.


Assumption table

Assumption Description Example of violation
1. Validity The data can answer the research question. Using a count of species in a plot to measure overall ecosystem diversity, when diversity also includes abundance and evenness.
2. Representativeness The sample reflects the population of interest. Plots selected near roads because they were easiest to reach may not represent remote areas.
3. Additivity Predictor effects are additive and independent. The effect of biomass on species count may differ between low and high pH soils.
4. Linearity The relationship between predictors and log(λ) is linear. Species count declining steeply at first then levelling off, rather than following the log-linear form.
5. Independence of errors Observations are independent. Plots that are physically close to each other may share species, violating independence.
6. Equal variance of errors Relevant for linear models; for Poisson GLMs, dispersion replaces this check. Residuals spreading out at higher fitted values (funnel pattern), indicating overdispersion.
7. Normality of errors Relevant for linear models; not assumed for Poisson GLMs. Not applicable here: the Q-Q plot is largely uninformative for Poisson residuals.
8. Dispersion For Poisson GLMs, variance should equal the mean (λ). Residual deviance / residual df should be close to 1. Residual deviance / df ≈ 4.9 in the first model: severe overdispersion, resolved by including pH.