Workshop 10: Bernoulli GLMs

BI3010 Statistics for Biologists

Author

Roos & Pinard

Published

16 May 2026

Glossary / 词汇表
Probabilities and odds
Bernoulli distribution 伯努利分布
A probability distribution for binary outcomes: observations can only take the value 0 (failure) or 1 (success). The distribution has a single parameter p, the probability of success. It is a special case of the Binomial distribution with one trial. In R, a Bernoulli GLM is fitted using family = binomial(link = "logit"), which handles 0/1 responses automatically.
用于二元结果的概率分布:观测值只能取0(失败)或1(成功)。该分布有一个参数p,即成功的概率。它是试验次数为1的二项分布的特例。在R中,伯努利GLM通过family = binomial(link = "logit")拟合,可自动处理0/1响应变量。
Odds 几率
The ratio of the probability of success to the probability of failure: p / (1 − p). If the probability of success is 0.75, the odds are 0.75 / 0.25 = 3, meaning success is three times as likely as failure. Odds range from 0 (impossible) to +∞ (certain), unlike probabilities which range from 0 to 1.
成功概率与失败概率之比:p / (1 − p)。若成功概率为0.75,则几率为0.75 / 0.25 = 3,表示成功是失败可能性的三倍。与概率(范围0到1)不同,几率的范围从0(不可能)到+∞(必然)。
Log-odds 对数几率
The natural logarithm of the odds: log(p / (1 − p)). Also called the logit of p. Log-odds can range from −∞ (probability of 0) to +∞ (probability of 1), making them suitable as the linear predictor in a Bernoulli GLM. A log-odds of 0 corresponds to a probability of 0.5; positive values correspond to probabilities above 0.5; negative values to probabilities below 0.5.
几率的自然对数:log(p / (1 − p)),也称为p的logit。对数几率的范围从−∞(概率为0)到+∞(概率为1),因此适合作为伯努利GLM的线性预测子。对数几率为0对应概率0.5;正值对应高于0.5的概率;负值对应低于0.5的概率。
The logit link
Logit link logit连接
The link function used in a Bernoulli GLM. It transforms the probability p to the log-odds scale: logit(p) = log(p / (1 − p)). This allows the linear predictor to range from −∞ to +∞ while keeping predicted probabilities between 0 and 1. The logit link transforms the linear predictor, not the observed 0/1 data.
伯努利GLM中使用的连接函数。它将概率p变换到对数几率尺度:logit(p) = log(p / (1 − p))。这使得线性预测子可以在−∞到+∞范围内取值,同时将预测概率保持在0和1之间。logit连接变换的是线性预测子,而非观测到的0/1数据。
Back-transformation (logit) 反变换(logit)
Converting a prediction from the log-odds scale back to a probability. The inverse of the logit function is the logistic function, applied in R with plogis(). For a log-odds value of x, the corresponding probability is plogis(x) = exp(x) / (1 + exp(x)). Confidence interval limits must also be back-transformed after being calculated on the log-odds scale.
将预测值从对数几率尺度转换回概率。logit函数的逆函数是logistic函数,在R中通过plogis()应用。对于对数几率值x,对应的概率为plogis(x) = exp(x) / (1 + exp(x))。置信区间上下限也必须在对数几率尺度上计算后再进行反变换。
Diagnostics
Dispersion (Bernoulli) 离散度(伯努利)
Unlike the Poisson distribution, the Bernoulli distribution has no free dispersion parameter. The variance is entirely determined by the mean: var = p(1 − p). There is therefore no overdispersion to check for in a Bernoulli GLM: the residual deviance / degrees of freedom check used in Workshop 9 does not apply here.
与泊松分布不同,伯努利分布没有自由的离散参数。方差完全由均值决定:var = p(1 − p)。因此,伯努利GLM中无需检查过度离散,第9次研讨课中使用的残差偏差/自由度检验在此不适用。
plogis() plogis()函数
The R function that converts log-odds to a probability. plogis(0) = 0.5, plogis(2) ≈ 0.88, plogis(-2) ≈ 0.12. It is the inverse of qlogis() (which converts a probability to log-odds). Use it whenever you need to back-transform predictions or confidence interval limits from a Bernoulli GLM.
将对数几率转换为概率的R函数。plogis(0) = 0.5,plogis(2) ≈ 0.88,plogis(-2) ≈ 0.12。它是qlogis()(将概率转换为对数几率)的逆函数。每当需要将伯努利GLM的预测值或置信区间上下限进行反变换时,使用此函数。

Learning Objectives

Learning objectives

By the end of this workshop you should be able to:

  1. Explain why binary response data cannot be modelled with a Normal linear model and why a Bernoulli GLM is appropriate.
  2. Describe how the logit link function connects probabilities to log-odds, and convert between the two using plogis().
  3. Fit a Bernoulli GLM in R using glm() with family = binomial(link = "logit").
  4. Interpret the two-banded pattern in Bernoulli GLM residual plots and explain why overdispersion is not a concern.
  5. Interpret model coefficients on the log-odds scale and calculate predicted probabilities for specific combinations of predictor values.

Workshop structure

The Analytical Workflow

As always, we work through the same seven-step workflow:

  1. Formulate a research question.
  2. Perform exploratory data analysis (EDA).
  3. Identify any hidden assumptions.
  4. Fit an appropriate model [Focus of today].
  5. Diagnose the model and check assumptions [Focus of today].
  6. Summarise the results [Focus of today].
  7. Interpret and draw conclusions.

In Workshop 9 we went from Normal linear models to Poisson GLMs for count data. Today we take one more step. The response variable is not a count but a binary outcome: each observation is either a 0 or a 1. You can’t model that with a Normal distribution without all kinds of ugly consequences, which we’ll get to shortly. The appropriate distribution is the Bernoulli, the appropriate link is the logit, and the whole thing is more commonly known as logistic regression.

1. Formulate a research question

The dataset for this workshop comes from a long-term study of Eurasian oystercatcher (Haematopus ostralegus) breeding at a coastal site. Download it below and save it to your BI3010/data/ folder.

Download oystercatcher.txt

Each row is one breeding pair in one season. The variables are:

  • success: whether the pair fledged at least one chick that season (1 = yes, 0 = no).
  • territory_quality: a continuous index of territory quality, scored from 1 (poor) to 10 (excellent) based on food availability and nest-site safety.
  • experience: whether the pair were first-time breeders ("First") or had bred at least once before ("Experienced").

The research question is predictive: which combinations of territory quality and breeding experience are associated with higher or lower probability of breeding success?

Load the data:

oystercatcher <- read.table("Data/oystercatcher.txt", header = TRUE)
oystercatcher$experience <- factor(oystercatcher$experience,
                                   levels = c("First", "Experienced"))

The second line sets the factor levels explicitly so that "First" is the reference (baseline) category in the model. This means the intercept will represent first-time breeders, and any experience coefficient will represent the difference for experienced breeders relative to first-time breeders.

2. Perform exploratory data analysis

The response variable is binary, so some of the usual EDA tools behave a bit differently. geom_histogram() on a 0/1 variable produces a two-bar chart (not very illuminating, but fine for checking the split between successes and failures). table() gives counts; mean() gives the proportion of successes.

str(oystercatcher)
summary(oystercatcher)
table(oystercatcher$success)
mean(oystercatcher$success)   # overall success rate
table(oystercatcher$experience, oystercatcher$success)

A raw scatter plot of 0s and 1s is deeply unhelpful: every point sits on one of two horizontal lines and you get no sense of how the data are distributed. Adding a small amount of vertical jitter helps enormously:

ggplot(oystercatcher, aes(x = territory_quality, y = success)) +
  geom_jitter(height = 0.05, alpha = 0.4) +
  labs(x = "Territory quality", y = "Breeding success (0/1)") +
  theme_minimal()
ggplot(oystercatcher, aes(x = experience, y = success)) +
  geom_bar(stat = "summary", fun = "mean") +
  labs(x = "Breeding experience", y = "Proportion successful") +
  theme_minimal()

3. Identify hidden assumptions

The usual hidden assumptions apply: validity (does success capture what we mean by breeding success?), representativeness (does this population generalise to oystercatchers elsewhere?), and independence (each pair must be a separate, independent observation; pairs sharing a territory or the same individual appearing across years would violate this).

Dispersion isn’t a concern for Bernoulli GLMs. The variance of a Bernoulli outcome is p(1 − p), which is completely determined by p, the probability of success. There’s no free dispersion parameter to estimate or check.

4. Fit an appropriate model

Why not use a linear model?

The response variable success can only be 0 or 1. A Normal linear model fitted to binary data has two problems. First, it will happily predict values below 0 and above 1, and you can’t have a probability of −0.2 or 1.3. Second, the spread of 0/1 residuals is largest when p ≈ 0.5 and shrinks near 0 or 1, so the equal variance of errors assumption is violated by design. The linear model will run without complaint. It’s just wrong.

Question 1. If you fitted lm(success ~ territory_quality, data = oystercatcher), which of the following would be a problem?

The linear model will run without error and produce coefficient estimates, but its predictions for territory quality at the extremes of the range will fall outside [0, 1]. For example, a predicted success probability of 1.3 or −0.2 has no meaning. The model also assumes equal spread of errors at all fitted values, but for a binary response the spread depends on the probability itself: it’s largest when p ≈ 0.5 and smallest near p = 0 or p = 1. Both problems disappear when we use a GLM with the logit link.

Fitting the model

In R, a Bernoulli GLM uses family = binomial(link = "logit"). The family name is binomial rather than bernoulli because Bernoulli is actually a special case of the Binomial (one trial per observation), and R covers both with the same family. As always, fitting the model is the easiest part. Start with territory_quality as the only predictor:

success_mod <- glm(success ~ territory_quality,
                   family = binomial(link = "logit"),
                   data = oystercatcher)

5. Diagnose the model and check assumptions

par(mfrow = c(2, 2))
plot(success_mod)
par(mfrow = c(1, 1))

The residual plots for a Bernoulli GLM look different from those of a linear model or Poisson GLM. Because each observation is either 0 or 1, the raw residuals cluster into two distinct bands in the residuals vs fitted plot: one band for observations where success = 1 and one for where success = 0. This two-banded appearance is expected and doesn’t indicate a problem.

Question 3. You notice the residuals vs fitted plot shows two distinct bands of points rather than a single cloud. What does this mean?

Think of it this way: every observation is exactly 0 or 1. There’s nowhere else for the points to go. The two bands are the residuals from the successes (1s) and failures (0s), not a sign of trouble, just an unavoidable consequence of a binary response. What you’re looking for is a systematic trend within each band. A curve or funnel inside the bands is a concern. Two flat parallel bands with no internal pattern is fine.

Question 4. Should you use the normal Q-Q plot to assess this Bernoulli GLM?

As with the Poisson GLM in Workshop 9, the Q-Q plot is largely uninformative here. Normality of residuals is an assumption of the Normal linear model, not of GLMs. A Bernoulli GLM assumes residuals follow the Bernoulli distribution, and the Q-Q plot doesn’t check that. Focus your diagnostic attention on the residuals vs fitted and scale-location plots, and on the residuals vs leverage plot for influential observations.

Question 5. Do you need to check for overdispersion in this Bernoulli GLM?

Overdispersion is a concern for Poisson GLMs but not for Bernoulli. Once you know p, the variance is exactly p(1 − p): there’s no free dispersion parameter to worry about. Honestly, after the doomsday dispersion values we saw in Workshop 9, it’s a relief. The residual deviance / degrees of freedom check doesn’t apply here.

6. Summarise the results

summary(success_mod)

The coefficient table gives estimates on the log-odds scale. From the output, the fitted equation is approximately:

\[logit(\hat{p}_i) = -2.14 + 0.44 \times quality_i\]

Predictions on the log-odds scale

Substitute a value of territory_quality directly into the equation to get the predicted log-odds.

Question 6. What is the predicted log-odds of success for a pair on a territory with quality = 4? And for quality = 8?

-2.14 + 0.44 * 4   # = -0.38  (quality = 4)
-2.14 + 0.44 * 8   # =  1.38  (quality = 8)

A log-odds of −0.38 is below zero, so the probability of success is below 0.5. A log-odds of +1.38 is above zero, so the probability is above 0.5. To get exact probabilities, back-transform with plogis().

Predictions on the probability scale

To convert log-odds to a probability, use plogis(). You can also work it out from the definition: p = exp(log-odds) / (1 + exp(log-odds)).

Question 7. Using plogis(), convert your log-odds predictions from Question 6 to probabilities. At what territory quality does the model predict a 50% chance of success?

plogis(-0.38)   # ≈ 0.41  (quality = 4: 41% chance of success)
plogis( 1.38)   # ≈ 0.80  (quality = 8: 80% chance of success)

For a 50% chance of success, we need the log-odds to equal 0:

-2.14 + 0.44 * quality = 0
quality = 2.14 / 0.44 ≈ 4.9

The model predicts that a territory quality of around 4.9 gives a 50% probability of success. Below that the odds favour failure; above it they favour success.

Plotting the probability curve

Create a fake dataset, generate predictions on the log-odds scale, then back-transform with plogis() to get predicted probabilities and confidence interval limits:

fake <- data.frame(territory_quality = seq(from = min(oystercatcher$territory_quality),
                                           to   = max(oystercatcher$territory_quality),
                                           length.out = 50))

preds       <- predict(success_mod, newdata = fake, se.fit = TRUE)
fake$fit    <- plogis(preds$fit)
fake$low    <- plogis(preds$fit - 1.96 * preds$se.fit)
fake$upp    <- plogis(preds$fit + 1.96 * preds$se.fit)

Question 8. Using the fake dataset, create a figure showing predicted probability of success against territory quality, with a 95% confidence band. Add the raw data points using geom_jitter().

ggplot() +
  geom_ribbon(data = fake,
              aes(x = territory_quality, ymin = low, ymax = upp),
              fill = "grey80") +
  geom_line(data = fake,
            aes(x = territory_quality, y = fit)) +
  geom_jitter(data = oystercatcher,
              aes(x = territory_quality, y = success),
              height = 0.03, alpha = 0.35) +
  labs(x = "Territory quality",
       y = "Probability of breeding success") +
  theme_minimal()

The result is an S-shaped (logistic) curve: probability rises steeply across the middle of the quality range and flattens out near 0 and 1. The S-shape is a direct consequence of the logit link, and it ensures predictions stay within [0, 1] across the entire range of territory quality.

7. Interpret and draw conclusions

The model predicts that breeding success increases with territory quality. The positive slope means that each additional unit of territory quality increases the log-odds of success by approximately 0.44. On the probability scale, the relationship isn’t linear: an increase from quality 2 to quality 3 has a smaller effect than an increase from quality 5 to quality 6 (where the curve is steepest).

Adding breeding experience

So far we have only included territory_quality. The research question also asks about experience. Add it to the model:

success_mod2 <- glm(success ~ territory_quality + experience,
                    family = binomial(link = "logit"),
                    data = oystercatcher)

Diagnose the new model

par(mfrow = c(2, 2))
plot(success_mod2)
par(mfrow = c(1, 1))

The diagnostic plots should look similar to the first model: two-banded residuals are still expected, and the Q-Q plot is still uninformative. Check the residuals vs leverage plot for any influential observations.

Summarise and interpret

summary(success_mod2)

Question 9. The experienceExperienced coefficient is approximately +2.0. What does this mean?

On the log-odds scale, the coefficient is the difference in log-odds between the two groups, holding all other predictors constant. To make this more intuitive, exponentiate it: exp(2.0) ≈ 7.4. Experienced breeders have roughly 7.4 times the odds of success compared to first-time breeders on the same territory. This doesn’t mean they are 7.4 times more likely to succeed. Odds and probability aren’t the same thing. People confuse them constantly, including in peer-reviewed papers. Don’t be one of them.

Question 10. Using your fitted equation from success_mod2, calculate the predicted probability of success for a first-time breeding pair on a territory with quality = 5, and for an experienced pair on the same territory.

With approximate coefficients: intercept ≈ −3.5, territory_quality ≈ 0.5, experienceExperienced ≈ 2.0.

# First-time breeder, quality = 5
log_odds_first <- -3.5 + 0.5 * 5 + 0
plogis(log_odds_first)   # plogis(-1.0) ≈ 0.27

# Experienced breeder, quality = 5
log_odds_exp <- -3.5 + 0.5 * 5 + 2.0
plogis(log_odds_exp)     # plogis(1.0) ≈ 0.73

At the same territory quality of 5, first-time breeders have a predicted success probability of about 27%, while experienced breeders have about 73%. The difference is large in probability terms even though the log-odds difference is the fixed value of 2.0 regardless of quality.

Plot the model fit

Let’s use ggeffects and have it do the tedious coding for us. When both predictors are included, ggpredict() with terms = c("territory_quality", "experience") shows separate probability curves for each experience level:

library(ggeffects)

preds2 <- ggpredict(success_mod2, terms = c("territory_quality", "experience"))
plot(preds2)

The two S-curves will be offset from each other: the experienced-breeder curve sits to the left of the first-time-breeder curve, meaning experienced breeders achieve the same probability of success at a lower territory quality.

A new challenger: antimicrobial peptides

If time allows, work through the following dataset. It comes from an in vitro study testing whether a novel antimicrobial peptide (AMP) can clear Staphylococcus aureus infections in a cell-culture model. AMPs are short proteins produced by the immune system that disrupt bacterial membranes. There is a lot of interest in them as alternatives to conventional antibiotics, partly because bacteria evolve resistance to them more slowly. Download the data below and save it to your BI3010/data/ folder.

Download amp.txt

The dataset contains 160 experimental replicates. Each row is one replicate, with the following variables:

  • cleared: whether the S. aureus infection was eliminated within 24 hours (1 = yes, 0 = no).
  • peptide_conc: concentration of the AMP applied, in µM (continuous, 1 to 20).
  • strain: whether the bacterial strain was susceptible to the peptide or carried an efflux pump gene associated with reduced AMP uptake ("Susceptible" / "Resistant").

The research question is predictive: which combinations of peptide concentration and strain type are associated with a higher probability of clearing the infection?

amp <- read.table("Data/amp.txt", header = TRUE)
amp$strain <- factor(amp$strain, levels = c("Susceptible", "Resistant"))

Go through the full seven-step workflow. A few things to keep in mind:

  1. The response is binary, so use jitter plots rather than raw scatter plots to visualise it. mean(amp$cleared) gives the overall clearance rate; tapply(amp$cleared, amp$strain, mean) gives it separately by strain.

  2. Think carefully about hidden assumptions. The cleared variable is a binary endpoint measured at 24 hours: what does that assume about the clearance process? Are there any representativeness concerns about using a cell-culture model to draw conclusions about clinical use?

  3. Use plogis() to back-transform any predictions from the log-odds scale to the probability scale.

Question 11. After completing your AMP analysis, summarise your main findings. How does peptide concentration affect the probability of clearing the infection, and does this differ between susceptible and resistant strains?

Higher peptide concentrations are associated with a higher probability of clearing the infection: the peptide_conc coefficient is positive on the log-odds scale. Exponentiating it gives the multiplicative change in odds per additional µM of peptide. For susceptible strains, the model predicts roughly 50% clearance probability at around 10 µM (the point where log-odds cross zero). Resistant strains require substantially higher concentrations to achieve the same probability, with the strainResistant coefficient being negative and corresponding to a reduction in the odds of clearance at any given concentration.

Key hidden assumptions worth flagging: “cleared within 24 hours” is an arbitrary endpoint that may miss slower-acting dynamics, and a cell-culture model excludes host immune responses, pharmacokinetics, and the polymicrobial environment of a real infection. Both concerns are squarely in the validity column of our assumption table.

What you can take away from this course

The skills you’ve learnt on this course aren’t actually “biology” skills. They’re data skills, and data skills go anywhere. Finance, healthcare, tech, government, sport, NGOs; any organisation sitting on data and trying to make sense of it needs people who can do what you can now do. That’s what “data analyst”, “quantitative researcher”, and “data scientist” actually mean in most job ads.

Here’s what to put on your CV:

Statistical methods

  • Linear regression
  • Interaction modelling
  • Generalised Linear Models: Poisson regression for count data, logistic regression for binary outcomes
  • Model diagnostics and assumption checking
  • Prediction and uncertainty quantification

Software

  • R and RStudio
  • ggplot2 for data visualisation

One-liner for applications:

Quantitative data analysis in R, including linear and generalised linear models (Poisson and logistic regression), model diagnostics, and data visualisation.

Logistic regression is worth mentioning specifically. It’s everywhere outside academia: customer churn, clinical outcomes, loan defaults, election forecasting. You can now run them.

You don’t need every detail memorised. You can always come back to this material for the details. You’ve fitted the models, read the diagnostics, and interpreted results in context.

Fin

That’s a wrap. This is the last workshop of the course, so you’re free to go.

Two final requests from me (Deon):

  • I know from past years that the majority of you will forget the details of the course and methods a few months after the course. That’s fine. What I do want you to try and hold onto is this: data and models (statistical, machine learning or AI) aren’t magic boxes that unravel the Truth or reveal the secrets of the universe. A model is just a bunch of code and equations. Don’t be taken in by people throwing data and results at you, whether that’s on the internet, in the news, or in peer-reviewed papers. Ask what assumptions were made. Ask whether the data can actually answer the question being claimed. Be sceptical.

  • If you found any typos, confusing sections, or parts of the workshops or lectures that just didn’t land: please tell me. I genuinely think this material matters, and if I’m not communicating it well, I want to know. Feel free to email or message however you prefer. I’d really appreciate any feedback (just please don’t be needlessly cruel about it).


Assumption table

Assumption Description Example of violation
1. Validity The data can answer the research question. Counting a nest as successful if it survived to hatching but not checking whether chicks actually fledged.
2. Representativeness The sample reflects the population of interest. Data collected only from pairs on the best territories because those were easiest to monitor.
3. Additivity The effects of territory quality and experience are independent and additive on the log-odds scale. If experience matters more on poor territories than on good ones, the two predictors interact.
4. Linearity The relationship between each continuous predictor and the log-odds of success is linear. Success probability rising quickly at low quality then levelling off, rather than following the logistic form.
5. Independence of errors Each observation is an independent breeding attempt. The same individual appearing in multiple years without this being accounted for.
6. Equal variance of errors Not assumed for Bernoulli GLMs. The variance is p(1 − p), fixed once p is known. Not applicable: heteroscedasticity is built into the Bernoulli distribution and is not a violation.
7. Normality of errors Not assumed for Bernoulli GLMs. Not applicable: the Q-Q plot is uninformative and should not be used to assess model fit.
8. Dispersion Not applicable. The Bernoulli distribution has no free dispersion parameter to check. Unlike Poisson GLMs, overdispersion cannot occur with a true Bernoulli response.