Workshop 5: Linear models with a categorical predictor

BI3010 Statistics for Biologists

Author

Roos & Pinard

Published

16 May 2026

Learning Objectives

Learning objectives
  1. Understand how a categorical predictor is encoded as a dummy variable in a linear model.
  2. Identify the reference group and interpret \(\beta_0\) and \(\beta_1\) in a categorical model.
  3. Fit a linear model with a categorical predictor using lm().
  4. Generate model predictions with 95% confidence intervals and visualise them.
  5. Reflect critically on hidden assumptions that formal diagnostics can’t detect.

Workshop structure

The Analytical Workflow

As always, we’re using the workflow below to structure the workshops:

  1. Formulate a research question.
  2. Perform exploratory data analysis.
  3. Identify any hidden assumptions.
  4. Fit an appropriate model [Focus of today].
  5. Diagnose the model and check assumptions.
  6. Summarise the results.
  7. Interpret and provide inferences.

1. Formulating a research question

This workshop introduces fitting a linear model with a single categorical explanatory variable. The dataset comes from Japan, specifically the Ryukyu islands, far to the south of mainland Japan. Two islands are home to the Amami rabbit (Pentalagus furnessi), an ancient species designated a natural monument of Japan and currently threatened by the invasive Java mongoose.

Researchers in Japan have asked us whether the invasive mongoose is causing rabbit size to decline. They have provided weights from two islands: Oshima, where mongoose are confirmed present (multiple captures, with a control programme underway), and Tokunoshima, where no sightings have been reported by local residents.

The dataset contains two columns:

  • weight: weight of the Amami rabbit (grams)
  • mongoose: whether the rabbit was caught on the island with mongoose Present (Oshima) or Absent (Tokunoshima)

Download rabbits.txt

Load the data to get started:

rabbit <- read.table("rabbits.txt", header = TRUE)

As always, begin by looking at the structure and a summary of the data:

str(rabbit)
summary(rabbit)

2. Exploratory data analysis

Before we can work with this dataset, our colleagues in Japan have emailed to flag two data entry mistakes. They report:

Mistake 1: The rabbit in the 7th row was actually from Oshima, not Tokunoshima.

Mistake 2: They cannot remember which rabbit, but they accidentally added 100 to its weight.

They ask you to resolve both issues and also check for any other possible mistakes.

Cleaning the rabbit data

Question 1. What mongoose value should the rabbit in row 7 have?

We were told it’s row 7. We can inspect that row using square brackets:

rabbit[7, ]

The rabbit is currently recorded as Absent, but it should be Present (it came from Oshima).

Question 2. Correct the mistake from Question 1 and confirm it has worked.

rabbit$mongoose[7] <- "Present"
# Or equivalently:
rabbit[7, 1] <- "Present"

# Confirm:
rabbit[7, ]

Both lines do the same thing. After the correction, row 7 should show Present.

Question 3. Which rabbit had the incorrect weight? Find it and correct the mistake at the same time.

A histogram reveals the outlier clearly:

ggplot(rabbit) +
  geom_histogram(aes(x = weight))

There is one rabbit much heavier than all others, well above 350 g. We can use that as a condition to identify it and subtract 100:

rabbit[rabbit$weight > 350, ]
# Row 41 is the culprit.

Question 4. Correct the mistake from Question 3 and confirm it has worked.

rabbit$weight[rabbit$weight > 350] <- rabbit$weight[rabbit$weight > 350] - 100

# Confirm:
rabbit[41, ]
# Or replot the histogram to check no outlier remains.
ggplot(rabbit) +
  geom_histogram(aes(x = weight))

Question 5. Can you spot any additional data entry mistakes in the dataset?

Nothing else obvious. The histogram and summary look reasonable once the outlier has been corrected.

3. Hidden assumptions

Having resolved the data issues, it’s time to think carefully about what assumptions we’re making before fitting any model. “Hidden assumptions” refers broadly to all the assumptions underlying our analysis: some we can think through carefully, and some we can even check. What makes them “hidden” is that they’re easy to overlook and aren’t automatically flagged by any software. Validity and representativeness deserve particular attention because there aren’t any formal diagnostic checks for them at all; evaluating them depends entirely on domain knowledge and thinking critically about how the data were collected and what they actually measure.

Assumptions for the mongoose analysis

Question 1. Can you identify any hidden assumptions with this dataset?

The key issue is whether mongoose are truly absent from Tokunoshima. All we know is that local residents have not reported any sightings. That’s not the same as confirmed absence. It’s possible there’s just a very small population of mongoose on an island that doesn’t have enough people around to spot them.

This puts us at risk of violating the assumption of validity: our mongoose variable may not actually measure what we think it does.

Question 2. If you have identified any issues, can we do anything about them?

Ideally we’d confirm the absence of mongoose on Tokunoshima before proceeding, but that’s not feasible here (we can’t pay for the class to go on a trip to some tropical islands in Japan). We have two options:

  1. Decline to perform the analysis because of the assumption violation.
  2. Perform the analysis but include clear, prominent statements about this limitation.

Both are defensible. For the purposes of this workshop (learning the method), we go with option 2. If this were a real project, I’d push hard to confirm the absence of mongoose before doing any analysis. If that weren’t possible, I’d probably pull out of the project entirely — but others would go with option 2.

Question 3. If mongoose are actually present on Tokunoshima, what effect would that have on the analysis?

If mongoose are present on both islands and they do reduce rabbit weight, then rabbit weights on both islands would be similarly suppressed. The model would find little or no difference between groups, leading us to incorrectly conclude that mongoose have no effect.

The annoying thing is that finding no difference is ambiguous: it could mean mongoose are present on both islands, or that mongoose genuinely don’t affect rabbit weight. We can’t tell which from the data alone.

This is a good example of why validity is the hardest assumption to check and the most consequential to violate.

4. Fitting the model

Assuming we’re comfortable proceeding, the model looks like this:

\[weight_i \sim \text{Normal}(\mu_i, \sigma)\]

\[\mu_i = \beta_0 + \beta_1 \times \underline{\phantom{Present}}_i\]

Where:

  • \(weight_i\) is the weight of rabbit \(i\), assumed to follow a Normal distribution with mean \(\mu_i\) and standard deviation \(\sigma\).
  • \(\mu_i\) is determined by \(\beta_0\) and \(\beta_1\).

The table below shows how R converts a categorical variable into dummy variables. Try changing the category in each row using the dropdowns, or enter different variable names and levels to see how the encoding changes.

From categories to dummy variables

R sorts levels alphabetically and uses the first as the reference group (the intercept, β0). One dummy column is created per remaining level: it takes the value 1 when the observation belongs to that level and 0 otherwise. The reference group is identified by all dummy columns being 0.

Understanding the model equation

If you’re finding the questions below difficult, skip ahead and run the model, then return and use the summary() output to help.

Question 1. What replaces the blank in the equation below?

\[\mu_i = \beta_0 + \beta_1 \times \underline{\phantom{MongoosePresent}}_i\]

Present replaces the blank.

If you said mongoose, remember that mongoose has two values ("Present" and "Absent"), which can’t be used directly in an equation. Instead, R takes one group and converts it to a binary dummy variable (0 or 1). R picks the group that comes first alphabetically as the reference, so Absent becomes the reference (0) and Present becomes the dummy (1). When Present = 0, it must be Absent.

Changing the reference group with relevel()

R picks the reference group alphabetically, which isn’t always what you want. If you’d rather Present be the reference (so that \(\beta_0\) is the mean weight where mongoose are present), use relevel() before fitting the model:

rabbit$mongoose <- relevel(factor(rabbit$mongoose), ref = "Present")
rabbit_mod      <- lm(weight ~ mongoose, data = rabbit)

This doesn’t change the data or the underlying relationship; it only changes which group sits in the intercept and which appears as a coefficient. The predicted values and confidence intervals stay the same.

Question 2. What does \(\beta_0\) represent in this model?

\(\beta_0\) is the predicted weight of rabbits in the reference group, i.e. when mongoose are Absent (Tokunoshima). It is the model’s estimate of the mean weight where Present = 0.

Question 3. What does \(\beta_1\) represent in this model?

\(\beta_1\) is the difference in predicted weight between rabbits where mongoose are Present versus where they are Absent. It’s not the predicted weight when mongoose are present; it’s the change relative to \(\beta_0\).

Running the model

Let’s fit the model:

rabbit_mod <- lm(weight ~ mongoose, data = rabbit)

Now inspect the estimated parameters:

summary(rabbit_mod)

Interpreting model output

Question 1. Replace the blanks in the equation below with the values estimated by the model.

weighti ∼ Normal(μi, σ)

μi =   +   × Presenti

\[\mu_i = 269.6 + (-2.9) \times \text{Present}_i\]

The first blank is 269.566 (\(\beta_0\)) and the second is −2.866 (\(\beta_1\)). Rounding to one decimal place is appropriate here given that weights are recorded in whole grams and the estimated difference is small.

Question 2. Using your completed equation, what weight (\(\mu\)) would you predict for a rabbit where mongoose are present?

Substitute Present = 1:

\[\mu_i = 269.6 + (-2.9) \times 1 = 266.7 \text{ g}\]

Note: I am rounding to one decimal place throughout. Rabbit weights were recorded as whole grams and the estimated coefficient is small, so claiming more decimal places would imply more precision than the data actually support.

Question 3. Using your completed equation, what weight (\(\mu\)) would you predict for a rabbit where mongoose are absent?

Substitute Present = 0:

\[\mu_i = 269.6 + (-2.9) \times 0 = 269.6 \text{ g}\]

When the dummy variable is 0, \(\beta_1\) drops out completely and we are left with \(\beta_0\) alone.

Question 4. Is the presence of mongoose associated with a decline in rabbit weight, as the researchers hypothesised?

The model suggests mongoose presence is associated with a weight reduction of about 3 g, but the standard error on that estimate is large (5.4 g). A quick 95% confidence interval:

  • Upper limit: −2.9 + (5.4 × 1.96) = +7.7 g
  • Lower limit: −2.9 − (5.4 × 1.96) = −13.5 g

The confidence interval spans zero, meaning the data are consistent with no effect at all. There is very little statistical evidence for a meaningful difference.

Question 5. Is the association sufficient to be of biological concern?

3 g is pretty small for a species that weighs on average around 2.7 kg. Combined with the wide uncertainty and the unresolved question of whether mongoose are even absent from Tokunoshima, we’re not really sure what we’ve learned here.

We identified before running the model that the mongoose variable might not be reliable, and we now find a tiny effect. So what should we take away from this? I’m not sure myself.

5. Model diagnostics

We will cover this on Friday.

6. Summarising and visualising results

We will return to model diagnostics on Friday. For now, let’s focus on step 6: turning the model output into a figure. Just like last week, there are two options.

Using ggeffects

Make sure ggeffects is installed and loaded:

library(ggeffects)

Ask ggeffects to predict rabbit weight across the two mongoose categories:

preds <- ggpredict(rabbit_mod)

When you run this, you’ll likely see a Warning (not an Error) telling you that ggeffects prefers factors over character strings. Converting is simple:

rabbit$mongoose <- factor(rabbit$mongoose)

Then refit and rerun ggpredict(). In practice, I tend to convert categorical variables to factors before fitting any model, since some packages refuse to work with character strings. Whether you do so is up to you, but given how little effort it takes, it’s worth the habit.

Regardless of whether you convert, plot the predictions:

plot(preds)

Building the prediction yourself

We can also follow the same manual approach as last week, with small adjustments for a categorical predictor.

When building the fake dataset, mongoose only needs the values that appear in the real data. The unique() function handles this neatly, which is especially useful when there are many groups:

fake <- data.frame(
  mongoose = unique(rabbit$mongoose)
)
fake

This gives a two-row dataset, one row for Absent and one for Present.

Last week we only extracted the point predictions (fit). This week we also want the 95% confidence intervals, so we ask predict() for the standard error too:

preds <- predict(rabbit_mod, newdata = fake, se.fit = TRUE)

Extract the point predictions first:

fake$fit <- preds$fit

Confidence intervals

If you need a reminder of what 95% CIs are: confidence intervals represent the probability that an interval will contain the true parameter value before the model has been run. Once the model has been run, either the interval contains the true value or it doesn’t. We don’t know which. If that’s confusing (which is fair, 95% CIs are weird), think of the ring toss example from the lecture: if we make the hoop big, we have a 95% chance of landing on target before the throw, but once the hoop has left our hand, it either lands on target or it doesn’t.

The 95% confidence interval is constructed by multiplying the standard error by 1.96, then adding and subtracting from fit:

fake$low95 <- preds$fit - 1.96 * preds$se.fit
fake$upp95 <- preds$fit + 1.96 * preds$se.fit
# You can also use confint(rabbit_mod) to check these values.

Now plot the predictions with error bars:

ggplot(fake) +
  geom_errorbar(aes(x = mongoose, ymin = low95, ymax = upp95)) +
  geom_point(aes(x = mongoose, y = fit)) +
  labs(x = "Java mongoose",
       y = "Predicted weight of rabbits (g)") +
  theme_minimal()

The full workflow

Full rabbit workflow. Complete the analysis yourself, then compare your script with the instructor version below.

# Analysis of Amami rabbit weight in relation to mongoose presence

# Load data
rabbit <- read.table("rabbits.txt", header = TRUE)

# EDA
summary(rabbit)
str(rabbit)
ggplot(rabbit) +
  geom_histogram(aes(x = weight))
ggplot(rabbit) +
  geom_point(aes(x = mongoose, y = weight))

# Correct two data entry errors flagged by researchers
rabbit$mongoose[7] <- "Present"
rabbit$weight[rabbit$weight > 350] <- rabbit$weight[rabbit$weight > 350] - 100

# Concern prior to modelling:
# It is unclear whether mongoose are genuinely absent from Tokunoshima,
# or simply undetected. This caveat limits our conclusions.

# Fit model
rabbit_mod <- lm(weight ~ mongoose, data = rabbit)

# Diagnostics (covered in Workshop 6)

# Inference
summary(rabbit_mod)

# Visualise predictions with 95% CI
fake <- data.frame(mongoose = unique(rabbit$mongoose))
preds <- predict(rabbit_mod, newdata = fake, se.fit = TRUE)
fake$fit   <- preds$fit
fake$low95 <- preds$fit - 1.96 * preds$se.fit
fake$upp95 <- preds$fit + 1.96 * preds$se.fit

ggplot(fake) +
  geom_errorbar(aes(x = mongoose, ymin = low95, ymax = upp95)) +
  geom_point(aes(x = mongoose, y = fit)) +
  labs(x = "Java mongoose",
       y = "Predicted weight of rabbits (g)") +
  theme_minimal()

Should we share these results?

Question 1. Once you have completed this analysis (excluding diagnostics, which come on Friday), discuss in groups whether you think these results should be shared with the local government as evidence to inform mongoose eradication decisions.

There’s no clean answer here, and that’s the point.

The assumption of validity is questionable (we don’t truly know mongoose are absent from Tokunoshima). The estimated effect is tiny (about 3 g in an animal weighing 2.7 kg) and the uncertainty is large enough to include zero. The effect could also reflect something mundane, such as using different scales on the two islands.

The hardest part of statistics has nothing to do with equations or distributions. It’s deciding whether the data are appropriate for the question. The thing that would keep me up at night would be if anyone were to use our model to make decisions with real-world consequences. Are we really confident enough in our model to have that rest on our conscience? I don’t think I am.

Personally, I wouldn’t be comfortable sharing this with the local government without first making a serious effort to confirm whether mongoose are present on Tokunoshima.

A new challenger

Once you have finished the Amami rabbit analysis, move on to the lemurs.txt dataset. This comes from researchers interested in how activity levels (hours active per day) vary across four habitat types in Madagascar.

The dataset contains:

  • activity: number of hours lemurs were active in a day (measured to two decimal places)
  • habitat: one of four habitat types; Rainforest, Dry Forest, Spiny Forest, and Mangrove

Download lemurs.txt

Load the data and go through the same workflow you just used for the rabbit analysis:

lemur <- read.table("lemurs.txt", header = TRUE)

Work through each step: EDA (including checking for data issues), hidden assumptions, model fitting, and visualising predictions. The goal is to identify which habitat type lemurs are most active in.

Full lemur analysis. Work through the complete workflow. When you’re done, compare your approach with the instructor solution.

# Analysis of lemur activity by habitat type

library(ggplot2)
library(ggeffects)

lemur <- read.table("lemurs.txt", header = TRUE)

# EDA
str(lemur)      # n = 120, two columns
summary(lemur)  # check ranges

ggplot(lemur) +
  geom_histogram(aes(x = activity), colour = "white", binwidth = 0.5) +
  theme_minimal()

ggplot(lemur) +
  geom_histogram(aes(x = activity, fill = habitat),
                 colour = "white", binwidth = 0.5) +
  theme_minimal()
# One habitat is recorded as "Rain forest" rather than "Rainforest".
# This is clearly a typo, so we can correct without needing confirmation.

lemur$habitat[lemur$habitat == "Rain forest"] <- "Rainforest"

# Confirm the correction
ggplot(lemur) +
  geom_histogram(aes(x = activity, fill = habitat),
                 colour = "white", binwidth = 0.5) +
  theme_minimal()

# Hidden assumptions
# - Validity: how is activity measured? Accuracy may differ across habitats
#   (e.g., tracking lemurs is easier in Dry Forest than dense Rainforest).
# - Equal variance: may differ across habitat types. Worth checking diagnostics.

# Fit model
lem_mod <- lm(activity ~ habitat, data = lemur)

# Diagnostics (covered in Workshop 6)

# Summarise
summary(lem_mod)
# Dry Forest is the reference group (first alphabetically).
# Model predicts ~7.9 h in Dry Forest.
# Mangrove: ~1.1 h less (so ~6.8 h total)
# Rainforest: ~1.8 h more (so ~9.7 h total)
# Spiny Forest: ~1.8 h less (so ~6.1 h total)

preds <- ggpredict(lem_mod)
plot(preds)

# Interpretation:
# Lemurs are most active in Rainforest and least active in Spiny Forest.
# This does not necessarily mean Rainforest is more important habitat;
# it is just where activity rates are highest.
# Measurement bias (easier tracking in open habitats) is a real concern.

Assumption table

Assumption Description Example of violation
1. Validity The data can answer the research question. Using IQ to measure overall intelligence.
2. Representativeness The data reflects the population being studied. Drawing conclusions about an urban population from rural survey data.
3. Additivity Effects of predictors combine additively with no interactions. Assuming study hours and sleep hours affect exam scores independently, without considering how they interact.
4. Linearity The relationship between predictor and outcome is linear. Assuming plant growth increases indefinitely with light exposure.
5. Independence of errors Observations are independent of each other. Modelling height using repeated measurements from the same children over time.
6. Equal variance of errors Residuals have constant spread across predictor levels (homoscedasticity). Variability in income differs widely across levels of a company’s hierarchy.
7. Normality of errors Residuals follow a Normal distribution. Modelling income, where a few extremely wealthy individuals create extreme residuals.
Glossary / 词汇表
Variable encoding
Categorical variable 分类变量
A variable whose values are discrete named groups (e.g., Present/Absent, habitat type) rather than numbers on a continuous scale.
取值为离散命名类别(如在场/缺失、栖息地类型)而非连续数值的变量。
Dummy variable 虚拟变量
A binary (0/1) variable created from a categorical predictor so it can be used in a mathematical equation. R creates these automatically.
由分类预测变量生成的二元(0/1)变量,使其能用于数学方程。R会自动创建。
Reference group 参照组
The group assigned a dummy variable value of 0. R chooses the group that comes first alphabetically by default. The intercept (β₀) is the predicted response for this group.
虚拟变量取值为0的组。R默认选择字母顺序最靠前的组。截距(β₀)即该组的预测响应值。
factor() factor()函数
Converts a character or numeric vector to a factor in R, making group membership explicit and ensuring R treats the variable as categorical.
将字符或数值向量转换为R中的因子,明确组别归属,确保R将变量视为分类变量。
Model parameters
Intercept (β₀) 截距(β₀)
In a categorical model, the predicted mean response for the reference group (when all dummy variables equal 0).
在分类预测变量模型中,参照组(所有虚拟变量为0)的预测均值。
Coefficient (β₁) 系数(β₁)
In a categorical model, the estimated difference in mean response between the non-reference group and the reference group.
在分类模型中,非参照组与参照组之间均值响应的估计差异。
Inference and uncertainty
Confidence interval (95% CI) 置信区间(95% CI)
A range constructed so that, before seeing the data, 95% of such intervals would contain the true parameter value. Wider intervals indicate more uncertainty.
在观测数据之前,该区间有95%的概率包含真实参数值。区间越宽表示不确定性越大。
Standard error (SE) 标准误(SE)
A measure of how precisely a parameter has been estimated. Multiply by 1.96 to get the half-width of a 95% confidence interval.
衡量参数估计精度的指标。乘以1.96得到95%置信区间的半宽。
se.fit 拟合标准误
The standard error of a fitted (predicted) value, returned by predict(..., se.fit = TRUE). Used to compute confidence intervals.
predict(..., se.fit = TRUE)返回的拟合值标准误,用于计算置信区间。
confint() confint()函数
Computes confidence intervals for model parameters in R. A quick alternative to calculating them manually from the standard error.
计算R中模型参数置信区间的函数,是手动根据标准误计算的快捷替代方法。
R functions
lm() lm()函数
Fits a linear model in R. The formula y ~ x specifies the response and predictor. Works with both continuous and categorical predictors.
在R中拟合线性模型。公式y ~ x指定响应变量和预测变量,支持连续和分类预测变量。
predict() predict()函数
Generates model predictions for a new dataset. With se.fit = TRUE it also returns the standard error of each prediction, needed for confidence intervals.
为新数据集生成模型预测值。设置se.fit = TRUE时还返回每个预测值的标准误,用于计算置信区间。
unique() unique()函数
Returns all unique values in a vector. Useful for building a fake prediction dataset when a categorical variable has many levels.
返回向量中所有唯一值。当分类变量有多个水平时,可方便地构建预测数据集。
geom_errorbar() 误差棒图层
A ggplot2 layer that draws vertical error bars between ymin and ymax. Used to show confidence intervals on a categorical plot.
ggplot2图层,在yminymax之间绘制垂直误差棒,用于在分类图中显示置信区间。
ggeffects ggeffects包
An R package that automates model prediction and plotting. ggpredict() returns predictions with confidence intervals; plot() produces the figure.
自动化模型预测和绘图的R包。ggpredict()返回带置信区间的预测值;plot()生成图形。
Assumptions
Validity (assumption) 效度假设
The assumption that the data and variables actually measure what we intend them to measure. One of the hardest assumptions to test formally.
数据和变量确实测量了我们意图测量内容的假设。最难通过正式检验来评估的假设之一。
Representativeness (assumption) 代表性假设
The assumption that the sampled data reflect the broader population of interest. Violations lead to conclusions that do not generalise.
样本数据能代表更广泛目标总体的假设。违反此假设会导致结论无法推广。
Ecological fallacy 生态谬误
The error of drawing conclusions about individuals from group-level data, or assuming that group patterns apply to every member of that group.
从群体水平数据推断个体结论的错误,即假设群体规律适用于该群体每一个成员。
Dataset terms
Invasive species 入侵物种
A non-native species introduced (accidentally or deliberately) to an ecosystem, often causing harm to native species and habitats.
被(偶然或刻意)引入某生态系统的外来物种,通常对本地物种和栖息地造成危害。
Habitat 栖息地
The natural environment in which a species lives, characterised by its physical and biological features (e.g., Rainforest, Dry Forest).
物种生活的自然环境,以其物理和生物特征为特征(如雨林、旱林)。