rabbit <- read.table("rabbits.txt", header = TRUE)Workshop 5: Linear models with a categorical predictor
BI3010 Statistics for Biologists
Learning Objectives
- Understand how a categorical predictor is encoded as a dummy variable in a linear model.
- Identify the reference group and interpret \(\beta_0\) and \(\beta_1\) in a categorical model.
- Fit a linear model with a categorical predictor using
lm(). - Generate model predictions with 95% confidence intervals and visualise them.
- Reflect critically on hidden assumptions that formal diagnostics can’t detect.
Workshop structure
As always, we’re using the workflow below to structure the workshops:
- Formulate a research question.
- Perform exploratory data analysis.
- Identify any hidden assumptions.
- Fit an appropriate model [Focus of today].
- Diagnose the model and check assumptions.
- Summarise the results.
- Interpret and provide inferences.
1. Formulating a research question
This workshop introduces fitting a linear model with a single categorical explanatory variable. The dataset comes from Japan, specifically the Ryukyu islands, far to the south of mainland Japan. Two islands are home to the Amami rabbit (Pentalagus furnessi), an ancient species designated a natural monument of Japan and currently threatened by the invasive Java mongoose.
Researchers in Japan have asked us whether the invasive mongoose is causing rabbit size to decline. They have provided weights from two islands: Oshima, where mongoose are confirmed present (multiple captures, with a control programme underway), and Tokunoshima, where no sightings have been reported by local residents.
The dataset contains two columns:
weight: weight of the Amami rabbit (grams)mongoose: whether the rabbit was caught on the island with mongoosePresent(Oshima) orAbsent(Tokunoshima)
Load the data to get started:
As always, begin by looking at the structure and a summary of the data:
str(rabbit)
summary(rabbit)2. Exploratory data analysis
Before we can work with this dataset, our colleagues in Japan have emailed to flag two data entry mistakes. They report:
Mistake 1: The rabbit in the 7th row was actually from Oshima, not Tokunoshima.
Mistake 2: They cannot remember which rabbit, but they accidentally added 100 to its weight.
They ask you to resolve both issues and also check for any other possible mistakes.
Cleaning the rabbit data
Question 1. What mongoose value should the rabbit in row 7 have?
We were told it’s row 7. We can inspect that row using square brackets:
rabbit[7, ]The rabbit is currently recorded as Absent, but it should be Present (it came from Oshima).
Question 2. Correct the mistake from Question 1 and confirm it has worked.
rabbit$mongoose[7] <- "Present"
# Or equivalently:
rabbit[7, 1] <- "Present"
# Confirm:
rabbit[7, ]Both lines do the same thing. After the correction, row 7 should show Present.
Question 3. Which rabbit had the incorrect weight? Find it and correct the mistake at the same time.
A histogram reveals the outlier clearly:
ggplot(rabbit) +
geom_histogram(aes(x = weight))There is one rabbit much heavier than all others, well above 350 g. We can use that as a condition to identify it and subtract 100:
rabbit[rabbit$weight > 350, ]
# Row 41 is the culprit.Question 4. Correct the mistake from Question 3 and confirm it has worked.
rabbit$weight[rabbit$weight > 350] <- rabbit$weight[rabbit$weight > 350] - 100
# Confirm:
rabbit[41, ]
# Or replot the histogram to check no outlier remains.
ggplot(rabbit) +
geom_histogram(aes(x = weight))Question 5. Can you spot any additional data entry mistakes in the dataset?
Nothing else obvious. The histogram and summary look reasonable once the outlier has been corrected.
4. Fitting the model
Assuming we’re comfortable proceeding, the model looks like this:
\[weight_i \sim \text{Normal}(\mu_i, \sigma)\]
\[\mu_i = \beta_0 + \beta_1 \times \underline{\phantom{Present}}_i\]
Where:
- \(weight_i\) is the weight of rabbit \(i\), assumed to follow a Normal distribution with mean \(\mu_i\) and standard deviation \(\sigma\).
- \(\mu_i\) is determined by \(\beta_0\) and \(\beta_1\).
The table below shows how R converts a categorical variable into dummy variables. Try changing the category in each row using the dropdowns, or enter different variable names and levels to see how the encoding changes.
From categories to dummy variables
R sorts levels alphabetically and uses the first as the reference group (the intercept, β0). One dummy column is created per remaining level: it takes the value 1 when the observation belongs to that level and 0 otherwise. The reference group is identified by all dummy columns being 0.
Understanding the model equation
If you’re finding the questions below difficult, skip ahead and run the model, then return and use the summary() output to help.
Question 1. What replaces the blank in the equation below?
\[\mu_i = \beta_0 + \beta_1 \times \underline{\phantom{MongoosePresent}}_i\]
Present replaces the blank.
If you said mongoose, remember that mongoose has two values ("Present" and "Absent"), which can’t be used directly in an equation. Instead, R takes one group and converts it to a binary dummy variable (0 or 1). R picks the group that comes first alphabetically as the reference, so Absent becomes the reference (0) and Present becomes the dummy (1). When Present = 0, it must be Absent.
relevel()
R picks the reference group alphabetically, which isn’t always what you want. If you’d rather Present be the reference (so that \(\beta_0\) is the mean weight where mongoose are present), use relevel() before fitting the model:
rabbit$mongoose <- relevel(factor(rabbit$mongoose), ref = "Present")
rabbit_mod <- lm(weight ~ mongoose, data = rabbit)This doesn’t change the data or the underlying relationship; it only changes which group sits in the intercept and which appears as a coefficient. The predicted values and confidence intervals stay the same.
Question 2. What does \(\beta_0\) represent in this model?
\(\beta_0\) is the predicted weight of rabbits in the reference group, i.e. when mongoose are Absent (Tokunoshima). It is the model’s estimate of the mean weight where Present = 0.
Question 3. What does \(\beta_1\) represent in this model?
\(\beta_1\) is the difference in predicted weight between rabbits where mongoose are Present versus where they are Absent. It’s not the predicted weight when mongoose are present; it’s the change relative to \(\beta_0\).
Running the model
Let’s fit the model:
rabbit_mod <- lm(weight ~ mongoose, data = rabbit)Now inspect the estimated parameters:
summary(rabbit_mod)Interpreting model output
Question 1. Replace the blanks in the equation below with the values estimated by the model.
weighti ∼ Normal(μi, σ)
μi = + × Presenti
\[\mu_i = 269.6 + (-2.9) \times \text{Present}_i\]
The first blank is 269.566 (\(\beta_0\)) and the second is −2.866 (\(\beta_1\)). Rounding to one decimal place is appropriate here given that weights are recorded in whole grams and the estimated difference is small.
Question 2. Using your completed equation, what weight (\(\mu\)) would you predict for a rabbit where mongoose are present?
Substitute Present = 1:
\[\mu_i = 269.6 + (-2.9) \times 1 = 266.7 \text{ g}\]
Note: I am rounding to one decimal place throughout. Rabbit weights were recorded as whole grams and the estimated coefficient is small, so claiming more decimal places would imply more precision than the data actually support.
Question 3. Using your completed equation, what weight (\(\mu\)) would you predict for a rabbit where mongoose are absent?
Substitute Present = 0:
\[\mu_i = 269.6 + (-2.9) \times 0 = 269.6 \text{ g}\]
When the dummy variable is 0, \(\beta_1\) drops out completely and we are left with \(\beta_0\) alone.
Question 4. Is the presence of mongoose associated with a decline in rabbit weight, as the researchers hypothesised?
The model suggests mongoose presence is associated with a weight reduction of about 3 g, but the standard error on that estimate is large (5.4 g). A quick 95% confidence interval:
- Upper limit: −2.9 + (5.4 × 1.96) = +7.7 g
- Lower limit: −2.9 − (5.4 × 1.96) = −13.5 g
The confidence interval spans zero, meaning the data are consistent with no effect at all. There is very little statistical evidence for a meaningful difference.
Question 5. Is the association sufficient to be of biological concern?
3 g is pretty small for a species that weighs on average around 2.7 kg. Combined with the wide uncertainty and the unresolved question of whether mongoose are even absent from Tokunoshima, we’re not really sure what we’ve learned here.
We identified before running the model that the mongoose variable might not be reliable, and we now find a tiny effect. So what should we take away from this? I’m not sure myself.
5. Model diagnostics
We will cover this on Friday.
6. Summarising and visualising results
We will return to model diagnostics on Friday. For now, let’s focus on step 6: turning the model output into a figure. Just like last week, there are two options.
Using ggeffects
Make sure ggeffects is installed and loaded:
library(ggeffects)Ask ggeffects to predict rabbit weight across the two mongoose categories:
preds <- ggpredict(rabbit_mod)When you run this, you’ll likely see a Warning (not an Error) telling you that ggeffects prefers factors over character strings. Converting is simple:
rabbit$mongoose <- factor(rabbit$mongoose)Then refit and rerun ggpredict(). In practice, I tend to convert categorical variables to factors before fitting any model, since some packages refuse to work with character strings. Whether you do so is up to you, but given how little effort it takes, it’s worth the habit.
Regardless of whether you convert, plot the predictions:
plot(preds)Building the prediction yourself
We can also follow the same manual approach as last week, with small adjustments for a categorical predictor.
When building the fake dataset, mongoose only needs the values that appear in the real data. The unique() function handles this neatly, which is especially useful when there are many groups:
fake <- data.frame(
mongoose = unique(rabbit$mongoose)
)
fakeThis gives a two-row dataset, one row for Absent and one for Present.
Last week we only extracted the point predictions (fit). This week we also want the 95% confidence intervals, so we ask predict() for the standard error too:
preds <- predict(rabbit_mod, newdata = fake, se.fit = TRUE)Extract the point predictions first:
fake$fit <- preds$fitConfidence intervals
If you need a reminder of what 95% CIs are: confidence intervals represent the probability that an interval will contain the true parameter value before the model has been run. Once the model has been run, either the interval contains the true value or it doesn’t. We don’t know which. If that’s confusing (which is fair, 95% CIs are weird), think of the ring toss example from the lecture: if we make the hoop big, we have a 95% chance of landing on target before the throw, but once the hoop has left our hand, it either lands on target or it doesn’t.
The 95% confidence interval is constructed by multiplying the standard error by 1.96, then adding and subtracting from fit:
fake$low95 <- preds$fit - 1.96 * preds$se.fit
fake$upp95 <- preds$fit + 1.96 * preds$se.fit
# You can also use confint(rabbit_mod) to check these values.Now plot the predictions with error bars:
ggplot(fake) +
geom_errorbar(aes(x = mongoose, ymin = low95, ymax = upp95)) +
geom_point(aes(x = mongoose, y = fit)) +
labs(x = "Java mongoose",
y = "Predicted weight of rabbits (g)") +
theme_minimal()The full workflow
Full rabbit workflow. Complete the analysis yourself, then compare your script with the instructor version below.
# Analysis of Amami rabbit weight in relation to mongoose presence
# Load data
rabbit <- read.table("rabbits.txt", header = TRUE)
# EDA
summary(rabbit)
str(rabbit)
ggplot(rabbit) +
geom_histogram(aes(x = weight))
ggplot(rabbit) +
geom_point(aes(x = mongoose, y = weight))
# Correct two data entry errors flagged by researchers
rabbit$mongoose[7] <- "Present"
rabbit$weight[rabbit$weight > 350] <- rabbit$weight[rabbit$weight > 350] - 100
# Concern prior to modelling:
# It is unclear whether mongoose are genuinely absent from Tokunoshima,
# or simply undetected. This caveat limits our conclusions.
# Fit model
rabbit_mod <- lm(weight ~ mongoose, data = rabbit)
# Diagnostics (covered in Workshop 6)
# Inference
summary(rabbit_mod)
# Visualise predictions with 95% CI
fake <- data.frame(mongoose = unique(rabbit$mongoose))
preds <- predict(rabbit_mod, newdata = fake, se.fit = TRUE)
fake$fit <- preds$fit
fake$low95 <- preds$fit - 1.96 * preds$se.fit
fake$upp95 <- preds$fit + 1.96 * preds$se.fit
ggplot(fake) +
geom_errorbar(aes(x = mongoose, ymin = low95, ymax = upp95)) +
geom_point(aes(x = mongoose, y = fit)) +
labs(x = "Java mongoose",
y = "Predicted weight of rabbits (g)") +
theme_minimal()A new challenger
Once you have finished the Amami rabbit analysis, move on to the lemurs.txt dataset. This comes from researchers interested in how activity levels (hours active per day) vary across four habitat types in Madagascar.
The dataset contains:
activity: number of hours lemurs were active in a day (measured to two decimal places)habitat: one of four habitat types; Rainforest, Dry Forest, Spiny Forest, and Mangrove
Load the data and go through the same workflow you just used for the rabbit analysis:
lemur <- read.table("lemurs.txt", header = TRUE)Work through each step: EDA (including checking for data issues), hidden assumptions, model fitting, and visualising predictions. The goal is to identify which habitat type lemurs are most active in.
Full lemur analysis. Work through the complete workflow. When you’re done, compare your approach with the instructor solution.
# Analysis of lemur activity by habitat type
library(ggplot2)
library(ggeffects)
lemur <- read.table("lemurs.txt", header = TRUE)
# EDA
str(lemur) # n = 120, two columns
summary(lemur) # check ranges
ggplot(lemur) +
geom_histogram(aes(x = activity), colour = "white", binwidth = 0.5) +
theme_minimal()
ggplot(lemur) +
geom_histogram(aes(x = activity, fill = habitat),
colour = "white", binwidth = 0.5) +
theme_minimal()
# One habitat is recorded as "Rain forest" rather than "Rainforest".
# This is clearly a typo, so we can correct without needing confirmation.
lemur$habitat[lemur$habitat == "Rain forest"] <- "Rainforest"
# Confirm the correction
ggplot(lemur) +
geom_histogram(aes(x = activity, fill = habitat),
colour = "white", binwidth = 0.5) +
theme_minimal()
# Hidden assumptions
# - Validity: how is activity measured? Accuracy may differ across habitats
# (e.g., tracking lemurs is easier in Dry Forest than dense Rainforest).
# - Equal variance: may differ across habitat types. Worth checking diagnostics.
# Fit model
lem_mod <- lm(activity ~ habitat, data = lemur)
# Diagnostics (covered in Workshop 6)
# Summarise
summary(lem_mod)
# Dry Forest is the reference group (first alphabetically).
# Model predicts ~7.9 h in Dry Forest.
# Mangrove: ~1.1 h less (so ~6.8 h total)
# Rainforest: ~1.8 h more (so ~9.7 h total)
# Spiny Forest: ~1.8 h less (so ~6.1 h total)
preds <- ggpredict(lem_mod)
plot(preds)
# Interpretation:
# Lemurs are most active in Rainforest and least active in Spiny Forest.
# This does not necessarily mean Rainforest is more important habitat;
# it is just where activity rates are highest.
# Measurement bias (easier tracking in open habitats) is a real concern.Assumption table
| Assumption | Description | Example of violation |
|---|---|---|
| 1. Validity | The data can answer the research question. | Using IQ to measure overall intelligence. |
| 2. Representativeness | The data reflects the population being studied. | Drawing conclusions about an urban population from rural survey data. |
| 3. Additivity | Effects of predictors combine additively with no interactions. | Assuming study hours and sleep hours affect exam scores independently, without considering how they interact. |
| 4. Linearity | The relationship between predictor and outcome is linear. | Assuming plant growth increases indefinitely with light exposure. |
| 5. Independence of errors | Observations are independent of each other. | Modelling height using repeated measurements from the same children over time. |
| 6. Equal variance of errors | Residuals have constant spread across predictor levels (homoscedasticity). | Variability in income differs widely across levels of a company’s hierarchy. |
| 7. Normality of errors | Residuals follow a Normal distribution. | Modelling income, where a few extremely wealthy individuals create extreme residuals. |
predict(..., se.fit = TRUE). Used to compute confidence intervals.predict(..., se.fit = TRUE)返回的拟合值标准误,用于计算置信区间。y ~ x specifies the response and predictor. Works with both continuous and categorical predictors.y ~ x指定响应变量和预测变量,支持连续和分类预测变量。se.fit = TRUE it also returns the standard error of each prediction, needed for confidence intervals.se.fit = TRUE时还返回每个预测值的标准误,用于计算置信区间。ymin and ymax. Used to show confidence intervals on a categorical plot.ymin和ymax之间绘制垂直误差棒,用于在分类图中显示置信区间。ggpredict() returns predictions with confidence intervals; plot() produces the figure.ggpredict()返回带置信区间的预测值;plot()生成图形。