Workshop 3: Linear models with a continuous predictor

BI3010 Statistics for Biologists

Author

Roos & Pinard

Published

16 May 2026

Learning Objectives

Learning objectives
  1. Decide whether a Normal distribution is a reasonable assumption for a response variable.
  2. Fit a linear model in R using lm().
  3. Interpret the intercept and slope from model output.
  4. Produce a figure of the fitted relationship using ggeffects or manual predictions.

Workshop structure

The Analytical Workflow

Starting this week, we apply the full analytical workflow to a single dataset before working through an independent one. Today’s focus is step 4.

  1. Formulate a research question.
  2. Perform exploratory data analysis.
  3. Identify any hidden assumptions.
  4. Fit an appropriate model [Focus of today].
  5. Diagnose the model and check assumptions.
  6. Summarise the results.
  7. Interpret and provide inferences.

Formulate a research question

For this first workshop on linear models, we return to a familiar dataset: boto.txt from Workshop 2. As a reminder, it contains:

  • sex: sex of the boto dolphin
  • drought: number of years in the past 20 where drought was recorded in the dolphin’s region
  • length: length of the boto dolphin (cm)

The research question is whether the number of droughts in the past 20 years affects boto length. The broader motivation is climate change and dam construction in the Amazon basin: if droughts become more frequent, could dolphins become smaller and be in poorer condition?

If you no longer have boto.txt from Workshop 2, download it here and place it in your data/ folder:

Download boto.txt

From the question we can identify:

  • Response variable: length
  • Predictor variable: drought

Before fitting any model, we need to decide whether a Normal distribution is a reasonable description of the data-generating process for boto length. Formally, we’re asking whether the following is plausible:

\[length_i \sim \text{Normal}(\mu_i, \sigma)\]

Here, \(length_i\) is the length of dolphin \(i\), assumed to follow a Normal distribution with mean \(\mu_i\) (the model’s predicted length for that dolphin) and standard deviation \(\sigma\).

Is a Normal distribution appropriate?

The first question is whether length is continuous or discrete. A discrete variable can only take integer values (1 cm, 2 cm, etc.). A continuous variable can in principle take any value within a range. Although the researchers rounded lengths to the nearest centimetre, actual dolphin length isn’t constrained to whole-centimetre increments. length is inherently continuous, and the Normal distribution generates continuous values. To see this, run:

rnorm(n = 10, mean = 200, sd = 10)

The generated values look much like the lengths in the boto dataset, which is reassuring. The only visible difference is that the researchers rounded their measurements.

The next question is whether there are natural bounds that might cause problems. Boto length cannot be zero or negative, and the Normal distribution can technically produce such values. However, the minimum observed length is well above zero, so the risk of the model predicting physically impossible values is low. A Normal distribution is a reasonable choice here, though we should verify this after fitting by checking that the model doesn’t produce implausible predictions. If it did, we’d switch to a different distribution using a Generalised Linear Model (GLM), which we cover later in the course.

An important distinction

Choosing a distribution for the response variable is separate from the assumption of “normally distributed errors” (or “normally distributed residuals”). The choice of distribution is based on the nature of the response variable and can be made before even looking at the data. The residuals assumption can only be checked after the model has been fitted.

The Normal distribution in practice

Question 1. Does a Normal distribution generate discrete or continuous values?

Continuous. Although values in a dataset may subsequently be rounded (e.g. to the nearest cm), the underlying process generates continuous values.

Question 2. Run rnorm(n = 100, mean = 50, sd = 20). What is the smallest value you get? Try increasing the standard deviation and observe what happens.

Because these are random numbers, results vary each run. With a mean of 50 and a standard deviation of 20, values below zero are quite plausible, especially if you increase the SD. You can find the minimum quickly with:

min(rnorm(n = 100, mean = 50, sd = 20))

This illustrates that a Normal distribution can produce values below zero, which matters when the response variable has a natural lower bound.

Question 3. If the values from rnorm(n = 100, mean = 50, sd = 20) were your response variable, would it be reasonable to assume a Normal distribution in your model?

Yes. The data were generated by a Normal distribution, so a model assuming Normal errors is entirely appropriate. The subtlety is that in the boto case we’re balancing biology with statistics: dolphins cannot be zero or negative centimetres long, but we can reasonably proceed because no observed values are anywhere near that boundary.

Question 4. If your data were generated by a Normal distribution, does that automatically satisfy the assumption of normality?

No. The assumption of normality refers to the distribution of the residuals (differences between observed and predicted values), not the raw data, and not the distribution we assumed generated the data. Even if the underlying data are normally distributed, the residuals from a poorly specified model may not be. This can only be checked after fitting the model.

Exploratory data analysis

You worked with this dataset in Workshop 2, so EDA is already done. If you want a refresher, copy the relevant code from your Workshop 2 script.

Identifying hidden assumptions

Before fitting the model, consider which assumptions are most likely to be violated. The full assumption list is included at the end of this document.

Assumptions for the boto model

Question 1. Which assumptions do you think are most likely to be violated?

Additivity is a likely concern. In Workshop 2 we saw that drought appears to affect male and female dolphins differently (different slopes on the faceted plot). Where predictors overlap in their effects, additivity is probably broken.

Independence of errors is also plausible. Dolphins from the same region may be related: if a parent is shorter than predicted (for unknown reasons), its offspring may also be shorter for the same reasons. This shared unexplained variation would violate independence: misunderstanding one observation shouldn’t make us more likely to misunderstand another.

Question 2. Which assumptions can only be checked after fitting the model?

All of the residual-based assumptions: independence of errors, equal variance of errors, and normality of errors. Residuals are the difference between what the model predicts and what was observed, so they can only exist after the model has been fitted. Linearity can also be assessed from a plot of residuals versus fitted values, though it can also be explored during EDA.

Fit an appropriate model

The model we will fit is:

\[length_i \sim \text{Normal}(\mu_i, \sigma)\] \[\mu_i = \beta_0 + \beta_1 \times drought_i\]

Where:

  • \(length_i\) is the length of dolphin \(i\), assumed to follow a Normal distribution with mean \(\mu_i\) and standard deviation \(\sigma\).
  • \(\beta_0\) is the intercept: the predicted length when drought = 0.
  • \(\beta_1\) is the slope: the change in predicted length for each additional year of drought.

Fitting this in R takes one line:

boto_model <- lm(length ~ drought, data = boto)

Translating the equation to code

  • lm() fits a linear model, which by default assumes the response follows a Normal distribution.
  • The formula length ~ drought specifies the relationship. length is the response variable; drought is the predictor. This corresponds directly to \(\mu_i = \beta_0 + \beta_1 \times drought_i\). R adds the intercept automatically.
  • We store the result in boto_model so we can use it later without re-running.

Now look at the results:

summary(boto_model)

The Estimates column gives us \(\beta_0\) and \(\beta_1\): 213.3 and −5.3, respectively.

What the parameters mean

To understand what these numbers represent, substitute into the equation. Start with the intercept. If there has never been a drought, then drought = 0:

\[\mu_i = \beta_0 + \beta_1 \times 0 = \beta_0\]

Any number multiplied by zero is zero, so only \(\beta_0\) remains. The intercept is the predicted length when the predictor equals zero: when drought = 0, the model predicts dolphins are approximately 213.3 cm long. The intercept is always “the expected value of Y when X equals zero.”

The slope, \(\beta_1 = -5.3\), tells us how much the predicted length changes per additional year of drought. For every extra year of drought, the model expects dolphins to be 5.3 cm shorter on average. Starting from drought = 0 at a length of 213 cm, each step to the right along the x-axis drops the prediction by 5.3 cm.

The figure below illustrates this:

Nobody talks much about \(\sigma\), but it’s doing important stuff in the model. \(\beta_0\) and \(\beta_1\) describe the line; \(\sigma\) describes the cloud of points around it. Two dolphins with the same drought history won’t be the same length. One is older, one had a better feeding season, one was awkward to measure. None of that is drought, and all of it ends up in \(\sigma\).

In the output, \(\sigma\) is called the Residual standard error. For a Normal distribution, about two thirds of observations fall within one \(\sigma\) of the fitted line. Friday’s session is where we actually dig into what that number is telling us. For now, just know it exists and matters.

Predictions from the equation

Question 1. The model estimated β0 = 213.3 and β1 = −5.3. Write out the completed equation:

μi =   +   × droughti

μi = 213.3 + (−5.3) × droughti

Question 2. Using your completed equation, what length would you predict for a dolphin in a region with zero years of drought?

213.3 + (−5.3) × 0 = 213.3 cm

Question 3. What length would you predict if there had been eight years of drought?

213.3 + (−5.3) × 8 = 170.9 cm

Question 4. What is the predicted difference in length between a dolphin that experienced two years of drought and one that experienced seven?

Two years of drought: 213.3 + (−5.3) × 2 = 202.7 cm

Seven years of drought: 213.3 + (−5.3) × 7 = 176.2 cm

Difference: 202.7 − 176.2 = 26.5 cm

According to the model, the dolphin with seven years of drought would be 26.5 cm shorter.

Diagnose the model and check assumptions

We’ll cover this on Friday.

A note on decimal places

When reporting parameter estimates, how many decimal places should you use? There is no universal answer; it depends on context. The figures below show what happens when you round estimates to different levels of precision.

Here \(\beta_0 = -1.235\) and \(\beta_1 = 3.456\). Rounding to zero decimal places gives \(\beta_0 = -1\) and \(\beta_1 = 3\), producing a noticeably different line than rounding to two decimal places.

In this second example the explanatory variable runs up to one million and the true parameters are extremely small (\(\beta_0 = 0.0000000652\), \(\beta_1 = 0.0005161519\)). Rounding to two decimal places gives \(\beta_0 = 0.00\) and \(\beta_1 = 0.00\), which is useless.

The lesson: before rounding, ask how much the number changes and what the scale of your predictor is. An arbitrary rounding decision can meaningfully alter your results. In BI3010 assessments, the number of decimal places to report will always be specified.

Summarise the results

We’ll return to model diagnosis on Friday. For now, let’s turn the results into a figure. There are two ways to do this.

Option 1: ggeffects

The ggeffects package does the work for you:

install.packages("ggeffects")  # Run once
library(ggeffects)

preds <- ggpredict(boto_model)
plot(preds)

This produces the fitted line with a 95% confidence interval (we’ll cover confidence intervals next week). It’s quick but gives you limited control over the appearance.

Option 2: manual predictions

Making the figure yourself is not much harder and is more transparent. The idea is to create a small dataset of drought values and use predict() to calculate the model’s prediction for each one, exactly what you did by hand in the questions above.

First, create a dataset with the drought values you want to predict at:

fake <- data.frame(drought = 0:8)

Then generate the predictions and add them to the dataset:

fake$fit <- predict(boto_model, newdata = fake)

Try it yourself. Enter any drought values in the table below. The fit column fills in using 213.3 + (−5.3) × drought, and each point appears on the plot connected to the previous one. This is exactly what predict() does in R, repeated for each row.

# drought fit (cm)

Now plot:

ggplot(fake) +
  geom_line(aes(x = drought, y = fit)) +
  labs(x = "Number of drought years",
       y = "Predicted length of boto (cm)") +
  theme_minimal()

Here’s the full workflow from loading the data to the final figure, as you might write it in a script:

# Deon Roos
# Analysis of boto length in relation to droughts

library(ggplot2)

# Open your R Project (.Rproj) before running this script
boto <- read.table("data/boto.txt", header = TRUE)

# EDA
summary(boto)
str(boto)
ggplot(boto) + geom_point(aes(x = drought, y = length))
ggplot(boto) + geom_point(aes(x = drought, y = length, colour = sex))
ggplot(boto) + geom_point(aes(x = drought, y = length, colour = sex)) + facet_wrap(~sex)
# No unusual observations detected in EDA

# Concerns prior to modelling:
# Measurement precision is high (unclear how lengths were measured at this accuracy)
# Validity: depends on measurement method
# Independence: related dolphins (families) may have been measured
# Additivity: drought may affect the sexes differently

# Fit model
boto_model <- lm(length ~ drought, data = boto)

# Diagnostics: covered in the next workshop

# Inference
summary(boto_model)

fake <- data.frame(drought = 0:8)
fake$fit <- predict(boto_model, newdata = fake)

ggplot(fake) +
  geom_line(aes(x = drought, y = fit)) +
  labs(x = "Number of drought years",
       y = "Predicted length of boto (cm)") +
  theme_minimal()

The entire analysis is about 40 lines including comments. It’s reproducible and shareable. As a tip: don’t confuse script length with complexity. A long script is often a sign of uncertainty, not sophistication. Less is usually more.

Burying beetles: independent analysis

Using everything you have just learned, work through a new dataset: beetles.txt on MyAberdeen. This data comes from a lab experiment investigating how short-term temperature exposure (6 hours per day) affects fecundity in burying beetles (Nicrophorus vespilloides), with the aim of understanding how the species might respond to climate change.

Variables:

  • offspring: number of offspring produced
  • temperature: temperature (°C) to which beetles were exposed for 6 hours per day

Download beetles.txt and place it in your data/ folder:

Download beetles.txt

Carry out an EDA, consider which assumptions are likely violated, fit an appropriate model, and plot the results. Make sure you are comfortable with the full workflow, as an equivalent analysis may appear in a quiz.

Worked answer. Once you have completed your own analysis, reveal the instructor’s version to compare approaches.

library(ggplot2)
library(ggeffects)

beetles <- read.table("data/beetles.txt", header = TRUE)

# EDA
str(beetles)      # n = 234
summary(beetles)  # minimum of 1 offspring, plausible if low

ggplot(beetles) +
  geom_histogram(aes(x = offspring), colour = "white", binwidth = 1) +
  theme_minimal()
# 1 offspring seems reasonable, just on the low end

ggplot(beetles) +
  geom_histogram(aes(x = temperature), colour = "white", binwidth = 1) +
  theme_minimal()
# Fairly even spread of temperatures between −10 and 35°C

ggplot(beetles) +
  geom_point(aes(x = temperature, y = offspring)) +
  theme_minimal()
# Clear non-linear trend: offspring increases then decreases as temperature rises

# Hidden assumption concerns:
# Linearity is almost certainly broken (the relationship curves)
# Equal variance of errors: possibly OK (the plot looks roughly homogeneous)
# Representativeness: this is a lab population; extrapolate to wild populations with caution

# Fit model
beet_mod <- lm(offspring ~ temperature, data = beetles)

summary(beet_mod)
# Intercept: 25.3, predicted offspring at 0°C
# Slope: 0.5, each 1°C increase adds roughly 0.5 offspring
# BUT: the model is biased (linearity is broken), so these estimates are unreliable.
# Do not use this model for inference.

preds <- ggpredict(beet_mod)
plot(preds, add.data = TRUE)
# The linear fit to a non-linear relationship is obvious in this plot.
# In reality, we would need to allow temperature to have a non-linear effect in the model.

Assumption reference

Assumption Description Example of violation
1. Validity The data can answer the research question. Using IQ to measure overall intelligence.
2. Representativeness The data reflects the population being studied. Using survey data from a rural area to draw conclusions about an urban population.
3. Additivity Effects of predictors are additive with no interactions. Assuming study hours and sleep hours independently affect exam scores without considering their combined effect.
4. Linearity The relationship between predictors and the outcome is linear. Assuming plant growth continues to increase indefinitely without ever levelling off.
5. Independence of errors Observations are independent of each other. Modelling height using repeated measurements of the same children over time.
6. Equal variance of errors Residuals have constant variance across all predictor levels (homoscedasticity). Modelling income, where salary variability differs across levels of a company hierarchy.
7. Normality of errors Residuals are approximately normally distributed. Modelling income, where a small number of extremely wealthy individuals create a heavily skewed residual distribution.
Glossary / 词汇表
Model concepts
Linear model 线性模型
A statistical model that describes the relationship between a response variable and one or more predictors using a straight line. Fitted in R with lm().
用直线描述响应变量与一个或多个预测变量之间关系的统计模型,在 R 中使用 lm() 拟合。
Normal distribution 正态分布
A symmetric, bell-shaped probability distribution defined by its mean (μ) and standard deviation (σ). The default assumption for the response variable in a linear model.
由均值(μ)和标准差(σ)定义的对称钟形概率分布,是线性模型中响应变量的默认假设。
Response variable 响应变量
The outcome being modelled (the "Y" variable). In the boto example, this is length. Also called the dependent variable.
被建模的结果变量(即"Y"变量)。在亚马逊河豚示例中为体长,也称因变量。
Predictor variable 预测变量
The variable used to explain or predict the response (the "X" variable). Also called the explanatory variable, independent variable, or covariate.
用于解释或预测响应变量的变量(即"X"变量),也称解释变量、自变量或协变量。
Continuous variable 连续变量
A variable that can take any value within a range, including non-integer values. Length, weight, and temperature are continuous. Compatible with a Normal distribution.
可取某范围内任意值(包括非整数值)的变量。体长、体重和温度均为连续变量,与正态分布兼容。
Discrete variable 离散变量
A variable that can only take specific values, usually integers (e.g. count of offspring). A Normal distribution is generally not appropriate for discrete variables.
只能取特定值(通常为整数)的变量,例如后代数量。正态分布通常不适用于离散变量。
Covariate 协变量
Another term for a predictor variable. Often used when the predictor is continuous and included to control for its effect alongside the main variable of interest.
预测变量的另一种说法,通常指连续型预测变量,用于在研究主要变量的同时控制其影响。
Model parameters
Intercept (β₀) 截距
The predicted value of the response variable when all predictors equal zero. In the boto model: the predicted length when drought = 0.
当所有预测变量等于零时,响应变量的预测值。在亚马逊河豚模型中,即干旱年数为零时的预测体长。
Slope (β₁) 斜率
The change in the predicted response for a one-unit increase in the predictor. In the boto model: the change in predicted length for each additional year of drought.
预测变量每增加一个单位时,预测响应变量的变化量。在亚马逊河豚模型中,即每增加一年干旱,预测体长的变化量。
Fitted value (μ) 拟合值
The model's predicted value for a given set of predictor values. Written μ_i for observation i. Calculated as β₀ + β₁ × predictor.
对于给定预测变量值,模型所预测的值,第 i 个观测的拟合值记为 μ_i,计算公式为 β₀ + β₁ × 预测变量。
Residual 残差
The difference between an observed value and its fitted value: residual = observed − predicted. Residuals are used to diagnose whether model assumptions are met.
观测值与拟合值之差:残差 = 观测值 − 预测值。残差用于诊断模型假设是否满足。
Standard deviation (σ) 标准差
A measure of spread around the mean. In a linear model, σ describes how much individual observations vary around the fitted line.
均值周围数据分散程度的度量。在线性模型中,σ 描述各观测值围绕拟合线的变异程度。
Coefficient 系数
An estimated parameter from the model (e.g. the intercept or slope). Reported in the Estimate column of summary() output.
模型估计的参数(如截距或斜率),显示在 summary() 输出的 Estimate 列中。
Inference and uncertainty
Confidence interval 置信区间
A range of values within which the true parameter is likely to fall, with a specified level of confidence (usually 95%). Covered in detail next week.
在指定置信水平(通常为95%)下,真实参数值可能落入的值域范围,将在下周详细介绍。
Biased estimate 有偏估计
An estimate that is systematically too high or too low because the model is misspecified (e.g. fitting a straight line to curved data). Biased estimates should not be used for inference.
由于模型设定错误(如对曲线数据拟合直线)而系统性偏高或偏低的估计值,有偏估计不应用于推断。
R functions
lm() lm()
The R function for fitting a linear model. Takes a formula (response ~ predictor) and a data frame. Returns a model object that can be passed to summary(), predict(), and other functions.
R 中用于拟合线性模型的函数。接受公式(响应变量 ~ 预测变量)和数据框,返回可传递给 summary()、predict() 等函数的模型对象。
summary() for lm summary() 用于线性模型
Prints the key results from a fitted lm object, including the estimated intercept and slope, standard errors, and model fit statistics.
打印拟合线性模型的关键结果,包括截距和斜率的估计值、标准误差及模型拟合统计量。
predict() predict()
Calculates fitted values from a model for a given set of predictor values. Use predict(model, newdata = ...) to generate predictions for new data.
根据给定的预测变量值,从模型计算拟合值。使用 predict(模型, newdata = ...) 为新数据生成预测值。
ggeffects ggeffects
An R package that automatically generates model predictions and confidence intervals, and plots them with ggplot2. Useful for quick visualisation of model fits.
自动生成模型预测值和置信区间,并用 ggplot2 绘图的 R 包,适用于快速可视化模型拟合结果。
Assumptions
Linearity (assumption) 线性性(假设)
The assumption that the relationship between the predictor and the response follows a straight line. Violated when the true relationship curves, as in the beetle temperature data.
预测变量与响应变量之间关系呈直线的假设。当真实关系为曲线时(如甲虫温度数据),该假设被违反。
Dataset terms
Fecundity 繁殖力
The reproductive output of an organism, typically measured as the number of offspring produced. The response variable in the burying beetle dataset.
生物体的生殖产出,通常以产生的后代数量衡量,是埋葬甲虫数据集中的响应变量。