rnorm(n = 10, mean = 200, sd = 10)Workshop 3: Linear models with a continuous predictor
BI3010 Statistics for Biologists
Learning Objectives
- Decide whether a Normal distribution is a reasonable assumption for a response variable.
- Fit a linear model in R using
lm(). - Interpret the intercept and slope from model output.
- Produce a figure of the fitted relationship using
ggeffectsor manual predictions.
Workshop structure
Starting this week, we apply the full analytical workflow to a single dataset before working through an independent one. Today’s focus is step 4.
- Formulate a research question.
- Perform exploratory data analysis.
- Identify any hidden assumptions.
- Fit an appropriate model [Focus of today].
- Diagnose the model and check assumptions.
- Summarise the results.
- Interpret and provide inferences.
Formulate a research question
For this first workshop on linear models, we return to a familiar dataset: boto.txt from Workshop 2. As a reminder, it contains:
sex: sex of the boto dolphindrought: number of years in the past 20 where drought was recorded in the dolphin’s regionlength: length of the boto dolphin (cm)
The research question is whether the number of droughts in the past 20 years affects boto length. The broader motivation is climate change and dam construction in the Amazon basin: if droughts become more frequent, could dolphins become smaller and be in poorer condition?
If you no longer have boto.txt from Workshop 2, download it here and place it in your data/ folder:
From the question we can identify:
- Response variable:
length - Predictor variable:
drought
Before fitting any model, we need to decide whether a Normal distribution is a reasonable description of the data-generating process for boto length. Formally, we’re asking whether the following is plausible:
\[length_i \sim \text{Normal}(\mu_i, \sigma)\]
Here, \(length_i\) is the length of dolphin \(i\), assumed to follow a Normal distribution with mean \(\mu_i\) (the model’s predicted length for that dolphin) and standard deviation \(\sigma\).
Is a Normal distribution appropriate?
The first question is whether length is continuous or discrete. A discrete variable can only take integer values (1 cm, 2 cm, etc.). A continuous variable can in principle take any value within a range. Although the researchers rounded lengths to the nearest centimetre, actual dolphin length isn’t constrained to whole-centimetre increments. length is inherently continuous, and the Normal distribution generates continuous values. To see this, run:
The generated values look much like the lengths in the boto dataset, which is reassuring. The only visible difference is that the researchers rounded their measurements.
The next question is whether there are natural bounds that might cause problems. Boto length cannot be zero or negative, and the Normal distribution can technically produce such values. However, the minimum observed length is well above zero, so the risk of the model predicting physically impossible values is low. A Normal distribution is a reasonable choice here, though we should verify this after fitting by checking that the model doesn’t produce implausible predictions. If it did, we’d switch to a different distribution using a Generalised Linear Model (GLM), which we cover later in the course.
Choosing a distribution for the response variable is separate from the assumption of “normally distributed errors” (or “normally distributed residuals”). The choice of distribution is based on the nature of the response variable and can be made before even looking at the data. The residuals assumption can only be checked after the model has been fitted.
The Normal distribution in practice
Question 1. Does a Normal distribution generate discrete or continuous values?
Continuous. Although values in a dataset may subsequently be rounded (e.g. to the nearest cm), the underlying process generates continuous values.
Question 2. Run rnorm(n = 100, mean = 50, sd = 20). What is the smallest value you get? Try increasing the standard deviation and observe what happens.
Because these are random numbers, results vary each run. With a mean of 50 and a standard deviation of 20, values below zero are quite plausible, especially if you increase the SD. You can find the minimum quickly with:
min(rnorm(n = 100, mean = 50, sd = 20))
This illustrates that a Normal distribution can produce values below zero, which matters when the response variable has a natural lower bound.
Question 3. If the values from rnorm(n = 100, mean = 50, sd = 20) were your response variable, would it be reasonable to assume a Normal distribution in your model?
Yes. The data were generated by a Normal distribution, so a model assuming Normal errors is entirely appropriate. The subtlety is that in the boto case we’re balancing biology with statistics: dolphins cannot be zero or negative centimetres long, but we can reasonably proceed because no observed values are anywhere near that boundary.
Question 4. If your data were generated by a Normal distribution, does that automatically satisfy the assumption of normality?
No. The assumption of normality refers to the distribution of the residuals (differences between observed and predicted values), not the raw data, and not the distribution we assumed generated the data. Even if the underlying data are normally distributed, the residuals from a poorly specified model may not be. This can only be checked after fitting the model.
Exploratory data analysis
You worked with this dataset in Workshop 2, so EDA is already done. If you want a refresher, copy the relevant code from your Workshop 2 script.
Fit an appropriate model
The model we will fit is:
\[length_i \sim \text{Normal}(\mu_i, \sigma)\] \[\mu_i = \beta_0 + \beta_1 \times drought_i\]
Where:
- \(length_i\) is the length of dolphin \(i\), assumed to follow a Normal distribution with mean \(\mu_i\) and standard deviation \(\sigma\).
- \(\beta_0\) is the intercept: the predicted length when
drought = 0. - \(\beta_1\) is the slope: the change in predicted length for each additional year of drought.
Fitting this in R takes one line:
boto_model <- lm(length ~ drought, data = boto)Translating the equation to code
lm()fits a linear model, which by default assumes the response follows a Normal distribution.- The formula
length ~ droughtspecifies the relationship.lengthis the response variable;droughtis the predictor. This corresponds directly to \(\mu_i = \beta_0 + \beta_1 \times drought_i\). R adds the intercept automatically. - We store the result in
boto_modelso we can use it later without re-running.
Now look at the results:
summary(boto_model)The Estimates column gives us \(\beta_0\) and \(\beta_1\): 213.3 and −5.3, respectively.
What the parameters mean
To understand what these numbers represent, substitute into the equation. Start with the intercept. If there has never been a drought, then drought = 0:
\[\mu_i = \beta_0 + \beta_1 \times 0 = \beta_0\]
Any number multiplied by zero is zero, so only \(\beta_0\) remains. The intercept is the predicted length when the predictor equals zero: when drought = 0, the model predicts dolphins are approximately 213.3 cm long. The intercept is always “the expected value of Y when X equals zero.”
The slope, \(\beta_1 = -5.3\), tells us how much the predicted length changes per additional year of drought. For every extra year of drought, the model expects dolphins to be 5.3 cm shorter on average. Starting from drought = 0 at a length of 213 cm, each step to the right along the x-axis drops the prediction by 5.3 cm.
The figure below illustrates this:
Nobody talks much about \(\sigma\), but it’s doing important stuff in the model. \(\beta_0\) and \(\beta_1\) describe the line; \(\sigma\) describes the cloud of points around it. Two dolphins with the same drought history won’t be the same length. One is older, one had a better feeding season, one was awkward to measure. None of that is drought, and all of it ends up in \(\sigma\).
In the output, \(\sigma\) is called the Residual standard error. For a Normal distribution, about two thirds of observations fall within one \(\sigma\) of the fitted line. Friday’s session is where we actually dig into what that number is telling us. For now, just know it exists and matters.
Predictions from the equation
Question 1. The model estimated β0 = 213.3 and β1 = −5.3. Write out the completed equation:
μi = + × droughti
μi = 213.3 + (−5.3) × droughti
Question 2. Using your completed equation, what length would you predict for a dolphin in a region with zero years of drought?
213.3 + (−5.3) × 0 = 213.3 cm
Question 3. What length would you predict if there had been eight years of drought?
213.3 + (−5.3) × 8 = 170.9 cm
Question 4. What is the predicted difference in length between a dolphin that experienced two years of drought and one that experienced seven?
Two years of drought: 213.3 + (−5.3) × 2 = 202.7 cm
Seven years of drought: 213.3 + (−5.3) × 7 = 176.2 cm
Difference: 202.7 − 176.2 = 26.5 cm
According to the model, the dolphin with seven years of drought would be 26.5 cm shorter.
Diagnose the model and check assumptions
We’ll cover this on Friday.
A note on decimal places
When reporting parameter estimates, how many decimal places should you use? There is no universal answer; it depends on context. The figures below show what happens when you round estimates to different levels of precision.
Here \(\beta_0 = -1.235\) and \(\beta_1 = 3.456\). Rounding to zero decimal places gives \(\beta_0 = -1\) and \(\beta_1 = 3\), producing a noticeably different line than rounding to two decimal places.
In this second example the explanatory variable runs up to one million and the true parameters are extremely small (\(\beta_0 = 0.0000000652\), \(\beta_1 = 0.0005161519\)). Rounding to two decimal places gives \(\beta_0 = 0.00\) and \(\beta_1 = 0.00\), which is useless.
The lesson: before rounding, ask how much the number changes and what the scale of your predictor is. An arbitrary rounding decision can meaningfully alter your results. In BI3010 assessments, the number of decimal places to report will always be specified.
Summarise the results
We’ll return to model diagnosis on Friday. For now, let’s turn the results into a figure. There are two ways to do this.
Option 1: ggeffects
The ggeffects package does the work for you:
install.packages("ggeffects") # Run once
library(ggeffects)
preds <- ggpredict(boto_model)
plot(preds)This produces the fitted line with a 95% confidence interval (we’ll cover confidence intervals next week). It’s quick but gives you limited control over the appearance.
Option 2: manual predictions
Making the figure yourself is not much harder and is more transparent. The idea is to create a small dataset of drought values and use predict() to calculate the model’s prediction for each one, exactly what you did by hand in the questions above.
First, create a dataset with the drought values you want to predict at:
fake <- data.frame(drought = 0:8)Then generate the predictions and add them to the dataset:
fake$fit <- predict(boto_model, newdata = fake)Try it yourself. Enter any drought values in the table below. The fit column fills in using 213.3 + (−5.3) × drought, and each point appears on the plot connected to the previous one. This is exactly what predict() does in R, repeated for each row.
Now plot:
ggplot(fake) +
geom_line(aes(x = drought, y = fit)) +
labs(x = "Number of drought years",
y = "Predicted length of boto (cm)") +
theme_minimal()Here’s the full workflow from loading the data to the final figure, as you might write it in a script:
# Deon Roos
# Analysis of boto length in relation to droughts
library(ggplot2)
# Open your R Project (.Rproj) before running this script
boto <- read.table("data/boto.txt", header = TRUE)
# EDA
summary(boto)
str(boto)
ggplot(boto) + geom_point(aes(x = drought, y = length))
ggplot(boto) + geom_point(aes(x = drought, y = length, colour = sex))
ggplot(boto) + geom_point(aes(x = drought, y = length, colour = sex)) + facet_wrap(~sex)
# No unusual observations detected in EDA
# Concerns prior to modelling:
# Measurement precision is high (unclear how lengths were measured at this accuracy)
# Validity: depends on measurement method
# Independence: related dolphins (families) may have been measured
# Additivity: drought may affect the sexes differently
# Fit model
boto_model <- lm(length ~ drought, data = boto)
# Diagnostics: covered in the next workshop
# Inference
summary(boto_model)
fake <- data.frame(drought = 0:8)
fake$fit <- predict(boto_model, newdata = fake)
ggplot(fake) +
geom_line(aes(x = drought, y = fit)) +
labs(x = "Number of drought years",
y = "Predicted length of boto (cm)") +
theme_minimal()The entire analysis is about 40 lines including comments. It’s reproducible and shareable. As a tip: don’t confuse script length with complexity. A long script is often a sign of uncertainty, not sophistication. Less is usually more.
Burying beetles: independent analysis
Using everything you have just learned, work through a new dataset: beetles.txt on MyAberdeen. This data comes from a lab experiment investigating how short-term temperature exposure (6 hours per day) affects fecundity in burying beetles (Nicrophorus vespilloides), with the aim of understanding how the species might respond to climate change.
Variables:
offspring: number of offspring producedtemperature: temperature (°C) to which beetles were exposed for 6 hours per day
Download beetles.txt and place it in your data/ folder:
Carry out an EDA, consider which assumptions are likely violated, fit an appropriate model, and plot the results. Make sure you are comfortable with the full workflow, as an equivalent analysis may appear in a quiz.
Worked answer. Once you have completed your own analysis, reveal the instructor’s version to compare approaches.
library(ggplot2)
library(ggeffects)
beetles <- read.table("data/beetles.txt", header = TRUE)
# EDA
str(beetles) # n = 234
summary(beetles) # minimum of 1 offspring, plausible if low
ggplot(beetles) +
geom_histogram(aes(x = offspring), colour = "white", binwidth = 1) +
theme_minimal()
# 1 offspring seems reasonable, just on the low end
ggplot(beetles) +
geom_histogram(aes(x = temperature), colour = "white", binwidth = 1) +
theme_minimal()
# Fairly even spread of temperatures between −10 and 35°C
ggplot(beetles) +
geom_point(aes(x = temperature, y = offspring)) +
theme_minimal()
# Clear non-linear trend: offspring increases then decreases as temperature rises
# Hidden assumption concerns:
# Linearity is almost certainly broken (the relationship curves)
# Equal variance of errors: possibly OK (the plot looks roughly homogeneous)
# Representativeness: this is a lab population; extrapolate to wild populations with caution
# Fit model
beet_mod <- lm(offspring ~ temperature, data = beetles)
summary(beet_mod)
# Intercept: 25.3, predicted offspring at 0°C
# Slope: 0.5, each 1°C increase adds roughly 0.5 offspring
# BUT: the model is biased (linearity is broken), so these estimates are unreliable.
# Do not use this model for inference.
preds <- ggpredict(beet_mod)
plot(preds, add.data = TRUE)
# The linear fit to a non-linear relationship is obvious in this plot.
# In reality, we would need to allow temperature to have a non-linear effect in the model.
Assumption reference
| Assumption | Description | Example of violation |
|---|---|---|
| 1. Validity | The data can answer the research question. | Using IQ to measure overall intelligence. |
| 2. Representativeness | The data reflects the population being studied. | Using survey data from a rural area to draw conclusions about an urban population. |
| 3. Additivity | Effects of predictors are additive with no interactions. | Assuming study hours and sleep hours independently affect exam scores without considering their combined effect. |
| 4. Linearity | The relationship between predictors and the outcome is linear. | Assuming plant growth continues to increase indefinitely without ever levelling off. |
| 5. Independence of errors | Observations are independent of each other. | Modelling height using repeated measurements of the same children over time. |
| 6. Equal variance of errors | Residuals have constant variance across all predictor levels (homoscedasticity). | Modelling income, where salary variability differs across levels of a company hierarchy. |
| 7. Normality of errors | Residuals are approximately normally distributed. | Modelling income, where a small number of extremely wealthy individuals create a heavily skewed residual distribution. |