abdn <- read.table("abdn_flats.txt", header = TRUE)Workshop 7: Linear models with multiple predictors
BI3010 Statistics for Biologists
Learning Objectives
- Distinguish between predictive and causal modelling goals, and explain how the goal shapes which variables to include.
- Use a DAG to identify confounders, pipes, and colliders, and apply the backdoor criterion to select adjustment variables.
- Fit a multiple linear regression model and interpret its output in both a predictive and a causal context.
- Diagnose a multiple-predictor model and identify the consequences of common assumption violations.
- Avoid the Table 2 Fallacy by recognising that not all coefficients in a causal model should be interpreted causally.
Workshop structure
- Formulate a research question.
- Perform exploratory data analysis (EDA).
- Identify any hidden assumptions.
- Fit an appropriate model [Focus of today].
- Diagnose the model and check assumptions.
- Summarise the results.
- Interpret and draw conclusions.
Today we work through the same dataset twice: once to build a predictive model and once to build a causal model. Using the same data for both goals shows how the purpose of the model shapes every decision you make about it.
The dataset is the Aberdeen rental flats data from Workshop 2. If you no longer have it, download it below and save it to your BI3010/data/ folder.
Predictive modelling
Predictive models are a natural starting point for multiple-predictor models because the inclusion criteria are simple: add variables that improve predictions. We don’t need to worry about causal assumptions here (don’t tell the Machine Learning folks I said that!); we just want the model to guess the rent accurately.
1. Formulate a research question
The question is: can we predict the monthly rent of a listed property? We’ll use floor area (m²), total number of rooms, energy efficiency band (EPC), and council tax band as predictors.
2. Perform exploratory data analysis
Spend a few minutes exploring the dataset. Check for odd values. The dataset includes a URL for each property, so you can check unusual entries directly, a rare luxury in science!
One thing to be aware of: some properties may have had their listed price changed since the data were collected. For example, there’s one property that when I first added it to the dataset was listed at a whopping £3,200 a month but a year later, that price had dropped to £2,900 (because no one was willing to pay £3.2k in rent). You’ll need to decide for yourself how to handle these. Your options are: (1) keep the original price, (2) update to the current price, or (3) create a new row for the “new” listing. Option 3 violates an assumption. Can you identify which one?
4. Fit an appropriate model
In predictive modelling, we can include all sensible explanatory variables without worrying about the underlying causal relationships. The dataset variables are:
| Variable | Description |
|---|---|
Price |
Monthly rent (£) |
Bedrooms |
Number of bedrooms |
PublicRooms |
Number of living rooms |
Rooms |
Total rooms (bedrooms + living rooms) |
Bathrooms |
Number of bathrooms |
FloorArea |
Total floor area (m²) |
EPC |
Energy Performance Certificate band (A = most efficient, G = least) |
Tax |
Council tax band (A = lowest, H = highest) |
DayAdded |
Date the record was added to the dataset |
PropertyURL |
Listing URL |
Address |
Property address |
Latitude |
Latitude coordinates |
Longitude |
Longitude coordinates |
UniCommuteTime |
Walking time to the Zoology Building (minutes) |
HouseType |
Detached, semi-detached, terrace, or flat |
Furnished |
Fully furnished, part furnished, or unfurnished |
We’ll keep it to four predictors: FloorArea, Rooms, EPC, and Tax.
For some reason, the value of a property in 1991 is how the council tax band is decided. I don’t get why 1991 is special (I doubt it actually is) but that’s what they do. It’s dumb. C’est la vie.
One complication: EPC and Tax are inherently continuous quantities (energy efficiency and property value) that have been discretised into letter bands. This forces R to create a separate dummy variable for every band except the reference. Written out in full, the model equation is:
\[Price_i \sim \text{Normal}(\mu_i, \sigma)\]
\[\begin{align} \mu_i = \; &\beta_0 + \beta_1 \times FloorArea_i + \beta_2 \times Rooms_i \\ + \; &\beta_3 \times EPC\text{B}_i + \beta_4 \times EPC\text{C}_i + \beta_5 \times EPC\text{D}_i \\ + \; &\beta_6 \times EPC\text{E}_i + \beta_7 \times EPC\text{F}_i + \beta_8 \times EPC\text{G}_i \\ + \; &\beta_9 \times Tax\text{B}_i + \beta_{10} \times Tax\text{C}_i + \beta_{11} \times Tax\text{D}_i \\ + \; &\beta_{12} \times Tax\text{E}_i + \beta_{13} \times Tax\text{F}_i + \beta_{14} \times Tax\text{G}_i \end{align}\]
where \(\beta_0\) is the predicted rent for a property with zero floor area, zero rooms, EPC band A, and tax band A; \(\beta_1\) is the change per additional m²; \(\beta_2\) is the change per additional room; and \(\beta_3\) through \(\beta_8\) and \(\beta_9\) through \(\beta_{14}\) are the differences between each band and the reference (band A in each case).
As you can see, EPC and Tax being recorded as categories has made our model stupidly and crazily more complicated than it would otherwise be. That’s 14 parameters, compared to just 4 if they had been kept as continuous scores. Discretising continuous variables into categories creates a lot of extra work and often loses information in the process. Avoid it where you can.
Despite the notation complexity, fitting the model in R is straightforward:
rent_mod <- lm(Price ~ FloorArea + Rooms + EPC + Tax, data = abdn)
summary(rent_mod)This produces a giant list of estimated parameters, reminding us why it’s best to avoid excessive categorisation. Keep this in mind for your future work (including your Honour’s project), both for simplicity and for more meaningful analyses!
5. Diagnose the model and check assumptions
par(mfrow = c(2, 2))
plot(rent_mod)
par(mfrow = c(1, 1))Question set 1
Question 1. Do the Residuals vs Fitted and Scale-Location plots suggest any assumption is violated?
Yes: residual spread tends to increase with fitted values, a sign of heteroscedasticity (unequal variance of errors). This isn’t surprising given that the dataset mixes houses and flats of very different price ranges.
Question 2. What is the main consequence of violating the equal variance of errors assumption?
Heteroscedasticity inflates standard errors, so confidence intervals become either too wide or too narrow. The parameter estimates (predicted means) remain unbiased, but uncertainty estimates aren’t reliable.
Question 3. Does the Q-Q Residuals plot suggest a normality problem?
The residuals aren’t perfectly normally distributed. A small number of properties deviate notably from the diagonal line, either cheaper or more expensive than the model expects given their characteristics. The departure isn’t extreme, but it’s present. Minor deviations of this kind are expected in real data and aren’t grounds for rejecting the model outright.
Question 4. Does the Residuals vs Leverage plot flag any observations with disproportionate influence on the model?
Yes. One or more observations typically appear with notably high Cook’s distance. These are usually expensive properties (large houses) whose price is hard for a model fitted mainly on flats to predict well.
Question 5. If there are influential observations, what should you do?
You have three options, in increasing order of rigour:
- Collect more data. Influential observations are most damaging in small datasets. More data dilutes their effect. Not practical here, but the right long-term solution.
- Remove and rerun. Many papers do this, but only if they also report the original results and clearly explain why they removed observations. Silently dropping data isn’t acceptable.
- Fit both models and compare. Run the model with and without the influential observations, then compare the parameter estimates side by side. If removing them barely changes the estimates, keep the original model and note that you checked. If the estimates shift substantially, report both models and discuss which you think is more reliable.
# Example: remove rows 324 and 534 and compare
abdn_trimmed <- abdn[-c(324, 534), ]
trimmed_rent_mod <- lm(Price ~ FloorArea + Rooms + EPC + Tax, data = abdn_trimmed)
coef(trimmed_rent_mod)
coef(rent_mod)
Your row numbers may differ depending on how you handled any data issues earlier. Remove any observation whose Cook’s distance is at or above 1, rerun the model, and check the diagnostics again.
If you identified influential observations, remove them and refit the model before proceeding:
abdn <- abdn[-c(row_number), ] # replace row_number with your identified row(s)
rent_mod <- lm(Price ~ FloorArea + Rooms + EPC + Tax, data = abdn)
par(mfrow = c(2, 2))
plot(rent_mod)
par(mfrow = c(1, 1))6. Summarise the results
With multiple predictors, ggeffects makes plotting much easier than building every prediction data frame by hand:
library(ggeffects)
summary(rent_mod)
preds <- ggpredict(rent_mod)
plot(preds)If you want to appease the ggeffects gods, feel free to convert your categorical variables to factors first:
abdn$EPC <- factor(abdn$EPC)
abdn$Tax <- factor(abdn$Tax)
rent_mod <- lm(Price ~ FloorArea + Rooms + EPC + Tax, data = abdn)
preds <- ggpredict(rent_mod)
plot(preds)To build the FloorArea prediction plot manually, holding other predictors at representative values:
fake <- data.frame(
FloorArea = seq(from = min(abdn$FloorArea), to = max(abdn$FloorArea), by = 5),
Rooms = median(abdn$Rooms),
Tax = "D",
EPC = "D"
)
preds <- predict(rent_mod, newdata = fake, se.fit = TRUE)
fake$fit <- preds$fit
fake$low <- preds$fit - preds$se.fit * 1.96
fake$upp <- preds$fit + preds$se.fit * 1.96
ggplot(fake) +
geom_ribbon(aes(x = FloorArea, ymin = low, ymax = upp), fill = "lightgrey", alpha = 0.8) +
geom_line(aes(x = FloorArea, y = fit)) +
labs(x = "Square metres", y = "Predicted monthly rent (£)") +
theme_minimal()Pay attention to the Estimate for FloorArea. The model is predicting that adding an extra square metre to a property only increases rent by a few pounds. Seems a bit low, right? Do we really think the difference in price between a 50 m² flat and a 100 m² flat will only be a small amount extra per month? We’ll come back to this after we run the causal version of this model.
Question set 2
Question 1. Can we read the coefficient for FloorArea from this model as a causal effect, for example “adding 10 m² to a property causes rent to increase by X”?
No. The predictive model includes variables to improve predictive accuracy, not because a causal analysis told us to. When all of FloorArea, Rooms, EPC, and Tax are in the model, each coefficient represents the association with price after adjusting for the others, but “adjusting for” isn’t the same as “identifying the causal effect of.” The causal model section below shows how the goal changes the analysis.
Question 2. The coefficient for FloorArea in this model is probably smaller than you might expect, only a few pounds per m². Why might that be?
When FloorArea, Rooms, EPC, and Tax are all in the model, each coefficient only captures the portion of the association not explained by the others. FloorArea drives much of the variation in Rooms, EPC, and Tax as well. By including all of them together, the model distributes the “credit” for predicting rent across all four variables, so FloorArea’s individual coefficient ends up smaller than the true causal effect of floor area alone. The causal model below, which uses a DAG to correctly specify the model, will produce a larger and more interpretable coefficient.
Question 3. Go to ASPC and find a recently listed rental property. Use your model to predict its monthly rent, then compare the prediction with the listed price.
# Example using a hypothetical property with 3 rooms, 61 m², EPC C, Tax C
my_property <- data.frame(
Rooms = 3,
FloorArea = 61,
EPC = "C",
Tax = "C"
)
predict(rent_mod, newdata = my_property)
# Add 95% confidence interval
predict(rent_mod, newdata = my_property, interval = "confidence")
The residual is the difference between your model’s prediction and the listed price. If the true price falls within the confidence interval, the model’s uncertainty correctly captured the listing.
If you want to see how this kind of model could be translated into something you might see online, with some marketing saying something like “We’re using the most advanced Generative Artificial Intelligence to predict how much rent you should be asking for”, see this demonstration app. This is a little toy website I made (entirely in R). Two things to note: there’s a decent chance the website will crash if lots of people try to use it at the same time, and the “Fancy model” also accounts for spatial patterns in rent across Aberdeen.
Causal modelling
We now switch goals. Instead of predicting rent as accurately as possible, we want to understand the causal effect of floor area on rent: if a property had an extra 10 m², how much more would we expect it to rent for?
This is a different question and requires a different analytical approach.
Confounders
Before starting the analysis, let’s go over the three types of confounds again to help solidify your understanding. Below we’ll go through each type in turn and see the effect of including the confounded variable in a model or not.
The pipe: mediator sitting between cause and effect
In a pipe, Z is caused by the predictor of interest X and in turn causes the response variable Y: the structure is X → Z → Y. The effect of X on Y is real, but it flows through Z. Including Z in the model blocks the pathway, absorbing the X effect and making the X coefficient shrink towards zero. X still matters; you have removed the route through which it acts.
Click Z in the DAG to add or remove it from the model.
Within this data: FloorArea → Rooms → Price. Larger properties have more rooms, and more rooms drives price. Including Rooms when asking about the causal effect of FloorArea would block this path. The correct causal model leaves Rooms out.
Question 1. In the structure FloorArea → Rooms → Price, what causal role does Rooms play?
Rooms is a pipe. Floor area causes the number of rooms a property has (larger properties have more rooms), and the number of rooms in turn influences price. The causal chain runs FloorArea → Rooms → Price: Rooms is a step along the route from cause to effect, not a separate cause of both.
Question 2. When estimating the total causal effect of FloorArea on Price, should Rooms be included in the model?
No. Including a pipe blocks the causal pathway it sits on. If Rooms enters the model, the model estimates how much FloorArea affects Price over and above the rooms it provides: a narrower, less interpretable quantity. To capture the total effect of floor area (everything it does to price, including through the rooms it provides), leave Rooms out.
The fork: a confounder causing both predictor and response
In a fork, Z causes both X and Y: the structure is X ← Z → Y. X and Y appear associated, but neither causes the other. The association is driven by Z. Without adjusting for Z, the X coefficient picks up Z’s influence and is inflated. Including Z closes the confounded path.
Click Z in the DAG to add or remove it from the model.
Within this data: EPC ← FloorArea → Price. Larger properties tend to have better energy efficiency ratings and also cost more. Any apparent EPC effect on price in a model without FloorArea would partly be floor area in disguise. Including FloorArea as an adjustment variable closes this confounded path.
Question 3. In the structure EPC ← FloorArea → Price, what causal role does FloorArea play?
FloorArea is a fork (confounder). Larger properties tend to have better energy efficiency ratings because there’s more scope for insulation (FloorArea → EPC), and larger properties also command higher rents (FloorArea → Price). Without adjusting for FloorArea, any apparent EPC–Price association partly reflects this shared driver rather than a genuine effect of energy efficiency on rent.
Question 4. When estimating the causal effect of EPC on Price, what should you do with FloorArea?
Include FloorArea. It sits at the top of a fork: it causes both the predictor of interest (EPC) and the response (Price). That fork is a backdoor path, and without closing it, the EPC coefficient picks up FloorArea’s influence on Price as well as any genuine EPC effect. Adding FloorArea to the model closes this path. Note that FloorArea is included only as an adjustment variable; its own coefficient should not be given a causal interpretation (this would be the Table 2 Fallacy).
The collider: including it manufactures a spurious association
In a collider, both X and Y independently cause Z: the structure is X → Z ← Y. X and Y have no association. Including Z in the model opens a confounded path and creates a spurious relationship from nothing.
A concrete example: imagine both floor area and property condition independently influence how quickly a property lets. If you restrict your analysis to only recently let properties (conditioning on letting speed), you introduce a spurious negative correlation between floor area and condition, because among fast-letting properties, a small floor area must be “compensated for” by good condition, and vice versa.
The collider structure doesn’t appear in the current flats DAG, but it’s the hardest mistake to spot. “Controlling for more variables” sounds rigorous, but adding a collider makes the model worse, not better. Click Z to see the effect.
Question 5. A colleague adds a variable Z to their causal model because “it’s associated with both X and Y, so it must be important to control for.” Z is in fact caused by both X and Y. What is wrong with this reasoning?
Z is a collider. Because both X and Y cause Z, conditioning on Z (including it in the model) opens a path between X and Y that was previously closed. The result is a spurious association between X and Y: bias that the model introduces rather than removes. Being associated with both predictor and response is a description of a confounder, a pipe, and a collider: the association alone doesn’t tell you which structure applies. Only the arrows in the DAG do that.
Start by reloading the data cleanly so that any changes made for the predictive model don’t carry over:
abdn <- read.table("abdn_flats.txt", header = TRUE)1. Formulate a research question
“How much does monthly rent change when the floor area of a property increases?” We want the causal effect of FloorArea on Price, not just a predictive association.
Creating a DAG
To estimate a causal effect responsibly, we need to reason about confounding. We do this by drawing a Directed Acyclic Graph (DAG) that captures our beliefs about how variables cause one another. The check_dag() function from the performance package helps build and interrogate DAGs in R.
This requires the developer version of performance. Install it with:
install.packages(c("performance", "see"), repos = "https://easystats.r-universe.dev")Then build the DAG. Each line states what causes what. For example, Rooms ~ FloorArea says “the number of rooms in a property is caused by its floor area.”
library(performance)
library(see)
rent_dag <- check_dag(
Price ~ Rooms + FloorArea + Tax + EPC,
Rooms ~ FloorArea,
EPC ~ FloorArea,
Tax ~ FloorArea,
exposure = "FloorArea",
outcome = "Price"
)
plot(rent_dag)The exact layout of variables in the plot is semi-random, so your figure may look slightly different each time, but the arrows should be the same.
The plot shows two versions of the DAG. The left panel shows the model as specified by exposure and outcome alone (Price ~ FloorArea). The right panel shows what would be needed to estimate the total causal effect of FloorArea on Price. In this case, both panels are identical.
Why? Because FloorArea drives Rooms, EPC, and Tax, and those variables all flow into Price. These are mediating paths (pipes): FloorArea → Rooms → Price, and FloorArea → EPC → Price, and so on. When we ask “what is the total causal effect of FloorArea?”, we want to include all the ways extra floor area changes price, including through the extra rooms it allows and the tax band it implies. Blocking those paths by adding Rooms, EPC, and Tax to the model would give us only the direct effect of FloorArea, not the total effect.
To confirm, print the DAG object:
rent_dagThis confirms: to estimate the total causal effect of FloorArea on Price, no additional adjustment variables are needed. The correct model is simply:
lm(Price ~ FloorArea, data = abdn)This will not always be the case. Most of the time, estimating a causal effect requires including specific adjustment variables to close non-causal (backdoor) paths. It’s a coincidence of this particular DAG that no adjustment is needed here.
4. Fit an appropriate model
rent_causal_mod <- lm(Price ~ FloorArea, data = abdn)5. Diagnose the model and check assumptions
par(mfrow = c(2, 2))
plot(rent_causal_mod)
par(mfrow = c(1, 1))Two issues commonly appear:
Heteroscedasticity. Residual spread tends to increase with fitted values. There are two reasons for this. First, landlords tend to round rents to the nearest £25 to £50 for cheaper properties and to the nearest £50 to £100 for more expensive ones, which creates measurement “noise” that scales with price. Second, it’s easier to pin down the right rent for a cheaper property (the difference between £500 and £750 is proportionally large) whereas for expensive properties a £250 difference is relatively small, so estimates naturally vary more.
Influential observations. One or more expensive properties may appear beyond the Cook’s distance contour lines. Confirm whether the recorded price is correct using the PropertyURL. If the price is correct, you have two options: ignore the problem and note it when reporting, or remove the observation.
Fix any influential observations and rerun:
abdn_trimmed <- abdn[-c(row_number), ] # replace with your identified row(s)
rent_causal_mod <- lm(Price ~ FloorArea, data = abdn_trimmed)
par(mfrow = c(2, 2))
plot(rent_causal_mod)
par(mfrow = c(1, 1))6. Summarise the results
summary(rent_causal_mod)Look at the coefficient for FloorArea. It should be noticeably larger than the same coefficient in the predictive model. The causal model, which keeps all the mediating paths open, attributes the full effect of extra floor area to FloorArea alone. The predictive model spread that effect across FloorArea, Rooms, EPC, and Tax, making each individual coefficient smaller.
Plot the estimated relationship:
fake <- data.frame(
FloorArea = seq(from = min(abdn$FloorArea), to = max(abdn$FloorArea), length.out = 50)
)
preds <- predict(rent_causal_mod, newdata = fake, se.fit = TRUE)
fake$fit <- preds$fit
fake$low <- preds$fit - preds$se.fit * 1.96
fake$upp <- preds$fit + preds$se.fit * 1.96
ggplot(fake) +
geom_ribbon(aes(x = FloorArea, ymin = low, ymax = upp), fill = "lightgrey", alpha = 0.7) +
geom_line(aes(x = FloorArea, y = fit)) +
labs(x = "Floor area (m²)", y = "Predicted monthly rent (£)") +
theme_minimal()Assuming the DAG is correct, the slope from this model is our best estimate of the total causal effect of floor area on rent: the expected change in monthly rent for each additional m², accounting for all the downstream ways that extra space affects price.
What is the causal effect of EPC on price?
Repeat the full causal workflow, but now asking: “What is the causal effect of energy efficiency band (EPC) on rent?” The DAG structure is the same, but the exposure changes.
Reload the data cleanly before starting:
abdn <- read.table("abdn_flats.txt", header = TRUE)
EPC causal analysis. Build the DAG with EPC as the exposure, identify which model to run, fit it, diagnose it, and interpret the results. Compare the DAG-recommended model with what you would have fitted without a DAG.
library(performance)
library(see)
# Same DAG structure, different exposure
# (But keep in mind, this doesn't include things like how nicely it's decorated,
# its general maintenance, how greedy the landlord is, etc.)
rent_dag_epc <- check_dag(
Price ~ Rooms + FloorArea + Tax + EPC,
Rooms ~ FloorArea,
EPC ~ FloorArea,
Tax ~ FloorArea,
exposure = "EPC",
outcome = "Price"
)
plot(rent_dag_epc)
rent_dag_epc
# check_dag() now recommends including FloorArea to block the backdoor path:
# FloorArea → EPC (exposure), but also FloorArea → Price (directly)
# Without adjusting for FloorArea, we confound the EPC → Price relationship
EPC_mod <- lm(Price ~ EPC + FloorArea, data = abdn)
par(mfrow = c(2, 2))
plot(EPC_mod)
par(mfrow = c(1, 1))
# Diagnostics are generally acceptable
summary(EPC_mod)
library(ggeffects)
plot(ggpredict(EPC_mod, terms = "EPC"))
The DAG recommends including FloorArea as an adjustment variable to block the backdoor path (FloorArea causes both EPC and Price). Without it, any apparent EPC effect would be partly a floor area effect in disguise.
Important: in this model, FloorArea is included only to adjust for confounding. Its coefficient should not be interpreted causally; doing so would be the Table 2 Fallacy.
The EPC effect estimates are often noisy with this dataset, with wide confidence intervals that span zero for some bands. This either means the DAG is more complex than assumed, or the dataset is too small to estimate the EPC effect reliably with this approach.
A more complete DAG (accounting for property age, property type, and historic price) would likely change which variables need to be included, and might reveal that the current dataset cannot answer this causal question at all without collecting additional variables.
Assumption reference
| Assumption | Description | Example of violation |
|---|---|---|
| 1. Validity | The data can answer the research question. | Using IQ to measure overall intelligence. |
| 2. Representativeness | The data reflects the population being studied. | Using survey data from a rural area to draw conclusions about an urban population. |
| 3. Additivity | Effects of predictors are additive with no interactions. | Assuming study hours and sleep independently affect exam scores without considering their combined effect. |
| 4. Linearity | The relationship between predictors and the outcome is linear. | Assuming plant growth continues to increase indefinitely without levelling off. |
| 5. Independence of errors | Observations are independent of each other. | Modelling height using repeated measurements of the same children over time. |
| 6. Equal variance of errors | Residuals have constant variance across all predictor levels. | Modelling income, where salary variability differs across levels of a company hierarchy. |
| 7. Normality of errors | Residuals follow a normal distribution. | Modelling income, where a few extremely wealthy individuals create extreme residuals. |