Workshop 7: Linear models with multiple predictors

BI3010 Statistics for Biologists

Author

Roos & Pinard

Published

16 May 2026

Glossary / 词汇表
Model types
Multiple linear regression 多元线性回归
A linear model with more than one explanatory variable. Each predictor gets its own coefficient, estimated while holding all other predictors constant. Multiple linear regression models can serve either predictive or causal goals depending on how they are built and interpreted.
包含多个解释变量的线性模型。每个预测变量在其他变量保持不变的情况下,估计各自的系数。多元线性回归模型可根据构建方式和解读目的,服务于预测或因果两种不同目标。
Predictive model 预测模型
A model whose goal is to produce accurate predictions of the response variable. Predictors can be included freely to improve accuracy without requiring causal justification. Example: given a flat with 60 m², 3 rooms, EPC C, and Tax band C, predict its monthly rent.
以准确预测响应变量为目标的模型。可以自由纳入预测变量以提高准确性,无需进行因果论证。示例:给定一套60平方米、3个房间、EPC等级C、税级C的公寓,预测其月租金。
Causal model 因果模型
A model whose goal is to estimate the causal effect of a specific predictor on the response variable. Which variables to include or exclude is determined by causal reasoning (e.g. a DAG), not predictive performance. Example: by how much does monthly rent increase for each additional square metre of floor area, if you could change only the floor area?
以估计特定预测变量对响应变量因果效应为目标的模型。纳入或排除哪些变量由因果推理(如DAG)决定,而非预测性能。示例:若仅改变建筑面积,每增加一平方米,月租金会增加多少?
Causal concepts
DAG (Directed Acyclic Graph) 有向无环图
A diagram representing assumed causal relationships between variables. Arrows point from cause to effect. Used with the Backdoor Criterion to decide which variables to include in a causal model.
表示变量间假定因果关系的图形。箭头从原因指向结果,结合后门准则决定因果模型中应纳入哪些变量。
Exposure 暴露变量
A term borrowed from medicine and epidemiology for the explanatory variable whose causal effect you are trying to estimate, equivalent to the predictor of interest. In observational research this is typically something subjects were exposed to rather than something a researcher assigned.
借自医学与流行病学的术语,指你试图估计其因果效应的解释变量,等同于"感兴趣的预测变量"。在观察性研究中,通常指受试者所"暴露"的某种因素,而非研究者人为分配的干预。
Outcome 结局变量
A term borrowed from medicine and epidemiology for the response variable: the quantity you are trying to explain or predict. Equivalent to the response or dependent variable in a regression context.
借自医学与流行病学的术语,指响应变量——即你试图解释或预测的量,等同于回归分析中的响应变量或因变量。
Confounder 混淆变量
A variable that is a common cause of both the predictor of interest and the response variable. Failing to account for a confounder creates a spurious association between the predictor and the response. Confounders arise from the fork structure in a DAG.
同时导致感兴趣的预测变量和响应变量的变量。未控制混淆变量会在预测变量与响应变量之间产生虚假关联,混淆变量源于DAG中的叉形结构。
Fork 叉形结构
A causal structure in which a third variable Z causes both the predictor of interest X and the response variable Y (X ← Z → Y). This creates a spurious association between X and Y that disappears when Z is included in the model. The fork is the structure that produces confounding; the fix is to adjust for Z.
一种因果结构,其中第三个变量Z同时导致感兴趣的预测变量X和响应变量Y(X←Z→Y)。这在X与Y之间产生虚假关联,纳入Z后关联消失。叉形结构是产生混淆的根本原因,解决方法是将Z纳入模型。
Pipe 管道变量
A variable that lies on the causal pathway from the predictor of interest to the response variable (predictor → pipe → response). Including a pipe in a causal model blocks part of the causal effect you are trying to estimate. Sometimes called a mediator.
位于感兴趣的预测变量到响应变量因果路径上的变量(预测变量→管道→响应变量)。在因果模型中纳入管道变量会阻断部分因果效应。有时也称为中介变量。
Collider 对撞变量
A variable Z that is caused by both the predictor of interest X and the response variable Y (X → Z ← Y). X and Y have no genuine association, but including Z in the model opens a spurious path between them and manufactures a false relationship. Colliders should be left out of causal models.
同时被感兴趣的预测变量X和响应变量Y所导致的变量Z(X→Z←Y)。X与Y之间本无真实关联,但将Z纳入模型会在它们之间开启虚假路径,制造出不存在的关系。对撞变量不应纳入因果模型。
Backdoor criterion 后门准则
A rule for determining which variables to condition on (include in the model) in order to block all non-causal paths between the predictor of interest and the response variable while keeping causal paths open.
确定哪些变量需要加以控制(纳入模型)的规则,旨在阻断感兴趣的预测变量与响应变量之间所有非因果路径,同时保持因果路径畅通。
Total causal effect 总因果效应
The combined effect of the predictor of interest on the response variable through all causal pathways, including indirect paths via pipes. Contrasts with the direct effect, which excludes paths running through pipes.
感兴趣的预测变量通过所有因果路径(包括经管道变量的间接路径)对响应变量的综合效应,与仅排除管道路径的直接效应相对。
Table 2 Fallacy "表2谬误"
The mistake of interpreting all coefficients in a multiple regression as causal effects. In a causal model targeting one predictor of interest, other included variables are there to adjust for confounding and their coefficients should not be interpreted causally.
将多元回归中所有系数都解读为因果效应的错误。在针对某一感兴趣的预测变量构建的因果模型中,其他变量的纳入是为了控制混淆,其系数不应作因果解释。
Model parameters
Discretisation 离散化
Converting a continuous variable into a categorical one (e.g. energy efficiency score → EPC band A to G). Discretisation loses information and greatly increases the number of parameters a model must estimate.
将连续变量转换为分类变量(如能效评分→EPC等级A至G)。离散化会损失信息,并大幅增加模型需要估计的参数数量。
Residual 残差
The difference between an observed value and its model-predicted value: residual = observed − predicted. Residuals are the core of all model diagnostics.
观测值与模型预测值之差:残差 = 观测值 − 预测值,是所有模型诊断的核心。

Learning Objectives

Learning objectives
  1. Distinguish between predictive and causal modelling goals, and explain how the goal shapes which variables to include.
  2. Use a DAG to identify confounders, pipes, and colliders, and apply the backdoor criterion to select adjustment variables.
  3. Fit a multiple linear regression model and interpret its output in both a predictive and a causal context.
  4. Diagnose a multiple-predictor model and identify the consequences of common assumption violations.
  5. Avoid the Table 2 Fallacy by recognising that not all coefficients in a causal model should be interpreted causally.

Workshop structure

The Analytical Workflow
  1. Formulate a research question.
  2. Perform exploratory data analysis (EDA).
  3. Identify any hidden assumptions.
  4. Fit an appropriate model [Focus of today].
  5. Diagnose the model and check assumptions.
  6. Summarise the results.
  7. Interpret and draw conclusions.

Today we work through the same dataset twice: once to build a predictive model and once to build a causal model. Using the same data for both goals shows how the purpose of the model shapes every decision you make about it.

The dataset is the Aberdeen rental flats data from Workshop 2. If you no longer have it, download it below and save it to your BI3010/data/ folder.

Download abdn_flats.txt

abdn <- read.table("abdn_flats.txt", header = TRUE)

Predictive modelling

Predictive models are a natural starting point for multiple-predictor models because the inclusion criteria are simple: add variables that improve predictions. We don’t need to worry about causal assumptions here (don’t tell the Machine Learning folks I said that!); we just want the model to guess the rent accurately.

1. Formulate a research question

The question is: can we predict the monthly rent of a listed property? We’ll use floor area (m²), total number of rooms, energy efficiency band (EPC), and council tax band as predictors.

2. Perform exploratory data analysis

Spend a few minutes exploring the dataset. Check for odd values. The dataset includes a URL for each property, so you can check unusual entries directly, a rare luxury in science!

One thing to be aware of: some properties may have had their listed price changed since the data were collected. For example, there’s one property that when I first added it to the dataset was listed at a whopping £3,200 a month but a year later, that price had dropped to £2,900 (because no one was willing to pay £3.2k in rent). You’ll need to decide for yourself how to handle these. Your options are: (1) keep the original price, (2) update to the current price, or (3) create a new row for the “new” listing. Option 3 violates an assumption. Can you identify which one?

3. Identify any hidden assumptions

Question 1. The dataset mixes houses and flats without distinguishing between them. What effect might this have on the predictive model?

Property type (house vs flat) almost certainly affects rent, so excluding it from the model leaves an important variable out. Predictions will be less accurate than they could be, perhaps by a lot, perhaps only a little. It will also tend to create unequal variance of errors: the model will struggle most with expensive properties (which are more often houses), producing larger residuals at the upper end of the predicted rent range.

4. Fit an appropriate model

In predictive modelling, we can include all sensible explanatory variables without worrying about the underlying causal relationships. The dataset variables are:

Variable Description
Price Monthly rent (£)
Bedrooms Number of bedrooms
PublicRooms Number of living rooms
Rooms Total rooms (bedrooms + living rooms)
Bathrooms Number of bathrooms
FloorArea Total floor area (m²)
EPC Energy Performance Certificate band (A = most efficient, G = least)
Tax Council tax band (A = lowest, H = highest)
DayAdded Date the record was added to the dataset
PropertyURL Listing URL
Address Property address
Latitude Latitude coordinates
Longitude Longitude coordinates
UniCommuteTime Walking time to the Zoology Building (minutes)
HouseType Detached, semi-detached, terrace, or flat
Furnished Fully furnished, part furnished, or unfurnished

We’ll keep it to four predictors: FloorArea, Rooms, EPC, and Tax.

For some reason, the value of a property in 1991 is how the council tax band is decided. I don’t get why 1991 is special (I doubt it actually is) but that’s what they do. It’s dumb. C’est la vie.

One complication: EPC and Tax are inherently continuous quantities (energy efficiency and property value) that have been discretised into letter bands. This forces R to create a separate dummy variable for every band except the reference. Written out in full, the model equation is:

\[Price_i \sim \text{Normal}(\mu_i, \sigma)\]

\[\begin{align} \mu_i = \; &\beta_0 + \beta_1 \times FloorArea_i + \beta_2 \times Rooms_i \\ + \; &\beta_3 \times EPC\text{B}_i + \beta_4 \times EPC\text{C}_i + \beta_5 \times EPC\text{D}_i \\ + \; &\beta_6 \times EPC\text{E}_i + \beta_7 \times EPC\text{F}_i + \beta_8 \times EPC\text{G}_i \\ + \; &\beta_9 \times Tax\text{B}_i + \beta_{10} \times Tax\text{C}_i + \beta_{11} \times Tax\text{D}_i \\ + \; &\beta_{12} \times Tax\text{E}_i + \beta_{13} \times Tax\text{F}_i + \beta_{14} \times Tax\text{G}_i \end{align}\]

where \(\beta_0\) is the predicted rent for a property with zero floor area, zero rooms, EPC band A, and tax band A; \(\beta_1\) is the change per additional m²; \(\beta_2\) is the change per additional room; and \(\beta_3\) through \(\beta_8\) and \(\beta_9\) through \(\beta_{14}\) are the differences between each band and the reference (band A in each case).

As you can see, EPC and Tax being recorded as categories has made our model stupidly and crazily more complicated than it would otherwise be. That’s 14 parameters, compared to just 4 if they had been kept as continuous scores. Discretising continuous variables into categories creates a lot of extra work and often loses information in the process. Avoid it where you can.

Despite the notation complexity, fitting the model in R is straightforward:

rent_mod <- lm(Price ~ FloorArea + Rooms + EPC + Tax, data = abdn)
summary(rent_mod)

This produces a giant list of estimated parameters, reminding us why it’s best to avoid excessive categorisation. Keep this in mind for your future work (including your Honour’s project), both for simplicity and for more meaningful analyses!

5. Diagnose the model and check assumptions

par(mfrow = c(2, 2))
plot(rent_mod)
par(mfrow = c(1, 1))

Question set 1

Question 1. Do the Residuals vs Fitted and Scale-Location plots suggest any assumption is violated?

Yes: residual spread tends to increase with fitted values, a sign of heteroscedasticity (unequal variance of errors). This isn’t surprising given that the dataset mixes houses and flats of very different price ranges.

Question 2. What is the main consequence of violating the equal variance of errors assumption?

Heteroscedasticity inflates standard errors, so confidence intervals become either too wide or too narrow. The parameter estimates (predicted means) remain unbiased, but uncertainty estimates aren’t reliable.

Question 3. Does the Q-Q Residuals plot suggest a normality problem?

The residuals aren’t perfectly normally distributed. A small number of properties deviate notably from the diagonal line, either cheaper or more expensive than the model expects given their characteristics. The departure isn’t extreme, but it’s present. Minor deviations of this kind are expected in real data and aren’t grounds for rejecting the model outright.

Question 4. Does the Residuals vs Leverage plot flag any observations with disproportionate influence on the model?

Yes. One or more observations typically appear with notably high Cook’s distance. These are usually expensive properties (large houses) whose price is hard for a model fitted mainly on flats to predict well.

Question 5. If there are influential observations, what should you do?

You have three options, in increasing order of rigour:

  1. Collect more data. Influential observations are most damaging in small datasets. More data dilutes their effect. Not practical here, but the right long-term solution.
  2. Remove and rerun. Many papers do this, but only if they also report the original results and clearly explain why they removed observations. Silently dropping data isn’t acceptable.
  3. Fit both models and compare. Run the model with and without the influential observations, then compare the parameter estimates side by side. If removing them barely changes the estimates, keep the original model and note that you checked. If the estimates shift substantially, report both models and discuss which you think is more reliable.
# Example: remove rows 324 and 534 and compare
abdn_trimmed <- abdn[-c(324, 534), ]
trimmed_rent_mod <- lm(Price ~ FloorArea + Rooms + EPC + Tax, data = abdn_trimmed)
coef(trimmed_rent_mod)
coef(rent_mod)

Your row numbers may differ depending on how you handled any data issues earlier. Remove any observation whose Cook’s distance is at or above 1, rerun the model, and check the diagnostics again.

If you identified influential observations, remove them and refit the model before proceeding:

abdn <- abdn[-c(row_number), ]   # replace row_number with your identified row(s)
rent_mod <- lm(Price ~ FloorArea + Rooms + EPC + Tax, data = abdn)
par(mfrow = c(2, 2))
plot(rent_mod)
par(mfrow = c(1, 1))

6. Summarise the results

With multiple predictors, ggeffects makes plotting much easier than building every prediction data frame by hand:

library(ggeffects)
summary(rent_mod)
preds <- ggpredict(rent_mod)
plot(preds)

If you want to appease the ggeffects gods, feel free to convert your categorical variables to factors first:

abdn$EPC <- factor(abdn$EPC)
abdn$Tax <- factor(abdn$Tax)
rent_mod  <- lm(Price ~ FloorArea + Rooms + EPC + Tax, data = abdn)
preds     <- ggpredict(rent_mod)
plot(preds)

To build the FloorArea prediction plot manually, holding other predictors at representative values:

fake <- data.frame(
  FloorArea = seq(from = min(abdn$FloorArea), to = max(abdn$FloorArea), by = 5),
  Rooms     = median(abdn$Rooms),
  Tax       = "D",
  EPC       = "D"
)
preds <- predict(rent_mod, newdata = fake, se.fit = TRUE)
fake$fit <- preds$fit
fake$low <- preds$fit - preds$se.fit * 1.96
fake$upp <- preds$fit + preds$se.fit * 1.96

ggplot(fake) +
  geom_ribbon(aes(x = FloorArea, ymin = low, ymax = upp), fill = "lightgrey", alpha = 0.8) +
  geom_line(aes(x = FloorArea, y = fit)) +
  labs(x = "Square metres", y = "Predicted monthly rent (£)") +
  theme_minimal()

Pay attention to the Estimate for FloorArea. The model is predicting that adding an extra square metre to a property only increases rent by a few pounds. Seems a bit low, right? Do we really think the difference in price between a 50 m² flat and a 100 m² flat will only be a small amount extra per month? We’ll come back to this after we run the causal version of this model.

Question set 2

Question 1. Can we read the coefficient for FloorArea from this model as a causal effect, for example “adding 10 m² to a property causes rent to increase by X”?

No. The predictive model includes variables to improve predictive accuracy, not because a causal analysis told us to. When all of FloorArea, Rooms, EPC, and Tax are in the model, each coefficient represents the association with price after adjusting for the others, but “adjusting for” isn’t the same as “identifying the causal effect of.” The causal model section below shows how the goal changes the analysis.

Question 2. The coefficient for FloorArea in this model is probably smaller than you might expect, only a few pounds per m². Why might that be?

When FloorArea, Rooms, EPC, and Tax are all in the model, each coefficient only captures the portion of the association not explained by the others. FloorArea drives much of the variation in Rooms, EPC, and Tax as well. By including all of them together, the model distributes the “credit” for predicting rent across all four variables, so FloorArea’s individual coefficient ends up smaller than the true causal effect of floor area alone. The causal model below, which uses a DAG to correctly specify the model, will produce a larger and more interpretable coefficient.

Question 3. Go to ASPC and find a recently listed rental property. Use your model to predict its monthly rent, then compare the prediction with the listed price.

# Example using a hypothetical property with 3 rooms, 61 m², EPC C, Tax C
my_property <- data.frame(
  Rooms     = 3,
  FloorArea = 61,
  EPC       = "C",
  Tax       = "C"
)

predict(rent_mod, newdata = my_property)

# Add 95% confidence interval
predict(rent_mod, newdata = my_property, interval = "confidence")

The residual is the difference between your model’s prediction and the listed price. If the true price falls within the confidence interval, the model’s uncertainty correctly captured the listing.

If you want to see how this kind of model could be translated into something you might see online, with some marketing saying something like “We’re using the most advanced Generative Artificial Intelligence to predict how much rent you should be asking for”, see this demonstration app. This is a little toy website I made (entirely in R). Two things to note: there’s a decent chance the website will crash if lots of people try to use it at the same time, and the “Fancy model” also accounts for spatial patterns in rent across Aberdeen.

Causal modelling

We now switch goals. Instead of predicting rent as accurately as possible, we want to understand the causal effect of floor area on rent: if a property had an extra 10 m², how much more would we expect it to rent for?

This is a different question and requires a different analytical approach.

Confounders

Before starting the analysis, let’s go over the three types of confounds again to help solidify your understanding. Below we’ll go through each type in turn and see the effect of including the confounded variable in a model or not.

The pipe: mediator sitting between cause and effect

In a pipe, Z is caused by the predictor of interest X and in turn causes the response variable Y: the structure is X → Z → Y. The effect of X on Y is real, but it flows through Z. Including Z in the model blocks the pathway, absorbing the X effect and making the X coefficient shrink towards zero. X still matters; you have removed the route through which it acts.

Click Z in the DAG to add or remove it from the model.

Within this data: FloorAreaRoomsPrice. Larger properties have more rooms, and more rooms drives price. Including Rooms when asking about the causal effect of FloorArea would block this path. The correct causal model leaves Rooms out.

Question 1. In the structure FloorArea → Rooms → Price, what causal role does Rooms play?

Rooms is a pipe. Floor area causes the number of rooms a property has (larger properties have more rooms), and the number of rooms in turn influences price. The causal chain runs FloorArea → Rooms → Price: Rooms is a step along the route from cause to effect, not a separate cause of both.

Question 2. When estimating the total causal effect of FloorArea on Price, should Rooms be included in the model?

No. Including a pipe blocks the causal pathway it sits on. If Rooms enters the model, the model estimates how much FloorArea affects Price over and above the rooms it provides: a narrower, less interpretable quantity. To capture the total effect of floor area (everything it does to price, including through the rooms it provides), leave Rooms out.

The fork: a confounder causing both predictor and response

In a fork, Z causes both X and Y: the structure is X ← Z → Y. X and Y appear associated, but neither causes the other. The association is driven by Z. Without adjusting for Z, the X coefficient picks up Z’s influence and is inflated. Including Z closes the confounded path.

Click Z in the DAG to add or remove it from the model.

Within this data: EPCFloorAreaPrice. Larger properties tend to have better energy efficiency ratings and also cost more. Any apparent EPC effect on price in a model without FloorArea would partly be floor area in disguise. Including FloorArea as an adjustment variable closes this confounded path.

Question 3. In the structure EPC ← FloorArea → Price, what causal role does FloorArea play?

FloorArea is a fork (confounder). Larger properties tend to have better energy efficiency ratings because there’s more scope for insulation (FloorArea → EPC), and larger properties also command higher rents (FloorArea → Price). Without adjusting for FloorArea, any apparent EPC–Price association partly reflects this shared driver rather than a genuine effect of energy efficiency on rent.

Question 4. When estimating the causal effect of EPC on Price, what should you do with FloorArea?

Include FloorArea. It sits at the top of a fork: it causes both the predictor of interest (EPC) and the response (Price). That fork is a backdoor path, and without closing it, the EPC coefficient picks up FloorArea’s influence on Price as well as any genuine EPC effect. Adding FloorArea to the model closes this path. Note that FloorArea is included only as an adjustment variable; its own coefficient should not be given a causal interpretation (this would be the Table 2 Fallacy).

The collider: including it manufactures a spurious association

In a collider, both X and Y independently cause Z: the structure is X → Z ← Y. X and Y have no association. Including Z in the model opens a confounded path and creates a spurious relationship from nothing.

A concrete example: imagine both floor area and property condition independently influence how quickly a property lets. If you restrict your analysis to only recently let properties (conditioning on letting speed), you introduce a spurious negative correlation between floor area and condition, because among fast-letting properties, a small floor area must be “compensated for” by good condition, and vice versa.

The collider structure doesn’t appear in the current flats DAG, but it’s the hardest mistake to spot. “Controlling for more variables” sounds rigorous, but adding a collider makes the model worse, not better. Click Z to see the effect.

Question 5. A colleague adds a variable Z to their causal model because “it’s associated with both X and Y, so it must be important to control for.” Z is in fact caused by both X and Y. What is wrong with this reasoning?

Z is a collider. Because both X and Y cause Z, conditioning on Z (including it in the model) opens a path between X and Y that was previously closed. The result is a spurious association between X and Y: bias that the model introduces rather than removes. Being associated with both predictor and response is a description of a confounder, a pipe, and a collider: the association alone doesn’t tell you which structure applies. Only the arrows in the DAG do that.


Start by reloading the data cleanly so that any changes made for the predictive model don’t carry over:

abdn <- read.table("abdn_flats.txt", header = TRUE)

1. Formulate a research question

“How much does monthly rent change when the floor area of a property increases?” We want the causal effect of FloorArea on Price, not just a predictive association.

Creating a DAG

To estimate a causal effect responsibly, we need to reason about confounding. We do this by drawing a Directed Acyclic Graph (DAG) that captures our beliefs about how variables cause one another. The check_dag() function from the performance package helps build and interrogate DAGs in R.

This requires the developer version of performance. Install it with:

install.packages(c("performance", "see"), repos = "https://easystats.r-universe.dev")

Then build the DAG. Each line states what causes what. For example, Rooms ~ FloorArea says “the number of rooms in a property is caused by its floor area.”

library(performance)
library(see)

rent_dag <- check_dag(
  Price ~ Rooms + FloorArea + Tax + EPC,
  Rooms ~ FloorArea,
  EPC   ~ FloorArea,
  Tax   ~ FloorArea,
  exposure = "FloorArea",
  outcome  = "Price"
)

plot(rent_dag)

The exact layout of variables in the plot is semi-random, so your figure may look slightly different each time, but the arrows should be the same.

The plot shows two versions of the DAG. The left panel shows the model as specified by exposure and outcome alone (Price ~ FloorArea). The right panel shows what would be needed to estimate the total causal effect of FloorArea on Price. In this case, both panels are identical.

Why? Because FloorArea drives Rooms, EPC, and Tax, and those variables all flow into Price. These are mediating paths (pipes): FloorAreaRoomsPrice, and FloorAreaEPCPrice, and so on. When we ask “what is the total causal effect of FloorArea?”, we want to include all the ways extra floor area changes price, including through the extra rooms it allows and the tax band it implies. Blocking those paths by adding Rooms, EPC, and Tax to the model would give us only the direct effect of FloorArea, not the total effect.

To confirm, print the DAG object:

rent_dag

This confirms: to estimate the total causal effect of FloorArea on Price, no additional adjustment variables are needed. The correct model is simply:

lm(Price ~ FloorArea, data = abdn)

This will not always be the case. Most of the time, estimating a causal effect requires including specific adjustment variables to close non-causal (backdoor) paths. It’s a coincidence of this particular DAG that no adjustment is needed here.

4. Fit an appropriate model

rent_causal_mod <- lm(Price ~ FloorArea, data = abdn)

5. Diagnose the model and check assumptions

par(mfrow = c(2, 2))
plot(rent_causal_mod)
par(mfrow = c(1, 1))

Two issues commonly appear:

Heteroscedasticity. Residual spread tends to increase with fitted values. There are two reasons for this. First, landlords tend to round rents to the nearest £25 to £50 for cheaper properties and to the nearest £50 to £100 for more expensive ones, which creates measurement “noise” that scales with price. Second, it’s easier to pin down the right rent for a cheaper property (the difference between £500 and £750 is proportionally large) whereas for expensive properties a £250 difference is relatively small, so estimates naturally vary more.

Influential observations. One or more expensive properties may appear beyond the Cook’s distance contour lines. Confirm whether the recorded price is correct using the PropertyURL. If the price is correct, you have two options: ignore the problem and note it when reporting, or remove the observation.

Fix any influential observations and rerun:

abdn_trimmed <- abdn[-c(row_number), ]   # replace with your identified row(s)
rent_causal_mod <- lm(Price ~ FloorArea, data = abdn_trimmed)
par(mfrow = c(2, 2))
plot(rent_causal_mod)
par(mfrow = c(1, 1))

6. Summarise the results

summary(rent_causal_mod)

Look at the coefficient for FloorArea. It should be noticeably larger than the same coefficient in the predictive model. The causal model, which keeps all the mediating paths open, attributes the full effect of extra floor area to FloorArea alone. The predictive model spread that effect across FloorArea, Rooms, EPC, and Tax, making each individual coefficient smaller.

Plot the estimated relationship:

fake <- data.frame(
  FloorArea = seq(from = min(abdn$FloorArea), to = max(abdn$FloorArea), length.out = 50)
)
preds <- predict(rent_causal_mod, newdata = fake, se.fit = TRUE)
fake$fit <- preds$fit
fake$low <- preds$fit - preds$se.fit * 1.96
fake$upp <- preds$fit + preds$se.fit * 1.96

ggplot(fake) +
  geom_ribbon(aes(x = FloorArea, ymin = low, ymax = upp), fill = "lightgrey", alpha = 0.7) +
  geom_line(aes(x = FloorArea, y = fit)) +
  labs(x = "Floor area (m²)", y = "Predicted monthly rent (£)") +
  theme_minimal()

Assuming the DAG is correct, the slope from this model is our best estimate of the total causal effect of floor area on rent: the expected change in monthly rent for each additional m², accounting for all the downstream ways that extra space affects price.

What is the causal effect of EPC on price?

Repeat the full causal workflow, but now asking: “What is the causal effect of energy efficiency band (EPC) on rent?” The DAG structure is the same, but the exposure changes.

Reload the data cleanly before starting:

abdn <- read.table("abdn_flats.txt", header = TRUE)

EPC causal analysis. Build the DAG with EPC as the exposure, identify which model to run, fit it, diagnose it, and interpret the results. Compare the DAG-recommended model with what you would have fitted without a DAG.

library(performance)
library(see)

# Same DAG structure, different exposure
# (But keep in mind, this doesn't include things like how nicely it's decorated,
# its general maintenance, how greedy the landlord is, etc.)
rent_dag_epc <- check_dag(
  Price ~ Rooms + FloorArea + Tax + EPC,
  Rooms ~ FloorArea,
  EPC   ~ FloorArea,
  Tax   ~ FloorArea,
  exposure = "EPC",
  outcome  = "Price"
)
plot(rent_dag_epc)
rent_dag_epc

# check_dag() now recommends including FloorArea to block the backdoor path:
# FloorArea → EPC (exposure), but also FloorArea → Price (directly)
# Without adjusting for FloorArea, we confound the EPC → Price relationship
EPC_mod <- lm(Price ~ EPC + FloorArea, data = abdn)

par(mfrow = c(2, 2))
plot(EPC_mod)
par(mfrow = c(1, 1))
# Diagnostics are generally acceptable

summary(EPC_mod)

library(ggeffects)
plot(ggpredict(EPC_mod, terms = "EPC"))

The DAG recommends including FloorArea as an adjustment variable to block the backdoor path (FloorArea causes both EPC and Price). Without it, any apparent EPC effect would be partly a floor area effect in disguise.

Important: in this model, FloorArea is included only to adjust for confounding. Its coefficient should not be interpreted causally; doing so would be the Table 2 Fallacy.

The EPC effect estimates are often noisy with this dataset, with wide confidence intervals that span zero for some bands. This either means the DAG is more complex than assumed, or the dataset is too small to estimate the EPC effect reliably with this approach.

A more complete DAG (accounting for property age, property type, and historic price) would likely change which variables need to be included, and might reveal that the current dataset cannot answer this causal question at all without collecting additional variables.

Assumption reference

Assumption Description Example of violation
1. Validity The data can answer the research question. Using IQ to measure overall intelligence.
2. Representativeness The data reflects the population being studied. Using survey data from a rural area to draw conclusions about an urban population.
3. Additivity Effects of predictors are additive with no interactions. Assuming study hours and sleep independently affect exam scores without considering their combined effect.
4. Linearity The relationship between predictors and the outcome is linear. Assuming plant growth continues to increase indefinitely without levelling off.
5. Independence of errors Observations are independent of each other. Modelling height using repeated measurements of the same children over time.
6. Equal variance of errors Residuals have constant variance across all predictor levels. Modelling income, where salary variability differs across levels of a company hierarchy.
7. Normality of errors Residuals follow a normal distribution. Modelling income, where a few extremely wealthy individuals create extreme residuals.