Workshop 1.5: Hidden Assumptions

BI3010 Statistics for Biologists

Author

Roos & Pinard

Published

16 May 2026

Learning Objectives

By the end of this workshop you will be able to:
  1. Learn to identify potential issues (or “hidden assumptions”) without seeing the data.
  2. Identify potential hidden assumptions in a real world dataset.
  3. Use R to perform simple descriptive summaries of the data to assist in this.

These skills will be used throughout the course.

Workshop structure

The Analytical Workflow

As always, we’re using the workflow below to structure the workshops:

  1. Formulate a research question.
  2. Perform exploratory data analysis.
  3. Identify any hidden assumptions [Focus of today].
  4. Fit an appropriate model.
  5. Diagnose the model and check assumptions.
  6. Summarise the results.
  7. Interpret and provide inferences.

The Seven Assumptions

In this workshop, you’ll come up with some examples of various datasets that are likely to violate a hidden assumption. Some of these are related to residual error, which we’ll cover in more detail next week. For now, understand residual error as the difference between what a model predicts and the actual observation. For example, if a model predicts a flat costs £850 to rent each month, but the actual rent was £800, then the residual error is observed minus predicted (Residual = y − ŷ). For the flat example, the residual error would be 800 − 850 = −50, or £-50 difference.

These assumptions (with some simple examples) are:

Assumption Description Example of violation
1. Validity The data is able to answer the research question effectively. Using IQ to measure overall intelligence (what does IQ even measure?).
2. Representativeness The data reflects the population or group being studied. Using survey data from a rural area to draw conclusions about an urban population.
3. Additivity Effects of predictors are additive and there are no interactions between them. Assuming that study hours and sleep hours independently affect exam scores without considering their combined effect.
4. Linearity The relationship between predictors and the outcome is linear. Assuming plant growth will continue to increase indefinitely without ever levelling out.
5. Independence of errors Observations are independent of each other. If modelling height, the data come from the same group of children over time.
6. Equal variance of errors Residuals have constant variance across all levels of predictors (“homoscedasticity”). If modelling income, variability in salaries isn’t the same across all levels of a company’s hierarchy.
7. Normality of errors Residuals should be normally distributed. If modelling income, the presence of any extremely wealthy individuals will be hard for the model to understand and explain.

For this exercise, form groups of two to four at your table. Your task is to come up with seven datasets that you believe are likely to violate assumptions one to seven in the above table. Once you have come up with an example for a given assumption, choose one of your group to upload your example to the MyAberdeen discussion board, such that we build up a “portfolio” of lousy datasets.

If it helps, you can think of this exercise as essentially asking you to be a terrible scientist. For each assumption, think of an experiment or how you might collect data in such a way as to violate one of these assumptions. You might decide to focus on trying to come up with a lousy experiment to understand, which you do in the worst possible way:

  • Ostrich wing length: “we measured wing length by looking at the wing through binocs and guessing to the nearest mm” (Validity)
  • European otter density: “we measured the density of European otters in Tokyo zoo to understand otter dynamics in Spain” (Representativeness)
  • Plant disease occurrence: “we assume residual error of number of pathogens will be constant as a plant becomes older” (Equal variance of errors)
  • Dolphin weight: “we assume that dolphins will continue to grow at the same rate throughout their lives” (Linearity)
  • Number of corrupt politicians: “we assume most politicians are similar and that it’s not just one or two that are extremely corrupt” (Normality of errors)

You’re by no means limited to the silly examples above. Feel free to think of your own silly (or not) example datasets/context.

To ensure that we have examples for each assumption, within your group, have one person run the following code to determine the order of assumptions (from 1. Validity to 7. Normality of errors) that you will work through:

# The code below is essentially a lottery machine
# x = the options you want to select from (here 1 to 7)
# size = the number of draws you want made
# replace = FALSE means once a number is selected, it cannot be selected again
sample(x = 1:7, size = 7, replace = FALSE)

If you’re unsure what to do, please ask a member of staff, and also discuss your ideas with them during this first task.

Check Your Understanding

For each assumption below, select the scenario that best illustrates it being violated.

Assumption 1: Validity. Which dataset is most likely to violate Validity?

Track counts and population size aren’t the same thing. Track density depends on substrate, weather, species behaviour, and observer skill, not just how many animals are present. The data can’t directly answer the question it’s being used for.

Assumption 2: Representativeness. Which study is most likely to violate Representativeness?

Accessible riverbanks aren’t representative of the whole river system. Deeper water, faster currents, and remote stretches may hold very different fish communities. Sampling only what’s convenient and generalising to the whole introduces systematic bias.

Assumption 3: Additivity. Which scenario is most likely to violate Additivity?

The drug and alcohol interact: the combined effect isn’t simply the sum of each alone. Additivity assumes that predictors work independently, and this is a clear violation. The other examples describe situations where effects are independent and additive.

Assumption 4: Linearity. Which relationship is most likely to be non-linear?

Bacterial growth is exponential: each bacterium divides to produce two, which divide to produce four, and so on. The relationship between time and population size is a curve, not a straight line. Assuming linearity here would badly underestimate growth at later time points.

Assumption 5: Independence of errors. Which study design is most likely to produce non-independent errors?

Piglets from the same litter share genetics and environment, so they are more similar to each other than to piglets from different litters. Treating them as 200 independent observations ignores this clustering. The effective sample size is closer to 20 (litters) than 200 (piglets).

Assumption 6: Equal variance of errors. Which scenario is most likely to show unequal variance?

When the outcome itself ranges across orders of magnitude, variance almost always scales with the mean. A herb producing around 50 seeds might vary by 10; a tree producing a million might vary by tens of thousands. Assuming constant variance across this range would be badly wrong.

Assumption 7: Normality of errors. Which dataset is most likely to produce non-normal residuals?

Income distributions are strongly right-skewed. Most people earn a moderate amount, but a small number earn very large amounts, pulling the tail far to the right. A model assuming normal residuals will struggle with these extreme values, exactly the “extremely wealthy individual” problem from the table above.

Red Deer in the UK

Still within your groups, visit the following NBN Atlas page. This page contains all recordings of red deer in the UK. Spend some time exploring this page, as well as finding out how the data is collected.

Question Set One

Question 1. Using this dataset, consider which assumptions it is most likely to violate.

Validity. A data point only appears when a deer was present, seen, and reported by an observer. We can’t guarantee that every observer spots a deer even when one is there, and skill levels will vary considerably. There’s also the question of identification: what if it was a roe deer, a dog, or even Bigfoot?

Representativeness. The dataset only records successful sightings, not failed searches. Every blank area on the map has two possible meanings: someone looked and found no deer, or no one looked at all. These are very different situations but look identical in the data. Survey effort is also not uniform across the UK. Someone near Fort William is fairly accustomed to red deer, whereas someone spotting one in London would find it much more exciting and be far more likely to report it. That reporting bias makes the dataset unreliable as a picture of where deer actually are.

In short, what the data captures is: observations where a deer was present, seen, and reported. What we want it to represent is: observations where deer are either present or absent. Those aren’t the same thing.

Having done this, compare the NBN Atlas with the IUCN Red List range map.

Question Set Two

Question 1. How reliable do you think the IUCN geographic range is for the UK?

The core of the range is likely reliable, but the range edge is likely highly unreliable, with some areas probably overestimated and others underestimated. As we approach a species’ range edge, we might find: 1) the species are less common so aren’t seen as often, 2) they are relatively “new” at the range edge so aren’t reported as often, 3) they might colonise an area but quickly go locally extinct (meaning we might have false positive “detections”), and 4) there are lots of additional little things that all contribute to detections being highly unreliable, especially at range edges.

Question 2. Do you think the non-UK geographic range is as reliable, or less so?

Probably less so. The UK public tends to be engaged with wildlife reporting, particularly for larger mammals and birds. In other countries this culture either doesn’t exist or is less pronounced, and in some datasets it’s not uncommon for all observations from a single country to come from a single highly motivated individual. Reporting intensity and accuracy will differ substantially across borders, making a multi-country range map inherently uneven.

Question 3. Do you think there’s any value in the IUCN geographic range for decision making?

There is some value, but in a limited capacity. It’s useful for very broad decisions (“red deer are present in country X”) but is very likely to be detrimental for any fine-scale decisions (“red deer are expanding into neighbourhood Y of city X”). That’s often the case in science: results may still be useful for broad-level understanding even when the underlying data are of limited quality, but if we want to make more precise and accurate decisions, the data really do need to be top quality.

Revisiting the Alligator Data

Return to the alligator_bite.txt dataset from the first workshop. We didn’t give you any information as to how the data were collected, so gauging how likely it was to meet or violate assumptions is difficult (but not impossible). To allow you to do so now, here’s a description of data collection:

Each alligator was retrieved from a communal pond containing 100 alligators. Although the alligators were not individually tagged, farm staff were able to identify each one, ensuring that observations were independent. After retrieval, the alligator was blindfolded and moved to a safe and secure location. Once secured, the alligator was restrained by up to three technicians while its length was measured. To measure the length, two additional technicians used a tape measure from the tip of the nose to the tip of the tail. The tape was passed under the restraining technicians when possible to ensure accurate measurements. Lengths were recorded in centimetres, with one decimal point.

After measuring length, the blindfold was removed, and a modified gnathodynamometer was used to measure bite force in pounds per square inch. Since alligators may not immediately bite the gnathodynamometer, the device was placed in the alligator’s mouth for up to 60 seconds to capture an appropriate bite strength. Once all measurements were taken, the alligator was returned to the pond before the next individual was retrieved.

Question Set Three

Question 1. With the alligator dataset and data collection description, state which assumptions you believe may be most likely to be violated.

Independence of errors. Alligators are returned to the communal pond after measurement before the next one is retrieved. Despite staff assurances that individuals can be identified, it seems plausible that the same alligator could be captured twice. If so, some “observations” are repeated measurements of the same animal — a direct violation of independence.

Validity (bite force). It sounds like actually getting alligators to bite the device might have been a bit hard. Did each animal bite as hard as it could, or just hard enough for the staff to feel like they had a measurement? If the latter, recorded values may not reflect true maximum bite strength.

Validity (length). The fact that they had to pass the measuring tape in and around technicians doesn’t inspire confidence in accuracy. This isn’t even thinking about what happens if the alligator is moving while they’re trying to measure it. Seems like length wouldn’t be measured especially accurately, never mind to one decimal point (which seems absurdly accurate given the difficulties in measuring).

Question 2. Describe how the data collection could have been improved so that the assumptions would be more likely to be met.

The easiest option may be to have some method to identify which alligators had already been measured. Something like a non-toxic paint mark on their head. That would remove the chance that a previously measured alligator was repeatedly measured.

In terms of getting more accurate measurements, it’s a bit harder. Measuring an alligator’s length and bite strength will probably always be difficult and the way they do it seems pretty sensible from the little I know about measuring alligator bite strength (which isn’t much). Personally, I can’t think of a way to resolve this issue, so I’d likely look to resolve the measurement error in the model (note, we don’t teach you how to do this).

The broader point with this example is that sometimes you can’t improve a dataset. In such cases, the important thing is to identify what impact the limitations will have on the results, which we’ll return to as the course progresses.

Glossary / 词汇表
Statistical concepts
Assumption 假设
A condition that a statistical model requires to hold true for its results to be valid.
统计模型得出有效结论所需满足的条件。若假设被违反,模型结果可能不可靠。
Predictor 预测变量
A variable used in a model to explain or predict an outcome. Also called an independent variable.
模型中用于解释或预测结果变量的变量,也称"自变量"。
Residual error 残差
The difference between an observed value and the value a model predicted (Residual = Observed − Predicted).
观测值与模型预测值之间的差值(残差 = 观测值 − 预测值)。
Data quality
Validity 效度
Whether data actually measures what it is intended to measure.
数据是否真正测量了其所声称测量的内容。例如,用推特粉丝数衡量科研影响力,效度较低。
Representativeness 代表性
The degree to which a sample accurately reflects the wider population it is drawn from.
样本是否准确反映了所研究的更广泛总体。仅对城市居民调查,则结论可能不适用于农村地区。
Model assumptions
Additivity 可加性
The assumption that predictors affect the outcome independently, so their effects simply add together without interacting.
假设各预测变量的效应相互独立,可以直接相加,彼此之间不存在交互作用。
Linearity 线性
The assumption that the relationship between a predictor and the outcome follows a straight line.
假设预测变量与结果变量之间的关系是直线关系。若实际关系是曲线(如指数增长),则违反此假设。
Independence of errors 误差独立性
The assumption that observations do not influence each other.
假设各观测值之间互不影响。对同一对象重复测量,或测量来自同一群体的个体,容易违反此假设。
Homoscedasticity 同方差性
The assumption that residuals have constant variance across all levels of the predictors. Also called equal variance of errors.
假设残差的方差在所有预测变量水平上保持恒定,又称"等方差性",是线性回归的重要假设之一。
Heteroscedasticity 异方差性
The opposite of homoscedasticity: residual variance changes across levels of the predictor.
同方差性的对立概念,指残差的方差随预测变量水平的变化而改变,违反了等方差假设。
Normality of errors 误差正态性
The assumption that model residuals follow a normal (bell-shaped) distribution.
假设模型残差服从正态分布(钟形曲线)。当数据中存在极端值时,此假设容易被违反。
Interaction 交互作用
When the effect of one predictor on the outcome depends on the level of another predictor.
当一个预测变量对结果的影响取决于另一个预测变量的水平时,两变量之间存在交互作用。
Exponential growth 指数增长
Growth where the rate of increase is proportional to the current value, producing an accelerating curve rather than a straight line.
增长速率与当前数量成正比的增长方式,如细菌不断倍增。与线性增长不同,指数增长会迅速变得极大。
Sampling and resources
Stratified random sampling 分层随机抽样
A sampling method where the population is divided into subgroups (strata) and random samples are taken from each.
先将总体划分为若干子组(层),再从每层中独立随机抽取样本的方法,有助于确保样本的代表性。
Geographic range 地理分布范围
The geographic area in which a species is known to occur.
某物种已知存在的地理区域。
NBN Atlas NBN物种图集
The National Biodiversity Network Atlas: a UK database of wildlife observations submitted by members of the public.
英国国家生物多样性网络图集,收录由公众提交的野生动物观测记录,是英国最大的野生动物数据库之一。
IUCN 国际自然保护联盟
The International Union for Conservation of Nature: the global authority that assesses the conservation status of species and publishes the Red List.
国际自然保护联盟,是全球评估物种保护状况的权威机构,发布濒危物种红色名录(Red List)。
Gnathodynamometer 颌力计
A device used to measure the bite force of an animal.
用于测量动物咬合力的仪器。词源来自希腊语:"gnathos"意为颌骨,"dynamis"意为力量。