# The code below is essentially a lottery machine
# x = the options you want to select from (here 1 to 7)
# size = the number of draws you want made
# replace = FALSE means once a number is selected, it cannot be selected again
sample(x = 1:7, size = 7, replace = FALSE)Workshop 1.5: Hidden Assumptions
BI3010 Statistics for Biologists
Learning Objectives
- Learn to identify potential issues (or “hidden assumptions”) without seeing the data.
- Identify potential hidden assumptions in a real world dataset.
- Use
Rto perform simple descriptive summaries of the data to assist in this.
These skills will be used throughout the course.
Workshop structure
As always, we’re using the workflow below to structure the workshops:
- Formulate a research question.
- Perform exploratory data analysis.
- Identify any hidden assumptions [Focus of today].
- Fit an appropriate model.
- Diagnose the model and check assumptions.
- Summarise the results.
- Interpret and provide inferences.
The Seven Assumptions
In this workshop, you’ll come up with some examples of various datasets that are likely to violate a hidden assumption. Some of these are related to residual error, which we’ll cover in more detail next week. For now, understand residual error as the difference between what a model predicts and the actual observation. For example, if a model predicts a flat costs £850 to rent each month, but the actual rent was £800, then the residual error is observed minus predicted (Residual = y − ŷ). For the flat example, the residual error would be 800 − 850 = −50, or £-50 difference.
These assumptions (with some simple examples) are:
| Assumption | Description | Example of violation |
|---|---|---|
| 1. Validity | The data is able to answer the research question effectively. | Using IQ to measure overall intelligence (what does IQ even measure?). |
| 2. Representativeness | The data reflects the population or group being studied. | Using survey data from a rural area to draw conclusions about an urban population. |
| 3. Additivity | Effects of predictors are additive and there are no interactions between them. | Assuming that study hours and sleep hours independently affect exam scores without considering their combined effect. |
| 4. Linearity | The relationship between predictors and the outcome is linear. | Assuming plant growth will continue to increase indefinitely without ever levelling out. |
| 5. Independence of errors | Observations are independent of each other. | If modelling height, the data come from the same group of children over time. |
| 6. Equal variance of errors | Residuals have constant variance across all levels of predictors (“homoscedasticity”). | If modelling income, variability in salaries isn’t the same across all levels of a company’s hierarchy. |
| 7. Normality of errors | Residuals should be normally distributed. | If modelling income, the presence of any extremely wealthy individuals will be hard for the model to understand and explain. |
For this exercise, form groups of two to four at your table. Your task is to come up with seven datasets that you believe are likely to violate assumptions one to seven in the above table. Once you have come up with an example for a given assumption, choose one of your group to upload your example to the MyAberdeen discussion board, such that we build up a “portfolio” of lousy datasets.
If it helps, you can think of this exercise as essentially asking you to be a terrible scientist. For each assumption, think of an experiment or how you might collect data in such a way as to violate one of these assumptions. You might decide to focus on trying to come up with a lousy experiment to understand, which you do in the worst possible way:
- Ostrich wing length: “we measured wing length by looking at the wing through binocs and guessing to the nearest mm” (Validity)
- European otter density: “we measured the density of European otters in Tokyo zoo to understand otter dynamics in Spain” (Representativeness)
- Plant disease occurrence: “we assume residual error of number of pathogens will be constant as a plant becomes older” (Equal variance of errors)
- Dolphin weight: “we assume that dolphins will continue to grow at the same rate throughout their lives” (Linearity)
- Number of corrupt politicians: “we assume most politicians are similar and that it’s not just one or two that are extremely corrupt” (Normality of errors)
You’re by no means limited to the silly examples above. Feel free to think of your own silly (or not) example datasets/context.
To ensure that we have examples for each assumption, within your group, have one person run the following code to determine the order of assumptions (from 1. Validity to 7. Normality of errors) that you will work through:
If you’re unsure what to do, please ask a member of staff, and also discuss your ideas with them during this first task.
Check Your Understanding
For each assumption below, select the scenario that best illustrates it being violated.
Assumption 1: Validity. Which dataset is most likely to violate Validity?
Track counts and population size aren’t the same thing. Track density depends on substrate, weather, species behaviour, and observer skill, not just how many animals are present. The data can’t directly answer the question it’s being used for.
Assumption 2: Representativeness. Which study is most likely to violate Representativeness?
Accessible riverbanks aren’t representative of the whole river system. Deeper water, faster currents, and remote stretches may hold very different fish communities. Sampling only what’s convenient and generalising to the whole introduces systematic bias.
Assumption 3: Additivity. Which scenario is most likely to violate Additivity?
The drug and alcohol interact: the combined effect isn’t simply the sum of each alone. Additivity assumes that predictors work independently, and this is a clear violation. The other examples describe situations where effects are independent and additive.
Assumption 4: Linearity. Which relationship is most likely to be non-linear?
Bacterial growth is exponential: each bacterium divides to produce two, which divide to produce four, and so on. The relationship between time and population size is a curve, not a straight line. Assuming linearity here would badly underestimate growth at later time points.
Assumption 5: Independence of errors. Which study design is most likely to produce non-independent errors?
Piglets from the same litter share genetics and environment, so they are more similar to each other than to piglets from different litters. Treating them as 200 independent observations ignores this clustering. The effective sample size is closer to 20 (litters) than 200 (piglets).
Assumption 6: Equal variance of errors. Which scenario is most likely to show unequal variance?
When the outcome itself ranges across orders of magnitude, variance almost always scales with the mean. A herb producing around 50 seeds might vary by 10; a tree producing a million might vary by tens of thousands. Assuming constant variance across this range would be badly wrong.
Assumption 7: Normality of errors. Which dataset is most likely to produce non-normal residuals?
Income distributions are strongly right-skewed. Most people earn a moderate amount, but a small number earn very large amounts, pulling the tail far to the right. A model assuming normal residuals will struggle with these extreme values, exactly the “extremely wealthy individual” problem from the table above.
Red Deer in the UK
Still within your groups, visit the following NBN Atlas page. This page contains all recordings of red deer in the UK. Spend some time exploring this page, as well as finding out how the data is collected.
Question Set One
Question 1. Using this dataset, consider which assumptions it is most likely to violate.
Validity. A data point only appears when a deer was present, seen, and reported by an observer. We can’t guarantee that every observer spots a deer even when one is there, and skill levels will vary considerably. There’s also the question of identification: what if it was a roe deer, a dog, or even Bigfoot?
Representativeness. The dataset only records successful sightings, not failed searches. Every blank area on the map has two possible meanings: someone looked and found no deer, or no one looked at all. These are very different situations but look identical in the data. Survey effort is also not uniform across the UK. Someone near Fort William is fairly accustomed to red deer, whereas someone spotting one in London would find it much more exciting and be far more likely to report it. That reporting bias makes the dataset unreliable as a picture of where deer actually are.
In short, what the data captures is: observations where a deer was present, seen, and reported. What we want it to represent is: observations where deer are either present or absent. Those aren’t the same thing.
Having done this, compare the NBN Atlas with the IUCN Red List range map.
Question Set Two
Question 1. How reliable do you think the IUCN geographic range is for the UK?
The core of the range is likely reliable, but the range edge is likely highly unreliable, with some areas probably overestimated and others underestimated. As we approach a species’ range edge, we might find: 1) the species are less common so aren’t seen as often, 2) they are relatively “new” at the range edge so aren’t reported as often, 3) they might colonise an area but quickly go locally extinct (meaning we might have false positive “detections”), and 4) there are lots of additional little things that all contribute to detections being highly unreliable, especially at range edges.
Question 2. Do you think the non-UK geographic range is as reliable, or less so?
Probably less so. The UK public tends to be engaged with wildlife reporting, particularly for larger mammals and birds. In other countries this culture either doesn’t exist or is less pronounced, and in some datasets it’s not uncommon for all observations from a single country to come from a single highly motivated individual. Reporting intensity and accuracy will differ substantially across borders, making a multi-country range map inherently uneven.
Question 3. Do you think there’s any value in the IUCN geographic range for decision making?
There is some value, but in a limited capacity. It’s useful for very broad decisions (“red deer are present in country X”) but is very likely to be detrimental for any fine-scale decisions (“red deer are expanding into neighbourhood Y of city X”). That’s often the case in science: results may still be useful for broad-level understanding even when the underlying data are of limited quality, but if we want to make more precise and accurate decisions, the data really do need to be top quality.
Revisiting the Alligator Data
Return to the alligator_bite.txt dataset from the first workshop. We didn’t give you any information as to how the data were collected, so gauging how likely it was to meet or violate assumptions is difficult (but not impossible). To allow you to do so now, here’s a description of data collection:
Each alligator was retrieved from a communal pond containing 100 alligators. Although the alligators were not individually tagged, farm staff were able to identify each one, ensuring that observations were independent. After retrieval, the alligator was blindfolded and moved to a safe and secure location. Once secured, the alligator was restrained by up to three technicians while its length was measured. To measure the length, two additional technicians used a tape measure from the tip of the nose to the tip of the tail. The tape was passed under the restraining technicians when possible to ensure accurate measurements. Lengths were recorded in centimetres, with one decimal point.
After measuring length, the blindfold was removed, and a modified gnathodynamometer was used to measure bite force in pounds per square inch. Since alligators may not immediately bite the gnathodynamometer, the device was placed in the alligator’s mouth for up to 60 seconds to capture an appropriate bite strength. Once all measurements were taken, the alligator was returned to the pond before the next individual was retrieved.
Question Set Three
Question 1. With the alligator dataset and data collection description, state which assumptions you believe may be most likely to be violated.
Independence of errors. Alligators are returned to the communal pond after measurement before the next one is retrieved. Despite staff assurances that individuals can be identified, it seems plausible that the same alligator could be captured twice. If so, some “observations” are repeated measurements of the same animal — a direct violation of independence.
Validity (bite force). It sounds like actually getting alligators to bite the device might have been a bit hard. Did each animal bite as hard as it could, or just hard enough for the staff to feel like they had a measurement? If the latter, recorded values may not reflect true maximum bite strength.
Validity (length). The fact that they had to pass the measuring tape in and around technicians doesn’t inspire confidence in accuracy. This isn’t even thinking about what happens if the alligator is moving while they’re trying to measure it. Seems like length wouldn’t be measured especially accurately, never mind to one decimal point (which seems absurdly accurate given the difficulties in measuring).
Question 2. Describe how the data collection could have been improved so that the assumptions would be more likely to be met.
The easiest option may be to have some method to identify which alligators had already been measured. Something like a non-toxic paint mark on their head. That would remove the chance that a previously measured alligator was repeatedly measured.
In terms of getting more accurate measurements, it’s a bit harder. Measuring an alligator’s length and bite strength will probably always be difficult and the way they do it seems pretty sensible from the little I know about measuring alligator bite strength (which isn’t much). Personally, I can’t think of a way to resolve this issue, so I’d likely look to resolve the measurement error in the model (note, we don’t teach you how to do this).
The broader point with this example is that sometimes you can’t improve a dataset. In such cases, the important thing is to identify what impact the limitations will have on the results, which we’ll return to as the course progresses.