boto <- read.table("data/boto.txt", header = TRUE)Workshop 2: Data visualisation
BI3010 Statistics for Biologists
Learning Objectives
- Use the
Rpackageggplot2to produce data visualisations. - Distinguish effective visualisations from ineffective ones.
- Apply some techniques to make figures more visually appealing.
These skills come up throughout the course and beyond.
Workshop structure
These workshops follow an analytical workflow. Today we focus on steps 2 and 3, expanding the toolkit you built in Workshop 1 using data visualisation.
- Formulate a research question.
- Perform exploratory data analysis [Focus of today].
- Identify any hidden assumptions [Focus of today].
- Fit an appropriate model.
- Diagnose the model and check assumptions.
- Summarise the results.
- Interpret and provide inferences.
Introduction to ggplot2
ggplot2 implements the “Grammar of Graphics”: a framework for building plots layer by layer. You start with a dataset, map variables to visual properties (called aesthetics) using aes(), then add one or more geometry functions that determine the form of the plot (points, lines, bars, and so on). Scales, labels, and themes refine the result. The approach is flexible enough to produce sophisticated figures with surprisingly little code. Familiarity with ggplot2 is increasingly expected by employers; Spotify has listed it as an essential criterion in job adverts.
That said, ggplot2 isn’t the only option. Base R can produce similar plots, and many of the figures in this course could be recreated that way. We use ggplot2 because we find it more intuitive, and nearly every figure you see on this course was produced with it.
Load the data
For the first part of this workshop we’ll use data on the Amazon river dolphin, or boto (Inia geoffrensis). We’re broadly interested in two questions: is there sexual dimorphism in asymptotic length (the length animals reach once they stop growing), and how does the species respond to drought? As always, check for anything unusual before starting any analysis.
R sessions sometimes reopen with objects from a previous session, because you clicked “Save workspace” on exit. Leftover objects can cause confusion. Clear your environment before starting with:
rm(list = ls())This removes all objects in the global environment but doesn’t delete any files. In RStudio, you can do the same from the Environment pane by clicking the broom icon.
Download the data file below and save it to your BI3010/data/ folder. Open your BI3010.Rproj file to start RStudio with the working directory already set, then open a new script and load the data:
Run a quick summary() and str() to check for anything unexpected:
summary(boto)
str(boto)Building a first plot
Install and load ggplot2 if you haven’t already:
install.packages("ggplot2") # Run once
library(ggplot2)Call ggplot() with no arguments, then with the dataset:
ggplot() # Empty canvas (expected)
ggplot(boto) # Still empty: we haven't said what to plot or howThe canvas stays empty because we haven’t told ggplot2 two things: (1) which form to show the data in and (2) which variables to map to which axes. Let’s put length on the y-axis, drought on the x-axis, and show the data as points:
ggplot(boto) +
geom_point(aes(y = length, x = drought))From this plot and the summary statistics, we can already start drawing conclusions.
First impressions
Question 1. What variables are available in this dataset?
Three variables: sex, drought, and length.
Question 2. How many observations does the dataset contain?
71 boto dolphins.
Question 3. Based on the scatterplot, is the relationship between boto length and number of drought years positive, neutral, or negative?
Very likely negative: longer drought histories appear to correspond to shorter dolphins.
Question 4. What range of lengths do boto dolphins reach at maturity?
From summary(): roughly 170 to 238 cm. The scatterplot gives the same impression.
Additional geom_ layers
How do we include a categorical variable like sex? Swapping drought for sex on the x-axis is a start:
ggplot(boto) +
geom_point(aes(y = length, x = sex))The spread of data for each sex is hard to read as a single column of stacked points. Boxplots and violin plots handle this better. Changing the plot type is as simple as replacing the geometry function:
# Boxplot
ggplot(boto) +
geom_boxplot(aes(y = length, x = sex))
# Violin plot
ggplot(boto) +
geom_violin(aes(y = length, x = sex))Boxplots use the median and interquartile range to summarise the distribution. Violin plots show the full distribution as a smoothed density curve, mirrored symmetrically. Boxplots have a long history in science, but violin plots are increasingly preferred because they reveal the shape of the distribution more directly.
It’s also common to overlay raw data points on top of either. Using geom_point() would just produce a column of stacked points again, so we “jitter” them instead: add a small random offset to x so they spread horizontally. We don’t want to jitter y (that would distort the length values), so we set height = 0:
ggplot(boto) +
geom_violin(aes(y = length, x = sex)) +
geom_jitter(aes(y = length, x = sex), height = 0, width = 0.1)Two things to note. First, geom_jitter() comes after geom_violin(). Layers are drawn in order, so placing the violin last would cover the points. Second, height = 0 and width = 0.1 sit outside aes() because they are fixed values, not columns in the dataset. Anything that changes based on a variable in the data goes inside aes(); anything fixed goes outside it.
Debugging ggplot2 code
The following code snippets contain errors that will either produce an error message or silently fail to do what was intended. Try to spot each problem before running the code.
Puzzle 1. What is wrong with this code?
ggplot(boto) +
geom_jitter(aes(y = length, x = sex, height = 0, width = 0.1))
The code runs but produces a warning: “Ignoring unknown aesthetics: height and width”. The arguments height and width are fixed values, not variables from the dataset, so they belong outside aes().
Puzzle 2. Find the error:
ggplot(boto) +
geom_point(aes(x = length, x = drought))
Error: “formal argument ‘x’ matched by multiple actual arguments”. Two variables are both mapped to x. One should be y = drought.
Puzzle 3. Find the error:
ggplot(boto) +
geom_point(aes(x = length y = drought))
Error: “unexpected symbol in: … geom_point(aes(x = length y”“. There is a missing comma between x = length and y = drought.
Question 4. How would you produce a histogram of boto lengths? (Hint: type geom_ in your script and wait for RStudio’s autocomplete, browse the ggplot2 reference, or search online.)
ggplot(boto) +
geom_histogram(aes(x = length))
Question 5. From the boxplot, what features tell you how symmetrical the distribution is for each sex?
The box spans the interquartile range (Q1 to Q3) and the line inside shows the median. If the median sits centrally in the box and the whiskers are roughly equal in length, the distribution is roughly symmetric. Here, females have a slightly longer upper whisker and males a slightly longer lower whisker, suggesting modest asymmetry in both groups, though the differences are small.
Question 6. What is the difference between the interquartile range shown in the boxplot and the range?
Interquartile range (IQR): the difference between the 75th and 25th percentiles, i.e. the width of the box. It describes the spread of the central 50% of observations.
Range: the difference between the maximum and minimum values in the data.
Question 7. What is the difference between the interquartile range and variance?
Both measure spread, but differently. The IQR captures how wide the central half of the data is, ignoring the extremes. Variance is calculated from all observations and reflects how far each value deviates from the mean on average. A dataset with extreme outliers can have a large variance but a modest IQR.
Adding more information
We can encode additional variables through colour, shape, size, or transparency. Let’s colour the points by sex. Since sex is a column in the boto dataset, it goes inside aes():
ggplot(boto) +
geom_point(aes(y = length, x = drought, colour = sex))With that one addition the points are now coloured by sex. Already we can see some evidence for sexual dimorphism, and that drought has a negative relationship with length for both sexes. Let’s also increase the point size. Since we want all points the same size (not driven by a variable), size goes outside aes():
ggplot(boto) +
geom_point(aes(y = length, x = drought, colour = sex), size = 2)Now let’s apply a theme and improve the axis labels. Themes control the non-data elements of a figure (background, grid lines, axis text, and so on); common options are theme_bw() and theme_minimal(). Because themes don’t depend on the data, they can go anywhere in the layer stack, though by convention they go at the end. Labels are set with labs():
ggplot(boto) +
geom_point(aes(y = length, x = drought, colour = sex), size = 2) +
labs(y = "Boto length (cm)",
x = "Number of droughts in past 20 years",
colour = "Boto sex") +
theme_bw()Facetting splits a figure into multiple smaller panels based on a grouping variable. There are two options: facet_wrap() and facet_grid(). facet_grid() gives more precise control and is most useful when facetting by two variables at once; facet_wrap() automates the layout and is a good default for a single grouping variable. In both, you specify the grouping variable using ~ (called “tilde”), which you can read as “split by”. So facet_wrap(~sex) means “create one panel per level of sex”:
ggplot(boto) +
geom_point(aes(y = length, x = drought, colour = sex), size = 2) +
labs(y = "Boto length (cm)",
x = "Number of droughts in past 20 years",
colour = "Boto sex") +
facet_wrap(~sex) +
theme_bw()Interpreting the faceted figure
Question 1. Imagine drawing a line through the points in each panel. Would those lines have the same slope?
Probably not. Male boto length appears to have a stronger negative relationship with drought than females, suggesting the slopes differ between sexes.
Question 2. Does your answer to Question 1 suggest any assumption might be violated?
Yes, this raises a potential violation of additivity. If drought affects males more strongly than females, the two predictors aren’t acting independently; their combined effect matters. If the slopes were identical for both sexes, the additivity assumption would hold.
Question 3. Having explored the data with plots and summary statistics, do you notice anything suspicious?
Nothing obviously wrong, but the precision is worth questioning. Lengths are recorded to the nearest centimetre. How do you measure a dolphin in the Amazon to 1 cm accuracy? Something worth verifying with whoever collected the data.
Choosing geoms
Browse the ggplot2 reference, specifically the layers section. Your challenge is to identify which single geom_ produces the figure described below, using only x = and y = inside aes(). If you pick the right one, you may be prompted to install an additional package; type 1 and press Enter when asked, and decline if asked whether to install from source.
The figure divides the plotting area into hexagonal cells and colours each one by the number of data points that fall within it.
Challenge. Which geom_ produces a hexagonal bin plot? What does it show, and when might you prefer it over a regular scatterplot?
geom_hex(). It divides the plot area into hexagonal bins and colours each bin by how many points fall inside it. It is useful when there are so many overlapping points that a regular scatterplot becomes a smear, sometimes called “overplotting”.
ggplot(boto) +
geom_hex(aes(x = drought, y = length)) +
theme_bw()
Interpreting common plots
Scatterplots, histograms, density plots, boxplots, and violin plots each have particular strengths for EDA. We’ll assume you’re comfortable with scatterplots and boxplots at this stage. Histograms and density plots are worth a closer look.
Histograms
Histograms show how many observations fall within each equal-width interval (or “bin”) of a variable. The violin plots you made earlier are essentially density plots rotated sideways; the shapes of those violins reflect smoothed versions of histograms.
Create a histogram of boto length:
ggplot(boto) +
geom_histogram(aes(x = length), colour = "white") +
labs(x = "Boto dolphin length",
y = "Frequency") +
theme_bw()By default, ggplot2 uses 30 bins. With this dataset, each bin covers roughly (max - min) / 30, or about 2.3 cm. There is no universally correct binwidth; aim for something that reveals the overall shape without each bar containing just a single observation. Set the binwidth manually:
ggplot(boto) +
geom_histogram(aes(x = length), colour = "white", binwidth = 5) +
labs(x = "Boto dolphin length",
y = "Frequency") +
theme_bw()With binwidth = 5, each bar represents a 5 cm interval. The bin centred on 200 cm contains dolphins between 197.5 and 202.5 cm. The bin at the far right contains the single longest dolphin (around 240 cm).
Density plots
Density plots are very similar to histograms but differ in two ways: (1) the y-axis shows proportion rather than count, and (2) a smooth curve replaces the bars. Read them the same way: “where does the data concentrate?”
ggplot(boto) +
geom_density(aes(x = length)) +
labs(x = "Boto dolphin length",
y = "Proportion") +
theme_bw()Comparing to the histogram, both lead to the same interpretation: most dolphins are below about 210 cm, a fair number fall between 210 and 230 cm, and very few exceed 230 cm.
To overlay both on the same axes, you need to convert the density scale to counts (multiply proportion by the number of observations and by the binwidth):
ggplot(boto) +
geom_histogram(aes(x = length), colour = "white", binwidth = 5) +
geom_density(aes(x = length, y = after_stat(density * n * 5))) +
labs(x = "Boto dolphin length",
y = "Frequency") +
theme_bw()This is rarely done in practice but helps clarify how the two plot types relate to each other.
The Good and The Bad
The second dataset for this workshop is abdn_flats.txt. Download it below and save it to your BI3010/data/ folder. It contains more than 500 records of rental properties listed in Aberdeenshire on the ASPC website. The variables are:
Price: monthly rent (£)Bedrooms: number of bedroomsPublicRooms: number of living roomsRooms: total bedrooms plus living roomsBathrooms: number of bathroomsFloorArea: total floor area (m²)EPC: Energy Performance Certificate (A is most efficient, G is worst)Tax: council tax band (A is lowest, H is highest; based on property value in April 1991)DayAdded: when the record was added to the dataset (not when the property was listed)PropertyURL: web link for the listingAddress: property name or addressLatitude: latitude coordinate (North to South; think of a “ladder”)Longitude: longitude coordinate (East to West; “the long way around”)UniCommuteTime: estimated walk time to the Zoology Building (Google Maps API)HouseType: detached, semi-detached, terrace, or flatFurnished: fully furnished, part furnished, or unfurnished
Do a quick EDA to get familiar with the data and check for anything unusual. If you spot what looks like an error, you can verify it directly using the property URL. We’ll return to this dataset later in the course.
The Good
Using whichever variables you like, produce the most informative and visually appealing figure you can. Imagine you have been hired by a real estate company to help potential renters understand what drives rent prices. Your audience is the general public, not scientists, so aim for something that communicates clearly and catches the eye. Avoid the pitfalls from the lecture.
patchwork
Sometimes the clearest way to show several aspects of a dataset is to arrange multiple plots side by side or stacked. The patchwork package makes this straightforward:
install.packages("patchwork") # run once
library(patchwork)
p1 <- ggplot(abdn) + geom_histogram(aes(x = Price))
p2 <- ggplot(abdn) + geom_point(aes(x = UniCommuteTime, y = Price))
p1 + p2 # side by side
p1 / p2 # stacked
(p1 + p2) / p3 # two on top, one belowEach object (p1, p2, etc.) is just a ggplot saved to a variable. You combine them using + (side by side) and / (stacked). That’s all there is to it.
If you need help implementing an idea, the Intro2R chapter on graphics has a more detailed walkthrough, or ask an LLM. If using an LLM, specify that you’re using R and ggplot2 and ask it not to use additional packages (extra packages add complexity and can cause conflicts while you’re still building familiarity).
Your figure. Produce your best figure. Describe what you think makes it effective.
There’s no single right answer. Here’s one approach combining several views of the data:
abdn <- read.table("data/abdn_flats.txt", header = TRUE)
library(patchwork) # lets you stitch ggplots together
p1 <- ggplot(abdn) +
geom_histogram(aes(x = Price), fill = "#EC1365", colour = "white", binwidth = 25) +
theme_classic() +
labs(x = "Rental price (£)", y = "Frequency")
p2 <- ggplot(abdn) +
geom_density(aes(x = Price, fill = Tax), alpha = 0.7) +
scale_fill_brewer(palette = "Set2") +
theme_classic() +
theme(legend.position = "top") +
labs(x = "Rental price (£)", fill = "Tax band", y = "Density")
p3 <- ggplot(abdn) +
geom_point(aes(x = UniCommuteTime, y = Price), colour = "#EC1365") +
theme_classic() +
labs(x = "Walk time to Uni (mins)", y = "Rental price (£)")
p4 <- ggplot(abdn) +
geom_violin(aes(x = HouseType, y = Price), fill = "#EC1365",
draw_quantiles = c(0.25, 0.75)) +
theme_classic() +
labs(x = "Type of property", y = "Rental price (£)")
(p1 + p2) / (p3 + p4)
The Bad
Now do the opposite: produce the most visually offensive figure you can. It must still contain real data and be at least vaguely readable, but otherwise go wild. Search “graph crimes” or “chart crime” for inspiration. Upload your masterpiece to the MyAberdeen discussion board for the Art Exhibit at the end of the workshop.
Your disaster figure. Describe what you did to make it so terrible.
One approach: abuse the theme() function to override every colour and size setting.
ggplot(abdn) +
geom_point(aes(x = PublicRooms, y = FloorArea, colour = Latitude),
shape = 24, size = 10) +
theme(
panel.background = element_rect(fill = "yellow"),
plot.background = element_rect(fill = "pink"),
panel.grid.major = element_line(colour = "red", linewidth = 2),
panel.grid.minor = element_line(colour = "blue", linewidth = 1),
axis.title.x = element_text(size = 20, angle = 90, colour = "purple"),
axis.title.y = element_text(size = 4, angle = 0, colour = "orange"),
axis.text = element_text(size = 15, colour = "green")
) +
scale_colour_gradient(low = "pink", high = "green") +
labs(x = "Y-axis", y = "X-axis", colour = "[?]") +
facet_wrap(~EPC, scales = "free_y")
Figures from the Real World
Below are figures encountered in the wild. For each one, consider what works, what doesn’t, and why the author made the choices they did. Also think about how you might recreate each figure in ggplot2.
The y-axis!
This figure comes from the Financial Times and shows bus usage inside and outside London. Pay close attention to the y-axis. (The Financial Times produces their figures in ggplot2 and even has their own theme package, ftplottools.)
Discussion. Is there anything inappropriate about this figure?
This figure made a lot of people angry online, mostly because the y-axis runs from -50% to +100% rather than from 0, and because the values have been log-transformed, which some argued exaggerated the differences.
In truth, it’s a pretty clever axis. A 100% increase means doubling (5 buses to 10). A -50% decrease means halving (10 buses to 5). On a log scale, doubling and halving take up the same vertical distance, which makes the axis symmetric and honest. Without the transformation, a -50% decline would look much smaller than an equivalent +100% increase, which would actually be misleading.
People get very wound up about y-axes, especially when the bottom is not zero. Honestly, I don’t think there’s one correct answer. There are times when it’s entirely appropriate to not start at zero, and other times when it’s not. I dislike treating it as though there’s a universal rule. There isn’t. Think about what the best way to show your data is, and do that.
Are you even happy?
This figure comes from the World Happiness Report. Most figures in the full report are reasonably well done; this one is an exception.
Discussion. What is wrong with this figure?
The bars show rankings, not magnitudes. Finland is ranked 1st, New Zealand is 10th. The figure is just a glorified sorted list; the bars add no information beyond the order.
More subtly, I’m always sceptical of “happiness” ratings as data. How do you assign a number to how happy you feel? Would you and I give the same score if we felt the same way? What if I’m just having an off day? How many people do you ask? The data underlying this figure is most likely pretty imprecise. At the very least, it doesn’t pass a validity check. Apply that check before treating numbers like these as meaningful.
Birds and light
This figure comes from Pease et al. (2025), published in Science. The study used more than 60 million bird detections linked to satellite measures of light pollution. The figure shows three stages of data processing: unfiltered detections, filtered detections, and the response variable used in the model.
Discussion. What works and what doesn’t in this figure?
Several problems, published in one of the most prestigious journals in science:
- Stacked bars make comparisons hard. Can you tell whether “Unfiltered detections” in August 2023 differs from September 2023? I genuinely can’t. And the “Response variable” slice at the bottom is so thin it’s almost invisible. That’s the variable that was actually used in the model. What’s the point of including it if it can’t be read?
- The colour palette is probably not colour-blind friendly. It costs nothing to check, and there’s no reason to exclude part of the audience.
- The x-axis is overloaded. Printing every month label makes it cluttered and hard to read. Three labels (start, middle, end) would be much cleaner. This probably happened because the author stored date as a factor in R, which prints every level.
A better approach: points for observation counts connected by a line, split into three separate panels with independent y-axes, so the trend in the “Response variable” is actually visible.
Marmosets and ketamine (?!)
The next figure comes from Wood et al. (2025), also published in Science. Honestly, I don’t really get this paper and I don’t know how to summarise it. The authors did something to monkey brains. There was ketamine. There were vehicles. I don’t know man. Anyway, here’s the figure…
Discussion. What problems do you see?
A pretty miserable figure all things considered. Here are my issues with it:
- The colour palette. The hell is with the weird colour palette and pattern that looks like it was stolen from a 1990s bus seat? Who was like “Yeaaaaaah bruh, this looks siiiiiiick dude.” Why is one bar darker than the other?
- Abbreviations. Do you know what “Sal/Veh” means? Great figures don’t need a legend or a paper for you to make sense of them; you can see amazing figures with zero context and they still make total sense. I don’t know what this figure is telling me even with context.
- Inconsistent data point styling. All the randomly shaped points. What do they mean? Why are some filled and others empty? Are those all the observations they had, because if so — what’s that? Like 10 observations? Get outta here with that.
- Bars are the wrong choice. The bars are completely unnecessary here and probably misleading. What the authors are trying to show are specific values (called “point estimates”). For example, the category “Sal/DCZ” is estimated at roughly -50 on the y-axis. That’s the value that’s important. By using a bar you’re implying there was also a -49, and -48, and -47, all the way to 0. That’s not the case. It’s -50. Just -50. Using bars is, for some reason, super common in some fields. Medics looooooove them. They’re overused and rarely the best choice. A lot of the time (but not always), bars should be replaced by points.
How I’d make this figure: I’d use points for the estimated values instead of bars. I probably wouldn’t add points to show the data, but if I did, I’d make those points much smaller (and be consistent). I’d write the full category name and rotate them 90 degrees so it’d fit. (I’d also get rid of the P-value “bars” and asterisks at the top of the figure.)