Workshop 2: Data visualisation

BI3010 Statistics for Biologists

Author

Roos & Pinard

Published

16 May 2026

Learning Objectives

Learning objectives
  1. Use the R package ggplot2 to produce data visualisations.
  2. Distinguish effective visualisations from ineffective ones.
  3. Apply some techniques to make figures more visually appealing.

These skills come up throughout the course and beyond.

Workshop structure

The Analytical Workflow

These workshops follow an analytical workflow. Today we focus on steps 2 and 3, expanding the toolkit you built in Workshop 1 using data visualisation.

  1. Formulate a research question.
  2. Perform exploratory data analysis [Focus of today].
  3. Identify any hidden assumptions [Focus of today].
  4. Fit an appropriate model.
  5. Diagnose the model and check assumptions.
  6. Summarise the results.
  7. Interpret and provide inferences.

Introduction to ggplot2

ggplot2 implements the “Grammar of Graphics”: a framework for building plots layer by layer. You start with a dataset, map variables to visual properties (called aesthetics) using aes(), then add one or more geometry functions that determine the form of the plot (points, lines, bars, and so on). Scales, labels, and themes refine the result. The approach is flexible enough to produce sophisticated figures with surprisingly little code. Familiarity with ggplot2 is increasingly expected by employers; Spotify has listed it as an essential criterion in job adverts.

That said, ggplot2 isn’t the only option. Base R can produce similar plots, and many of the figures in this course could be recreated that way. We use ggplot2 because we find it more intuitive, and nearly every figure you see on this course was produced with it.

Load the data

For the first part of this workshop we’ll use data on the Amazon river dolphin, or boto (Inia geoffrensis). We’re broadly interested in two questions: is there sexual dimorphism in asymptotic length (the length animals reach once they stop growing), and how does the species respond to drought? As always, check for anything unusual before starting any analysis.

R sessions sometimes reopen with objects from a previous session, because you clicked “Save workspace” on exit. Leftover objects can cause confusion. Clear your environment before starting with:

rm(list = ls())

This removes all objects in the global environment but doesn’t delete any files. In RStudio, you can do the same from the Environment pane by clicking the broom icon.

Download the data file below and save it to your BI3010/data/ folder. Open your BI3010.Rproj file to start RStudio with the working directory already set, then open a new script and load the data:

Download boto.txt

boto <- read.table("data/boto.txt", header = TRUE)

Run a quick summary() and str() to check for anything unexpected:

summary(boto)
str(boto)

Building a first plot

Install and load ggplot2 if you haven’t already:

install.packages("ggplot2")  # Run once
library(ggplot2)

Call ggplot() with no arguments, then with the dataset:

ggplot()     # Empty canvas (expected)
ggplot(boto) # Still empty: we haven't said what to plot or how

The canvas stays empty because we haven’t told ggplot2 two things: (1) which form to show the data in and (2) which variables to map to which axes. Let’s put length on the y-axis, drought on the x-axis, and show the data as points:

ggplot(boto) +
  geom_point(aes(y = length, x = drought))

From this plot and the summary statistics, we can already start drawing conclusions.

First impressions

Question 1. What variables are available in this dataset?

Three variables: sex, drought, and length.

Question 2. How many observations does the dataset contain?

71 boto dolphins.

Question 3. Based on the scatterplot, is the relationship between boto length and number of drought years positive, neutral, or negative?

Very likely negative: longer drought histories appear to correspond to shorter dolphins.

Question 4. What range of lengths do boto dolphins reach at maturity?

From summary(): roughly 170 to 238 cm. The scatterplot gives the same impression.

Additional geom_ layers

How do we include a categorical variable like sex? Swapping drought for sex on the x-axis is a start:

ggplot(boto) +
  geom_point(aes(y = length, x = sex))

The spread of data for each sex is hard to read as a single column of stacked points. Boxplots and violin plots handle this better. Changing the plot type is as simple as replacing the geometry function:

# Boxplot
ggplot(boto) +
  geom_boxplot(aes(y = length, x = sex))

# Violin plot
ggplot(boto) +
  geom_violin(aes(y = length, x = sex))

Boxplots use the median and interquartile range to summarise the distribution. Violin plots show the full distribution as a smoothed density curve, mirrored symmetrically. Boxplots have a long history in science, but violin plots are increasingly preferred because they reveal the shape of the distribution more directly.

It’s also common to overlay raw data points on top of either. Using geom_point() would just produce a column of stacked points again, so we “jitter” them instead: add a small random offset to x so they spread horizontally. We don’t want to jitter y (that would distort the length values), so we set height = 0:

ggplot(boto) +
  geom_violin(aes(y = length, x = sex)) +
  geom_jitter(aes(y = length, x = sex), height = 0, width = 0.1)

Two things to note. First, geom_jitter() comes after geom_violin(). Layers are drawn in order, so placing the violin last would cover the points. Second, height = 0 and width = 0.1 sit outside aes() because they are fixed values, not columns in the dataset. Anything that changes based on a variable in the data goes inside aes(); anything fixed goes outside it.

Debugging ggplot2 code

The following code snippets contain errors that will either produce an error message or silently fail to do what was intended. Try to spot each problem before running the code.

Puzzle 1. What is wrong with this code?

ggplot(boto) +
  geom_jitter(aes(y = length, x = sex, height = 0, width = 0.1))

The code runs but produces a warning: “Ignoring unknown aesthetics: height and width”. The arguments height and width are fixed values, not variables from the dataset, so they belong outside aes().

Puzzle 2. Find the error:

ggplot(boto) +
  geom_point(aes(x = length, x = drought))

Error: “formal argument ‘x’ matched by multiple actual arguments”. Two variables are both mapped to x. One should be y = drought.

Puzzle 3. Find the error:

ggplot(boto) +
  geom_point(aes(x = length y = drought))

Error: “unexpected symbol in: … geom_point(aes(x = length y”“. There is a missing comma between x = length and y = drought.

Question 4. How would you produce a histogram of boto lengths? (Hint: type geom_ in your script and wait for RStudio’s autocomplete, browse the ggplot2 reference, or search online.)

ggplot(boto) +
  geom_histogram(aes(x = length))

Question 5. From the boxplot, what features tell you how symmetrical the distribution is for each sex?

The box spans the interquartile range (Q1 to Q3) and the line inside shows the median. If the median sits centrally in the box and the whiskers are roughly equal in length, the distribution is roughly symmetric. Here, females have a slightly longer upper whisker and males a slightly longer lower whisker, suggesting modest asymmetry in both groups, though the differences are small.

Question 6. What is the difference between the interquartile range shown in the boxplot and the range?

Interquartile range (IQR): the difference between the 75th and 25th percentiles, i.e. the width of the box. It describes the spread of the central 50% of observations.

Range: the difference between the maximum and minimum values in the data.

Question 7. What is the difference between the interquartile range and variance?

Both measure spread, but differently. The IQR captures how wide the central half of the data is, ignoring the extremes. Variance is calculated from all observations and reflects how far each value deviates from the mean on average. A dataset with extreme outliers can have a large variance but a modest IQR.

Adding more information

We can encode additional variables through colour, shape, size, or transparency. Let’s colour the points by sex. Since sex is a column in the boto dataset, it goes inside aes():

ggplot(boto) +
  geom_point(aes(y = length, x = drought, colour = sex))

With that one addition the points are now coloured by sex. Already we can see some evidence for sexual dimorphism, and that drought has a negative relationship with length for both sexes. Let’s also increase the point size. Since we want all points the same size (not driven by a variable), size goes outside aes():

ggplot(boto) +
  geom_point(aes(y = length, x = drought, colour = sex), size = 2)

Now let’s apply a theme and improve the axis labels. Themes control the non-data elements of a figure (background, grid lines, axis text, and so on); common options are theme_bw() and theme_minimal(). Because themes don’t depend on the data, they can go anywhere in the layer stack, though by convention they go at the end. Labels are set with labs():

ggplot(boto) +
  geom_point(aes(y = length, x = drought, colour = sex), size = 2) +
  labs(y = "Boto length (cm)",
       x = "Number of droughts in past 20 years",
       colour = "Boto sex") +
  theme_bw()

Facetting splits a figure into multiple smaller panels based on a grouping variable. There are two options: facet_wrap() and facet_grid(). facet_grid() gives more precise control and is most useful when facetting by two variables at once; facet_wrap() automates the layout and is a good default for a single grouping variable. In both, you specify the grouping variable using ~ (called “tilde”), which you can read as “split by”. So facet_wrap(~sex) means “create one panel per level of sex”:

ggplot(boto) +
  geom_point(aes(y = length, x = drought, colour = sex), size = 2) +
  labs(y = "Boto length (cm)",
       x = "Number of droughts in past 20 years",
       colour = "Boto sex") +
  facet_wrap(~sex) +
  theme_bw()

Interpreting the faceted figure

Question 1. Imagine drawing a line through the points in each panel. Would those lines have the same slope?

Probably not. Male boto length appears to have a stronger negative relationship with drought than females, suggesting the slopes differ between sexes.

Question 2. Does your answer to Question 1 suggest any assumption might be violated?

Yes, this raises a potential violation of additivity. If drought affects males more strongly than females, the two predictors aren’t acting independently; their combined effect matters. If the slopes were identical for both sexes, the additivity assumption would hold.

Question 3. Having explored the data with plots and summary statistics, do you notice anything suspicious?

Nothing obviously wrong, but the precision is worth questioning. Lengths are recorded to the nearest centimetre. How do you measure a dolphin in the Amazon to 1 cm accuracy? Something worth verifying with whoever collected the data.

Choosing geoms

Browse the ggplot2 reference, specifically the layers section. Your challenge is to identify which single geom_ produces the figure described below, using only x = and y = inside aes(). If you pick the right one, you may be prompted to install an additional package; type 1 and press Enter when asked, and decline if asked whether to install from source.

The figure divides the plotting area into hexagonal cells and colours each one by the number of data points that fall within it.

Challenge. Which geom_ produces a hexagonal bin plot? What does it show, and when might you prefer it over a regular scatterplot?

geom_hex(). It divides the plot area into hexagonal bins and colours each bin by how many points fall inside it. It is useful when there are so many overlapping points that a regular scatterplot becomes a smear, sometimes called “overplotting”.

ggplot(boto) +
  geom_hex(aes(x = drought, y = length)) +
  theme_bw()

Interpreting common plots

Scatterplots, histograms, density plots, boxplots, and violin plots each have particular strengths for EDA. We’ll assume you’re comfortable with scatterplots and boxplots at this stage. Histograms and density plots are worth a closer look.

Histograms

Histograms show how many observations fall within each equal-width interval (or “bin”) of a variable. The violin plots you made earlier are essentially density plots rotated sideways; the shapes of those violins reflect smoothed versions of histograms.

Create a histogram of boto length:

ggplot(boto) +
  geom_histogram(aes(x = length), colour = "white") +
  labs(x = "Boto dolphin length",
       y = "Frequency") +
  theme_bw()

By default, ggplot2 uses 30 bins. With this dataset, each bin covers roughly (max - min) / 30, or about 2.3 cm. There is no universally correct binwidth; aim for something that reveals the overall shape without each bar containing just a single observation. Set the binwidth manually:

ggplot(boto) +
  geom_histogram(aes(x = length), colour = "white", binwidth = 5) +
  labs(x = "Boto dolphin length",
       y = "Frequency") +
  theme_bw()

With binwidth = 5, each bar represents a 5 cm interval. The bin centred on 200 cm contains dolphins between 197.5 and 202.5 cm. The bin at the far right contains the single longest dolphin (around 240 cm).

Density plots

Density plots are very similar to histograms but differ in two ways: (1) the y-axis shows proportion rather than count, and (2) a smooth curve replaces the bars. Read them the same way: “where does the data concentrate?”

ggplot(boto) +
  geom_density(aes(x = length)) +
  labs(x = "Boto dolphin length",
       y = "Proportion") +
  theme_bw()

Comparing to the histogram, both lead to the same interpretation: most dolphins are below about 210 cm, a fair number fall between 210 and 230 cm, and very few exceed 230 cm.

To overlay both on the same axes, you need to convert the density scale to counts (multiply proportion by the number of observations and by the binwidth):

ggplot(boto) +
  geom_histogram(aes(x = length), colour = "white", binwidth = 5) +
  geom_density(aes(x = length, y = after_stat(density * n * 5))) +
  labs(x = "Boto dolphin length",
       y = "Frequency") +
  theme_bw()

This is rarely done in practice but helps clarify how the two plot types relate to each other.

The Good and The Bad

The second dataset for this workshop is abdn_flats.txt. Download it below and save it to your BI3010/data/ folder. It contains more than 500 records of rental properties listed in Aberdeenshire on the ASPC website. The variables are:

  1. Price: monthly rent (£)
  2. Bedrooms: number of bedrooms
  3. PublicRooms: number of living rooms
  4. Rooms: total bedrooms plus living rooms
  5. Bathrooms: number of bathrooms
  6. FloorArea: total floor area (m²)
  7. EPC: Energy Performance Certificate (A is most efficient, G is worst)
  8. Tax: council tax band (A is lowest, H is highest; based on property value in April 1991)
  9. DayAdded: when the record was added to the dataset (not when the property was listed)
  10. PropertyURL: web link for the listing
  11. Address: property name or address
  12. Latitude: latitude coordinate (North to South; think of a “ladder”)
  13. Longitude: longitude coordinate (East to West; “the long way around”)
  14. UniCommuteTime: estimated walk time to the Zoology Building (Google Maps API)
  15. HouseType: detached, semi-detached, terrace, or flat
  16. Furnished: fully furnished, part furnished, or unfurnished

Download abdn_flats.txt

Do a quick EDA to get familiar with the data and check for anything unusual. If you spot what looks like an error, you can verify it directly using the property URL. We’ll return to this dataset later in the course.

The Good

Using whichever variables you like, produce the most informative and visually appealing figure you can. Imagine you have been hired by a real estate company to help potential renters understand what drives rent prices. Your audience is the general public, not scientists, so aim for something that communicates clearly and catches the eye. Avoid the pitfalls from the lecture.

Combining multiple plots with patchwork

Sometimes the clearest way to show several aspects of a dataset is to arrange multiple plots side by side or stacked. The patchwork package makes this straightforward:

install.packages("patchwork")  # run once
library(patchwork)

p1 <- ggplot(abdn) + geom_histogram(aes(x = Price))
p2 <- ggplot(abdn) + geom_point(aes(x = UniCommuteTime, y = Price))

p1 + p2        # side by side
p1 / p2        # stacked
(p1 + p2) / p3 # two on top, one below

Each object (p1, p2, etc.) is just a ggplot saved to a variable. You combine them using + (side by side) and / (stacked). That’s all there is to it.

If you need help implementing an idea, the Intro2R chapter on graphics has a more detailed walkthrough, or ask an LLM. If using an LLM, specify that you’re using R and ggplot2 and ask it not to use additional packages (extra packages add complexity and can cause conflicts while you’re still building familiarity).

Your figure. Produce your best figure. Describe what you think makes it effective.

There’s no single right answer. Here’s one approach combining several views of the data:

abdn <- read.table("data/abdn_flats.txt", header = TRUE)

library(patchwork)  # lets you stitch ggplots together

p1 <- ggplot(abdn) +
  geom_histogram(aes(x = Price), fill = "#EC1365", colour = "white", binwidth = 25) +
  theme_classic() +
  labs(x = "Rental price (£)", y = "Frequency")

p2 <- ggplot(abdn) +
  geom_density(aes(x = Price, fill = Tax), alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  theme_classic() +
  theme(legend.position = "top") +
  labs(x = "Rental price (£)", fill = "Tax band", y = "Density")

p3 <- ggplot(abdn) +
  geom_point(aes(x = UniCommuteTime, y = Price), colour = "#EC1365") +
  theme_classic() +
  labs(x = "Walk time to Uni (mins)", y = "Rental price (£)")

p4 <- ggplot(abdn) +
  geom_violin(aes(x = HouseType, y = Price), fill = "#EC1365",
              draw_quantiles = c(0.25, 0.75)) +
  theme_classic() +
  labs(x = "Type of property", y = "Rental price (£)")

(p1 + p2) / (p3 + p4)

The Bad

Now do the opposite: produce the most visually offensive figure you can. It must still contain real data and be at least vaguely readable, but otherwise go wild. Search “graph crimes” or “chart crime” for inspiration. Upload your masterpiece to the MyAberdeen discussion board for the Art Exhibit at the end of the workshop.

Your disaster figure. Describe what you did to make it so terrible.

One approach: abuse the theme() function to override every colour and size setting.

ggplot(abdn) +
  geom_point(aes(x = PublicRooms, y = FloorArea, colour = Latitude),
             shape = 24, size = 10) +
  theme(
    panel.background = element_rect(fill = "yellow"),
    plot.background  = element_rect(fill = "pink"),
    panel.grid.major = element_line(colour = "red", linewidth = 2),
    panel.grid.minor = element_line(colour = "blue", linewidth = 1),
    axis.title.x     = element_text(size = 20, angle = 90, colour = "purple"),
    axis.title.y     = element_text(size = 4,  angle = 0,  colour = "orange"),
    axis.text        = element_text(size = 15, colour = "green")
  ) +
  scale_colour_gradient(low = "pink", high = "green") +
  labs(x = "Y-axis", y = "X-axis", colour = "[?]") +
  facet_wrap(~EPC, scales = "free_y")

Figures from the Real World

Below are figures encountered in the wild. For each one, consider what works, what doesn’t, and why the author made the choices they did. Also think about how you might recreate each figure in ggplot2.

The y-axis!

This figure comes from the Financial Times and shows bus usage inside and outside London. Pay close attention to the y-axis. (The Financial Times produces their figures in ggplot2 and even has their own theme package, ftplottools.)

Discussion. Is there anything inappropriate about this figure?

This figure made a lot of people angry online, mostly because the y-axis runs from -50% to +100% rather than from 0, and because the values have been log-transformed, which some argued exaggerated the differences.

In truth, it’s a pretty clever axis. A 100% increase means doubling (5 buses to 10). A -50% decrease means halving (10 buses to 5). On a log scale, doubling and halving take up the same vertical distance, which makes the axis symmetric and honest. Without the transformation, a -50% decline would look much smaller than an equivalent +100% increase, which would actually be misleading.

People get very wound up about y-axes, especially when the bottom is not zero. Honestly, I don’t think there’s one correct answer. There are times when it’s entirely appropriate to not start at zero, and other times when it’s not. I dislike treating it as though there’s a universal rule. There isn’t. Think about what the best way to show your data is, and do that.

Are you even happy?

This figure comes from the World Happiness Report. Most figures in the full report are reasonably well done; this one is an exception.

Discussion. What is wrong with this figure?

The bars show rankings, not magnitudes. Finland is ranked 1st, New Zealand is 10th. The figure is just a glorified sorted list; the bars add no information beyond the order.

More subtly, I’m always sceptical of “happiness” ratings as data. How do you assign a number to how happy you feel? Would you and I give the same score if we felt the same way? What if I’m just having an off day? How many people do you ask? The data underlying this figure is most likely pretty imprecise. At the very least, it doesn’t pass a validity check. Apply that check before treating numbers like these as meaningful.

Birds and light

This figure comes from Pease et al. (2025), published in Science. The study used more than 60 million bird detections linked to satellite measures of light pollution. The figure shows three stages of data processing: unfiltered detections, filtered detections, and the response variable used in the model.

Discussion. What works and what doesn’t in this figure?

Several problems, published in one of the most prestigious journals in science:

  1. Stacked bars make comparisons hard. Can you tell whether “Unfiltered detections” in August 2023 differs from September 2023? I genuinely can’t. And the “Response variable” slice at the bottom is so thin it’s almost invisible. That’s the variable that was actually used in the model. What’s the point of including it if it can’t be read?
  2. The colour palette is probably not colour-blind friendly. It costs nothing to check, and there’s no reason to exclude part of the audience.
  3. The x-axis is overloaded. Printing every month label makes it cluttered and hard to read. Three labels (start, middle, end) would be much cleaner. This probably happened because the author stored date as a factor in R, which prints every level.

A better approach: points for observation counts connected by a line, split into three separate panels with independent y-axes, so the trend in the “Response variable” is actually visible.

Marmosets and ketamine (?!)

The next figure comes from Wood et al. (2025), also published in Science. Honestly, I don’t really get this paper and I don’t know how to summarise it. The authors did something to monkey brains. There was ketamine. There were vehicles. I don’t know man. Anyway, here’s the figure…

Discussion. What problems do you see?

A pretty miserable figure all things considered. Here are my issues with it:

  1. The colour palette. The hell is with the weird colour palette and pattern that looks like it was stolen from a 1990s bus seat? Who was like “Yeaaaaaah bruh, this looks siiiiiiick dude.” Why is one bar darker than the other?
  2. Abbreviations. Do you know what “Sal/Veh” means? Great figures don’t need a legend or a paper for you to make sense of them; you can see amazing figures with zero context and they still make total sense. I don’t know what this figure is telling me even with context.
  3. Inconsistent data point styling. All the randomly shaped points. What do they mean? Why are some filled and others empty? Are those all the observations they had, because if so — what’s that? Like 10 observations? Get outta here with that.
  4. Bars are the wrong choice. The bars are completely unnecessary here and probably misleading. What the authors are trying to show are specific values (called “point estimates”). For example, the category “Sal/DCZ” is estimated at roughly -50 on the y-axis. That’s the value that’s important. By using a bar you’re implying there was also a -49, and -48, and -47, all the way to 0. That’s not the case. It’s -50. Just -50. Using bars is, for some reason, super common in some fields. Medics looooooove them. They’re overused and rarely the best choice. A lot of the time (but not always), bars should be replaced by points.

How I’d make this figure: I’d use points for the estimated values instead of bars. I probably wouldn’t add points to show the data, but if I did, I’d make those points much smaller (and be consistent). I’d write the full category name and rotate them 90 degrees so it’d fit. (I’d also get rid of the P-value “bars” and asterisks at the top of the figure.)

Glossary / 词汇表
ggplot2 framework
ggplot2 ggplot2
An R package for data visualisation based on the Grammar of Graphics. Plots are built by combining layers (data, aesthetics, geometries, themes).
基于图形语法的 R 数据可视化包。通过组合数据、美学映射、几何对象和主题等图层来构建图表。
Grammar of Graphics 图形语法
A framework that describes plots in terms of data, aesthetics, and geometric objects. ggplot2 implements this framework in R.
一种用数据、美学属性和几何对象来描述图形的框架,ggplot2 在 R 中实现了这一框架。
Aesthetic mapping (aes()) 美学映射
The link between a variable in the dataset and a visual property of the plot, such as position (x, y), colour, shape, or size. Defined inside aes().
数据集中变量与图形视觉属性(如位置、颜色、形状、大小)之间的对应关系,在 aes() 中定义。
Geometry (geom_) 几何对象
The visual form used to represent data: points (geom_point), bars (geom_bar), lines (geom_line), etc. Each geom_ function adds a layer to the plot.
数据的视觉呈现形式,如点(geom_point)、柱(geom_bar)、线(geom_line)等。每个 geom_ 函数向图形添加一个图层。
Layer 图层
One component added to a ggplot, such as a geometry, a set of labels, or a theme. Layers are stacked in the order they are written; later layers appear on top.
添加到 ggplot 图形中的一个组件,如几何对象、标签或主题。图层按照书写顺序叠加,后写的图层显示在上方。
Theme 主题
Controls the non-data elements of a plot: background colour, grid lines, axis text, and so on. Common options include theme_bw() and theme_minimal().
控制图形中非数据元素的样式,如背景颜色、网格线、坐标轴文字等。常用选项包括 theme_bw() 和 theme_minimal()。
Facet 分面
A way to split one plot into multiple panels based on a categorical variable. facet_wrap(~variable) creates one panel per level of the variable.
根据分类变量将一张图拆分为多个子图的方法。facet_wrap(~变量) 为变量的每个水平创建一个子图。
Plot types
Histogram 频率直方图
A plot that counts how many observations fall within each equal-width bin of a continuous variable. Useful for understanding the shape and spread of a distribution.
统计连续变量在每个等宽区间(组距箱)内观测值数量的图表,用于了解数据分布的形状和范围。
Binwidth 组距
The width of each bar in a histogram. Wider bins produce a smoother picture; narrower bins show more detail. Set with the binwidth argument in geom_histogram().
直方图中每个柱的宽度。组距越宽,图形越平滑;组距越窄,细节越多。通过 geom_histogram() 的 binwidth 参数设置。
Density plot 密度图
A smoothed version of a histogram where the y-axis shows proportion rather than count. Read it the same way: taller regions contain more observations.
直方图的平滑版本,纵轴表示比例而非计数。解读方式相同:较高的区域包含更多观测值。
Boxplot 箱形图
A plot that summarises a distribution using the median (central line), interquartile range (box), and range (whiskers). Outliers may appear as individual points beyond the whiskers.
用中位数(中线)、四分位距(箱体)和范围(须线)来概括数据分布的图表。离群值可能以须线外的单独点呈现。
Interquartile range (IQR) 四分位距
The difference between the 75th and 25th percentiles. It captures the spread of the central 50% of observations and equals the box height in a boxplot.
第75百分位数与第25百分位数之差,反映中间50%数据的分散程度,等于箱形图中箱体的高度。
Violin plot 小提琴图
A plot that shows the distribution of a variable as a mirrored density curve. Wider sections contain more observations. Often combined with jittered points.
以对称密度曲线展示变量分布的图表,较宽处包含更多观测值,常与抖动点结合使用。
Scatterplot 散点图
A plot that shows the relationship between two continuous variables using points. Each point represents one observation.
用点展示两个连续变量之间关系的图表,每个点代表一个观测值。
Plotting techniques
Jitter 抖动
A small random offset added to data points to prevent them overlapping. In ggplot2, use geom_jitter() and control the amount with the height and width arguments.
为防止数据点重叠而添加的微小随机偏移量。在 ggplot2 中使用 geom_jitter(),通过 height 和 width 参数控制抖动幅度。
Overplotting 过度叠加
When many data points overlap in a scatterplot, making the density of points hard to read. Solutions include jittering, transparency (alpha), or hexagonal bins (geom_hex).
散点图中大量数据点重叠导致难以判断密度的问题。解决方法包括抖动、透明度(alpha)或六边形分箱(geom_hex)。
Dataset terms
Sexual dimorphism 性别二型性
Systematic differences in body size, shape, or colour between males and females of the same species. The boto data explores whether males and females differ in asymptotic length.
同一物种雌雄个体在体型、形态或颜色上的系统性差异。本次数据探讨雌雄亚马逊河豚的渐近体长是否存在差异。
Colour-blind safe palette 色盲友好调色板
A set of colours that remain distinguishable for people with common forms of colour blindness. In ggplot2, scale_fill_brewer() and scale_colour_brewer() offer several such palettes.
对常见色盲类型人群仍可区分的颜色组合。在 ggplot2 中,scale_fill_brewer() 和 scale_colour_brewer() 提供多种此类调色板。
Energy Performance Certificate (EPC) 能源效能证书
A rating from A (most efficient) to G (least efficient) that indicates how energy efficient a building is. Used in the Aberdeen rental dataset.
评估建筑能源效率的评级,从 A(最高效)到 G(最低效),在阿伯丁租房数据集中使用。
Asymptotic length 渐近体长
The maximum body length an animal approaches as it ages, after growth has effectively stopped. Used in growth models and in this workshop to describe adult boto dolphins.
动物随年龄增长趋于停止生长时所接近的最大体长,用于生长模型,在本次工作坊中用于描述成年亚马逊河豚的体长。