Workshop 1: Getting Back into R

BI3010 Statistics for Biologists

Author

Roos & Pinard

Published

16 May 2026

Welcome to the first BI3010 workshop. Over the next ten workshop sessions, you’ll go from working with raw data to fitting complex statistical models, building up a workflow you can apply to almost any dataset you encounter, including every single one of your honour’s project next year. Today’s workshop focuses on getting comfortable with R and learning to spot problems in data before analysis begins, a step that sounds unglamorous but is where scientists spend 80% of their time.

Learning Objectives

By the end of this workshop you will be able to:

(Re)familiarise yourself with the basics of R;
Use R to perform simple descriptive summaries of data and data processing;
Use the results of the EDA (Exploratory Data Analysis) to identify any issues in the data;
Begin thinking about the assumptions we make with a specific dataset even before we begin using statistics.

These skills apply every time you pick up a new dataset and will come up throughout the course.

Workshop structure

The Analytical Workflow

These workshops follow an analytical workflow built around linear models, but the same steps apply to almost any analysis: you’re learning how to go from a raw dataset to an evidenced answer.

The general workflow we’ll be using can be broken down into:

Formulate a research question.
Perform exploratory data analysis [Focus of today].
Identify any hidden assumptions.
Fit an appropriate model.
Diagnose the model and check assumptions.
Summarise the results.
Interpret and provide inferences.

We’ll skip Step 1 here for convenience, but in your own work it’s the most important step. Every subsequent decision (how to collect data, which model to fit, how to interpret results) depends on having a clear question. Don’t mistake its absence in these workshops for evidence that it doesn’t matter.

Getting Started with R

New to RStudio? Expand here

Before we begin the workshop, let’s make sure that you’re set up to work with R and RStudio. If you’re working on your own machine, make sure you have R and RStudio installed. If you don’t, you can download R from CRAN and RStudio from RStudio’s website. On university computers, these should already be installed.

Once you have R and RStudio set up, open RStudio. You should see a window with four main panes:

Source: This is where you write your scripts.
Console: This is where you can type and execute R code directly.
Environment/History: This pane shows you the objects that are currently stored in R’s memory.
Files/Plots/Packages/Help/Viewer: This is a multi-tabbed pane that allows you to navigate files, view plots, manage packages, and access documentation.

Running Code

There are a few ways to run R code:

Directly in the console: You can type R commands directly into the Console pane and press Enter to run them. This is useful for very quick tests and calculations but avoid using this for project work (none of it’s saved).
From a script: Scripts are simple text files that contain R code. You can write multiple lines of code in a script and run them together or one by one. This is particularly useful for more complex analyses where you want to keep a record of your code.

To run code from a script, you can:

Click the “Run” button at the top of the Source pane to run the selected line or code chunk.
Press Ctrl + Enter (or Cmd + Enter on a Mac) to run the current line or selected code.

Instruction Orientation

Commenting guide

Good comments are notes about intent, not narration. They’ll save you time when you return to old code, and some jobs will check them too.

Opinions vary (see here for a stern take), but my rule: briefly describe what you’re trying to do, not what each line does.

For example, I would write:

# Get R set up
library(ggplot2)  # For plots
setwd("my/file/path")
my_data <- read.csv("data/sales_data.csv")

# Plot the data
ggplot(my_data, aes(x = country, y = total_sales)) +
  geom_bar()

And not:

# Load the ggplot2 package into the R environment so we can create plots later
# What happened to ggplot1?
library(ggplot2)

# Use the read.csv function to read the sales_data.csv file from the data directory
# and store the resulting data frame in a data frame called 'my_data'
# Remember that it's called my_data and not data. Calling it data made things break :(
my_data <- read.csv("data/sales_data.csv")
# my_data <- read.table("data/sales_data.csv") # DONT USE THIS IT DOESNT WORK
# mi_data <- read_csv("data/sales_data.csv") # ALSO DIDN'T WORK - DUNNO WHY

# Use ggplot to create a bar plot of the total sales by country for 2024.
# The x-axis represents the categories, and the y-axis represents the total sales.
ggplot(my_data, aes(x = country, y = total_sales)) + # using ggplot, take my_data, then plot country on the x and sales on the y
  geom_bar() # then have the data visualised as bar, using the structure described in the previous line of code
# When I see comments like this IRL, I die a little bit inside. Keep it clean and concise.

Reading those comments is more work than reading the code itself. Think of comments as post-it notes: short signals for sections you might want to revisit, not a line-by-line narration of what the code does.

On using LLMs

LLMs (Large Language Models, or AI) have become common tools in programming and data analysis: good for boring or fiddly tasks, one-off problems, and tidying code (see here for a pragmatic take on this). The catch is that you need to know enough to tell when the output is correct and when it’s plausible-sounding nonsense. That line blurs fast as the code gets more complex. For analysis, without a sound theoretical understanding, they tend to get it wrong 60% of the time if the user can’t specify their data and question properly (see here). To make sure that doesn’t happen to you, you need to understand statistical theory and assumptions.

Using them well is a skill in itself. Make your prompt specific: describe what you want, paste the relevant code, include any error messages, and say you’re working in R (otherwise you’ll get Python). Mention any packages you want to use, or specify base R.

Workshop Instructions

In this workshop, we’re going to load two datasets, starting with alligator bite strength, and carry out Exploratory Data Analysis (EDA) alongside some simple Data Quality Assurance. The aim is to find weird things in the data, things that might indicate mistakes.

Errors in data entry are more common than you’d think. Imagine entering 5,000 nearly identical observations into a spreadsheet. People get bored, lose their place, hit the wrong key. If we don’t catch those mistakes before analysis, they can quietly wreck the results.

The good news: we don’t need complex statistics to check for this, and the R coding is relatively straightforward.

Setting Up Your Project

We’ll use an R Project to organise our work. An R Project is a folder with a small .Rproj file inside it. When you open that file, RStudio automatically sets the working directory to the project folder, with no manual path-setting required, and it works the same on any computer or operating system.

Step 1: Create the folder structure

Create a folder called BI3010 somewhere sensible:

University computers: the H drive works well, e.g. H:\BI3010
Personal computers: anywhere in your Documents folder is fine

Inside BI3010, create two subfolders:

data: all data files for the course go here
scripts: your R scripts, one per workshop

Your folder should look like this:

BI3010/
├── data/
└── scripts/

Step 2: Create the R Project

Open RStudio and go to File > New Project > Existing Directory. Navigate to your BI3010 folder and click Create Project. RStudio will create a BI3010.Rproj file inside the folder and restart with the project open.

From now on, always open RStudio by double-clicking BI3010.Rproj rather than opening RStudio directly. This ensures the working directory is always set correctly.

Step 3: Create a script for this workshop

Go to File > New File > R Script (or Ctrl + Shift + N). Save it immediately into your scripts folder as workshop1.R. To run code, press the Run button or use Ctrl + Enter.

Download the Data

The data file for this section is embedded in this page. Click the button below to download it, then save it to your BI3010/data/ folder before continuing.

Download alligator_bite.txt

Load File into RStudio

Now we can load the data. The .txt extension tells us it’s a plain text file, so we use read.table(). We need two arguments: the filename (including the data/ subfolder path), and header = TRUE to tell R that the first row contains column names rather than data.

Because the project sets the working directory to BI3010/, we can refer to files using paths relative to that folder, with no need to type out the full path.

Tip: ?read.table (or ? followed by any function name) opens the help page.

ali <- read.table("data/alligator_bite.txt", header = TRUE)

The above fits on one line, but you can split it for readability:

ali <- read.table("data/alligator_bite.txt",
                  header = TRUE)

Checking It Worked

If you hit an error, don’t panic. Nothing in R is permanent (unless you very explicitly make it so). Run the code again, fix the issue, move on. The only thing that’s permanent is your script and original data, so save your script often and comment as you go. Not sure whether removing something will break things? Comment it out with # first rather than deleting it.

If you get an error loading data, common causes are:

The file isn’t in the data folder inside your BI3010 project, or you didn’t open RStudio via BI3010.Rproj.
You may have misspelled one of the words (including using a capital letter instead of lower case).
Your file may be missing a column heading.
The file isn’t in the correct format (e.g., the file saved as a .csv instead of a .txt).
The quotation marks are in the wrong location.
You may have missed the final bracket ) in the read.table() function.

To confirm that the file has been read properly, click the blue button next to the name in the Environment panel (normally in the upper right of RStudio).

To inspect the data, you have a few options. The following shows the first six rows (use n to change how many, e.g. head(ali, n = 10)):

head(ali)

  length    bite
1  408.9 2268.50
2  378.1 2156.24
3  343.5 1883.48
4  337.2 1981.71
5  419.5 2442.91
6  427.5 2331.68

Or open the full dataset in a new tab; View() gives a spreadsheet-like view with sorting and filtering:

View(ali)

Or print everything to the console:

ali

Not recommended: it floods the console and is hard to read.

Question Set One

Question 1. How many variables do we have in this dataset?

1 2 4 6

There are 2 variables. There are several ways to check this:

summary(ali)
ncol(ali)   # Number of columns (i.e., variables) in the ali dataset
str(ali)    # Structure of the ali dataset
head(ali)   # Displays the first few rows
View(ali)   # Opens the dataset in a spreadsheet-like view

Question 2. What are the variable names?

size and force length and bite length and pressure weight and bite

The variable names are bite and length. The functions above will show you this (except ncol(), which only counts columns).

Question 3. Are the variable names capitalised?

No, both are lower case Yes, both start with a capital letter Only Length is capitalised Only Bite is capitalised

No, bite and length are both lower case. A general tip: keep variable names short and all lower case. You will type them a lot, so save yourself the effort. Avoid names like Alligator_Bite_Strength_PSI.

These questions may seem trivial, but honestly, they’re the first questions we ask ourselves when we load a dataset into R for the first time.

Getting to Know the Data

The dataset has 78 alligators from Parker Island Gator Farm, Florida. For each one, bite strength (pounds per square inch) and body length nose-to-tail (centimeters, one decimal place) were recorded.

Before going further, do a quick search: what’s the bite strength of a typical alligator, and how long are they? EDA means looking for things that seem weird, and you can only spot weird if you know what normal looks like. Note the typical values as a comment in your script.

Now let’s check that the data is sensibly structured: numbers stored as numbers, text as characters, and so on. A value like 2174.94BIGBITE would cause R to treat the entire column as text, which breaks any numerical analysis.

To do this quickly, we can use the str() function (short for “structure”):

str(ali)

'data.frame':   78 obs. of  2 variables:
 $ length: num  409 378 344 337 420 ...
 $ bite  : num  2268 2156 1883 1982 2443 ...

ali is a data.frame with 78 observations and two variables. Both length and bite are num (numeric). All good. The $ in ali$bite is how R pulls a single column out of a dataset; you’ll use it constantly.

Nothing unusual so far.

If we wanted to, we could extract the length variable from the dataset by doing:

ali$length

How would you extract the bite variable?

Let’s calculate some summary statistics. The average length:

mean(ali$length)

[1] 373.9897

The average alligator here is 374 cm, longer than wild alligators, but farm animals often are. Compare this with what you found online. Small differences are fine; large ones are worth questioning.

Now calculate the average bite strength:

mean(ali$bite)

[1] 3546.084

Assuming you found that alligator bite strength is roughly 2,000 PSI, our alligators appear to be biting with an extra 1,500 PSI on average. That seems like a lot more. This is something that should make us sit up and pay attention. We don’t know for sure if there’s a problem, but we have reason to believe there might be one. It’s now up to us to prove that there isn’t a problem.

summary() gives a quick overview of the whole dataset. Run it now and see if anything stands out:

summary(ali)

     length            bite         
 Min.   : 153.5   Min.   :   832.9  
 1st Qu.: 316.0   1st Qu.:  1749.5  
 Median : 359.1   Median :  2027.9  
 Mean   : 374.0   Mean   :  3546.1  
 3rd Qu.: 390.8   3rd Qu.:  2222.0  
 Max.   :2293.0   Max.   :123632.0

Keep in mind, with EDA we’re looking for anything weird in the data.

Question Set Two

Question 1. How small is the smallest alligator in the farm for which we have data, and what’s the biggest?

The smallest alligator is 153.5 cm and the largest is 2293.0 cm. Multiple approaches work:

range(ali$length)    # minimum and maximum
min(ali$length)      # minimum
max(ali$length)      # maximum
summary(ali)         # summary for all variables
summary(ali$length)  # summary for length only

Question 2. What are the minimum and maximum values for bite strength?

The minimum bite strength is 832.9 PSI and the maximum is 123,632.0 PSI.

range(ali$bite)

Question 3. Do any of these minimums or maximums seem strange?

Yes, both the maximum length and maximum bite strength look suspicious Yes, only the maximum length looks suspicious Yes, only the maximum bite strength looks suspicious No, all values seem reasonable

Yes. A 2293 cm alligator is longer than a bus. And 123,632 PSI is about 60 times the average bite strength of 2,000 PSI. At least one of these values is almost certainly a data entry error.

Data Detective

Hopefully you spotted at least one suspicious value. Let’s track it down.

We know there’s an alligator recorded as 2293 cm long, which is 23 m, longer than a bus. Here’s how to find it.

Square brackets [] are R’s indexing tool: they extract specific rows or columns from a dataframe using the format dataset[row, column]. Omit the row number and you get all rows; omit the column and you get all columns. For example, to see all columns for row 30:

ali[30,]

   length    bite
30  343.4 2026.86

To find our giant alligator, we need a condition: find rows where length equals 2293. In R, “equal to” is == (a single = assigns a value; == tests whether two things match). The result is a TRUE/FALSE for every row:

ali$length == 2293.0

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE

We can use that condition directly inside square brackets to pull out the matching rows:

ali[ali$length == 2293,]

   length   bite
20   2293 123632

length is about 10 times what we’d expect (200–400 cm), and bite is about 100 times typical. Neither has a decimal point. Let’s compare against a normal observation:

ali[1, ]

  length   bite
1  408.9 2268.5

The first alligator looks fine. So either giant alligators are among us, or there’s been a data entry mistake.

Question Set Three

Question 1. What may have occurred to cause this apparent data entry mistake?

The data was deliberately falsified Decimal points were accidentally omitted during data entry The measurements were recorded in the wrong units A duplicate row was accidentally added to the dataset

The decimal points were accidentally omitted during data entry. The correct values are 229.3 cm and 1236.32 PSI. If you look at the raw data file, neither value has a decimal point.

A More Refined Detective

The approach above works, but typing 2293 every time is awkward. Better to store the maximum in a variable and reference that instead:

# Get the maximum alligator length
max_length <- max(ali$length)

# Now use this value to find the problematic rows
ali[ali$length == max_length, ]

   length   bite
20   2293 123632

Storing the result means we don’t have to repeat the raw number, and the code reads more clearly.

The same logic works with other conditions, for example, all alligators longer than average:

# Calculate the average alligator length
average_length <- mean(ali$length)

# Extract all rows where the alligator length is greater than the average
ali[ali$length > average_length, ]

   length      bite
1   408.9   2268.50
2   378.1   2156.24
5   419.5   2442.91
6   427.5   2331.68
7   382.1   2186.78
12  378.1   2204.37
13  399.9   2205.96
18  377.3   2026.39
20 2293.0 123632.00
21  418.4   2383.09
23  412.6   2484.12
28  391.1   2251.63
34  422.3   2474.34
40  375.7   2173.30
43  379.7   2148.50
44  468.4   2649.99
48  390.1   2225.24
49  388.3   2149.13
50  466.4   2593.13
57  403.7   2402.71
59  410.0   2265.09
61  445.6   2559.36
62  401.6   2271.19
63  382.3   2212.23
65  448.8   2510.11
68  392.9   2283.28
69  379.5   2267.27
71  417.9   2434.23
77  435.3   2435.05
78  391.0   2051.66

More than one alligator qualifies this time. Our problem observation is still in there.

Let’s go one step further with quantile(). Look back at summary(ali): the 1st Qu. and 3rd Qu. values are the 25th and 75th percentiles. We can retrieve them directly:

quantile(ali$length, probs = c(0.25,0.75))

    25%     75% 
316.025 390.775

probs sets which percentiles to return. c() combines multiple values into a vector; needed here because we’re asking for two at once. For a single percentile, write quantile(ali$length, probs = 0.75) without c().

Question Set Four

Question 1. Using quantile(), extract all rows in the ali dataset where bite values are above the 3rd quantile.

ali[ali$bite > quantile(ali$bite, probs = 0.75),]

Question 2. Repeat the above, but for bite values less than or equal to the 33% quantile. If you’re unsure how to specify “less than or equal to”, have a quick search online or use your preferred LLM.

ali[ali$bite <= quantile(ali$bite, probs = 0.33),]

Question 3. Using the tools from above, can you identify any other possible data entry mistakes caused by missing decimal points?

Yes, there’s at least one more observation with missing decimal points No, row 20 appears to be the only case

Row 20 appears to be the only observation with this problem. The other bite and length values all look plausible.

Correcting Data

So, we’ve identified a problem. What now? Do we delete the observation? Declare that Sarcosuchus are not extinct and are alive and well in Parker Island Gator Farm, Florida, USA (and have actually doubled in size)? Set up a Facebook group to share our concrete and irrefutable evidence of Sarcosuchus and start producing a film to share this wonderful news with the world? Obviously not. That’d be silly.

But we do need to do something. Row 20 of our dataset is clearly a mistake, but correcting it is a really sensitive topic. Part of the reason for that sensitivity is because a non-trivial number of scientists have been found to actually fake their data in order to further their own careers and prestige (e.g. see here, here, and most ironically here for examples).

We want to be better than that. So before touching anything, go back to the source: not the downloaded file, but the original recorded values. Here that means contacting Parker Island Gator Farm to confirm our suspicion that row 20 is simply missing decimal points.

In the interest of time, I’ll confirm that it’s missing decimal points so we can move on to fixing the problem. Specifically, bite should have two decimal points, and length should have one.

We have two options: edit alligator_bite.txt directly and reload, or fix it in R. Do it in R. If you edit the raw file, there’s no record of what changed or why, and that starts to look a lot like the data manipulation we were just reading about. At minimum, you’d have to say: “We had this weird observation, but we’ve totally sorted it out. Don’t worry. Trust us.” The correction in code is transparent; it’s right there for anyone auditing your work.

The fix uses the indexing skills from above. ali$bite[20] points to row 20 of the bite column specifically (no comma needed since we’re already working with a single column).

We can start by replacing row 20 with itself (which does nothing, but makes the pattern clear):

ali$bite[20] <- ali$bite[20]

Now we just divide, which shifts the decimal two places to the right:

# Resolving data entry issue with missing decimal point
ali$bite[20] <- ali$bite[20] / 100

Using a hardcoded row number like [20] works here, but it’s fragile: if the dataset were ever re-sorted or a row added above this one, row 20 would point to the wrong observation. A safer approach uses the same conditional indexing we’ve already learned, targeting the value itself rather than its position:

ali$bite[ali$bite > 100000] <- ali$bite[ali$bite > 100000] / 100

This reads as: “for any bite value greater than 100,000, divide it by 100.” It will find the right row regardless of where it sits in the dataset.

Question Set Five

Question 1. Resolve the data entry mistake for ali$length. Be sure to add a clear comment indicating that you have made this change, so that if anyone reviews your code, they understand the reason for the modification and don’t suspect any foul play.

# Correcting missing decimal point, confirmed by Parker Island Gator Farm on 1/1/2020
# Row index approach (works here, but fragile if the dataset changes):
ali$length[20] <- ali$length[20] / 10

# More robust: target the value directly using a condition
ali$length[ali$length > 2000] <- ali$length[ali$length > 2000] / 10

Question 2. Having made any changes to your data, it’s always worthwhile checking it again to make sure you didn’t create a new mistake. Do so now using any techniques you feel are appropriate.

summary(ali)

Do it for Real

Download the data file below and save it to your BI3010/data/ folder.

Download animal_speed.txt

The goal is the same as before: find all the data entry mistakes. This dataset is larger and the errors are harder to spot, so you’ll need to be more systematic. A few hints:

unique() lists every distinct value in a column, useful for spotting typos in text columns.
mean() returns NA if any values are missing. Add na.rm = TRUE to ignore them: mean(variable, na.rm = TRUE).
There are five mistakes in total.

Start with an EDA as before. Can you find them all?

Information about the dataset

Variable	Description
`species`	One of three species: Peregrine Falcon, Saiga Antelope, or Cheetah
`weight`	Body weight in grams
`speed`	Speed over a 50 m distance, in km/h

A total of 500 individuals were measured, but these weren’t equally divided between the three species.
Multiple people participated in data entry, and it’s believed that not all were confident in how to do this.
Physical copies of the data are available should any data need to be verified (contact one of the teaching staff if you need a data entry confirmed).

fast <- read.table("data/animal_speed.txt", header = TRUE)

# Start your exploration here
summary(fast)
str(fast)
unique(fast$species)

Mistake 1. One observation has an impossible weight value. Find it, identify which row it is in, and write the code to correct it. (Hint: look at the minimum weight value.)

One individual has a recorded weight of 0 g, which is impossible. After confirming with the data provider, the correct weight is 45,514 g.

fast[fast$weight == 0,]
fast$weight[fast$weight == 0] <- 45514

Mistake 2. There’s a second unusual weight in this dataset. Find it, identify which row it is in, and write the code to correct it.

One individual has a recorded weight of 3 g. The smallest of the three species (Peregrine Falcon) typically weighs around 400 to 1,300 g, so 3 g is clearly wrong. The correct weight is 47,255 g.

fast[fast$weight == 3,]
fast$weight[fast$weight == 3] <- 47255

Mistake 3. One of the species names has been entered incorrectly. Use unique() on the species column to find the problem.

One entry reads “Peregrin Took” instead of “Peregrine Falcon”. This is almost certainly a typo (with a Lord of the Rings flavour).

unique(fast$species)
fast$species[fast$species == "Peregrin Took"] <- "Peregrine Falcon"
fast$species <- factor(fast$species)

Mistake 4. One observation has a speed value that is physically impossible for any living thing. Find it and correct it.

One Peregrine Falcon has a recorded speed of 35,711 km/h. The fastest recorded speed for a Peregrine Falcon is around 390 km/h. Dividing by 100 gives 357.11 km/h, which is plausible, so this is another missing decimal point error. Note that the NA in the speed column makes max() tricky here.

fast[fast$speed == max(fast$speed, na.rm = TRUE),]

# Row index approach (fragile: only reliable if the dataset has not changed):
fast$speed[15] <- fast$speed[15] / 100

# More robust: target the value directly
fast$speed[fast$speed > 10000 & !is.na(fast$speed)] <- fast$speed[fast$speed > 10000 & !is.na(fast$speed)] / 100

Mistake 5. There’s one remaining issue in the dataset. What is it, where is it, and how should it be handled?

One observation has a missing value (NA) in the speed column. NA means “not available” and is R’s standard way to represent missing data. The data provider can’t confirm what the true value should be, so the only honest option is to leave it as NA. Functions like mean() will return NA when any value is missing unless you add na.rm = TRUE.

After all corrections, run a final check:

summary(fast)

Question Set Six

Warning

Make sure you have corrected all five data entry errors before answering these questions - the answers depend on the corrected dataset.

Question 1. What is the median weight in this dataset?

median(fast$weight)

Question 2. What is the median speed in this dataset?

median(fast$speed, na.rm = TRUE)

Question 3. Which species is most common in this dataset?

The table() function gives a count per species, but there are other ways to check:

table(fast$species)

Chupacabras, Aliens and Beer

The final dataset is chupa.txt. We’ll work through it together in the last 30 minutes, but start early if you’re ahead. It contains the following variables:

year: The year in which data was collected
city: The city where the data was collected
alc: The average alcohol consumption in that city, in that year
ufo: The number of unidentified flying objects (i.e. aliens) reported in that city, in that year
belief: The most common response from a selection of members of the public when asked if they believe in the supernatural, in that city, in that year
chupa: The growth rate of chupacabra sightings from the previous year to the current year, in that city (where a value of zero means the number of sightings is constant, -0.5 means the number of sightings has declined by 50%, and 0.5 means they have increased by 50%).

There are data entry errors in this dataset. Find them all using EDA.

Download the data file below and save it to your BI3010/data/ folder.

Download chupa.txt

As a group, we’ll look at snippets of code and identify the errors.

chupa <- read.table("data/chupa.txt", header = TRUE)

head(chupa)
str(chupa)
unique(chupa$city)
summary(chupa)
unique(chupa$belief)

Find and fix all data entry errors in the chupa dataset. Describe each error and write the correction code.

Error 1 - Two spellings of Amarillo
One row has the city entered as “amarillo” (lower case a) while all other rows use “Amarillo”. Use unique(chupa$city) to spot this, then check the rows affected with indexing.

Error 2 - Placeholder value in ufo (9999)
The ufo column contains one value of 9999, which appears to be a placeholder for missing data rather than a real count. A histogram of ufo makes this obvious. This value should be replaced with NA.

Error 3 - Negative value in alc
Alcohol consumption can’t be negative. There’s one observation with a value of -5 in the alc column. A histogram will reveal it as a clear outlier on the left. The correct value is 5. Note: negative values in the chupa column are fine and expected, since it’s a growth rate.

Coding Puzzles

In the following snippets of code, can you identify the error? If you’re not sure, try running them and see what the error says.

Puzzle 1.

summary(alc)

# We have not specified that this is in the chupa dataset
summary(chupa$alc)

Puzzle 2.

chupa[chupa$ufo==9999]

# We're missing the comma to specify which columns to show
chupa[chupa$ufo == 9999, ]

Puzzle 3.

chupa[chupa$ufo = 9999, ]

# = assigns a value; use == for a conditional test
chupa[chupa$ufo == 9999, ]

Puzzle 4.

chupa[chupa$city = Houston,]

# Houston needs to be in quotation marks (it's a string, not an object)
# Also = should be ==
chupa[chupa$city == "Houston", ]

Puzzle 5.

ali[ali$Length > 200, ]

# Names are case sensitive. Should be `length`, not `Length`
ali[ali$length > 200, ]

Puzzle 6.

ali[ali$bite > mean(ali$bite), ]

# If there are NAs in bite this will fail - include na.rm = TRUE
ali[ali$bite > mean(ali$bite, na.rm = TRUE), ]

Glossary / 词汇表

Workflow

Exploratory Data Analysis (EDA) 探索性数据分析

An initial investigation of a dataset using summaries and plots to understand its structure, spot patterns, and identify potential problems before modelling.

在建模之前，通过统计摘要和图表初步了解数据结构、发现规律并识别潜在问题的过程。

Data entry error 数据录入错误

A mistake introduced when recording or entering data (e.g. a typo, wrong units, a placeholder value left in). EDA is the main tool for catching these.

在记录或输入数据时引入的错误（如错别字、单位错误、遗留的占位符值）。探索性数据分析是发现此类错误的主要方法。

Data structure

Data frame 数据框

R's standard table structure: rows are observations, columns are variables. Most datasets you work with in R will be data frames.

R 中最常用的表格结构：行为观测值，列为变量。大多数在 R 中处理的数据集都是数据框。

Observation 观测值

A single row in a dataset, representing one instance of data collection (e.g. one animal, one survey response).

数据集中的一行，代表一次数据采集的记录（如一只动物、一份问卷回答）。

Variable 变量

A column in a dataset. Each variable holds one type of measurement or attribute recorded for every observation.

数据集中的一列，记录每个观测值的某种测量或属性。

Data types

Numeric 数值型

A variable type that holds numbers (e.g. length, weight, temperature). R calls these num or int in str() output.

存储数字的变量类型（如长度、重量、温度）。在 R 的 str() 输出中显示为 num 或 int。

Character 字符型

A variable type that holds text (e.g. species names, city names). R calls these chr in str() output.

存储文本的变量类型（如物种名称、城市名称）。在 R 的 str() 输出中显示为 chr。

Factor 因子

A categorical variable with a fixed set of levels (e.g. sex: male/female). R stores factors efficiently and uses them correctly in models.

具有固定类别水平的分类变量（如性别：雄性/雌性）。R 以因子形式高效存储并在模型中正确使用。

Summary statistics

Mean 均值

The arithmetic average: sum of all values divided by the number of values. Sensitive to extreme outliers.

算术平均数：所有值的总和除以值的个数。对极端离群值较为敏感。

Median 中位数

The middle value when data are sorted. More robust than the mean when outliers are present.

数据排序后的中间值。在存在离群值时比均值更稳健。

Quantile 分位数

A cut point that divides data into equal-sized groups. The 25th and 75th percentiles (Q1 and Q3) bracket the middle half of the data.

将数据分成等份的分界点。第25和第75百分位数（Q1和Q3）框定了数据的中间一半。

Range 极差

The difference between the maximum and minimum values in a variable. A quick check for implausible values.

变量中最大值与最小值之差。可快速检查是否存在不合理的值。

Missing and unusual values

NA 缺失值

R's symbol for a missing value. Functions like mean() return NA if any value is missing unless you add na.rm = TRUE.

R 中缺失值的符号。若数据中存在 NA，mean() 等函数会返回 NA，除非添加 na.rm = TRUE。

Outlier 离群值

A value that sits far from the rest of the data. May be a genuine extreme observation or a data entry error.

远离其他数据点的值。可能是真实的极端观测值，也可能是数据录入错误。

R programming

Script 脚本

A plain text file (usually .R) containing a sequence of R commands. Run it top to bottom to reproduce an analysis exactly.

包含一系列 R 命令的纯文本文件（通常为 .R）。从上到下运行即可完整重现分析过程。

Function 函数

A named operation in R (e.g. mean(), summary()). You pass inputs inside the parentheses and get an output back.

R 中的命名操作（如 mean()、summary()）。在括号内传入输入值，函数返回输出结果。

Argument 参数

An input you provide to a function, e.g. na.rm = TRUE in mean(x, na.rm = TRUE). Arguments control how the function behaves.

传递给函数的输入值，例如 mean(x, na.rm = TRUE) 中的 na.rm = TRUE。参数控制函数的行为方式。

Indexing 索引

Selecting specific rows or columns from a data frame using square brackets: df[rows, columns]. Leave a slot blank to mean "all".

使用方括号从数据框中选取特定的行或列：df[行, 列]。留空表示"全部"。

Conditional statement 条件语句

A logical test that returns TRUE or FALSE (e.g. x > 5). Used inside square brackets to filter rows that meet a condition.

返回 TRUE 或 FALSE 的逻辑判断（如 x > 5）。用于方括号内筛选满足条件的行。

Working directory 工作目录

The folder R looks in by default when reading or writing files. Set it with setwd() or via the RStudio menu.

R 读写文件时默认查找的文件夹。可通过 setwd() 或 RStudio 菜单设置。