# Get R set up
library(ggplot2) # For plots
setwd("my/file/path")
my_data <- read.csv("data/sales_data.csv")
# Plot the data
ggplot(my_data, aes(x = country, y = total_sales)) +
geom_bar()Workshop 1: Getting Back into R
BI3010 Statistics for Biologists
Welcome to the first BI3010 workshop. Over the next ten workshop sessions, you’ll go from working with raw data to fitting complex statistical models, building up a workflow you can apply to almost any dataset you encounter, including every single one of your honour’s project next year. Today’s workshop focuses on getting comfortable with R and learning to spot problems in data before analysis begins, a step that sounds unglamorous but is where scientists spend 80% of their time.
Learning Objectives
- (Re)familiarise yourself with the basics of
R; - Use
Rto perform simple descriptive summaries of data and data processing; - Use the results of the EDA (Exploratory Data Analysis) to identify any issues in the data;
- Begin thinking about the assumptions we make with a specific dataset even before we begin using statistics.
These skills apply every time you pick up a new dataset and will come up throughout the course.
Workshop structure
These workshops follow an analytical workflow built around linear models, but the same steps apply to almost any analysis: you’re learning how to go from a raw dataset to an evidenced answer.
The general workflow we’ll be using can be broken down into:
- Formulate a research question.
- Perform exploratory data analysis [Focus of today].
- Identify any hidden assumptions.
- Fit an appropriate model.
- Diagnose the model and check assumptions.
- Summarise the results.
- Interpret and provide inferences.
We’ll skip Step 1 here for convenience, but in your own work it’s the most important step. Every subsequent decision (how to collect data, which model to fit, how to interpret results) depends on having a clear question. Don’t mistake its absence in these workshops for evidence that it doesn’t matter.
Getting Started with R
Before we begin the workshop, let’s make sure that you’re set up to work with R and RStudio. If you’re working on your own machine, make sure you have R and RStudio installed. If you don’t, you can download R from CRAN and RStudio from RStudio’s website. On university computers, these should already be installed.
Once you have R and RStudio set up, open RStudio. You should see a window with four main panes:
- Source: This is where you write your scripts.
- Console: This is where you can type and execute R code directly.
- Environment/History: This pane shows you the objects that are currently stored in R’s memory.
- Files/Plots/Packages/Help/Viewer: This is a multi-tabbed pane that allows you to navigate files, view plots, manage packages, and access documentation.
Running Code
There are a few ways to run R code:
- Directly in the console: You can type
Rcommands directly into the Console pane and press Enter to run them. This is useful for very quick tests and calculations but avoid using this for project work (none of it’s saved). - From a script: Scripts are simple text files that contain
Rcode. You can write multiple lines of code in a script and run them together or one by one. This is particularly useful for more complex analyses where you want to keep a record of your code.
To run code from a script, you can:
- Click the “Run” button at the top of the Source pane to run the selected line or code chunk.
- Press
Ctrl + Enter(orCmd + Enteron a Mac) to run the current line or selected code.
Instruction Orientation
Commenting guide
Good comments are notes about intent, not narration. They’ll save you time when you return to old code, and some jobs will check them too.
Opinions vary (see here for a stern take), but my rule: briefly describe what you’re trying to do, not what each line does.
For example, I would write:
And not:
# Load the ggplot2 package into the R environment so we can create plots later
# What happened to ggplot1?
library(ggplot2)
# Use the read.csv function to read the sales_data.csv file from the data directory
# and store the resulting data frame in a data frame called 'my_data'
# Remember that it's called my_data and not data. Calling it data made things break :(
my_data <- read.csv("data/sales_data.csv")
# my_data <- read.table("data/sales_data.csv") # DONT USE THIS IT DOESNT WORK
# mi_data <- read_csv("data/sales_data.csv") # ALSO DIDN'T WORK - DUNNO WHY
# Use ggplot to create a bar plot of the total sales by country for 2024.
# The x-axis represents the categories, and the y-axis represents the total sales.
ggplot(my_data, aes(x = country, y = total_sales)) + # using ggplot, take my_data, then plot country on the x and sales on the y
geom_bar() # then have the data visualised as bar, using the structure described in the previous line of code
# When I see comments like this IRL, I die a little bit inside. Keep it clean and concise.Reading those comments is more work than reading the code itself. Think of comments as post-it notes: short signals for sections you might want to revisit, not a line-by-line narration of what the code does.
On using LLMs
LLMs (Large Language Models, or AI) have become common tools in programming and data analysis: good for boring or fiddly tasks, one-off problems, and tidying code (see here for a pragmatic take on this). The catch is that you need to know enough to tell when the output is correct and when it’s plausible-sounding nonsense. That line blurs fast as the code gets more complex. For analysis, without a sound theoretical understanding, they tend to get it wrong 60% of the time if the user can’t specify their data and question properly (see here). To make sure that doesn’t happen to you, you need to understand statistical theory and assumptions.
Using them well is a skill in itself. Make your prompt specific: describe what you want, paste the relevant code, include any error messages, and say you’re working in R (otherwise you’ll get Python). Mention any packages you want to use, or specify base R.
Workshop Instructions
In this workshop, we’re going to load two datasets, starting with alligator bite strength, and carry out Exploratory Data Analysis (EDA) alongside some simple Data Quality Assurance. The aim is to find weird things in the data, things that might indicate mistakes.
Errors in data entry are more common than you’d think. Imagine entering 5,000 nearly identical observations into a spreadsheet. People get bored, lose their place, hit the wrong key. If we don’t catch those mistakes before analysis, they can quietly wreck the results.
The good news: we don’t need complex statistics to check for this, and the R coding is relatively straightforward.
Setting Up Your Project
We’ll use an R Project to organise our work. An R Project is a folder with a small .Rproj file inside it. When you open that file, RStudio automatically sets the working directory to the project folder, with no manual path-setting required, and it works the same on any computer or operating system.
Step 1: Create the folder structure
Create a folder called BI3010 somewhere sensible:
- University computers: the H drive works well, e.g.
H:\BI3010 - Personal computers: anywhere in your Documents folder is fine
Inside BI3010, create two subfolders:
data: all data files for the course go herescripts: your R scripts, one per workshop
Your folder should look like this:
BI3010/
├── data/
└── scripts/
Step 2: Create the R Project
Open RStudio and go to File > New Project > Existing Directory. Navigate to your BI3010 folder and click Create Project. RStudio will create a BI3010.Rproj file inside the folder and restart with the project open.
From now on, always open RStudio by double-clicking BI3010.Rproj rather than opening RStudio directly. This ensures the working directory is always set correctly.
Step 3: Create a script for this workshop
Go to File > New File > R Script (or Ctrl + Shift + N). Save it immediately into your scripts folder as workshop1.R. To run code, press the Run button or use Ctrl + Enter.
Download the Data
The data file for this section is embedded in this page. Click the button below to download it, then save it to your BI3010/data/ folder before continuing.
Load File into RStudio
Now we can load the data. The .txt extension tells us it’s a plain text file, so we use read.table(). We need two arguments: the filename (including the data/ subfolder path), and header = TRUE to tell R that the first row contains column names rather than data.
Because the project sets the working directory to BI3010/, we can refer to files using paths relative to that folder, with no need to type out the full path.
Tip: ?read.table (or ? followed by any function name) opens the help page.
ali <- read.table("data/alligator_bite.txt", header = TRUE)The above fits on one line, but you can split it for readability:
ali <- read.table("data/alligator_bite.txt",
header = TRUE)Checking It Worked
If you hit an error, don’t panic. Nothing in R is permanent (unless you very explicitly make it so). Run the code again, fix the issue, move on. The only thing that’s permanent is your script and original data, so save your script often and comment as you go. Not sure whether removing something will break things? Comment it out with # first rather than deleting it.
If you get an error loading data, common causes are:
- The file isn’t in the
datafolder inside yourBI3010project, or you didn’t open RStudio viaBI3010.Rproj. - You may have misspelled one of the words (including using a capital letter instead of lower case).
- Your file may be missing a column heading.
- The file isn’t in the correct format (e.g., the file saved as a
.csvinstead of a.txt). - The quotation marks are in the wrong location.
- You may have missed the final bracket
)in theread.table()function.
To confirm that the file has been read properly, click the blue button next to the name in the Environment panel (normally in the upper right of RStudio).
To inspect the data, you have a few options. The following shows the first six rows (use n to change how many, e.g. head(ali, n = 10)):
head(ali) length bite
1 408.9 2268.50
2 378.1 2156.24
3 343.5 1883.48
4 337.2 1981.71
5 419.5 2442.91
6 427.5 2331.68
Or open the full dataset in a new tab; View() gives a spreadsheet-like view with sorting and filtering:
View(ali)Or print everything to the console:
aliNot recommended: it floods the console and is hard to read.
Question Set One
Question 1. How many variables do we have in this dataset?
There are 2 variables. There are several ways to check this:
summary(ali)
ncol(ali) # Number of columns (i.e., variables) in the ali dataset
str(ali) # Structure of the ali dataset
head(ali) # Displays the first few rows
View(ali) # Opens the dataset in a spreadsheet-like view
Question 2. What are the variable names?
The variable names are bite and length. The functions above will show you this (except ncol(), which only counts columns).
Question 3. Are the variable names capitalised?
No, bite and length are both lower case. A general tip: keep variable names short and all lower case. You will type them a lot, so save yourself the effort. Avoid names like Alligator_Bite_Strength_PSI.
These questions may seem trivial, but honestly, they’re the first questions we ask ourselves when we load a dataset into R for the first time.
Getting to Know the Data
The dataset has 78 alligators from Parker Island Gator Farm, Florida. For each one, bite strength (pounds per square inch) and body length nose-to-tail (centimeters, one decimal place) were recorded.
Before going further, do a quick search: what’s the bite strength of a typical alligator, and how long are they? EDA means looking for things that seem weird, and you can only spot weird if you know what normal looks like. Note the typical values as a comment in your script.
Now let’s check that the data is sensibly structured: numbers stored as numbers, text as characters, and so on. A value like 2174.94BIGBITE would cause R to treat the entire column as text, which breaks any numerical analysis.
To do this quickly, we can use the str() function (short for “structure”):
str(ali)'data.frame': 78 obs. of 2 variables:
$ length: num 409 378 344 337 420 ...
$ bite : num 2268 2156 1883 1982 2443 ...
ali is a data.frame with 78 observations and two variables. Both length and bite are num (numeric). All good. The $ in ali$bite is how R pulls a single column out of a dataset; you’ll use it constantly.
Nothing unusual so far.
If we wanted to, we could extract the length variable from the dataset by doing:
ali$lengthHow would you extract the bite variable?
Let’s calculate some summary statistics. The average length:
mean(ali$length)[1] 373.9897
The average alligator here is 374 cm, longer than wild alligators, but farm animals often are. Compare this with what you found online. Small differences are fine; large ones are worth questioning.
Now calculate the average bite strength:
mean(ali$bite)[1] 3546.084
Assuming you found that alligator bite strength is roughly 2,000 PSI, our alligators appear to be biting with an extra 1,500 PSI on average. That seems like a lot more. This is something that should make us sit up and pay attention. We don’t know for sure if there’s a problem, but we have reason to believe there might be one. It’s now up to us to prove that there isn’t a problem.
summary() gives a quick overview of the whole dataset. Run it now and see if anything stands out:
summary(ali) length bite
Min. : 153.5 Min. : 832.9
1st Qu.: 316.0 1st Qu.: 1749.5
Median : 359.1 Median : 2027.9
Mean : 374.0 Mean : 3546.1
3rd Qu.: 390.8 3rd Qu.: 2222.0
Max. :2293.0 Max. :123632.0
Keep in mind, with EDA we’re looking for anything weird in the data.
Question Set Two
Question 1. How small is the smallest alligator in the farm for which we have data, and what’s the biggest?
The smallest alligator is 153.5 cm and the largest is 2293.0 cm. Multiple approaches work:
range(ali$length) # minimum and maximum
min(ali$length) # minimum
max(ali$length) # maximum
summary(ali) # summary for all variables
summary(ali$length) # summary for length only
Question 2. What are the minimum and maximum values for bite strength?
The minimum bite strength is 832.9 PSI and the maximum is 123,632.0 PSI.
range(ali$bite)
Question 3. Do any of these minimums or maximums seem strange?
Yes. A 2293 cm alligator is longer than a bus. And 123,632 PSI is about 60 times the average bite strength of 2,000 PSI. At least one of these values is almost certainly a data entry error.
Data Detective
Hopefully you spotted at least one suspicious value. Let’s track it down.
We know there’s an alligator recorded as 2293 cm long, which is 23 m, longer than a bus. Here’s how to find it.
Square brackets [] are R’s indexing tool: they extract specific rows or columns from a dataframe using the format dataset[row, column]. Omit the row number and you get all rows; omit the column and you get all columns. For example, to see all columns for row 30:
ali[30,] length bite
30 343.4 2026.86
To find our giant alligator, we need a condition: find rows where length equals 2293. In R, “equal to” is == (a single = assigns a value; == tests whether two things match). The result is a TRUE/FALSE for every row:
ali$length == 2293.0 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE
We can use that condition directly inside square brackets to pull out the matching rows:
ali[ali$length == 2293,] length bite
20 2293 123632
length is about 10 times what we’d expect (200–400 cm), and bite is about 100 times typical. Neither has a decimal point. Let’s compare against a normal observation:
ali[1, ] length bite
1 408.9 2268.5
The first alligator looks fine. So either giant alligators are among us, or there’s been a data entry mistake.
Question Set Three
Question 1. What may have occurred to cause this apparent data entry mistake?
The decimal points were accidentally omitted during data entry. The correct values are 229.3 cm and 1236.32 PSI. If you look at the raw data file, neither value has a decimal point.
A More Refined Detective
The approach above works, but typing 2293 every time is awkward. Better to store the maximum in a variable and reference that instead:
# Get the maximum alligator length
max_length <- max(ali$length)
# Now use this value to find the problematic rows
ali[ali$length == max_length, ] length bite
20 2293 123632
Storing the result means we don’t have to repeat the raw number, and the code reads more clearly.
The same logic works with other conditions, for example, all alligators longer than average:
# Calculate the average alligator length
average_length <- mean(ali$length)
# Extract all rows where the alligator length is greater than the average
ali[ali$length > average_length, ] length bite
1 408.9 2268.50
2 378.1 2156.24
5 419.5 2442.91
6 427.5 2331.68
7 382.1 2186.78
12 378.1 2204.37
13 399.9 2205.96
18 377.3 2026.39
20 2293.0 123632.00
21 418.4 2383.09
23 412.6 2484.12
28 391.1 2251.63
34 422.3 2474.34
40 375.7 2173.30
43 379.7 2148.50
44 468.4 2649.99
48 390.1 2225.24
49 388.3 2149.13
50 466.4 2593.13
57 403.7 2402.71
59 410.0 2265.09
61 445.6 2559.36
62 401.6 2271.19
63 382.3 2212.23
65 448.8 2510.11
68 392.9 2283.28
69 379.5 2267.27
71 417.9 2434.23
77 435.3 2435.05
78 391.0 2051.66
More than one alligator qualifies this time. Our problem observation is still in there.
Let’s go one step further with quantile(). Look back at summary(ali): the 1st Qu. and 3rd Qu. values are the 25th and 75th percentiles. We can retrieve them directly:
quantile(ali$length, probs = c(0.25,0.75)) 25% 75%
316.025 390.775
probs sets which percentiles to return. c() combines multiple values into a vector; needed here because we’re asking for two at once. For a single percentile, write quantile(ali$length, probs = 0.75) without c().
Question Set Four
Question 1. Using quantile(), extract all rows in the ali dataset where bite values are above the 3rd quantile.
ali[ali$bite > quantile(ali$bite, probs = 0.75),]
Question 2. Repeat the above, but for bite values less than or equal to the 33% quantile. If you’re unsure how to specify “less than or equal to”, have a quick search online or use your preferred LLM.
ali[ali$bite <= quantile(ali$bite, probs = 0.33),]
Question 3. Using the tools from above, can you identify any other possible data entry mistakes caused by missing decimal points?
Row 20 appears to be the only observation with this problem. The other bite and length values all look plausible.
Correcting Data
So, we’ve identified a problem. What now? Do we delete the observation? Declare that Sarcosuchus are not extinct and are alive and well in Parker Island Gator Farm, Florida, USA (and have actually doubled in size)? Set up a Facebook group to share our concrete and irrefutable evidence of Sarcosuchus and start producing a film to share this wonderful news with the world? Obviously not. That’d be silly.
But we do need to do something. Row 20 of our dataset is clearly a mistake, but correcting it is a really sensitive topic. Part of the reason for that sensitivity is because a non-trivial number of scientists have been found to actually fake their data in order to further their own careers and prestige (e.g. see here, here, and most ironically here for examples).
We want to be better than that. So before touching anything, go back to the source: not the downloaded file, but the original recorded values. Here that means contacting Parker Island Gator Farm to confirm our suspicion that row 20 is simply missing decimal points.
In the interest of time, I’ll confirm that it’s missing decimal points so we can move on to fixing the problem. Specifically, bite should have two decimal points, and length should have one.
We have two options: edit alligator_bite.txt directly and reload, or fix it in R. Do it in R. If you edit the raw file, there’s no record of what changed or why, and that starts to look a lot like the data manipulation we were just reading about. At minimum, you’d have to say: “We had this weird observation, but we’ve totally sorted it out. Don’t worry. Trust us.” The correction in code is transparent; it’s right there for anyone auditing your work.
The fix uses the indexing skills from above. ali$bite[20] points to row 20 of the bite column specifically (no comma needed since we’re already working with a single column).
We can start by replacing row 20 with itself (which does nothing, but makes the pattern clear):
ali$bite[20] <- ali$bite[20]Now we just divide, which shifts the decimal two places to the right:
# Resolving data entry issue with missing decimal point
ali$bite[20] <- ali$bite[20] / 100Using a hardcoded row number like [20] works here, but it’s fragile: if the dataset were ever re-sorted or a row added above this one, row 20 would point to the wrong observation. A safer approach uses the same conditional indexing we’ve already learned, targeting the value itself rather than its position:
ali$bite[ali$bite > 100000] <- ali$bite[ali$bite > 100000] / 100This reads as: “for any bite value greater than 100,000, divide it by 100.” It will find the right row regardless of where it sits in the dataset.
Question Set Five
Question 1. Resolve the data entry mistake for ali$length. Be sure to add a clear comment indicating that you have made this change, so that if anyone reviews your code, they understand the reason for the modification and don’t suspect any foul play.
# Correcting missing decimal point, confirmed by Parker Island Gator Farm on 1/1/2020
# Row index approach (works here, but fragile if the dataset changes):
ali$length[20] <- ali$length[20] / 10
# More robust: target the value directly using a condition
ali$length[ali$length > 2000] <- ali$length[ali$length > 2000] / 10
Question 2. Having made any changes to your data, it’s always worthwhile checking it again to make sure you didn’t create a new mistake. Do so now using any techniques you feel are appropriate.
summary(ali)
Do it for Real
Download the data file below and save it to your BI3010/data/ folder.
The goal is the same as before: find all the data entry mistakes. This dataset is larger and the errors are harder to spot, so you’ll need to be more systematic. A few hints:
unique()lists every distinct value in a column, useful for spotting typos in text columns.mean()returnsNAif any values are missing. Addna.rm = TRUEto ignore them:mean(variable, na.rm = TRUE).- There are five mistakes in total.
Start with an EDA as before. Can you find them all?
| Variable | Description |
|---|---|
species |
One of three species: Peregrine Falcon, Saiga Antelope, or Cheetah |
weight |
Body weight in grams |
speed |
Speed over a 50 m distance, in km/h |
- A total of 500 individuals were measured, but these weren’t equally divided between the three species.
- Multiple people participated in data entry, and it’s believed that not all were confident in how to do this.
- Physical copies of the data are available should any data need to be verified (contact one of the teaching staff if you need a data entry confirmed).
fast <- read.table("data/animal_speed.txt", header = TRUE)# Start your exploration here
summary(fast)
str(fast)
unique(fast$species)Mistake 1. One observation has an impossible weight value. Find it, identify which row it is in, and write the code to correct it. (Hint: look at the minimum weight value.)
One individual has a recorded weight of 0 g, which is impossible. After confirming with the data provider, the correct weight is 45,514 g.
fast[fast$weight == 0,]
fast$weight[fast$weight == 0] <- 45514
Mistake 2. There’s a second unusual weight in this dataset. Find it, identify which row it is in, and write the code to correct it.
One individual has a recorded weight of 3 g. The smallest of the three species (Peregrine Falcon) typically weighs around 400 to 1,300 g, so 3 g is clearly wrong. The correct weight is 47,255 g.
fast[fast$weight == 3,]
fast$weight[fast$weight == 3] <- 47255
Mistake 3. One of the species names has been entered incorrectly. Use unique() on the species column to find the problem.
One entry reads “Peregrin Took” instead of “Peregrine Falcon”. This is almost certainly a typo (with a Lord of the Rings flavour).
unique(fast$species)
fast$species[fast$species == "Peregrin Took"] <- "Peregrine Falcon"
fast$species <- factor(fast$species)
Mistake 4. One observation has a speed value that is physically impossible for any living thing. Find it and correct it.
One Peregrine Falcon has a recorded speed of 35,711 km/h. The fastest recorded speed for a Peregrine Falcon is around 390 km/h. Dividing by 100 gives 357.11 km/h, which is plausible, so this is another missing decimal point error. Note that the NA in the speed column makes max() tricky here.
fast[fast$speed == max(fast$speed, na.rm = TRUE),]
# Row index approach (fragile: only reliable if the dataset has not changed):
fast$speed[15] <- fast$speed[15] / 100
# More robust: target the value directly
fast$speed[fast$speed > 10000 & !is.na(fast$speed)] <- fast$speed[fast$speed > 10000 & !is.na(fast$speed)] / 100
Mistake 5. There’s one remaining issue in the dataset. What is it, where is it, and how should it be handled?
One observation has a missing value (NA) in the speed column. NA means “not available” and is R’s standard way to represent missing data. The data provider can’t confirm what the true value should be, so the only honest option is to leave it as NA. Functions like mean() will return NA when any value is missing unless you add na.rm = TRUE.
After all corrections, run a final check:
summary(fast)
Question Set Six
Make sure you have corrected all five data entry errors before answering these questions - the answers depend on the corrected dataset.
Question 1. What is the median weight in this dataset?
median(fast$weight)
Question 2. What is the median speed in this dataset?
median(fast$speed, na.rm = TRUE)
Question 3. Which species is most common in this dataset?
The table() function gives a count per species, but there are other ways to check:
table(fast$species)
Chupacabras, Aliens and Beer
The final dataset is chupa.txt. We’ll work through it together in the last 30 minutes, but start early if you’re ahead. It contains the following variables:
year: The year in which data was collectedcity: The city where the data was collectedalc: The average alcohol consumption in that city, in that yearufo: The number of unidentified flying objects (i.e. aliens) reported in that city, in that yearbelief: The most common response from a selection of members of the public when asked if they believe in the supernatural, in that city, in that yearchupa: The growth rate of chupacabra sightings from the previous year to the current year, in that city (where a value of zero means the number of sightings is constant, -0.5 means the number of sightings has declined by 50%, and 0.5 means they have increased by 50%).
There are data entry errors in this dataset. Find them all using EDA.
Download the data file below and save it to your BI3010/data/ folder.
As a group, we’ll look at snippets of code and identify the errors.
chupa <- read.table("data/chupa.txt", header = TRUE)head(chupa)
str(chupa)
unique(chupa$city)
summary(chupa)
unique(chupa$belief)
Find and fix all data entry errors in the chupa dataset. Describe each error and write the correction code.
Error 1 - Two spellings of Amarillo
One row has the city entered as “amarillo” (lower case a) while all other rows use “Amarillo”. Use unique(chupa$city) to spot this, then check the rows affected with indexing.
Error 2 - Placeholder value in ufo (9999)
The ufo column contains one value of 9999, which appears to be a placeholder for missing data rather than a real count. A histogram of ufo makes this obvious. This value should be replaced with NA.
Error 3 - Negative value in alc
Alcohol consumption can’t be negative. There’s one observation with a value of -5 in the alc column. A histogram will reveal it as a clear outlier on the left. The correct value is 5. Note: negative values in the chupa column are fine and expected, since it’s a growth rate.
Coding Puzzles
In the following snippets of code, can you identify the error? If you’re not sure, try running them and see what the error says.
Puzzle 1.
summary(alc)
# We have not specified that this is in the chupa dataset
summary(chupa$alc)
Puzzle 2.
chupa[chupa$ufo==9999]
# We're missing the comma to specify which columns to show
chupa[chupa$ufo == 9999, ]
Puzzle 3.
chupa[chupa$ufo = 9999, ]
# = assigns a value; use == for a conditional test
chupa[chupa$ufo == 9999, ]
Puzzle 4.
chupa[chupa$city = Houston,]
# Houston needs to be in quotation marks (it's a string, not an object)
# Also = should be ==
chupa[chupa$city == "Houston", ]
Puzzle 5.
ali[ali$Length > 200, ]
# Names are case sensitive. Should be `length`, not `Length`
ali[ali$length > 200, ]
Puzzle 6.
ali[ali$bite > mean(ali$bite), ]
# If there are NAs in bite this will fail - include na.rm = TRUE
ali[ali$bite > mean(ali$bite, na.rm = TRUE), ]
num or int in str() output.str() 输出中显示为 num 或 int。chr in str() output.str() 输出中显示为 chr。mean() return NA if any value is missing unless you add na.rm = TRUE.NA,mean() 等函数会返回 NA,除非添加 na.rm = TRUE。.R) containing a sequence of R commands. Run it top to bottom to reproduce an analysis exactly..R)。从上到下运行即可完整重现分析过程。mean(), summary()). You pass inputs inside the parentheses and get an output back.mean()、summary())。在括号内传入输入值,函数返回输出结果。na.rm = TRUE in mean(x, na.rm = TRUE). Arguments control how the function behaves.mean(x, na.rm = TRUE) 中的 na.rm = TRUE。参数控制函数的行为方式。df[rows, columns]. Leave a slot blank to mean "all".df[行, 列]。留空表示"全部"。TRUE or FALSE (e.g. x > 5). Used inside square brackets to filter rows that meet a condition.TRUE 或 FALSE 的逻辑判断(如 x > 5)。用于方括号内筛选满足条件的行。setwd() or via the RStudio menu.setwd() 或 RStudio 菜单设置。