Dee Ruttenberg
  • Home
  • About
  • Values
  • Publications
  • Blog
  • Labs

On this page

  • GitHub Classroom Assignment
  • Data Wrangling with R
    • Tidying data
    • Joins/Merges
    • Exercises
      • Exercise 1 – Filtering and Summarizing
      • Exercise 2 – Filtering and Summarizing
      • Exercise 3 – Wrangling Airport Data
      • Exercise 4: More Wrangling

BIO331 – Lab 05: Data Wrangling

Working with data frames and data wrangling in R

Author

Dee Ruttenberg (Adapted from Scott Wolf, Michelle White)

Published

February 25, 2026

GitHub Classroom Assignment

Reminder If you have not installed pak, do so now. You can install it from CRAN with install.packages("pak") Once installed, load it with library(pak).

Data Wrangling with R

Examples from Tidy data example.

library(pak)
pak::pkg_install("tidyverse")
library(tidyverse)
# Visualizing the structure of the data to understand it better
head(billboard) # billboard is a dataset that comes with the tidyr package
# A tibble: 6 × 79
  artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
  <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac       Baby… 2000-02-26      87    82    72    77    87    94    99    NA
2 2Ge+her     The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
3 3 Doors Do… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
4 3 Doors Do… Loser 2000-10-21      76    76    72    69    67    65    55    59
5 504 Boyz    Wobb… 2000-04-15      57    34    25    17    17    31    36    49
6 98^0        Give… 2000-08-19      51    39    34    26    26    19     2     2
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

Tidying data

In this step, the pivot_longer function is employed to transform the billboard dataset from a wide format to a long format.

  • names_to = "week": Specifies that the names of the original set of columns (wk1 to wk76) are to be stored in a new column named week.
  • values_to = "rank": Signifies that the values of the original set of columns will be gathered into a new column named rank.
  • values_drop_na = TRUE: Ensures that any resulting rows containing NA in the rank column are omitted from the billboard2 dataset.
billboard2 <- billboard %>%
  pivot_longer(
    wk1:wk76,
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

billboard2
# A tibble: 5,307 × 5
   artist  track                   date.entered week   rank
   <chr>   <chr>                   <date>       <chr> <dbl>
 1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
 2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
 3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
 4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
 5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
 6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
 7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
 8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
 9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
# ℹ 5,297 more rows

Next, the mutate function is utilized to create and modify variables within the long-format dataset created in the previous step.

  • week = as.integer(gsub("wk", "", week)): Converts the week column to integer by removing the “wk” prefix from the values in the week column and then coercing them to integer.
  • date = as.Date(date.entered) + 7 * (week - 1): Calculates a new date column by adding the number of weeks (converted to days) to the date.entered column, allowing tracking of the specific date related to each week”s data.
  • date.entered = NULL: Removes the original date.entered column after the new date column has been created.
billboard3 <- billboard2 %>%
  mutate(
    week = as.integer(gsub("wk", "", week)),
    # Adding to dates in R adds days!
    date = as.Date(date.entered) + 7 * (week - 1),
    date.entered = NULL
  )

billboard3
# A tibble: 5,307 × 5
   artist  track                    week  rank date      
   <chr>   <chr>                   <int> <dbl> <date>    
 1 2 Pac   Baby Don't Cry (Keep...     1    87 2000-02-26
 2 2 Pac   Baby Don't Cry (Keep...     2    82 2000-03-04
 3 2 Pac   Baby Don't Cry (Keep...     3    72 2000-03-11
 4 2 Pac   Baby Don't Cry (Keep...     4    77 2000-03-18
 5 2 Pac   Baby Don't Cry (Keep...     5    87 2000-03-25
 6 2 Pac   Baby Don't Cry (Keep...     6    94 2000-04-01
 7 2 Pac   Baby Don't Cry (Keep...     7    99 2000-04-08
 8 2Ge+her The Hardest Part Of ...     1    91 2000-09-02
 9 2Ge+her The Hardest Part Of ...     2    87 2000-09-09
10 2Ge+her The Hardest Part Of ...     3    92 2000-09-16
# ℹ 5,297 more rows

Finally, the arrange function is applied to organize the dataset based on the artist, track, and week columns. This operation ensures a coherent and ordered display of the dataset, making it more manageable and intuitive for subsequent analysis.

long_billboard_sorted <- billboard3 %>% arrange(artist, track, week)

glimpse(long_billboard_sorted)
Rows: 5,307
Columns: 5
$ artist <chr> "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", …
$ track  <chr> "Baby Don't Cry (Keep...", "Baby Don't Cry (Keep...", "Baby Don…
$ week   <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
$ rank   <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57,…
$ date   <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 20…

This example code creates a new song data frame holding unique artist and track combinations from the billboard3 dataframe, and assigns a unique song_id to each row (representing each unique song).

song <- billboard3 %>% 
  distinct(artist, track) %>%
  mutate(song_id = row_number())

glimpse(song)
Rows: 317
Columns: 3
$ artist  <chr> "2 Pac", "2Ge+her", "3 Doors Down", "3 Doors Down", "504 Boyz"…
$ track   <chr> "Baby Don't Cry (Keep...", "The Hardest Part Of ...", "Krypton…
$ song_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…

The song dataframe is then joined with the billboard3 dataframe to create a new dataframe rank that includes the song_id column.

rank <- billboard3 %>%
  left_join(song, c("artist", "track")) %>%
  select(song_id, date, week, rank)

glimpse(rank)
Rows: 5,307
Columns: 4
$ song_id <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ date    <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 2…
$ week    <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
$ rank    <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57…

Joins/Merges

In this section, we will join the flights dataset with the weather dataset from the nycflights13 R package to analyze how weather conditions might have affected the flights.

Make sure you have loaded in flights.csv and weather.csv!

library(nycflights13)  # R has built-in datasets that can be loaded directly from a library

Inner Join

In an inner join, only the rows with matching keys in both data frames are returned. Rows with non-matching keys are excluded from the result. It’s useful when you want to join datasets based on common key columns, and you are only interested in rows with matching keys in both datasets.

# R example
flights_weather_inner_joined <- inner_join(flights, weather,by=c("year", "month", "day", "hour", "origin"))

Left Join

A left join returns all rows from the left dataset and the matched rows from the right dataset. If there is no match found in the right dataset, then the result will contain NA. Use a left join when you want to retain all records from the “left” dataset, and add matching records from the “right” dataset where available.

# R example
flights_weather_left_joined <- left_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))

Right Join

In a right join, all rows from the right dataset and the matched rows from the left dataset are returned. If there is no match found in the left dataset, then the result will contain NA. It is the opposite of a left join and is used when you want to retain all records from the “right” dataset.

# R example
flights_weather_right_joined <- right_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))

Full Join

A full join returns all rows when there is a match in either the left or right dataset. If there is no match found in either dataset, then the result will contain NA. It is useful when you want to retain all records from both datasets.

# R example
flights_weather_full_joined <- full_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))
head(flights_weather_full_joined)
# A tibble: 6 × 29
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
#   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
#   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>

Using Joins to Analyze the Data

Use the inner joined dataset to calculate the average departure delay for flights with precipitation greater than 0.5.

# R example
average_delay_per_condition <- flights_weather_inner_joined %>%
  group_by(precip > 0.5) %>%
  summarise(Average_Departure_Delay = mean(dep_delay, na.rm = TRUE))

average_delay_per_condition
# A tibble: 2 × 2
  `precip > 0.5` Average_Departure_Delay
  <lgl>                            <dbl>
1 FALSE                             12.6
2 TRUE                              48.8

Anti Join

An anti join returns rows from the left dataset where there are no matching keys in the right dataset. It’s useful for identifying records in one dataset that do not have a counterpart in another dataset.

# R example
flights_weather_anti_joined <- anti_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))

head(flights_weather_anti_joined)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1     1153           1200        -7     1450           1529
2  2013     1     1     1154           1200        -6     1253           1306
3  2013     1     1     1155           1200        -5     1517           1510
4  2013     1     1     1155           1200        -5     1312           1315
5  2013     1     1     1157           1200        -3     1452           1456
6  2013     1     1     1158           1200        -2     1256           1300
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Exercises

We will use the weather and flights datasets from the nycflights13 package (also provided as .csv files for Python users) for the exercises.

Please see the nycflights13”s documentation for more information about the datasets.

Exercise 1 – Filtering and Summarizing

  • Task: Filter the flights dataset to include only flights with a delay of more than 12 hours. Group and count this output by origin and sort the result in descending order.
  • Expected Output: A data.frame showing the number of flights delayed over 12 hours by airport, ordered from most to least.

Exercise 2 – Filtering and Summarizing

  • Task: Calculate the average air time and the number of flights departing from JFK and arriving at LAX in the flights data set. Make sure to report this result in hours.
  • Expected Output: A data.frame with a single row showing the average air time in hours and the number of flights from JFK to LAX.

Exercise 3 – Wrangling Airport Data

  • Task: Using the airports dataset, report the frequency of the time zones of destinations in descending order. Additionally, find an example of an airport with a missing time zone and report the name of the airport, explaining how you checked for it.
  • Expected Output:
    1. A data.frame listing the time zones by frequency in descending order.
    2. The name of at least one airport with a missing time zone and the code used to identify it.

Exercise 4: More Wrangling

  • Task: Identify the top 3 months with the highest average departure delays in the flights dataset. For these months, calculate the average, minimum, and maximum departure delay.
  • Expected Output: A data.frame showing the top 3 months along with their respective average, minimum, and maximum departure delay values.