BIO331 – Lab 05: Data Wrangling

Working with data frames and data wrangling in R

Author

Dee Ruttenberg (Adapted from Scott Wolf, Michelle White)

Published

February 25, 2026

Reminder If you have not installed pak, do so now. You can install it from CRAN with install.packages("pak") Once installed, load it with library(pak).

Data Wrangling with R

Examples from Tidy data example.

library(pak)
pak::pkg_install("tidyverse")
library(tidyverse)

Today, we will be working with a file type called a Data Frame. You can think of a data frame as essentially an excel spreadsheet. The key difference is that while an excel spreadsheet is “What You See Is What You Get”, dataframes in R have to be manipulated in other ways. While at first, this may be tricky, it makes data entry and manipulation FAR easier. It also means, if you are working with sensitive data, you are far less likely to make small errors you cannot easily identity. Let’s start by learning how to work with a dataframe.

# Visualizing the structure of the data to understand it better
billboarddf = head(billboard) # billboard is a dataset that comes with the tidyr package

You can think of a dataframe as essentially a vector of vectors, so you can essentially get them by specifying a row, and then a column.

# Identifying columns and rows
billboarddf[,1]

# A tibble: 6 × 1
  artist      
  <chr>       
1 2 Pac       
2 2Ge+her     
3 3 Doors Down
4 3 Doors Down
5 504 Boyz    
6 98^0

billboarddf[1,]

# A tibble: 1 × 79
  artist track      date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
  <chr>  <chr>      <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac  Baby Don'… 2000-02-26      87    82    72    77    87    94    99    NA
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

billboarddf[1:2,]

# A tibble: 2 × 79
  artist  track     date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
  <chr>   <chr>     <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac   Baby Don… 2000-02-26      87    82    72    77    87    94    99    NA
2 2Ge+her The Hard… 2000-09-02      91    87    92    NA    NA    NA    NA    NA
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

billboarddf[3, 2]

# A tibble: 1 × 1
  track     
  <chr>     
1 Kryptonite

However, there is also a syntax which makes working with data frames significantly easier. Data frames come with an associated file which names your columns. If you use a dollar sign, you can actually simply take those named columns:

# Identifying columns and rows
billboarddf$artist

[1] "2 Pac"        "2Ge+her"      "3 Doors Down" "3 Doors Down" "504 Boyz"    
[6] "98^0"

billboarddf$artist[1]

[1] "2 Pac"

billboarddf$wk1

[1] 87 91 81 76 57 51

billboarddf$wk1[billboarddf$wk1 >= 80]

[1] 87 91 81

Being able to use conditional functions is incredibly useful in order to actually filter your code and analyze only specific rows. It also lets you do math.

# Identifying columns and rows
billboarddf$wk1 >= 80

[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

billboarddf[billboarddf$wk1 >= 80,]

# A tibble: 3 × 79
  artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
  <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac       Baby… 2000-02-26      87    82    72    77    87    94    99    NA
2 2Ge+her     The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
3 3 Doors Do… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

billboarddf$wk20 != NaN

[1] NA NA NA NA NA NA

billboarddf[billboarddf$wk20 != NaN,]

# A tibble: 6 × 79
  artist track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
  <chr>  <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 <NA>   <NA>  NA              NA    NA    NA    NA    NA    NA    NA    NA
2 <NA>   <NA>  NA              NA    NA    NA    NA    NA    NA    NA    NA
3 <NA>   <NA>  NA              NA    NA    NA    NA    NA    NA    NA    NA
4 <NA>   <NA>  NA              NA    NA    NA    NA    NA    NA    NA    NA
5 <NA>   <NA>  NA              NA    NA    NA    NA    NA    NA    NA    NA
6 <NA>   <NA>  NA              NA    NA    NA    NA    NA    NA    NA    NA
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

mean(billboarddf$wk3)

[1] 60.5

We’re now going to discuss ways to transform entire data sets using R.

Tidying data

In this step, the pivot_longer function is employed to transform the billboard dataset from a wide format to a long format.

names_to = "week": Specifies that the names of the original set of columns (wk1 to wk76) are to be stored in a new column named week.
values_to = "rank": Signifies that the values of the original set of columns will be gathered into a new column named rank.
values_drop_na = TRUE: Ensures that any resulting rows containing NA in the rank column are omitted from the billboard2 dataset.

billboard2 <- billboard %>%
  pivot_longer(
    wk1:wk76,
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

billboard2

# A tibble: 5,307 × 5
   artist  track                   date.entered week   rank
   <chr>   <chr>                   <date>       <chr> <dbl>
 1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
 2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
 3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
 4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
 5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
 6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
 7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
 8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
 9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
# ℹ 5,297 more rows

Next, the mutate function is utilized to create and modify variables within the long-format dataset created in the previous step.

week = as.integer(gsub("wk", "", week)): Converts the week column to integer by removing the “wk” prefix from the values in the week column and then coercing them to integer.
date = as.Date(date.entered) + 7 * (week - 1): Calculates a new date column by adding the number of weeks (converted to days) to the date.entered column, allowing tracking of the specific date related to each week”s data.
date.entered = NULL: Removes the original date.entered column after the new date column has been created.

billboard3 <- billboard2 %>%
  mutate(
    week = as.integer(gsub("wk", "", week)),
    # Adding to dates in R adds days!
    date = as.Date(date.entered) + 7 * (week - 1),
    date.entered = NULL
  )

billboard3

# A tibble: 5,307 × 5
   artist  track                    week  rank date      
   <chr>   <chr>                   <int> <dbl> <date>    
 1 2 Pac   Baby Don't Cry (Keep...     1    87 2000-02-26
 2 2 Pac   Baby Don't Cry (Keep...     2    82 2000-03-04
 3 2 Pac   Baby Don't Cry (Keep...     3    72 2000-03-11
 4 2 Pac   Baby Don't Cry (Keep...     4    77 2000-03-18
 5 2 Pac   Baby Don't Cry (Keep...     5    87 2000-03-25
 6 2 Pac   Baby Don't Cry (Keep...     6    94 2000-04-01
 7 2 Pac   Baby Don't Cry (Keep...     7    99 2000-04-08
 8 2Ge+her The Hardest Part Of ...     1    91 2000-09-02
 9 2Ge+her The Hardest Part Of ...     2    87 2000-09-09
10 2Ge+her The Hardest Part Of ...     3    92 2000-09-16
# ℹ 5,297 more rows

Finally, the arrange function is applied to organize the dataset based on the artist, track, and week columns. This operation ensures a coherent and ordered display of the dataset, making it more manageable and intuitive for subsequent analysis.

long_billboard_sorted <- billboard3 %>% arrange(artist, track, week)

glimpse(long_billboard_sorted)

Rows: 5,307
Columns: 5
$ artist <chr> "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", …
$ track  <chr> "Baby Don't Cry (Keep...", "Baby Don't Cry (Keep...", "Baby Don…
$ week   <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
$ rank   <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57,…
$ date   <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 20…

This example code creates a new song data frame holding unique artist and track combinations from the billboard3 dataframe, and assigns a unique song_id to each row (representing each unique song).

song <- billboard3 %>% 
  distinct(artist, track) %>%
  mutate(song_id = row_number())

glimpse(song)

Rows: 317
Columns: 3
$ artist  <chr> "2 Pac", "2Ge+her", "3 Doors Down", "3 Doors Down", "504 Boyz"…
$ track   <chr> "Baby Don't Cry (Keep...", "The Hardest Part Of ...", "Krypton…
$ song_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…

The song dataframe is then joined with the billboard3 dataframe to create a new dataframe rank that includes the song_id column.

rank <- billboard3 %>%
  left_join(song, c("artist", "track")) %>%
  select(song_id, date, week, rank)

glimpse(rank)

Rows: 5,307
Columns: 4
$ song_id <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ date    <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 2…
$ week    <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
$ rank    <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57…

Joins/Merges

In this section, we will join the flights dataset with the weather dataset from the nycflights13 R package to analyze how weather conditions might have affected the flights.

library(nycflights13)  # R has built-in datasets that can be loaded directly from a library

Inner Join

In an inner join, only the rows with matching keys in both data frames are returned. Rows with non-matching keys are excluded from the result. It’s useful when you want to join datasets based on common key columns, and you are only interested in rows with matching keys in both datasets.

# R example
flights_weather_inner_joined <- inner_join(flights, weather,by=c("year", "month", "day", "hour", "origin"))

Left Join

A left join returns all rows from the left dataset and the matched rows from the right dataset. If there is no match found in the right dataset, then the result will contain NA. Use a left join when you want to retain all records from the “left” dataset, and add matching records from the “right” dataset where available.

# R example
flights_weather_left_joined <- left_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))

Right Join

In a right join, all rows from the right dataset and the matched rows from the left dataset are returned. If there is no match found in the left dataset, then the result will contain NA. It is the opposite of a left join and is used when you want to retain all records from the “right” dataset.

# R example
flights_weather_right_joined <- right_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))

Full Join

A full join returns all rows when there is a match in either the left or right dataset. If there is no match found in either dataset, then the result will contain NA. It is useful when you want to retain all records from both datasets.

# R example
flights_weather_full_joined <- full_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))
head(flights_weather_full_joined)

# A tibble: 6 × 29
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
#   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
#   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>

Using Joins to Analyze the Data

Use the inner joined dataset to calculate the average departure delay for flights with precipitation greater than 0.5.

# R example
average_delay_per_condition <- flights_weather_inner_joined %>%
  group_by(precip > 0.5) %>%
  summarise(Average_Departure_Delay = mean(dep_delay, na.rm = TRUE))

average_delay_per_condition

# A tibble: 2 × 2
  `precip > 0.5` Average_Departure_Delay
  <lgl>                            <dbl>
1 FALSE                             12.6
2 TRUE                              48.8

Exercises

We will use the weather and flights datasets from the nycflights13 package (also provided as .csv files for Python users) for the exercises.

Please see the nycflights13”s documentation for more information about the datasets.

library(nycflights13)  
flights

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Exercise 1 – Filtering and Summarizing

Task: Filter the flights dataset to include only flights with a delay of more than 12 hours. Group and count this output by origin and sort the result in descending order.
Expected Output: A data.frame showing the number of flights delayed over 12 hours by airport, ordered from most to least.

Exercise 2 – Filtering and Summarizing

Task: Calculate the average air time and the number of flights departing from JFK and arriving at LAX in the flights data set. Make sure to report this result in hours.
Expected Output: A data.frame with a single row showing the average air time in hours and the number of flights from JFK to LAX.

Exercise 3 – Wrangling Airport Data

airports

# A tibble: 1,458 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# ℹ 1,448 more rows

Task: Using the airports dataset, report the frequency of the time zones of destinations in descending order. Additionally, find an example of an airport with a missing time zone and report the name of the airport, explaining how you checked for it.

To do this problem, you’re going to need to use a function that is new to you: count

library(dplyr)
airports_dst_count <- airports %>% 
  count(dst)

Expected Output:
1. A data.frame listing the time zones by frequency in descending order.
2. The name of at least one airport with a missing time zone and the code used to identify it.