library(pak)
pak::pkg_install("tidyverse")
library(tidyverse)BIO331 – Lab 05: Data Wrangling
Working with data frames and data wrangling in R
GitHub Classroom Assignment
Reminder If you have not installed pak, do so now. You can install it from CRAN with
install.packages("pak")Once installed, load it withlibrary(pak).
Data Wrangling with R
Examples from Tidy data example.
# Visualizing the structure of the data to understand it better
head(billboard) # billboard is a dataset that comes with the tidyr package# A tibble: 6 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doors Do… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
4 3 Doors Do… Loser 2000-10-21 76 76 72 69 67 65 55 59
5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
# wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
Tidying data
In this step, the pivot_longer function is employed to transform the billboard dataset from a wide format to a long format.
names_to = "week": Specifies that the names of the original set of columns (wk1towk76) are to be stored in a new column namedweek.values_to = "rank": Signifies that the values of the original set of columns will be gathered into a new column namedrank.values_drop_na = TRUE: Ensures that any resulting rows containingNAin therankcolumn are omitted from thebillboard2dataset.
billboard2 <- billboard %>%
pivot_longer(
wk1:wk76,
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
billboard2# A tibble: 5,307 × 5
artist track date.entered week rank
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
# ℹ 5,297 more rows
Next, the mutate function is utilized to create and modify variables within the long-format dataset created in the previous step.
week = as.integer(gsub("wk", "", week)): Converts theweekcolumn to integer by removing the “wk” prefix from the values in theweekcolumn and then coercing them to integer.date = as.Date(date.entered) + 7 * (week - 1): Calculates a newdatecolumn by adding the number of weeks (converted to days) to thedate.enteredcolumn, allowing tracking of the specific date related to each week”s data.date.entered = NULL: Removes the originaldate.enteredcolumn after the newdatecolumn has been created.
billboard3 <- billboard2 %>%
mutate(
week = as.integer(gsub("wk", "", week)),
# Adding to dates in R adds days!
date = as.Date(date.entered) + 7 * (week - 1),
date.entered = NULL
)
billboard3# A tibble: 5,307 × 5
artist track week rank date
<chr> <chr> <int> <dbl> <date>
1 2 Pac Baby Don't Cry (Keep... 1 87 2000-02-26
2 2 Pac Baby Don't Cry (Keep... 2 82 2000-03-04
3 2 Pac Baby Don't Cry (Keep... 3 72 2000-03-11
4 2 Pac Baby Don't Cry (Keep... 4 77 2000-03-18
5 2 Pac Baby Don't Cry (Keep... 5 87 2000-03-25
6 2 Pac Baby Don't Cry (Keep... 6 94 2000-04-01
7 2 Pac Baby Don't Cry (Keep... 7 99 2000-04-08
8 2Ge+her The Hardest Part Of ... 1 91 2000-09-02
9 2Ge+her The Hardest Part Of ... 2 87 2000-09-09
10 2Ge+her The Hardest Part Of ... 3 92 2000-09-16
# ℹ 5,297 more rows
Finally, the arrange function is applied to organize the dataset based on the artist, track, and week columns. This operation ensures a coherent and ordered display of the dataset, making it more manageable and intuitive for subsequent analysis.
long_billboard_sorted <- billboard3 %>% arrange(artist, track, week)
glimpse(long_billboard_sorted)Rows: 5,307
Columns: 5
$ artist <chr> "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", …
$ track <chr> "Baby Don't Cry (Keep...", "Baby Don't Cry (Keep...", "Baby Don…
$ week <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
$ rank <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57,…
$ date <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 20…
This example code creates a new song data frame holding unique artist and track combinations from the billboard3 dataframe, and assigns a unique song_id to each row (representing each unique song).
song <- billboard3 %>%
distinct(artist, track) %>%
mutate(song_id = row_number())
glimpse(song)Rows: 317
Columns: 3
$ artist <chr> "2 Pac", "2Ge+her", "3 Doors Down", "3 Doors Down", "504 Boyz"…
$ track <chr> "Baby Don't Cry (Keep...", "The Hardest Part Of ...", "Krypton…
$ song_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
The song dataframe is then joined with the billboard3 dataframe to create a new dataframe rank that includes the song_id column.
rank <- billboard3 %>%
left_join(song, c("artist", "track")) %>%
select(song_id, date, week, rank)
glimpse(rank)Rows: 5,307
Columns: 4
$ song_id <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ date <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 2…
$ week <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
$ rank <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57…
Joins/Merges
In this section, we will join the flights dataset with the weather dataset from the nycflights13 R package to analyze how weather conditions might have affected the flights.
Make sure you have loaded in flights.csv and weather.csv!
library(nycflights13) # R has built-in datasets that can be loaded directly from a libraryInner Join
In an inner join, only the rows with matching keys in both data frames are returned. Rows with non-matching keys are excluded from the result. It’s useful when you want to join datasets based on common key columns, and you are only interested in rows with matching keys in both datasets.
# R example
flights_weather_inner_joined <- inner_join(flights, weather,by=c("year", "month", "day", "hour", "origin"))Left Join
A left join returns all rows from the left dataset and the matched rows from the right dataset. If there is no match found in the right dataset, then the result will contain NA. Use a left join when you want to retain all records from the “left” dataset, and add matching records from the “right” dataset where available.
# R example
flights_weather_left_joined <- left_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))Right Join
In a right join, all rows from the right dataset and the matched rows from the left dataset are returned. If there is no match found in the left dataset, then the result will contain NA. It is the opposite of a left join and is used when you want to retain all records from the “right” dataset.
# R example
flights_weather_right_joined <- right_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))Full Join
A full join returns all rows when there is a match in either the left or right dataset. If there is no match found in either dataset, then the result will contain NA. It is useful when you want to retain all records from both datasets.
# R example
flights_weather_full_joined <- full_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))
head(flights_weather_full_joined)# A tibble: 6 × 29
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
# humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
# precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>
Using Joins to Analyze the Data
Use the inner joined dataset to calculate the average departure delay for flights with precipitation greater than 0.5.
# R example
average_delay_per_condition <- flights_weather_inner_joined %>%
group_by(precip > 0.5) %>%
summarise(Average_Departure_Delay = mean(dep_delay, na.rm = TRUE))
average_delay_per_condition# A tibble: 2 × 2
`precip > 0.5` Average_Departure_Delay
<lgl> <dbl>
1 FALSE 12.6
2 TRUE 48.8
Anti Join
An anti join returns rows from the left dataset where there are no matching keys in the right dataset. It’s useful for identifying records in one dataset that do not have a counterpart in another dataset.
# R example
flights_weather_anti_joined <- anti_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))
head(flights_weather_anti_joined)# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 1153 1200 -7 1450 1529
2 2013 1 1 1154 1200 -6 1253 1306
3 2013 1 1 1155 1200 -5 1517 1510
4 2013 1 1 1155 1200 -5 1312 1315
5 2013 1 1 1157 1200 -3 1452 1456
6 2013 1 1 1158 1200 -2 1256 1300
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Exercises
We will use the weather and flights datasets from the nycflights13 package (also provided as .csv files for Python users) for the exercises.
Please see the nycflights13”s documentation for more information about the datasets.
Exercise 1 – Filtering and Summarizing
- Task: Filter the
flightsdataset to include only flights with a delay of more than 12 hours. Group and count this output byoriginand sort the result in descending order. - Expected Output: A
data.frameshowing the number of flights delayed over 12 hours by airport, ordered from most to least.
Exercise 2 – Filtering and Summarizing
- Task: Calculate the average air time and the number of flights departing from JFK and arriving at LAX in the
flightsdata set. Make sure to report this result in hours. - Expected Output: A
data.framewith a single row showing the average air time in hours and the number of flights from JFK to LAX.
Exercise 3 – Wrangling Airport Data
- Task: Using the
airportsdataset, report the frequency of the time zones of destinations in descending order. Additionally, find an example of an airport with a missing time zone and report the name of the airport, explaining how you checked for it. - Expected Output:
- A
data.framelisting the time zones by frequency in descending order. - The name of at least one airport with a missing time zone and the code used to identify it.
- A
Exercise 4: More Wrangling
- Task: Identify the top 3 months with the highest average departure delays in the flights dataset. For these months, calculate the average, minimum, and maximum departure delay.
- Expected Output: A
data.frameshowing the top 3 months along with their respective average, minimum, and maximum departure delay values.