library(pak)
pak::pkg_install("tidyverse")
library(tidyverse)BIO331 – Lab 05: Data Wrangling
Working with data frames and data wrangling in R
Reminder If you have not installed pak, do so now. You can install it from CRAN with
install.packages("pak")Once installed, load it withlibrary(pak).
Data Wrangling with R
Examples from Tidy data example.
Today, we will be working with a file type called a Data Frame. You can think of a data frame as essentially an excel spreadsheet. The key difference is that while an excel spreadsheet is “What You See Is What You Get”, dataframes in R have to be manipulated in other ways. While at first, this may be tricky, it makes data entry and manipulation FAR easier. It also means, if you are working with sensitive data, you are far less likely to make small errors you cannot easily identity. Let’s start by learning how to work with a dataframe.
# Visualizing the structure of the data to understand it better
billboarddf = head(billboard) # billboard is a dataset that comes with the tidyr packageYou can think of a dataframe as essentially a vector of vectors, so you can essentially get them by specifying a row, and then a column.
# Identifying columns and rows
billboarddf[,1]# A tibble: 6 × 1
artist
<chr>
1 2 Pac
2 2Ge+her
3 3 Doors Down
4 3 Doors Down
5 504 Boyz
6 98^0
billboarddf[1,]# A tibble: 1 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby Don'… 2000-02-26 87 82 72 77 87 94 99 NA
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
# wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
billboarddf[1:2,]# A tibble: 2 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby Don… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+her The Hard… 2000-09-02 91 87 92 NA NA NA NA NA
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
# wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
billboarddf[3, 2]# A tibble: 1 × 1
track
<chr>
1 Kryptonite
However, there is also a syntax which makes working with data frames significantly easier. Data frames come with an associated file which names your columns. If you use a dollar sign, you can actually simply take those named columns:
# Identifying columns and rows
billboarddf$artist[1] "2 Pac" "2Ge+her" "3 Doors Down" "3 Doors Down" "504 Boyz"
[6] "98^0"
billboarddf$artist[1][1] "2 Pac"
billboarddf$wk1[1] 87 91 81 76 57 51
billboarddf$wk1[billboarddf$wk1 >= 80][1] 87 91 81
Being able to use conditional functions is incredibly useful in order to actually filter your code and analyze only specific rows. It also lets you do math.
# Identifying columns and rows
billboarddf$wk1 >= 80[1] TRUE TRUE TRUE FALSE FALSE FALSE
billboarddf[billboarddf$wk1 >= 80,]# A tibble: 3 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doors Do… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
# wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
billboarddf$wk20 != NaN[1] NA NA NA NA NA NA
billboarddf[billboarddf$wk20 != NaN,]# A tibble: 6 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 <NA> <NA> NA NA NA NA NA NA NA NA NA
2 <NA> <NA> NA NA NA NA NA NA NA NA NA
3 <NA> <NA> NA NA NA NA NA NA NA NA NA
4 <NA> <NA> NA NA NA NA NA NA NA NA NA
5 <NA> <NA> NA NA NA NA NA NA NA NA NA
6 <NA> <NA> NA NA NA NA NA NA NA NA NA
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
# wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
mean(billboarddf$wk3)[1] 60.5
We’re now going to discuss ways to transform entire data sets using R.
Tidying data
In this step, the pivot_longer function is employed to transform the billboard dataset from a wide format to a long format.
names_to = "week": Specifies that the names of the original set of columns (wk1towk76) are to be stored in a new column namedweek.values_to = "rank": Signifies that the values of the original set of columns will be gathered into a new column namedrank.values_drop_na = TRUE: Ensures that any resulting rows containingNAin therankcolumn are omitted from thebillboard2dataset.
billboard2 <- billboard %>%
pivot_longer(
wk1:wk76,
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
billboard2# A tibble: 5,307 × 5
artist track date.entered week rank
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
# ℹ 5,297 more rows
Next, the mutate function is utilized to create and modify variables within the long-format dataset created in the previous step.
week = as.integer(gsub("wk", "", week)): Converts theweekcolumn to integer by removing the “wk” prefix from the values in theweekcolumn and then coercing them to integer.date = as.Date(date.entered) + 7 * (week - 1): Calculates a newdatecolumn by adding the number of weeks (converted to days) to thedate.enteredcolumn, allowing tracking of the specific date related to each week”s data.date.entered = NULL: Removes the originaldate.enteredcolumn after the newdatecolumn has been created.
billboard3 <- billboard2 %>%
mutate(
week = as.integer(gsub("wk", "", week)),
# Adding to dates in R adds days!
date = as.Date(date.entered) + 7 * (week - 1),
date.entered = NULL
)
billboard3# A tibble: 5,307 × 5
artist track week rank date
<chr> <chr> <int> <dbl> <date>
1 2 Pac Baby Don't Cry (Keep... 1 87 2000-02-26
2 2 Pac Baby Don't Cry (Keep... 2 82 2000-03-04
3 2 Pac Baby Don't Cry (Keep... 3 72 2000-03-11
4 2 Pac Baby Don't Cry (Keep... 4 77 2000-03-18
5 2 Pac Baby Don't Cry (Keep... 5 87 2000-03-25
6 2 Pac Baby Don't Cry (Keep... 6 94 2000-04-01
7 2 Pac Baby Don't Cry (Keep... 7 99 2000-04-08
8 2Ge+her The Hardest Part Of ... 1 91 2000-09-02
9 2Ge+her The Hardest Part Of ... 2 87 2000-09-09
10 2Ge+her The Hardest Part Of ... 3 92 2000-09-16
# ℹ 5,297 more rows
Finally, the arrange function is applied to organize the dataset based on the artist, track, and week columns. This operation ensures a coherent and ordered display of the dataset, making it more manageable and intuitive for subsequent analysis.
long_billboard_sorted <- billboard3 %>% arrange(artist, track, week)
glimpse(long_billboard_sorted)Rows: 5,307
Columns: 5
$ artist <chr> "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", …
$ track <chr> "Baby Don't Cry (Keep...", "Baby Don't Cry (Keep...", "Baby Don…
$ week <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
$ rank <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57,…
$ date <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 20…
This example code creates a new song data frame holding unique artist and track combinations from the billboard3 dataframe, and assigns a unique song_id to each row (representing each unique song).
song <- billboard3 %>%
distinct(artist, track) %>%
mutate(song_id = row_number())
glimpse(song)Rows: 317
Columns: 3
$ artist <chr> "2 Pac", "2Ge+her", "3 Doors Down", "3 Doors Down", "504 Boyz"…
$ track <chr> "Baby Don't Cry (Keep...", "The Hardest Part Of ...", "Krypton…
$ song_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
The song dataframe is then joined with the billboard3 dataframe to create a new dataframe rank that includes the song_id column.
rank <- billboard3 %>%
left_join(song, c("artist", "track")) %>%
select(song_id, date, week, rank)
glimpse(rank)Rows: 5,307
Columns: 4
$ song_id <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ date <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 2…
$ week <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
$ rank <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57…
Joins/Merges
In this section, we will join the flights dataset with the weather dataset from the nycflights13 R package to analyze how weather conditions might have affected the flights.
library(nycflights13) # R has built-in datasets that can be loaded directly from a libraryInner Join
In an inner join, only the rows with matching keys in both data frames are returned. Rows with non-matching keys are excluded from the result. It’s useful when you want to join datasets based on common key columns, and you are only interested in rows with matching keys in both datasets.
# R example
flights_weather_inner_joined <- inner_join(flights, weather,by=c("year", "month", "day", "hour", "origin"))Left Join
A left join returns all rows from the left dataset and the matched rows from the right dataset. If there is no match found in the right dataset, then the result will contain NA. Use a left join when you want to retain all records from the “left” dataset, and add matching records from the “right” dataset where available.
# R example
flights_weather_left_joined <- left_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))Right Join
In a right join, all rows from the right dataset and the matched rows from the left dataset are returned. If there is no match found in the left dataset, then the result will contain NA. It is the opposite of a left join and is used when you want to retain all records from the “right” dataset.
# R example
flights_weather_right_joined <- right_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))Full Join
A full join returns all rows when there is a match in either the left or right dataset. If there is no match found in either dataset, then the result will contain NA. It is useful when you want to retain all records from both datasets.
# R example
flights_weather_full_joined <- full_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))
head(flights_weather_full_joined)# A tibble: 6 × 29
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
# humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
# precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>
Using Joins to Analyze the Data
Use the inner joined dataset to calculate the average departure delay for flights with precipitation greater than 0.5.
# R example
average_delay_per_condition <- flights_weather_inner_joined %>%
group_by(precip > 0.5) %>%
summarise(Average_Departure_Delay = mean(dep_delay, na.rm = TRUE))
average_delay_per_condition# A tibble: 2 × 2
`precip > 0.5` Average_Departure_Delay
<lgl> <dbl>
1 FALSE 12.6
2 TRUE 48.8
Exercises
We will use the weather and flights datasets from the nycflights13 package (also provided as .csv files for Python users) for the exercises.
Please see the nycflights13”s documentation for more information about the datasets.
library(nycflights13)
flights# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Exercise 1 – Filtering and Summarizing
- Task: Filter the
flightsdataset to include only flights with a delay of more than 12 hours. Group and count this output byoriginand sort the result in descending order. - Expected Output: A
data.frameshowing the number of flights delayed over 12 hours by airport, ordered from most to least.
Exercise 2 – Filtering and Summarizing
- Task: Calculate the average air time and the number of flights departing from JFK and arriving at LAX in the
flightsdata set. Make sure to report this result in hours. - Expected Output: A
data.framewith a single row showing the average air time in hours and the number of flights from JFK to LAX.
Exercise 3 – Wrangling Airport Data
airports# A tibble: 1,458 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
# ℹ 1,448 more rows
- Task: Using the
airportsdataset, report the frequency of the time zones of destinations in descending order. Additionally, find an example of an airport with a missing time zone and report the name of the airport, explaining how you checked for it.
To do this problem, you’re going to need to use a function that is new to you: count
library(dplyr)
airports_dst_count <- airports %>%
count(dst)- Expected Output:
- A
data.framelisting the time zones by frequency in descending order. - The name of at least one airport with a missing time zone and the code used to identify it.
- A