In this document, we will explore two significant aspects of data analysis: Data Input/Output (IO) and String Wrangling with Regular Expressions (Regex). Data IO is fundamental as it’s the first step in the data analysis pipeline, where we read or load data from various sources. String wrangling, particularly using Regex, is vital when we are dealing with text data and need to extract, replace, or modify strings based on specific patterns.
Data IO in R
Data can be read from various sources such as CSV files, Excel files, or databases. Below are some ways to read data into R.
Using read.csv from Base R
# Reading a CSV file using read.csv from base Rdata_base_r <-read.csv("places_in_princeton.csv")# Inspecting the first few rows of the datahead(data_base_r)
id name address
1 1 Nassau Hall 1 Nassau Hall, Princeton, NJ 08544
2 2 Princeton University Art Museum Elm Dr, Princeton, NJ 08544
3 3 Albert Einstein House 112 Mercer St, Princeton, NJ 08540
4 4 Princeton Public Library 65 Witherspoon St, Princeton, NJ 08542
5 5 McCarter Theatre Center 91 University Pl, Princeton, NJ 08540
6 6 Marquand Park 68 Lovers Ln, Princeton, NJ 08540
comment
1 Historical building in Princeton.
2 A place with a vast and varied collection of art.
3 Albert Einstein's residence.
4 The hub of community learning.
5 Famous for its performances and shows.
6 A peaceful place to walk and enjoy nature.
Using read_csv from readr
pak::pkg_install("readr")library(readr)
# Reading a CSV file using read_csv from readr packagedata_readr <-read_csv("places_in_princeton.csv")
Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): name, address, comment
dbl (1): id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Inspecting the first few rows of the datahead(data_readr)
# A tibble: 6 × 4
id name address comment
<dbl> <chr> <chr> <chr>
1 1 Nassau Hall 1 Nassau Hall, Princeton, NJ 08… Histor…
2 2 Princeton University Art Museum Elm Dr, Princeton, NJ 08544 A plac…
3 3 Albert Einstein House 112 Mercer St, Princeton, NJ 08… Albert…
4 4 Princeton Public Library 65 Witherspoon St, Princeton, N… The hu…
5 5 McCarter Theatre Center 91 University Pl, Princeton, NJ… Famous…
6 6 Marquand Park 68 Lovers Ln, Princeton, NJ 085… A peac…
Note: Everything below here, while I’m keeping for the class, will be optional. Learning Regex is, imo, useful, but it is also a skill which is quite unintuitive, so I’m not going to require you to use regex this semester.
Regular Expressions (Regex) are a powerful tool to match and manipulate text and can be used across languages. Here is a brief guide on some of the basic regex symbols and patterns with examples:
\\d: Matches any digit.
Example: "\\d" will match any single digit like 5 in "Price: $5".
\\w: Matches any word character (alphanumeric + underscore).
Example: "\\w" will match any single word character like a in "apple".
\\b: Word boundary.
Example: "\\bword\\b" will match the word "word" but not "swordfish".
+: Matches one or more of the preceding character or group.
Example: "\\d+" will match one or more digits like 123 in "123 apples".
*: Matches zero or more of the preceding character or group.
Example: "a*" will match any number of consecutive a’s, including none, as in "baaa!".
(?:): Non-capturing group.
Example: "\\d(?:\\.\\d)?" will match 1 or 1.5 but not .5.
^: Start of a line.
Example: "^Start" will match any line that begins with the word "Start".
$: End of a line.
Example: "end$" will match any line that ends with the word "end".
[]: Defines a character class.
Example: "[aeiou]" will match any vowel in "apple".
|: Acts as an OR operator.
Example: "apple|orange" will match either "apple" or "orange" in a given text.
\\s: Matches any whitespace character.
Example: "\\s" will match the space in "apple orange".
{n}: Matches exactly n occurrences of the preceding character or group.
Example: "\\d{3}" will match 123 in "12345".
{n,}: Matches n or more occurrences of the preceding character or group.
Example: "\\d{2,}" will match 123 in "12345", but not 1.
{n,m}: Matches between n and m occurrences of the preceding character or group.
Example: "\\d{2,4}" will match 123 and 12 in "12345".
By understanding and combining these patterns, you can create complex expressions to match a wide range of strings within your text data.
String Wrangling in R
Installing and loading the stringr package
pak::pkg_install("stringr")library(stringr)
stringr examples
Extracting digits from strings can be crucial to isolate specific numerical information such as prices or zip codes.
strings <-c("123 Main St", "Price: $200 0")digits <-str_extract_all(strings, "\\b\\d+\\b") digits
df |>separate_wider_regex( str,patterns =c("<", # Match the literal character '<'.name ="[A-Za-z]+", # Match one or more alphabets (upper or lower case) and create a new column 'name' with the matched value.">-", # Match the literal string '>-'.gender =".", # Match any single character (except newline) and create a new column 'gender' with the matched value."_", # Match the literal character '_'.age ="[0-9]+"# Match one or more digits and create a new column 'age' with the matched value. ) )
# A tibble: 7 × 3
name gender age
<chr> <chr> <chr>
1 Sheryl F 34
2 Kisha F 45
3 Brandon N 33
4 Sharon F 38
5 Penny F 58
6 Justin M 41
7 Patricia F 84
Exercises
Exercise 1: Reading TSV Files
A TSV file, places_in_princeton.tsv, uses a tab character as a delimiter between values. Your task is to read this file into R using an appropriate reading function.