[go: up one dir, main page]

0% found this document useful (0 votes)
58 views25 pages

Chapter 15

Uploaded by

Dimple Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views25 pages

Chapter 15

Uploaded by

Dimple Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter 15

Dimple K Patel

2025-01-07

Chapter 15 Intro
#install.packages("tidyverse")
library(tidyverse)

## ── Attaching core tidyverse packages ────────────────────────


tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ──────────────────────────────────────────
tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to
force all conflicts to become errors

library(babynames)

#2nd argument is the REGEX expression that matches. STR_VIEW puts <>
around match.
str_view(fruit, "berry")

## [6] │ bil<berry>
## [7] │ black<berry>
## [10] │ blue<berry>
## [11] │ boysen<berry>
## [19] │ cloud<berry>
## [21] │ cran<berry>
## [29] │ elder<berry>
## [32] │ goji <berry>
## [33] │ goose<berry>
## [38] │ huckle<berry>
## [50] │ mul<berry>
## [70] │ rasp<berry>
## [73] │ salal <berry>
## [76] │ straw<berry>

#The period after the a is a METAcharacter.


str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
## [2] │ <ab>
## [3] │ <ae>
## [6] │ e<ab>

#The 3 periods match any 3 letters inside the fruit.


str_view(fruit, "a...e")

## [1] │ <apple>
## [7] │ bl<ackbe>rry
## [48] │ mand<arine>
## [51] │ nect<arine>
## [62] │ pine<apple>
## [64] │ pomegr<anate>
## [70] │ r<aspbe>rry
## [73] │ sal<al be>rry

Quantifiers control # of times a pattern can match:


 ? makes a pattern optional (i.e. it matches 0 or 1 times) — question
mark is shaped like a 0 or 1.

 + lets a pattern repeat (i.e. it matches at least once)—just like the


addition sign. Resembles the number 1 so has to match at least one
time.

 * lets a pattern be optional or repeat (i.e. it matches any number of


times, including 0). * Draw a circle around the star like in those
children’s games growing up on hands. Infinite circle means ANY
number qualifies.
# ab? matches an "a", optionally followed by a "b".
str_view(c("a", "ab", "abb"), "ab?")

## [1] │ <a>
## [2] │ <ab>
## [3] │ <ab>b

# ab+ matches an "a", followed by at least one "b".


str_view(c("a", "ab", "abb"), "ab+")

## [2] │ <ab>
## [3] │ <abb>

# ab* matches an "a", followed by any number of "b"s.


str_view(c("a", "ab", "abb"), "ab*")

## [1] │ <a>
## [2] │ <ab>
## [3] │ <abb>

Character classes are denoted by []. [Nemo] matches the letters N, e, m,


or o. [^Nemo] matches everything besides N, e, m, or o.
l <- str_view(words, "[aeiou]x[aeiou]")
m <- str_view(words, "[^aeiou]y[^aeiou]")

Alternation picks between n>1 alternate patterns w | sign.


k <- str_view(fruit, "apple|melon|nut")
j <- str_view(fruit, "aa|ee|ii|oo|uu")
j

## [9] │ bl<oo>d orange


## [33] │ g<oo>seberry
## [47] │ lych<ee>
## [66] │ purple mangost<ee>n

Chapter 15.2 Key Functions


Detect matches with str_detect. Like a detective like Sherlock Holmes.
str_detect(c("a", "b", "c"), "[aeiou]")

## [1] TRUE FALSE FALSE

o <- babynames |>


filter(str_detect(name, "x")) |>
count(name, wt = n, sort = TRUE)

head(o)

## # A tibble: 6 × 2
## name n
## <chr> <int>
## 1 Alexander 665492
## 2 Alexis 399551
## 3 Alex 278705
## 4 Alexandra 232223
## 5 Max 148787
## 6 Alexa 123032

babynames |>
group_by(year) |>
summarize(prop_x = mean(str_detect(name, "x"))) |>
ggplot(aes(x = year, y = prop_x)) +
geom_line()
Str_count() quantifies matches. Str_view() highlights the matches. Regex
expressions are case-sensitive!!!!! Str_to_lower() converts all the words to
lower case. TL —> too long, so make it short. TL DR
babynames |>
count(name) |>
mutate(
name = str_to_lower(name),
vowels = str_count(name, "[aeiou]"),
consonants = str_count(name, "[^aeiou]")
)

## # A tibble: 97,310 × 4
## name n vowels consonants
## <chr> <int> <int> <int>
## 1 aaban 10 3 2
## 2 aabha 5 3 2
## 3 aabid 2 3 2
## 4 aabir 1 3 2
## 5 aabriella 5 5 4
## 6 aada 1 3 1
## 7 aadam 26 3 2
## 8 aadan 11 3 2
## 9 aadarsh 17 3 4
## 10 aaden 18 3 2
## # ℹ 97,300 more rows
3 ways to fix —> ick , Count Chocula experienced the ick and ignore_case
was invoked to prevent upper case fiascos.
Str_replace_all() —> RAxa is irreplaceable. Str_remove_all —> RA Raxa
removed my heart when she didn’t hug me.
x <- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")

## [1] "-ppl-" "p--r" "b-n-n-"

x <- c("apple", "pear", "banana")


str_remove_all(x, "[aeiou]")

## [1] "ppl" "pr" "bnn"

Extract variables with separate_wider_regex(). It’s the 3rd cousin (so


fuckable) of separate_wider_position() and separate_wider_delim() . Reggae
dances need to be separated wider bc of the inappropriate NSFW thumping
rhythmic vibes of the musical genre. Hence STRIPATTY –> str - pattern ,
PATTY -> it’s the type of musical genre that makes Ballsack’s mom Patty
strip. TF - WHAT The FISH!! too_few = “debug” can be added as an argument
if a match fails.
df <- tribble(
~str,
"<Sheryl>-F_34",
"<Kisha>-F_45",
"<Brandon>-N_33",
"<Sharon>-F_38",
"<Penny>-F_58",
"<Justin>-M_41",
"<Patricia>-F_84",
)

df2 <- df |>


separate_wider_regex(
str,
patterns = c(
"<",
name = "[A-Za-z]+",
">-",
gender = ".",
"_",
age = "[0-9]+"
)
)
15.3.5 Exercises
1. What baby name has the most vowels? What name has the highest
proportion of vowels? (Hint: what is the denominator?)

“Mariaguadalupe” has the most amount of vowels. “Louie” has the


highest proportion of vowels.
stuff <- babynames |>
mutate(vow = str_count(name, "[aeiou]"),
cons = str_count(name, "[^aeiou]"), denom = vow + cons,
prop = vow / denom) |>
arrange(desc(prop))

2. Replace all forward slashes in "a/b/c/d/e" with backslashes. What


happens if you attempt to undo the transformation by replacing all
backslashes with forward slashes? (We’ll discuss the problem very
soon.)

Doing the reverse throws an error bc “\” throws an error as an escape


character.
y <- "a/b/c/d/e"
str_replace_all(y, pattern = "/", replacement = "\\\\") |> str_view()

## [1] │ a\b\c\d\e

3. Implement a simple version of str_to_lower() using


str_replace_all().
test_string3 <- "Other branches opened in Floral City in 1958, and
Hernando in
1959, as well as the freestanding Crystal River and Homosassa
Libraries."
str_replace_all(test_string3,
pattern = "[A-Z|a-z]",
replacement = tolower)

## [1] "other branches opened in floral city in 1958, and hernando


in \n1959, as well as the freestanding crystal river and homosassa
libraries."

4. Create a regular expression that will match telephone numbers as


commonly written in your country.
telephone_numbers = c(
"555-123-4567",
"(555) 555-7890",
"888-555-4321",
"(123) 456-7890",
"555-987-6543",
"(555) 123-7890"
)
telephone_numbers |>
str_replace(" ", "-") |>
str_replace("\\(", "") |>
str_replace("\\)", "") |>
as_tibble()

## # A tibble: 6 × 1
## value
## <chr>
## 1 555-123-4567
## 2 555-555-7890
## 3 888-555-4321
## 4 123-456-7890
## 5 555-987-6543
## 6 555-123-7890

15.4.1 Escaping
A period is a metacharacter (to match with letters). Use a backslash to
escape the metacharacter. A backslash also esapes a backslash. Use 4
backslashes to write a literal ‘\’ or 1 backslash. Alternatively, use raw strings
like r”{\\}” to denote a literal backslash.
# To create the regular expression \., we need to use \\.
dot <- "\\."

# But the expression itself only contains one \


str_view(dot)

## [1] │ \.

#> [1] │ \.

# And this tells R to look for an explicit .


str_view(c("abc", "a.c", "bef"), "a\\.c")

## [2] │ <a.c>

x <- "a\\b"
str_view(x)

## [1] │ a\b

str_view(x, "\\\\")

## [1] │ a<\>b

str_view(x, r"{\\}")

## [1] │ a<\>b
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")

## [2] │ <a.c>

To match a literal ., $, |, *, +, ?, {, }, (, ), use a character class: [.], [$], [|],


… all match the literal values.
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")

## [2] │ <a.c>

str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")

## [3] │ <a*c>

Anchor - match at the start ^ (to START going inside of a rabbit hole, dangle
the CARROT, or carat). $ matches the end bc when the world comes to an
end, it’s probably over money. Either way, directionally speaking, the anchor
PROTECTS the letter. Use both carat and dollar sign to match a literal string
only.
str_view(fruit, "^a")

## [1] │ <a>pple
## [2] │ <a>pricot
## [3] │ <a>vocado

str_view(fruit, "a$")

## [4] │ banan<a>
## [15] │ cherimoy<a>
## [30] │ feijo<a>
## [36] │ guav<a>
## [56] │ papay<a>
## [74] │ satsum<a>

str_view(fruit, "apple")

## [1] │ <apple>
## [62] │ pine<apple>

str_view(fruit, "^apple$")

## [1] │ <apple>

“\\b” at the end or start of the word matches the boundary of the word.
y <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
z <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
z1 <- str_view(z, "sum")
y1 <- str_view(y, "\\bsum\\b")
y1

## [4] │ <sum>(x)
By themselves, anchors yield a zero-length match.
str_view("abc", c("$", "^", "\\b"))

## [1] │ abc<>
## [2] │ <>abc
## [3] │ <>abc<>

str_replace_all("abc", c("$", "^", "\\b"), "--")

## [1] "abc--" "--abc" "--abc--"

Character classes a.k.a. character sets. Special meanings inside the brackets
[] include:
 - defines a range, e.g., [a-z] matches any lower case letter.[0-9]
matches any #.
 \ escapes special characters, so [\^\-\]] matches ^, -, or ].

x <- "abcd ABCD 12345 -!@#%."


x1 <- str_view(x, "[abc]+")
x1

## [1] │ <abc>d ABCD 12345 -!@#%.

x2 <- str_view(x, "[a-z]+")


x2

## [1] │ <abcd> ABCD 12345 -!@#%.

x3 <- str_view(x, "[^a-z0-9]+")


x3

## [1] │ abcd< ABCD >12345< -!@#%.>

# You need an escape to match characters that are otherwise special


inside of []
x4 <- str_view("a-b-c", "[a-c]")
x4

## [1] │ <a>-<b>-<c>

x5 <- str_view("a-b-c", "[a\\-c]")


x5

## [1] │ <a><->b<-><c>

DSW - Designer Shoe Warehouse. Digits are toes which are in shoes.
 \d matches any digit;
\D matches anything that isn’t a digit.
 \s matches any whitespace (e.g., space, tab, newline);
\S matches anything that isn’t whitespace.

 \w matches any “word” character, i.e. letters and numbers;


\W matches any “non-word” character.

x <- "abcd ABCD 12345 -!@#%."


str_view(x, "\\d+")

## [1] │ abcd ABCD <12345> -!@#%.

str_view(x, "\\D+")

## [1] │ <abcd ABCD >12345< -!@#%.>

str_view(x, "\\s+")

## [1] │ abcd< >ABCD< >12345< >-!@#%.

str_view(x, "\\S+")

## [1] │ <abcd> <ABCD> <12345> <-!@#%.>

str_view(x, "\\w+")

## [1] │ <abcd> <ABCD> <12345> -!@#%.

str_view(x, "\\W+")

## [1] │ abcd< >ABCD< >12345< -!@#%.>

Quantifiers -> {n} matches exactly n times. {n, } matches at least n times.
{n, m} matches between n and m times.
PEMDAS -> follow the order of operations parentheses, exponents, multiply,
divide, add, subtract to decide what regex rule gets precedence.
Capturing groups -> parentheses can help make capturing groups (like a
pseudo net) for a sub-match.
Back reference: \1 refers to 1st parentheses. \2 refers to 2nd parentheses.
#Match repeated letter pairs.
str_view(fruit, "(..)\\1")

## [4] │ b<anan>a
## [20] │ <coco>nut
## [22] │ <cucu>mber
## [41] │ <juju>be
## [56] │ <papa>ya
## [73] │ s<alal> berry

#Match words that start and end with repeated letters.


str_view(words, "^(..).*\\1$")
## [152] │ <church>
## [217] │ <decide>
## [617] │ <photograph>
## [699] │ <require>
## [739] │ <sense>

#Switches the 2nd and 3rd word.


sentences |>
str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |>
str_view()

## [1] │ The canoe birch slid on the smooth planks.


## [2] │ Glue sheet the to the dark blue background.
## [3] │ It's to easy tell the depth of a well.
## [4] │ These a days chicken leg is a rare dish.
## [5] │ Rice often is served in round bowls.
## [6] │ The of juice lemons makes fine punch.
## [7] │ The was box thrown beside the parked truck.
## [8] │ The were hogs fed chopped corn and garbage.
## [9] │ Four of hours steady work faced us.
## [10] │ A size large in stockings is hard to sell.
## [11] │ The was boy there when the sun rose.
## [12] │ A is rod used to catch pink salmon.
## [13] │ The of source the huge river is the clear spring.
## [14] │ Kick ball the straight and follow through.
## [15] │ Help woman the get back to her feet.
## [16] │ A of pot tea helps to pass the evening.
## [17] │ Smoky lack fires flame and heat.
## [18] │ The cushion soft broke the man's fall.
## [19] │ The breeze salt came across from the sea.
## [20] │ The at girl the booth sold fifty bonds.
## ... and 700 more

#Convert the expressions to a TIBBLE & name the columns. Form of


#separate_wider_regex()
sentences |>
str_match("the (\\w+) (\\w+)") |>
as_tibble(.name_repair = "minimal") |>
set_names("match", "word1", "word2")

## # A tibble: 720 × 3
## match word1 word2
## <chr> <chr> <chr>
## 1 the smooth planks smooth planks
## 2 the sheet to sheet to
## 3 the depth of depth of
## 4 <NA> <NA> <NA>
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
## 7 the parked truck parked truck
## 8 <NA> <NA> <NA>
## 9 <NA> <NA> <NA>
## 10 <NA> <NA> <NA>
## # ℹ 710 more rows

Use ?: to create a non-capturing group.


x <- c("a gray cat", "a grey dog")
str_match(x, "gr(e|a)y")

## [,1] [,2]
## [1,] "gray" "a"
## [2,] "grey" "e"

str_match(x, "gr(?:e|a)y")

## [,1]
## [1,] "gray"
## [2,] "grey"

15.4.7 Exercises
1. How would you match the literal string "'\? How about "$^$"?
input_string <- "\"'\\"
str_view(input_string)

## [1] │ "'\

# Pattern to match the literal string


match_pattern <- "\"\'\\\\"
str_view(match_pattern)

## [1] │ "'\\

input_string <- "\"$^$\""


str_view(input_string)

## [1] │ "$^$"

# Pattern to match the literal string


match_pattern <- "\"\\$\\^\\$\""
str_view(match_pattern)

## [1] │ "\$\^\$"

2. Explain why each of these patterns don’t match a \: "\", "\\", "\\\". \
single backslash is an escape character. 2nd one has quotes. 3rd one
means literal backslash. 4th one –> 3 backslashes refers to a LITERAL
backslash in a regex expression; with the input string, it refers to a
single backslash.

3. Given the corpus of common words in stringr::words, create regular


expressions that find all words that:
a. Start with “y”.
words |>
str_view(pattern = "^y")

## [975] │ <y>ear
## [976] │ <y>es
## [977] │ <y>esterday
## [978] │ <y>et
## [979] │ <y>ou
## [980] │ <y>oung

b. Don’t start with “y”.


words |>
str_view(pattern = "^(?!y)")

## [1] │ <>a
## [2] │ <>able
## [3] │ <>about
## [4] │ <>absolute
## [5] │ <>accept
## [6] │ <>account
## [7] │ <>achieve
## [8] │ <>across
## [9] │ <>act
## [10] │ <>active
## [11] │ <>actual
## [12] │ <>add
## [13] │ <>address
## [14] │ <>admit
## [15] │ <>advertise
## [16] │ <>affect
## [17] │ <>afford
## [18] │ <>after
## [19] │ <>afternoon
## [20] │ <>again
## ... and 954 more

c. End with “x”.


words |> str_view(pattern = "x$")

## [108] │ bo<x>
## [747] │ se<x>
## [772] │ si<x>
## [841] │ ta<x>

d. Are exactly three letters long. (Don’t cheat by using


str_length()!)
words |>
str_subset(pattern = "\\b\\w{3}\\b")
## [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm"
"art" "ask" "bad"
## [13] "bag" "bar" "bed" "bet" "big" "bit" "box" "boy" "bus"
"but" "buy" "can"
## [25] "car" "cat" "cup" "cut" "dad" "day" "die" "dog" "dry"
"due" "eat" "egg"
## [37] "end" "eye" "far" "few" "fit" "fly" "for" "fun" "gas"
"get" "god" "guy"
## [49] "hit" "hot" "how" "job" "key" "kid" "lad" "law" "lay"
"leg" "let" "lie"
## [61] "lot" "low" "man" "may" "mrs" "new" "non" "not" "now"
"odd" "off" "old"
## [73] "one" "out" "own" "pay" "per" "put" "red" "rid" "run"
"say" "see" "set"
## [85] "sex" "she" "sir" "sit" "six" "son" "sun" "tax" "tea"
"ten" "the" "tie"
## [97] "too" "top" "try" "two" "use" "war" "way" "wee" "who"
"why" "win" "yes"
## [109] "yet" "you"

e. Have seven letters or more.


words |>
str_subset(pattern = "\\b\\w{7,}\\b")

## [1] "absolute" "account" "achieve" "address"


"advertise"
## [6] "afternoon" "against" "already" "alright"
"although"
## [11] "america" "another" "apparent" "appoint"
"approach"
## [16] "appropriate" "arrange" "associate" "authority"
"available"
## [21] "balance" "because" "believe" "benefit"
"between"
## [26] "brilliant" "britain" "brother" "business"
"certain"
## [31] "chairman" "character" "Christmas" "colleague"
"collect"
## [36] "college" "comment" "committee" "community"
"company"
## [41] "compare" "complete" "compute" "concern"
"condition"
## [46] "consider" "consult" "contact" "continue"
"contract"
## [51] "control" "converse" "correct" "council"
"country"
## [56] "current" "decision" "definite" "department"
"describe"
## [61] "develop" "difference" "difficult" "discuss"
"district"
## [66] "document" "economy" "educate" "electric"
"encourage"
## [71] "english" "environment" "especial" "evening"
"evidence"
## [76] "example" "exercise" "expense" "experience"
"explain"
## [81] "express" "finance" "fortune" "forward"
"function"
## [86] "further" "general" "germany" "goodbye"
"history"
## [91] "holiday" "hospital" "however" "hundred"
"husband"
## [96] "identify" "imagine" "important" "improve"
"include"
## [101] "increase" "individual" "industry" "instead"
"interest"
## [106] "introduce" "involve" "kitchen" "language"
"machine"
## [111] "meaning" "measure" "mention" "million"
"minister"
## [116] "morning" "necessary" "obvious" "occasion"
"operate"
## [121] "opportunity" "organize" "original" "otherwise"
"paragraph"
## [126] "particular" "pension" "percent" "perfect"
"perhaps"
## [131] "photograph" "picture" "politic" "position"
"positive"
## [136] "possible" "practise" "prepare" "present"
"pressure"
## [141] "presume" "previous" "private" "probable"
"problem"
## [146] "proceed" "process" "produce" "product"
"programme"
## [151] "project" "propose" "protect" "provide"
"purpose"
## [156] "quality" "quarter" "question" "realise"
"receive"
## [161] "recognize" "recommend" "relation" "remember"
"represent"
## [166] "require" "research" "resource" "respect"
"responsible"
## [171] "saturday" "science" "scotland" "secretary"
"section"
## [176] "separate" "serious" "service" "similar"
"situate"
## [181] "society" "special" "specific" "standard"
"station"
## [186] "straight" "strategy" "structure" "student"
"subject"
## [191] "succeed" "suggest" "support" "suppose"
"surprise"
## [196] "telephone" "television" "terrible" "therefore"
"thirteen"
## [201] "thousand" "through" "thursday" "together"
"tomorrow"
## [206] "tonight" "traffic" "transport" "trouble"
"tuesday"
## [211] "understand" "university" "various" "village"
"wednesday"
## [216] "welcome" "whether" "without" "yesterday"

f. Contain a vowel-consonant pair.


words |>
str_view(pattern = "[aeiou][^aeiou]")

## [2] │ <ab>le
## [3] │ <ab>o<ut>
## [4] │ <ab>s<ol><ut>e
## [5] │ <ac>c<ep>t
## [6] │ <ac>co<un>t
## [7] │ <ac>hi<ev>e
## [8] │ <ac>r<os>s
## [9] │ <ac>t
## [10] │ <ac>t<iv>e
## [11] │ <ac>tu<al>
## [12] │ <ad>d
## [13] │ <ad>dr<es>s
## [14] │ <ad>m<it>
## [15] │ <ad>v<er>t<is>e
## [16] │ <af>f<ec>t
## [17] │ <af>f<or>d
## [18] │ <af>t<er>
## [19] │ <af>t<er>no<on>
## [20] │ <ag>a<in>
## [21] │ <ag>a<in>st
## ... and 924 more

g. Contain at least two vowel-consonant pairs in a row.


words |>
str_view(pattern = "[aeiou][^aeiou][aeiou][^aeiou]")

## [4] │ abs<olut>e
## [23] │ <agen>t
## [30] │ <alon>g
## [36] │ <amer>ica
## [39] │ <anot>her
## [42] │ <apar>t
## [43] │ app<aren>t
## [61] │ auth<orit>y
## [62] │ ava<ilab>le
## [63] │ <awar>e
## [64] │ <away>
## [70] │ b<alan>ce
## [75] │ b<asis>
## [81] │ b<ecom>e
## [83] │ b<efor>e
## [84] │ b<egin>
## [85] │ b<ehin>d
## [87] │ b<enef>it
## [119] │ b<usin>ess
## [143] │ ch<arac>ter
## ... and 149 more

h. Only consist of repeated vowel-consonant pairs.


words |>
str_view(pattern = "^(?:[aeiou][^aeiou]){2,}$")

## [64] │ <away>
## [265] │ <eleven>
## [279] │ <even>
## [281] │ <ever>
## [436] │ <item>
## [573] │ <okay>
## [579] │ <open>
## [586] │ <original>
## [591] │ <over>
## [905] │ <unit>
## [911] │ <upon>

4. Create 11 regular expressions that match the British or American


spellings for each of the following words: airplane/aeroplane,
aluminum/aluminium, analog/ analogue, ass/arse, center/centre,
defense/defence, donut/doughnut, gray/grey, modeling/modelling,
skeptic/sceptic, summarize/summarise. Try and make the shortest
possible regex!
exp1 <-"The airplane is made of aluminum. The analog signal is
stronger. Don't
be an ass. The center is closed for defense training. I prefer a
donut, while
she likes a doughnut. His hair is gray, but hers is grey. We're
modeling a new
project. The skeptic will not believe it. Please summarize the
report."

patterns_to_detect <- c(
"air(?:plane|oplane)",
"alumin(?:um|ium)",
"analog(?:ue)?",
"ass|arse",
"cent(?:er|re)",
"defen(?:se|ce)",
"dou(?:gh)?nut",
"gr(?:a|e)y",
"model(?:ing|ling)",
"skep(?:tic|tic)",
"summar(?:ize|ise)"
)

for (pattern in patterns_to_detect) {


matches <- str_extract_all(exp1, pattern)
if (length(matches[[1]]) > 0) {
exp1 <- str_replace_all(exp1,
pattern,
paste0("**", matches[[1]], "**"))
}
}

exp1

## [1] "The **airplane** is made of **aluminum**. The **analog**


signal is stronger. Don't\nbe an **ass**. The **center** is closed for
**defense** training. I prefer a donut, while \nshe likes a
**doughnut**. His hair is **gray**, but hers is **gray**. We're
**modeling** a new \nproject. The **skeptic** will not believe it.
Please **summarize** the report."
## [2] "The **airplane** is made of **aluminum**. The **analog**
signal is stronger. Don't\nbe an **ass**. The **center** is closed for
**defense** training. I prefer a donut, while \nshe likes a
**doughnut**. His hair is **grey**, but hers is **grey**. We're
**modeling** a new \nproject. The **skeptic** will not believe it.
Please **summarize** the report."

5. Switch the first and last letters in words. Which of those strings are still
words?
new_words = words |>
str_replace_all(pattern = "\\b(\\w)(\\w*)(\\w)\\b",
replacement = "\\3\\2\\1")

6. Describe in words what these regular expressions match: (read


carefully to see if each entry is a regular expression or a string that
defines a regular expression.)

a. ^.*$ Match an entire string.

b. "\\{.+\\}" Matches an expression like {abc}.

c. \d{4}-\d{2}-\d{2} Matches digits of specified lengths with


hyphens between like “1989-02-18.”
d. "\\\\{4}" Matches an expression with 4 backslashes. “\\abcd”

e. \..\..\.. It matches strings like “.a.b.c” or “.1.2.3”

f. (.)\1\1 The parentheses (.) capture any single character, and \1


refers to the first captured character. So, it matches strings like
“aaa” or “111.”

g. "(..)\\1" Matches strings with 2 identical characters repeated.


Matches “aa” or “11.”

7. Solve the beginner regexp crosswords at


https://regexcrossword.com/challenges/beginner. No bueno!

Regex flags: control general Pacifics! of the regexp details. Coolest flag is
ignore_case = TRUE.
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")

## [1] │ <banana>

str_view(bananas, regex("banana", ignore_case = TRUE))

## [1] │ <banana>
## [2] │ <Banana>
## [3] │ <BANANA>

2ND regex flag: dotall = TRUE lets . match everything, including \n:
x <- "Line 1\nLine 2\nLine 3"
str_view(x, ".Line")
str_view(x, regex(".Line", dotall = TRUE))

## [1] │ Line 1<


## │ Line> 2<
## │ Line> 3

multiline = TRUE makes ^ and $ match the start and end of each line rather
than the start and end of the complete string:
x <- "Line 1\nLine 2\nLine 3"
str_view(x, "^Line")

## [1] │ <Line> 1
## │ Line 2
## │ Line 3

str_view(x, regex("^Line", multiline = TRUE))

## [1] │ <Line> 1
## │ <Line> 2
## │ <Line> 3
comments = TRUE ignores anything after #.
phone <- regex(
r"(
\(? # optional opening parens
(\d{3}) # area code
[)\-]? # optional closing parens or dash
\ ? # optional space
(\d{3}) # another three numbers
[\ -]? # optional space or dash
(\d{4}) # four more numbers
)",
comments = TRUE
)

str_extract(c("514-791-8141", "(123) 456 7890", "123456"), phone)

## [1] "514-791-8141" "(123) 456 7890" NA

Opt-out of regex rules with fixed() function, which can also ignore case.
str_view(c("", "a", "."), fixed("."))

## [3] │ <.>

str_view("x X", "X")

## [1] │ x <X>

str_view("x X", fixed("X", ignore_case = TRUE))

## [1] │ <x> <X>

NOTES!:
str_view(sentences, “^The”) matches ANYTHING that starts with ‘The’
including sentences that begin with ’There is no way home” and not just
sentences that start with “The.”
str_view(sentences, “^The\\b”) matches ONLY sentences that start with
“The”.
str_view(sentences, “^She|He|It|They\\b”) matches sentences that start w/ a
pronoun. But ADD parentheses like here:
str_view(sentences, “^(She|He|It|They)\\b”)
Try testing patterns to spot mistakes!
str_view(words, “^[^aeiou]+$”) matches words with ONLY consonants.
str_view(words[!str_detect(words, “[aeiou]”)]) ALSO matches words with
ONLY consonants.
str_view(words, “a.*b|b.*a”) matches words containing a and b in both
orders.
words[str_detect(words, “a”) & str_detect(words, “b”)] is an easier way to
detect same letters.
words[str_detect(words, “a.*e.*i.*o.*u”)] finds words with ALL five vowels.
The equivalent is 5 str_detect() calls with &.
words[ str_detect(words, “a”) & str_detect(words, “e”) & str_detect(words,
“i”) & str_detect(words, “o”) & str_detect(words, “u”)]
Str_flatten() and str_c() can create strings to use inside regex functions.
rgb <- c("red", "green", "blue")
j <- str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
str_view(sentences, j)

## [2] │ Glue the sheet to the dark <blue> background.


## [26] │ Two <blue> fish swam in the tank.
## [92] │ A wisp of cloud hung in the <blue> air.
## [148] │ The spot on the blotter was made by <green> ink.
## [160] │ The sofa cushion is <red> and of light weight.
## [174] │ The sky that morning was clear and bright <blue>.
## [204] │ A <blue> crane is a tall wading bird.
## [217] │ It is hard to erase <blue> or <red> ink.
## [224] │ The lamp shone with a steady <green> flame.
## [247] │ The box is held by a bright <red> snapper.
## [256] │ The houses are built of <red> clay bricks.
## [274] │ The <red> tape bound the smuggled food.
## [288] │ Hedge apples may stain your hands <green>.
## [302] │ The plant grew large and <green> in the window.
## [330] │ Bathe and relax in the cool <green> grass.
## [368] │ The lake sparkled in the <red> hot sun.
## [372] │ Mark the spot with a sign painted <red>.
## [452] │ The couch cover and hall drapes were <blue>.
## [491] │ A man in a <blue> sweater sat at the desk.
## [551] │ The small <red> neon lamp went out.
## ... and 6 more

15.6.4 Exercises
1. For each of the following challenges, try solving it by using both a
single regular expression, and a combination of multiple
[str_detect()] (https://stringr.tidyverse.org/reference/str_detect.html)
calls.

a. Find all words that start or end with x.


start_r <- str_detect(words, "^x")
end_r <- str_detect(words, "x$")
words[start_r | end_r]

## [1] "box" "sex" "six" "tax"

b. Find all words that start with a vowel and end with a consonant.
start_r <- str_detect(words, "^[aeiou]")
end_r <- str_detect(words, "[^aeiou]$")

words[start_r & end_r]

## [1] "about" "accept" "account" "across"


"act"
## [6] "actual" "add" "address" "admit"
"affect"
## [11] "afford" "after" "afternoon" "again"
"against"
## [16] "agent" "air" "all" "allow"
"almost"
## [21] "along" "already" "alright" "although"
"always"
## [26] "amount" "and" "another" "answer"
"any"
## [31] "apart" "apparent" "appear" "apply"
"appoint"
## [36] "approach" "arm" "around" "art"
"as"
## [41] "ask" "at" "attend" "authority"
"away"
## [46] "awful" "each" "early" "east"
"easy"
## [51] "eat" "economy" "effect" "egg"
"eight"
## [56] "either" "elect" "electric" "eleven"
"employ"
## [61] "end" "english" "enjoy" "enough"
"enter"
## [66] "environment" "equal" "especial" "even"
"evening"
## [71] "ever" "every" "exact" "except"
"exist"
## [76] "expect" "explain" "express" "identify"
"if"
## [81] "important" "in" "indeed" "individual"
"industry"
## [86] "inform" "instead" "interest" "invest"
"it"
## [91] "item" "obvious" "occasion" "odd"
"of"
## [96] "off" "offer" "often" "okay"
"old"
## [101] "on" "only" "open" "opportunity"
"or"
## [106] "order" "original" "other" "ought"
"out"
## [111] "over" "own" "under" "understand"
"union"
## [116] "unit" "university" "unless" "until"
"up"
## [121] "upon" "usual"

c. Are there any words that contain at least one of each different
vowel?
vowels <-
str_detect(words, "a") & str_detect(words, "e") &
str_detect(words, "i") &
str_detect(words, "o") & str_detect(words, "u")

words[vowels]

## character(0)

2. Construct patterns to find evidence for and against the rule “i before e
except after c”?
rule <- str_detect(words, "[A-Za-z]*(cei|[^c]ie)[A-Za-z]*")

pattern_1a = "\\b\\w*ie\\w*\\b"
pattern_1b = "\\b\\w+ei\\w*\\b"

pattern_2a = "\\b\\w*cei\\w*\\b"
pattern_2b = "\\b\\w*cie\\w*\\b"
words[str_detect(words, pattern_1a)]

## [1] "achieve" "believe" "brief" "client" "die"

## [6] "experience" "field" "friend" "lie" "piece"

## [11] "quiet" "science" "society" "tie" "view"

# Words which contain "e" before an "i", thus giving evidence against
# the rule, unless there is a preceding "c"
words[str_detect(words, pattern_1b)]

## [1] "receive" "weigh"

# Words which contain "e" before an "i" after "c", thus following the
rule.
# That is, evidence in favour of the rule
words[str_detect(words, pattern_2a)]
## [1] "receive"

# Words which contain an "i" before "e" after "c", thus violating the
rule.
# That is, evidence against the rule
words[str_detect(words, pattern_2b)]

## [1] "science" "society"

3. colors() contains a number of modifiers like “lightgray” and


“darkblue”. How could you automatically identify these modifiers?
(Think about how you might detect and then remove the colors that
are modified).
col_vec = colours(distinct = TRUE)
col_vec = col_vec[!str_detect(col_vec, "\\b\\w*\\d\\w*\\b")]

col_vec[str_detect(col_vec, "\\b(?:light|dark)\\w*\\b")]

## [1] "darkgoldenrod" "darkgray" "darkgreen"

## [4] "darkkhaki" "darkmagenta" "darkolivegreen"

## [7] "darkorange" "darkorchid" "darkred"

## [10] "darksalmon" "darkseagreen" "darkslateblue"

## [13] "darkslategray" "darkturquoise" "darkviolet"

## [16] "lightblue" "lightcoral" "lightcyan"

## [19] "lightgoldenrod" "lightgoldenrodyellow" "lightgray"

## [22] "lightgreen" "lightpink" "lightsalmon"

## [25] "lightseagreen" "lightskyblue" "lightslateblue"

## [28] "lightslategray" "lightsteelblue" "lightyellow"

4. Create a regular expression that finds any base R dataset. You can get
a list of these datasets via a special use of the data() function:
data(package = "datasets")$results[, "Item"]. Note that a number
of old datasets are individual vectors; these contain the name of the
grouping “data frame” in parentheses, so you’ll need to strip those off.
# Extract all base R datasets into a character vector
base_r_packs = data(package = "datasets")$results[, "Item"]

# Remove all the names of grouping data.frames in parenthesis


base_r_packs = str_replace_all(base_r_packs,
pattern = "\\([^()]+\\)",
replacement = "")
# Remove the whitespace, i.e., " " let after removing the parenthesis
words
base_r_packs = str_replace_all(base_r_packs,
pattern = "\\s+$",
replacement = "")

# Create the regular expression


huge_regex = str_c("\\b(", str_flatten(base_r_packs, "|"), ")\\b")

15.7 Regex in other places


3 cool places that invoke regex: matches(), pivot_longer(), and delim(). MS
DP!!!!!! lol my name.
Base R: apropos(“replace”) matches all objects w the contained pattern of
replace.
List.files can also use REGEX to match files with specific names.
head(list.files(pattern = "\\.Rmd$"))

## [1] "Chapter-15.Rmd" "Chapter 15.Rmd"

You might also like