Chapter 15
Chapter 15
Dimple K Patel
2025-01-07
Chapter 15 Intro
#install.packages("tidyverse")
library(tidyverse)
library(babynames)
#2nd argument is the REGEX expression that matches. STR_VIEW puts <>
around match.
str_view(fruit, "berry")
## [6] │ bil<berry>
## [7] │ black<berry>
## [10] │ blue<berry>
## [11] │ boysen<berry>
## [19] │ cloud<berry>
## [21] │ cran<berry>
## [29] │ elder<berry>
## [32] │ goji <berry>
## [33] │ goose<berry>
## [38] │ huckle<berry>
## [50] │ mul<berry>
## [70] │ rasp<berry>
## [73] │ salal <berry>
## [76] │ straw<berry>
## [1] │ <apple>
## [7] │ bl<ackbe>rry
## [48] │ mand<arine>
## [51] │ nect<arine>
## [62] │ pine<apple>
## [64] │ pomegr<anate>
## [70] │ r<aspbe>rry
## [73] │ sal<al be>rry
## [1] │ <a>
## [2] │ <ab>
## [3] │ <ab>b
## [2] │ <ab>
## [3] │ <abb>
## [1] │ <a>
## [2] │ <ab>
## [3] │ <abb>
head(o)
## # A tibble: 6 × 2
## name n
## <chr> <int>
## 1 Alexander 665492
## 2 Alexis 399551
## 3 Alex 278705
## 4 Alexandra 232223
## 5 Max 148787
## 6 Alexa 123032
babynames |>
group_by(year) |>
summarize(prop_x = mean(str_detect(name, "x"))) |>
ggplot(aes(x = year, y = prop_x)) +
geom_line()
Str_count() quantifies matches. Str_view() highlights the matches. Regex
expressions are case-sensitive!!!!! Str_to_lower() converts all the words to
lower case. TL —> too long, so make it short. TL DR
babynames |>
count(name) |>
mutate(
name = str_to_lower(name),
vowels = str_count(name, "[aeiou]"),
consonants = str_count(name, "[^aeiou]")
)
## # A tibble: 97,310 × 4
## name n vowels consonants
## <chr> <int> <int> <int>
## 1 aaban 10 3 2
## 2 aabha 5 3 2
## 3 aabid 2 3 2
## 4 aabir 1 3 2
## 5 aabriella 5 5 4
## 6 aada 1 3 1
## 7 aadam 26 3 2
## 8 aadan 11 3 2
## 9 aadarsh 17 3 4
## 10 aaden 18 3 2
## # ℹ 97,300 more rows
3 ways to fix —> ick , Count Chocula experienced the ick and ignore_case
was invoked to prevent upper case fiascos.
Str_replace_all() —> RAxa is irreplaceable. Str_remove_all —> RA Raxa
removed my heart when she didn’t hug me.
x <- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")
## [1] │ a\b\c\d\e
## # A tibble: 6 × 1
## value
## <chr>
## 1 555-123-4567
## 2 555-555-7890
## 3 888-555-4321
## 4 123-456-7890
## 5 555-987-6543
## 6 555-123-7890
15.4.1 Escaping
A period is a metacharacter (to match with letters). Use a backslash to
escape the metacharacter. A backslash also esapes a backslash. Use 4
backslashes to write a literal ‘\’ or 1 backslash. Alternatively, use raw strings
like r”{\\}” to denote a literal backslash.
# To create the regular expression \., we need to use \\.
dot <- "\\."
## [1] │ \.
#> [1] │ \.
## [2] │ <a.c>
x <- "a\\b"
str_view(x)
## [1] │ a\b
str_view(x, "\\\\")
## [1] │ a<\>b
str_view(x, r"{\\}")
## [1] │ a<\>b
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
## [2] │ <a.c>
## [2] │ <a.c>
## [3] │ <a*c>
Anchor - match at the start ^ (to START going inside of a rabbit hole, dangle
the CARROT, or carat). $ matches the end bc when the world comes to an
end, it’s probably over money. Either way, directionally speaking, the anchor
PROTECTS the letter. Use both carat and dollar sign to match a literal string
only.
str_view(fruit, "^a")
## [1] │ <a>pple
## [2] │ <a>pricot
## [3] │ <a>vocado
str_view(fruit, "a$")
## [4] │ banan<a>
## [15] │ cherimoy<a>
## [30] │ feijo<a>
## [36] │ guav<a>
## [56] │ papay<a>
## [74] │ satsum<a>
str_view(fruit, "apple")
## [1] │ <apple>
## [62] │ pine<apple>
str_view(fruit, "^apple$")
## [1] │ <apple>
“\\b” at the end or start of the word matches the boundary of the word.
y <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
z <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
z1 <- str_view(z, "sum")
y1 <- str_view(y, "\\bsum\\b")
y1
## [4] │ <sum>(x)
By themselves, anchors yield a zero-length match.
str_view("abc", c("$", "^", "\\b"))
## [1] │ abc<>
## [2] │ <>abc
## [3] │ <>abc<>
Character classes a.k.a. character sets. Special meanings inside the brackets
[] include:
- defines a range, e.g., [a-z] matches any lower case letter.[0-9]
matches any #.
\ escapes special characters, so [\^\-\]] matches ^, -, or ].
## [1] │ <a>-<b>-<c>
## [1] │ <a><->b<-><c>
DSW - Designer Shoe Warehouse. Digits are toes which are in shoes.
\d matches any digit;
\D matches anything that isn’t a digit.
\s matches any whitespace (e.g., space, tab, newline);
\S matches anything that isn’t whitespace.
str_view(x, "\\D+")
str_view(x, "\\s+")
str_view(x, "\\S+")
str_view(x, "\\w+")
str_view(x, "\\W+")
Quantifiers -> {n} matches exactly n times. {n, } matches at least n times.
{n, m} matches between n and m times.
PEMDAS -> follow the order of operations parentheses, exponents, multiply,
divide, add, subtract to decide what regex rule gets precedence.
Capturing groups -> parentheses can help make capturing groups (like a
pseudo net) for a sub-match.
Back reference: \1 refers to 1st parentheses. \2 refers to 2nd parentheses.
#Match repeated letter pairs.
str_view(fruit, "(..)\\1")
## [4] │ b<anan>a
## [20] │ <coco>nut
## [22] │ <cucu>mber
## [41] │ <juju>be
## [56] │ <papa>ya
## [73] │ s<alal> berry
## # A tibble: 720 × 3
## match word1 word2
## <chr> <chr> <chr>
## 1 the smooth planks smooth planks
## 2 the sheet to sheet to
## 3 the depth of depth of
## 4 <NA> <NA> <NA>
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
## 7 the parked truck parked truck
## 8 <NA> <NA> <NA>
## 9 <NA> <NA> <NA>
## 10 <NA> <NA> <NA>
## # ℹ 710 more rows
## [,1] [,2]
## [1,] "gray" "a"
## [2,] "grey" "e"
str_match(x, "gr(?:e|a)y")
## [,1]
## [1,] "gray"
## [2,] "grey"
15.4.7 Exercises
1. How would you match the literal string "'\? How about "$^$"?
input_string <- "\"'\\"
str_view(input_string)
## [1] │ "'\
## [1] │ "'\\
## [1] │ "$^$"
## [1] │ "\$\^\$"
2. Explain why each of these patterns don’t match a \: "\", "\\", "\\\". \
single backslash is an escape character. 2nd one has quotes. 3rd one
means literal backslash. 4th one –> 3 backslashes refers to a LITERAL
backslash in a regex expression; with the input string, it refers to a
single backslash.
## [975] │ <y>ear
## [976] │ <y>es
## [977] │ <y>esterday
## [978] │ <y>et
## [979] │ <y>ou
## [980] │ <y>oung
## [1] │ <>a
## [2] │ <>able
## [3] │ <>about
## [4] │ <>absolute
## [5] │ <>accept
## [6] │ <>account
## [7] │ <>achieve
## [8] │ <>across
## [9] │ <>act
## [10] │ <>active
## [11] │ <>actual
## [12] │ <>add
## [13] │ <>address
## [14] │ <>admit
## [15] │ <>advertise
## [16] │ <>affect
## [17] │ <>afford
## [18] │ <>after
## [19] │ <>afternoon
## [20] │ <>again
## ... and 954 more
## [108] │ bo<x>
## [747] │ se<x>
## [772] │ si<x>
## [841] │ ta<x>
## [2] │ <ab>le
## [3] │ <ab>o<ut>
## [4] │ <ab>s<ol><ut>e
## [5] │ <ac>c<ep>t
## [6] │ <ac>co<un>t
## [7] │ <ac>hi<ev>e
## [8] │ <ac>r<os>s
## [9] │ <ac>t
## [10] │ <ac>t<iv>e
## [11] │ <ac>tu<al>
## [12] │ <ad>d
## [13] │ <ad>dr<es>s
## [14] │ <ad>m<it>
## [15] │ <ad>v<er>t<is>e
## [16] │ <af>f<ec>t
## [17] │ <af>f<or>d
## [18] │ <af>t<er>
## [19] │ <af>t<er>no<on>
## [20] │ <ag>a<in>
## [21] │ <ag>a<in>st
## ... and 924 more
## [4] │ abs<olut>e
## [23] │ <agen>t
## [30] │ <alon>g
## [36] │ <amer>ica
## [39] │ <anot>her
## [42] │ <apar>t
## [43] │ app<aren>t
## [61] │ auth<orit>y
## [62] │ ava<ilab>le
## [63] │ <awar>e
## [64] │ <away>
## [70] │ b<alan>ce
## [75] │ b<asis>
## [81] │ b<ecom>e
## [83] │ b<efor>e
## [84] │ b<egin>
## [85] │ b<ehin>d
## [87] │ b<enef>it
## [119] │ b<usin>ess
## [143] │ ch<arac>ter
## ... and 149 more
## [64] │ <away>
## [265] │ <eleven>
## [279] │ <even>
## [281] │ <ever>
## [436] │ <item>
## [573] │ <okay>
## [579] │ <open>
## [586] │ <original>
## [591] │ <over>
## [905] │ <unit>
## [911] │ <upon>
patterns_to_detect <- c(
"air(?:plane|oplane)",
"alumin(?:um|ium)",
"analog(?:ue)?",
"ass|arse",
"cent(?:er|re)",
"defen(?:se|ce)",
"dou(?:gh)?nut",
"gr(?:a|e)y",
"model(?:ing|ling)",
"skep(?:tic|tic)",
"summar(?:ize|ise)"
)
exp1
5. Switch the first and last letters in words. Which of those strings are still
words?
new_words = words |>
str_replace_all(pattern = "\\b(\\w)(\\w*)(\\w)\\b",
replacement = "\\3\\2\\1")
Regex flags: control general Pacifics! of the regexp details. Coolest flag is
ignore_case = TRUE.
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
## [1] │ <banana>
## [1] │ <banana>
## [2] │ <Banana>
## [3] │ <BANANA>
2ND regex flag: dotall = TRUE lets . match everything, including \n:
x <- "Line 1\nLine 2\nLine 3"
str_view(x, ".Line")
str_view(x, regex(".Line", dotall = TRUE))
multiline = TRUE makes ^ and $ match the start and end of each line rather
than the start and end of the complete string:
x <- "Line 1\nLine 2\nLine 3"
str_view(x, "^Line")
## [1] │ <Line> 1
## │ Line 2
## │ Line 3
## [1] │ <Line> 1
## │ <Line> 2
## │ <Line> 3
comments = TRUE ignores anything after #.
phone <- regex(
r"(
\(? # optional opening parens
(\d{3}) # area code
[)\-]? # optional closing parens or dash
\ ? # optional space
(\d{3}) # another three numbers
[\ -]? # optional space or dash
(\d{4}) # four more numbers
)",
comments = TRUE
)
Opt-out of regex rules with fixed() function, which can also ignore case.
str_view(c("", "a", "."), fixed("."))
## [3] │ <.>
## [1] │ x <X>
NOTES!:
str_view(sentences, “^The”) matches ANYTHING that starts with ‘The’
including sentences that begin with ’There is no way home” and not just
sentences that start with “The.”
str_view(sentences, “^The\\b”) matches ONLY sentences that start with
“The”.
str_view(sentences, “^She|He|It|They\\b”) matches sentences that start w/ a
pronoun. But ADD parentheses like here:
str_view(sentences, “^(She|He|It|They)\\b”)
Try testing patterns to spot mistakes!
str_view(words, “^[^aeiou]+$”) matches words with ONLY consonants.
str_view(words[!str_detect(words, “[aeiou]”)]) ALSO matches words with
ONLY consonants.
str_view(words, “a.*b|b.*a”) matches words containing a and b in both
orders.
words[str_detect(words, “a”) & str_detect(words, “b”)] is an easier way to
detect same letters.
words[str_detect(words, “a.*e.*i.*o.*u”)] finds words with ALL five vowels.
The equivalent is 5 str_detect() calls with &.
words[ str_detect(words, “a”) & str_detect(words, “e”) & str_detect(words,
“i”) & str_detect(words, “o”) & str_detect(words, “u”)]
Str_flatten() and str_c() can create strings to use inside regex functions.
rgb <- c("red", "green", "blue")
j <- str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
str_view(sentences, j)
15.6.4 Exercises
1. For each of the following challenges, try solving it by using both a
single regular expression, and a combination of multiple
[str_detect()] (https://stringr.tidyverse.org/reference/str_detect.html)
calls.
b. Find all words that start with a vowel and end with a consonant.
start_r <- str_detect(words, "^[aeiou]")
end_r <- str_detect(words, "[^aeiou]$")
c. Are there any words that contain at least one of each different
vowel?
vowels <-
str_detect(words, "a") & str_detect(words, "e") &
str_detect(words, "i") &
str_detect(words, "o") & str_detect(words, "u")
words[vowels]
## character(0)
2. Construct patterns to find evidence for and against the rule “i before e
except after c”?
rule <- str_detect(words, "[A-Za-z]*(cei|[^c]ie)[A-Za-z]*")
pattern_1a = "\\b\\w*ie\\w*\\b"
pattern_1b = "\\b\\w+ei\\w*\\b"
pattern_2a = "\\b\\w*cei\\w*\\b"
pattern_2b = "\\b\\w*cie\\w*\\b"
words[str_detect(words, pattern_1a)]
# Words which contain "e" before an "i", thus giving evidence against
# the rule, unless there is a preceding "c"
words[str_detect(words, pattern_1b)]
# Words which contain "e" before an "i" after "c", thus following the
rule.
# That is, evidence in favour of the rule
words[str_detect(words, pattern_2a)]
## [1] "receive"
# Words which contain an "i" before "e" after "c", thus violating the
rule.
# That is, evidence against the rule
words[str_detect(words, pattern_2b)]
col_vec[str_detect(col_vec, "\\b(?:light|dark)\\w*\\b")]
4. Create a regular expression that finds any base R dataset. You can get
a list of these datasets via a special use of the data() function:
data(package = "datasets")$results[, "Item"]. Note that a number
of old datasets are individual vectors; these contain the name of the
grouping “data frame” in parentheses, so you’ll need to strip those off.
# Extract all base R datasets into a character vector
base_r_packs = data(package = "datasets")$results[, "Item"]