Regular Expressions (C++) - Microsoft Learn
Regular Expressions (C++) - Microsoft Learn
C++ C++ in Visual Studio overview Language reference S Libraries S C++ build process S Windows programming with C++ S
Version Learn / C++, C, and Assembler / Ask Learn Focus mode In this article
Yes No
File system navigation The regular expression grammar to use is by specified by the use of one of the
std::regex_constants::syntax_option_type enumeration values. These regular
Download PDF
expression grammars are defined in std::regex_constants :
ECMAScript: This is closest to the grammar used by JavaScript and the .NET
languages.
basic: The POSIX basic regular expressions or BRE.
extended: The POSIX extended regular expressions or ERE.
awk: This is extended, but it has more escapes for non-printing characters.
grep: This is basic, but it also allows newline ( \n ) characters to separate alternations.
egrep: This is extended, but it also allows newline characters to separate alternations.
are stored.
optimize : Make matching faster, at the possible expense of greater construction
time.
collate : Use locale-sensitive collation sequences (for example, ranges of the form
[a-z] ).
Zero or more flags may be combined with the grammar to specify the regular expression
engine behavior. If only flags are specified, ECMAScript is assumed as the grammar.
Element
An element can be one of the following:
An ordinary character that matches the same character in the target sequence.
A wildcard character '.' that matches any character in the target sequence except a
newline.
A character range of the form ch1-ch2 . Adds the characters that are represented
by values in the closed range [ch1, ch2] to the set defined by expr .
A character class of the form [:name:] . Adds the characters in the named class to
the set defined by expr .
An equivalence class of the form [=elt=] . Adds the collating elements that are
equivalent to elt to the set defined by expr .
A collating symbol of the form [.elt.] . Adds the collation element elt to the set
defined by expr .
A capture group of the form (subexpression), or \(subexpression\) in basic and grep, which
matches the sequence of characters in the target sequence that is matched by the pattern
between the delimiters.
An identity escape of the form \k , which matches the character k in the target
sequence.
Examples:
a matches the target sequence "a" but doesn't match the target sequences "B" ,
"b" , or "c" .
. matches all the target sequences "a" , "B" , "b" , and "c" .
[b-z] matches the target sequences "b" and "c" but doesn't match the target
sequences "a" or "B" .
[:lower:] matches the target sequences "a" , "b" , and "c" but doesn't match the
target sequence "B" .
(a) matches the target sequence "a" and associates capture group 1 with the
subsequence "a" , but doesn't match the target sequences "B" , "b" , or "c" .
In ECMAScript, basic, and grep, an element can also be a back reference of the form \dd ,
where dd represents a decimal value N that matches a sequence of characters in the target
sequence that is the same as the sequence of characters that is matched by the Nth
capture group.
For example, (a)\1 matches the target sequence "aa" because the first (and only)
capture group matches the initial sequence "a" and then the \1 matches the final
sequence "a" .
A hexadecimal escape sequence of the form \xhh . Matches a character in the target
sequence that is represented by the two hexadecimal digits hh .
A unicode escape sequence of the form \uhhhh . Matches a character in the target
sequence that is represented by the four hexadecimal digits hhhh .
A control escape sequence of the form \ck . Matches the control character that is
named by the character k .
A word boundary assert of the form \b . Matches when the current position in the
target sequence is immediately after a word boundary.
A negative word boundary assert of the form \B . Matches when the current position
in the target sequence isn't immediately after a word boundary.
Examples:
(?:a) matches the target sequence "a" , but "(?:a)\1" is invalid because there's
no capture group 1.
(=a)a matches the target sequence "a" . The positive assert matches the initial
sequence "a" in the target sequence and the final "a" in the regular expression
matches the initial sequence "a" in the target sequence.
a\b. matches the target sequence "a~" , but doesn't match the target sequence
"ab" .
a\B. matches the target sequence "ab" , but doesn't match the target sequence
"a~" .
An octal escape sequence of the form \ooo . Matches a character in the target
sequence whose representation is the value represented by the one, two, or three
octal digits ooo .
Repetition
Any element other than a positive assert, a negative assert, or an anchor can be followed
by a repetition count. The most general kind of repetition count takes the form {min,max},
or \{min,max\} in basic and grep. An element that is followed by this form of repetition
count matches at least min successive occurrences and no more than max successive
occurrences of a sequence that matches the element.
For example, a{2,3} matches the target sequence "aa" and the target sequence "aaa" ,
but not the target sequence "a" or the target sequence "aaaa" .
* is equivalent to {0,unbounded}.
Examples:
a{2} matches the target sequence "aa" but not the target sequence "a" or the
target sequence "aaa" .
a{2,} matches the target sequence "aa" , the target sequence "aaa" , and so on,
but doesn't match the target sequence "a" .
a* matches the target sequence "" , the target sequence "a" , the target sequence
"aa" , and so on.
For all grammars except basic and grep, a repetition count can also take one of the
following forms:
? is equivalent to {0,1}.
+ is equivalent to {1,unbounded}.
Examples:
a? matches the target sequence "" and the target sequence "a" , but not the target
sequence "aa" .
a+ matches the target sequence "a" , the target sequence "aa" , and so on, but not
the target sequence "" .
In ECMAScript, all the forms of repetition count can be followed by the character ? which
designates a non-greedy repetition.
Concatenation
Regular expression elements, with or without repetition counts, can be concatenated to
form longer regular expressions. The resulting expression matches a target sequence that
is a concatenation of the sequences that are matched by the individual elements.
For example, a{2,3}b matches the target sequence "aab" and the target sequence
"aaab" , but doesn't match the target sequence "ab" or the target sequence "aaaab" .
Alternation
In all regular expression grammars except basic and grep, a concatenated regular
expression can be followed by the character | (pipe) and another concatenated regular
expression. Any number of concatenated regular expressions can be combined in this
manner. The resulting expression matches any target sequence that matches one or more
of the concatenated regular expressions.
When more than one of the concatenated regular expressions match the target sequence,
ECMAScript chooses the first of the concatenated regular expressions that matches the
sequence as the match, which will be referred to as the first match. The other regular
expression grammars choose the one that achieves the longest match.
For example, ab|cd matches the target sequence "ab" and the target sequence "cd" , but
doesn't match the target sequence "abd" or the target sequence "acd" .
Subexpression
In basic and grep, a subexpression is a concatenation. In the other regular expression
grammars, a subexpression is an alternation.
Grammar summary
The following table summarizes the features that are available in the various regular
expression grammars:
ノ Expand table
alternation using | + + + +
alternation using \n + +
anchor + + + + + +
back reference + + +
bracket expression + + + + + +
identity escape + + + + + +
negative assert +
non-capture group +
non-greedy repetition +
ordinary character + + + + + +
positive assert +
repetition using {} + + + +
repetition using * + + + + + +
wildcard character + + + + + +
Semantic details
Anchor
An anchor matches a position in the target string, not a character. A ^ matches the
beginning of the target string, and a $ matches the end of the target string.
Back reference
A back reference is a backslash that is followed by a decimal value N. It matches the
contents of the Nth capture group. The value of N must not be more than the number of
capture groups that precede the back reference. In basic and grep, the value of N is
determined by the decimal digit that follows the backslash. In ECMAScript, the value of N is
determined by all the decimal digits that immediately follow the backslash. Therefore, in
basic and grep, the value of N is never more than 9, even if the regular expression has
more than nine capture groups. In ECMAScript, the value of N is unbounded.
Examples:
Bracket expression
A bracket expression defines a set of characters and collating elements. When the bracket
expression begins with the character ^ the match succeeds if no elements in the set match
the current character in the target sequence. Otherwise, the match succeeds if any one of
the elements in the set matches the current character in the target sequence.
The set of characters can be defined by listing any combination of individual characters,
character ranges, character classes, equivalence classes, and collating symbols.
Capture group
A capture group marks its contents as a single unit in the regular expression grammar and
labels the target text that matches its contents. The label that is associated with each
capture group is a number, which is determined by counting the opening parentheses that
mark capture groups up to and including the opening parenthesis that marks the current
capture group. In this implementation, the maximum number of capture groups is 31.
Examples:
ab+ matches the target sequence "abb" , but doesn't match the target sequence
"abab" .
(ab)+ doesn't match the target sequence "abb" , but matches the target sequence
"abab" .
Character class
A character class in a bracket expression adds all the characters in the named class to the
character set that is defined by the bracket expression. To create a character class, use [:
followed by the name of the class, followed by :] .
id) returns true. The default regex_traits template supports the class names in the
following table.
ノ Expand table
digit digits
punct punctuation
space space
xdigit digits, a , b , c , d , e , f , A , B , C , D , E , F
d same as digit
s same as space
w same as alnum
Character range
A character range in a bracket expression adds all the characters in the range to the
character set that is defined by the bracket expression. To create a character range, put the
character '-' between the first and last characters in the range. A character range puts all
characters that have a numeric value that is more than or equal to the numeric value of the
first character, and less than or equal to the numeric value of the last character, into the
set. Notice that this set of added characters depends on the platform-specific
representation of characters. If the character '-' occurs at the beginning or the end of a
bracket expression, or as the first or last character of a character range, it represents itself.
Examples:
On systems that use ASCII character encoding, [h-k] represents the set of characters
{ h , i , j , k }. It matches the target sequences "h" , "i" , and so on, but not "\x8A"
or "0" .
On systems that use EBCDIC character encoding, [h-k] represents the set of
characters { h , i , '\x8A' , '\x8B' , '\x8C' , '\x8D' , '\x8E' , '\x8F' , '\x90' , j , k
} ( h is encoded as 0x88 and k is encoded as 0x92 ). It matches the target sequences
"h" , "i" , "\x8A" , and so on, but not "0" .
On systems that use ASCII character encoding, [+--] represents the set of characters
{ + , - }.
However, when locale-sensitive ranges are used, the characters in a range are determined
by the collation rules for the locale. Characters that collate after the first character in the
definition of the range and before the last character in the definition of the range are in the
set. The two end characters are also in the set.
Collating element
A collating element is a multi-character sequence that is treated as a single character.
Collating symbol
A collating symbol in a bracket expression adds a collating element to the set that is
defined by the bracket expression. To create a collating symbol, use [. followed by the
collating element, followed by .]
ノ Expand table
\d [[:d:]] [[:digit:]]
\D [^[:d:]] [^[:digit:]]
\s [[:s:]] [[:space:]]
\S [^[:s:]] [^[:space:]]
\w [[:w:]] [a-zA-Z0-9_] *
\W [^[:w:]] [^a-zA-Z0-9_] *
Equivalence class
An equivalence class in a bracket expression adds all the characters and collating elements
that are equivalent to the collating element in the equivalence class definition to the set
that is defined by the bracket expression.
traits.transform_primary(elt2.begin(), elt2.end()) .
backspace, form feed, newline, carriage return, horizontal tab, and vertical tab, respectively.
In ECMAScript, \a and \b aren't allowed. ( \\ is allowed, but it's an identity escape, not a
file format escape).
For example, "\x41" matches the target sequence "a" when ASCII character encoding is
used.
Identity escape
An identity escape is a backslash followed by a single character. It matches that character.
It's required when the character has a special meaning. Using the identity escape removes
the special meaning. For example:
a* matches the target sequence "aaa" , but doesn't match the target sequence
"a*" .
a\* doesn't match the target sequence "aaa" , but matches the target sequence
"a*" .
The set of characters that are allowed in an identity escape depends on the regular
expression grammar, as shown in the following table.
ノ Expand table
basic, grep { ( ) { } . [ \ * ^ $ }
extended, { ( ) { . [ \ * ^ $ + ? | }
egrep
ECMAScript All characters except those that can be part of an identifier. Typically, this includes
letters, digits, $ , _ , and unicode escape sequences. For more information, see the
ECMAScript Language Specification.
Individual character
An individual character in a bracket expression adds that character to the character set that
is defined by the bracket expression. Anywhere in a bracket expression except at the
beginning, a ^ represents itself.
Examples:
[abc] matches the target sequences "a" , "b" , and "c" , but not the sequence "d" .
[^abc] matches the target sequence "d" , but not the target sequences "a" , "b" , or
"c" .
[a^bc] matches the target sequences "a" , "b" , "c" , and "^" , but not the target
sequence "d" .
In all regular expression grammars except ECMAScript, if a ] is the first character that
follows the opening [ or is the first character that follows an initial ^ , it represents itself.
Examples:
[]abc] matches the target sequences "a" , "b" , "c" , and "]" , but not the target
sequence "d" .
[^]abc] matches the target sequence "d" , but not the target sequences "a" , "b" ,
"c" , or "]" .
Examples:
[]a matches the target sequence "a" because the bracket expression is empty.
[\]abc] matches the target sequences "a" , "b" , "c" , and "]" but not the target
sequence "d" .
Negative assert
A negative assert matches anything but its contents. It doesn't consume any characters in
the target sequence.
For example, (!aa)(a*) matches the target sequence "a" and associates capture group 1
with the subsequence "a" . It doesn't match the target sequence "aa" or the target
sequence "aaa" .
Non-capture group
A non-capture group marks its contents as a single unit in the regular expression grammar,
but doesn't label the target text.
For example, (a)(?:b)*(c) matches the target text "abbc" and associates capture group
1 with the subsequence "a" and capture group 2 with the subsequence "c" .
Non-greedy repetition
A non-greedy repetition consumes the shortest subsequence of the target sequence that
matches the pattern. A greedy repetition consumes the longest. For example, (a+)(a*b)
matches the target sequence "aaab" .
When a non-greedy repetition is used, it associates capture group 1 with the subsequence
"a" at the beginning of the target sequence and capture group 2 with the subsequence
When a greedy match is used, it associates capture group 1 with the subsequence "aaa"
For example, \101 matches the target sequence "a" when ASCII character encoding is
used.
Ordinary character
An ordinary character is any valid character that doesn't have a special meaning in the
current grammar.
^ $ \ . * + ? ( ) [ ] { } |
. [ \
Also in basic and grep, the following characters have special meanings when they're used
in a particular context:
* has a special meaning in all cases except when it's the first character in a regular
expression or the first character that follows an initial ^ in a regular expression, or
when it's the first character of a capture group or the first character that follows an
initial ^ in a capture group.
^ has a special meaning when it's the first character of a regular expression.
$ has a special meaning when it's the last character of a regular expression.
In extended, egrep, and awk, the following characters have special meanings:
. [ \ ( * + ? { |
Also in extended, egrep, and awk, the following characters have special meanings when
they're used in a particular context.
^ has a special meaning when it's the first character of a regular expression.
$ has a special meaning when it's the last character of a regular expression.
An ordinary character matches the same character in the target sequence. By default, this
means that the match succeeds if the two characters are represented by the same value. In
a case-insensitive match, two characters ch0 and ch1 match if
traits.translate_nocase(ch0) == traits.translate_nocase(ch1) . In a locale-
sensitive match, two characters ch0 and ch1 match if traits.translate(ch0) ==
traits.translate(ch1) .
Positive assert
A positive assert matches its contents, but doesn't consume any characters in the target
sequence.
Examples:
(=aa)(a*) matches the target sequence "aaaa" and associates capture group 1 with
the subsequence "aaaa" .
(aa)(a*) matches the target sequence "aaaa" and associates capture group 1 with
the subsequence "aa" at the beginning of the target sequence and capture group 2
with the subsequence "aa" at the end of the target sequence.
(=aa)(a)|(a) matches the target sequence "a" and associates capture group 1 with
an empty sequence (because the positive assert failed) and capture group 2 with the
subsequence "a" . It also matches the target sequence "aa" and associates capture
group 1 with the subsequence "aa" and capture group 2 with an empty sequence.
Wildcard character
A wildcard character matches any character in the target expression except a newline.
Word boundary
A word boundary occurs in the following situations:
The current character is at the beginning of the target sequence and is one of the
word characters A-Za-z0-9_
The current character position is past the end of the target sequence and the last
character in the target sequence is one of the word characters.
The current character is one of the word characters and the preceding character isn't.
The current character isn't one of the word characters and the preceding character is.
Examples:
A search for the regular expression bcd in the target sequence "bcd" succeeds and
matches the entire sequence. The same search in the target sequence "abcd" also
succeeds and matches the last three characters. The same search in the target
sequence "bcde" also succeeds and matches the first three characters.
A search for the regular expression bcd in the target sequence "bcdbcd" succeeds
and matches the first three characters.
If there's more than one subsequence that matches at some location in the target
sequence, there are two ways to choose the matching pattern.
First match chooses the subsequence that was found first when the regular expression is
matched.
Longest match chooses the longest subsequence from the ones that match at that
location. If there's more than one subsequence that has the maximal length, longest match
chooses the one that was found first.
For example, when first match is used, a search for the regular expression b|bc in the
target sequence "abcd" matches the subsequence "b" because the left-hand term of the
alternation matches that subsequence; therefore, first match doesn't try the right-hand
term of the alternation. When longest match is used, the same search matches "bc"
because "bc" is longer than "b" .
A partial match succeeds if the match reaches the end of the target sequence without
failing, even if it hasn't reached the end of the regular expression. Therefore, after a partial
match succeeds, appending characters to the target sequence could cause a later partial
match to fail. However, after a partial match fails, appending characters to the target
sequence can't cause a later partial match to succeed. For example, with a partial match,
ab matches the target sequence "a" but not "ac" .
Format flags
ノ Expand table
$& & The character sequence that matches the entire regular
expression: [match[0].first, match[0].second)
$$ $
\& &
$`" (dollar sign The character sequence that precedes the subsequence that
followed by back matches the regular expression: [match.prefix().first,
quote) match.prefix().second)
$'" (dollar sign The character sequence that follows the subsequence that
followed by forward matches the regular expression: [match.suffix().first,
quote) match.suffix().second)
\\n \n
See also
C++ Standard Library Overview
AI Disclaimer Previous Versions Blog Contribute Privacy Terms of Use Trademarks © Microsoft 2025