8000 bpo-433030: Add support of atomic grouping in regular expressions (GH… · python/cpython@345b390 · GitHub
[go: up one dir, main page]

Skip to content

Commit 345b390

Browse files
serhiy-storchakaJeffrey C. Jacobs
and
Jeffrey C. Jacobs
authored
bpo-433030: Add support of atomic grouping in regular expressions (GH-31982)
* Atomic grouping: (?>...). * Possessive quantifiers: x++, x*+, x?+, x{m,n}+. Equivalent to (?>x+), (?>x*), (?>x?), (?>x{m,n}). Co-authored-by: Jeffrey C. Jacobs <timehorse@users.sourceforge.net>
1 parent 2bde682 commit 345b390

File tree

11 files changed

+593
-92
lines changed

11 files changed

+593
-92
lines changed

Doc/library/re.rst

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,30 @@ The special characters are:
154154
characters as possible will be matched. Using the RE ``<.*?>`` will match
155155
only ``'<a>'``.
156156

157+
.. index::
158+
single: *+; in regular expressions
159+
single: ++; in regular expressions
160+
single: ?+; in regular expressions
161+
162+
``*+``, ``++``, ``?+``
163+
Like the ``'*'``, ``'+'``, and ``'?'`` qualifiers, those where ``'+'`` is
164+
appended also match as many times as possible.
165+
However, unlike the true greedy qualifiers, these do not allow
166+
back-tracking when the expression following it fails to match.
167+
These are known as :dfn:`possessive` qualifiers.
168+
For example, ``a*a`` will match ``'aaaa'`` because the ``a*`` will match
169+
all 4 ``'a'``s, but, when the final ``'a'`` is encountered, the
170+
expression is backtracked so that in the end the ``a*`` ends up matching
171+
3 ``'a'``s total, and the fourth ``'a'`` is matched by the final ``'a'``.
172+
However, when ``a*+a`` is used to match ``'aaaa'``, the ``a*+`` will
173+
match all 4 ``'a'``, but when the final ``'a'`` fails to find any more
174+
characters to match, the expression cannot be backtracked and will thus
175+
fail to match.
176+
``x*+``, ``x++`` and ``x?+`` are equivalent to ``(?>x*)``, ``(?>x+)``
177+
and ``(?>x?)`` correspondigly.
178+
179+
.. versionadded:: 3.11
180+
157181
.. index::
158182
single: {} (curly brackets); in regular expressions
159183

@@ -178,6 +202,21 @@ The special characters are:
178202
6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
179203
while ``a{3,5}?`` will only match 3 characters.
180204

205+
``{m,n}+``
206+
Causes the resulting RE to match from *m* to *n* repetitions of the
207+
preceding RE, attempting to match as many repetitions as possible
208+
*without* establishing any backtracking points.
209+
This is the possessive version of the qualifier above.
210+
For example, on the 6-character string ``'aaaaaa'``, ``a{3,5}+aa``
211+
attempt to match 5 ``'a'`` characters, then, requiring 2 more ``'a'``s,
212+
will need more characters than available and thus fail, while
213+
``a{3,5}aa`` will match with ``a{3,5}`` capturing 5, then 4 ``'a'``s
214+
by backtracking and then the final 2 ``'a'``s are matched by the final
215+
``aa`` in the pattern.
216+
``x{m,n}+`` is equivalent to ``(?>x{m,n})``.
217+
218+
.. versionadded:: 3.11
219+
181220
.. index:: single: \ (backslash); in regular expressions
182221

183222
``\``
@@ -336,6 +375,21 @@ The special characters are:
336375
.. versionchanged:: 3.7
337376
The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
338377

378+
``(?>...)``
379+
Attempts to match ``...`` as if it was a separate regular expression, and
380+
if successful, continues to match the rest of the pattern following it.
381+
If the subsequent pattern fails to match, the stack can only be unwound
382+
to a point *before* the ``(?>...)`` because once exited, the expression,
383+
known as an :dfn:`atomic group`, has thrown away all stack points within
384+
itself.
385+
Thus, ``(?>.*).`` would never match anything because first the ``.*``
386+
would match all characters possible, then, having nothing left to match,
387+
the final ``.`` would fail to match.
388+
Since there are no stack points saved in the Atomic Group, and there is
389+
no stack point before it, the entire expression would thus fail to match.
390+
391+
.. versionadded:: 3.11
392+
339393
.. index:: single: (?P<; in regular expressions
340394

341395
``(?P<name>...)``

Doc/whatsnew/3.11.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -295,6 +295,12 @@ os
295295
instead of ``CryptGenRandom()`` which is deprecated.
296296
(Contributed by Dong-hee Na in :issue:`44611`.)
297297

298+
re
299+
--
300+
301+
* Atomic grouping (``(?>...)``) and possessive qualifiers (``*+``, ``++``,
302+
``?+``, ``{m,n}+``) are now supported in regular expressions.
303+
(Contributed by Jeffrey C. Jacobs and Serhiy Storchaka in :issue:`433030`.)
298304

299305
shutil
300306
------

Lib/sre_compile.py

Lines changed: 27 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,16 @@
1717
assert _sre.MAGIC == MAGIC, "SRE module mismatch"
1818

1919
_LITERAL_CODES = {LITERAL, NOT_LITERAL}
20-
_REPEATING_CODES = {REPEAT, MIN_REPEAT, MAX_REPEAT}
2120
_SUCCESS_CODES = {SUCCESS, FAILURE}
2221
_ASSERT_CODES = {ASSERT, ASSERT_NOT}
2322
_UNIT_CODES = _LITERAL_CODES | {ANY, IN}
2423

24+
_REPEATING_CODES = {
25+
MIN_REPEAT: (REPEAT, MIN_UNTIL, MIN_REPEAT_ONE),
26+
MAX_REPEAT: (REPEAT, MAX_UNTIL, REPEAT_ONE),
27+
POSSESSIVE_REPEAT: (POSSESSIVE_REPEAT, SUCCESS, POSSESSIVE_REPEAT_ONE),
28+
}
29+
2530
# Sets of lowercase characters which have the same uppercase.
2631
_equivalences = (
2732
# LATIN SMALL LETTER I, LATIN SMALL LETTER DOTLESS I
@@ -138,27 +143,21 @@ def _compile(code, pattern, flags):
138143
if flags & SRE_FLAG_TEMPLATE:
139144
raise error("internal: unsupported template operator %r" % (op,))
140145
if _simple(av[2]):
141-
if op is MAX_REPEAT:
142-
emit(REPEAT_ONE)
143-
else:
144-
emit(MIN_REPEAT_ONE)
146+
emit(REPEATING_CODES[op][2])
145147
skip = _len(code); emit(0)
146148
emit(av[0])
147149
emit(av[1])
148150
_compile(code, av[2], flags)
149151
emit(SUCCESS)
150152
code[skip] = _len(code) - skip
151153
else:
152-
emit(REPEAT)
154+
emit(REPEATING_CODES[op][0])
153155
skip = _len(code); emit(0)
154156
emit(av[0])
155157
emit(av[1])
156158
_compile(code, av[2], flags)
157159
code[skip] = _len(code) - skip
158-
if op is MAX_REPEAT:
159-
emit(MAX_UNTIL)
160-
else:
161-
emit(MIN_UNTIL)
160+
emit(REPEATING_CODES[op][1])
162161
elif op is SUBPATTERN:
163162
group, add_flags, del_flags, p = av
164163
if group:
@@ -169,6 +168,17 @@ def _compile(code, pattern, flags):
169168
if group:
170169
emit(MARK)
171170
emit((group-1)*2+1)
171+
elif op is ATOMIC_GROUP:
172+
# Atomic Groups are handled by starting with an Atomic
173+
# Group op code, then putting in the atomic group pattern
174+
# and finally a success op code to tell any repeat
175+
# operations within the Atomic Group to stop eating and
176+
# pop their stack if they reach it
177+
emit(ATOMIC_GROUP)
178+
skip = _len(code); emit(0)
179+
_compile(code, av, flags)
180+
emit(SUCCESS)
181+
code[skip] = _len(code) - skip
172182
elif op in SUCCESS_CODES:
173183
emit(op)
174184
elif op in ASSERT_CODES:
@@ -709,7 +719,8 @@ def print_2(*args):
709719
else:
710720
print_(FAILURE)
711721
i += 1
712-
elif op in (REPEAT, REPEAT_ONE, MIN_REPEAT_ONE):
722+
elif op in (REPEAT, REPEAT_ONE, MIN_REPEAT_ONE,
723+
POSSESSIVE_REPEAT, POSSESSIVE_REPEAT_ONE):
713724
skip, min, max = code[i: i+3]
714725
if max == MAXREPEAT:
715726
max = 'MAXREPEAT'
@@ -725,6 +736,11 @@ def print_2(*args):
725736
print_(op, skip, arg, to=i+skip)
726737
dis_(i+2, i+skip)
727738
i += skip
739+
elif op is ATOMIC_GROUP:
740+
skip = code[i]
741+
print_(op, skip, to=i+skip)
742+
dis_(i+1, i+skip)
743+
i += skip
728744
elif op is INFO:
729745
skip, flags, min, max = code[i: i+4]
730746
if max == MAXREPEAT:

Lib/sre_constants.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
# update when constants are added or removed
1515

16-
MAGIC = 20171005
16+
MAGIC = 20220318
1717

1818
from _sre import MAXREPEAT, MAXGROUPS
1919

@@ -97,6 +97,9 @@ def _makecodes(names):
9797
REPEAT_ONE
9898
SUBPATTERN
9999
MIN_REPEAT_ONE
100+
ATOMIC_GROUP
101+
POSSESSIVE_REPEAT
102+
POSSESSIVE_REPEAT_ONE
100103
101104
GROUPREF_IGNORE
102105
IN_IGNORE

Lib/sre_parse.py

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525

2626
WHITESPACE = frozenset(" \t\n\r\v\f")
2727

28-
_REPEATCODES = frozenset({MIN_REPEAT, MAX_REPEAT})
28+
_REPEATCODES = frozenset({MIN_REPEAT, MAX_REPEAT, POSSESSIVE_REPEAT})
2929
_UNITCODES = frozenset({ANY, RANGE, IN, LITERAL, NOT_LITERAL, CATEGORY})
3030

3131
ESCAPES = {
@@ -190,6 +190,10 @@ def getwidth(self):
190190
i, j = av.getwidth()
191191
lo = lo + i
192192
hi = hi + j
193+
elif op is ATOMIC_GROUP:
194+
i, j = av.getwidth()
195+
lo = lo + i
196+
hi = hi + j
193197
elif op is SUBPATTERN:
194198
i, j = av[-1].getwidth()
195199
lo = lo + i
@@ -675,16 +679,22 @@ def _parse(source, state, verbose, nested, first=False):
675679
if group is None and not add_flags and not del_flags:
676680
item = p
677681
if sourcematch("?"):
682+
# Non-Greedy Match
678683
subpattern[-1] = (MIN_REPEAT, (min, max, item))
684+
elif sourcematch("+"):
685+
# Possessive Match (Always Greedy)
686+
subpattern[-1] = (POSSESSIVE_REPEAT, (min, max, item))
679687
else:
688+
# Greedy Match
680689
subpattern[-1] = (MAX_REPEAT, (min, max, item))
681690

682691
elif this == ".":
683692
subpatternappend((ANY, None))
684693

685694
elif this == "(":
686695
start = source.tell() - 1
687-
group = True
696+
capture = True
697+
atomic = False
688698
name = None
689699
add_flags = 0
690700
del_flags = 0
@@ -726,7 +736,7 @@ def _parse(source, state, verbose, nested, first=False):
726736
len(char) + 2)
727737
elif char == ":":
728738
# non-capturing group
729-
group = None
739+
capture = False
730740
elif char == "#":
731741
# comment
732742
while True:
@@ -800,6 +810,10 @@ def _parse(source, state, verbose, nested, first=False):
800810
subpatternappend((GROUPREF_EXISTS, (condgroup, item_yes, item_no)))
801811
continue
802812

813+
elif char == ">":
814+
# non-capturing, atomic group
815+
capture = False
816+
atomic = True
803817
elif char in FLAGS or char == "-":
804818
# flags
805819
flags = _parse_flags(source, state, char)
@@ -813,17 +827,19 @@ def _parse(source, state, verbose, nested, first=False):
813827
continue
814828

815829
add_flags, del_flags = flags
816-
group = None
830+
capture = False
817831
else:
818832
raise source.error("unknown extension ?" + char,
819833
len(char) + 1)
820834

821835
# parse group contents
822-
if group is not None:
836+
if capture:
823837
try:
824838
group = state.opengroup(name)
825839
except error as err:
826840
raise source.error(err.msg, len(name) + 1) from None
841+
else:
842+
group = None
827843
sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
828844
not (del_flags & SRE_FLAG_VERBOSE))
829845
p = _parse_sub(source, state, sub_verbose, nested + 1)
@@ -832,7 +848,11 @@ def _parse(source, state, verbose, nested, first=False):
832848
source.tell() - start)
833849
if group is not None:
834850
state.closegroup(group, p)
835-
subpatternappend((SUBPATTERN, (group, add_flags, del_flags, p)))
851+
if atomic:
852+
assert group is None
853+
subpatternappend((ATOMIC_GROUP, p))
854+
else:
855+
subpatternappend((SUBPATTERN, (group, add_flags, del_flags, p)))
836856

837857
elif this == "^":
838858
subpatternappend((AT, AT_BEGINNING))

0 commit comments

Comments
 (0)
0