-
-
Notifications
You must be signed in to change notification settings - Fork 32k
bpo-433030: Add support of atomic grouping in regular expressions #31982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-433030: Add support of atomic grouping in regular expressions #31982
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know anything about re
's implementation. Does anyone? 😉
Doc/library/re.rst
Outdated
successful, continues to match the rest of the pattern following it. If the | ||
subsequent pattern fails to match, the stack can only be unwound to a point | ||
*before* the ``(?>...)`` because once exited, the expression, known as an | ||
:dfn:`Atomic Group`, has thrown away all stack points within itself. Thus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 'stack point' the same as 'backtrack point'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure yet.
BTW, I'll add that I think these (atomic groups and possessive quantifiers) are, and by far, the most important things Python's engine is missing among "modern" features. Although, ya, I would rather see MRAB fold his |
capture = True | ||
atomic = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use a three-state flag (capturing/non-capturing/atomic) instead of two boolean flags. It could even be a tiny bit faster if use built-in singletons True/False/None. But the current code may be clearer.
I did have doubts about this feature, because it is an advanced feature not supported in older regular expression implementations, and it is difficult to learning. But some regular expressions (e.g. in textwrap) can be simplified with atomic groups, and we have found recently that atomic groups can allow to fix some security issues. So there is a need of atomic groups in the stdlib. Slowly they will be used in user code too. |
I'd say, to the contrary, that these are easy to learn. Indeed, that's why they're near-universal now among other regexp engines. The possibility for backtracking is rarely needed for lexical analysis, and gimmicks that promise not to backtrack save users from worlds of unintended quadratic, cubic, ..., even exponential time (non-)matching disaster cases. |
Example: a while back, our But I expect there are only a relative handful of people now who could repair that if another bug turns up, because the workaround was so obscure and technical. With atomic groups, though, the workaround could be entirely straightforward - indeed, it would be obvious. |
This is a really impressive example. I shudder at the thought that I would have to repeat this work in a modified copy of If there are no objections I am going to merge this PR tomorrow. I can't say that I understand this code 100%, but on the other hand I don't see anything openly suspicious, the tests passed successfully, and the core has remained stable for many years. |
I agree merging is the best thing to do, and especially since it's early in the release cycle. Some encouragement: while I haven't studied Python's re engine, as a general principle engines "like this" use much the same implementation code for atomic groups and forward lookahead assertions. Both try to match starting from the current position, and once they find something that matches, they're done, period - if, after getting out of the group, something later fails to match, they do no internal backtracking. The only "real" difference is that the atomic group consumes the substring it matched (advances the current search position), but the assertion does not. That's why I expect Fredrick said (on the bpo report) that most of the code to do atomic groups was already there. After you merge this, I intend to change |
There are some differences in a code of atomic groups and lookahead assertions. I have tested them, and atomic groups seems work correctly. For example But I have found a bug in possessive repetition. For example, |
Other bug: Well, I wrote the code which allows non-possessive repetition to work correctly in these cases, so I know how to fix it. |
@serhiy-storchaka Shouldn't that not be a bug and shouldn't that match |
@thatbirdguythatuknownot, that's a bug. Note that |
I see. Thanks for clarifying. |
https://bugs.python.org/issue433030