-
Notifications
You must be signed in to change notification settings - Fork 815
Fix unescaping implementation in parser #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tests/test_parser.py
Outdated
self.assertEqual(_replace_escaping('\\\\n'), '\\n') | ||
self.assertEqual(_replace_escaping('\\\\\\n'), '\\\n') | ||
self.assertEqual(_replace_escaping('\\"'), '"') | ||
self.assertEqual(_replace_escaping('\\\\"'), '\\"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't valid input. This would also be clearer if you used r on the input strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, what would you like to do about that invalid input: fix the parser to raise an error or just not test for it?
The initial version had r
s, but it doesn't work for strings that end with backslashes (r'\'
) or strings that should contain actual newlines (\n
), so then you have to put some strings with r
and some without, and it felt like it added a bit to the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd mostly look to test this a level up with whole lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I've moved it to work with input lines instead and also found bugs in escaping handling in _parse_labels, so I proceeded to rewrite all the parser on top of re
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were just some changes contributed to improve performance, and I'd rather not change all that again with an RE implementation rather than the explicit FSM that's in use. If you can fix it under the current setup so thsat at least all valid input is correctly handled, that's sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I see how this PR is different to 282. Before 282 there was an explicit FSM implementation, and it was replaced by an explicit algorithmic approach that's easy to get wrong, which is exactly what happened. This PR suggests to replace the algorithmic approach with regexps, because they offer a concise and easy-to-understand way to work with sequences of characters that also happens to be faster in some cases than pure-python implementations. Could you elaborate what's wrong with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're changing from a standard FSM approach for a parser, to an unusual mix of FSM and regex. This is confusing and will be harder to maintain. Please restrict your fix to only what it needs to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to restate that the departure from a standard FSM approach to a somewhat confusing one has already happened in #282.
I don't think I agree that regexes are unusual in the parsing business, they have been around at least since lex & yacc duo which were introduced in the 1970s.
But OK, the decision is up to you.
bff99ea
to
1f19879
Compare
Signed-off-by: immerrr <immerrr@gmail.com>
Signed-off-by: immerrr <immerrr@gmail.com>
Closing in favour of #291 |
Hi!
This PR addresses some bugs introduced with the recent parser rewrite (#282). The old implementation
would produce invalid results in presence of
\\n
substring in input:\\n -> \n
(simply unescaping the\\
)\\n -> \<NL>
, where<NL>
means newline characterreplace('\\\\', '\\')
at the beginning it would still be wrong (\\n -> \n -> <NL>
)Also, the rewritten version of
_parse_labels
didn't work well with"foo\\"
strings, because it only checked for one slash before quotes. With all that in mind I went on to rewrite the entire thing with regexes.It should be more correct and more performant. Taking the input values from #282 on Python 3.6 gave me the following.
Short string, 1M reps
Long string 100k reps: