8000 Fix unescaping implementation in parser by immerrr · Pull Request #289 · prometheus/client_python · GitHub
[go: up one dir, main page]

Skip to content
8000

Fix unescaping implementation in parser #289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

immerrr
Copy link
Contributor
@immerrr immerrr commented Jul 18, 2018

Hi!

This PR addresses some bugs introduced with the recent parser rewrite (#282). The old implementation

s.replace("\\n", "\n").replace('\\\\', '\\').replace('\\"', '"')

would produce invalid results in presence of \\n substring in input:

  • the expected result AFAIU is \\n -> \n (simply unescaping the \\)
  • the implementation from Text parser optimization (~4.5x perf) #282 would do \\n -> \<NL>, where <NL> means newline character
  • if you place replace('\\\\', '\\') at the beginning it would still be wrong (\\n -> \n -> <NL>)

Also, the rewritten version of _parse_labels didn't work well with "foo\\" strings, because it only checked for one slash before quotes. With all that in mind I went on to rewrite the entire thing with regexes.

It should be more correct and more performant. Taking the input values from #282 on Python 3.6 gave me the following.

Short string, 1M reps

# OLD
In [7]: %time for i in range(1000000*5): _parse_sample('simple_metric 1.513767429e+09')
CPU times: user 13.6 s, sys: 0 ns, total: 13.6 s
Wall time: 13.6 s

#NEW
In [16]: %time for i in range(1000000*5): _parse_sample('simple_metric 1.513767429e+09')
CPU times: user 9.97 s, sys: 0 ns, total: 9.97 s
Wall time: 9.97 s

Long string 100k reps:

# OLD
In [5]: %time for i in range(100000*5): _parse_sample('kube_service_labels{label_app="kube-state-metrics",label_chart="kube-state-metrics-0.5.0",label_heritage="Tiller",label_release="ungaged
   ...: -panther",namespace="default",service="ungaged-panther-kube-state-metrics"} 1')
CPU times: user 9.07 s, sys: 0 ns, total: 9.07 s
Wall time: 9.07 s

# NEW
In [13]: %time for i in range(100000*5): _parse_sample('kube_service_labels{label_app="kube-state-metrics",label_chart="kube-state-metrics-0.5.0",label_heritage="Tiller",label_release="ungage
    ...: d-panther",namespace="default",service="ungaged-panther-kube-state-metrics"} 1')
CPU times: user 7.49 s, sys: 3 µs, total: 7.49 s
Wall time: 7.49 s

self.assertEqual(_replace_escaping('\\\\n'), '\\n')
self.assertEqual(_replace_escaping('\\\\\\n'), '\\\n')
self.assertEqual(_replace_escaping('\\"'), '"')
self.assertEqual(_replace_escaping('\\\\"'), '\\"')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't valid input. This would also be clearer if you used r on the input strings

Copy link
Contributor Author
@immerrr immerrr Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what would you like to do about that invalid input: fix the parser to raise an error or just not test for it?

The initial version had rs, but it doesn't work for strings that end with backslashes (r'\') or strings that should contain actual newlines (\n), so then you have to put some strings with r and some without, and it felt like it added a bit to the confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd mostly look to test this a level up with whole lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I've moved it to work with input lines instead and also found bugs in escaping handling in _parse_labels, so I proceeded to rewrite all the parser on top of re.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were just some changes contributed to improve performance, and I'd rather not change all that again with an RE implementation rather than the explicit FSM that's in use. If you can fix it under the current setup so thsat at least all valid input is correctly handled, that's sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see how this PR is different to 282. Before 282 there was an explicit FSM implementation, and it was replaced by an explicit algorithmic approach that's easy to get wrong, which is exactly what happened. This PR suggests to replace the algorithmic approach with regexps, because they offer a concise and easy-to-understand way to work with sequences of characters that also happens to be faster in some cases than pure-python implementations. Could you elaborate what's wrong with it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're changing from a standard FSM approach for a parser, to an unusual mix of FSM and regex. This is confusing and will be harder to maintain. Please restrict your fix to only what it needs to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to restate that the departure from a standard FSM approach to a somewhat confusing one has already happened in #282.

I don't think I agree that regexes are unusual in the parsing business, they have been around at least since lex & yacc duo which were introduced in the 1970s.

But OK, the decision is up to you.

@immerrr immerrr force-pushed the fix-unescaping branch 4 times, most recently from bff99ea to 1f19879 Compare July 21, 2018 07:09
immerrr added 2 commits July 21, 2018 20:14
Signed-off-by: immerrr <immerrr@gmail.com>
Signed-off-by: immerrr <immerrr@gmail.com>
@immerrr
Copy link
Contributor Author
immerrr commented Jul 21, 2018

Closing in favour of #291

@immerrr immerrr closed this Jul 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0