Fix unescaping implementation in parser #289

immerrr · 2018-07-18T18:07:13Z

Hi!

This PR addresses some bugs introduced with the recent parser rewrite (#282). The old implementation

s.replace("\\n", "\n").replace('\\\\', '\\').replace('\\"', '"')

would produce invalid results in presence of \\n substring in input:

the expected result AFAIU is \\n -> \n (simply unescaping the \\)
the implementation from Text parser optimization (~4.5x perf) #282 would do \\n -> \<NL>, where <NL> means newline character
if you place replace('\\\\', '\\') at the beginning it would still be wrong (\\n -> \n -> <NL>)

Also, the rewritten version of _parse_labels didn't work well with "foo\\" strings, because it only checked for one slash before quotes. With all that in mind I went on to rewrite the entire thing with regexes.

It should be more correct and more performant. Taking the input values from #282 on Python 3.6 gave me the following.

Short string, 1M reps

# OLD
In [7]: %time for i in range(1000000*5): _parse_sample('simple_metric 1.513767429e+09')
CPU times: user 13.6 s, sys: 0 ns, total: 13.6 s
Wall time: 13.6 s

#NEW
In [16]: %time for i in range(1000000*5): _parse_sample('simple_metric 1.513767429e+09')
CPU times: user 9.97 s, sys: 0 ns, total: 9.97 s
Wall time: 9.97 s

Long string 100k reps:

# OLD
In [5]: %time for i in range(100000*5): _parse_sample('kube_service_labels{label_app="kube-state-metrics",label_chart="kube-state-metrics-0.5.0",label_heritage="Tiller",label_release="ungaged
   ...: -panther",namespace="default",service="ungaged-panther-kube-state-metrics"} 1')
CPU times: user 9.07 s, sys: 0 ns, total: 9.07 s
Wall time: 9.07 s

# NEW
In [13]: %time for i in range(100000*5): _parse_sample('kube_service_labels{label_app="kube-state-metrics",label_chart="kube-state-metrics-0.5.0",label_heritage="Tiller",label_release="ungage
    ...: d-panther",namespace="default",service="ungaged-panther-kube-state-metrics"} 1')
CPU times: user 7.49 s, sys: 3 µs, total: 7.49 s
Wall time: 7.49 s

brian-brazil · 2018-07-19T13:27:10Z

tests/test_parser.py

+        self.assertEqual(_replace_escaping('\\\\n'), '\\n')
+        self.assertEqual(_replace_escaping('\\\\\\n'), '\\\n')
+        self.assertEqual(_replace_escaping('\\"'), '"')
+        self.assertEqual(_replace_escaping('\\\\"'), '\\"')


This isn't valid input. This would also be clearer if you used r on the input strings

Ok, what would you like to do about that invalid input: fix the parser to raise an error or just not test for it?

The initial version had rs, but it doesn't work for strings that end with backslashes (r'\') or strings that should contain actual newlines (\n), so then you have to put some strings with r and some without, and it felt like it added a bit to the confusion.

I'd mostly look to test this a level up with whole lines.

Good call. I've moved it to work with input lines instead and also found bugs in escaping handling in _parse_labels, so I proceeded to rewrite all the parser on top of re.

There were just some changes contributed to improve performance, and I'd rather not change all that again with an RE implementation rather than the explicit FSM that's in use. If you can fix it under the current setup so thsat at least all valid input is correctly handled, that's sufficient.

I'm not sure I see how this PR is different to 282. Before 282 there was an explicit FSM implementation, and it was replaced by an explicit algorithmic approach that's easy to get wrong, which is exactly what happened. This PR suggests to replace the algorithmic approach with regexps, because they offer a concise and easy-to-understand way to work with sequences of characters that also happens to be faster in some cases than pure-python implementations. Could you elaborate what's wrong with it?

You're changing from a standard FSM approach for a parser, to an unusual mix of FSM and regex. This is confusing and will be harder to maintain. Please restrict your fix to only what it needs to change.

I'd like to restate that the departure from a standard FSM approach to a somewhat confusing one has already happened in #282.

I don't think I agree that regexes are unusual in the parsing business, they have been around at least since lex & yacc duo which were introduced in the 1970s.

But OK, the decision is up to you.

Signed-off-by: immerrr <immerrr@gmail.com>

immerrr · 2018-07-21T18:30:14Z

Closing in favour of #291

immerrr force-pushed the fix-unescaping branch from bc6154c to 5173a6f Compare July 18, 2018 21:41

brian-brazil reviewed Jul 19, 2018

View reviewed changes

immerrr force-pushed the fix-unescaping branch 4 times, most recently from bff99ea to 1f19879 Compare July 21, 2018 07:09

immerrr added 2 commits July 21, 2018 20:14

Rewrite parser with regexes fixing handling of escape sequences

0374576

Signed-off-by: immerrr <immerrr@gmail.com>

Extend parser tests

8be2424

Signed-off-by: immerrr <immerrr@gmail.com>

immerrr force-pushed the fix-unescaping branch from 1f19879 to 8be2424 Compare July 21, 2018 18:14

immerrr mentioned this pull request Jul 21, 2018

Fix unescaping take 2 #291

Merged

immerrr closed this Jul 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix unescaping implementation in parser #289

Fix unescaping implementation in parser #289

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix unescaping implementation in parser #289

Fix unescaping implementation in parser #289

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!