8000 [Yaml] Base the parser on a StringReader by GuilhemN · Pull Request #19750 · symfony/symfony · GitHub
[go: up one dir, main page]

Skip to content

[Yaml] Base the parser on a StringReader #19750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

GuilhemN
Copy link
Contributor
@GuilhemN GuilhemN commented Aug 26, 2016
Q A
Branch? "master"
Bug fix? no
New feature? yes kind of
BC breaks? no
Deprecations? no
Tests pass? yes
Fixed tickets
License MIT
Doc PR

This PR proposes the introduction of a StringReader in the Yaml component (could be in its own component later if we are satisfied).
It's a huge change (no more regex and offset management in the parser) so I only updated some simple part of the parser for now to show how it could be used.

This would simplify the Yaml parser maintenance by removing regexs but also by using a clearer syntax (no more obvious function calls), the first feature I imagine is allowing custom tags (what about !service foo instead of @foo for dependency injection ?).

WDYT?

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Aug 26, 2016

I think I won't adapt more things in this PR as it become very quickly hard to review and I would like to do that step by step in case it is accepted :)

@GuilhemN GuilhemN changed the title [Yaml] Add a StringReader [Yaml] Base the parser on a StringReader Aug 26, 2016
@staabm
Copy link
Contributor
staabm commented Aug 27, 2016

A perf comparison would be great.

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Aug 27, 2016

@staabm the tests are ~50ms slower on my computer (they took 250ms) but i think they should be faster or the same as currently once the parser will be entirely converted (the StringReader is instantiated several times for now and some variables are instantiated only because other parts aren't converted yet).
The most important performance gain should be on the biggest regex of the Parser (not converted yet).

BTW I think flexibility is the most important part for this component as its result is cached most of the time.

Edit: I think we could improve performances with a StringReaderGenerator that would replace things like eat('foo') by isset($data[$offset+2]) && 'f' === $data[$offset] && 'o' === $data[$offset+1] && 'o' === $data[$offset+2] and eatAny(array('foo', 'fah', 'bar')) by return isset($data[$offset+2]) ? ('f' === $data[$offset] ? ('o' === $data[$offset+1] && 'o' === $data[$offset+2] ? 'foo' : ($data[$offset+1] === 'a' && $data[$offset+2] === 'h' ? 'fah')) : ('b' === $data[$offset] && 'a' === $data[$offset+1] && 'r' === $data[$offset+2] ? 'bar' : false))) : false;.
That's a bit hard to read but if it is automatically generated it doesn't matter much.

We could as well generate ranges and hardcode them [a-zA-Z] can easily be hardcoded to abc...zABC...Z. But we would be forced to use commands to generate codes in the core. WDYT @fabpot ?

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Aug 28, 2016

Maybe you would prefer a real feature directly ? (such as custom tags)

@GuilhemN GuilhemN force-pushed the STRINGREADER branch 2 times, most recently from fc7dd89 to fe6901b Compare August 28, 2016 13:50
@javiereguiluz
Copy link
Member

@Ener-Getick after a quick review of your proposal there is one (minor) thing I don't like: the name of the eat*() methods. Could we replace them by read*()? e.g. eatWhiteSpace() -> readWhiteSpace()

@GuilhemN
Copy link
Contributor Author

@javiereguiluz how would you call eat, eatAny, eatSpan and eatCSpan? readString, readAny, readSpan and readCSpan ?

@javiereguiluz
Copy link
Member

@Ener-Getick yes, exactly like that :)

*
* @return string A YAML string
*
* @throws ParseException When malformed inline YAML string is parsed
*
* @internal
Copy link
Contributor

Choose a reason for 8000 hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was marked as internal in 2.x then you can even make this method private, no?
If it was marked as internal in 3.x it may not be possible to change it (just yet).

Copy link
Contributor Author
@GuilhemN GuilhemN Aug 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the entire class is @internal so I removed this tag ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is only internal since Symfony 3.1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xabbuh if someone looks at the 3.1 code he will see that this method is @internal because its class is. What's the problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you're right, if it should be removed, it should be done in 3.1.

@sstok
Copy link
Contributor
sstok commented Aug 29, 2016

read(Any, Span, CSpan, String) and consumeWhiteSpace (that name is used in PHP-cs-fixer, could even leave out the White part).

// quoted scalar
$output = self::parseQuotedScalar($scalar, $i);
if ($quote = $reader->eatSpan('"\'', 1)) {
$unescaper = new Unescaper();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be injected / defined as a property?

Copy link
Contributor Author
@GuilhemN GuilhemN Aug 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its methods are static, it could even not be instantiated.

@javiereguiluz
Copy link
Member

I was looking at the StringReader implementations in .NET and in Java. Some comments in case they're useful:

  • Our readToFullConsumption() method is called readToEnd() in .NET. I like that.
  • Both APIs are based on two versions of read() method, one to read chars and other strings. It's a bit strange to use in our code readString('{') when you are reading just a single char.
  • .NET defines the peek() method, which works as read() but doesn't consume the char. I don't know if we need that.
  • Their constructors is simply StringReader(String) I don't know if we really need the start and end arguments.

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Aug 30, 2016
  • readToEnd looks good to me :)
  • I introduced readChar and readAnyChar, they work as readString and readAny but are a bit faster
  • I also introduced peek to remove a call to read + State::restore()
  • For your last point, we need these arguments during the transition phase as we use several instance of StringReader (see their usage here) edit: after all i used substr instead to optimize readSpan.

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Sep 6, 2016

Do you think it will be possible to have this in 3.2 or is it too late ?
I know it contains a lot of modifications so don't hesitate to ask me if you don't understand why I changed something.

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Sep 9, 2016

BTW this PR fixes several limitations:

  • tags are allowed before any values (eg !foo {my: map}), that's only useful for non-specific tags for now but it will be useful for custom tags if we allow them later
  • tags are allowed for keys ({ !str false: bar })
  • mark to be deprecated several things (it is not deprecated yet):
    • unsupported tags
    • every reserved indicators
    • using consecutive commas in mappings (eg {foo: bar,,})
    • resolvable keys (eg false: should be converted to ! false: as false is not a supported key in php)
  • fix support of spaces in keys ({foo bar: quz} wasn't possible)

@xabbuh
Copy link
Member
xabbuh commented Sep 27, 2016

I am not against this change per se especially if it greatly increases readability. However, we should be sure not to decrease performance as this component is used by many packages (of which many probably do not use a caching system like Symfony is using it when building the DI container).

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Sep 27, 2016

@xabbuh i don't think people should use this component for its performance (there is the yaml extension for that) but instead for its extensibility and its integration with symfony.

Tests are ~10% slower on my computer. Maybe we can improve this more (blackfire attributes a big cost to functions not doing anything themselves, such as readSpan, maybe caused by blackfire itself or because of something else?).
To chose whether to merge this, I think we also have to take into account the ability to fix bugs and add new features more easily as well as having a more spec compliant parser (for example, maps and sequences could be parsed the same way according to the parser, only the output changes).

@GuilhemN
Copy link
Contributor Author

I guess we now have to wait for 3.3?

I don't know if i should continue to add changes here as it would only make the diff bigger but i still think we can really improve the yaml component by changing its internals.
Do you want me to change something else or is this gonna be rejected because it is slower than currently?

@javiereguiluz javiereguiluz added this to the 3.3 milestone Nov 6, 2016
@xabbuh
Copy link
Member
xabbuh commented Dec 4, 2016

I don't have a really strong opinion either way. Though it would be good to get some more opinion of the other @symfony/deciders before investing more time here IMO.

@dunglas
Copy link
Member
dunglas commented Dec 5, 2016

Having a true parser instead of relying on regex is a good idea.

However @xabbuh is right, I saw a lot of PHP applications using this component to parse YAML files at runtime and without cache. Using the C extension is often not an option because it is not broadly available and requires to be able to install packages on the server.
It's one of the most popular Symfony component, performance must not decrease here.

@stof
Copy link
Member
stof commented Dec 5, 2016

@GuilhemN can you share a blackfire comparison of this PR vs master (and maybe the profiles themselves too, at least for the PR) ?

Btw, most usage of readSpan seems to come from consuming whitespaces. It might be worth looking at optimizing them

@stof
Copy link
Member
stof commented Dec 5, 2016

Thus, most places consuming whitespaces don't care about them, so you could have a skipWhitespaces instead of consumeWhitespaces for them, which would move the cursor without actually reading the content of these bytes (so doing the strspn call but not calling internalRead with the result)

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Dec 5, 2016

I used this benchmark based on the config.yml file of the Symfony standard edition.

The first try was 34% slower, then with the consumeWhitespace optimization, it was 18% slower than currently (you can find the last profile here).

@theofidry
Copy link
Contributor
theofidry commented Dec 5, 2016

@GuilhemN I'm getting a 404 for your link (latest profile) :/

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Dec 5, 2016

@theofidry sorry, I forgot to make them public... it should work now.


// bc support of [ foo, , ]
while ($reader->readChar(',')) {
$reader->consumeWhiteSpace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section looks like one of those rare cases where a do {} while() makes sense.

@stof
Copy link
Member
stof commented Dec 7, 2016

@GuilhemN please use public links for the comparisons too

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Dec 7, 2016

It looks like the links were broken this time...

@javiereguiluz
Copy link
Member

@GuilhemN are you still using the benchmark that you mentioned in this comment?

for ($i = 0; $i < 10000; ++$i) {
    $parser->parse('[foo, bar, quz: foo, {bar: foo, quz: mou,}]');
}

I don't like that benchmark because it parses a small YAML file thousands of times. I prefer something more realistic, like parsing a medium-sized YAML file a few times. For example, you could use the Symfony Demo app/config/config.yml file and parse 5 times or so.

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Dec 7, 2016

@javiereguiluz i did not like it too, i changed it to this one, based on the standard edition.
It is still executed 1000 times, should I change it?

@javiereguiluz
Copy link
Member

@GuilhemN OK thanks! I still think that parsing the YAML file 1,000 times is very unrealistic.

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Dec 7, 2016

@javiereguiluz the results when executed 5 times: before / after / comparison.

@theofidry
Copy link
Contributor

@GuilhemN to be more fair in your benchmark you could profile:

  • parse a small file
  • parse a medium size file
  • parse a big file

You could do so separately or within the same process. You could also use Blackfire prob to exclude the autoloading from the profile which can always be kinda falsy cf. Blackfire doc.

@GuilhemN
Copy link
Contributor Author
GuilhemN commented Dec 7, 2016

@theofidry I don't think that's worth it, a benchmark will never be exact anyway. The goal is to see the slowest parts of the code and see the global trend when changing a big part of the code; it's good enough as is imo :)

For the autoloader, I'll run more samples at the same time, the results should be more accurate.

@GuilhemN
Copy link
Contributor Author

I'm closing this as it needs a lot of work to be rebased and most of the things it fixed we're fixed otherwise.
The win in readability would be great but smaller changes must be easier to review/accept, eventually I'll open smaller PRs.

@GuilhemN GuilhemN closed this May 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0