8000 Add `utils::character_reference` by SabrinaJewson · Pull Request #981 · pulldown-cmark/pulldown-cmark · GitHub
[go: up one dir, main page]

Skip to content

Conversation

SabrinaJewson
Copy link
Contributor

I am a consumer of this crate, but I’m not outputting HTML, and I would like to transform InlineHtml and HtmlBlock nodes according to my own needs. Writing a small HTML parser myself is easy enough, but pulldown-cmark already has the large table required to resolved named HTML entities, and for binary size’s sake I’d prefer to avoid duplicating all of this in my own code.

Therefore, this feature request exposes the functionality of resolving HTML character references in the public API of the crate, via the utils module.

While it may be nice to expose more HTML parsing, such an API would be more subjective and would be hard to make fit everybody’s needs. By contrast, this function is minimal and self-contained, and has legitimate disadvantages to users rolling their own versions of.

@ollpu
Copy link
Collaborator
ollpu commented Dec 11, 2024

Fair enough, I agree that this is something you don't necessarily want to implement yourself.

Just for binary size it might be better that we split the data into a separate crate or use the existing https://github.com/p-jackson/entities.

The implementation is also rather specific to CommonMark and there are a couple cases that behave differently from HTML.. In some use cases it could be that the function is not safe for parsing actual HTML. I think the relevant chapter of the spec should be mentioned in the doc comment, or even as part of the function name.

@SabrinaJewson
Copy link
Contributor Author

I see – you’re right about CommonMark and HTML differing. In that case, in the interests of not wanting to design a HTML parser, I don’t think this function is the best way forward. Are you okay with an entities dep, then?

@SabrinaJewson
Copy link
Contributor Author

Hmm. entities’ data isn’t sorted, meaning we can’t do a binary search like we currently do. I did some benchmarks:

  • binary search is 83ns
  • std hashmap is 16ns
  • rustc_hash hashmap is 5ns
  • foldhash-f and foldhash-q are 6.5 and 7ns respectively
  • BTreeMap is 45ns
  • PHF hashmap is 15ns

Building the rustc_hash map takes 38μs, and it takes up 128KB in memory (cf. entities::ENTITIES takes up 105KB in the array and 25KB in static strings; the current impl takes 68KB in the array and 20KB in static strings). The size of entities::ENTITIES could be reduced by feature-flagging (which I opened an issue on that repo for).

There’s a tradeoff, I suppose. I opened p-jackson/entities#14 to see what the maintainer of that crate is interested in.

@ollpu
Copy link
Collaborator
ollpu commented Dec 16, 2024

Yeah, on closer look it might not be a great fit for us. In addition to the different sorting and extra codepoint data, it contains some of the entities both with and without ;, which isn't needed for CM. So in total binary size might be something like 3­–4x compared to the current table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

0