Add `utils::character_reference` #981

SabrinaJewson · 2024-11-06T20:38:44Z

I am a consumer of this crate, but I’m not outputting HTML, and I would like to transform InlineHtml and HtmlBlock nodes according to my own needs. Writing a small HTML parser myself is easy enough, but pulldown-cmark already has the large table required to resolved named HTML entities, and for binary size’s sake I’d prefer to avoid duplicating all of this in my own code.

Therefore, this feature request exposes the functionality of resolving HTML character references in the public API of the crate, via the utils module.

While it may be nice to expose more HTML parsing, such an API would be more subjective and would be hard to make fit everybody’s needs. By contrast, this function is minimal and self-contained, and has legitimate disadvantages to users rolling their own versions of.

ollpu · 2024-12-11T15:46:11Z

Fair enough, I agree that this is something you don't necessarily want to implement yourself.

Just for binary size it might be better that we split the data into a separate crate or use the existing https://github.com/p-jackson/entities.

The implementation is also rather specific to CommonMark and there are a couple cases that behave differently from HTML.. In some use cases it could be that the function is not safe for parsing actual HTML. I think the relevant chapter of the spec should be mentioned in the doc comment, or even as part of the function name.

SabrinaJewson · 2024-12-16T19:41:19Z

I see – you’re right about CommonMark and HTML differing. In that case, in the interests of not wanting to design a HTML parser, I don’t think this function is the best way forward. Are you okay with an entities dep, then?

SabrinaJewson · 2024-12-16T20:54:26Z

Hmm. entities’ data isn’t sorted, meaning we can’t do a binary search like we currently do. I did some benchmarks:

binary search is 83ns
std hashmap is 16ns
rustc_hash hashmap is 5ns
foldhash-f and foldhash-q are 6.5 and 7ns respectively
BTreeMap is 45ns
PHF hashmap is 15ns

Building the rustc_hash map takes 38μs, and it takes up 128KB in memory (cf. entities::ENTITIES takes up 105KB in the array and 25KB in static strings; the current impl takes 68KB in the array and 20KB in static strings). The size of entities::ENTITIES could be reduced by feature-flagging (which I opened an issue on that repo for).

There’s a tradeoff, I suppose. I opened p-jackson/entities#14 to see what the maintainer of that crate is interested in.

ollpu · 2024-12-16T21:15:34Z

Yeah, on closer look it might not be a great fit for us. In addition to the different sorting and extra codepoint data, it contains some of the entities both with and without ;, which isn't needed for CM. ~~So in total binary size might be something like 3–4x compared to the current table.~~

SabrinaJewson added 2 commits November 6, 2024 20:30

Add utils::character_reference

c169af5

Add tests

8a5bc6a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `utils::character_reference` #981

Add `utils::character_reference` #981

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add utils::character_reference #981

Are you sure you want to change the base?

Add utils::character_reference #981

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add `utils::character_reference` #981

Add `utils::character_reference` #981