8000 Avoid slices in entries of decomposition tables by glandium · Pull Request #86 · unicode-rs/unicode-normalization · GitHub
[go: up one dir, main page]

Skip to content

Avoid slices in entries of decomposition tables #86

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 23, 2022

Conversation

glandium
Copy link
Contributor

For small arrays of data, slices are expensive:

  • there's the obvious size of the array (sizeof::<char>() * length)
  • there's the size of the slice itself (sizeof::<(*const str, usize)>())
  • there's the size of the relocation for the pointer in the slice. The
    worst case is on 64-bits ELF, where it is 3 * sizeof::<usize>()(!).

Most entries in decomposition tables are 2 characters or less, so the
overhead for each of these tables is incredibly large.

To give an idea, a print "Hello, World" (fresh from cargo new)
executable built with --release on my machine has 17712 bytes of
.rela.dyn (relocations) and 9520 bytes of .data.rel.ro (relocatable
read-only data).

Adding unicode-normalization as a dependency and changing the code to
println!("{}", String::from_iter("Hello, world!".nfc()));,
those jump to, respectively, 156336 and 147968 bytes.

For comparison, with unicode-normalization 0.1.8 (last release before
the perfect hashes), they were 18168 and 9872 bytes. This is however
compensated by the .text (code) being larger (314607 with 0.1.8 vs.
234639 with 0.1.19); likewise for .rodata (non-relocatable read-only
data) (225979 with 0.1.8, vs. 82523 with 0.1.19).

This can be alleviated by replacing slices with indexes into a unique
slice per decomposition table, overall saving 228K (while barely adding
to code size (160 bytes)). This also makes the overall cost of
unicode-normalization lower than what it was in 0.1.8.

As far as performance is concerned, at least on my machine, it makes
virtually no difference on cargo bench:

on master:
running 22 tests
test bench_is_nfc_ascii ... bench: 13 ns/iter (+/- 0)
test bench_is_nfc_normalized ... bench: 23 ns/iter (+/- 0)
test bench_is_nfc_not_normalized ... bench: 347 ns/iter (+/- 2)
test bench_is_nfc_stream_safe_ascii ... bench: 13 ns/iter (+/- 0)
test bench_is_nfc_stream_safe_normalized ... bench: 31 ns/iter (+/- 0)
test bench_is_nfc_stream_safe_not_normalized ... bench: 374 ns/iter (+/- 2)
test bench_is_nfd_ascii ... bench: 9 ns/iter (+/- 0)
test bench_is_nfd_normalized ... bench: 29 ns/iter (+/- 2)
test bench_is_nfd_not_normalized ... bench: 9 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_ascii ... bench: 16 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_normalized ... bench: 40 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_not_normalized ... bench: 9 ns/iter (+/- 0)
test bench_nfc_ascii ... bench: 525 ns/iter (+/- 1)
test bench_nfc_long ... bench: 186,528 ns/iter (+/- 1,613)
test bench_nfd_ascii ... bench: 283 ns/iter (+/- 30)
test bench_nfd_long ... bench: 120,183 ns/iter (+/- 4,510)
test bench_nfkc_ascii ... bench: 513 ns/iter (+/- 1)
test bench_nfkc_long ... bench: 192,922 ns/iter (+/- 1,673)
test bench_nfkd_ascii ... bench: 276 ns/iter (+/- 30)
test bench_nfkd_long ... bench: 137,163 ns/iter (+/- 2,159)
test bench_streamsafe_adversarial ... bench: 323 ns/iter (+/- 5)
test bench_streamsafe_ascii ... bench: 25 ns/iter (+/- 0)

with patch applied:
running 22 tests
test bench_is_nfc_ascii ... bench: 13 ns/iter (+/- 0)
test bench_is_nfc_normalized ... bench: 23 ns/iter (+/- 0)
test bench_is_nfc_not_normalized ... bench: 347 ns/iter (+/- 7)
test bench_is_nfc_stream_safe_ascii ... bench: 13 ns/iter (+/- 0)
test bench_is_nfc_stream_safe_normalized ... bench: 36 ns/iter (+/- 1)
test bench_is_nfc_stream_safe_not_normalized ... bench: 377 ns/iter (+/- 14)
test bench_is_nfd_ascii ... bench: 9 ns/iter (+/- 0)
test bench_is_nfd_normalized ... bench: 29 ns/iter (+/- 3)
test bench_is_nfd_not_normalized ... bench: 10 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_ascii ... bench: 16 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_normalized ... bench: 39 ns/iter (+/- 1)
test bench_is_nfd_stream_safe_not_normalized ... bench: 10 ns/iter (+/- 0)
test bench_nfc_ascii ... bench: 545 ns/iter (+/- 2)
test bench_nfc_long ... bench: 186,348 ns/iter (+/- 1,660)
test bench_nfd_ascii ... bench: 281 ns/iter (+/- 2)
test bench_nfd_long ... bench: 124,720 ns/iter (+/- 5,967)
test bench_nfkc_ascii ... bench: 517 ns/iter (+/- 4)
test bench_nfkc_long ... bench: 194,943 ns/iter (+/- 1,636)
test bench_nfkd_ascii ... bench: 274 ns/iter (+/- 0)
test bench_nfkd_long ... bench: 127,973 ns/iter (+/- 1,161)
test bench_streamsafe_adversarial ... bench: 320 ns/iter (+/- 3)
test bench_streamsafe_ascii ... bench: 25 ns/iter (+/- 0)

For small arrays of data, slices are expensive:
- there's the obvious size of the array (`sizeof::<char>() * length`)
- there's the size of the slice itself (`sizeof::<(*const str, usize)>()`)
- there's the size of the relocation for the pointer in the slice. The
  worst case is on 64-bits ELF, where it is `3 * sizeof::<usize>()`(!).

Most entries in decomposition tables are 2 characters or less, so the
overhead for each of these tables is incredibly large.

To give an idea, a print "Hello, World" (fresh from cargo new)
executable built with `--release` on my machine has 17712 bytes of
.rela.dyn (relocations) and 9520 bytes of .data.rel.ro (relocatable
read-only data).

Adding unicode-normalization as a dependency and changing the code to
`println!("{}", String::from_iter("Hello, world!".nfc()));`,
those jump to, respectively, 156336 and 147968 bytes.

For comparison, with unicode-normalization 0.1.8 (last release before
the perfect hashes), they were 18168 and 9872 bytes. This is however
compensated by the .text (code) being larger (314607 with 0.1.8 vs.
234639 with 0.1.19); likewise for .rodata (non-relocatable read-only
data) (225979 with 0.1.8, vs. 82523 with 0.1.19).

This can be alleviated by replacing slices with indexes into a unique
slice per decomposition table, overall saving 228K (while barely adding
to code size (160 bytes)). This also makes the overall cost of
unicode-normalization lower than what it was in 0.1.8.

As far as performance is concerned, at least on my machine, it makes
virtually no difference on `cargo bench`:

on master:
running 22 tests
test bench_is_nfc_ascii                      ... bench:          13 ns/iter (+/- 0)
test bench_is_nfc_normalized                 ... bench:          23 ns/iter (+/- 0)
test bench_is_nfc_not_normalized             ... bench:         347 ns/iter (+/- 2)
test bench_is_nfc_stream_safe_ascii          ... bench:          13 ns/iter (+/- 0)
test bench_is_nfc_stream_safe_normalized     ... bench:          31 ns/iter (+/- 0)
test bench_is_nfc_stream_safe_not_normalized ... bench:         374 ns/iter (+/- 2)
test bench_is_nfd_ascii                      ... bench:           9 ns/iter (+/- 0)
test bench_is_nfd_normalized                 ... bench:          29 ns/iter (+/- 2)
test bench_is_nfd_not_normalized             ... bench:           9 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_ascii          ... bench:          16 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_normalized     ... bench:          40 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_not_normalized ... bench:           9 ns/iter (+/- 0)
test bench_nfc_ascii                         ... bench:         525 ns/iter (+/- 1)
test bench_nfc_long                          ... bench:     186,528 ns/iter (+/- 1,613)
test bench_nfd_ascii                         ... bench:         283 ns/iter (+/- 30)
test bench_nfd_long                          ... bench:     120,183 ns/iter (+/- 4,510)
test bench_nfkc_ascii                        ... bench:         513 ns/iter (+/- 1)
test bench_nfkc_long                         ... bench:     192,922 ns/iter (+/- 1,673)
test bench_nfkd_ascii                        ... bench:         276 ns/iter (+/- 30)
test bench_nfkd_long                         ... bench:     137,163 ns/iter (+/- 2,159)
test bench_streamsafe_adversarial            ... bench:         323 ns/iter (+/- 5)
test bench_streamsafe_ascii                  ... bench:          25 ns/iter (+/- 0)

with patch applied:
running 22 tests
test bench_is_nfc_ascii                      ... bench:          13 ns/iter (+/- 0)
test bench_is_nfc_normalized                 ... bench:          23 ns/iter (+/- 0)
test bench_is_nfc_not_normalized             ... bench:         347 ns/iter (+/- 7)
test bench_is_nfc_stream_safe_ascii          ... bench:          13 ns/iter (+/- 0)
test bench_is_nfc_stream_safe_normalized     ... bench:          36 ns/iter (+/- 1)
test bench_is_nfc_stream_safe_not_normalized ... bench:         377 ns/iter (+/- 14)
test bench_is_nfd_ascii                      ... bench:           9 ns/iter (+/- 0)
test bench_is_nfd_normalized                 ... bench:          29 ns/iter (+/- 3)
test bench_is_nfd_not_normalized             ... bench:          10 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_ascii          ... bench:          16 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_normalized     ... bench:          39 ns/iter (+/- 1)
test bench_is_nfd_stream_safe_not_normalized ... bench:          10 ns/iter (+/- 0)
test bench_nfc_ascii                         ... bench:         545 ns/iter (+/- 2)
test bench_nfc_long                          ... bench:     186,348 ns/iter (+/- 1,660)
test bench_nfd_ascii                         ... bench:         281 ns/iter (+/- 2)
test bench_nfd_long                          ... bench:     124,720 ns/iter (+/- 5,967)
test bench_nfkc_ascii                        ... bench:         517 ns/iter (+/- 4)
test bench_nfkc_long                         ... bench:     194,943 ns/iter (+/- 1,636)
test bench_nfkd_ascii                        ... bench:         274 ns/iter (+/- 0)
test bench_nfkd_long                         ... bench:     127,973 ns/iter (+/- 1,161)
test bench_streamsafe_adversarial            ... bench:         320 ns/iter (+/- 3)
test bench_streamsafe_ascii                  ... bench:          25 ns/iter (+/- 0)
@glandium
Copy link
Contributor Author

Forgot to mention in the commit: relocatable data also means it's not sharable between processes.

@glandium
Copy link
Contributor Author

With a hacked up cargo-bloat:
on master:

$ cargo bloat --release -n 0 --crates
    Finished release [optimized] target(s) in 0.00s
    Analyzing target/release/foors

File  .text     Size Crate
5.1%  96.6% 221.4KiB std
0.0%   0.6%   1.4KiB unicode_normalization
0.0%   0.5%   1.1KiB tinyvec
0.0%   0.1%     339B foors
0.0%   0.1%     229B [Unknown]
5.3% 100.0% 229.1KiB .text section size, the file size is 4.2MiB

Note: numbers above are a result of guesswork. They are not 100% correct and never will be.
$ cargo bloat --release -n 0 --crates --symbols-section .rodata
    Finished release [optimized] target(s) in 0.00s
    Analyzing target/release/foors

File  .text    Size Crate
1.2%  63.7% 51.3KiB unicode_normalization
0.0%   1.0%    851B std
0.0%   0.2%    174B [Unknown]
1.9% 100.0% 80.5KiB .rodata section size, the file size is 4.2MiB

Note: numbers above are a result of guesswork. They are not 100% correct and n
8000
ever will be.
$ cargo bloat --release -n 0 --crates --symbols-section .data.rel.ro
    Finished release [optimized] target(s) in 0.00s
    Analyzing target/release/foors

File  .text     Size Crate
3.1%  93.3% 134.8KiB unicode_normalization
0.0%   0.0%      72B [Unknown]
0.0%   0.0%      16B std
3.3% 100.0% 144.5KiB .data.rel.ro section size, the file size is 4.2MiB

Note: numbers above are a result of guesswork. They are not 100% correct and never will be.
$ cargo bloat --release -n 0 --crates --symbols-section .rela.dyn
    Finished release [optimized] target(s) in 0.00s
    Analyzing target/release/foors

File  .text     Size Crate
3.5% 100.0% 152.7KiB .rela.dyn section size, the file size is 4.2MiB

Note: numbers above are a result of guesswork. They are not 100% correct and never will be.

with the patch applied:

$ cargo bloat --release -n 0 --crates
    Finished release [optimized] target(s) in 0.00s
    Analyzing target/release/foors

File  .text     Size Crate
5.8%  96.6% 221.4KiB std
0.0%   0.7%   1.6KiB unicode_normalization
0.0%   0.5%   1.1KiB tinyvec
0.0%   0.1%     339B foors
0.0%   0.1%     229B [Unknown]
6.0% 100.0% 229.3KiB .text section size, the file size is 3.7MiB

Note: numbers above are a result of guesswork. They are not 100% correct and never will be.
$ cargo bloat --release -n 0 --crates --symbols-section .rodata
    Finished release [optimized] target(s) in 0.00s
    Analyzing target/release/foors

File  .text     Size Crate
2.8%  83.1% 105.6KiB unicode_normalization
0.0%   0.7%     851B std
0.0%   0.1%     174B [Unknown]
3.3% 100.0% 127.0KiB .rodata section size, the file size is 3.7MiB

Note: numbers above are a result of guesswork. They are not 100% correct and never will be.
$ cargo bloat --release -n 0 --crates --symbols-section .data.rel.ro
    Finished release [optimized] target(s) in 0.00s
    Analyzing target/release/foors

File  .text   Size Crate
0.0%   0.7%    72B [Unknown]
0.0%   0.5%    48B unicode_normalization
0.0%   0.2%    16B std
0.3% 100.0% 9.7KiB .data.rel.ro section size, the file size is 3.7MiB

Note: numbers above are a result of guesswork. They are not 100% correct and never will be.
$ cargo bloat --release -n 0 --crates --symbols-section .rela.dyn
    Finished release [optimized] target(s) in 0.00s
    Analyzing target/release/foors

File  .text    Size Crate
0.5% 100.0% 17.9KiB .rela.dyn section size, the file size is 3.7MiB

Note: numbers above are a result of guesswork. They are not 100% correct and never will be.

@Manishearth Manishearth merged commit e9dc93a into unicode-rs:master Jun 23, 2022
@glandium
Copy link
Contributor Author

@Manishearth are you planning to spin a new release soon?

@Manishearth
Copy link
Member

No but I can if you make a PR that bumps the version

", ".join("'\\u{%s}'" % hexify(c) for c in table[k])))
offsets = {}
offset = 0
out.write("pub(crate) const %s_DECOMPOSED_CHARS: &[char] = &[\n" % name.upper())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a fixed array avoid the relocations altogether? As in, a %s_DECOMPOSED_CHARS: [char; _] = or so?

Copy link
Contributor Author
@glandium glandium Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tables are const. The slice for the table themselves are not stored in .data.rel.ro at all because of that. So there are actually no relocations for those (also, that would only be one per table, which is not as dramatic as one per entry).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0