8000 Implement Stream-Safe Text Process and QuickCheck algorithms by sujayakar · Pull Request #22 · unicode-rs/unicode-normalization · GitHub
[go: up one dir, main page]

Skip to content

Implement Stream-Safe Text Process and QuickCheck algorithms #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 2, 2018

Conversation

sujayakar
Copy link
Contributor

This change implements a few new Unicode algorithms: the Stream-Safe Text Process, which bounds decomposition buffers in the face of long runs of non-starter characters, and the QuickCheck algorithms, which allow checking whether a string is normalized efficiently.

First, it refactors the table generation script to simplify the tables. Instead of emitting binary search tables, we just present the compiler a large switch statement, which it's happy to optimize. This change improves performance (see numbers below) and keeps the code simple.

Then, implementing both the algorithms is straightforward per the spec. I've also simplified the decomposition code to remove its custom sorting implementation and only sort each character once.

Before:

test bench_is_nfc_ascii          ... bench:       4,368 ns/iter (+/- 684)
test bench_is_nfc_normalized     ... bench:       5,079 ns/iter (+/- 1,022)
test bench_is_nfc_not_normalized ... bench:       2,405 ns/iter (+/- 362)
test bench_is_nfd_ascii          ... bench:       1,886 ns/iter (+/- 311)
test bench_is_nfd_normalized     ... bench:       2,326 ns/iter (+/- 586)
test bench_is_nfd_not_normalized ... bench:       1,074 ns/iter (+/- 149)
test bench_nfc_ascii             ... bench:       3,412 ns/iter (+/- 1,856)
test bench_nfd_ascii             ... bench:       1,794 ns/iter (+/- 315)

After:

test bench_is_nfc_ascii           ... bench:          51 ns/iter (+/- 6)
test bench_is_nfc_normalized      ... bench:          70 ns/iter (+/- 18)
test bench_is_nfc_not_normalized  ... bench:         967 ns/iter (+/- 255)
test bench_is_nfd_ascii           ... bench:          59 ns/iter (+/- 107)
test bench_is_nfd_normalized      ... bench:          97 ns/iter (+/- 81)
test bench_is_nfd_not_normalized  ... bench:          34 ns/iter (+/- 6)
test bench_nfc_ascii              ... bench:       1,052 ns/iter (+/- 206)
test bench_nfd_ascii              ... bench:         710 ns/iter (+/- 97)
test bench_streamsafe_adversarial ... bench:         612 ns/iter (+/- 86)
test bench_streamsafe_ascii       ... bench:         103 ns/iter (+/- 40)

@alexcrichton
Copy link
Contributor

Some awesome wins, thanks @sujayakar!

@alexcrichton alexcrichton merged commit ba25f07 into unicode-rs:master May 2, 2018
@alexcrichton
Copy link
Contributor

BTW @sujayakar if y'all are interested I could add you as comaintainers of this organization/crate if you'd like!

@sujayakar sujayakar deleted the unicode-opts2 branch May 4, 2018 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0