8000 Update to Unicode 9.0.0 by SimonSapin · Pull Request #10 · unicode-rs/unicode-normalization · GitHub
[go: up one dir, main page]

Skip to content

Update to Unicode 9.0.0 #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 19, 2016
Merged

Update to Unicode 9.0.0 #10

merged 2 commits into from
Dec 19, 2016

Conversation

SimonSapin
Copy link
Contributor

No description provided.

@SimonSapin
Copy link
Contributor Author

Failing test in Unicode source form:

1100 AC00 11A8 11A8;1100 AC01 11A8;1100 1100 1161 11A8 11A8;1100 AC01 11A8;1100 1100 1161 11A8 11A8; 
# (ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ) HANGUL CHOSEONG KIYEOK, HANGUL SYLLABLE GA, HANGUL JONGSEONG KIYEOK, HANGUL JONGSEONG KIYEOK

when testing column_2 == nfc(column_1):

'assertion failed: `(left == right)` (left: `"ᄀ각ᆨ"`, right: `"ᄀ갂"`)'

@Manishearth
Copy link
Member
Manishearth commented Dec 19, 2016

So the nfc form looks somewhat conceptually correct to me. GA+KIYEOK+KIYEOK (ga+g+g) is (sort of?) the same as GA+SSANGKIYEOK (ga+gg), which does compose to GAGG in our code. However, KIYEOK+KIYEOK does not normalize to SSANGKIYEOK (so on its own, g+g != gg, it only seems to work when within a syllable block). Additionally, GAGG decomposes to GA+SSANGKIYEOK. The fact that two kiyeoks is a ssangkiyeok seems to only work when composing them within a syllable block, and only in one direction, which is strange.

However, I need to dig into the tables to understand if we're supposed to do this.

ga+g+g is LV+T+T, which is a form not usually talked of in Unicode. Usually you have one of each L,V,T.

@Manishearth
Copy link
Member

Our algorithm is wrong. GAGG getting composed is a happy accident.

GA+TIKEUT+TIKEUT "\u{AC00}\u{11AE}\u{11AE}" ('가', 'ᆮ', 'ᆮ') composes to , which is GALP, or GA+RIEUL-PHIEUPH, not GA+SSANGTIKEUT (SSANGTIKEUT isn't even a jongseong so that wouldn't work anyway)

The reason GA+G+G composed was that G has an offset of 1, and GAGG is right next to GAG, so our algorithm treated GAG as an LV precomposed syllable block (it is not, it is LVT) and added 1 to it to get GAGG.

I'm going to look through the algorithm in the spec and see what's missing. In essence, we're not distinguishing between LVT s-blocks (which do not compose further) and LV s-blocks (which can compose to LVT).

@Manishearth
Copy link
Member
Manishearth commented Dec 19, 2016

Fixed in #11 / 98748d3

Manishearth and others added 2 commits December 20, 2016 00:03
Fix #11.

The algorithm for composition of Hangul Jamo is:

 - L (choseong jamo) + V (jungseong jamo) = LV (syllable block)
 - LV (syllable block) + T (jongseong jamo) = LVT (syllable block)

However, the LV and LVT syllable blocks are intermingled in the unicode
block. In particular, for each pair LV, you will first see the syllable block
LV, followed by syllable blocks for LVT for each T. The LV+T
composition was a simple addition of offsets.

Our algorithm did not ignore the LVT syllable blocks, which meant that
LVT+T would just offset further and produce an unrelated syllable block.

By ensuring that the `S_index` is a multiple of `T_count`, we filter
for only LV syllable blocks (which occur every `T_count` codepoints in
the S block)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0