10BC0 Update to Unicode 9.0.0 by SimonSapin · Pull Request #10 · unicode-rs/unicode-normalization · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@SimonSapin
Copy link
Contributor

No description provided.

@SimonSapin
< 8000 path d="M8 9a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3ZM1.5 9a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3Zm13 0a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3Z"> Copy link
Contributor Author

Failing test in Unicode source form:

1100 AC00 11A8 11A8;1100 AC01 11A8;1100 1100 1161 11A8 11A8;1100 AC01 11A8;1100 1100 1161 11A8 11A8; 
# (ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ) HANGUL CHOSEONG KIYEOK, HANGUL SYLLABLE GA, HANGUL JONGSEONG KIYEOK, HANGUL JONGSEONG KIYEOK

when testing column_2 == nfc(column_1):

'assertion failed: `(left == right)` (left: `"ᄀ각ᆨ"`, right: `"ᄀ갂"`)'

@Manishearth
Copy link
Member
Manishearth commented Dec 19, 2016

So the nfc form looks somewhat conceptually correct to me. GA+KIYEOK+KIYEOK (ga+g+g) is (sort of?) the same as GA+SSANGKIYEOK (ga+gg), which does compose to GAGG in our code. However, KIYEOK+KIYEOK does not normalize to SSANGKIYEOK (so on its own, g+g != gg, it only seems to work when within a syllable block). Additionally, GAGG decomposes to GA+SSANGKIYEOK. The fact that two kiyeoks is a ssangkiyeok seems to only work when composing them within a syllable block, and only in one direction, which is strange.

However, I need to dig into the tables to understand if we're supposed to do this.

ga+g+g is LV+T+T, which is a form not usually talked of in Unicode. Usually you have one of each L,V,T.

@Manishearth
Copy link
Member

Our algorithm is wrong. GAGG getting composed is a happy accident.

GA+TIKEUT+TIKEUT "\u{AC00}\u{11AE}\u{11AE}" ('가', 'ᆮ', 'ᆮ') composes to , which is GALP, or GA+RIEUL-PHIEUPH, not GA+SSANGTIKEUT (SSANGTIKEUT isn't even a jongseong so that wouldn't work anyway)

The reason GA+G+G composed was that G has an offset of 1, and GAGG is right next to GAG, so our algorithm treated GAG as an LV precomposed syllable block (it is not, it is LVT) and added 1 to it to get GAGG.

I'm going to look through the algorithm in the spec and see what's missing. In essence, we're not distinguishing between LVT s-blocks (which do not compose further) and LV s-blocks (which can compose to LVT).

@Manishearth
Copy link
Member
Manishearth commented Dec 19, 2016

Fixed in #11 / 98748d3

Manishearth and others added 2 commits December 20, 2016 00:03
Fix #11.

The algorithm for composition of Hangul Jamo is:

 - L (choseong jamo) + V (jungseong jamo) = LV (syllable block)
 - LV (syllable block) + T (jongseong jamo) = LVT (syllable block)

However, the LV and LVT syllable blocks are intermingled in the unicode
block. In particular, for each pair LV, you will first see the syllable block
LV, followed by syllable blocks for LVT for each T. The LV+T
composition was a simple addition of offsets.

Our algorithm did not ignore the LVT syllable blocks, which meant that
LVT+T would just offset further and produce an unrelated syllable block.

By ensuring that the `S_index` is a multiple of `T_count`, we filter
for only LV syllable blocks (which occur every `T_count` codepoints in
the S block)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0