Update to Unicode 9.0.0 #10

SimonSapin · 2016-12-19T20:28:06Z

No description provided.

SimonSapin · 2016-12-19T20:47:40Z

Failing test in Unicode source form:

1100 AC00 11A8 11A8;1100 AC01 11A8;1100 1100 1161 11A8 11A8;1100 AC01 11A8;1100 1100 1161 11A8 11A8; 
# (ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ᄀ각ᆨ; ) HANGUL CHOSEONG KIYEOK, HANGUL SYLLABLE GA, HANGUL JONGSEONG KIYEOK, HANGUL JONGSEONG KIYEOK

when testing column_2 == nfc(column_1):

'assertion failed: `(left == right)` (left: `"ᄀ각ᆨ"`, right: `"ᄀ갂"`)'

Manishearth · 2016-12-19T21:45:19Z

So the nfc form looks somewhat conceptually correct to me. GA+KIYEOK+KIYEOK (ga+g+g) is (sort of?) the same as GA+SSANGKIYEOK (ga+gg), which does compose to GAGG in our code. However, KIYEOK+KIYEOK does not normalize to SSANGKIYEOK (so on its own, g+g != gg, it only seems to work when within a syllable block). Additionally, GAGG decomposes to GA+SSANGKIYEOK. The fact that two kiyeoks is a ssangkiyeok seems to only work when composing them within a syllable block, and only in one direction, which is strange.

However, I need to dig into the tables to understand if we're supposed to do this.

ga+g+g is LV+T+T, which is a form not usually talked of in Unicode. Usually you have one of each L,V,T.

Manishearth · 2016-12-19T22:21:45Z

Our algorithm is wrong. GAGG getting composed is a happy accident.

GA+TIKEUT+TIKEUT "\u{AC00}\u{11AE}\u{11AE}" ('가', 'ᆮ', 'ᆮ') composes to 갎, which is GALP, or GA+RIEUL-PHIEUPH, not GA+SSANGTIKEUT (SSANGTIKEUT isn't even a jongseong so that wouldn't work anyway)

The reason GA+G+G composed was that G has an offset of 1, and GAGG is right next to GAG, so our algorithm treated GAG as an LV precomposed syllable block (it is not, it is LVT) and added 1 to it to get GAGG.

I'm going to look through the algorithm in the spec and see what's missing. In essence, we're not distinguishing between LVT s-blocks (which do not compose further) and LV s-blocks (which can compose to LVT).

Manishearth · 2016-12-19T22:39:01Z

Fixed in #11 / 98748d3

Fix #11. The algorithm for composition of Hangul Jamo is: - L (choseong jamo) + V (jungseong jamo) = LV (syllable block) - LV (syllable block) + T (jongseong jamo) = LVT (syllable block) However, the LV and LVT syllable blocks are intermingled in the unicode block. In particular, for each pair LV, you will first see the syllable block LV, followed by syllable blocks for LVT for each T. The LV+T composition was a simple addition of offsets. Our algorithm did not ignore the LVT syllable blocks, which meant that LVT+T would just offset further and produce an unrelated syllable block. By ensuring that the `S_index` is a multiple of `T_count`, we filter for only LV syllable blocks (which occur every `T_count` codepoints in the S block)

SimonSapin force-pushed the master branch from 7a54ec8 to cd9a45e Compare December 19, 2016 23:01

Manishearth and others added 2 commits December 20, 2016 00:03

Update to Unicode 9.0.0

7299191

SimonSapin force-pushed the 9 branch from f0e9362 to 7299191 Compare December 19, 2016 23:04

SimonSapin merged commit b3a331c into master Dec 19, 2016

SimonSapin deleted the 9 branch December 19, 2016 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update to Unicode 9.0.0 #10

Update to Unicode 9.0.0 #10

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Update to Unicode 9.0.0 #10

Update to Unicode 9.0.0 #10

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants