Conversation
|
Failing test in Unicode source form: when testing |
|
So the nfc form looks somewhat conceptually correct to me. GA+KIYEOK+KIYEOK (ga+g+g) is (sort of?) the same as GA+SSANGKIYEOK (ga+gg), which does compose to GAGG in our code. However, KIYEOK+KIYEOK does not normalize to SSANGKIYEOK (so on its own, g+g != gg, it only seems to work when within a syllable block). Additionally, GAGG decomposes to GA+SSANGKIYEOK. The fact that two kiyeoks is a ssangkiyeok seems to only work when composing them within a syllable block, and only in one direction, which is strange. However, I need to dig into the tables to understand if we're supposed to do this. ga+g+g is LV+T+T, which is a form not usually talked of in Unicode. Usually you have one of each L,V,T. |
|
Our algorithm is wrong. GAGG getting composed is a happy accident. GA+TIKEUT+TIKEUT The reason GA+G+G composed was that G has an offset of 1, and GAGG is right next to GAG, so our algorithm treated GAG as an LV precomposed syllable block (it is not, it is LVT) and added 1 to it to get GAGG. I'm going to look through the algorithm in the spec and see what's missing. In essence, we're not distinguishing between LVT s-blocks (which do not compose further) and LV s-blocks (which can compose to LVT). |
Fix #11. The algorithm for composition of Hangul Jamo is: - L (choseong jamo) + V (jungseong jamo) = LV (syllable block) - LV (syllable block) + T (jongseong jamo) = LVT (syllable block) However, the LV and LVT syllable blocks are intermingled in the unicode block. In particular, for each pair LV, you will first see the syllable block LV, followed by syllable blocks for LVT for each T. The LV+T composition was a simple addition of offsets. Our algorithm did not ignore the LVT syllable blocks, which meant that LVT+T would just offset further and produce an unrelated syllable block. By ensuring that the `S_index` is a multiple of `T_count`, we filter for only LV syllable blocks (which occur every `T_count` codepoints in the S block)
No description provided.