Closed
Description
This is needed for mixed script detection.
The easy way to do this is just to store a slice of script_extensions for each code point / range, but there's actually a limited set of ways script_extensions mix (taken from here):
Adlam (Adlam),
Adlam,Arabic,Hanifi_Rohingya,Mandaic,Manichaean,Psalter_Pahlavi,Sogdian,Syriac (Adlam,Arabic,Hanifi_Rohingya,Mandaic,Manichaean,Psalter_Pahlavi,Sogdian,Syriac),
Ahom (Ahom),
Anatolian_Hieroglyphs (Anatolian_Hieroglyphs),
Arabic (Arabic),
Arabic,Coptic (Arabic,Coptic),
Arabic,Hanifi_Rohingya (Arabic,Hanifi_Rohingya),
Arabic,Hanifi_Rohingya,Syriac,Thaana (Arabic,Hanifi_Rohingya,Syriac,Thaana),
Arabic,Syriac (Arabic,Syriac),
Arabic,Syriac,Thaana (Arabic,Syriac,Thaana),
Arabic,Thaana (Arabic,Thaana),
Armenian (Armenian),
Armenian,Georgian (Armenian,Georgian),
Avestan (Avestan),
Balinese (Balinese),
Bamum (Bamum),
Bassa_Vah (Bassa_Vah),
Batak (Batak),
Bengali (Bengali),
Bengali,Chakma,Syloti_Nagri (Bengali,Chakma,Syloti_Nagri),
Bengali,Devanagari (Bengali,Devanagari),
Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Limbu,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Limbu,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta),
Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta),
Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Sharada,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Sharada,Tamil,Telugu,Tirhuta),
Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Tamil,Telugu,Tirhuta),
Bengali,Devanagari,Grantha,Kannada (Bengali,Devanagari,Grantha,Kannada),
Bengali,Devanagari,Grantha,Kannada,Nandinagari,Oriya,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Kannada,Nandinagari,Oriya,Telugu,Tirhuta),
Bhaiksuki (Bhaiksuki),
Bopomofo (Bopomofo),
Bopomofo,Han (Bopomofo,Han),
Bopomofo,Han,Hangul,Hiragana,Katakana (Bopomofo,Han,Hangul,Hiragana,Katakana),
Bopomofo,Han,Hangul,Hiragana,Katakana,Yi (Bopomofo,Han,Hangul,Hiragana,Katakana,Yi),
Brahmi (Brahmi),
Braille (Braille),
Buginese (Buginese),
Buginese,Javanese (Buginese,Javanese),
Buhid (Buhid),
Buhid,Hanunoo,Tagalog,Tagbanwa (Buhid,Hanunoo,Tagalog,Tagbanwa),
Canadian_Aboriginal (Canadian_Aboriginal),
Carian (Carian),
Caucasian_Albanian (Caucasian_Albanian),
Chakma (Chakma),
Chakma,Myanmar,Tai_Le (Chakma,Myanmar,Tai_Le),
Cham (Cham),
Cherokee (Cherokee),
Common (Common),
Coptic (Coptic),
Cuneiform (Cuneiform),
Cypriot (Cypriot),
Cypriot,Linear_A,Linear_B (Cypriot,Linear_A,Linear_B),
Cypriot,Linear_B (Cypriot,Linear_B),
Cyrillic (Cyrillic),
Cyrillic,Glagolitic (Cyrillic,Glagolitic),
Cyrillic,Latin (Cyrillic,Latin),
Cyrillic,Old_Permic (Cyrillic,Old_Permic),
Deseret (Deseret),
Devanagari (Devanagari),
Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Malayalam,Modi,Nandinagari,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Malayalam,Modi,Nandinagari,Takri,Tirhuta),
Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Modi,Nandinagari,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Modi,Nandinagari,Takri,Tirhuta),
Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Khojki,Khudawadi,Mahajani,Modi,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Khojki,Khudawadi,Mahajani,Modi,Takri,Tirhuta),
Devanagari,Dogra,Kaithi,Mahajani (Devanagari,Dogra,Kaithi,Mahajani),
Devanagari,Grantha (Devanagari,Grantha),
Devanagari,Grantha,Kannada (Devanagari,Grantha,Kannada),
Devanagari,Grantha,Latin (Devanagari,Grantha,Latin),
Devanagari,Kannada,Malayalam,Oriya,Tamil,Telugu (Devanagari,Kannada,Malayalam,Oriya,Tamil,Telugu),
Devanagari,Nandinagari (Devanagari,Nandinagari),
Devanagari,Sharada (Devanagari,Sharada),
Devanagari,Tamil (Devanagari,Tamil),
Dogra (Dogra),
Duployan (Duployan),
Egyptian_Hieroglyphs (Egyptian_Hieroglyphs),
Elbasan (Elbasan),
Elymaic (Elymaic),
Ethiopic (Ethiopic),
Georgian (Georgian),
Georgian,Latin (Georgian,Latin),
Glagolitic (Glagolitic),
Gothic (Gothic),
Grantha (Grantha),
Grantha,Tamil (Grantha,Tamil),
Greek (Greek),
Gujarati (Gujarati),
Gujarati,Khojki (Gujarati,Khojki),
Gunjala_Gondi (Gunjala_Gondi),
Gurmukhi (Gurmukhi),
Gurmukhi,Multani (Gurmukhi,Multani),
Han (Han),
Han,Hiragana,Katakana (Han,Hiragana,Katakana),
Hangul (Hangul),
Hanifi_Rohingya (Hanifi_Rohingya),
Hanunoo (Hanunoo),
Hatran (Hatran),
Hebrew (Hebrew),
Hiragana (Hiragana),
Hiragana,Katakana (Hiragana,Katakana),
Imperial_Aramaic (Imperial_Aramaic),
Inherited (Inherited),
Inscriptional_Pahlavi (Inscriptional_Pahlavi),
Inscriptional_Parthian (Inscriptional_Parthian),
Javanese (Javanese),
Kaithi (Kaithi),
Kannada (Kannada),
Kannada,Nandinagari (Kannada,Nandinagari),
Katakana (Katakana),
Kayah_Li (Kayah_Li),
Kayah_Li,Latin,Myanmar (Kayah_Li,Latin,Myanmar),
Kharoshthi (Kharoshthi),
Khmer (Khmer),
Khojki (Khojki),
Khudawadi (Khudawadi),
Lao (Lao),
Latin (Latin),
Latin,Mongolian (Latin,Mongolian),
Lepcha (Lepcha),
Limbu (Limbu),
Linear_A (Linear_A),
Linear_B (Linear_B),
Lisu (Lisu),
Lycian (Lycian),
Lydian (Lydian),
Mahajani (Mahajani),
Makasar (Makasar),
Malayalam (Malayalam),
Mandaic (Mandaic),
Manichaean (Manichaean),
Marchen (Marchen),
Masaram_Gondi (Masaram_Gondi),
Medefaidrin (Medefaidrin),
Meetei_Mayek (Meetei_Mayek),
Mende_Kikakui (Mende_Kikakui),
Meroitic_Cursive (Meroitic_Cursive),
Meroitic_Hieroglyphs (Meroitic_Hieroglyphs),
Miao (Miao),
Modi (Modi),
Mongolian (Mongolian),
Mongolian,Phags_Pa (Mongolian,Phags_Pa),
Mro (Mro),
Multani (Multani),
Myanmar (Myanmar),
Nabataean (Nabataean),
Nandinagari (Nandinagari),
New_Tai_Lue (New_Tai_Lue),
Newa (Newa),
Nko (Nko),
Nushu (Nushu),
Nyiakeng_Puachue_Hmong (Nyiakeng_Puachue_Hmong),
Ogham (Ogham),
Ol_Chiki (Ol_Chiki),
Old_Hungarian (Old_Hungarian),
Old_Italic (Old_Italic),
Old_North_Arabian (Old_North_Arabian),
Old_Permic (Old_Permic),
Old_Persian (Old_Persian),
Old_Sogdian (Old_Sogdian),
Old_South_Arabian (Old_South_Arabian),
Old_Turkic (Old_Turkic),
Oriya (Oriya),
Osage (Osage),
Osmanya (Osmanya),
Pahawh_Hmong (Pahawh_Hmong),
Palmyrene (Palmyrene),
Pau_Cin_Hau (Pau_Cin_Hau),
Phags_Pa (Phags_Pa),
Phoenician (Phoenician),
Psalter_Pahlavi (Psalter_Pahlavi),
Rejang (Rejang),
Runic (Runic),
Samaritan (Samaritan),
Saurashtra (Saurashtra),
Sharada (Sharada),
Shavian (Shavian),
Siddham (Siddham),
Sign_Writing (Sign_Writing),
Sinhala (Sinhala),
Sogdian (Sogdian),
Sora_Sompeng (Sora_Sompeng),
Soyombo (Soyombo),
Sundanese (Sundanese),
Syloti_Nagri (Syloti_Nagri),
Syriac (Syriac),
Tagalog (Tagalog),
Tagbanwa (Tagbanwa),
Tai_Le (Tai_Le),
Tai_Tham (Tai_Tham),
Tai_Viet (Tai_Viet),
Takri (Takri),
Tamil (Tamil),
Tangut (Tangut),
Telugu (Telugu),
Thaana (Thaana),
Thai (Thai),
Tibetan (Tibetan),
Tifinagh (Tifinagh),
Tirhuta (Tirhuta),
Ugaritic (Ugaritic),
Unknown (Unknown),
Vai (Vai),
Wancho (Wancho),
Warang_Citi (Warang_Citi),
Yi (Yi),
Zanabazar_Square (Zanabazar_Square)
We can very easily make a single enum value for each one, and programmatically generate an intersect()
function that can calculate intersections. This would be faster.
(For performance it would also probably be worth only running these checks on non-ascii identifiers)
Metadata
Metadata
Assignees
Labels
No labels