by Peter Constable, UTC Chair
The Unicode Technical Committee (UTC) met last week (April 23 to
25) in San Jose, California. Thanks to Unicode member company Adobe for hosting.
Here are some highlights from the large number of items that were covered.
Preparing Unicode 16.0 Beta
An important objective was to cover all technical decisions that
would be needed for the Unicode 16.0 Beta preview. The Beta will be available
for public review and comment on May 21, 2024, and will include all charts, data
and annexes for The Unicode Standard as well as other synchronized standards,
including UTS 10, Unicode Collation Algorithm, and UTS 51, Unicode Emoji. Also,
for the first time, the Beta release will include a complete draft of the core
text of the standard.
The character repertoire for Unicode 16.0 was slightly adjusted,
with the removal of two characters: U+0CDC KANNADA ARCHAIC SHRII and U+0C5C
TELUGU ARCHAIC SHRII. These characters were first approved in January 2022 (UTC
#170) and assigned for addition in Unicode 16.0 in April 2023 (UTC #175).
However, in the ISO process for Amendment 2 of ISO/IEC 10646:2022 (which is to
be synchronized with Unicode 16.0), the India national body requested more time
for review by experts in India. To avoid a risk of Unicode 16.0 and Amendment 2
of 10646 not being in sync, UTC decided to delay these two characters for a
later version.
Various character property (UCD) and algorithm changes were made
based on issues reported during the Alpha review or found while the UTC
Properties and Algorithms Working Group prepared data files for 16.0. Two
notable areas for changes are grapheme cluster segmentation (UAX #29) and line
breaking (UAX #14):
- For grapheme clusters, some changes will be
made to extended grapheme cluster segmentation for improved handling of
orthographic syllables in Indic scripts.
- For line breaking, several changes will be
made to data and rules to fix various edge cases, and to incorporate
behaviour for hyphens that has already been implemented in CLDR and ICU for
several years.
Also related to properties, the organization of the
ScriptExtensions.txt file will be changing. Previously, lines of data were
grouped by characters that had the same script extension property values. Going
forward, lines will be ordered by code point. (This is only a change in the
order the data is listed; the parsing of lines is unchanged.) This will make it
much easier to compare changes in property values between different Unicode
versions.
In relation to emoji, the set of new emoji for version 16.0 is
unchanged. During the Beta review, the draft update for UTS #51, Unicode Emoji,
will include some proposed revisions related to recommendations for display of
emoji family combinations. These revisions have not yet been reviewed and
approved by UTC, so will require careful review and will be subject to
confirmation or change at the next UTC meeting, after the Beta review period is
over.
UTC action item backlog
UTC has had a growing backlog of open action items, some over ten
years old. For this meeting, the various UTC working groups triaged their action
items that were five or more years old, and outcomes were discussed by the UTC.
Some action items were completed; some were closed as no longer relevant. Many
that required more research were closed as UTC action items and replaced by
issues in the relevant working group’s GitHub repo. Note that tracking them in
this other way doesn’t necessarily mean they will get higher priority. However,
since the working groups are using GitHub issues to organize their regular work,
this should bring more attention to these issues. UTC will repeat this process
at UTC #181, six months from now.
As a side effect of this review of old action items, a document was
submitted to UTC (
L2/24-123)
proposing that UTC transition from the way it has handled action items in the
past to tracking issues in a public GitHub repo to allow contributions from a
broader set of volunteers. That document identifies some problems and
limitations of the existing processes, and suggests that a new process could
provide improvements. UTC spent some time discussing this document. It was noted
that the idea was valuable, though such a change in processes would not be a
small change and would involve some not-so-obvious challenges. It would also be
something that affects the Unicode Consortium as a whole, not just UTC. For that
reason, this proposal will need to be considered as part of a broader discussion
of Consortium processes, resources and infrastructure.
New investigation: automatic space handing at inter-script
boundaries
East Asian text often combines different scripts, and a common
typographic practice is to insert space between script runs. UTC briefly
discussed a new document,
L2/24-057,
which proposes development of an algorithm for automatic spacing between script
runs. The Properties and Algorithms Working Group has assembled experts to
discuss this topic. Interested experts are invited to participate in discussion
via
issues (with "auto-spacing" label) in the public unicodetools repo in
GitHub.
Adopt a Character and Support Unicode’s Mission
Looking to give that special someone a special something?
Or maybe something to treat yourself?
π️ππ️π¨π₯πη±₿♜π
Adopt a character or emoji to give it the attention it deserves,
while also supporting Unicode’s mission to ensure everyone can
communicate in their own languages across all devices.
Each adoption includes a digital badge and certificate that you can proudly display!
Have fun and support a good cause
You can also donate funds or
gift stock