[go: up one dir, main page]

Friday, December 19, 2025

Opening CLDR Survey Tool early for DDL locales

[image]

We are announcing an early submission window for the CLDR Survey Tool, exclusively for Digitally Disadvantaged Languages (DDLs). These include languages across the world that lack full digital support, such as Qสผeqchiสผ with about 1.3M speakers, and many more.


The early submission window will allow more time for individuals and organizations that make DDL contributions, providing crucial data to close the digital support gap. The data will go into the CLDR v50 release, targeted at October 2026. Languages maintained by the CLDR Technical Committee are not available during this special window. They will be available for submission in Q2 2026.


See DDL: Help Center for more information on how to contribute to a DDL language.


If your language is not yet in CLDR, organizations can submit a formal request to add it; see adding a new language.


CLDR Organizations are needed for approval of CLDR data, so that it can be picked up by libraries, applications, programming languages, and operating systems. To register a new CLDR Organization, see adding an organization to CLDR. Individuals can also request languages and submit/approve data; however, the data cannot reach even Basic coverage without at least one CLDR Organization supporting it.



What is CLDR?


CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). All major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)


Contributors supply data for their languages via the online Survey Tool. This data is widely used to support much of the world’s software and is also a factor in determining which languages are supported on mobile phones and computer operating systems.


The Survey Tool opened on December 18, 2025 for DDL languages. The tool will remain open for data submission and correction until July 2026. A public alpha will make the draft data available in early August 2026. Data contributed at this time will be scheduled for publication and available for use in October 2026.


Each additional CLDR language starts with a small set of Core Data, such as a list of characters used in the language. Submitters of new languages commit to bringing the coverage up to a minimum of Basic coverage (very basic formats for dates, times, numbers, and endonyms). 


Once a language reaches Basic coverage, it will have the minimum support for use in language selection, such as on mobile devices. That is the first step; for broader support the Moderate level is typically required. 


If you would like to contribute missing data for your language, see Survey Tool Accounts.

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
๐Ÿ•‰️๐Ÿ’—๐ŸŽ️๐Ÿจ๐Ÿ”ฅ๐Ÿš€็ˆฑ₿♜๐Ÿ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Tuesday, December 2, 2025

UTC #185 Highlights

 Unicode Technical Committee meeting #185 was held October 27 – 29 in Cupertino, CA, hosted by Apple. Here are some highlights.

Starting the Unicode 18.0 cycle

As we've been following an annual September release cycle for the Unicode Standard, the Q4 UTC meeting is the first meeting during a new cycle. While some decisions targeting the release might have been taken at a previous meeting, this is the first meeting in which the next release has particular focus. One of the decisions taken is to plan out the key milestones and dates for the next new cycle. Here's a summary of the timeline for Unicode 18.0:

  • November 2025: UTC #185 approved new character repertoire

  • January 2026: UTC #186 will finalize content for the alpha release

  • February – March: alpha release open for public review

  • April: UTC #187 will review alpha feedback and finalize content for the beta release

  • May – June: beta release open for public review

  • July: UTC #188 will finalize 18.0 content

  • September: Unicode 18.0 release

Unicode 18.0 character and emoji repertoire

During a release cycle, the primary focus for the alpha review is on the new character repertoire. The repertoire for the alpha review can be updated at the January UTC meeting; but we like to have that planned repertoire largely determined by the Q4 meeting so that working groups can focus early on preparing content that will be needed for the alpha.

UTC #184 had approved around 60 characters for publication in Unicode 18.0. (Some of those had been planned for Unicode 17.0 but, for various reasons, needed to be postponed.) These included the UAE Dirham sign, and the first tranche of a large set of symbols from the writings of Gottfried Leibniz for which proposals are in development. At UTC #185, nearly 13,000 additional characters were approved for encoding in Unicode 18.0. 

The approved additions include encoding of Small Seal script ("Seal"), a repertoire of 11,328 ideographic characters. Seal is distinct from modern Han ideographs (aka, "CJK"), but is an important precursor of CJK resulting from the first efforts to standardize writing across Chinese-speaking regions during China's Qin Dynasty. As such, Seal has important cultural significance in China and for Chinese speakers throughout the world.

Other additions included 1,276 characters allocated in three new blocks: Archaic Cuneiform Numerals — 311 Cuneiform characters from the fourth millenium BCE; and Jurchen and Jurchen Radicals — 965 ideographic characters that were used for writing the Jurchen language in the12th – 13th century CE. 

In addition, 321 other characters were approved as additions to a number of existing blocks. This includes many characters for Arabic and Latin scripts, many characters used in phonetic transcription, a number of symbols used in music notation, and a second set of the Leibniz symbols.

Finally, the new characters approved for Unicode 18.0 includes nine new emoji characters. Note that many emoji are represented as character sequences, so mentioning the new emoji characters doesn't provide a complete picture. Look for more information about Unicode 18.0 emoji in the coming months.

CJK & Unihan

UTC works on CJK character encoding in collaboration with IRG (Ideographic Research Group), a working group under ISO/IEC JTC 1/SC 2. There are over 100,000 CJK ideographs now encoded in Unicode, and with such a large repertoire of characters there are refinements to the already-encoded characters that continue to be made. At UTC #185, recommendations arising from a recent IRG meeting were reviewed, and a number of changes were approved for Unicode 18.0. Some of these are technical details that are not so visible, such as corrections to source references for certain characters (the references cited when the characters were encoded providing evidence of their usage and identity as distinct characters). Among the significant and visible changes approved by UTC are over 700 horizontal extensions, which will be reflected in the Unicode 18.0 code charts with additional glyphs for already-encoded characters.

For complete details on outcomes from UTC #185, see the draft minutes.

About the Unicode Standard

The world relies on digital communications. The Unicode Standard is a vital building block for global digital communications, providing the encoding for more than 155,000 characters used by thousands of languages and scripts throughout the world. 

Each character—letter, diacritic, symbol, emoji, etc.—is represented by a unique numeric code, and has defined properties data that define how characters behave in several text processing algorithms. 

With this combination, The Unicode Standard provides the foundation for implementations to support the world's writing systems, enabling billions of people across the globe to seamlessly communicate with one another across platforms and devices. The Standard is also the foundation for the suite of code, libraries, data, and products that the Unicode Consortium delivers for robust language support.


----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
๐Ÿ•‰️๐Ÿ’—๐ŸŽ️๐Ÿจ๐Ÿ”ฅ๐Ÿš€็ˆฑ₿♜๐Ÿ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock



Wednesday, November 5, 2025

Introducing the Unicode Inflection Library Technical Preview Release

The problem of linguistic inflection has long been a barrier to effective software internationalization. The problem is even more visible today with multimodal UIs. In many languages, word forms change (inflect) based on grammatical context, creating a significant challenge for developers aiming to build truly global applications. Getting the wrong word inflection can be as bad as using the wrong preposition in English.

Today, the Unicode Consortium is announcing a major step forward with the Technical Preview Release of the Unicode Inflection Library. It provides direct access through C and C++ APIs, or can be used in conjunction with Message Format 2.0 functionality.

This library is designed to solve a problem that is particularly acute in languages with a large number of inflectional forms, such as the Slavic, Germanic, Romance, Semitic, Indic and agglutinative families of languages.

The issue extends beyond common words like adjectives, nouns, and verbs. In many of these languages, proper nouns—including geo-location names, brands, and people’s names—can also inflect. This complexity affects a large number of users and has been largely unaddressed by the industry, which has typically opted for narrow, language-specific solutions. Even languages like French require handling inflection for gender and number, demonstrating the problem is not limited to a few specific language families.

The Unicode Inflection Library provides a robust and standardized approach to this challenge. It leverages extensive data sets to handle complex grammatical transformations, enabling more accurate text generation, search functionality, and natural language processing. A key resource for this project is the availability of comprehensive lexicons from the Wikidata project, which provide the foundational data necessary for these operations.

Get Started and Participate

This is a community effort. We invite developers and linguists to explore the library's capabilities and contribute to its development. A detailed tutorial is available to help you get started:

Tutorial: https://github.com/unicode-org/inflection/wiki/Tutorial

Release: https://github.com/unicode-org/inflection/releases/tag/Inflection-0.1

Your feedback and contributions are critical for refining the library's rules, expanding language coverage, and ensuring its performance. By participating, you will help build a foundational tool that will make the digital world more accessible and linguistically accurate for hundreds of millions of users.

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
๐Ÿ•‰️๐Ÿ’—๐ŸŽ️๐Ÿจ๐Ÿ”ฅ๐Ÿš€็ˆฑ₿♜๐Ÿ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock



Thursday, October 30, 2025

Unicode CLDR 48 available

Postal Horn emojiUnicode CLDR 48 is now available and has been integrated into version 78 of ICU


Some of the most significant changes in this release are the following (for more detail, see the CLDR 48 release note page):

  • Updated for Unicode 17, including new names and search terms for new emoji, new sort order, and Han→Latin romanization additions for many characters.

  • Updated to the latest external standards and data sources, such as the language subtag registry, UN M49 macro regions, ISO 4217 currencies, etc.

  • Many enhancements of the CLDR specification (LDML)

  • Many additions to language data including:

    • Likely Subtags, for deriving the likely script and region from the language (used in many processes)

  • New formatting options:

    • Rational number formats added, allowing for formats like “5½” in tech preview

    • For timezones, usesMetazone adds two new attributes stdOffset and dstOffset so that implementations can use either “main” or  “rearguard” TZDB data

    • Combination formats added for relative dates + times, such as “tomorrow at 12:30”

    • Additional units added for scientific contexts (coulombs, farads, teslas, etc.) and for English systems (fortnights, imperial pints, etc.)

  • Many corrections and updates for Metazone data and calendars eras (including removal of eras and fixes to start dates)

  • This is the first release where the new CLDR Organization process is in place for DDL languages. As a result, several locales were able to reach higher levels (see below).

See the CLDR 48 release note page for information on accessing the data, reviewing charts of the changes, and — importantly — Migration issues.


CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). All major browsers and modern mobile phones use CLDR for language support. (See Who uses CLDR?)


Via the Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems. 

Locale Coverage Levels

Level
Count
With Script
Regional Variants
Usage
Modern
104
5
305
Suitable for full UI internationalization
Moderate
13
0
1
Suitable for “document content” internationalization, eg. in spreadsheet
Basic
57
10
22
Suitable for locale selection, eg. choice of language on mobile phone
Changes in coverage
±New LevelLocales
๐Ÿ“ˆModernAkan, Bashkir, Chuvash, Kazakh (Arabic), Romansh, Shan, Quechua
๐Ÿ“ˆModerateAnii, Esperanto
๐Ÿ“ˆBasicBuriat, Piedmontese, Sicilian, Tuvinian
๐Ÿ“‰Basic*Baluchi (Latin), Kurdish

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
๐Ÿ•‰️๐Ÿ’—๐ŸŽ️๐Ÿจ๐Ÿ”ฅ๐Ÿš€็ˆฑ₿♜๐Ÿ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock