The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16"

The rule for content-char currently looks like this:

content-char = %x01-08        ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
             ...
             / %x3001-D7FF    ; omit surrogates
             / %xE000-10FFFF
             ...

That is unusual for languages that use UTF-16 natively, like JavaScript, Java, and even the "wide version" of the Windows C APIs (using wchar_t, that is 16 bits on Windows)

Such languages try to enforce utf-16 correctness (the same way C/C++ don't try to enforce utf-8 or any other kind of utf correctness).

Any validation is done "at the edge", when data is ingested, if at all.

Existing APIs that are similar to MessageFormat 2 work just fine with incorrect surrogate sequences.

@Test
public void testBadSurrogates() {
  dumpHex(String.format("\uda02 %d \udc02", 42));
  dumpHex(java.text.MessageFormat.format("\uda02 {0} \udc02", 42));
  dumpHex(com.ibm.icu.text.MessageFormat.format("\uda02 {0} \udc02", 42));
}

private void dumpHex(String str) {
  str.chars().forEach(c -> System.out.printf(" %04X", c));
  System.out.println();
}

The code above does not throw, and the result preserves the surrogates "as is".
The output looks like this:

 DA02 0020 0034 0032 0020 DC02
 DA02 0020 0034 0032 0020 DC02
 DA02 0020 0034 0032 0020 DC02

The current restriction also contradicts what was agreed in this thread:

RGN: Surrogate code points. Those are code points reserved for representing code points in UTF-16 that are beyond the first plane (BMP) of 2^16 code points.

MIH: I understand what you mean. But we also implement this is C and Java, and so on. So what should we do if we receive a message with invalid UTF-8 code points. Do we expect to replace them with the replacement character, or do we just pass them through?

RGN: I think what you're asking about, using JavaScript as a concrete example, is that a JS string is allowed to have unpaired surrogates. So the question is a question for the JS adapter / implementation, but that's not a question for the standard itself.

MIH: So we leave it to the implementation?

RGN: Yes.

MIH: Okay, that is fine with me.

https://github.com/unicode-org/message-format-wg/blob/main/meetings/2022/notes-2022-06-13.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16" #895

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16" #895

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions