8000 The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16" · Issue #895 · unicode-org/message-format-wg · GitHub
[go: up one dir, main page]

Skip to content
The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16" #895
Closed
@mihnita

Description

@mihnita

The rule for content-char currently looks like this:

content-char = %x01-08        ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
             ...
             / %x3001-D7FF    ; omit surrogates
             / %xE000-10FFFF
             ...

That is unusual for languages that use UTF-16 natively, like JavaScript, Java, and even the "wide version" of the Windows C APIs (using wchar_t, that is 16 bits on Windows)

Such languages try to enforce utf-16 correctness (the same way C/C++ don't try to enforce utf-8 or any other kind of utf correctness).

Any validation is done "at the edge", when data is ingested, if at all.

Existing APIs that are similar to MessageFormat 2 work just fine with incorrect surrogate sequences.

@Test
public void testBadSurrogates() {
  dumpHex(String.format("\uda02 %d \udc02", 42));
  dumpHex(java.text.MessageFormat.format("\uda02 {0} \udc02", 42));
  dumpHex(com.ibm.icu.text.MessageFormat.format("\uda02 {0} \udc02", 42));
}

private void dumpHex(String str) {
  str.chars().forEach(c -> System.out.printf(" %04X", c));
  System.out.println();
}

The code above does not throw, and the result preserves the surrogates "as is".
The output looks like this:

 DA02 0020 0034 0032 0020 DC02
 DA02 0020 0034 0032 0020 DC02
 DA02 0020 0034 0032 0020 DC02

The current restriction also contradicts what was agreed in this thread:

RGN: Surrogate code points. Those are code points reserved for representing code points in UTF-16 that are beyond the first plane (BMP) of 2^16 code points.

MIH: I understand what you mean. But we also implement this is C and Java, and so on. So what should we do if we receive a message with invalid UTF-8 code points. Do we expect to replace them with the replacement character, or do we just pass them through?

RGN: I think what you're asking about, using JavaScript as a concrete example, is that a JS string is allowed to have unpaired surrogates. So the question is a question for the JS adapter / implementation, but that's not a question for the standard itself.

MIH: So we leave it to the implementation?

RGN: Yes.

MIH: Okay, that is fine with me.

https://github.com/unicode-org/message-format-wg/blob/main/meetings/2022/notes-2022-06-13.md

Metadata

Metadata

Assignees

Labels

LDML46.1MF2.0 Draft CandidatePreview-FeedbackFeedback gathered during the technical previewresolve-candidateThis issue appears to have been answered or resolved, and may be closed soon.syntaxIssues related with syntax or ABNF

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0