Description
The rule for content-char
currently looks like this:
content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
...
/ %x3001-D7FF ; omit surrogates
/ %xE000-10FFFF
...
That is unusual for languages that use UTF-16 natively, like JavaScript, Java, and even the "wide version" of the Windows C APIs (using wchar_t
, that is 16 bits on Windows)
Such languages try to enforce utf-16 correctness (the same way C/C++ don't try to enforce utf-8 or any other kind of utf correctness).
Any validation is done "at the edge", when data is ingested, if at all.
Existing APIs that are similar to MessageFormat 2 work just fine with incorrect surrogate sequences.
@Test
public void testBadSurrogates() {
dumpHex(String.format("\uda02 %d \udc02", 42));
dumpHex(java.text.MessageFormat.format("\uda02 {0} \udc02", 42));
dumpHex(com.ibm.icu.text.MessageFormat.format("\uda02 {0} \udc02", 42));
}
private void dumpHex(String str) {
str.chars().forEach(c -> System.out.printf(" %04X", c));
System.out.println();
}
The code above does not throw, and the result preserves the surrogates "as is".
The output looks like this:
DA02 0020 0034 0032 0020 DC02
DA02 0020 0034 0032 0020 DC02
DA02 0020 0034 0032 0020 DC02
The current restriction also contradicts what was agreed in this thread:
RGN: Surrogate code points. Those are code points reserved for representing code points in UTF-16 that are beyond the first plane (BMP) of 2^16 code points.
MIH: I understand what you mean. But we also implement this is C and Java, and so on. So what should we do if we receive a message with invalid UTF-8 code points. Do we expect to replace them with the replacement character, or do we just pass them through?
RGN: I think what you're asking about, using JavaScript as a concrete example, is that a JS string is allowed to have unpaired surrogates. So the question is a question for the JS adapter / implementation, but that's not a question for the standard itself.
MIH: So we leave it to the implementation?
RGN: Yes.
MIH: Okay, that is fine with me.
https://github.com/unicode-org/message-format-wg/blob/main/meetings/2022/notes-2022-06-13.md