-
Notifications
You must be signed in to change notification settings - Fork 7.6k
Implement Unicode escape parsing #3958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
8000By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Unicode escape parsing #3958
Conversation
@@ -270,9 +270,44 @@ Describe "ParserTests (admin\monad\tests\monad\src\engine\core\ParserTests.cs)" | |||
$result | should be ([char]0x1b) | |||
} | |||
|
|||
It "Test that escaping any character with no special meaning just returns that char. (line 602)" { | |||
$result = ExecuteCommand '"fo`obar"' | |||
$result | should be "foobar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we remove the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks. Was debugging utf8 encoding problems with the file and removed this one by accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closed.
# These tests require the file to be saved with a BOM. Unfortunately when this UTF8 file is read by | ||
# PowerShell without a BOM, the file is incorrectly interpreted as ASCII. | ||
It 'Test that Unicode escape sequence `u2195 returns the ↕ character.' { | ||
$result = ExecuteCommand '"foo`u2195abc"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see tests with single quotas 'foo`u2195abc' for expressions, variables, arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in tests that verify the escape sequence is not converted to a Unicode char? Single quoted strings in PowerShell ignore esc sequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't see such tests at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. It is easy enough to add a single quoted string test. I probably should add here string tests as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also test an escape in a command name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
@@ -1229,23 +1229,125 @@ private int SkipBlockComment(int i) | |||
return i; | |||
} | |||
|
|||
private static char Backtick(char c) | |||
private string Backtick(char c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For discussion. Here we have to change both Backtick and calling code. Alternative is to return char "`u" from Backtick and then read a hex in ScanUnicodeEscapeSequence:
c = Backtick(c1);
if ( c == '\u' )
{
ScanUnicodeEscapeSequence;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a bad idea. Thanks! The code at each of the six call sites would still need to add the result of calling ScanUnicodeEscapeSequence to the local sb/formatSb instances but the intent of that code would be clearer than the current string length > 1 checks. I'll try this change later tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this might be cleaner, especially if you pass in the sb instances to ScanUnicodeEscapeSequence
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential issue with the proposed change is that if someone where to add a new context in which backticks could be used for escaping characters, they can't just call Backtick()
. They have to also know to check the resulting char and if == '\u'
call the ScanUnicodeEscapeSequence()
method. And they might not know because every other escape sequence would work by just calling Backtick()
. Not that I'm opposed to this change but it's a potential downside. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with Backtick
returning a string, I think that's a reasonable change.
I'm less happy with the inconsistent handling at each call site - maybe it's necessary but it's worth discussing more.
It looks like the only reason to check the length is to support doubling {
and }
. Should that happen if someone wrote "`u7b"
instead of "{"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the only reason to check the length is to support doubling { and }
Actually it is used to determine whether the resulting string is a Unicode esc sequence that corresponds to a surrogate pair. The thinking was that a surrogate pair should just be added to string builders and that you'd never need to "scan" the astral plane character as a token. BTW this is the only case where the string returned by Backtick
contains more than one character.
In the non-bracketed case e.g. "`u007b"
you will never get back a surrogate pair. "`u7b"
is not a valid seq - has to be 4 hex digits in this C#/Bash style Unicode esc seq.
FWIW I didn't feel great about the changes to the six Backtick
call sites. I'm going to try something to at least consolidate the duplicate code in a single method to stay copasetic with the DRY principle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn't clear. I meant - the only reason to have two code paths.
If the Length > 1
code path was used unconditionally - how is that different than what you have today - that's what I meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I did that because I was trying to change the original logic as little as possible. For instance Backtick
returning a char `n
or z
led to further checking of that character. IOW a call to Backtick was not followed by a call to continue
to the next loop iteration. If you think we can always proceed onto the next iteration without further checking the returned character, then yeah I can simplify that logic.
BTW here is what I'm thinking to eliminate the duplicate code:
private bool IsSurrogatePair(string str, ref char nonSurrogatePairChar, params StringBuilder[] addStrToStringBuilders)
{
Diagnostics.Assert(!string.IsNullOrEmpty(str), "Caller never calls us with an empty string");
Diagnostics.Assert((str.Length <= 2), "Caller never calls us with a string length greater than two");
Diagnostics.Assert((str.Length == 1) || Char.IsSurrogate(str, 0),
"Caller never calls us with a string length of two and not be a surrogate pair");
if (str.Length == 1)
{
nonSurrogatePairChar = str[0];
return false;
}
if (addStrToStringBuilders?.Length > 0)
{
foreach (var sb in addStrToStringBuilders)
{
sb.Append(str);
}
}
return true;
}
Not crazy about the method name so I'm open to ideas but this does clean up the various call sites e.g."
else if (c == '`')
{
// If end of input, go ahead and append the backtick and issue an error later
char c1 = PeekChar();
if (c1 != 0)
{
SkipChar();
string str = Backtick(c1);
if (IsSurrogatePair(str, ref c, sb, formatSb))
{
// Skip further scanning when str is surrogate pair and is outside the basic multilingual plane
continue;
}
}
}
<!-- | ||
Microsoft ResX Schema | ||
|
||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid combining reformatting and functional fixes in the same change. Suggestion: Submit a PR for the reformatting before the functional change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this wasn't an intentional formatting change but just a consequence of me editing in VSCode with trimTrailingWhitespace = true
. I can fix it. Will submit an update later tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW it might be a good idea to add either a .vscode\settings.json
where we can define "files.trimTrailingWhitespace": false
and other desired settings. Or we could cast a wider net and use a top-level .editorconfig
file with trim_trailing_whitespace = true
. In this case, I would add a .vscode\extensions.json
file with the EditorConfig extension listed. In fact, such a file should also recommend the C#, C++ and PowerShell extensions from Microsoft.
private string ScanUnicodeEscapeSequence() | ||
{ | ||
// Unicode replacement char, used to represent unknown, unrecognized or unrepresentable characters. | ||
var unknownChar = ((char)0xffdd).ToString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a literal string is not possible, you should make this static so we create just one instance instead of creating an instance every time we scan `u
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
var sb = GetStringBuilder(); | ||
for (int i = 0; i < maxNumberOfHexDigits; i++) | ||
{ | ||
c = PeekChar(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use GetChar
here - you're expecting more than one character, then just call UngetChar
when you went past what you consume.
@@ -1229,23 +1229,125 @@ private int SkipBlockComment(int i) | |||
return i; | |||
} | |||
|
|||
private static char Backtick(char c) | |||
private string Backtick(char c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this might be cleaner, especially if you pass in the sb instances to ScanUnicodeEscapeSequence
.
break; | ||
} | ||
|
||
if ((c == '\0') || !c.IsHexDigit()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this loop is more complicated than necessary. If you were only checking IsHexDigit
and then break on the first non-hex digit, does that simplify things by checking for the terminator in just one place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier I had an error along the lines of IncompleteInputUnicodeYadaYada
but then got rid of it. So yeah, as long as IsHexDigit()
returns false
on \0
I can rip out the c == '\0'
test.
int unicodeValue; | ||
if (!int.TryParse(hexStr, NumberStyles.AllowHexSpecifier, NumberFormatInfo.InvariantInfo, out unicodeValue)) | ||
{ | ||
// Considering we validate hex characters, we should never get here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of errors like this for impossible conditions. If int.Parse
fails, let that exception go, it might be more useful than this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I was wondering about that. Will remove the corresponding ParserStrings entry.
# These tests require the file to be saved with a BOM. Unfortunately when this UTF8 file is read by | ||
# PowerShell without a BOM, the file is incorrectly interpreted as ASCII. | ||
It 'Test that Unicode escape sequence `u2195 returns the ↕ character.' { | ||
$result = ExecuteCommand '"foo`u2195abc"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also test an escape in a command name.
ShouldBeParseError '"`u219z"' InvalidUnicodeEscapeSequence 1 | ||
ShouldBeParseError '"`u{219z"' UnicodeEscapeSequenceMissingTerminator 7 # error offset "`u{219>>z<<" | ||
ShouldBeParseError '"`u{}"' InvalidBracketedUnicodeEscapeSequence 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need more tests that are missing the closing quote, e.g. "`u
, "`u{
, etc.
I find it handy to type in samples interactively, PSReadline tends to hit these cases automatically as you type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've noticed that handy little feature of PSReadline. Will add these additional tests. Do you have an opinion on whether I should test single quoted strings/here strings? In theory, none of this new code should be invoked for sq strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let you decide what you'd like to do.
I'm certain we had such tests using one of our internal frameworks and I'm not too worried about regressions, but it'd be good to get some tests like that into this repo.
|
||
It "Test that a Unicode escape sequence can be used in a command name." { | ||
function xyzzy`u2195($p) {$p} | ||
$cmd = Get-Command xyzzy`u2195 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add "-ErrorAction SilentlyContinue"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
ShouldBeParseError '"`u{' UnicodeEscapeSequenceMissingTerminator,TerminatorExpectedAtEndOfString 4,0 | ||
ShouldBeParseError '"`u{1' UnicodeEscapeSequenceMissingTerminator,TerminatorExpectedAtEndOfString 5,0 | ||
ShouldBeParseError '"`u{123456' UnicodeEscapeSequenceMissingTerminator,TerminatorExpectedAtEndOfString 10,0 | ||
ShouldBeParseError '"`u{1234567' UnicodeEscapeSequenceMissingTerminator,TerminatorExpectedAtEndOfString 10,0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lzybkr Should we report the error posion at '{'?
@rkeithhill What about tests for absent '{' ? '"`u1234567}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three error spans:
- Invalid Unicode seq will span from the starting backtick to the last valid character
- Bracketed Unicode seq with missing
}
will span just the one col where it expected the}
. - Bracketed Unicode seq where 6 digit value is > 0x10FFFF will span just the "value" part of the sequence i.e. only the hex digits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An user can type '"`u123}' so we should write an useful error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your example generates the following two errors:
iex '"`u123}'
iex : At line:1 char:2
+ "`u123}
+ ~~~~~
The Unicode escape sequence is not valid. A valid sequence is of the form `uXXXX where X is a hex digit and there must
be four hex digits.
At line:1 char:1
+ "`u123}
+ ~~~~~~~
The string is missing the terminator: ".
At line:1 char:1
+ iex '"`u123}'
+ ~~~~~~~~~~~~~
+ CategoryInfo : ParserError: (:) [Invoke-Expression], ParseException
+ FullyQualifiedErrorId : InvalidUnicodeEscapeSequence,Microsoft.PowerShell.Commands.InvokeExpressionCommand
Effectively "`u1
tells the tokenizer that the Unicode esc seq is not using the bracketed syntax so if it encounters a non-hex digit before it scans four hex digits, it errors telling you that the esc seq isn't valid and tells you what the correct form is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I lost a double quota. My sample is '"`u123}"' .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first error message is exactly the same. There isn't the second error, because the dq string is now terminated:
PS> iex '"`u123}"'
iex : At line:1 char:2
+ "`u123}"
+ ~~~~~
The Unicode escape sequence is not valid. A valid sequence is of the form `uXXXX where X is a hex digit and there must
be four hex digits.
At line:1 char:1
+ iex '"`u123}"'
+ ~~~~~~~~~~~~~~
+ CategoryInfo : ParserError: (:) [Invoke-Expression], ParseException
+ FullyQualifiedErrorId : InvalidUnicodeEscapeSequence,Microsoft.PowerShell.Commands.InvokeExpressionCommand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarify. The error looks good.
We can be more accurate with the message:
A valid sequence is of the form `uXXXX or `u{XXXX} where X is a hex digit and a maximum value is 0x10FFFF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'm definitely looking for help on the parser error messages. I like your idea of showing both forms but we need to indicate the bracketed form accepts 1 to 6 hex digits. How about this?
A valid sequence is of the form `u followed by 4 hex digits e.g. `u2195. Or `u followed by 1 to 6 hex digits
enclosed in curly braces for example `u{A9} `u{1F596}. The maximum allowed hex value is 0x10FFFF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're not going to be able to pack all the documentation in one message. 😄
I believe we should just specify a common format (one sample) and a range of values. Additional information the user can read in the documentation.
I have addressed most of the issues. The last remaining issue is the proposed change to |
Added the breaking change label - it's similar to the break we did for |
So the Travis build failed because I assumed that on non-Windows systems, a here string used
|
A here string will use the newlines that appear in the file - it doesn't matter what environment you are in. |
@@ -2029,7 +2147,12 @@ private TokenFlags ScanStringExpandable(StringBuilder sb, StringBuilder formatSb | |||
if (c1 != 0) | |||
{ | |||
SkipChar(); | |||
c = Backtick(c1); | |||
string str = Backtick(c1); | |||
if (IsSurrogatePair(str, ref c)) //, sb, formatSb)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm an idiot. Will undo the temp comment (was testing for no sb passed into method).
The first of the two Travis builds passed. The second didn't even seem to really get started before it failed. |
Where does the |
You can find the help topics by starting on MSDN (e.g. here) and then click on |
Cool. Found the v6 version of the file but I won't make any changes unless/until the PR is accepted. |
@@ -1338,7 +1338,7 @@ private string ScanUnicodeEscapeSequence() | |||
} | |||
} | |||
|
|||
private bool IsSurrogatePair(string str, ref char nonSurrogatePairChar, params StringBuilder[] addStrToStringBuilders) | |||
private bool IsSurrogatePair(string str, ref char nonSurrogatePairChar, StringBuilder sb1 = null, StringBuilder sb2 = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we want default parameters here?
} | ||
} | ||
|
||
private bool IsSurrogatePair(string str, ref char nonSurrogatePairChar, StringBuilder sb1 = null, StringBuilder sb2 = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought - is this cleaner if you use overloads instead of default parameters? It would avoid the null tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use overloads then presumably the version that takes one SB will delegate to the overload that takes two SBs (and has the implementation). And for that delegation, it would pass null for the second SB - at least that is how I've done overloads before. Are you thinking of a different way?
In general I would use overloads over default parameters for a public API but since this method isn't public ... seemed OK to use default parameters. And it does eliminate the params
GC impact.
BTW this method is perhaps a bit more general purpose than it needs to be since the only existing callers pass either one or two SBs. There are no places that call it without any SB. So we could Diag.Assert
that sb1 is never null and do away with the first sb1 != null
check - if you're comfortable with that. But we would still need to check sb2 != null
since sb2 would be null if we were called from the overload that takes just sb1.
@TravisEz13 Thanks for the link but that is for PS 3.0. I think I should only update the 6.0 version: Right? |
Sorry, something went wrong.
I realize you guys will probably want me to rebase this at some point. Do you think we are getting close on this? I feel like it is getting close, modulo error message tweaks. |
@@ -1,4 +1,4 @@ | |||
Describe "ParserTests (admin\monad\tests\monad\src\engine\core\ParserTests.cs)" -Tags "CI" { | |||
Describe "ParserTests (admin\monad\tests\monad\src\engine\core\ParserTests.cs)" -Tags "CI" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I believe this line appears changed because I had to save this file with encoding UTF8 w/BOM otherwise the Unicode characters I use in this file don't get read back in properly when Pester runs.
@@ -509,7 +509,7 @@ internal class Tokenizer | |||
|
|||
// Unicode replacement char used to represent an unknown, unrecognized or unrepresentable character in a | |||
// Unicode escape sequence. | |||
private static readonly string s_unknownUnicodeChar = ((char)0xffdd).ToString(); | |||
private static readonly string s_unknownUnicodeChar = ((char)0xfffd).ToString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo 0xffff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, I wanted to correct the original typo for replacement/unknown character first. The correct value for that is 0xfffd
. Next up, I'm looking at switching to what C# uses - invalidChar which is Char.MaxValue (0xffff).
The last commit adopts the C# approach to Unicode sequence lexing and removes IsSurrogatePair() method. The C# approach is to have the esc sequence and Unicode esc lexing return a char but also have an out char surrogateChar (lowSurrogate). This perhaps makes the Backtick() call sites a bit clearer than the use of the IsSurrogateChar() method. Also change from using s_invalidChar as 0xfffd (replacement char typically used when "rendering" an invalid byte in the current byte stream e.g UTF-8 reading 0x66, 0xFC, 0x72 - the 0xFC is not valid in this context and an editor might use char 0xFFFD to indicate the character was not valid. The C# parser returns 0xFFFF (Char.MaxValue) when there is an error and it needs to return something for the return value. See what you think. I can always revert this change. To see the C# version, look here: https://github.com/dotnet/roslyn/blob/94ca919c6b4c94857dc5d58df4e6e10cb8d55c3c/src/Compilers/CSharp/Portable/Parser/Lexer_StringLiteral.cs#L98 |
27181df
to
a0707d3
Compare
…ir() method. The C# approach is to have the esc sequence and Unicode esc lexing return a char but also have an out char surrogateChar (lowSurrogate). This perhaps makes the Backtick() call sites a bit clearer than the use of the IsSurrogateChar() method. Also change from using s_invalidChar as 0xfffd (replacement char typically used when rendering an invalid byte in the current byte stream e.g UTF-8 reading 0x66, 0xFC, 0x72 - the 0xFC is not valid in this context and an editor might use char 0xFFFD to indicate the character was not understood. The C# parser returns 0xFFFF (Char.MaxValue) when there is an error and it needs to return something for the return value. See what you think. I can always revert this change.
a0707d3
to
04b8e45
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - thanks for reviewing the Roslyn implementation, that was a great idea.
@lzybkr Could you please comment #3958 (comment) ? It seems like a full imitation to C# too old. |
…hat return early (before the call to GetStringAndRelease).
This PR is tied to the acceptance of the PR PowerShell/PowerShell#3958.
I would appreciate it if a few folks could review the docs I updated for this feature: https://github.com/PowerShell/PowerShell-Docs/pull/1407/files cc @mklement0 Thanks! |
This PR is tied to the acceptance of the PR PowerShell/PowerShell#3958.
Docs are merged, @lzybkr has signed off. The Committee has reviewed the PR. Thanks, @rkeithhill |
@TravisEz13 Thanks. I'd like @lzybkr to take a quick look at my very last commit (made after his approval). It is basically a three line change to make sure we call |
@rkeithhill - I have no preference on leading zeros - I'm fine either way on this one. |
What is conclusion about surrogate pairs? |
I'm not sure I have a conclusion on surrogate pairs - but I think I implied it's reasonable to allow them. |
Could you please re-read last @mklement0 comments above - we should continue and finish the discussion, and maybe documenting it for users. |
I did read that comment before replying and don't really have more to say right now - I think what was merged is fine, but I don't have a strong opinion. |
@iSazonov: I won't be saying anything substantially new over what @lzybkr has already said, but let me attempt a concise summary:
|
This PR is tied to the acceptance of the PR PowerShell/PowerShell#3958.
This addresses the last item in issue #2751. The implementation supports the following forms of Unicode escape sequences:
"`u2195"
evals to↕
"`u{4}"
evals to control char 0x4 (EOT)"`u{a9}"
evals to©
"`u{2195}"
evals to↕
"`u{1f44d}"
evals to👍
In addition, you can use Unicode escape sequences when naming a variable:
And you can specify a Unicode sequence as an argument:
Concerns
For the implementation I had to change the signature of
Backtick()
to return astring
instead ofchar
. This was necessary to support Unicode surrogate pairs like"`u{1f44d}"
which consists of two Unicode characters. This required updating six call sites ofBacktick()
. The primary approach I took was that if the string length was greater than 1, then we have a character from the Unicode astral plane and is not of further interest to the parser. So I put the string in the sb/formatSb StringBuilder andcontinue
the scanning loop. Otherwise, we have a single character, which I assign to the localc
(char) variable and allow processing in that loop iteration to continue.I had to change the encoding of the
Parser.Tests.ps1
file to UTF8 w/BOM. That's because Pester apparently reads the file with the default encoding and the Unicode characters I have in the file do not get recognized correctly. BTW I've noticed this before whereGet-Content
(without specify -Encoding UTF8) on a UTF8 file w/no BOM and with Unicode characters seems to use ASCII encoding and that is no bueno.For the bracket syntax, I'm only supporting up to 6 hex digits. The underlying API we use is
Char.ConvertFromUtf32()
and that supports values up to0x10FFFF
, which 6 hex digits is sufficient to specify. In fact with 6 digits you can specify invalid values (over 0x10FFFF) - the parser errors appropriately on those.I could use some help with wording on the ParserStrings error messages I added. Or at the very least, these should be reviewed. Sorry about the trailing whitespace removal (isn't everybody using VSCode with editor.trimTrailingWhitespace set to true by now?)
This will require a change to the EditorSyntax project to get it to properly recognize the bracketed form in a variable declaration.