8000 Implement Unicode escape parsing by rkeithhill · Pull Request #3958 · PowerShell/PowerShell · GitHub
[go: up one dir, main page]

Skip to content

Implement Unicode escape parsing #3958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

8000

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

rkeithhill
Copy link
Collaborator
@rkeithhill rkeithhill commented Jun 7, 2017

This addresses the last item in issue #2751. The implementation supports the following forms of Unicode escape sequences:

"`u2195" evals to
"`u{4}" evals to control char 0x4 (EOT)
"`u{a9}" evals to ©
"`u{2195}" evals to
"`u{1f44d}" evals to 👍

In addition, you can use Unicode escape sequences when naming a variable:

${foo`u{2195}} = 42 # creates variable named: foo↕

And you can specify a Unicode sequence as an argument:

Write-Host `u{a9}` Acme` Inc # outputs: © Acme Inc

Concerns

  • For the implementation I had to change the signature of Backtick() to return a string instead of char. This was necessary to support Unicode surrogate pairs like "`u{1f44d}" which consists of two Unicode characters. This required updating six call sites of Backtick(). The primary approach I took was that if the string length was greater than 1, then we have a character from the Unicode astral plane and is not of further interest to the parser. So I put the string in the sb/formatSb StringBuilder and continue the scanning loop. Otherwise, we have a single character, which I assign to the local c (char) variable and allow processing in that loop iteration to continue.

  • I had to change the encoding of the Parser.Tests.ps1 file to UTF8 w/BOM. That's because Pester apparently reads the file with the default encoding and the Unicode characters I have in the file do not get recognized correctly. BTW I've noticed this before where Get-Content (without specify -Encoding UTF8) on a UTF8 file w/no BOM and with Unicode characters seems to use ASCII encoding and that is no bueno.

  • For the bracket syntax, I'm only supporting up to 6 hex digits. The underlying API we use is Char.ConvertFromUtf32() and that supports values up to 0x10FFFF, which 6 hex digits is sufficient to specify. In fact with 6 digits you can specify invalid values (over 0x10FFFF) - the parser errors appropriately on those.

  • I could use some help with wording on the ParserStrings error messages I added. Or at the very least, these should be reviewed. Sorry about the trailing whitespace removal (isn't everybody using VSCode with editor.trimTrailingWhitespace set to true by now?)

  • This will require a change to the EditorSyntax project to get it to properly recognize the bracketed form in a variable declaration.

@rkeithhill rkeithhill cha 10000 nged the title Implements Unicode escape parsing Implement Unicode escape parsing Jun 7, 2017
@@ -270,9 +270,44 @@ Describe "ParserTests (admin\monad\tests\monad\src\engine\core\ParserTests.cs)"
$result | should be ([char]0x1b)
}

It "Test that escaping any character with no special meaning just returns that char. (line 602)" {
$result = ExecuteCommand '"fo`obar"'
$result | should be "foobar"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we remove the test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks. Was debugging utf8 encoding problems with the file and removed this one by accident.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closed.

# These tests require the file to be saved with a BOM. Unfortunately when this UTF8 file is read by
# PowerShell without a BOM, the file is incorrectly interpreted as ASCII.
It 'Test that Unicode escape sequence `u2195 returns the ↕ character.' {
$result = ExecuteCommand '"foo`u2195abc"'
Copy link
Collaborator
@iSazonov iSazonov Jun 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see tests with single quotas 'foo`u2195abc' for expressions, variables, arguments.

Copy link
Collaborator Author
@rkeithhill rkeithhill Jun 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in tests that verify the escape sequence is not converted to a Unicode char? Single quoted strings in PowerShell ignore esc sequences.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't see such tests at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. It is easy enough to add a single quoted string test. I probably should add here string tests as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also test an escape in a command name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@@ -1229,23 +1229,125 @@ private int SkipBlockComment(int i)
return i;
}

private static char Backtick(char c)
private string Backtick(char c)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For discussion. Here we have to change both Backtick and calling code. Alternative is to return char "`u" from Backtick and then read a hex in ScanUnicodeEscapeSequence:

c = Backtick(c1);
if ( c == '\u' )
{
    ScanUnicodeEscapeSequence;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a bad idea. Thanks! The code at each of the six call sites would still need to add the result of calling ScanUnicodeEscapeSequence to the local sb/formatSb instances but the intent of that code would be clearer than the current string length > 1 checks. I'll try this change later tonight.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this might be cleaner, especially if you pass in the sb instances to ScanUnicodeEscapeSequence.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential issue with the proposed change is that if someone where to add a new context in which backticks could be used for escaping characters, they can't just call Backtick(). They have to also know to check the resulting char and if == '\u' call the ScanUnicodeEscapeSequence() method. And they might not know because every other escape sequence would work by just calling Backtick(). Not that I'm opposed to this change but it's a potential downside. Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW '\u' does not compile in C#. It generates the following error:
image.
Now Backtick could default to just returning u and we could test for that in each of the 6 call sites but we're no longer testing for \u. This is starting to feel a bit less elegant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with Backtick returning a string, I think that's a reasonable change.

I'm less happy with the inconsistent handling at each call site - maybe it's necessary but it's worth discussing more.

It looks like the only reason to check the length is to support doubling { and }. Should that happen if someone wrote "`u7b" instead of "{"?

Copy link
Collaborator Author
@rkeithhill rkeithhill Jun 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the only reason to check the length is to support doubling { and }

Actually it is used to determine whether the resulting string is a Unicode esc sequence that corresponds to a surrogate pair. The thinking was that a surrogate pair should just be added to string builders and that you'd never need to "scan" the astral plane character as a token. BTW this is the only case where the string returned by Backtick contains more than one character.

In the non-bracketed case e.g. "`u007b" you will never get back a surrogate pair. "`u7b" is not a valid seq - has to be 4 hex digits in this C#/Bash style Unicode esc seq.

FWIW I didn't feel great about the changes to the six Backtick call sites. I'm going to try something to at least consolidate the duplicate code in a single method to stay copasetic with the DRY principle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't clear. I meant - the only reason to have two code paths.

If the Length > 1 code path was used unconditionally - how is that different than what you have today - that's what I meant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did that because I was trying to change the original logic as little as possible. For instance Backtick returning a char `n or z led to further checking of that character. IOW a call to Backtick was not followed by a call to continue to the next loop iteration. If you think we can always proceed onto the next iteration without further checking the returned character, then yeah I can simplify that logic.

BTW here is what I'm thinking to eliminate the duplicate code:

        private bool IsSurrogatePair(string str, ref char nonSurrogatePairChar, params StringBuilder[] addStrToStringBuilders) 
        {
            Diagnostics.Assert(!string.IsNullOrEmpty(str), "Caller never calls us with an empty string");
            Diagnostics.Assert((str.Length <= 2), "Caller never calls us with a string length greater than two");
            Diagnostics.Assert((str.Length == 1) || Char.IsSurrogate(str, 0), 
                "Caller never calls us with a string length of two and not be a surrogate pair");                
            
            if (str.Length == 1) 
            {
                nonSurrogatePairChar = str[0];
                return false;
            }

            if (addStrToStringBuilders?.Length > 0) 
            {
                foreach (var sb in addStrToStringBuilders)
                {
                    sb.Append(str);
                }
            }

            return true;
        }

Not crazy about the method name so I'm open to ideas but this does clean up the various call sites e.g."

                else if (c == '`')
                {
                    // If end of input, go ahead and append the backtick and issue an error later
                    char c1 = PeekChar();
                    if (c1 != 0)
                    {
                        SkipChar();
                        string str = Backtick(c1);
                        if (IsSurrogatePair(str, ref c, sb, formatSb))
                        {
                            // Skip further scanning when str is surrogate pair and is outside the basic multilingual plane
                            continue;
                        } 
                    }
                }

@TravisEz13 TravisEz13 added the Documentation Needed in this repo Documentation is needed in this repo label Jun 7, 2017
<!--
Microsoft ResX Schema

<!--
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid combining reformatting and functional fixes in the same change. Suggestion: Submit a PR for the reformatting before the functional change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this wasn't an intentional formatting change but just a consequence of me editing in VSCode with trimTrailingWhitespace = true. I can fix it. Will submit an update later tonight.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW it might be a good idea to add either a .vscode\settings.json where we can define "files.trimTrailingWhitespace": false and other desired settings. Or we could cast a wider net and use a top-level .editorconfig file with trim_trailing_whitespace = true. In this case, I would add a .vscode\extensions.json file with the EditorConfig extension listed. In fact, such a file should also recommend the C#, C++ and PowerShell extensions from Microsoft.

@TravisEz13 TravisEz13 requested a review from lzybkr June 7, 2017 19:43
private string ScanUnicodeEscapeSequence()
{
// Unicode replacement char, used to represent unknown, unrecognized or unrepresentable characters.
var unknownChar = ((char)0xffdd).ToString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a literal string is not possible, you should make this static so we create just one instance instead of creating an instance every time we scan `u.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

var sb = GetStringBuilder();
for (int i = 0; i < maxNumberOfHexDigits; i++)
{
c = PeekChar();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use GetChar here - you're expecting more than one character, then just call UngetChar when you went past what you consume.

@@ -1229,23 +1229,125 @@ private int SkipBlockComment(int i)
return i;
}

private static char Backtick(char c)
private string Backtick(char c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this might be cleaner, especially if you pass in the sb instances to ScanUnicodeEscapeSequence.

break;
}

if ((c == '\0') || !c.IsHexDigit())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this loop is more complicated than necessary. If you were only checking IsHexDigit and then break on the first non-hex digit, does that simplify things by checking for the terminator in just one place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier I had an error along the lines of IncompleteInputUnicodeYadaYada but then got rid of it. So yeah, as long as IsHexDigit() returns false on \0 I can rip out the c == '\0' test.

int unicodeValue;
if (!int.TryParse(hexStr, NumberStyles.AllowHexSpecifier, NumberFormatInfo.InvariantInfo, out unicodeValue))
{
// Considering we validate hex characters, we should never get here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of errors like this for impossible conditions. If int.Parse fails, let that exception go, it might be more useful than this error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I was wondering about that. Will remove the corresponding ParserStrings entry.

# These tests require the file to be saved with a BOM. Unfortunately when this UTF8 file is read by
# PowerShell without a BOM, the file is incorrectly interpreted as ASCII.
It 'Test that Unicode escape sequence `u2195 returns the ↕ character.' {
$result = ExecuteCommand '"foo`u2195abc"'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also test an escape in a command name.

ShouldBeParseError '"`u219z"' InvalidUnicodeEscapeSequence 1
ShouldBeParseError '"`u{219z"' UnicodeEscapeSequenceMissingTerminator 7 # error offset "`u{219>>z<<"
ShouldBeParseError '"`u{}"' InvalidBracketedUnicodeEscapeSequence 1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need more tests that are missing the closing quote, e.g. "`u, "`u{, etc.
I find it handy to type in samples interactively, PSReadline tends to hit these cases automatically as you type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've noticed that handy little feature of PSReadline. Will add these additional tests. Do you have an opinion on whether I should test single quoted strings/here strings? In theory, none of this new code should be invoked for sq strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let you decide what you'd like to do.

I'm certain we had such tests using one of our internal frameworks and I'm not too worried about regressions, but it'd be good to get some tests like that into this repo.


It "Test that a Unicode escape sequence can be used in a command name." {
function xyzzy`u2195($p) {$p}
$cmd = Get-Command xyzzy`u2195
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add "-ErrorAction SilentlyContinue"

Copy link < 55E /clipboard-copy>
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

ShouldBeParseError '"`u{' UnicodeEscapeSequenceMissingTerminator,TerminatorExpectedAtEndOfString 4,0
ShouldBeParseError '"`u{1' UnicodeEscapeSequenceMissingTerminator,TerminatorExpectedAtEndOfString 5,0
ShouldBeParseError '"`u{123456' UnicodeEscapeSequenceMissingTerminator,TerminatorExpectedAtEndOfString 10,0
ShouldBeParseError '"`u{1234567' UnicodeEscapeSequenceMissingTerminator,TerminatorExpectedAtEndOfString 10,0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lzybkr Should we report the error posion at '{'?

@rkeithhill What about tests for absent '{' ? '"`u1234567}'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three error spans:

  1. Invalid Unicode seq will span from the starting backtick to the last valid character
  2. Bracketed Unicode seq with missing } will span just the one col where it expected the }.
  3. Bracketed Unicode seq where 6 digit value is > 0x10FFFF will span just the "value" part of the sequence i.e. only the hex digits.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An user can type '"`u123}' so we should write an useful error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example generates the following two errors:

iex '"`u123}'
iex : At line:1 char:2
+ "`u123}
+  ~~~~~
The Unicode escape sequence is not valid. A valid sequence is of the form `uXXXX where X is a hex digit and there must
be four hex digits.
At line:1 char:1
+ "`u123}
+ ~~~~~~~
The string is missing the terminator: ".
At line:1 char:1
+ iex '"`u123}'
+ ~~~~~~~~~~~~~
    + CategoryInfo          : ParserError: (:) [Invoke-Expression], ParseException
    + FullyQualifiedErrorId : InvalidUnicodeEscapeSequence,Microsoft.PowerShell.Commands.InvokeExpressionCommand

Effectively "`u1 tells the tokenizer that the Unicode esc seq is not using the bracketed syntax so if it encounters a non-hex digit before it scans four hex digits, it errors telling you that the esc seq isn't valid and tells you what the correct form is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I lost a double quota. My sample is '"`u123}"' .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first error message is exactly the same. There isn't the second error, because the dq string is now terminated:

PS> iex '"`u123}"'
iex : At line:1 char:2
+ "`u123}"
+  ~~~~~
The Unicode escape sequence is not valid. A valid sequence is of the form `uXXXX where X is a hex digit and there must
be four hex digits.
At line:1 char:1
+ iex '"`u123}"'
+ ~~~~~~~~~~~~~~
    + CategoryInfo          : ParserError: (:) [Invoke-Expression], ParseException
    + FullyQualifiedErrorId : InvalidUnicodeEscapeSequence,Microsoft.PowerShell.Commands.InvokeExpressionCommand

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarify. The error looks good.
We can be more accurate with the message:
A valid sequence is of the form `uXXXX or `u{XXXX} where X is a hex digit and a maximum value is 0x10FFFF.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm definitely looking for help on the parser error messages. I like your idea of showing both forms but we need to indicate the bracketed form accepts 1 to 6 hex digits. How about this?

A valid sequence is of the form `u followed by 4 hex digits e.g. `u2195. Or `u followed by 1 to 6 hex digits 
enclosed in curly braces for example `u{A9} `u{1F596}.  The maximum allowed hex value is 0x10FFFF.

Copy link
Collaborator
@iSazonov iSazonov Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're not going to be able to pack all the documentation in one message. 😄

I believe we should just specify a common format (one sample) and a range of values. Additional information the user can read in the documentation.

@rkeithhill
Copy link
Collaborator Author

I have addressed most of the issues. The last remaining issue is the proposed change to Backtick() and where we call ScanUnicodeEscapeSequence() from.

@lzybkr lzybkr added the Breaking-Change breaking change that may affect users label Jun 8, 2017
@lzybkr
Copy link
Contributor
lzybkr commented Jun 8, 2017

Added the breaking change label - it's similar to the break we did for `e.

@PowerShell PowerShell deleted a comment from schittli Jun 8, 2017
@PowerShell PowerShell deleted a comment from schittli Jun 8, 2017
@PowerShell PowerShell deleted a comment from schittli Jun 8, 2017
@rkeithhill
Copy link
Collaborator Author
rkeithhill commented Jun 8, 2017

So the Travis build failed because I assumed that on non-Windows systems, a here string used [System.Environment]::NewLine to separate lines. Apparently that's not true which is somewhat surprising.

Expected: {\nfoo`u2195abc\n}
But was:  {\r\nfoo`u2195abc\r\n}

@lzybkr
Copy link
Contributor
lzybkr commented Jun 8, 2017

A here string will use the newlines that appear in the file - it doesn't matter what environment you are in.

@PowerShell PowerShell deleted a comment from schittli Jun 8, 2017
@@ -2029,7 +2147,12 @@ private TokenFlags ScanStringExpandable(StringBuilder sb, StringBuilder formatSb
if (c1 != 0)
{
SkipChar();
c = Backtick(c1);
string str = Backtick(c1);
if (IsSurrogatePair(str, ref c)) //, sb, formatSb))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm an idiot. Will undo the temp comment (was testing for no sb passed into method).

@rkeithhill
Copy link
Collaborator Author

The first of the two Travis builds passed. The second didn't even seem to really get started before it failed.

@rkeithhill
Copy link
Collaborator Author
rkeithhill commented Jun 8, 2017

Where does the about_Escape_Characters.help.txt file live? It seems to get copied in during build but doesn't live in this repo AFAICT.

@lzybkr
Copy link
Contributor
lzybkr commented Jun 8, 2017

You can find the help topics by starting on MSDN (e.g. here) and then click on Contribute.

@rkeithhill
Copy link
Collaborator Author
E377 rkeithhill commented Jun 8, 2017

Cool. Found the v6 version of the file but I won't make any changes unless/until the PR is accepted.

@@ -1338,7 +1338,7 @@ private string ScanUnicodeEscapeSequence()
}
}

private bool IsSurrogatePair(string str, ref char nonSurrogatePairChar, params StringBuilder[] addStrToStringBuilders)
private bool IsSurrogatePair(string str, ref char nonSurrogatePairChar, StringBuilder sb1 = null, StringBuilder sb2 = null)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we want default parameters here?

}
}

private bool IsSurrogatePair(string str, ref char nonSurrogatePairChar, StringBuilder sb1 = null, StringBuilder sb2 = null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought - is this cleaner if you use overloads instead of default parameters? It would avoid the null tests.

Copy link
Collaborator Author
@rkeithhill rkeithhill Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use overloads then presumably the version that takes one SB will delegate to the overload that takes two SBs (and has the implementation). And for that delegation, it would pass null for the second SB - at least that is how I've done overloads before. Are you thinking of a different way?

In general I would use overloads over default parameters for a public API but since this method isn't public ... seemed OK to use default parameters. And it does eliminate the params GC impact.

BTW this method is perhaps a bit more general purpose than it needs to be since the only existing callers pass either one or two SBs. There are no places that call it without any SB. So we could Diag.Assert that sb1 is never null and do away with the first sb1 != null check - if you're comfortable with that. But we would still need to check sb2 != null since sb2 would be null if we were called from the overload that takes just sb1.

@rkeithhill
Copy link
Collaborator Author
rkeithhill commented Jun 9, 2017

@TravisEz13 Thanks for the link but that is for PS 3.0. I think I should only update the 6.0 version:
https://github.com/PowerShell/PowerShell-Docs/blob/bc94e78134be5d14369b6a9cd4c5ea2b61e1d37f/reference/6/About/about_Escape_Characters.md

Right?

@rkeithhill
Copy link
Collaborator Author

I realize you guys will probably want me to rebase this at some point. Do you think we are getting close on this? I feel like it is getting close, modulo error message tweaks.

@@ -1,4 +1,4 @@
Describe "ParserTests (admin\monad\tests\monad\src\engine\core\ParserTests.cs)" -Tags "CI" {
Describe "ParserTests (admin\monad\tests\monad\src\engine\core\ParserTests.cs)" -Tags "CI" {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I believe this line appears changed because I had to save this file with encoding UTF8 w/BOM otherwise the Unicode characters I use in this file don't get read back in properly when Pester runs.

@TravisEz13
Copy link
Member

@lzybkr @iSazonov Can you review?

@@ -509,7 +509,7 @@ internal class Tokenizer

// Unicode replacement char used to represent an unknown, unrecognized or unrepresentable character in a
// Unicode escape sequence.
private static readonly string s_unknownUnicodeChar = ((char)0xffdd).ToString();
private static readonly string s_unknownUnicodeChar = ((char)0xfffd).ToString();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo 0xffff.

Copy link
Collaborator Author
@rkeithhill rkeithhill Jun 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, I wanted to correct the original typo for replacement/unknown character first. The correct value for that is 0xfffd. Next up, I'm looking at switching to what C# uses - invalidChar which is Char.MaxValue (0xffff).

@rkeithhill
Copy link
Collaborator Author
rkeithhill commented Jun 25, 2017

The last commit adopts the C# approach to Unicode sequence lexing and removes IsSurrogatePair() method.

The C# approach is to have the esc sequence and Unicode esc lexing return a char but also have an out char surrogateChar (lowSurrogate).

This perhaps makes the Backtick() call sites a bit clearer than the use of the IsSurrogateChar() method.

Also change from using s_invalidChar as 0xfffd (replacement char typically used when "rendering" an invalid byte in the current byte stream e.g UTF-8 reading 0x66, 0xFC, 0x72 - the 0xFC is not valid in this context and an editor might use char 0xFFFD to indicate the character was not valid. The C# parser returns 0xFFFF (Char.MaxValue) when there is an error and it needs to return something for the return value.

See what you think. I can always revert this change.

To see the C# version, look here: https://github.com/dotnet/roslyn/blob/94ca919c6b4c94857dc5d58df4e6e10cb8d55c3c/src/Compilers/CSharp/Portable/Parser/Lexer_StringLiteral.cs#L98

@rkeithhill rkeithhill force-pushed the rkeithhill/impl-unicode-esc-seq branch from 27181df to a0707d3 Compare June 25, 2017 23:08
…ir() method.

The C# approach is to have the esc sequence and Unicode esc lexing return a char but also have an out char surrogateChar (lowSurrogate).

This perhaps makes the Backtick() call sites a bit clearer than the use of the IsSurrogateChar() method.

Also change from using s_invalidChar as 0xfffd (replacement char typically used when rendering an invalid byte in the current byte stream e.g UTF-8 reading 0x66, 0xFC, 0x72 - the 0xFC is not valid in this context and an editor might use char 0xFFFD to indicate the character was not understood.  The C# parser returns 0xFFFF (Char.MaxValue) when there is an error and it needs to return something for the return value.

See what you think. I can always revert this change.
@rkeithhill rkeithhill force-pushed the rkeithhill/impl-unicode-esc-seq branch from a0707d3 to 04b8e45 Compare June 25, 2017 23:11
Copy link
Contributor
@lzybkr lzybkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thanks for reviewing the Roslyn implementation, that was a great idea.

@iSazonov
Copy link
Collaborator

@lzybkr Could you please comment #3958 (comment) ? It seems like a full imitation to C# too old.

…hat return early (before the call to GetStringAndRelease).
rkeithhill added a commit to rkeithhill/PowerShell-Docs that referenced this pull request Jun 27, 2017
This PR is tied to the acceptance of the PR PowerShell/PowerShell#3958.
@rkeithhill
Copy link
Collaborator Author
rkeithhill commented Jun 27, 2017

I would appreciate it if a few folks could review the docs I updated for this feature: https://github.com/PowerShell/PowerShell-Docs/pull/1407/files

cc @mklement0

Thanks!

juanpablojofre pushed a commit to MicrosoftDocs/PowerShell-Docs that referenced this pull request Jun 27, 2017
This PR is tied to the acceptance of the PR PowerShell/PowerShell#3958.
@TravisEz13
Copy link
Member

Docs are merged, @lzybkr has signed off. The Committee has reviewed the PR. Thanks, @rkeithhill
If there isn't any additional feedback, I'll merge this soon.

@rkeithhill
Copy link
Collaborator Author

@TravisEz13 Thanks.

I'd like @lzybkr to take a quick look at my very last commit (made after his approval). It is basically a three line change to make sure we call Release(sb) in the error conditions where we return early (before calling GetStringAndRelease(sb).

@lzybkr
Copy link
Contributor
lzybkr commented Jun 28, 2017

@rkeithhill - I have no preference on leading zeros - I'm fine either way on this one.

@iSazonov
Copy link
Collaborator

What is conclusion about surrogate pairs?

@lzybkr
Copy link
Contributor
lzybkr commented Jun 28, 2017

I'm not sure I have a conclusion on surrogate pairs - but I think I implied it's reasonable to allow them.

@iSazonov
Copy link
Collaborator

Could you please re-read last @mklement0 comments above - we should continue and finish the discussion, and maybe documenting it for users.

@lzybkr
Copy link
Contributor
lzybkr commented Jun 28, 2017

I did read that comment before replying and don't really have more to say right now - I think what was merged is fine, but I don't have a strong opinion.

@mklement0
Copy link
Contributor
mklement0 commented Jun 28, 2017

@iSazonov: I won't be saying anything substantially new over what @lzybkr has already said, but let me attempt a concise summary:

  • C# supports surrogate code points in its \uhhhh syntax - which is unfortunate, given that \Uhhhhhhhh allows you to express any actual character code point directly, but it is in line with C#/.NET strings being UTF-16 strings that operate on code points rather than characters.

  • The only real disadvantage of allowing surrogate escapes in PowerShell is the introduction of historical baggage, but even that is defensible, since we can't escape the reality of the underlying UTF-16 strings.

  • On the flip side, there are advantages that @lzybkr has already mentioned, notably reuse of existing hex codes (it would be nontrivial to determine the single character code point that is the equivalent of a given surrogate pair), the ability to test APIs with isolated surrogates, and last and least, ease of implementation.

juanpablojofre pushed a commit to MicrosoftDocs/PowerShell-Docs that referenced this pull request Jun 29, 2017
This PR is tied to the acceptance of the PR PowerShell/PowerShell#3958.
@joeyaiello joeyaiello removed Documentation Needed in this repo Documentation is needed in this repo labels Oct 15, 2018
@rkeithhill rkeithhill deleted the rkeithhill/impl-unicode-esc-seq branch May 29, 2019 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking-Change breaking change that may affect users Committee-Reviewed PS-Committee has reviewed this and made a decision
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0