8000 Fix Default/OEM encoding behavior PowerShell Core by iSazonov · Pull Request #3467 · PowerShell/PowerShell · GitHub
[go: up one dir, main page]

Skip to content

Fix Default/OEM encoding behavior PowerShell Core #3467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 6, 2017

Conversation

iSazonov
Copy link
Collaborator
@iSazonov iSazonov commented Mar 31, 2017

Based on conclusion in discussion #3248

  • - OEM

We need internal fix because .Net Core haven't "OEM" term.
The fix should provide PowerShell Core behavior:

  • on Windows - as Windows PowerShell based on GetOEMCP() and legacy encodings (System.Text.Encoding.CodePages)

  • on Unix - the same as default encoding in PowerShell Core

  • - Default

We need internal fix because CoreFX default for all platforms is UTF8
Until we have completed Encoding RFC we should restore behaivor as in Windows PowerShell.
The fix should provide PowerShell Core behavior:

  • on Windows - as Windows PowerShell based on GetACP() and legacy encodings (System.Text.Encoding.CodePages)

  • on Unix - UTF-8 without BOM (We are expecting that the same will be in CoreFX)

  • - Tests

Now we don't have a full set of encoding tests (only for redirections). I believe we need to create them during future RFC implementation, not now.

@iSazonov iSazonov force-pushed the fixdefaultencoding branch from 36ab1e2 to 9580b4a Compare March 31, 2017 15:20
@iSazonov
Copy link
Collaborator Author

I was able to build locally with
<PackageReference Include="System.Text.Encodings.Web" Version="4.0.11" />
in Microsoft.PowerShell.SDK.csproj
but on CI build failed 😕 I revert to Version="4.3.0" but build failed again.
@daxian-dbw Could you please help?

@@ -7569,8 +7569,7 @@ private static Encoding GetEncodingFromEnum(FileSystemCmdletProviderEncoding typ

case FileSystemCmdletProviderEncoding.Oem:
{
uint oemCP = NativeMethods.GetOEMCP();
encoding = System.Text.Encoding.GetEncoding((int)oemCP);
encoding = ClrFacade.GetOEMEncoding();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NativeMethods.GetOEMCP in this file can be removed if it's no longer used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in ClrFacade.GetOEMEncoding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClrFacade has its own copy of GetOEMEncoding, so the one in FileSystemProvider.cs can be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#if CORECLR

#if UNUX // PowerShell Core on Unix
s_defaultEncoding = Encoding.GetEncoding(65001);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using Encoding.UTF8? Isn't 65001 the same as utf8?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#if UNUX // PowerShell Core on Unix
s_defaultEncoding = Encoding.GetEncoding(65001);
#else // PowerShell Core on Windows
uint aCp = NativeMethods.GetACP();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename the variable to currentAnsiCp? It's helpful to know what GetACP does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// We will revisit this if it causes any failures when running tests on Core PS.
s_defaultEncoding = Encoding.GetEncoding(28591);
#else
#if CORECLR
Copy link
Member
@daxian-dbw daxian-dbw Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if/def pattern used in other places is

#if UNIX
// unix behavior
#elif CORECLR
// win-core behavior
#else
// win-full behavior
#endif

Could you please follow the same?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

uint aCp = NativeMethods.GetACP();
if (s_oemEncoding == null)
{
Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to separate this into a method to reduce code duplication?

@@ -17,6 +17,7 @@ internal static class PinvokeDllNames
internal const string QueryDosDeviceDllName = "api-ms-win-core-file-l1-1-0.dll"; /*1*/
internal const string CreateSymbolicLinkDllName = "api-ms-win-core-file-l2-1-0.dll"; /*2*/
internal const string GetOEMCPDllName = "api-ms-win-core-localization-l1-2-0.dll"; /*3*/
internal const string GetACPDllName = "api-ms-win-core-localization-l1-2-0.dll"; /*3*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /*3*/ at the end is important to track if there is a mismatch. Please add this API to the end of the list and use an incremented number.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -139,6 +140,7 @@ internal static class PinvokeDllNames
internal const string QueryDosDeviceDllName = "kernel32.dll"; /*1*/
internal const string CreateSymbolicLinkDllName = "kernel32.dll"; /*2*/
internal const string GetOEMCPDllName = "kernel32.dll"; /*3*/
internal const string GetACPDllName = "kernel32.dll"; /*3*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -555,10 +564,20 @@ internal static Encoding GetOEMEncoding()
{
if (s_oemEncoding == null)
{
#if CORECLR // The OEM code page '437' is not supported by CoreCLR.
// Use the default encoding (ISO-8859-1, CodePage 28591) as the OEM encoding in OneCore powershell.
#if CORECLR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here regarding the pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

uint oemCp = NativeMethods.GetOEMCP();
if (s_defaultEncoding == null)
{
Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here regarding the code duplication.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@daxian-dbw daxian-dbw requested a review from mirichmo March 31, 2017 18:27
@daxian-dbw
Copy link
Member

Add @mirichmo to review the new windows API set added to PInvokeDllNames.cs, to see if that's fine for downlevel machines like win7.

@iSazonov I don't see you change the .csproj file. Did you miss that?

@daxian-dbw
Copy link
Member

<PackageReference Inclu A36C de="System.Text.Encodings.Web" Version="4.0.11" />

Why are you using System.Text.Encoding.Web? The CodePagesEncodingProvider type is available in System.Text.Encoding.CodePages

@@ -85,7 +85,7 @@
<PackageReference Include="System.ServiceModel.Primitives" Version="4.3.0" />
<PackageReference Include="System.ServiceModel.Security" Version="4.3.0" />
<PackageReference Include="System.ServiceProcess.ServiceController" Version="4.3.0" />
<PackageReference Include="System.Text.Encoding.CodePages" Version="4.3.0" />
<PackageReference Include="System.Text.Encoding.CodePages" Version="4.0.11" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change it to 4.0.11? I don't even find such a version on nuget.org

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 4.0.11 I can compile locally on my computer. Now I changed to 4.0.1 but failed too.

EncodingRegisterProvider()

uint currentAnsiCp = NativeMethods.GetACP();
s_defaultEncoding = Encoding.GetEncoding((int)currentAnsiCp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding.Default is coming back in netstandard2.0, are we going to change to it at that point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have RFC for that. No we only recover "Windows PowerShell" behavior.

@@ -530,17 +530,21 @@ internal static object[] GetCustomAttributes<T>(Assembly assembly)
#region Encoding

/// <summary>
/// Facade for Encoding.Default
/// Facade for getting Encoding.Default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on what we will do when Encoding.Default is back in netstandard2.0. If we are not going to change to Encoding.Default in powershell core, then the comment should be Facade for getting default encoding.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.


uint currentAnsiCp = NativeMethods.GetACP();
s_defaultEncoding = Encoding.GetEncoding((int)currentAnsiCp);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the extra blank line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

uint oemCp = NativeMethods.GetOEMCP();
s_oemEncoding = Encoding.GetEncoding((int)oemCp);

#else // Windows PowerShell
uint oemCp = NativeMethods.GetOEMCP();
s_oemEncoding = Encoding.GetEncoding((int)oemCp);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice the duplicate code between coreclr and fullclr. In this case, maybe the following is better?

#if UNIX
         s_oemEncoding = GetDefaultEncoding();
#else
#if CORECLR
         EncodingRegisterProvider()
#endif
         uint oemCp = NativeMethods.GetOEMCP();
         s_oemEncoding = Encoding.GetEncoding((int)oemCp);
#end

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried - it is difficult to read. In compile time no duplicate code is. So maybe leave as is?

uint oemCp = NativeMethods.GetOEMCP();
s_oemEncoding = Encoding.GetEncoding((int)oemCp);
#endif
}
return s_oemEncoding;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra line, or move private static volatile Encoding s_defaultEncoding; here as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

private static volatile Encoding s_oemEncoding;

private static EncodingRegisterProvider()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be enclosed with #if CoreCLR, as it won't build for fullclr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
if (s_defaultEncoding == null && s_oemEncoding == null)
{
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible this line will be called more than once when there are multiple runspace. Will it fail at the second call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I put the call in GetDefaultEncoding and GetOEMEncoding without "if" and second call was well.

@daxian-dbw daxian-dbw requested a review from JamesWTruher March 31, 2017 19:58
@iSazonov
Copy link
Collaborator Author

@daxian-dbw
I change CodePages version on 4.0.11 but CodePagesEncodingProvider cannot be finded. 😕

@daxian-dbw
Copy link
Member

@JamesWTruher @mklement0, you have been closely following up with this issue (and related), could you please take a look at the fix?

s_defaultEncoding = Encoding.GetEncoding(28591);
#else
#if UNUX // PowerShell Core on Unix
s_defaultEncoding = Encoding.UTF8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a mistake to default to a UTF-8 encoding with BOM.
While .NET Core, at least currently, also reports the same BOM-based encoding - unofficially in Encoding.Default (not part of the contract yet) and officially in Encoding.GetDefaultEncoding(0) (and I consider that misguided too) -
at least it doesn't govern the actual default behavior in the sense of what happens if no encoding is specified. The methods of type System.IO.File default to BOM-less UTF-8 - both in .NET Framework since its inception and in .NET Core.
It think it's imperative that we use a BOM-less UTF-8 encoding in the Unix world (and, I would argue, even make that the default on Windows (Core) as well, with opt-in to legacy behavior).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -85,7 +85,7 @@
<PackageReference Include="System.ServiceModel.Primitives" Version="4.3.0" />
<PackageReference Include="System.ServiceModel.Security" Version="4.3.0" />
<PackageReference Include="System.ServiceProcess.ServiceController" Version="4.3.0" />
<PackageReference Include="System.Text.Encoding.CodePages" Version="4.3.0" />
<PackageReference Include="System.Text.Encoding.CodePages" Version="4.0.1" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use 4.3.0? The type System.Text.CodePagesEncodingProvider is available in that package.
You need to update SMA.csproj to add this package reference since that's the project depending on it. Then you can remove this entry from Microsoft.PowerShell.SDK.csproj.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks! It is solved the problem and PowerShell is builded well.

But currently Travis is failed on test phase with

Import-Module : Unable to load DLL 'api-ms-win-core-localization-l1-2-0.dll'

Could you comment please?

Copy link
Member
@daxian-dbw daxian-dbw Apr 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#if UNUX

As Steve pointed out, this is why 😄

@mklement0
Copy link
Contributor
mklement0 commented Mar 31, 2017

@daxian-dbw, @JamesWTruher : I know this debate is cumbersome, but I think it's a very important one to have, and I don't think we're ready to move ahead yet - I've just added further thoughts here.

// We will revisit this if it causes any failures when running tests on Core PS.
s_defaultEncoding = Encoding.GetEncoding(28591);
#else
#if UNUX // PowerShell Core on Unix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be UNIX?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... fixed.

@@ -135,6 +135,7 @@ internal static class PinvokeDllNames
internal const string CreateToolhelp32SnapshotDllName = "api-ms-win-core-toolhelp-l1-1-0"; /*120*/
internal const string Process32FirstDllName = "api-ms-win-core-toolhelp-l1-1-0"; /*121*/
internal const string Process32NextDllName = "api-ms-win-core-toolhelp-l1-1-0"; /*122*/
internal const string GetACPDllName = "api-ms-win-core-localization-l1-2-0.dll"; /*123*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API set matches the API on my laptop. I don't have access to a Nano VM at the moment to verify it there. On Nano, it must match the exporting API set exactly. You can verify it with dumpbin.exe /exports <path to api set>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I don't run still dumpbin.exe /exports
MSDN docs say about:

What version we should use?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetACP is in api-ms-win-core-localization-l1-2-0.dll of runtime.win7-x64.Microsoft.NETCore.Windows.ApiSets dump.
But It seems Nano use runtime.win10-x64.Microsoft.NETCore.Windows.ApiSets

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct "as-is" here. No change needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Clear.
Is Nano still under question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Nano Server contains the 1-2-0 version of the API set

@iSazonov iSazonov force-pushed the fixdefaultencoding branch from 3a8d68b to 0f04279 Compare April 1, 2017 09:48
@iSazonov
Copy link
Collaborator Author
iSazonov commented Apr 1, 2017

@daxian-dbw @SteveL-MSFT CI have passed - many thanks! Sorry for many typos (make the PR in the weekend night was bad my idea). Please continue code review.

@mklement0 Can you build the branch and check our expectations? If no, could you please attach files with your tests?

@mklement0
Copy link
Contributor

@iSazonov: I built on my macOS 10.12 machine and I can confirm that the following ad-hoc tests return true on macOS (and, judging by the source, therefore all Unix platforms), as desired:

# Setup: *create* a  BOM-less UTF-8 file.
'ö' | set-content -nonewline /tmp/$pid.txt

# Tests: Both should output $True

# Compare the raw bytes of the new file to the UTF-8 encoding of 'ö' (0xc3 0xb6)
# With the current alpha17, this would return $False, because Set-Content creates an
# ISO-8859-1 file.
$null -eq (Compare-Object (Get-Content -Encoding Byte -Raw /tmp/$pid.txt) (0xc3, 0xb6))

# See if the BOM-less UTF-8 file is *read* correctly.
'ö' -eq (Get-Content -Raw /tmp/$pid.txt)

(I wasn't able to run the tests - the powershell binary crashed at the following point:

Describing New-PSSession basic test
 [-] New-PSSession should not crash powershell 228ms
   Expected string length 66 but was 77. Strings differ at index 0.
   Expected: {InvalidOperation,Microsoft.PowerShell.Commands.NewPSSessionCommand}
   But was:  {System.DllNotFoundException,Microsoft.PowerShell.Commands.NewPSSessionCommand}
...

)

@iSazonov
Copy link
Collaborator Author
iSazonov commented Apr 1, 2017

@mklement0 Thanks for confirmation! It means the code is step in right direction!

All tests have passed on CI. I believe "New-PSSession basic test" too and you can ignore the error in context of the PR (or try discover a root of the error: copy-past test code in your session and see $error[0] - I think you don't install any package).

@iSazonov
Copy link
Collaborator Author
iSazonov commented Apr 2, 2017

I added comment about tests in first post.

@mklement0
Copy link
Contributor

@iSazonov:

Great idea, but I suggest, to avoid ambiguity, that you explicitly mention in your comment that this PR implements BOM-less UTF-8, especially given that your comment links to the .NET Core Encoding.Default property, which is currently UTF-8 with BOM (which, as stated, is not the actual default encoding (though, fortunately, it looks like getting Encoding.Default in line with that actual default is coming)).

To add some more fodder for upcoming tests, here's a simple one that tests whether PowerShell itself correctly reads BOM-less UTF-8 source code (such as scripts) by default in PS Core on Unix:

# Setup:
# Create a no-BOM UTF-8 file with the following literal content:
#   'ö' 
# which, when executed as a PS script, should echo 'ö' back, IF the script
# was correctly decoded from UTF-8 by PS.
# UTF-8 bytes: 0x27 (single quote), 0xc3 0xb6 (encoding of 'ö', U+00F6), 0x27 (single quote)
[byte[]] (0x27, 0xc3, 0xb6, 0x27) | Set-Content -Encoding Byte /tmp/$PID.ps1

# Test: See if the 'ö' is echoed back correctly.
#       Should return $True
'ö' -eq (& /tmp/$PID.ps1)

@iSazonov
Copy link
Collaborator Author
iSazonov commented Apr 2, 2017

@mklement0 I corrected first post about BOM-less, in the code it is obvious.

@daxian-dbw
Copy link
Member

@iSazonov are you going to add tests to this PR?

@iSazonov
Copy link
Collaborator Author
iSazonov commented Apr 4, 2017

I added in first post:

Now we don't have a full set of encoding tests (only for redirections). I believe we need to create them during future RFC implementation, not now.

I am confused. Partial test is meaningless. The full test also does not make sense until the approval of the Encoding RFC. What is compromise?

@daxian-dbw
Copy link
Member

@iSazonov the code changes look good to me. If the corresponding tests will come later, then we should keep that work item tracked in an issue. Is #3248 considered fixed with this PR? If not, the test work item should be tracked in the first post of that issue, similar to how you are tracking OEM, Default and tests in this PR.
And @mklement0 provides some good testing snippets here, we should also keep them in the same issue that tracks the test work item, so they can be easily referenced when it's time to complete the tests.

@iSazonov iSazonov mentioned this pull request Apr 5, 2017
6 tasks
@iSazonov
Copy link
Collaborator Author
iSazonov commented Apr 5, 2017

@daxian-dbw Tracking Issue #3488.

@daxian-dbw
Copy link
Member

@iSazonov Thank you! I will merge this PR later to 4D1F day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0