8000 Unify file encoding when a cmdlet creates a file by JamesWTruher · Pull Request #4119 · PowerShell/PowerShell · GitHub
[go: up one dir, main page]

Skip to content

Unify file encoding when a cmdlet creates a file #4119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

JamesWTruher
Copy link
Collaborator

Implementation of https://github.com/PowerShell/PowerShell-RFC/blob/master/1-Draft/RFC0020-DefaultFileEncoding.md

Cmdlets now create files with consistent encoding (UTF8 without BOM) on all platforms.
A new preference variable PSDefaultFileEncoding is now available to enable users to set encoding for cmdlets. By setting $PSDefaultFileEncoding = "WindowsLegacy" users can select the encoding which exists in PowerShell5 for each specific cmdlet.

Provider and cmdlet encoding methods have been centralized and are now common.

@JamesWTruher JamesWTruher added WG-Cmdlets general cmdlet issues WG-Cmdlets-Management cmdlets in the Microsoft.PowerShell.Management module Breaking-Change breaking change that may affect users labels Jun 27, 2017
/// </summary>
public static Encoding GetDefaultEncoding()
{
return new UTF8Encoding(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use static cache for UTF8Encoding(false); here and in Line 142 too - the cache is in Line 311.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

@lzybkr
Copy link
Contributor
lzybkr commented Jun 29, 2017

Please fix the conflict and rebase.

8000
@JamesWTruher JamesWTruher force-pushed the jameswtruher/encoding3 branch 2 times, most recently from a013795 to 9f77970 Compare June 29, 2017 22:44
@iSazonov
Copy link
Collaborator

If the new test batch covers all the old encoding variants, it would be good to merge the tests before the PR - then we'll be sure to see that the PR has full backward compatibility.

{ "microsoft.powershell.commands.setcontentcommand", Encoding.ASCII },
// Providers are handled here
{ "microsoft.powershell.commands.filesystemprovider", Encoding.ASCII },

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra empty line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


internal static Encoding GetWindowsLegacyEncoding(string name)
{
if ( legacyEncodingMap.ContainsKey(name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use legacyEncodingMap.TryGetValue instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

{
Encoding resolvedEncoding = GetDefaultEncoding();
FileEncoding encodingPreference = GetEncodingPreference(provider.SessionSta 10000 te);
if ( encoding == FileEncoding.Unknown && encodingPreference != FileEncoding.Unknown )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to care about WindowsLegacyEncoding here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - because of the provider cmdlets which leave the encoding to the underlying provider as a dynamic parameter it needs to be handled via a different path than checking against the name of the cmdlet.

MasterStreamOpen(cmdlet, filePath, resolvedEncoding, defaultEncoding, Append, Force, NoClobber, out fileStream, out streamWriter, out readOnlyFileInfo, isLiteralPath);
}

/*
/// <summary>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete the commented block?

return new StreamReader(fileStream, PowerShellEncoding.GetEncoding(command, encoding));
}

/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete commented block

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed - I missed that one

}
}

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the code coverage with the newly added tests for this file? So we add tests now if we are missing any.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that we can, i will :)

@@ -206,8 +207,7 @@ public SwitchParameter NoClobber
/// Encoding optional flag
/// </summary>
[Parameter()]
[ValidateSetAttribute(new string[] { "Unicode", "UTF7", "UTF8", "ASCII", "UTF32", "BigEndianUnicode", "Default", "OEM" })]
public string Encoding { get; set; }
public FileEncoding Encoding { get; set; } = FileEncoding.Unknown;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't you using a ValidateSet with the FileEncoding values?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use #3784 - dynamic validate set, don't expose FileEncoding and allow enhancement to use legacy charsets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to use validateset here at all. Especially since this is so pervasive in so many areas, I get the same more easily with an enum

@@ -571,8 +571,7 @@ public SwitchParameter UseCulture
/// Encoding optional flag
/// </summary>
[Parameter()]
[ValidateSetAttribute(new[] { "Unicode", "UTF7", "UTF8", "ASCII", "UTF32", "BigEndianUnicode", "Default", "OEM" })]
public string Encoding { get; set; }
public FileEncoding Encoding { get; set; } = FileEncoding.Unknown;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, Why aren't you using a ValidateSet with the FileEncoding values

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

EncodingConversion.Default,
EncodingConversion.OEM })]
public string Encoding { get; set; }
public FileEncoding Encoding { get; set; } = FileEncoding.Unknown;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default should be FileEncoding.UTF8. See BeginProcessing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want to be sure that I discover the encoding. If I set it to something that is likely to be provided, I won't be able to determine whether the user explicitly set it. By setting it to unknown, I'll calculate the appropriate value and I can tell that it wasn't set by the user

/// </summary>
public static Encoding GetProviderEncoding(CmdletProvider provider, FileEncoding encoding)
{
Encoding resolvedEncoding = GetDefaultEncoding();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest checking for encoding != FileEncoding.Unknown first; none of the other code paths are executed for that case.

{
using (BinaryReader reader = new BinaryReader(stream))
{
bytesRead = reader.Read(initialBytes, 0, 100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use initialBytes.Length instead of 100.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return FileEncoding.Default;
}

byte[] initialBytes = new byte[100];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need to read 100 bytes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was part of a refactor where this code has been around for a long time, and I didn't really want to change it. I'm certainly open reducing the size, but it's not too much, not too little and any number here would be a guess anyway, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we look beyond the preamble - a page size seems fine because we'll read that much from disk anyway, so 100 is totally fine.

internal static readonly UTF8Encoding utf8NoBom = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);

// take a look at the file contents and guess at the best encoding
internal static FileEncoding GetEncoding(string path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest naming this GetFileEncoding for clarity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

{
if (!File.Exists(path))
{
return FileEncoding.Default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should probably throw; not return Default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily - the file could be later created, so I need to pass back something - this code was refactored from another location, so i wanted to do as little as possible


if (bytesRead > 3)
{
preamble = String.Join("-", initialBytes[0], initialBytes[1], initialBytes[2], initialBytes[3]);
Copy link
Contributor
@dantraMSFT dantraMSFT Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the string building/lookups are overkill. Consider testing the bytes directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was code which was refactored from another location, I didn't really want to change it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do fix it, or open an issue to get it fixed. Testing bytes directly is much better - no extra allocations, no static initialization. It's simple code too - here is my PowerShell version: https://gist.github.com/lzybkr/f040f18d1fbfff9eaf3f4533e24126fe#file-fix_trailing_ws-ps1-L14 - note that my version handles UFT7 which this version does not, but mine misses the big endian encodings.

@@ -37,7 +38,7 @@ internal class FormatXmlWriter
StreamWriter streamWriter;
FileStream fileStream;
FileInfo fileInfo;
PathUtils.MasterStreamOpen(cmdlet, filepath, "ascii", true, false, force, noclobber,
PathUtils.MasterStreamOpen(cmdlet, filepath, PowerShellEncoding.GetEncoding(FileEncoding.Ascii), true, false, force, noclobber,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be the only external call to Encoding overload of PathUtils.MasterStreamOpen. Suggest using FileEncoding.Ascii here and merge the two MasterStreamOpen overloads.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah - I was worried about removing the overload as it was a public interface and thought to do less harm, just in case someone else was using it somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MasterStreamOpen is internal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll investigate

Copy link
Collaborator Author 1CF5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the encoding overload is used by CSV cmdlets which do their own determination of the encoding (in the append case), and pass that along to the open. I'd rather not refactor all that code if I don't need to.

@@ -4824,6 +4825,7 @@ .ForwardHelpCategory Cmdlet
internal const ActionPreference defaultVerbosePreference = ActionPreference.SilentlyContinue;
internal const ActionPreference defaultWarningPreference = ActionPreference.Continue;
internal const ActionPreference defaultInformationPreference = ActionPreference.SilentlyContinue;
internal const Microsoft.PowerShell.FileEncoding defaultFileEncodingPreference = FileEncoding.Unknown;
Copy link
Contributor
@dantraMSFT dantraMSFT Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can find no references to this field other than setting the session state variable. How is it used and how does it control behavior? Currently, it appears to be a NOP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's used down in line 4915 where it essentially lays claim to the variable

@@ -8,6 +8,7 @@
using System.Management.Automation.Internal;
using System.Management.Automation.Security;
using System.Reflection;
using Microsoft.PowerShell;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually needed? The only other change to this file is the removal of the GetEncoding logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -209,7 +210,7 @@ internal static Encoding GetOEMEncoding()
if (s_oemEncoding == null)
{
#if UNIX // PowerShell Core 10000 on Unix
s_oemEncoding = GetDefaultEncoding();
s_oemEncoding = PowerShellEncoding.GetDefaultEncoding();
#elif CORECLR // PowerShell Core on Windows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CORECLR branches could use commenting in both GetDefaultEncoding and GetOEMEncoding. The logic reasoning for GetACP and GetOEMCP is opaque.

Copy link
Collaborator
@iSazonov iSazonov Jul 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion was in #3248 and fixed in #3467. Maybe it should be reviewed again. In short we returned behavior Windows PowerShell and exclude the breaking change.

Copy link
Collaborator Author
@JamesWTruher JamesWTruher Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add comments about this, we can change it later after we have live-fire feedback.

/// <param name="streamWriter">Result2: <see cref="StreamWriter"/> (inherits from <see cref="TextWriter"/>) opened for writing</param>
/// <param name="readOnlyFileInfo">Result3: file info that should be used to restore file attributes after done with the file (<c>null</c> is this is not needed)</param>
/// <param name="isLiteralPath">True if wildcard expansion should be bypassed.</param>
internal static void MasterStreamOpen(
Copy link
Contributor
@dantraMSFT dantraMSFT Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest merging this with the Encoding overload. There appears to be only 1 caller to the later overload and it does so by converting a FileEncoding to an Encoding to accomplish it.

@@ -490,7 +537,7 @@ internal static Encoding Convert(Cmdlet cmdlet, string encoding)
return System.Text.Encoding.UTF32;

if (string.Equals(encoding, Default, StringComparison.OrdinalIgnoreCase))
return ClrFacade.GetDefaultEncoding();
return PowerShellEncoding.GetDefaultEncoding();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using a dictionary for these cascading 'if' tests.

// the parameter is not specifically set, so check the preference variable
encodingPreference = GetEncodingPreference(cmdlet.Context.SessionState);
// If set to unknown, we accept that it is unset
preferenceSetAndValid = encodingPreference != FileEncoding.Unknown;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code.

{
Encoding resolvedEncoding = GetDefaultEncoding();
FileEncoding encodingPreference = FileEncoding.Unknown;
bool preferenceSetAndValid = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code.

$encoder = [Microsoft.PowerShell.PowerShellEncoding]::GetEncoding($encoding)
$encoder.GetBytes([Environment]::NewLine) -Join "-"
}
function Test-GetEncoding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not referenced.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@SteveL-MSFT
Copy link
Member

@PowerShell/powershell-committee reviewed this and we should have an RFC specific to the new public apis (and generally any new public apis should be RFCs going forward)

/// <param name="path">The path to a file to inspect for an encoding</param>
/// <returns>System.Text.Encoding</returns>
/// </summary>
internal static FileEncoding GetFileEncodingFromFile(string path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should be GetEncodingFromFile to keep the naming consistent

FileEncoding encodingPreference = FileEncoding.Unspecified;
try
{
// It doesn't matter if this fails or throws, we will return unknown in that case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Unspecified, not return unknown

/// <param name="encoding">The Encoding parameter value</param>
/// <returns>System.Text.Encoding</returns>
/// </summary>
public static Encoding GetEncoding(Cmdlet cmdlet, FileEncoding encoding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this api would be better if we passed a SessionState instead of Cmdlet.

Create new class PowerShellEncoding and enum FileEncoding to unify cmdlet and provider code for file encoding.
Created PowerShellEncoding class and FileEncoding enum and removed ClrFacade.GetDefaultEncoding.
PSDefaultFileEncoding preference variable now can set file encoding across all cmdlets.
Setting PSDefaultFileEncoding to WindowsLegacy will set file encoding to historic PowerShell5 encodings.
some tests were failing on Windows because new line is different, calculate the bytes
in newline rather than hardcoding them
ClrFacade retains some of its functionality, but now relies on PowerShellEncoding class for
knowing what the default coding is. The encoding methods which calls native methods is retained.
the WindowsLegacy behavior for New-ModuleManifest should get the correct number of
bytes which will change depending on how many bytes are encoded for [Environment]::NewLine
update calls which create had been creating a new instance of the Utf8 encoding without BOM
to return the available static
only use hardcoded bytes when it's a custom file generation
(like Export-CliXml or New-ModuleManifest)
or when we're looking at a set of partial results
also remove an unused function
…ing files

improve code coverage for PowerShellEncoding class and don't duplicate byte representations when they're not needed
Remove a couple of extraneous using statements
Move encoding code from PathUtils.cs to Encoding.cs
Move and rename GetEncoding method in Utils.cs to GetFileEncodingFromFile in Encoding.cs
Expand some explanatory comments with more details
…oser to what is really happening

Update RedirectionOperator tests to compare bytes in a more sensible manner
Remove PSDefaultFileEncoding from special variable collection so they don't show up in script cmdlets
miscellaneous code clean up
made file encoding probe method internal
F438
{ "239-187-191", FileEncoding.UTF8BOM },
};

internal static char[] nonPrintableCharacters = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, 11 (vertical tab) and 12 (form feed) are also printable characters; given their rarity, I assume they were left out intentionally; if so, perhaps a comment would appease drive-by pedants like me.

if (s_oemEncoding == null)
{
#if UNIX // PowerShell Core on Unix
s_oemEncoding = GetDefaultEncoding();
#else // PowerShell Core on Windows
#else // PowerShell Core on Windows, which needs provider registration
EncodingRegisterProvider();

uint oemCp = NativeMethods.GetOEMCP();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this line is no longer needed.

// The OEM code pages are sometimes used by Win32 console applications, and
// on non-Windows platforms they still may have uses (if installed) and
// could be used if desired.
// On non-windows platforms, they have more limited uses, and probably won't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit confusing. Doesn't the previous one say everything that needs to be said?

{ "microsoft.powershell.commands.setcontentcommand", Encoding.ASCII },
// Providers are handled here
{ "microsoft.powershell.commands.filesystemprovider", Encoding.ASCII },
};
Copy link
Contributor
@mklement0 mklement0 Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that doesn't appear to be correct.

As for the writing cmdlets:

Add-Content and Set-Content - despite what the documentation claims - have always used "ANSI" encoding, not ASCII - see MicrosoftDocs/PowerShell-Docs#1483.

What about Send-MailMessage? The help topics claim that ASCII encoding is the default.

New-Item -Type File -Value currently creates BOM-less(!) UTF-8.

As an aside: While Export-Csv without -Append indeed creates ASCII files, Export-Csv -Append currently blindly appends UTF-8 if the existing file's encoding is any of ASCII/UTF-8/"ANSI", but correctly matches UTF-16LE and UTF-16BE.


As for the reading cmdlets:

Get-Content currently defaults to "ANSI" in the absence of a BOM, as does Import-PowerShellDataFile.

Import-Csv currently actually assumes UTF-8 in the absence of a BOM, as do Import-CliXml and Select-String

@JamesWTruher
Copy link
Collaborator Author

i am going to take a different route for this change - closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking-Change breaking change that may affect users Committee-Reviewed PS-Committee has reviewed this and made a decision WG-Cmdlets general cmdlet issues WG-Cmdlets-Management cmdlets in the Microsoft.PowerShell.Management module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0