-
Notifications
You must be signed in to change notification settings - Fork 7.6k
Unify file encoding when a cmdlet creates a file #4119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify file encoding when a cmdlet creates a file #4119
Conversation
/// </summary> | ||
public static Encoding GetDefaultEncoding() | ||
{ | ||
return new UTF8Encoding(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use static cache for UTF8Encoding(false);
here and in Line 142 too - the cache is in Line 311.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
Please fix the conflict and rebase. |
a013795
to
9f77970
Compare
If the new test batch covers all the old encoding variants, it would be good to merge the tests before the PR - then we'll be sure to see that the PR has full backward compatibility. |
{ "microsoft.powershell.commands.setcontentcommand", Encoding.ASCII }, | ||
// Providers are handled here | ||
{ "microsoft.powershell.commands.filesystemprovider", Encoding.ASCII }, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
internal static Encoding GetWindowsLegacyEncoding(string name) | ||
{ | ||
if ( legacyEncodingMap.ContainsKey(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use legacyEncodingMap.TryGetValue instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
{ | ||
Encoding resolvedEncoding = GetDefaultEncoding(); | ||
FileEncoding encodingPreference = GetEncodingPreference(provider.SessionSta 10000 te); | ||
if ( encoding == FileEncoding.Unknown && encodingPreference != FileEncoding.Unknown ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to care about WindowsLegacyEncoding here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - because of the provider cmdlets which leave the encoding to the underlying provider as a dynamic parameter it needs to be handled via a different path than checking against the name of the cmdlet.
MasterStreamOpen(cmdlet, filePath, resolvedEncoding, defaultEncoding, Append, Force, NoClobber, out fileStream, out streamWriter, out readOnlyFileInfo, isLiteralPath); | ||
} | ||
|
||
/* | ||
/// <summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete the commented block?
return new StreamReader(fileStream, PowerShellEncoding.GetEncoding(command, encoding)); | ||
} | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete commented block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed - I missed that one
} | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check the code coverage with the newly added tests for this file? So we add tests now if we are missing any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that we can, i will :)
@@ -206,8 +207,7 @@ public SwitchParameter NoClobber | |||
/// Encoding optional flag | |||
/// </summary> | |||
[Parameter()] | |||
[ValidateSetAttribute(new string[] { "Unicode", "UTF7", "UTF8", "ASCII", "UTF32", "BigEndianUnicode", "Default", "OEM" })] | |||
public string Encoding { get; set; } | |||
public FileEncoding Encoding { get; set; } = FileEncoding.Unknown; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't you using a ValidateSet with the FileEncoding values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use #3784 - dynamic validate set, don't expose FileEncoding
and allow enhancement to use legacy charsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to use validateset here at all. Especially since this is so pervasive in so many areas, I get the same more easily with an enum
@@ -571,8 +571,7 @@ public SwitchParameter UseCulture | |||
/// Encoding optional flag | |||
/// </summary> | |||
[Parameter()] | |||
[ValidateSetAttribute(new[] { "Unicode", "UTF7", "UTF8", "ASCII", "UTF32", "BigEndianUnicode", "Default", "OEM" })] | |||
public string Encoding { get; set; } | |||
public FileEncoding Encoding { get; set; } = FileEncoding.Unknown; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, Why aren't you using a ValidateSet with the FileEncoding values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
EncodingConversion.Default, | ||
EncodingConversion.OEM })] | ||
public string Encoding { get; set; } | ||
public FileEncoding Encoding { get; set; } = FileEncoding.Unknown; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default should be FileEncoding.UTF8. See BeginProcessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i want to be sure that I discover the encoding. If I set it to something that is likely to be provided, I won't be able to determine whether the user explicitly set it. By setting it to unknown, I'll calculate the appropriate value and I can tell that it wasn't set by the user
/// </summary> | ||
public static Encoding GetProviderEncoding(CmdletProvider provider, FileEncoding encoding) | ||
{ | ||
Encoding resolvedEncoding = GetDefaultEncoding(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest checking for encoding != FileEncoding.Unknown first; none of the other code paths are executed for that case.
{ | ||
using (BinaryReader reader = new BinaryReader(stream)) | ||
{ | ||
bytesRead = reader.Read(initialBytes, 0, 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use initialBytes.Length instead of 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
return FileEncoding.Default; | ||
} | ||
|
||
byte[] initialBytes = new byte[100]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need to read 100 bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was part of a refactor where this code has been around for a long time, and I didn't really want to change it. I'm certainly open reducing the size, but it's not too much, not too little and any number here would be a guess anyway, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we look beyond the preamble - a page size seems fine because we'll read that much from disk anyway, so 100 is totally fine.
internal static readonly UTF8Encoding utf8NoBom = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false); | ||
|
||
// take a look at the file contents and guess at the best encoding | ||
internal static FileEncoding GetEncoding(string path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest naming this GetFileEncoding for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
{ | ||
if (!File.Exists(path)) | ||
{ | ||
return FileEncoding.Default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably throw; not return Default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessarily - the file could be later created, so I need to pass back something - this code was refactored from another location, so i wanted to do as little as possible
|
||
if (bytesRead > 3) | ||
{ | ||
preamble = String.Join("-", initialBytes[0], initialBytes[1], initialBytes[2], initialBytes[3]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the string building/lookups are overkill. Consider testing the bytes directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was code which was refactored from another location, I didn't really want to change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do fix it, or open an issue to get it fixed. Testing bytes directly is much better - no extra allocations, no static initialization. It's simple code too - here is my PowerShell version: https://gist.github.com/lzybkr/f040f18d1fbfff9eaf3f4533e24126fe#file-fix_trailing_ws-ps1-L14 - note that my version handles UFT7 which this version does not, but mine misses the big endian encodings.
@@ -37,7 +38,7 @@ internal class FormatXmlWriter | |||
StreamWriter streamWriter; | |||
FileStream fileStream; | |||
FileInfo fileInfo; | |||
PathUtils.MasterStreamOpen(cmdlet, filepath, "ascii", true, false, force, noclobber, | |||
PathUtils.MasterStreamOpen(cmdlet, filepath, PowerShellEncoding.GetEncoding(FileEncoding.Ascii), true, false, force, noclobber, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be the only external call to Encoding overload of PathUtils.MasterStreamOpen. Suggest using FileEncoding.Ascii here and merge the two MasterStreamOpen overloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah - I was worried about removing the overload as it was a public interface and thought to do less harm, just in case someone else was using it somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MasterStreamOpen is internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll investigate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the encoding overload is used by CSV cmdlets which do their own determination of the encoding (in the append case), and pass that along to the open. I'd rather not refactor all that code if I don't need to.
@@ -4824,6 +4825,7 @@ .ForwardHelpCategory Cmdlet | |||
internal const ActionPreference defaultVerbosePreference = ActionPreference.SilentlyContinue; | |||
internal const ActionPreference defaultWarningPreference = ActionPreference.Continue; | |||
internal const ActionPreference defaultInformationPreference = ActionPreference.SilentlyContinue; | |||
internal const Microsoft.PowerShell.FileEncoding defaultFileEncodingPreference = FileEncoding.Unknown; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can find no references to this field other than setting the session state variable. How is it used and how does it control behavior? Currently, it appears to be a NOP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's used down in line 4915 where it essentially lays claim to the variable
@@ -8,6 +8,7 @@ | |||
using System.Management.Automation.Internal; | |||
using System.Management.Automation.Security; | |||
using System.Reflection; | |||
using Microsoft.PowerShell; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually needed? The only other change to this file is the removal of the GetEncoding logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -209,7 +210,7 @@ internal static Encoding GetOEMEncoding() | |||
if (s_oemEncoding == null) | |||
{ | |||
#if UNIX // PowerShell Core 10000 on Unix | |||
s_oemEncoding = GetDefaultEncoding(); | |||
s_oemEncoding = PowerShellEncoding.GetDefaultEncoding(); | |||
#elif CORECLR // PowerShell Core on Windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the CORECLR branches could use commenting in both GetDefaultEncoding and GetOEMEncoding. The logic reasoning for GetACP and GetOEMCP is opaque.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add comments about this, we can change it later after we have live-fire feedback.
/// <param name="streamWriter">Result2: <see cref="StreamWriter"/> (inherits from <see cref="TextWriter"/>) opened for writing</param> | ||
/// <param name="readOnlyFileInfo">Result3: file info that should be used to restore file attributes after done with the file (<c>null</c> is this is not needed)</param> | ||
/// <param name="isLiteralPath">True if wildcard expansion should be bypassed.</param> | ||
internal static void MasterStreamOpen( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest merging this with the Encoding overload. There appears to be only 1 caller to the later overload and it does so by converting a FileEncoding to an Encoding to accomplish it.
@@ -490,7 +537,7 @@ internal static Encoding Convert(Cmdlet cmdlet, string encoding) | |||
return System.Text.Encoding.UTF32; | |||
|
|||
if (string.Equals(encoding, Default, StringComparison.OrdinalIgnoreCase)) | |||
return ClrFacade.GetDefaultEncoding(); | |||
return PowerShellEncoding.GetDefaultEncoding(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using a dictionary for these cascading 'if' tests.
// the parameter is not specifically set, so check the preference variable | ||
encodingPreference = GetEncodingPreference(cmdlet.Context.SessionState); | ||
// If set to unknown, we accept that it is unset | ||
preferenceSetAndValid = encodingPreference != FileEncoding.Unknown; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code.
{ | ||
Encoding resolvedEncoding = GetDefaultEncoding(); | ||
FileEncoding encodingPreference = FileEncoding.Unknown; | ||
bool preferenceSetAndValid = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code.
$encoder = [Microsoft.PowerShell.PowerShellEncoding]::GetEncoding($encoding) | ||
$encoder.GetBytes([Environment]::NewLine) -Join "-" | ||
} | ||
function Test-GetEncoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not referenced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@PowerShell/powershell-committee reviewed this and we should have an RFC specific to the new public apis (and generally any new public apis should be RFCs going forward) |
/// <param name="path">The path to a file to inspect for an encoding</param> | ||
/// <returns>System.Text.Encoding</returns> | ||
/// </summary> | ||
internal static FileEncoding GetFileEncodingFromFile(string path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this should be GetEncodingFromFile
to keep the naming consistent
FileEncoding encodingPreference = FileEncoding.Unspecified; | ||
try | ||
{ | ||
// It doesn't matter if this fails or throws, we will return unknown in that case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return Unspecified, not return unknown
/// <param name="encoding">The Encoding parameter value</param> | ||
/// <returns>System.Text.Encoding</returns> | ||
/// </summary> | ||
public static Encoding GetEncoding(Cmdlet cmdlet, FileEncoding encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this api would be better if we passed a SessionState
instead of Cmdlet
.
Create new class PowerShellEncoding and enum FileEncoding to unify cmdlet and provider code for file encoding. Created PowerShellEncoding class and FileEncoding enum and removed ClrFacade.GetDefaultEncoding. PSDefaultFileEncoding preference variable now can set file encoding across all cmdlets. Setting PSDefaultFileEncoding to WindowsLegacy will set file encoding to historic PowerShell5 encodings.
some tests were failing on Windows because new line is different, calculate the bytes in newline rather than hardcoding them
ClrFacade retains some of its functionality, but now relies on PowerShellEncoding class for knowing what the default coding is. The encoding methods which calls native methods is retained.
the WindowsLegacy behavior for New-ModuleManifest should get the correct number of bytes which will change depending on how many bytes are encoded for [Environment]::NewLine
update calls which create had been creating a new instance of the Utf8 encoding without BOM to return the available static
only use hardcoded bytes when it's a custom file generation (like Export-CliXml or New-ModuleManifest) or when we're looking at a set of partial results also remove an unused function
…ing files improve code coverage for PowerShellEncoding class and don't duplicate byte representations when they're not needed
Remove a couple of extraneous using statements Move encoding code from PathUtils.cs to Encoding.cs Move and rename GetEncoding method in Utils.cs to GetFileEncodingFromFile in Encoding.cs Expand some explanatory comments with more details
…oser to what is really happening Update RedirectionOperator tests to compare bytes in a more sensible manner Remove PSDefaultFileEncoding from special variable collection so they don't show up in script cmdlets miscellaneous code clean up
made file encoding probe method internal
62ba583
to
ff24e22
Compare
{ "239-187-191", FileEncoding.UTF8BOM }, | ||
}; | ||
|
||
internal static char[] nonPrintableCharacters = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, 11
(vertical tab) and 12
(form feed) are also printable characters; given their rarity, I assume they were left out intentionally; if so, perhaps a comment would appease drive-by pedants like me.
if (s_oemEncoding == null) | ||
{ | ||
#if UNIX // PowerShell Core on Unix | ||
s_oemEncoding = GetDefaultEncoding(); | ||
#else // PowerShell Core on Windows | ||
#else // PowerShell Core on Windows, which needs provider registration | ||
EncodingRegisterProvider(); | ||
|
||
uint oemCp = NativeMethods.GetOEMCP(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this line is no longer needed.
// The OEM code pages are sometimes used by Win32 console applications, and | ||
// on non-Windows platforms they still may have uses (if installed) and | ||
// could be used if desired. | ||
// On non-windows platforms, they have more limited uses, and probably won't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is a bit confusing. Doesn't the previous one say everything that needs to be said?
{ "microsoft.powershell.commands.setcontentcommand", Encoding.ASCII }, | ||
// Providers are handled here | ||
{ "microsoft.powershell.commands.filesystemprovider", Encoding.ASCII }, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, that doesn't appear to be correct.
As for the writing cmdlets:
Add-Content
and Set-Content
- despite what the documentation claims - have always used "ANSI" encoding, not ASCII - see MicrosoftDocs/PowerShell-Docs#1483.
What about Send-MailMessage
? The help topics claim that ASCII encoding is the default.
New-Item -Type File -Value
currently creates BOM-less(!) UTF-8.
As an aside: While Export-Csv
without -Append
indeed creates ASCII files, Export-Csv -Append
currently blindly appends UTF-8 if the existing file's encoding is any of ASCII/UTF-8/"ANSI", but correctly matches UTF-16LE and UTF-16BE.
As for the reading cmdlets:
Get-Content
currently defaults to "ANSI" in the absence of a BOM, as does Import-PowerShellDataFile
.
Import-Csv
currently actually assumes UTF-8 in the absence of a BOM, as do Import-CliXml
and Select-String
i am going to take a different route for this change - closing this |
Implementation of https://github.com/PowerShell/PowerShell-RFC/blob/master/1-Draft/RFC0020-DefaultFileEncoding.md
Cmdlets now create files with consistent encoding (UTF8 without BOM) on all platforms.
A new preference variable
PSDefaultFileEncoding
is now available to enable users to set encoding for cmdlets. By setting$PSDefaultFileEncoding = "WindowsLegacy"
users can select the encoding which exists in PowerShell5 for each specific cmdlet.Provider and cmdlet encoding methods have been centralized and are now common.