-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[System.Convert]::FromBase64String causes memory leak with large strings #21473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, Given all of your file and data operations are performed directly with the .NET API I suggest this is not a PowerShell problem, it looks like it is a .NET issue. Is your issue that PowerShell is not running the garbage collector? Does running
make any difference? Also be aware that managed memory runtimes often take memory from the OS in order to allocate objects but never return it. So the objects may have been released/disposed/freed so the actual CLR heap is free but the memory has not been returned to the OS. This is completely normal as managed runtimes (CLR, JVM etc) assume that if they needed it once they are likely to need it again so no point giving it back to the OS. You would need to look for tools to examine the state of the CLR heap within the process rather than external process monitoring tools. When you have an operation that you know is memory intensive then two other options are available,
I hope that helps. |
I cannot replicate the slowness you see as 7.4.2 takes less than 5 seconds for me and is in fact faster than WinPS but I do notice the large memory usage. My guess is that it is not only are now storing the 2 byte arrays (raw file data and the decoded base64 string) but also the base64 string is all allocated on the heap as part of the operation. Potentially WinPS/.NET Framework is more aggressive in reusing the array values but as per the above the CLR could be allocating the memory and just never freeing it so it can more efficient reuse the memory in the future. Putting aside the above comment you can more efficiently base64 encode bytes by streaming it rather than reading all the input bytes into memory. Function ConvertTo-Base64String {
[OutputType([string])]
[CmdletBinding()]
param (
[Parameter(Mandatory)]
[string]$Path
)
$fs = $cryptoStream = $sr = $null
try {
$fs = [System.IO.File]::OpenRead($Path)
$cryptoStream = [System.Security.CryptoGraphy.CryptoStream]::new(
$fs,
[System.Security.Cryptography.ToBase64Transform]::new(),
[System.Security.Cryptography.CryptoStreamMode]::Read)
$sr = [System.IO.StreamReader]::new($cryptoStream, [System.Text.Encoding]::ASCII)
$sr.ReadToEnd()
}
finally {
${sr}?.Dispose()
${cryptoStream}?.Dispose()
${fs}?.Dispose()
}
} This will stream the raw bytes from the source file stream and produce the final output string. If you are storing this string into a file t 8000 hen you could optimize it further by streaming the output base64 CryptoStream to a file avoiding having to store all the data in PowerShell. If you do need to store the base64 string as an object in PowerShell keep in mind this means you not only have to store the inflated size that base64 uses ( |
Thank you both for your responses and suggestions for optimization. I did reply to the Issue cross-posted in the dotnet/runtime project which you can see here dotnet/runtime#101061 (comment) I ran UPDATE: Confirmed below that the FromBase64String delay and excessive memory usage only occurs through PowerShell 7, but not through a .NET console app. I am aware that storing a 200 MB file in memory as base64 text is wildly inefficient, and is not the intention of how my PowerPass module should be used (which is how I discovered this in the first place), but since I stumbled upon this unexpected behavior I thought it prudent to at least report it. But again, I appreciate all of the comments and feedback here, especially the suggestions for optimization techniques. |
Yes, the same memory usage, but time to complete is about 1.2 seconds |
I retested the following updated script on my desktop PC. A Ryzen 5800X with 128 GB of RAM and PCIe Gen4 NVMe storage. The test ran much faster as expected, but the memory usage still remains high. Even after invoking Reading the 222 MB file into memory takes 0.06 seconds and converting it to base64 takes 0.22 seconds and uses 845 MB of RAM across both operations as expected. The last operation I'll cross-post this in the dotnet/runtime issue. Thank you all for the feedback. Updated test script: $name = "random.bin"
$start = Get-Date
Write-Host "Creating Path to $name test file: " -NoNewline
$now = Get-Date
$file = Join-Path -Path $PSScriptRoot -ChildPath $name
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"
Write-Host "Reading all file bytes into memory: " -NoNewline
$now = Get-Date
$bytes = [System.IO.File]::ReadAllBytes( $file )
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"
Write-Host "Converting file bytes to base64 string: " -NoNewline
$now = Get-Date
$base64 = [System.Convert]::ToBase64String( $bytes )
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"
Write-Host "Converting base64 string back to file bytes: " -NoNewline
$now = Get-Date
$bytes = [System.Convert]::FromBase64String( $base64 )
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"
Write-Host "Test complete"
Write-Host "Total duration: $(((Get-Date) - $start).TotalMilliseconds) ms" |
I just tested this same implementation using a C# console application for the dotnet/runtime team and the issues does NOT occur when running in a console application against .NET 8 on the latest SDK on Windows 11 Professional. My test results and C# code are here: dotnet/runtime#101061 (comment) It seems that this is actually a memory leak in the PowerShell runtime for some reason. The dotnet/runtime crew was asking if the I'm assuming PowerShell 7.4.2 is using .NET 8.0 under the hood. Does it ship with its own .NET run-time or does it rely on the run-time installed on the system? |
For reference, the param(
[int]
$Size
)
$blockSize = 256
$rand = [System.Random]::new()
$total = 0
[byte[]]$data = [System.Array]::CreateInstance( [byte[]], $blockSize )
$path = Join-Path -Path $PSScriptRoot -ChildPath "random.bin"
if( Test-Path $path ) {
Remove-Item -Path $path -Force
}
$file = [System.IO.File]::OpenWrite( $path )
while( $total -lt $Size ) {
$rand.NextBytes( $data )
$file.Write( $data, 0, $data.Length )
$total += $blockSize
}
$file.Flush()
$file.Close() |
With PowerShell 7.4.2 installed from Microsoft Store,
|
A couple more data points regarding the unexpected slowdown:
|
In |
I would like to remind that earlier we observed slow pwsh operations with files due to antivirus. |
I am also seeing .NET 8.0.4 for the |
The slow operation is the call to Also, this only happens when doing this in PowerShell 7.4.2. Running this same test in a C# console application takes under 1 second. |
I added a Program Setting for |
The excess memory consumption of the During the slow FromBase64String operation, is the Can you attach to an unmanaged-code debugger to the process during FromBase64String and get a stack trace of the thread with the most CPU time? ( |
Here is a stack trace of the thread with the most CPU time. I used WinDbg and broke process execution in the middle of the FromBase64String operation. You can see some fun stuff at the top.
|
Oh, LogMemberInvocation calls ArgumentToString here: PowerShell/src/System.Management.Automation/engine/runtime/Operations/MiscOps.cs Line 3660 in 8ea1598
So does that mean the multi-megabyte base64 string goes via Anti-Malware Scan Interface to Windows Defender…? I guess that would be a sensible design. And then perhaps the Defender implementation of AMSI makes a few more copies of the string. This PowerShell code would apparently log the whole AMSI scan request to the console if you set Why does the AMSI scan take that long, though… does it do useful work all that time, or does it get stuck somehow and give up after a timeout? Perhaps you could try with files of different sizes, graph how the file size affects the FromBase64String duration. If the duration stays the same, then that suggests there is a timeout. |
I ran a test using variable size byte arrays with random payloads starting at 2 MiB in size and going up to 116 MiB in size. You can see that the duration required is linear, and also extremely slow. It takes 2 seconds to convert 32 MiB back to a byte array from a base64 string. The same test conducted at 16 MiB intervals up to 256 MiB also shows a linear trend. One final test at 32 MiB intervals up to 384 MiB shows a linear trend as well suggesting that there may be no upper boundary or timeout no matter how much data you ask PowerShell to convert from base64. |
Or perhaps you might be completely horrified by the idea of deep packet inspection of all arguments and no knowledge of whether that will be sent to 3rd parties. ( or any other party at all, to be honest ) |
I hoped the graph might show a lower boundary, because it could indicate a configuration error that could then be fixed to speed up the operation; for example, if the AMSI code running in-process were unable to contact the Defender service and spent some constant amount of time attempting that. Alas, the linear graph doesn't look like that's the case. There may be ways to change the PowerShell script so that, even though it still triggers the suspicious content detector and causes an AMSI scan, the argument list being scanned does not include the base64 data and the scan finishes faster. But if such a workaround becomes commonly used, I suspect a future version of PowerShell will be changed to scan the data anyway. |
If I am understanding this nonsense with AMSI correctly, then a solution would be to perform the Base64 translation in a compiled C# cmdlet. Given we are talking PowerShell it should be implemented using a pipeline with |
AMSI logging of method invocations was added as an experimental feature in #16496 and changed to non-experimental in #18041. I'm not sure it even uses the suspicious content detector; perhaps the difference between ToBase64String and FromBase64String is that ArgumentToString does not format the elements of a byte[] argument for AMSI, but passes a string argument through. A slowdown was previously reported in #19431. |
I am not seeing similar times
Took 442.478 ms on a little Intel(R) Core(TM) i3-10100Y CPU @ 1.30GHz 1.61 GHz running Windows 11 Pro |
Where can I find documentation on what this actually does? I am personally horrified by the idea that anyone thinks they have the rights to log data that was private to a process without their knowledge. When I say |
@rhubarb-geek-nz, your script uses ToBase64String, not FromBase64String.
The best may be the documentation of the PSAMSIMethodInvocationLogging experimental feature in this old version: https://github.com/MicrosoftDocs/PowerShell-Docs/blob/793ed5c687e6c7b64565d1751c532eb1d7d84209/reference/docs-conceptual/learn/experimental-features.md#psamsimethodinvocationlogging The "How AMSI helps" link in that documentation doesn't work on GitHub; use https://learn.microsoft.com/windows/win32/amsi/how-amsi-helps instead. AMSI doesn't necessarily involve telemetry that would send the data off the machine. I don't know whether Windows Defender has telemetry for AMSI scans.
|
Because ArgumentToString does not recognise the char[] type and returns only the type name, I think a [System.Convert]::FromBase64CharArray call should be much faster for AMSI to scan than [System.Convert]::FromBase64String. But who knows how long that will remain so. |
It depends on the context. If you mean a program that is found on via the PATH then I might agree, but in general when you are managing large numbers of scripts to perform tasks then keeping the .sh extension is very useful. UNIX exec() does not care about file extensions for executables, the concept of file extensions does not exist within the POSIX C API. You are free to name executable files how you like. One major advantage of maintaining the .sh extension is when you manage them in a source code repository and you are storing text, not a compiled binary. Keeping the extension makes that absolutely obvious. It is Window'isms that step through extensions (com, bat, exe, cmd) while looking for commands on the path or local directory, and similarly PowerShell does the same and will try and append .ps1 to try and look for a command. |
@rhubarb-geek-nz , we're getting far afield, but let me attempt a summary of the issue at hand first, which implies that there's likely nothing actionable here:
Returning to the tangent:
When it comes to naming a stand-alone executable, it seems to me that the end-user experience should be the driver, trumping any design-time / implementation considerations:
|
I'm half expecting you to make PowerShell recognise MethodInfo.Invoke calls and log each element of |
@KalleOlaviNiemitalo, fair point: Both of the aforementioned workarounds amount to bypassing the intended AMSI calls - I merely summarized them, speaking as someone who's neither a security expert nor speaking in any official capacity. |
Let's go back to the original problem.
Since the early days of computers we have been able to deal with files larger than the available memory of the computer. This is still the case. The first thing to realise is (a) PowerShell is not a UNIX shell and it is really really bad at dealing with streams of bytes. That is not a problem of the PowerShell engine itself, but the existing cmdlets, scripts, patterns and expectations. PowerShell deals with pipelines of typed objects, not text or byte streams. (b) UNIX does this kind of thing in its sleep, literally. A pipe is a byte stream first and foremost. Deciding to treat it as text is an afterthought. So if we were doing this in UNIX we would simply do
The file went through the memory as it was being processed and then out to the final file. Now let's do the same thing with PowerShell,
When you put that pipeline together it takes only about 50MB working set in order to process dotnet-sdk-8.0.204-win-x64.exe and write a copy of the output. Validate it and compare with the SHA512 from the original download site
So how does that work? Split-Content reads a file and writes arrays of 4096 bytes to the success pipeline ConvertTo-Base64 reads the byte arrays and writes out lines of Base64 encoding of just 64 characters each, same as ConvertFrom-Base64 reads the strings and converts them to byte arrays. Set-Content writes the bytes arrays to the final file. It only took about 27MB to read, encode the decode the base64, without writing to a file.
So from 3.4GB to 27MB with no change to PowerShell itself is not a bad effort. It was a trade-off of space versus time. It takes about 7 seconds or so to run the read, encode and decode pipeline. |
Yes, prior to PS 7.4 raw byte handling in pipelines wasn't supported, but in 7.4+ it now is, between external (native) programs, so the following works as intended from PowerShell (also on Windows, if you install # OK in PS 7.4+
openssl base64 -in file.in | openssl base64 -d > file.out I haven't looked into the implementation, but I assume (hope) that on Unix-like platforms the usual system-level data buffering applies, which is 64KB these days. While The - slow - solution is therefore (byte-by-byte processing on the PowerShell side): Get-Content file.in -AsByteStream | openssl base64 | openssl base64 -d > file.out The - much faster - solution, which, however, reads the input file in full, due to Get-Content file.in -Raw -AsByteStream | openssl base64 | openssl base64 -d > file.out The - more memory-efficient - solution that emulates Unix pipeline buffering is: Get-Content file.in -ReadCount 64kb -AsByteStream |
% { , [byte[]] $_ } |
openssl base64 | openssl base64 -d > file.out Note the - unfortunate in terms of both verbosity and performance - need for an intermediate Arguably,
This would obviate the need for the inefficient and awkward |
I did not have much success with Get-Content with ReadCount even in binary mode, I did not think of the array conversion in a ForEach-Object. Hence I wrote the Split-Content which reads directly into a byte array and put that straight in the output pipeline. No need to convert any arrays. I am not convinced that large buffers like 64K help in the PowerShell pipeline, because it has to fill the entire 64KB first before it passes onto the pipeline. The buffering in UNIX works the other way round, things can keep writing until the pipe buffer is full then they block until the reader has made some room. A UNIX pipeline has a record size of 1. The PowerShell pipeline above has a record size of 64K, so nothing can move until the record is full. In UNIX if a network stream is slow then even the few hundred bytes at a time would still dribble through. It would certainly be better if Get-Content always wrote AsByteStream as a byte array but I think it is too late to change that. |
Yes, it's an imperfect emulation of the native Unix pipeline, but with file input (where there's no "dribbling"), it works well. That said, it's rare for Unix-heritage utilities to accept input via stdin (the pipeline) only and not also via file-path operands; thus, with a file as the data source, passing the file's path as an argument to an external program is the simpler and better solution (such as in the
Hopefully not: Let's see what becomes of the feature request you've since created: |
Thank you for showing me this technique. So what I understand is happening with the PowerShell pipeline is @rhubarb-geek-nz do you have your cmdlet source on Github? |
Yes, they are on PSGallery and each entry has a Project link which takes you to github, likewise, the releases pages on github have a link to PSGallery PSGallery rhubarb-geek-nz.SplitContent/1.0.0 github rhubarb-geek-nz/SplitContent |
Yes, but a new byte array is written to the output pipeline. So the same total amount of memory is allocated, just not all at the same time. |
Looking at lines 170, 172, and 174 in https://github.com/rhubarb-geek-nz/SplitContent/blob/main/SplitContent/SplitContent.cs you copy the read buffer For example, let's say you: $b64 = Split-Content -Path 'C:\temp\dotnet.exe' -AsByteStream | ConvertTo-Base64 In your converter, you pass the incoming
|
Good questions.
The consequence is the next stage would see the same repeated block of data rather than the next new set of data. The pipelines between cmdlets act like a queue and the cmdlets can work at different speeds. So the writer may add five records to the pipeline before the next stage starts consuming them. If you use the same buffer then the next stage sees the pattern repeated. The pipelines don't hold data, only object references. |
Interesting. And I understand why this is done, you can take advantage of parallel processing on multi-core CPUs to improve the performance of a pipeline of commands rather than executing them all in sequence. Does the |
Yes PowerShell allows you to process cmdlets in parallel, how useful that is depends on the task and how independent the records are. In this case the sequencing of the encoding and decoding is importance and the processed records have to be reassembled in the correct order, so for this case all being sequential makes the solution simpler but effectively single threaded. The design of the PSCmdlet is to keep them as simple as possible, ProcessRecord() is called when there is something to process, and write the results to the output pipeline. When the input pipeline is closed then EndProcessing() is called on the cmdlet and it is shutdown. I would recommend keeping any scheduling or throttling outside of the business logic of the cmdlets and keep them simple ( do one thing well principle ). |
I ran a memory profiler on the scenario $fileBytes = [io.file]::ReadAllBytes("C:\Users\staff\Downloads\dotnet-sdk-8.0.204-win-x64.exe");
$base64String =[System.Convert]::ToBase64String($fileBytes);
# $base64String is 620 Mb
# But as the image shows, the call to FromBase64String causes AMSI logging of method invocation
# which allocates 2.7 GB of memory to do the logging.
$roundTripBytes =[System.Convert]::FromBase64String($base64String) The code responsible for getting the param values for the logging looks like this. internal static void LogMemberInvocation(string targetName, string name, object[] args)
{
var contentName = "PowerShellMemberInvocation";
var argsBuilder = new Text.StringBuilder();
for (int i = 0; i < args.Length; i++)
{
string value = ArgumentToString(args[i]);
if (i > 0) { argsBuilder.Append(", "); }
argsBuilder.Append($"<{value}>");
}
string content = $"<{targetName}>.{name}({argsBuilder})";
var success = AmsiUtils.ReportContent(contentName, content);
}
private static string ArgumentToString(object arg)
{
object obj = PSObject.Base(arg);
if (obj == null)
return "null";
if (obj is string str)
return str; // perhaps limit to 256 bytes or so
Type type = obj.GetType();
return type.IsEnum || type.IsPrimitive || type == typeof (Guid) || type == typeof (Uri) || type == typeof (Version) || type == typeof (SemanticVersion) || type == typeof (BigInteger) || type == typeof (Decimal) ? obj.ToString() : type.FullName;
} Maybe it would make sense to limit the size of I tagged WG-Security to have them look at this to see if it's viable. |
@PaulHigin Can you advise here? This is starting to cause problems on CI farms where lots of JSON is being generated. In some cases, there have been several orders of magnitude slowdowns, ranging from seconds locally (on folders excluded from AV) to 20 minutes on the farms. |
I have asked until blue in the face. The attitude from this project is it is by design so it is your problem. The best solutions are (a) to use reflection to obscure your parameters, what makes this so silly is it demonstrates how trivial it is to bypass the logging You can also use a cmdlet so that your use of reflection is not logged. I wrote my own Base64 cmdlets, Base64 and Base64String I do have concerns about the performance of PowerShell Azure Functions moving from 7.2 to 7.4 for this same reason. |
@powercode, note that @PaulHigin is retired now. |
I have done a build of 7.4.5 with the AMSI code removed. This now has similar performance to 7.2 |
@TravisEz13, @SydneyhSmith - Can you have a look? This is a performance issue with potential security implications. |
Perhaps #24853 resolve the issue. |
Or perhaps not. That refers to casting the arguments as different types to match the method signature. In case of FromBase64String if you provide a string argument, it is still a string argument. If you use Invoke-Reflection you can avoid the logging.
|
Are there plans to add AMSI logging to method invocation via reflection? I'm able to claw back some of the catastrophic performance loss that AMSI incurred using this trick, but I'm guessing it's only a matter of time before this gets caught up in AMSI's web too. |
Prerequisites
Steps to reproduce
This was tested on PowerShell 7.4.2
NOTE: That if you test this with a newer version of the .NET 8.0 installer, you may have to modify the test script to pick the correct file for the test since the filename is hard coded on line 3.
The .NET 8.0 installer for Windows x64 is approximately 222 MB in size. Reading into memory and converting to base64 then converting back should require about 790 MB of RAM with all variables remaining in scope during the process and no garbage collection happening or object disposal happening. The observed behavior appears to be memory-leak related as the amount of memory used once the conversion eventually completes is about 3.4 GB of RAM. These data points can be see in the attached screen shots.
Expected behavior
Actual behavior
In PowerShell 7.4.2, the time to complete is 82 seconds and memory used is 3.4 GB.
Error details
No response
Environment data
Visuals
Testing in PowerShell 7.4.2
Testing in PowerShell 5.1
PowerShell 7.4.2 Memory Usage
PowerShell 5.1 Memory Usage
The text was updated successfully, but these errors were encountered: