WebCmdlets should read the encoding of Content-Type application/json per RFC #5530

Jaykul · 2017-11-22T22:33:19Z

Steps to reproduce

Invoke-RestMethod 'http://api.forismatic.com/api/1.0/?method=getQuote&format=json&lang=ru' -verbose

Expected behavior

It should detect the utf-8 encoding, and produce the same output as this:

$resp = Invoke-WebRequest 'http://api.forismatic.com/api/1.0/?method=getQuote&format=json&lang=ru'
$char = $resp.RawContentStream.ToArray()
$str = [Text.Encoding]::UTF8.GetString($char)
ConvertFrom-Json $str

i.e. something like this:

VERBOSE: GET http://api.forismatic.com/api/1.0/?method=getQuote&format=json with 0-byte payload
VERBOSE: received 536-byte response of content type application/json
VERBOSE: Content encoding: utf-8

quoteText   : Именно внутренний диалог прижимает к земле людей в повседневной жизни. Мир для нас такой-то и такой-то или этакий и этакий лишь потому, что мы сами себе говорим о нем, что он такой-то и такой-то или этакий и этакий.
quoteAuthor : Карлос Кастанеда
senderName  :
senderLink  :
quoteLink   : http://forismatic.com/ru/6309006412/

Actual behavior

It falls back to iso-8859-1 encoding and produces gobbledygook with a lot of Ð's in it. Also, it utterly fails to produce a number for the #-byte response string in the verbose output.

VERBOSE: GET http://api.forismatic.com/api/1.0/?method=getQuote&format=json with 0-byte payload
VERBOSE: received -byte response of content type application/json
VERBOSE: Content encoding: iso-8859-1

quoteText   : Ð�Ð¾ Ð²Ñ�Ñ�ÐºÐ¾Ð¼Ñ� Ð¿Ñ�Ð¸Ð±ÐµÐ¶Ð¸Ñ�Ñ� Ð¾Ð±Ñ�Ð°Ñ�Ð°Ñ�Ñ�Ñ�Ñ� Ð»Ñ�Ð´Ð¸, Ð¼Ñ�Ñ�Ð¸Ð¼Ñ�Ðµ Ñ�Ñ�Ñ�Ð°Ñ�Ð¾Ð¼: Ðº Ð³Ð¾Ñ�Ð°Ð¼
              Ð¸ Ðº Ð»ÐµÑ�Ð°Ð¼, Ðº Ð´ÐµÑ�ÐµÐ²Ñ�Ñ�Ð¼ Ð² Ñ�Ð¾Ñ�Ðµ, Ðº Ð³Ñ�Ð¾Ð±Ð½Ð¸Ñ�Ð°Ð¼.
quoteAuthor : Ð�Ñ�Ð´Ð´Ð° Ð�Ð°Ñ�Ñ�Ð°Ð¼Ð°
senderName  :
senderLink  :
quoteLink   : http://forismatic.com/ru/804c7d14d9/

Discussion

When calling an HTTP endpoint that returns a header: Content-Type: application/json the WebCmdlets are incorrectly defaulting to iso-8859-1 rather than a proper unicode encoding, and are disregarding the application/json RFC's simple specification for how to determine the content encoding.

The JSON standard ECMA-404 (PDF) clearly states that JSON must be unicode
The application/json RFC (in section 3) clearly indicates how the encoding should be determined from the first 4 bytes of the content.

NOTE: Please don't work around this by just defaulting to utf-8. I'm sure that 90% of the time, you could probably get away with that, but it's not actually correct, and the RFC implementation is trivial.

ALSO NOTE: The WebCmdlets do respect the ; charset=utf-8 attribute if it's present on the content-type header -- which makes sense, but isn't technically standards compliant for an application/* content-type, as far as I can tell.

To get started: ProcessResponse and TryGetEncoding

See also #5528 which was a specific instance of this problem. @lipkau was incorrectly convinced by early responders that the problem was in the webserver, but it's actually in PowerShell's cmdlets. If you invoke the rest API against the Atlassian wiki, you can see the problem happening in the Verbose stream:

$r = IRM $url -Credential $mycred -Authentication basic -Verbose
VERBOSE: GET https://powershell.atlassian.net/wiki/rest/api/content/13009245?expand=space,version with 0-byte payload
VERBOSE: received -byte response of content type application/json
VERBOSE: Content encoding: iso-8859-1

The content is actually correctly utf-8 encoded (as you could tell from the positions of the nulls in the first 4 bytes), and iso-8859-1 is never a valid encoding for application/json, period.

PS C:\Program Files\PowerShell\6.0.0-rc> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      6.0.0-rc
PSEdition                      Core
GitCommitId                    v6.0.0-rc
OS                             Microsoft Windows 10.0.15063
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

The text was updated successfully, but these errors were encountered:

markekraus · 2017-11-22T23:38:36Z

That seems reasonable.. Tell me if this logic makes sense:

If the Content-type is application/json and does not provide a charset, inspect the first 4 bytes of the content stream to determine the correct unicode encoding and If it cannot be determined, fall back to utf-8.

1 is valid JSON and is only 1 byte in utf-8 and 2 in utf-16, but it's painful logic to try and figure out if 2 bytes is a 2 digit utf-8 or a single digit utlf-16.

edit: thinking about it, it wouldn't be 2 hard. if the response is 3 bytes, it's utf8. if it's 2 and the first byte is 00, then it is UTF-16BE. if it is 2 and the second byte is 00 its UTF-16LE, and if nether 1 nor 2 are 00, it's utf-8.

* PowerShell is not setting the encoding to UTF-8 for application/json * See PowerShell/PowerShell#5530

markekraus · 2018-02-02T20:31:34Z

Hmm I'm wondering what standard we should follow here. RFC 8259 released in December 2017 now enforces UTF-8 (no-BOM) encoding for JSON. https://tools.ietf.org/html/rfc8259#section-8.1

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.

@SteveL-MSFT Do we have any kind of guidance on what standards should be followed and how soon after acceptance they should be implemented? Currently RFC 8259 is set as "Internet Standard" and has obsoleted the standard @Jaykul cited (RFC 4627) but conflicts with EMCA-404 which still allows for any Unicode and not exclusively UTF-8. I agree with the IETF that UTF-8 reigns supreme (I don't personally know a single JSON API that returns anything other than UTF-8).

It would seem to me the default encoding for application/json responses should be UTF-8. I'm torn on whether we should ignore or honor the char encoding provided by a remote endpoint. I think part of this could be solved is we enforced UTF-8 but allowed users to supply their own response encoding via parameter. this would solve some of the other open issues where char detection is not working because the remote endpoints are not being compliant.

SteveL-MSFT · 2018-02-02T21:00:24Z

@markekraus the great thing about standards is that there's so many to choose from :)

I think we should default to UTF-8. If the response includes an encoding, we should use that. For edge cases, end users can use Invoke-WebRequest and handle the encoding themselves. I expect this to be rare until proven otherwise.

At this time, I wouldn't bother with providing a parameter until a need arises.

markekraus · 2018-02-02T22:09:22Z

@SteveL-MSFT Re: user supplied encoding. this has come up several times. #3267 from @chuanjiao10 has a few examples. I could dig but there have been other issues where this is a comment. Andi suspect we might not see a great number of complaints as it seems a common problem in Asia (language barrier and all that). web browsers do a fair bit of magic to select the encoding. The least we could do is provide the user the option to select encoding. It's probably not a priority, but there does appear to be a need.

SteveL-MSFT · 2018-02-02T22:24:42Z

@markekraus if it's come up several times already, then we should probably support it.

markekraus · 2018-02-04T11:10:47Z

@chuanjiao10 Hopefully, whatever we land on for HTML parsing (AngleSharp) will support multiple character encoding in a single response. For now, we are concerned only with the outer content type encoding of the response as a whole as our current implementation only does basic HTML parsing. Could you please open a new issue for multiple response encoding in a single HTML and maybe include some examples? Thanks!

iSazonov · 2018-02-04T13:00:45Z

I agree that we should default to UTF-8 - CoreFX uses this and PowerShell Core too in many places.

I believe we should get PowerShell committee conclusion to close the discussion.

markekraus · 2018-02-04T13:40:16Z

@iSazonov for this particular issue, we are only changing the default to UTF-8 for aplication/json responses because the RFC's are clear on this one.

The standards are a bit less clear on HTML responses. Historically, when no charset is defined in the header ISO-8859-1 was assumed. So if we changed the default for everything to UTF-8, legacy systems will be potentially broken. This is complicated because now the default charset for HTML5 is UTF-8. HTML5 becomes heavily reliant on charset in the Content-Type HTTP response header and then the HTML rendering engine uses that to interpret the first <meta http-equiv="content-type" or <meta charset to switch encoding for the inner elements. Unless HTTP response header includes an alternate charset, one should assume the the charset is 8859-1 to ensure backwards compat.

Ultimately, we will never have perfect encoding in this project because we do not aspire to be a full parity web browser. For HTML, My hope is that when we move to AngleSharp for HTML rendering they can handle the terrifying task of navigating all the charset complexities for us.

markekraus · 2018-02-05T14:10:25Z

@chuanjiao10 I understand the CJK user experience is currently bad. I have a personal interest in addressing that (for the limited amount of 日本語 I deal with). The problem is that Invoke-WebRequest, under the current plans, no longer seeks to be a full parity web browser. In PS Core, it doesn't even support HTML parsing at this point. Invoke-RestMethod should work for APIs in any part of the work because JSON and XML follow stricter standards (and this issue in particular is about bringing the cmdlet into even close alignment with standards).

I really do feel much of the CJK pain will be reduced when we can add in a proper HTML parser. This project doesn't have the resources to make its own full functioning browser, so we will be relying on another community project that does. I also feel that adding the ability for the user to supply their desired character encoding will address some of this pain.

Invoke-WebRequest will very likely never directly support multiple encoding scenarios allowed under HTML5. But, ConvertFrom-Html would in theory would. That leaves us with navigating only the baseline encoding for the "outer envelope".

For that, we must follow standards, as painful as they are for a significant user population. Current standards make it clear the default for HTML must be iso-8859-1 and that it is the responsibility of the remote endpoint to suggest an alternate encoding when something other than iso-8859-1 is used. This is on the web developers and web server admins more than us. Though, we can and do seek to make this less painful. We currently support a wide range of encoding detection. It's not perfect but it can also be improved give specific examples where it breaks down. For when that breaks down, allowing you to supply your own encoding should work.

markekraus · 2018-02-06T12:21:55Z

@chuanjiao10 That solution uses Internet Explorer's engine to parse the HTML and determine internal element content type. Internet Explorer is full fledged web browser. It has teams dedicated to complexe charset identification and string encoding. Invoke-WebRequest is not a full web browser and does not have those kinds of resources. Internet Explorer inter-op is not available in PowerShell core so this solution cannot be used in PowerShell Core.

Invoke-WebRequest and Invoke-RestMethod in PowerShell core now support charset detection in the HTML for the outer most tag. This is an improvement over PowerShell core 5.1. They do not and likely will not ever support it in the inner elements. That is a task for a separate cmdlet.

UTF-8 is supported. If the server declares charset=utf-8 in the Content-Type header and if it is declared in the outermost HTML element. This is not limited to UTF-8. Support for any encoding that .NET Core supports is possible.

Why WebCmdlets very bad for a long time?

I can't answer that. I don't work for Microsoft.. I'm a community member like yourself. 😄 I do share some of your frustration, though. That's why I have been contributing.

"<a charset=xxx> C </a> <a charset=yyy> J </a>" in html4 is not standard?

It is, but that is only something that can be implemented in a web browser where you can click links and the browser has context surrounding the URI it retrieves. You are not clicking links in Invoke-WebRequest, you are providing URI's. The burden is 100% on the web developer and web server administrators to ensure the response for a given endpoint returns the proper charset definitions in the "outer envelope". Otherwise, PowerShell has no context to understand what the charset should be based on an anchor element.

In PowerShell, this would be accomplished by accomplished by parsing the uri and charset from the HTML anchor element (<a>), calling Invoke-WebRequest with that information. if we exposed a way for you to declare what encoding type you are expecting it would be possible with something like Invoke-WebRequest -Uri $uri -ResponseCharset 'UTF-8' or something.

if WebCmdlets do not do well, calling an external mature library

Like I said, we will be using an external library for HTML parsing (AngleSharp). But this wont be baked into Invoke-WebRequest. The focus of the web cmdlets has shifted beginning with 6.0.0. HTML parsing will never return to them. Instead, a separate cmdlet will be introduced so it can be used for more than just Invoke-WebRequest and Invoke-RestMethod. Invoke-WebRequest alone wont do what you want even in the future. But, there will be a way with something like $Html = Invoke-WebRequest $uri | ConvertFrom-Html

I should also note that as far as I know none of these changes will be ported back to Windows PowerShell 5.1 or older. To get these benefits, you will need to move to PowerShell Core.

SteveL-MSFT · 2018-02-06T18:07:55Z

Further HTML parsing discussion should be part of #2867.

For this specific issue, we should default to UTF-8 unless an encoding is provided in the response or if the user specified an encoding they want to use.

Jaykul · 2018-02-07T04:16:12Z

the default encoding for application/json responses should be UTF-8.

This is most certainly correct. Even in the (older) standard I cited, UTF-8 (with no BOM) is specified as the default, and they suggest reading the first 4 bytes only as a way of determining if it's a different Unicode encoding (since UTF-16 and UTF-32 require BOM).

Since you point out the version I cited is twice obsolete, and the newer standard says UTF-8 only, I would be happy with UTF-8 only (i.e. skipping the algorithm to check for BOM).

I'd be willing to bet that nearly ever times you've had complaints about encoding problems are instances like this one, where the content was UTF-8 encoded without specifying anything (since that's supposed to be the default), and the cmdlet is improperly treating it as the Windows default encoding.

I also agree that for anything else that's not JSON, all bets are off 😉 but I'll go look at #2867 too 😁

Jaykul changed the title ~~WebCmdlets should treat Content-Type application/json as utf-8 by default~~ WebCmdlets should read the encoding of Content-Type application/json per RFC Nov 22, 2017

markekraus added WG-Cmdlets general cmdlet issues Issue-Enhancement the issue is more of a feature request than a bug WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module and removed WG-Cmdlets general cmdlet issues labels Nov 22, 2017

lipkau added a commit to lipkau/ConfluencePS that referenced this issue Nov 23, 2017

Fix encoding of Invoke-WebRequest response

89e3eba

* PowerShell is not setting the encoding to UTF-8 for application/json * See PowerShell/PowerShell#5530

lipkau mentioned this issue Nov 23, 2017

Fix encoding of Invoke-WebRequest response AtlassianPS/ConfluencePS#105

Merged

6 tasks

SteveL-MSFT added this to the 6.1.0-Consider milestone Dec 12, 2017 8000

markekraus self-assigned this Feb 6, 2018

markekraus mentioned this issue Feb 6, 2018

Make UTF-8 Default Encoding for application/json #6109

Merged

7 tasks

iSazonov closed this as completed in #6109 Feb 22, 2018

iSazonov added the Resolution-Fixed The issue is fixed. label Mar 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WebCmdlets should read the encoding of Content-Type application/json per RFC #5530

WebCmdlets should read the encoding of Content-Type application/json per RFC #5530

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WebCmdlets should read the encoding of Content-Type application/json per RFC #5530

WebCmdlets should read the encoding of Content-Type application/json per RFC #5530

Comments

Uh oh!

Steps to reproduce

Expected behavior

Actual behavior

Discussion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!