-
Notifications
You must be signed in to change notification settings - Fork 7.7k
WebCmdlets should read the encoding of Content-Type application/json per RFC #5530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
That seems reasonable.. Tell me if this logic makes sense: If the
edit: thinking about it, it wouldn't be 2 hard. if the response is 3 bytes, it's utf8. if it's 2 and the first byte is 00, then it is UTF-16BE. if it is 2 and the second byte is 00 its UTF-16LE, and if nether 1 nor 2 are 00, it's utf-8. |
* PowerShell is not setting the encoding to UTF-8 for application/json * See PowerShell/PowerShell#5530
Hmm I'm wondering what standard we should follow here. RFC 8259 released in December 2017 now enforces UTF-8 (no-BOM) encoding for JSON. https://tools.ietf.org/html/rfc8259#section-8.1
@SteveL-MSFT Do we have any kind of guidance on what standards should be followed and how soon after acceptance they should be implemented? Currently RFC 8259 is set as "Internet Standard" and has obsoleted the standard @Jaykul cited (RFC 4627) but conflicts with EMCA-404 which still allows for any Unicode and not exclusively UTF-8. I agree with the IETF that UTF-8 reigns supreme (I don't personally know a single JSON API that returns anything other than UTF-8). It would seem to me the default encoding for application/json responses should be UTF-8. I'm torn on whether we should ignore or honor the char encoding provided by a remote endpoint. I think part of this could be solved is we enforced UTF-8 but allowed users to supply their own response encoding via parameter. this would solve some of the other open issues where char detection is not working because the remote endpoints are not being compliant. |
@markekraus the great thing about standards is that there's so many to choose from :) I think we should default to UTF-8. If the response includes an encoding, we should use that. For edge cases, end users can use At this time, I wouldn't bother with providing a parameter until a need arises. |
@SteveL-MSFT Re: user supplied encoding. this has come up several times. #3267 from @chuanjiao10 has a few examples. I could dig but there have been other issues where this is a comment. Andi suspect we might not see a great number of complaints as it seems a common problem in Asia (language barrier and all that). web browsers do a fair bit of magic to select the encoding. The least we could do is provide the user the option to select encoding. It's probably not a priority, but there does appear to be a need. |
@markekraus if it's come up several times already, then we should probably support it. |
@chuanjiao10 Hopefully, whatever we land on for HTML parsing (AngleSharp) will support multiple character encoding in a single response. For now, we are concerned only with the outer content type encoding of the response as a whole as our current implementation only does basic HTML parsing. Could you please open a new issue for multiple response encoding in a single HTML and maybe include some examples? Thanks! |
I agree that we should default to UTF-8 - CoreFX uses this and PowerShell Core too in many places. I believe we should get PowerShell committee conclusion to close the discussion. |
@iSazonov for this particular issue, we are only changing the default to UTF-8 for The standards are a bit less clear on HTML responses. Historically, when no charset is defined in the header ISO-8859-1 was assumed. So if we changed the default for everything to UTF-8, legacy systems will be potentially broken. This is complicated because now the default charset for HTML5 is UTF-8. HTML5 becomes heavily reliant on charset in the Ultimately, we will never have perfect encoding in this project because we do not aspire to be a full parity web browser. For HTML, My hope is that when we move to AngleSharp for HTML rendering they can handle the terrifying task of navigating all the charset complexities for us. |
@chuanjiao10 I understand the CJK user experience is currently bad. I have a personal interest in addressing that (for the limited amount of 日本語 I deal with). The problem is that I really do feel much of the CJK pain will be reduced when we can add in a proper HTML parser. This project doesn't have the resources to make its own full functioning browser, so we will be relying on another community project that does. I also feel that adding the ability for the user to supply their desired character encoding will address some of this pain.
For that, we must follow standards, as painful as they are for a significant user population. Current standards make it clear the default for HTML must be |
@chuanjiao10 That solution uses Internet Explorer's engine to parse the HTML and determine internal element content type. Internet Explorer is full fledged web browser. It has teams dedicated to complexe charset identification and string encoding.
UTF-8 is supported. If the server declares
I can't answer that. I don't work for Microsoft.. I'm a community member like yourself. 😄 I do share some of your frustration, though. That's why I have been contributing.
It is, but that is only something that can be implemented in a web browser where you can click links and the browser has context surrounding the URI it retrieves. You are not clicking links in In PowerShell, this would be accomplished by accomplished by parsing the uri and charset from the HTML anchor element (
Like I said, we will be using an external library for HTML parsing (AngleSharp). But this wont be baked into I should also note that as far as I know none of these changes will be ported back to Windows PowerShell 5.1 or older. To get these benefits, you will need to move to PowerShell Core. |
Further HTML parsing discussion should be part of #2867. For this specific issue, we should default to UTF-8 unless an encoding is provided in the response or if the user specified an encoding they want to use. |
This is most certainly correct. Even in the (older) standard I cited, UTF-8 (with no BOM) is specified as the default, and they suggest reading the first 4 bytes only as a way of determining if it's a different Unicode encoding (since UTF-16 and UTF-32 require BOM). Since you point out the version I cited is twice obsolete, and the newer standard says UTF-8 only, I would be happy with UTF-8 only (i.e. skipping the algorithm to check for BOM). I'd be willing to bet that nearly ever times you've had complaints about encoding problems are instances like this one, where the content was UTF-8 encoded without specifying anything (since that's supposed to be the default), and the cmdlet is improperly treating it as the Windows default encoding. I also agree that for anything else that's not JSON, all bets are off 😉 but I'll go look at #2867 too 😁 |
Uh oh!
There was an error while loading. Please reload this page.
Steps to reproduce
Expected behavior
It should detect the utf-8 encoding, and produce the same output as this:
i.e. something like this:
Actual behavior
It falls back to iso-8859-1 encoding and produces gobbledygook with a lot of Ð's in it. Also, it utterly fails to produce a number for the
#-byte response
string in the verbose output.Discussion
When calling an HTTP endpoint that returns a header:
Content-Type: application/json
the WebCmdlets are incorrectly defaulting toiso-8859-1
rather than a proper unicode encoding, and are disregarding the application/json RFC's simple specification for how to determine the content encoding.NOTE: Please don't work around this by just defaulting to utf-8. I'm sure that 90% of the time, you could probably get away with that, but it's not actually correct, and the RFC implementation is trivial.
ALSO NOTE: The WebCmdlets do respect the
; charset=utf-8
attribute if it's present on the content-type header -- which makes sense, but isn't technically standards compliant for anapplication/*
content-type, as far as I can tell.To get started:
ProcessResponse
andTryGetEncoding
See also #5528 which was a specific instance of this problem. @lipkau was incorrectly convinced by early responders that the problem was in the webserver, but it's actually in PowerShell's cmdlets. If you invoke the rest API against the Atlassian wiki, you can see the problem happening in the Verbose stream:
The content is actually correctly utf-8 encoded (as you could tell from the positions of the nulls in the first 4 bytes), and iso-8859-1 is never a valid encoding for application/json, period.
The text was updated successfully, but these errors were encountered: