8000 Fix <img /> detection regex in web cmdlets by vexx32 · Pull Request #12099 · PowerShell/PowerShell · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@vexx32
Copy link
Collaborator
@vexx32 vexx32 commented Mar 11, 2020

PR Summary

In c64a28e, the img tag regex was changed to <img\s+[^\s>]*> to avoid a DoS issue.

This regex will not match common image tags that contain more than just the src attribute (e.g., <img src="$url" class="className"/>

Fix is to adjust the regex to not preclude spaces after other content in the tag.

PR Context

Resolves #12089

PR Checklist

@ghost ghost assigned anmenaga Mar 11, 2020
@vexx32
Copy link
Collaborator Author
vexx32 commented Mar 11, 2020

@PoshChan please retry macos

@PoshChan
Copy link
Collaborator

@vexx32, successfully started retry of PowerShell-CI-macOS

@iSazonov
Copy link
Collaborator

We have 6 Regex in the file. It would nice to add tests. xUnit could be most stable.

@TravisEz13
Copy link
Member

There are no existing tests except the test to verify the last change didn't regress. We should add tests, but as this is regression, and we are early in 7.1, my question is have you manually tested it.

But to port it to 7.0, we need to add tests to cover at least a couple of cases.

here is the test I added:

@TravisEz13 TravisEz13 added this to the 7.0.x-Servicing-Consider milestone Mar 11, 2020
@vexx32
Copy link
Collaborator Author
vexx32 commented Mar 11, 2020

@TravisEz13 yep I manually tested it with regex101. On a page that contains a bunch of images (literally just a page about travel blogs with a bunch of pictures), here's the breakdown:

"Pathological" regex from the test:

  • 2ms
  • 25 matches
  • 1708 steps

Currently-used regex in the codebase:

  • ~1ms
  • No matches
  • 1683 steps

Proposed new regex in this PR:

  • 1ms
  • 25 matches
  • 1708 steps

There may be a more effective test case available, but I just grabbed the first thing I could think of 😅

@TravisEz13
Copy link
Member
TravisEz13 commented Mar 11, 2020

for it to be pathological, you need to test it with the pathological case (lots of spaces after img)... but there is a test case for that.

8000 @TravisEz13 TravisEz13 added the CL-General Indicates that a PR should be marked as a general cmdlet change in the Change Log label Mar 11, 2020
@TravisEz13
Copy link
Member

@vexx32 if you aren't planning on adding the cases, can you open an issue to add the cases and port both PR's back to 7.0 after the cases have been added.

@vexx32
Copy link
Collaborator Author
vexx32 commented Mar 11, 2020

I don't mind adding the cases, but I'm not really clear on where / how these cases are defined (would I need to add something to that Get-WebListenerUrl function?), or exactly what additional cases you're looking to have added here? Tests for something like <img$(" " * 5000)src="dummyimage.png"$(" " * 5000)class="classname"/>?

@iSazonov
Copy link
Collaborator

Is there a compliance analyzer tool for Regex?

@iSazonov
Copy link
Collaborator

@vexx32 I think @TravisEz13 says that we could add tests like existing one for rest 5 Regex.

@TravisEz13
Copy link
Member
TravisEz13 commented Mar 12, 2020

@iSazonov we are supposed to use a tool in azure. It's public, but to use it free, we are supposed to use an internal version. To use it, we need a test case (or a few) where the content going to the regex is in a file (separate file per case). The tool is basically and AI and will vary the content of the file until all codepaths have been hit.

I'd really appreciate if someone added a case like this. We could remove the pathological test cases if we had cases like this and those cases are unreliable.

@TravisEz13
Copy link
Member

@vexx32 for this (#12099 (comment)) new case, we don't need to add a large number of spaces.

@iSazonov
Copy link
Collaborator

@TravisEz13 Do you ask to create one HTML test file per Regex from this 5 Regex in the PR? Should it be full featured HTML page or simple string like
<img class="image a11y-hidden" src="https://mc.yandex.ru/watch/723233"/>

@TravisEz13
Copy link
Member
TravisEz13 commented Mar 13, 2020

@iSazonov
This is the tool: https://www.microsoft.com/en-us/security-risk-detection/
The description of the requirement are here:
https://docs.microsoft.com/en-us/security-risk-detection/concepts/seed-files

Microsoft Security Risk Detection
Security Risk Detection is Microsoft's unique fuzz testing service for finding security critical bugs in software, helping customers quickly adopt practices and technology battle-tested at Microsoft.
Guidelines and best practices to choosing the corpus of seed files to maximize fuzzing efficiency and results

@iSazonov
Copy link
Collaborator

@TravisEz13 The docs say that we could have up to 1000 seed files. I think we could create a list of sites based on different engines, grab pages from them to use as seed files. I believe it will be more productive and native than manually create HTML files.

@TravisEz13
Copy link
Member
TravisEz13 commented Mar 16, 2020

@iSazonov The tool may allow that, but during our testing internet access is disabled. So, the test would fail. (It may not be that way today, but it will be in the future.). The whole purpose of this tool is that it does the variation so that you don't have to produce every input.

See https://en.wikipedia.org/wiki/Fuzzing

@iSazonov
Copy link
Collaborator

@TravisEz13 My suggestion is not to access Internet at test time. My suggestion is:

  • create a list of popular sites based on different templates (Sharepoint and so on)
  • download pages from the sites
  • use these downloaded static html pages in tests

@TravisEz13
Copy link
Member

@iSazonov in theory that works, but there are legal issues with that. Sharepoint is fine, but the other sites we may need to get the sites permission.

@vexx32
Copy link
Collaborator Author
vexx32 commented Mar 16, 2020

Opened #12138; feel free to continue that discussion there, as it doesn't look like this will be a simple decision of what tests to add for this PR. 🙂

@TravisEz13
Copy link
Member
TravisEz13 commented Mar 16, 2020

@vexx32 Can you add a test like you were suggesting here: #12099 (comment) minus the spaces (that's already covered)

@vexx32
Copy link
Collaborator Author
vexx32 commented Mar 16, 2020

Happy to, but I'm not not clear on how cases are added here, as it looks like the content generation is handled elsewhere. Can you point me to where that is?

@TravisEz13
Copy link
Member

Here is the commit where I added the last test case and tho WebListener "Controller" that the test cases uses: c64a28e

@iSazonov
Copy link
Collaborator

in theory that works, but there are legal issues with that. Sharepoint is fine, but the other sites we may need to get the sites permission.

@TravisEz13 If I understand right you will make the tests internally so the site list will be not public. We are not going to change, share or distribute the pages. We are not going to download any artifacts like pictures - we need only html. We are not going to check these sites and distribute any information about it - we are going to test only that our code works correctly on the data.

contentType = "text/html; charset=utf8";
body = "<img" + (new string(' ', dosLength));
break;
case "img-attribute":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually a DOS... can you just add a comment explaining that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TravisEz13 Yeah I was torn on that myself. I could also just add a new Controller class here instead, since we plan on adding future cases to test the actual regex patterns' capability to parse what we expect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to. I'll leave it up to you, but at least add a comment that this is not a DOS case, but you didn't want to create a new controller for a case that was so similar.

@ghost ghost added Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept and removed Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept labels Mar 18, 2020
$pathologicalRatio | Should -BeGreaterThan 5
}

It 'correctly parses an image with id, class, and src attributes' {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you should probably move this to a new context too.

@TravisEz13
Copy link
Member
TravisEz13 commented Mar 23, 2020

I used fixing a conflict in your PR to partially fix a mistake I made... Brought a test back, but pended it.
related to #12155

@anmenaga anmenaga merged commit f8fb770 into PowerShell:master Mar 24, 2020
@vexx32 vexx32 deleted the Web/ImageRegex branch March 24, 2020 17:54
@TravisEz13 TravisEz13 modified the milestones: 7.0.1-approved, 7.0.1 Apr 20, 2020
@ghost
Copy link
ghost commented Apr 23, 2020

🎉v7.1.0-preview.2 has been released which incorporates this pull request.:tada:

Handy links:

@ghost ghost mentioned this pull request Apr 23, 2020
@ghost
Copy link
ghost commented May 14, 2020

🎉v7.0.1 has been released which incorporates this pull request.:tada:

Handy links:

silijon pushed a commit to SkyKick/PowerShell that referenced this pull request Jul 2, 2020
# Conflicts:
#	test/powershell/Modules/Microsoft.PowerShell.Utility/WebCmdlets.Tests.ps1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CL-General Indicates that a PR should be marked as a general cmdlet change in the Change Log

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invoke-WebRequest Images missing

6 participants

0