10000 Merge branch 'master' into kuvandjiev/master · html5lib/html5lib-python@9a5b127 · GitHub
[go: up one dir, main page]

Skip to content

Commit 9a5b127

Browse files
committed
Merge branch 'master' into kuvandjiev/master
2 parents dae6201 + f0bb2a6 commit 9a5b127

File tree

94 files changed

+9670
-1197
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

94 files changed

+9670
-1197
lines changed

.appveyor.yml

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,18 @@
1-
# To activate, change the Appveyor settings to use `.appveyor.yml`.
1+
image: Visual Studio 2019
22
environment:
33
global:
44
PATH: "C:\\Python27\\Scripts\\;%PATH%"
5-
PYTEST_COMMAND: "coverage run -m pytest"
65
matrix:
76
- TOXENV: py27-base
87
- TOXENV: py27-optional
9-
- TOXENV: py34-base
10-
- TOXENV: py34-optional
118
- TOXENV: py35-base
129
- TOXENV: py35-optional
1310
- TOXENV: py36-base
1411
- TOXENV: py36-optional
1512

1613
install:
1714
- git submodule update --init --recursive
18-
- python -m pip install tox codecov
15+
- python -m pip install tox
1916

2017
build: off
2118

@@ -24,6 +21,3 @@ test_script:
2421

2522
after_test:
2623
- python debug-info.py
27-
28-
on_success:
29-
- codecov

.github/workflows/python-tox.yml

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@ 106D2
1+
on: [pull_request, push]
2+
jobs:
3+
build:
4+
# Prevent duplicate builds for 'internal' pull requests on existing commits
5+
# Credit: https://github.community/t/duplicate-checks-on-push-and-pull-request-simultaneous-event/18012
6+
if: github.event.push || github.event.pull_request.head.repo.full_name != github.repository
7+
strategy:
8+
fail-fast: false
9+
matrix:
10+
# 2.7, 3.5, and 3.6 run on Windows via AppVeyor
11+
python: ["3.7", "3.8", "3.9", "3.10", "3.11"]
12+
os: [ubuntu-latest, windows-latest]
13+
deps: [base, optional]
14+
include:
15+
- python: "pypy-2.7"
16+
os: ubuntu-latest
17+
deps: base
18+
- python: "pypy-3.8"
19+
os: ubuntu-latest
20+
deps: base
21+
- python: "2.7"
22+
os: ubuntu-latest
23+
deps: oldest
24+
- python: "3.7"
25+
os: ubuntu-latest
26+
deps: oldest
27+
runs-on: ${{ matrix.os }}
28+
steps:
29+
- uses: actions/checkout@v3
30+
with:
31+
submodules: true
32+
- if: ${{ matrix.deps == 'base' }}
33+
uses: actions/setup-python@v4
34+
with:
35+
python-version: ${{ matrix.python }}
36+
cache: pip
37+
cache-dependency-path: |
38+
requirements.txt
39+
requirements-test.txt
40+
- if: ${{ matrix.deps == 'optional' }}
41+
uses: actions/setup-python@v4
42+
with:
43+
python-version: ${{ matrix.python }}
44+
cache: pip
45+
cache-dependency-path: |
46+
requirements.txt
47+
requirements-optional.txt
48+
requirements-test.txt
49+
- if: ${{ matrix.deps == 'oldest' }}
50+
uses: actions/setup-python@v4
51+
with:
52+
python-version: ${{ matrix.python }}
53+
cache: pip
54+
cache-dependency-path: |
55+
requirements-oldest.txt
56+
- if: ${{ matrix.os == 'windows-latest' }}
57+
name: Determine environment name for Tox (PowerShell)
58+
run: python toxver.py ${{ matrix.python }} ${{ matrix.deps }} >> $env:GITHUB_ENV
59+
- if: ${{ matrix.os == 'ubuntu-latest' }}
60+
name: Determine environment name for Tox (Bash)
61+
run: python toxver.py ${{ matrix.python }} ${{ matrix.deps }} >> $GITHUB_ENV
62+
- run: pip install tox
63+
- run: tox
64+
- if: ${{ always() }}
65+
run: python debug-info.py

.pytest.expect

Lines changed: 149 additions & 116 deletions
Large diffs are not rendered by default.

.travis.yml

Lines changed: 0 additions & 31 deletions
This file was deleted.

AUTHORS.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Credits
44
``html5lib`` is written and maintained by:
55

66
- James Graham
7-
- Geoffrey Sneddon
7+
- Sam Sneddon
88
- Łukasz Langa
99
- Will Kahn-Greene
1010

CHANGES.rst

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,49 @@
11
Change Log
22
----------
33

4+
1.2
5+
~~~
6+
7+
Unreleased yet
8+
9+
Features:
10+
11+
* Add support for the ``<wbr>`` element in the sanitizer, `which indicates
12+
a line break opportunity <https://html.spec.whatwg.org/#the-wbr-element>`_.
13+
This element is allowed by default. (#395) (Thank you, Tom Most!)
14+
* Add support for serializing the ``<ol reversed>`` boolean attribute. (Thank
15+
you, Tom Most!) (#396)
16+
* The ``<ol reversed>`` and ``<ol start>`` attributes are now permitted by the
17+
sanitizer. (#321) (Thank you, Tom Most!)
18+
19+
Bug fixes:
20+
21+
* The sanitizer now permits ``<summary>`` tags. It used to allow ``<details>``
22+
already. (#423)
23+
24+
1.1
25+
~~~
26+
27+
Released on June 23, 2020
28+
29+
Breaking changes:
30+
31+
* Drop support for Python 3.3. (#358)
32+
* Drop support for Python 3.4. (#421)
33+
34+
Deprecations:
35+
36+
* Deprecate the ``html5lib`` sanitizer (``html5lib.serialize(sanitize=True)`` and
37+
``html5lib.filters.sanitizer``). We recommend users migrate to `Bleach
38+
<https://github.com/mozilla/bleach>`. Please let us know if Bleach doesn't suffice for your
39+
use. (#443)
40+
41+
Other changes:
42+
43+
* Try to import from ``collections.abc`` to remove DeprecationWarning and ensure
44+
``html5lib`` keeps working in future Python versions. (#403)
45+
* Drop optional ``datrie`` dependency. (#442)
46+
447
1.0.1
548
~~~~~
649

@@ -20,7 +63,7 @@ Features:
2063
* Support Python 3.6. (#333) (Thank you, Jon Dufresne!)
2164
* Add CI support for Windows using AppVeyor. (Thank you, John Vandenberg!)
2265
* Improve testing and CI and add code coverage (#323, #334), (Thank you, Jon
23-
Dufresne, John Vandenberg, Geoffrey Sneddon, Will Kahn-Greene!)
66+
Dufresne, John Vandenberg, Sam Sneddon, Will Kahn-Greene!)
2467
* Semver-compliant version number.
2568

2669
Bug fixes:
@@ -73,7 +116,7 @@ Released on July 14, 2016
73116
tested, doesn't entirely work, and as far as I can tell is
74117
completely unused by anyone.**
75118

76-
* Move testsuite to ``py.test``.
119+
* Move testsuite to ``pytest``.
77120

78121
* **Fix #124: move to webencodings for decoding the input byte stream;
79122
this makes html5lib compliant with the Encoding Standard, and

CONTRIBUTING.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ documentation. Some useful information:
1616
- We keep the master branch passing all tests at all times on all
1717
supported versions.
1818

19-
`Travis CI <https://travis-ci.org/html5lib/html5lib-python/>`_ is run
19+
`GitHub Actions <https://github.com/html5lib/html5lib-python/actions>`_ is run
2020
against all pull requests and should enforce all of the above.
2121

2222
We use `Opera Critic <https://critic.hoppipolla.co.uk/>`_ as an external

README.rst

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
html5lib
22
========
33

4-
.. image:: https://travis-ci.org/html5lib/html5lib-python.svg?branch=master
5-
:target: https://travis-ci.org/html5lib/html5lib-python
6-
4+
.. image:: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml/badge.svg
5+
:target: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml
76

87
html5lib is a pure-python library for parsing HTML. It is designed to
98
conform to the WHATWG HTML specification, as is implemented by all major
@@ -91,23 +90,22 @@ More documentation is available at https://html5lib.readthedocs.io/.
9190
Installation
9291
------------
9392

94-
html5lib works on CPython 2.7+, CPython 3.4+ and PyPy. To install it,
95-
use:
93+
html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install:
9694

9795
.. code-block:: bash
9896
9997
$ pip install html5lib
10098
99+
The goal is to support a (non-strict) superset of the versions that `pip
100+
supports
101+
<https://pip.pypa.io/en/stable/installing/#python-and-os-compatibility>`_.
101102

102103
Optional Dependencies
103104
---------------------
104105

105106
The following third-party libraries may be used for additional
106107
functionality:
107108

108-
- ``datrie`` can be used under CPython to improve parsing performance
109-
(though in almost all cases the improvement is marginal);
110-
111109
- ``lxml`` is supported as a tree format (for both building and
112110
walking) under CPython (but *not* PyPy where it is known to cause
113111
segfaults);
@@ -129,7 +127,7 @@ Tests
129127
-----
130128

131129
Unit tests require the ``pytest`` and ``mock`` libraries and can be
132-
run using the ``py.test`` command in the root directory.
130+
run using the ``pytest`` command in the root directory.
133131

134132
Test data are contained in a separate `html5lib-tests
135133
<https://github.com/html5lib/html5lib-tests>`_ repository and included
@@ -146,7 +144,9 @@ which can be found on PyPI.
146144
Questions?
147145
----------
148146

149-
There's a mailing list available for support on Google Groups,
150-
`html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_,
151-
though you may get a quicker response asking on IRC in `#whatwg on
152-
irc.freenode.net <http://wiki.whatwg.org/wiki/IRC>`_.
147+
Check out `the docs <https://html5lib.readthedocs.io/en/latest/>`_. Still
148+
need help? Go to our `GitHub Discussions
149+
<https://github.com/html5lib/html5lib-python/discussions>`_.
150+
151+
You can also browse the archives of the `html5lib-discuss mailing list
152+
<https://www.mail-archive.com/html5lib-discuss@googlegroups.com/>`_.

benchmarks/bench_html.py

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
import io
2+
import os
3+
import sys
4+
5+
import pyperf
6+
7+
sys.path[0:0] = [os.path.join(os.path.dirname(__file__), "..")]
8+
import html5lib # noqa: E402
9+
10+
11+
def bench_parse(fh, treebuilder):
12+
fh.seek(0)
13+
html5lib.parse(fh, treebuilder=treebuilder, useChardet=False)
14+
15+
16+
def bench_serialize(loops, fh, treebuilder):
17+
fh.seek(0)
18+
doc = html5lib.parse(fh, treebuilder=treebuilder, useChardet=False)
19+
20+
range_it = range(loops)
21+
t0 = pyperf.perf_counter()
22+
23+
for loops in range_it:
24+
html5lib.serialize(doc, tree=treebuilder, encoding="ascii", inject_meta_charset=False)
25+
26+
return pyperf.perf_counter() - t0
27+
28+
29+
BENCHMARKS = ["parse", "serialize"]
30+
31+
32+
def add_cmdline_args(cmd, args):
33+
if args.benchmark:
34+
cmd.append(args.benchmark)
35+
36+
37+
if __name__ == "__main__":
38+
runner = pyperf.Runner(add_cmdline_args=add_cmdline_args)
39+
runner.metadata["description"] = "Run benchmarks based on Anolis"
40+
runner.argparser.add_argument("benchmark", nargs="?", choices=BENCHMARKS)
41+
42+
args = runner.parse_args()
43+
if args.benchmark:
44+
benchmarks = (args.benchmark,)
45+
else:
46+
benchmarks = BENCHMARKS
47+
48+
with open(os.path.join(os.path.dirname(__file__), "data", "html.html"), "rb") as fh:
49+
source = io.BytesIO(fh.read())
50+
51+
if "parse" in benchmarks:
52+
for tb in ("etree", "dom", "lxml"):
53+
runner.bench_func("html_parse_%s" % tb, bench_parse, source, tb)
54+
55+
if "serialize" in benchmarks:
56+
for tb in ("etree", "dom", "lxml"):
57+
runner.bench_time_func("html_serialize_%s" % tb, bench_serialize, source, tb)

benchmarks/bench_wpt.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
import io
2+
import os
3+
import sys
4+
5+
import pyperf
6+
7+
sys.path[0:0] = [os.path.join(os.path.dirname(__file__), "..")]
8+
import html5lib # noqa: E402
9+
10+
11+
def bench_html5lib(fh):
12+
fh.seek(0)
13+
html5lib.parse(fh, treebuilder="etree", useChardet=False)
14+
15+
16+
def add_cmdline_args(cmd, args):
17+
if args.benchmark:
18+
cmd.append(args.benchmark)
19+
20+
21+
BENCHMARKS = {}
22+
for root, dirs, files in os.walk(os.path.join(os.path.dirname(os.path.abspath(__file__)), "data", "wpt")):
23+
for f in files:
24+
if f.endswith(".html"):
25+
BENCHMARKS[f[: -len(".html")]] = os.path.join(root, f)
26+
27+
28+
if __name__ == "__main__":
29+
runner = pyperf.Runner(add_cmdline_args=add_cmdline_args)
30+
runner.metadata["description"] = "Run parser benchmarks from WPT"
31+
runner.argparser.add_argument("benchmark", nargs="?", choices=sorted(BENCHMARKS))
32+
33+
args = runner.parse_args()
34+
if args.benchmark:
35+
benchmarks = (args.benchmark,)
36+
else:
37+
benchmarks = sorted(BENCHMARKS)
38+
39+
for bench in benchmarks:
40+
name = "wpt_%s" % bench
41+
path = BENCHMARKS[bench]
42+
with open(path, "rb") as fh:
43+
fh2 = io.BytesIO(fh.read())
44+
45+
runner.bench_func(name, bench_html5lib, fh2)

benchmarks/data/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
The files in this data are derived from:
2+
3+
* `html.html`: from [html](http://github.com/whatwg/html), revision
4+
77db356a293f2b152b648c836b6989d17afe42bb. This is the first 5000 lines of `source`. (This is
5+
representative of the input to [Anolis](https://bitbucket.org/ms2ger/anolis/); first 5000 lines
6+
chosen to make it parse in a reasonable time.)
7+
8+
* `wpt`: see `wpt/README.md`.

0 commit comments

Comments
 (0)
0