8000 Checkpoint multi-page readability work · Harry0201/python-readability@f02fe79 · GitHub
[go: up one dir, main page]

Skip to content

Commit f02fe79

Browse files
jcharummitechie
authored andcommitted
Checkpoint multi-page readability work
Restructured code to better support multi-page readability. Improved tests. Conflicts: src/readability_lxml/readability.py src/tests/regression.py
1 parent 5cb4b8b commit f02fe79

File tree

3 files changed

+63
-1
lines changed

3 files changed

+63
-1
lines changed

src/readability_lxml/readability.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727

2828

2929
REGEXES = {
30+
<<<<<<< HEAD:src/readability_lxml/readability.py
3031
'unlikelyCandidatesRe': re.compile(
3132
('combx|comment|community|disqus|extra|foot|header|menu|remark|rss|'
3233
'shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup|'
@@ -46,7 +47,7 @@
4647
'divToPElementsRe': re.compile(
4748
'<(a|blockquote|dl|div|img|ol|p|pre|table|ul)', re.I),
4849
# Match: next, continue, >, >>, but not >|, as those usually mean last.
49-
'nextLink': re.compile(r'(next|weiter|continue|>[^\|]|$)', re.I),
50+
'nextLink': re.compile(r'(next|weiter|continue|>[^\|]$)', re.I), # Match: next, continue, >, >>, but not >|, as those usually mean last.
5051
'prevLink': re.compile(r'(prev|earl|old|new|<)', re.I),
5152
'page': re.compile(r'pag(e|ing|inat)', re.I),
5253
'firstLast': re.compile(r'(first|last)', re.I)

src/tests/regression.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
import re
1818
import sys
1919
import unittest
20+
import readability.urlfetch
2021
import yaml
2122

2223
from lxml.html import builder as B

test_data/basic-multi-page-3.html

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
2+
<html>
3+
<head>
4+
<title>A Simple Multi-Page Article For Testing : Page 3</title>
5+
</head>
6+
<body>
7+
<h1>A Simple Multi-Page Article For Testing : Page 3</h1>
8+
<p>
9+
Nullam laoreet, nibh non faucibus dictum, tellus libero varius
10+
erat, lobortis varius est massa quis metus. Donec vitae justo
11+
lacus, nec convallis metus. Suspendisse potenti. Nunc et rutrum
12+
justo. Maecenas ultrices ipsum in magna fermentum eleifend. Fusce
13+
sagittis pretium aliquam. Vestibulum et gravida lorem. Sed turpis
14+
quam, placerat ac ultrices eu, tempor sit amet elit. Curabitur eu
15+
imperdiet velit. Quisque pharetra ornare nunc, a volutpat metus
16+
aliquam quis. Vivamus semper aliquam cursus. Nullam ac nibh nulla,
17+
luctus pharetra nunc. Etiam ut sapien sem. Fusce vehicula, sem sit
18+
amet viverra pretium, magna tortor suscipit nisi, id interdum lorem
19+
orci in tellus. Vivamus vel ipsum eros. Fusce porttitor convallis
20+
ultricies. Etiam in risus diam, viverra suscipit felis. Duis vitae
21+
imperdiet est.
22+
</p>
23+
<p>
24+
Nunc nunc magna, facilisis blandit venenatis ut, scelerisque ac
25+
tortor. Cras condimentum fermentum lectus ac convallis. Suspendisse
26+
cursus, lacus sit amet sodales molestie, dui erat varius velit, non
27+
tincidunt metus dui sed nulla. Aliquam lacus orci, convallis ut
28+
pellentesque ac, molestie et dolor. Ut pretium enim ut nunc auctor
29+
eget placerat magna luctus. Duis mollis ligula a orci ultrices in
30+
facilisis felis feugiat. Morbi eget odio eget erat pulvinar
31+
placerat sed nec erat. Duis dignissim, dolor a lacinia commodo,
32+
metus erat laoreet dui, in lacinia felis lacus vitae nulla. Fusce
33+
imperdiet condimentum volutpat. Vivamus ut lacus a eros cursus
34+
scelerisque non sit amet orci. Phasellus id quam odio. Nulla
35+
adipiscing venenatis lorem nec feugiat. Aenean sit amet nisl odio,
36+
tincidunt scelerisque nisl. Curabitur ut nisl a dui facilisis
37+
vulputate. Mauris eu elit et felis hendrerit blandit. Cras magna
38+
dolor, imperdiet eget rutrum tempus, euismod nec augue.
39+
</p>
40+
<p>
41+
Ut in sem sit amet felis scelerisque elementum. Suspendisse vitae
42+
neque magna, in laoreet felis. Aenean elit ligula, tempor in
43+
vestibulum ac, porttitor nec lacus. Aenean urna mi, dictum feugiat
44+
placerat eget, congue nec dolor. Etiam pellentesque dictum nulla id
45+
vulputate. Etiam sit amet vehicula purus. Integer quis mi nisl,
46+
gravida malesuada enim. Donec malesuada felis nisi. Etiam id magna
47+
a libero pulvinar ullamcorper in nec neque. Duis pulvinar massa nec
48+
magna scelerisque vitae vulputate ipsum luctus.
49+
</p>
50+
<ul id="pageNumbers">
51+
<li> 1 </li>
52+
<li>
53+
<a title="Page 1" href="/article.html">1</a>
54+
</li>
55+
<li>
56+
<a title="Page 2" href="/article.html?pagewanted=2">2</a>
57+
</li>
58+
</ul>
59+
</body>
60+
</html>

0 commit comments

Comments
 (0)
0