8000 Trim many repeated spaces to make clean() faster · kingking888/python-readability@747c46a · GitHub
[go: up one dir, main page]

Skip to content

Commit 747c46a

Browse files
committed
Trim many repeated spaces to make clean() faster
When Readability encounters many repeated whitespace, the cleanup regexes in clean() take forever to run, so trim the amount of whitespace to 255 characters. Additionally, test the extracting performance with "timeout_decorator".
1 parent 8235f07 commit 747c46a

File tree

3 files changed

+19
-0
lines changed

3 files changed

+19
-0
lines changed

readability/readability.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,9 @@ def to_int(x):
5454

5555

5656
def clean(text):
57+
# Many spaces make the following regexes run forever
58+
text = re.sub(r'\s{255,}', ' ' * 255, text)
59+
5760
text = re.sub('\s*\n\s*', '\n', text)
5861
text = re.sub('\t|[ \t]{2,}', ' ', text)
5962
return text.strip()

setup.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,10 @@
2828
lxml_requirement,
2929
"cssselect"
3030
],
31+
tests_require=[
32+
# Test timeouts
33+
"timeout_decorator",
34+
],
3135
classifiers=[
3236
"Environment :: Web Environment",
3337
"Intended Audience :: Developers",

tests/test_article_only.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
import unittest
33

44
from readability import Document
5+
import timeout_decorator
56

67

78
SAMPLES = os.path.join(os.path.dirname(__file__), 'samples')
@@ -92,3 +93,14 @@ def test_correct_cleanup(self):
9293
assert('punctuation' in s)
9394
assert(not 'comment' in s)
9495
assert(not 'aside' in s)
96+
97+
# Many spaces make some regexes run forever
98+
@timeout_decorator.timeout(seconds=3, use_signals=False)
99+
def test_many_repeated_spaces(self):
100+
long_space = ' ' * 1000000
101+
sample = '<html><body><p>foo' + long_space + '</p></body></html>'
102+
103+
doc = Document(sample)
104+
s = doc.summary()
105+
106+
assert 'foo' in s

0 commit comments

Comments
 (0)
0