dreadful performance in difflib: ndiff and HtmlDiff #51180

heidarrafn · 2009-09-17T14:54:58Z

BPO	6931
Nosy	@tim-one, @terryjreedy, @jackdied, @merwok, @funkyHat
Files	python.difflib.bug.tgz: a gzipped tar with 6 testfiles - see comment above 11740.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2009-09-17.14:54:57.533>
labels = ['type-feature', 'library']
title = 'dreadful performance in difflib: ndiff and HtmlDiff'
updated_at = <Date 2017-05-06.22:48:36.280>
user = 'https://bugs.python.org/heidarrafn'

bugs.python.org fields:

activity = <Date 2017-05-06.22:48:36.280>
actor = 'gregory.p.smith'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2009-09-17.14:54:57.533>
creator = 'heidar.rafn'
dependencies = []
files = ['14911', '21588']
hgrepos = []
issue_num = 6931
keywords = ['patch']
message_count = 14.0
messages = ['92768', '99883', '102345', '112910', '132797', '133329', '133330', '133339', '133340', '139891', '222588', '222589', '222590', '223459']
nosy_count = 10.0
nosy_names = ['tim.peters', 'terry.reedy', 'jackdied', 'eric.araujo', 'gruszczy', 'heidar.rafn', 'neologix', 'zitrax', 'funkyhat', 'Christopher.Cabanne']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue6931'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

heidarrafn · 2009-09-17T14:54:57Z

Relatively small set of lines with differences in most lines can destroy
the performance of difflib.HtmlDiff.make_table and difflib.ndiff.
I am using it like this:
...
htmldiffer = HtmlDiff()
return htmldiffer.make_table(src_lines, dst_lines,
fromdesc="file1",
todesc="file2",
context=True)

I have written the src_lines and dst_lins to files and tried this with
the Tools/scripts/diff.py wrapper with same results when using the
switches -m or -n.
The performance is fine when using difflib.unified_diff or switch -u on
diff.py

Attached are files that show this clearly.
left200.txt,right200.txt - 200 lines of text - duration 11 seconds.
left500.txt,right500.txt - 500 lines of text - duration 2min 58 sec
left1000.txt,right1000.txt - 1000 lines of text - duration 29min 4sec

tested on Intel dualcore T2500 2GHz with 2 GB of memory, python 2.5.2 on
Ubuntu. Same problom on python 2.6 on Fedora-11
For reference, the kdiff3 utility performs beautifully on these files.

jackdied · 2010-02-23T00:18:34Z

Here is a profile run of the 200 line case, run on the 3.x trunk, and with all the trivial functions removed. quick_ratio and __contains__ dominate the time. The process was CPU bound, memory usage stayed low.

1708315 function calls (17066360 primitive calls) in 248.779 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
10660375 70.516 0.000 70.516 0.000 :0(contains)
114482 1.048 0.000 1.048 0.000 :0(append)
4434776 29.254 0.000 29.254 0.000 :0(get)
685047/685042 4.400 0.000 4.400 0.000 :0(len)
197349 1.512 0.000 1.512 0.000 :0(min)
197133 1.452 0.000 1.452 0.000 difflib.py:228(set_seq1)
2500 1.632 0.001 3.092 0.001 difflib.py:299(__chain_b)
1082 2.520 0.002 4.768 0.004 difflib.py:350(find_longest_match)
339143 2.580 0.000 2.580 0.000 difflib.py:40(_calculate_ratio)
141727 120.599 0.001 220.970 0.002 difflib.py:661(quick_ratio)
196736 6.956 0.000 12.549 0.000 difflib.py:690(real_quick_ratio)
8974/794 5.052 0.001 248.431 0.313 difflib.py:945(_fancy_replace)

neologix · 2010-04-04T16:59:20Z

This is because difflib.ndiff (called by difflib.HtmlDiff.make_table), contrarily to difflib.unified_diff (and probably kdiff3), doesn't restrict itself to contiguous lines, and searches diff even inside lines, so the complexity is much worse (see how many times _fancy_replace and quick_ratio are being called).
It might be a good idea to allow the user to specify the type of diff needed (ndiff vs unified_diff).

terryjreedy · 2010-08-04T22:15:04Z

An api addition makes this a feature request.

Heiðar, Please copy of difflib.py, hand patch HtmlDiff to use unified_diff instead of ndiff, and see if that solves the problem.

gruszczy · 2011-04-02T14:04:21Z

Check also this:

http://bugs.python.org/issue11740

neologix · 2011-04-08T17:24:20Z

Check also this:

http://bugs.python.org/issue11740

You should indicate it as duplicate.

gruszczy · 2011-04-08T17:25:55Z

I have no idea how I should do this. If you explain to me, how it should be done, I'll be happy to do this from now on :-)

terryjreedy · 2011-04-08T20:25:25Z

Filip, you can look at the changed Benjamin made. One needs to be OP or have admin privileges to change the headers.

Could you reupload your patch to this issue?

gruszczy · 2011-04-08T20:28:43Z

Here it is.

merwok · 2011-07-05T16:10:35Z

The patch by Filip does not add new features, so I’m adjusting versions.

I cannot review the patch only by reading it, but if someone gives me a timeit command I can post a benchmark for my Debian machine.

ChristopherCabanne · 2014-07-08T23:21:32Z

I ran into the slowness of difflib.HtmlDiff.make_table yesterday, and see this issue is three years stale now; is there any update or chance that the fix will be applied?

ChristopherCabanne · 2014-07-08T23:28:02Z

Here is a timeit command to test with the attached test files:

python -m timeit -n 10 "import difflib; difflib.HtmlDiff().make_table(open('left500.txt').readlines(), open('righ500.txt').readlines())"

ChristopherCabanne · 2014-07-08T23:29:36Z

Sorry about the typo; here is the corrected timeit command:

python -m timeit -n 10 "import difflib; difflib.HtmlDiff().make_table(open('left500.txt').readlines(), open('right500.txt').readlines())"

tim-one · 2014-07-19T02:39:24Z

I'm sympathetic, but I don't see a good solution here without using incompatible code.

ndiff was built to generate "the highest quality diff possible", for text written and edited by humans, where "quality" is measured by human judgment (not an abstract mathematical measure). That was its goal, and I think it does well at that. It wasn't intended to unwind mechanical changes made to machine-generated gibberish (to human eyes) text.

Not that such a thing has no value, but it was never ndiff's _intent_ to handle such stuff. So the best solution here would be to use a different differencing engine.

As already noted, unified_diff skips any attempt to find "merely similar" lines, and that's the great source of expense here: ndiff takes time essentially cubic in the number of lines, given inputs with a great many similar (but no identical) lines. It was noted that kdiff3 does fine on these inputs very quickly, but, from the kdiff3 FAQ: "If similar lines appear next to each other, this actually is coincidence but this fortunately is often the case." It just so happens that in the inputs attached to this report, the first lines in each file pair are similar, and the second lines likewise, etc etc. This allows kdiff3 to do a wonderful job without any time at all spent _trying_ to find similar lines.

But there's nothing coincidental here in what ndiff does: it finds "_the_ most similar pair", and that's a quadratic-time operation (which turns into cubic time when repeated over & over).

I didn't write any of the HTML-generating functions, nor have I ever used them, so I'm not in a good position to judge what they "should do". If similar lines aren't interesting in this context, then switching to unified_diff would save gobs of time. If similar lines are interesting, then new code would be required to get at _some_ notion of that less expensively than ndiff does it (e.g., the similarity scores for all pairs could be computed once and saved away, reducing worst-case cubic time to worst-case quadratic time, but at the cost of needing additional memory quadratic in the number of lines).

I'm -1 on 11740.patch: there is no reason, in general, to give up looking for the most similar pair - and give up on intraline differencing - simply because the total number of lines in the two blocks exceeds 99. It would reduce diff quality for all uses of the Differ class, not just for the specific uses of Differ made indirectly (via ndiff) by the make_table function - and it would be backward-incompatible anyway.

heidarrafn mannequin added performance Performance or resource usage stdlib Python modules in the Lib dir labels Sep 17, 2009

heidarrafn mannequin changed the title ~~awful performance in difflib: ndiff and HtmlDiff~~ dreadful performance in difflib: ndiff and HtmlDiff Sep 17, 2009

terryjreedy added type-feature A feature request or enhancement and removed performance Performance or resource usage labels Aug 4, 2010

gpshead self-assigned this Jul 9, 2014

gpshead removed their assignment May 6, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

ronaldoussoren mentioned this issue Dec 22, 2022

Calling unittest.assertDictEqual for medium-size dictionaries takes too long #99151

Open

mattrussell2 mentioned this issue Jul 4, 2023

icdiff hanging on large files jeffkaufman/icdiff#213

Open

kmccurley mentioned this issue Jul 5, 2024

Fix crashes from HtmlDiff in final review IACR/latex-submit#67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

dreadful performance in difflib: ndiff and HtmlDiff #51180

dreadful performance in difflib: ndiff and HtmlDiff #51180

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dreadful performance in difflib: ndiff and HtmlDiff #51180

dreadful performance in difflib: ndiff and HtmlDiff #51180

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!