-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
dreadful performance in difflib: ndiff and HtmlDiff #51180
New issue
Have a question about this project? Sign up for a free GitHub accou 8000 nt to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Relatively small set of lines with differences in most lines can destroy I have written the src_lines and dst_lins to files and tried this with Attached are files that show this clearly. tested on Intel dualcore T2500 2GHz with 2 GB of memory, python 2.5.2 on |
Here is a profile run of the 200 line case, run on the 3.x trunk, and with all the trivial functions removed. quick_ratio and __contains__ dominate the time. The process was CPU bound, memory usage stayed low. 1708315 function calls (17066360 primitive calls) in 248.779 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) |
This is because difflib.ndiff (called by difflib.HtmlDiff.make_table), contrarily to difflib.unified_diff (and probably kdiff3), doesn't restrict itself to contiguous lines, and searches diff even inside lines, so the complexity is much worse (see how many times _fancy_replace and quick_ratio are being called). |
An api addition makes this a feature request. Heiðar, Please copy of difflib.py, hand patch HtmlDiff to use unified_diff instead of ndiff, and see if that solves the problem. |
Check also this: |
You should indicate it as duplicate. |
I have no idea how I should do this. If you explain to me, how it should be done, I'll be happy to do this from now on :-) |
Filip, you can look at the changed Benjamin made. One needs to be OP or have admin privileges to change the headers. Could you reupload your patch to this issue? |
Here it is. |
The patch by Filip does not add new features, so I’m adjusting versions. I cannot review the patch only by reading it, but if someone gives me a timeit command I can post a benchmark for my Debian machine. |
I ran into the slowness of difflib.HtmlDiff.make_table yesterday, and see this issue is three years stale now; is there any update or chance that the fix will be applied? |
Here is a timeit command to test with the attached test files: python -m timeit -n 10 "import difflib; difflib.HtmlDiff().make_table(open('left500.txt').readlines(), open('righ500.txt').readlines())" |
Sorry about the typo; here is the corrected timeit command: python -m timeit -n 10 "import difflib; difflib.HtmlDiff().make_table(open('left500.txt').readlines(), open('right500.txt').readlines())" |
I'm sympathetic, but I don't see a good solution here without using incompatible code. ndiff was built to generate "the highest quality diff possible", for text written and edited by humans, where "quality" is measured by human judgment (not an abstract mathematical measure). That was its goal, and I think it does well at that. It wasn't intended to unwind mechanical changes made to machine-generated gibberish (to human eyes) text. Not that such a thing has no value, but it was never ndiff's _intent_ to handle such stuff. So the best solution here would be to use a different differencing engine. As already noted, unified_diff skips any attempt to find "merely similar" lines, and that's the great source of expense here: ndiff takes time essentially cubic in the number of lines, given inputs with a great many similar (but no identical) lines. It was noted that kdiff3 does fine on these inputs very quickly, but, from the kdiff3 FAQ: "If similar lines appear next to each other, this actually is coincidence but this fortunately is often the case." It just so happens that in the inputs attached to this report, the first lines in each file pair are similar, and the second lines likewise, etc etc. This allows kdiff3 to do a wonderful job without any time at all spent _trying_ to find similar lines. But there's nothing coincidental here in what ndiff does: it finds "_the_ most similar pair", and that's a quadratic-time operation (which turns into cubic time when repeated over & over). I didn't write any of the HTML-generating functions, nor have I ever used them, so I'm not in a good position to judge what they "should do". If similar lines aren't interesting in this context, then switching to unified_diff would save gobs of time. If similar lines are interesting, then new code would be required to get at _some_ notion of that less expensively than ndiff does it (e.g., the similarity scores for all pairs could be computed once and saved away, reducing worst-case cubic time to worst-case quadratic time, but at the cost of needing additional memory quadratic in the number of lines). I'm -1 on 11740.patch: there is no reason, in general, to give up looking for the most similar pair - and give up on intraline differencing - simply because the total number of lines in the two blocks exceeds 99. It would reduce diff quality for all uses of the Differ class, not just for the specific uses of Differ made indirectly (via ndiff) by the make_table function - and it would be backward-incompatible anyway. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: