10000 dreadful performance in difflib: ndiff and HtmlDiff · Issue #51180 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

dreadful performance in difflib: ndiff and HtmlDiff #51180

New issue

Have a question about this project? Sign up for a free GitHub accou 8000 nt to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
heidarrafn mannequin opened this issue Sep 17, 2009 · 14 comments
Open

dreadful performance in difflib: ndiff and HtmlDiff #51180

heidarrafn mannequin opened this issue Sep 17, 2009 · 14 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@heidarrafn
Copy link
Mannequin
heidarrafn mannequin commented Sep 17, 2009
BPO 6931
Nosy @tim-one, @terryjreedy, @jackdied, @merwok, @funkyHat
Files
  • python.difflib.bug.tgz: a gzipped tar with 6 testfiles - see comment above
  • 11740.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2009-09-17.14:54:57.533>
    labels = ['type-feature', 'library']
    title = 'dreadful performance in difflib: ndiff and HtmlDiff'
    updated_at = <Date 2017-05-06.22:48:36.280>
    user = 'https://bugs.python.org/heidarrafn'

    bugs.python.org fields:

    activity = <Date 2017-05-06.22:48:36.280>
    actor = 'gregory.p.smith'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2009-09-17.14:54:57.533>
    creator = 'heidar.rafn'
    dependencies = []
    files = ['14911', '21588']
    hgrepos = []
    issue_num = 6931
    keywords = ['patch']
    message_count = 14.0
    messages = ['92768', '99883', '102345', '112910', '132797', '133329', '133330', '133339', '133340', '139891', '222588', '222589', '222590', '223459']
    nosy_count = 10.0
    nosy_names = ['tim.peters', 'terry.reedy', 'jackdied', 'eric.araujo', 'gruszczy', 'heidar.rafn', 'neologix', 'zitrax', 'funkyhat', 'Christopher.Cabanne']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue6931'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @heidarrafn
    Copy link
    Mannequin Author
    heidarrafn mannequin commented Sep 17, 2009

    Relatively small set of lines with differences in most lines can destroy
    the performance of difflib.HtmlDiff.make_table and difflib.ndiff.
    I am using it like this:
    ...
    htmldiffer = HtmlDiff()
    return htmldiffer.make_table(src_lines, dst_lines,
    fromdesc="file1",
    todesc="file2",
    context=True)

    I have written the src_lines and dst_lins to files and tried this with
    the Tools/scripts/diff.py wrapper with same results when using the
    switches -m or -n.
    The performance is fine when using difflib.unified_diff or switch -u on
    diff.py

    Attached are files that show this clearly.
    left200.txt,right200.txt - 200 lines of text - duration 11 seconds.
    left500.txt,right500.txt - 500 lines of text - duration 2min 58 sec
    left1000.txt,right1000.txt - 1000 lines of text - duration 29min 4sec

    tested on Intel dualcore T2500 2GHz with 2 GB of memory, python 2.5.2 on
    Ubuntu. Same problom on python 2.6 on Fedora-11
    For reference, the kdiff3 utility performs beautifully on these files.

    @heidarrafn heidarrafn mannequin added performance Performance or resource usage stdlib Python modules in the Lib dir labels Sep 17, 2009
    @heidarrafn heidarrafn mannequin changed the title awful performance in difflib: ndiff and HtmlDiff dreadful performance in difflib: ndiff and HtmlDiff Sep 17, 2009
    @jackdied
    Copy link
    Contributor

    Here is a profile run of the 200 line case, run on the 3.x trunk, and with all the trivial functions removed. quick_ratio and __contains__ dominate the time. The process was CPU bound, memory usage stayed low.

    1708315 function calls (17066360 primitive calls) in 248.779 CPU seconds

    Ordered by: standard name

    ncalls tottime percall cumtime percall filename:lineno(function)
    10660375 70.516 0.000 70.516 0.000 :0(contains)
    114482 1.048 0.000 1.048 0.000 :0(append)
    4434776 29.254 0.000 29.254 0.000 :0(get)
    685047/685042 4.400 0.000 4.400 0.000 :0(len)
    197349 1.512 0.000 1.512 0.000 :0(min)
    197133 1.452 0.000 1.452 0.000 difflib.py:228(set_seq1)
    2500 1.632 0.001 3.092 0.001 difflib.py:299(__chain_b)
    1082 2.520 0.002 4.768 0.004 difflib.py:350(find_longest_match)
    339143 2.580 0.000 2.580 0.000 difflib.py:40(_calculate_ratio)
    141727 120.599 0.001 220.970 0.002 difflib.py:661(quick_ratio)
    196736 6.956 0.000 12.549 0.000 difflib.py:690(real_quick_ratio)
    8974/794 5.052 0.001 248.431 0.313 difflib.py:945(_fancy_replace)

    @neologix
    Copy link
    Mannequin
    neologix mannequin commented Apr 4, 2010

    This is because difflib.ndiff (called by difflib.HtmlDiff.make_table), contrarily to difflib.unified_diff (and probably kdiff3), doesn't restrict itself to contiguous lines, and searches diff even inside lines, so the complexity is much worse (see how many times _fancy_replace and quick_ratio are being called).
    It might be a good idea to allow the user to specify the type of diff needed (ndiff vs unified_diff).

    @terryjreedy
    Copy link
    Member

    An api addition makes this a feature request.

    Heiðar, Please copy of difflib.py, hand patch HtmlDiff to use unified_diff instead of ndiff, and see if that solves the problem.

    @terryjreedy terryjreedy added type-feature A feature request or enhancement and removed performance Performance or resource usage labels Aug 4, 2010
    @gruszczy
    Copy link
    Mannequin
    gruszczy mannequin commented Apr 2, 2011

    Check also this:

    http://bugs.python.org/issue11740

    @neologix
    Copy link
    Mannequin
    neologix mannequin commented Apr 8, 2011

    Check also this:

    http://bugs.python.org/issue11740

    You should indicate it as duplicate.

    @gruszczy
    Copy link
    Mannequin
    gruszczy mannequin commented Apr 8, 2011

    I have no idea how I should do this. If you explain to me, how it should be done, I'll be happy to do this from now on :-)

    @terryjreedy
    Copy link
    Member

    Filip, you can look at the changed Benjamin made. One needs to be OP or have admin privileges to change the headers.

    Could you reupload your patch to this issue?

    @gruszczy
    Copy link
    Mannequin
    gruszczy mannequin commented Apr 8, 2011

    Here it is.

    @merwok
    Copy link
    Member
    merwok commented Jul 5, 2011

    The patch by Filip does not add new features, so I’m adjusting versions.

    I cannot review the patch only by reading it, but if someone gives me a timeit command I can post a benchmark for my Debian machine.

    @ChristopherCabanne
    Copy link
    Mannequin
    ChristopherCabanne mannequin commented Jul 8, 2014

    I ran into the slowness of difflib.HtmlDiff.make_table yesterday, and see this issue is three years stale now; is there any update or chance that the fix will be applied?

    @ChristopherCabanne
    Copy link
    Mannequin
    ChristopherCabanne mannequin commented Jul 8, 2014

    Here is a timeit command to test with the attached test files:

    python -m timeit -n 10 "import difflib; difflib.HtmlDiff().make_table(open('left500.txt').readlines(), open('righ500.txt').readlines())"

    @ChristopherCabanne
    Copy link
    Mannequin
    ChristopherCabanne mannequin commented Jul 8, 2014

    Sorry about the typo; here is the corrected timeit command:

    python -m timeit -n 10 "import difflib; difflib.HtmlDiff().make_table(open('left500.txt').readlines(), open('right500.txt').readlines())"

    @gpshead gpshead self-assigned this Jul 9, 2014
    @tim-one
    Copy link
    Member
    tim-one commented Jul 19, 2014

    I'm sympathetic, but I don't see a good solution here without using incompatible code.

    ndiff was built to generate "the highest quality diff possible", for text written and edited by humans, where "quality" is measured by human judgment (not an abstract mathematical measure). That was its goal, and I think it does well at that. It wasn't intended to unwind mechanical changes made to machine-generated gibberish (to human eyes) text.

    Not that such a thing has no value, but it was never ndiff's _intent_ to handle such stuff. So the best solution here would be to use a different differencing engine.

    As already noted, unified_diff skips any attempt to find "merely similar" lines, and that's the great source of expense here: ndiff takes time essentially cubic in the number of lines, given inputs with a great many similar (but no identical) lines. It was noted that kdiff3 does fine on these inputs very quickly, but, from the kdiff3 FAQ: "If similar lines appear next to each other, this actually is coincidence but this fortunately is often the case." It just so happens that in the inputs attached to this report, the first lines in each file pair are similar, and the second lines likewise, etc etc. This allows kdiff3 to do a wonderful job without any time at all spent _trying_ to find similar lines.

    But there's nothing coincidental here in what ndiff does: it finds "_the_ most similar pair", and that's a quadratic-time operation (which turns into cubic time when repeated over & over).

    I didn't write any of the HTML-generating functions, nor have I ever used them, so I'm not in a good position to judge what they "should do". If similar lines aren't interesting in this context, then switching to unified_diff would save gobs of time. If similar lines are interesting, then new code would be required to get at _some_ notion of that less expensively than ndiff does it (e.g., the similarity scores for all pairs could be computed once and saved away, reducing worst-case cubic time to worst-case quadratic time, but at the cost of needing additional memory quadratic in the number of lines).

    I'm -1 on 11740.patch: there is no reason, in general, to give up looking for the most similar pair - and give up on intraline differencing - simply because the total number of lines in the two blocks exceeds 99. It would reduce diff quality for all uses of the Differ class, not just for the specific uses of Differ made indirectly (via ndiff) by the make_table function - and it would be backward-incompatible anyway.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants
    0