-
-
Notifications
You must be signed in to change notification settings - Fork 32k
gh-63416: Speed up assertEqual on long sequences #27434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'd recommend changing the PR title to something affirmative, e.g. "Speed up assertEqual on long sequences" |
I will! That's definitely more descriptive, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To sum up comments:
- Please add the new variant to
assertMultiLineEqual
as well; - Please adapt the test cases to be pathological (you can use my suggestion) with the original code for both strings and lists;
- Please change the test to simply execute the diff, don't measure time at all. If it becomes too slow, we'll know just by the fact it became super slow.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be poked with soft cushions! |
@ambv before I go any further, I went ahead and made the change from https://github.com/jdevries3133/cpython/tree/[bpo-19217](https://bugs.python.org/issue19217)-ideas By the way, there is one other problematic use of Lines 1134 to 1136 in 48a6255
Reproducing the issue is similar to the others: import unittest
class TestDictDiffSlow(unittest.TestCase):
def test_slow_assert_equal_dict(self):
d1 = { i : i * 2 for i in range(10000) }
d2= { i : i for i in range(10000) }
self.assertDictEqual(d1, d2)
if __name__ == '__main__':
unittest.main() I'm happy to fix all the failing tests, and there are quite a few, but the change from Before:
After:
Before:
After:
Other OptionsI don't know if these are any good, but maybe there are other options to fix this bug: Check size of InputCheck size of input value before passing to
Make
|
Good thinking about having a threshold above which we can switch to I guess what I'm saying is that it isn't at all clear how our threshold should be calculated. How about this:
As part of testing here you should deliberately try to come up with examples where the "25 diff lines" limit is still too big. If you can't find any after a good round of trying to break it, then maybe you can double the limit. And try breaking it then. I mean, you can spend as much or as little time on this as you wish. But I agree with you that we can tweak this into a solution that is both fast and preserves nice diffs for most use cases. |
* Now, we switch between difflib.ndiff and difflib.unified_diff based on input size. * Threshold is sort of arbitrary, but seems to be working in the limited test cases I've written * The full test suite is passing, with only very minor tweaks to existing tests!
@ambv I converted this to a draft, but I do have an implementation of what we discussed now. You'll notice a few TODOThis is why I've marked it as a draft:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is going. Looking forward to the next iteration.
And another, to show that from difflib import ndiff
from time import perf_counter as now
import sys
sys.setrecursionlimit(2000) # so the 512 case completes
count = 16
while True:
ax = [f"a{i:010d}" for i in range(count)]
bx = [s.replace("a", "b") for s in ax]
start = now()
ignore = list(ndiff(ax, bx))
finish = now()
print(count, finish - start)
count *= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like we're basically trading one regressive performance situation for another.
Someone can always wind up unhappy. Without the ability to bail out if computation is taking too long, we should take Tim's suggestion of going to a linear "first differing element" highlight approach when inputs are "too large", followed by our generic a != b printing code.
I blindly suggest: max(len(a), len(b)) > 500
prevent calling unified_diff and max(len(a), len(b)) > 200
prevents calling ndiff. Just based on the worst case timings.
those default values used to tune that could be TestCase attributes that people could override themselves if they want (assumed to be rare), much like maxDiff is today.
You can get sharper estimates by cheating a bit: build your own sm = difflib.SequenceMatcher(None, a, b) That much is worst-case linear time. Then b2j = sm.b2j
sum(len(b2j[line]) for line in a if line in b2j) is the total number of times a unified diff will go around its innermost loop. It's cheating because If that's "cheap enough", you can go on to take a guess at worst-case ndiff time (which is never cheaper than unified diff time). Go through ("replace", i1, i2, j1, j2) Then len1 = i2 - i1
len2 = j2 - j1
ouch = len1 * len2 * min(len1, len2) sets If that's "cheap enough", ndiff can be called without worry. |
I should add that my sketch of ndiff's upper bound ignores the As a practical matter, though, under the current implementation I think that can be ignored. |
…thon into bpo-19217-slow-assertEq
Wow, thank you Tim and Gregory – it's a joy to learn from your discussion on this issue; I'm having fun!! So, here's basically what we have on the table: To graph it out:
So, first things first, @tim-one, do you have any interest in having Option #2 being part of the difflib API? If so, there's nothing more to do here – the path forward would be to add that feature to difflib and consume it from unittest. Otherwise, I feel like this is the path forward I'd take with this PR, should continuing appear to be the best option:
Note that I'm thinking not to implement Tim's O(n) cost calculation and "first differing element" style diff for expensive inputs. I wouldn't rule it out, I just feel like this heuristic in combination with the autojunk heuristic doesn't edge cases behind, but that assumption might turn out to be incorrect – it'd need to be tested more thoroughly. Tim and Gregory, what do you think? |
No, I don't see folding any of this into Far as I'm concerned, the user wrote a low-quality test if the things it's asserting are equal are so large that it's not dead obvious what to show them if the assertion fails. I'm fine with dumping a million lines of text with no indication of where a difference may be - it's what they asked for, after all 😉. As a thoroughly practical matter, why not just use unified diff, period? It's hard to provoke into pathological behavior, provided It seems bizarre on the face of it too to expect a user to make sense of that there may be multiple diff formats produced, depending on things they have no understanding of.
BTW, if you want to be more extreme, I was wrong: the |
The only issue is (if I understand correctly) that currently, millions of lines of text are not dumped as they're generated. Rather, the program hangs – because the big diff is not streamed to stdout as it's being generated; it's just being loaded into memory. Is there a way to just blast the diff into stdout as it's being generated instead of holding it in memory? This would be a better user experience imo, because at least the user can see what is happening. To your point, users are obviously being a bit abusive of the testing library, but it should also be assumed that assertEqual will do something other than just lock up in linear time. Maybe we can dispatch diff generation to another thread, and quit generating the diff after a timeout? I agree that it's more of an issue if the test suite is gross and convoluted, but there are a lot of gross and convoluted tests out there. I see a lot of generators in difflib, but I'm not sure if something like this is possible? I'm sorry to backpedal to entirely new solutions, and I know this isn't really the place to discuss it, but your criticism is 100% valid and I'd like to forge a path forward.... moreover, I want to slay the 9-year-old bug :(
I have tried this; it causes a lot of breakage of unittest's own test suite, which just makes it annoying, but it could be done. Hopefully asserting against failed test output is not a thing outside of unittest's own test suite....... |
Non-linear diff algorithms don't work that way. It isn't a sequential algorithm, it's trying to find a best/minimalish fit of differences. Otherwise the diff between 1 3 5 7 9 and 1 3 4 5 7 9 would be -5 -7 -9 +4 +5 +7 +9 done via a linear O(n) algorithm. Sure someone would see it when running tests locally in their terminal, but so much of testing is done on CI test automation systems these days where it's all or nothing and you don't bother to get a result or see progress - you just wait for the entire suite to give a pass or fail and if it fails you look at its processed output and ultimately logs if needed. Just use unified_diff is one solution as Tim notes. nothing wrong with that. it'd work (yay autojunk logic?)! One other thing to consider is that we intentionally limit the size of printed artifacts for the base case of assertEqual via the https://github.com/python/cpython/blob/main/Lib/unittest/util.py 100% agreed that people get away with writing test cases that go overboard today. but this is easy to do when writing tests that pass as it's only on failures winding up in a poor behavior corner case of the library that you'd even think about it being an issue. until then everything is extremely convenient to all involved. |
Ironically, difflib used to - long ago - pump out differences "left to right" as it went along. The algorithm is based on finding the longest contiguous matching substring, then recursively do the same on the pieces "to the left" and "to the right" of that match. So there's nothing in theory to prevent delivering the deltas on the left part before the right part starts. But as the comments in It's still the case that ndiff's "fancy replace" algorithm can blow the recursion stack, though. In any case, the @gpshead, As to spinning off a thread to do the diff, that's why this issue report will never be closed - everyone who looks at it eventually refuses to settle for a "good enough!" partial solution 😉. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has merge conflicts now.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
@tim-one I have not stopped thinking about this problem.
OK so it seems pretty well established that any heuristic will only achieve minimizing the set of possible inputs that would cause a big slowdown. My question to Tim and the other core developers in this thread is: do you see any possible solution that could potentially cause this ticket to be closed? In this category (ticket closers), I think that spawning, babysitting, and possibly aborting a worker thread is possibly the most bulletproof idea. The only thing I worry about is whether python's execution model will cause the parent thread to be blocked while the diffing library dead spins, causing a race condition where the child can't be aborted. I think this wouldn't be the case, but I'm sure that someone here can advise as to whether that might be a problem. On the broad topic of heuristics, it seems clear to me that any heuristic is effectively just a band-aid that won't block the full set of problematic inputs without being extremely aggressive, like the heuristic that @gpshead suggested:
At the same time, I consider an extremely aggressive heuristic or simply abandoning To wrap up the heuristic discussion, it seems like the forgone conclusion is that heuristics are just icing on the cake – they don't address the first principles issues, and a heuristic too aggressive might constitute a breaking change. Tim & other core devs in the thread, my ultimate question is, do you sponsor a thread-based solution as a complete solution to the root problem we're trying to address here? Is there any other solution you'd sponsor as a complete enough solution to close the ticket? If not, are there any solutions that you'd sponsor enough to want it to be merged since it improves the ergonomics here, even if it's not a complete solution? |
No threads. stdout from tests are not an API. Anyone depending on that has made a mere change detector, that's their problem. Regardless we wouldn't significantly change the output in a bugfix backport. A heuristic to avoid potential regressive performance behaviors for something simpler is the best that can be done. |
Just a quick note to say that I just ran face-first into this now 12-year-old bug, so it's still out there and causing problems. For whatever it's worth I would be thoroughly on board with any "good enough" solution. |
Try things again under current main branch (3.14.0a6+)? ndiff's |
This is the commit that change ndiff: |
And this earlier test case: import unittest
class TestDictDiffSlow(unittest.TestCase):
def test_slow_assert_equal_dict(self):
d1 = { i : i * 2 for i in range(10000) }
d2= { i : i for i in range(10000) }
self.assertDictEqual(d1, d2)
if __name__ == '__main__':
unittest.main() completes in well under 2 seconds now (under current main branch). I never did wait long enough for it to complete in a released Python (at least 15 minutes elapsed before I gave up). |
udiff_differing_lines = [l for l in udiff | ||
if l.startswith(('-', '+'))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
udiff_differing_lines = [l for l in udiff | |
if l.startswith(('-', '+'))] | |
udiff_differing_lines = [l for l in udiff if l.startswith(('-', '+'))] |
|
||
# inspect unified_diff output | ||
num_difflines = len(udiff_differing_lines) | ||
total_diffline_length = sum(len(l) for l in udiff_differing_lines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_diffline_length = sum(len(l) for l in udiff_differing_lines) | |
total_diffline_length = sum(map(len, udiff_differing_lines)) |
hand, unified_diff's cost is the same as the cost of producing `diff` | ||
by itself: O(a + b). | ||
|
||
See bpo-19217 for additional context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See bpo-19217 for additional context. | |
See gh-63416 for additional context. |
If I haven't been clear enough, I'm questioning whether this PR should just be closed, unmerged. The worst cases of |
This incorporates changes started by @eamanu:
Additionally, I added two commits to make this ready-to-merge
https://bugs.python.org/issue19217