-
-
Notifications
You must be signed in to change notification settings - Fork 32k
Speed up slot lookup for class creation #76527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As mentioned in https://mail.python.org/pipermail/python-dev/2017-December/151298.html, slot lookup can take 75 to 80% of the runtime when creating a new class. This was slightly improved in https://bugs.python.org/issue31336 but we're still doing one lookup per possible slot name *and* per class along the MRO. I propose to speed up this step by caching the known descriptors for each class. This way, when instantiating a new class, you can simply iterate over the known descriptors of the MRO without wasting time on hundreds of dict lookups. (and it is reasonably easy to find all slot descriptors in a dict: first select only __dunder__ names, then look them up in a dedicated mapping) |
I posted #4902 for this. This approach has two drawbacks:
|
Some micro-benchmarks: $ ./python -m timeit "class Test: pass"
- before: 8.84 usec per loop
- after: 7.03 usec per loop
$ ./python -m timeit "class Test(tuple): pass"
- before: 10.1 usec per loop
- after: 8.4 usec per loop
$ ./python -m timeit -s "from logging import Logger" "class Test(Logger): pass"
- before: 12 usec per loop
- after: 6.25 usec per loop
$ ./python -m timeit -s "from logging.handlers import DatagramHandler" "class Test(DatagramHandler): pass"
- before: 15 usec per loop
- after: 6.68 usec per loop
$ ./python -m timeit -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass"
- before: 18.2 usec per loop
- after: 6.56 usec per loop
$ ./python -m timeit -s "from shelve import Shelf" "class Test(Shelf): pass"
- before: 28.6 usec per loop
- after: 18.4 usec per loop |
Updated benchmarks against git master (and using pyperf):
$ ./env-orig/bin/pyperf timeit -q "class Test: pass"
Mean +- std dev: 8.89 us +- 0.09 us
$ ./env-orig/bin/pyperf timeit -q -s "from logging import Logger" "class Test(Logger): pass"
Mean +- std dev: 12.0 us +- 0.2 us
$ ./env-orig/bin/pyperf timeit -q -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass"
Mean +- std dev: 18.6 us +- 0.2 us
$ ./env/bin/pyperf timeit -q "class Test: pass"
Mean +- std dev: 6.90 us +- 0.11 us
$ ./env/bin/pyperf timeit -q -s "from logging import Logger" "class Test(Logger): pass"
Mean +- std dev: 5.86 us +- 0.08 us
$ ./env/bin/pyperf timeit -q -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass"
Mean +- std dev: 6.13 us +- 0.08 us |
Due to its complexity it will take a time to make a review of this patch. |
Yes... The patch itself is not very complex, but you have to dive into the intricacies of typeobject.c to understand it :-/ |
New slot is really required? My idea is:
Is it impossible or harder than my thought? |
I don't really understand y 8000 our proposal, so it's hard to answer :-) Perhaps you can try to implement it so that we can compare? |
./python -m timeit -- "class A: pass"
./python -m timeit -s "class A: pass" -- "class B(A): pass"
./python -m timeit -s "class A: pass" -s "class B(A): pass" -- "class C(B): pass"
./python -m timeit -s "class A: pass" -s "class B(A): pass" -s "class C(B): pass" -- "class D(C): pass"
./python -m timeit -s "class A: pass" -s "class B(A): pass" -s "class C(B): pass" -s "class D(C): pass" -- "class E(D): pass"
|
How about reusing tp_cache instead of adding new slot? |
What exactly is tp_cache? Apparently it was added by Guido in 687ae00 but never used (at least officially). commit 687ae00
|
I don't recall what I meant to use tp_cache for, but everything I can find about it says it's unused, so let's kill it if we can. I guess the ABI requires that we keep the slot around until we find a new use for it? |
I see, so I should be able to reuse tp_cache for this PR. |
In my environment, python -X importtime -c 'import asyncio' speed up from 53.2ms to 51ms. |
There's no magic bullet that I know of to reduce startup time by a large amount. So we're left with optimizing progressively. For comparison, #3279 had an even smaller effect on startup time. This change has the nice property that it's a genuine speed-up on a generic operation, and it reduces the cost of adding slots as well as the cost of having long inheritance chains. |
The relative speed up looks nice. But it is just few microseconds per class. You have to create many thousands of classes to gain a significant fraction of second. And the complexity added by this patch is not tiny. I'm not sure how this caching works when change the parent class after creating the child class. Without caching the benefit is 20-50% smaller. Perhaps this result can be improved. Actually we don't need to search in all dictionaries of all classes in the MRO. We can check the correspondent slot in the parent class (the offsets are constants) and look up in the class dict of the first class with non-zero slot value. I tried to simplify this change (even at the cost of smaller speedup). But the result still looks too complex. Since the benefit in any case will be small, and seems there are no other issues that depend on this change, I suggest to defer it to 3.8. There are more priority changes that should be made before the feature freeze in 3.7. |
Le 11/01/2018 à 20:25, Serhiy Storchaka a écrit :
The caching is invalidated at the same place the method cache is
I'm not sure that's always true. The relationship between class dict
I am not working on any of those changes, so deferring this PR will not |
As my understand, this patch creates cache for all classe, Am I correct? If so, I want estimate of GC overhead and memory overhead of cache |
Here is a simple script creating 10000 classes (a large number, but perhaps not out of sight for a large application importing a lot of libraries (*)). (*) see the experiment I did in https://mail.python.org/pipermail/python-dev/2017-December/151260.html Before: After: RSS is a bit unstable from run to run, but roughly the patch seems to add 100 to 200KB in this case. As for full GC time, it is quite stable and there's a 0.1ms increase with the patch. Note this is really a worst-case benchmark: lots of classes, no methods, no user data beside the classes. |
Serhiy:
This work started with your message in https://mail.python.org/pipermail/python-dev/2017-December/151277.html pointing out that we shouldn't add tp_ slots because it makes type creation and Python initialization slower (*), so I'm a bit surprised that you're now claiming speeding up type creation (by up to 3x!) is not important... (*) To quote: ''' |
Since benchgcclasses.py doesn't creates dunder methods, benchgcclasses2.py adds three dunder methods: $ ./python benchgcclasses.py
GC time: 13.0 ms
gc: collecting generation 2...
gc: objects in each generation: 1 0 94942
gc: objects in permanent generation: 0
gc: done, 0.0131s elapsed
RSS:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
inada-n 29604 0.0 0.1 47052 30692 pts/2 S+ 20:42 0:00 ./python benchgcclasses.py
$ ./python-patched benchgcclasses.py
GC time: 17.2 ms
gc: collecting generation 2...
gc: objects in each generation: 1 0 135220
gc: objects in permanent generation: 0
gc: done, 0.0173s elapsed
RSS:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
inada-n 29626 0.0 0.2 49880 33464 pts/2 S+ 20:43 0:00 ./python-patched benchgcclasses.py |
Hmm, you're right, thank you. I can also reproduce your numbers here (using benchgcclasses2.py). There's definitely an impact. |
I think that Inada is right: there's too much impact on memory consumption and garbage collection to call this an optimization. Serhiy suggested removing the cache. But, once you remove the cache, you're going to iterate on all values of all parent classes' dicts to find which values are __dunder__ methods. This risks being much more expensive for classes with tens or hundreds of namespace entries, making the optimization even less interesting. So in the end I think I'm going to reject this issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: