8000 Speed up slot lookup for class creation · Issue #76527 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

Speed up slot lookup for class creation #76527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pitrou opened this issue Dec 16, 2017 · 23 comments
Closed

Speed up slot lookup for class creation #76527

pitrou opened this issue Dec 16, 2017 · 23 comments
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@pitrou
Copy link
Member
pitrou commented Dec 16, 2017
BPO 32346
Nosy @gvanrossum, @pitrou, @scoder, @methane, @serhiy-storchaka
PRs
  • bpo-32346: speed up slot lookup during class creation #4902
  • Files
  • benchgcclasses.py
  • benchgcclasses2.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-01-15.18:34:21.427>
    created_at = <Date 2017-12-16.12:21:18.352>
    labels = ['interpreter-core', '3.7', 'performance']
    title = 'Speed up slot lookup for class creation'
    updated_at = <Date 2018-01-15.18:34:21.426>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2018-01-15.18:34:21.426>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-01-15.18:34:21.427>
    closer = 'pitrou'
    components = ['Interpreter Core']
    creation = <Date 2017-12-16.12:21:18.352>
    creator = 'pitrou'
    dependencies = []
    files = ['47378', '47379']
    hgrepos = []
    issue_num = 32346
    keywords = ['patch']
    message_count = 23.0
    messages = ['308474', '308475', '308476', '308738', '308744', '308747', '308866', '308867', '309384', '309411', '309420', '309444', '309465', '309474', '309476', '309824', '309828', '309845', '309848', '309849', '309850', '309851', '309892']
    nosy_count = 5.0
    nosy_names = ['gvanrossum', 'pitrou', 'scoder', 'methane', 'serhiy.storchaka']
    pr_nums = ['4902']
    priority = 'low'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue32346'
    versions = ['Python 3.7']

    @pitrou
    Copy link
    Member Author
    pitrou commented Dec 16, 2017

    As mentioned in https://mail.python.org/pipermail/python-dev/2017-December/151298.html, slot lookup can take 75 to 80% of the runtime when creating a new class. This was slightly improved in https://bugs.python.org/issue31336 but we're still doing one lookup per possible slot name *and* per class along the MRO.

    I propose to speed up this step by caching the known descriptors for each class. This way, when instantiating a new class, you can simply iterate over the known descriptors of the MRO without wasting time on hundreds of dict lookups.

    (and it is reasonably easy to find all slot descriptors in a dict: first select only __dunder__ names, then look them up in a dedicated mapping)

    @pitrou pitrou added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Dec 16, 2017
    @pitrou
    Copy link
    Member Author
    pitrou commented Dec 16, 2017

    I posted #4902 for this.

    This approach has two drawbacks:

    • uses an additional tp_ slot (and a corresponding TPFLAGS)
    • adds a bit of code complexity (but quite localized)

    @pitrou
    Copy link
    Member Author
    pitrou commented Dec 16, 2017

    Some micro-benchmarks:

    $ ./python -m timeit "class Test: pass"
    - before: 8.84 usec per loop
    - after: 7.03 usec per loop
    
    $ ./python -m timeit "class Test(tuple): pass"
    - before: 10.1 usec per loop
    - after: 8.4 usec per loop
    
    $ ./python -m timeit -s "from logging import Logger" "class Test(Logger): pass"
    - before: 12 usec per loop
    - after: 6.25 usec per loop
    
    $ ./python -m timeit -s "from logging.handlers import DatagramHandler" "class Test(DatagramHandler): pass"
    - before: 15 usec per loop
    - after: 6.68 usec per loop
    
    $ ./python -m timeit -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass"
    - before: 18.2 usec per loop
    - after: 6.56 usec per loop
    
    $ ./python -m timeit -s "from shelve import Shelf" "class Test(Shelf): pass"
    - before: 28.6 usec per loop
    - after: 18.4 usec per loop

    @pitrou
    Copy link
    Member Author
    pitrou commented Dec 20, 2017

    Updated benchmarks against git master (and using pyperf):

    • before:
    $ ./env-orig/bin/pyperf timeit -q "class Test: pass"
    Mean +- std dev: 8.89 us +- 0.09 us
    $ ./env-orig/bin/pyperf timeit -q -s "from logging import Logger" "class Test(Logger): pass"
    Mean +- std dev: 12.0 us +- 0.2 us
    $ ./env-orig/bin/pyperf timeit -q -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass"
    Mean +- std dev: 18.6 us +- 0.2 us
    • after:
    $ ./env/bin/pyperf timeit -q "class Test: pass"
    Mean +- std dev: 6.90 us +- 0.11 us
    $ ./env/bin/pyperf timeit -q -s "from logging import Logger" "class Test(Logger): pass"
    Mean +- std dev: 5.86 us +- 0.08 us
    $ ./env/bin/pyperf timeit -q -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass"
    Mean +- std dev: 6.13 us +- 0.08 us

    @serhiy-storchaka
    Copy link
    Member

    Due to its complexity it will take a time to make a review of this patch.

    @pitrou
    Copy link
    Member Author
    pitrou commented Dec 20, 2017

    Yes... The patch itself is not very complex, but you have to dive into the intricacies of typeobject.c to understand it :-/

    @methane
    Copy link
    Member
    methane commented Dec 21, 2017

    New slot is really required?

    My idea is:

    • Copy slots from MRO
    • Iterate namespace dict and override slots if dunder is defined

    Is it impossible or harder than my thought?

    @pitrou
    Copy link
    Member Author
    pitrou commented Dec 21, 2017

    I don't really understand y 8000 our proposal, so it's hard to answer :-) Perhaps you can try to implement it so that we can compare?

    @serhiy-storchaka
    Copy link
    Member

    ./python -m timeit -- "class A: pass"

    • before: 6.63 usec per loop
    • after: 5.41 usec per loop

    ./python -m timeit -s "class A: pass" -- "class B(A): pass"

    • before: 7.04 usec per loop
    • after: 4.91 usec per loop

    ./python -m timeit -s "class A: pass" -s "class B(A): pass" -- "class C(B): pass"

    • before: 8.24 usec per loop
    • after: 5.09 usec per loop

    ./python -m timeit -s "class A: pass" -s "class B(A): pass" -s "class C(B): pass" -- "class D(C): pass"

    • before: 9.59 usec per loop
    • after: 5.29 usec per loop

    ./python -m timeit -s "class A: pass" -s "class B(A): pass" -s "class C(B): pass" -s "class D(C): pass" -- "class E(D): pass"

    • before: 10.9 usec per loop
    • after: 5.45 usec per loop

    @methane
    Copy link
    Member
    methane commented Jan 3, 2018

    How about reusing tp_cache instead of adding new slot?

    @pitrou
    Copy link
    Member Author
    pitrou commented Jan 3, 2018

    What exactly is tp_cache? Apparently it was added by Guido in 687ae00 but never used (at least officially).

    commit 687ae00
    Author: Guido van Rossum <guido@python.org>
    Date: Mon Oct 15 22:03:32 2001 +0000

    Get rid of __defined__ and tp_defined -- there's no need to
    distinguish __dict__ and __defined__ any more.  In the C structure,
    tp_cache takes its place -- but this hasn't been implemented yet.
    

    @gvanrossum
    Copy link
    Member

    I don't recall what I meant to use tp_cache for, but everything I can find about it says it's unused, so let's kill it if we can. I guess the ABI requires that we keep the slot around until we find a new use for it?

    @pitrou
    Copy link
    Member Author
    pitrou commented Jan 4, 2018

    I see, so I should be able to reuse tp_cache for this PR.

    @methane
    Copy link
    Member
    methane commented Jan 4, 2018

    In my environment, python -X importtime -c 'import asyncio' speed up from 53.2ms to 51ms.
    While I like faster startup time, I don't know 4% speed up is worth enough for
    this complexity.

    @pitrou
    Copy link
    Member Author
    pitrou commented Jan 4, 2018

    While I like faster startup time, I don't know 4% speed up is worth enough for this complexity.

    There's no magic bullet that I know of to reduce startup time by a large amount. So we're left with optimizing progressively. For comparison, #3279 had an even smaller effect on startup time.

    This change has the nice property that it's a genuine speed-up on a generic operation, and it reduces the cost of adding slots as well as the cost of having long inheritance chains.

    @serhiy-storchaka
    Copy link
    Member

    The relative speed up looks nice. But it is just few microseconds per class. You have to create many thousands of classes to gain a significant fraction of second. And the complexity added by this patch is not tiny.

    I'm not sure how this caching works when change the parent class after creating the child class.

    Without caching the benefit is 20-50% smaller. Perhaps this result can be improved. Actually we don't need to search in all dictionaries of all classes in the MRO. We can check the correspondent slot in the parent class (the offsets are constants) and look up in the class dict of the first class with non-zero slot value.

    I tried to simplify this change (even at the cost of smaller speedup). But the result still looks too complex.

    Since the benefit in any case will be small, and seems there are no other issues that depend on this change, I suggest to defer it to 3.8. There are more priority changes that should be made before the feature freeze in 3.7.

    @pitrou
    Copy link
    Member Author
    pitrou commented Jan 11, 2018

    Le 11/01/2018 à 20:25, Serhiy Storchaka a écrit :

    I'm not sure how this caching works when change the parent class after creating the child class.

    The caching is invalidated at the same place the method cache is
    invalidated.

    Without caching the benefit is 20-50% smaller. Perhaps this result can be improved. Actually we don't need to search in all dictionaries of all classes in the MRO. We can check the correspondent slot in the parent class (the offsets are constants) and look up in the class dict of the first class with non-zero slot value.

    I'm not sure that's always true. The relationship between class dict
    attributes and tp_ slots is not 100% bijective, there's more complicated
    stuff going on for some of the descriptors. My patch keeps the current
    algorithm (ensuring no descriptor gets broken) but optimizes the lookups.

    There are more priority changes that should be made before the feature freeze in 3.7.

    I am not working on any of those changes, so deferring this PR will not
    have any effect on those changes.

    @methane
    Copy link
    Member
    methane commented Jan 12, 2018

    As my understand, this patch creates cache for all classe,
    not only for parent classes.
    Caches may has much tuples, and they are GC tracked because
    they contains function descriptors. And they actually creates
    reference cycles.

    Am I correct?

    If so, I want estimate of GC overhead and memory overhead of cache
    for large project. Is it really negligible?

    @pitrou
    Copy link
    Member Author
    pitrou commented Jan 12, 2018

    Here is a simple script creating 10000 classes (a large number, but perhaps not out of sight for a large application importing a lot of libraries (*)).

    (*) see the experiment I did in https://mail.python.org/pipermail/python-dev/2017-December/151260.html

    Before:
    $ ./python-orig -I benchgcclasses.py
    GC time: 6.8 ms
    RSS:
    USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
    antoine 11248 0.0 0.3 41296 24576 pts/1 S+ 12:18 0:00 ./python-orig -I benchgcclasses.py

    After:
    $ ./python -I benchgcclasses.py
    GC time: 6.9 ms
    RSS:
    USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
    antoine 11097 0.0 0.3 41300 24740 pts/1 S+ 12:18 0:00 ./python -I benchgcclasses.py

    RSS is a bit unstable from run to run, but roughly the patch seems to add 100 to 200KB in this case.

    As for full GC time, it is quite stable and there's a 0.1ms increase with the patch.

    Note this is really a worst-case benchmark: lots of classes, no methods, no user data beside the classes.

    @pitrou
    Copy link
    Member Author
    pitrou commented Jan 12, 2018

    Serhiy:

    The relative speed up looks nice. But it is just few microseconds per class. You have to create many thousands of classes to gain a significant fraction of second.

    This work started with your message in https://mail.python.org/pipermail/python-dev/2017-December/151277.html pointing out that we shouldn't add tp_ slots because it makes type creation and Python initialization slower (*), so I'm a bit surprised that you're now claiming speeding up type creation (by up to 3x!) is not important...

    (*) To quote:

    '''
    2. Increased class initialization time. For every class for every slot
    we need to look up corresponding methods in dictionaries of the class
    itself and all its parents (caching doesn't work fine at this stage).
    Significant part of class initialization time is spent on initializing
    slots. This will increase the startup time and the time of creating
    local classes. The relative overhead is more significant in Cython.
    '''

    @methane
    Copy link
    Member
    methane commented Jan 12, 2018

    Note this is really a worst-case benchmark: lots of classes, no methods, no user data beside the classes.

    Since benchgcclasses.py doesn't creates dunder methods,
    cache doesn't have GC-tracked tuples, and ref cycles.

    benchgcclasses2.py adds three dunder methods:

    $ ./python benchgcclasses.py
    GC time: 13.0 ms
    gc: collecting generation 2...
    gc: objects in each generation: 1 0 94942
    gc: objects in permanent generation: 0
    gc: done, 0.0131s elapsed
    RSS:
    USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    inada-n  29604  0.0  0.1  47052 30692 pts/2    S+   20:42   0:00 ./python benchgcclasses.py
    
    $ ./python-patched benchgcclasses.py
    GC time: 17.2 ms
    gc: collecting generation 2...
    gc: objects in each generation: 1 0 135220
    gc: objects in permanent generation: 0
    gc: done, 0.0173s elapsed
    RSS:
    USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    inada-n  29626  0.0  0.2  49880 33464 pts/2    S+   20:43   0:00 ./python-patched benchgcclasses.py

    @pitrou
    Copy link
    Member Author
    pitrou commented Jan 12, 2018

    Since benchgcclasses.py doesn't creates dunder methods,
    cache doesn't have GC-tracked tuples, and ref cycles.

    Hmm, you're right, thank you. I can also reproduce your numbers here (using benchgcclasses2.py). There's definitely an impact.

    @pitrou
    Copy link
    Member Author
    pitrou commented Jan 13, 2018

    I think that Inada is right: there's too much impact on memory consumption and garbage collection to call this an optimization.

    Serhiy suggested removing the cache. But, once you remove the cache, you're going to iterate on all values of all parent classes' dicts to find which values are __dunder__ methods. This risks being much more expensive for classes with tens or hundreds of namespace entries, making the optimization even less interesting.

    So in the end I think I'm going to reject this issue.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants
    0