10BC0 Nearest neighbor return structure takes significantly longer to garbage collect · Issue #26873 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Nearest neighbor return structure takes significantly longer to garbage collect #26873

@BrandonSmithJ

Description

@BrandonSmithJ

The way output is structured in nearest neighbor classes (e.g. KDTree) leads to an order of magnitude greater time required to garbage collect the output, compared to actually generating it. For example:

from sklearn.neighbors import KDTree
import time

def mwe(n):
    start = time.time()
    a = KDTree([[1]]).query_radius([[1]]*int(n), 1)
    print(f'Function completed in: {time.time()-start:.2f} seconds')

start = time.time()
mwe(1e6)
print(f'Function returned after {time.time()-start:.1f} seconds')

start = time.time()
mwe(1e7)
print(f'Function returned after {time.time()-start:.1f} seconds')

This results in the output:

Function completed in: 0.36 seconds
Function returned after 4.4 seconds

Function completed in: 3.75 seconds
Function returned after 44.0 seconds

Compare this to the output of an equivalent script which uses scipy.spatial.KDTree:

Function completed in: 0.86 seconds
Function returned after 0.9 seconds

Function completed in: 9.17 seconds
Function returned after 9.3 seconds

Though the query operation is slower using scipy, the garbage collection time for the same output is inconsequential; this seems to be due to the nested objects being lists rather than arrays.

Also, I'm aware that by switching the build/query data, the problem goes away on this contrived example - but this results in a completely different output representation. In the data I'm actually working with, swapping the two leads to a runtime approaching the garbage collect time shown above, while still having a relatively large impact from garbage collection.

System:
    python: 3.10.12 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 19:09:20) [MSC v.1916 64 bit (AMD64)]
executable: env_crest\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
      sklearn: 1.2.1
          pip: 23.1.2
   setuptools: 67.8.0
        numpy: 1.23.5
        scipy: 1.10.1
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.7.1
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: env_crest\Library\bin\mkl_rt.2.dll
         prefix: mkl_rt
       user_api: blas
   internal_api: mkl
        version: 2023.1-Product
    num_threads: 6
threading_layer: intel

       filepath: env_crest\vcomp140.dll
         prefix: vcomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 12

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0