8000 Cannot recover DBSCAN from memory-overuse · Issue #31407 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Cannot recover DBSCAN from memory-overuse #31407
Open
@hubernikus

Description

@hubernikus

Describe the bug

I also just ran into this issue that the program gets killed when running DBSCAN, similar to:
#22531

The documentation update already helps and I think it's ok for the algorithm to fail. But currently there is no way for me to recover, and a more informative error message would be useful. Since now DBSCAN just reports killed and it requires a bit of search to see what fails:

>>> DBSCAN(eps=1, min_samples=2).fit(np.random.rand(10_000_000, 3))
Killed

e.g., something like how numpy does it:

>>> n = int(1e6)
>>> np.random.rand(n, n)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "numpy/random/mtrand.pyx", line 1219, in numpy.random.mtrand.RandomState.rand
  File "numpy/random/mtrand.pyx", line 437, in numpy.random.mtrand.RandomState.random_sample
  File "_common.pyx", line 307, in numpy.random._common.double_fill
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.28 TiB for an array with shape (1000000, 1000000) and data type float64

Additionally, I noted that the memory accumulated with consecutive calling of DBSCAN. Which can lead to a killed program even though there is enough memory when running a single fit.
I was able to resolve this by explicitly calling import gc; gc.collect() after each run. Maybe this could be invoked at the end of each DBSCAN fit?

Steps/Code to Reproduce

try:
    DBSCAN(eps=1, min_samples=2).fit(np.random.rand(10_000_000, 3))
except:
    print("Caught exception")

Expected Results

Caught exception

Actual Results

Killed

Versions

>>> import sklearn; sklearn.show_versions()

System:
    python: 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0]
executable: /usr/bin/python3
   machine: Linux-6.14.6-arch1-1-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.6.1
          pip: None
   setuptools: 80.7.1
        numpy: 1.26.4
        scipy: 1.15.3
       Cython: None
       pandas: 2.2.3
   matplotlib: 3.10.3
       joblib: 1.5.0
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 20
         prefix: libopenblas
       filepath: /usr/local/lib/python3.10/dist-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Prescott

       user_api: blas
   internal_api: openblas
    num_threads: 20
         prefix: libscipy_openblas
       filepath: /usr/local/lib/python3.10/dist-packages/scipy.libs/libscipy_openblas-68440149.so
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 20
         prefix: libgomp
       filepath: /usr/local/lib/python3.10/dist-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0