8000 [BUG] Linkage/Hierarchical clustering methods fail on readonly memmapped datasets · Issue #4114 · cython/cython · GitHub
[go: up one dir, main page]

Skip to content
[BUG] Linkage/Hierarchical clustering methods fail on readonly memmapped datasets #4114
@jjerphan

Description

@jjerphan

Describe the bug

Some linkage/hierarchical clustering methods fail for some combination of parameters, as noticed by scikit-learn/scikit-learn#19562 (comment).

To Reproduce
Modified from @tliu68's reproducer:

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.utils._testing import create_memmap_backed_data
from sklearn.cluster import AgglomerativeClustering

X, y = make_blobs(n_samples=50, random_state=1)
X, y = create_memmap_backed_data([X, y])

# does not fail
ag = AgglomerativeClustering(n_clusters=3)
ag.fit(X)

# fails
ag = AgglomerativeClustering(affinity="euclidean", linkage="single")
ag.fit(X)
Reproducer Full trace
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
----> 1 ag.fit(X)

~/dev/scikit-learn/sklearn/cluster/_agglomerative.py in fit(self, X, y)
    895         )
    896 
--> 897         out = memory.cache(tree_builder)(X, connectivity=connectivity,
    898                                          n_clusters=n_clusters,
    899                                          return_distance=return_distance,

~/.virtualenvs/sk/lib64/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

~/dev/scikit-learn/sklearn/cluster/_agglomerative.py in _single_linkage(*args, **kwargs)
    617 def _single_linkage(*args, **kwargs):
    618     kwargs['linkage'] = 'single'
--> 619     return linkage_tree(*args, **kwargs)
    620 
    621 

~/dev/scikit-learn/sklearn/cluster/_agglomerative.py in linkage_tree(X, connectivity, n_clusters, linkage, affinity, return_distance)
    484             X = np.ascontiguousarray(X, dtype=np.double)
    485 
--> 486             mst = _hierarchical.mst_linkage_core(X, dist_metric)
    487             # Sort edges of the min_spanning_tree by weight
    488             mst = mst[np.argsort(mst.T[2], kind='mergesort'), :]

~/dev/scikit-learn/sklearn/cluster/_hierarchical_fast.pyx in sklearn.cluster._hierarchical_fast.mst_linkage_core()
    456 @cython.nonecheck(False)
    457 def mst_linkage_core(
--> 458         DTYPE_t [:, ::1] raw_data,
    459         DistanceMetric dist_metric):
    460     """

~/dev/scikit-learn/sklearn/cluster/_hierarchical_fast.cpython-39-x86_64-linux-gnu.so in View.MemoryView.memoryview_cwrapper()

~/dev/scikit-learn/sklearn/cluster/_hierarchical_fast.cpython-39-x86_64-linux-gnu.so in View.MemoryView.memoryview.__cinit__()

ValueError: buffer source array is read-only

Expected behavior

Linkage/Hierarchical clustering methods should support readonly memmapped datasets

Additional context

Linkage/Hierarchical clustering methods rely on Cython.
Yet, those implementations in python do not support const memory view obtained by coercion of readonly memmapped datasets (mainly because const memory view with fused tupe were not implemented within cython at that time).

It should be fixable using const memory view.

Environment:

  • OS: Linux 5.11.11-200.fc33.x86_64
  • Python version: 3.8, 3.9
  • Cython version: 0.29.21

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0