8000 MAINT Improve scikit-learn reliability by replacing `cnp.ndarray`s with typed memoryviews · Issue #25484 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
MAINT Improve scikit-learn reliability by replacing cnp.ndarrays with typed memoryviews #25484
Closed
@jjerphan

Description

@jjerphan

Context

Using cnp.ndarray is not encouragged and using typed memoryviews must be preferred. More precisely const-qualified memoryview must be used when the buffer of the NumPy array is readonly (some Cython interfaces previously have had support for readonly data using cnp.ndarray as a workaround).

On a side-note, better uniform canonical support of readonly buffers has notable value: some frameworks like Ray or libraries like joblib make use of memory mapping which change the writability of buffers. Yet it might crash in some context if scikit-learn's implementations have no clear support for readonly data and this is fatal to users' workflows.

In overall, we have a poor overview for such support but we must. Efforts in this meta-issue will improve the overview for this support.


As an example, the following pattern:

cdef cnp.ndarray[T, ndim=1, mode="c"] readonly_data 

must be changed to:

cdef const T[::1] readonly_data

where T is a concrete type or a fused type.

See occurences of cnp.ndarray in the codebase.

💡 Do note, as mentioned by @thomasjpfan in #25484 (comment), that not all occurences of cnp.ndarray must be removed.

💡 People interested starting working with Cython might be interested in reading Cython Best Practices, Conventions and Knowledge section of scikit-learn documentation.

Proposed solution: treat one file per PR

  • Only perform changes necessary change to use memoryviews
  • If the file is big, you can have several PR to treat it
  • Create PRs for each warning with the following title:
    MAINT Use memoryviews in `source_file.ext`
    
    where:
    • source_file.ext is the file being treated

The following command allows to see which files needs to be treated:

ag "cnp.ndarray" -c

As of b69abf5, the following files need to be treated:

  • sklearn/metrics/cluster/_expected_mutual_info_fast.pyx:4
  • sklearn/metrics/_dist_metrics.pyx.tp:6
  • sklearn/neighbors/_quad_tree.pyx:1
  • sklearn/preprocessing/_csr_polynomial_expansion.pyx:9
  • sklearn/utils/_seq_dataset.pyx.tp:14
  • sklearn/utils/_fast_dict.pyx:4
  • sklearn/utils/_seq_dataset.pxd.tp:10
  • sklearn/utils/arrayfuncs.pyx:2
  • sklearn/utils/sparsefuncs_fast.pyx:50
  • sklearn/linear_model/_cd_fast.pyx:15
  • sklearn/tree/_tree.pxd:10
  • sklearn/tree/_tree.pyx:38
  • sklearn/neighbors/_binary_tree.pxi:1
  • sklearn/tree/_utils.pyx:1
  • sklearn/tree/_utils.pxd:1
  • sklearn/tree/_criterion.pyx:3
  • sklearn/utils/_random.pyx:4
  • sklearn/_isotonic.pyx:6

Miscellaneous

#24875 is also linked to this PR regarding using the newest NumPy C API.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0