8000 MAINT Improve scikit-learn reliability by replacing `cnp.ndarray`s with typed memoryviews · Issue #25484 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MAINT Improve scikit-learn reliability by replacing cnp.ndarrays with typed memoryviews #25484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
18 tasks done
jjerphan opened this issue Jan 26, 2023 · 31 comments
Closed
18 tasks done
Labels
Build / CI cython help wanted Meta-issue General issue associated to an identified list of tasks

Comments

@jjerphan
Copy link
Member
jjerphan commented Jan 26, 2023

Context

Using cnp.ndarray is not encouragged and using typed memoryviews must be preferred. More precisely const-qualified memoryview must be used when the buffer of the NumPy array is readonly (some Cython interfaces previously have had support for readonly data using cnp.ndarray as a workaround).

On a side-note, better uniform canonical support of readonly buffers has notable value: some frameworks like Ray or libraries like joblib make use of memory mapping which change the writability of buffers. Yet it might crash in some context if scikit-learn's implementations have no clear support for readonly data and this is fatal to users' workflows.

In overall, we have a poor overview for such support but we must. Efforts in this meta-issue will improve the overview for this support.


As an example, the following pattern:

cdef cnp.ndarray[T, ndim=1, mode="c"] readonly_data 

must be changed to:

cdef const T[::1] readonly_data

where T is a concrete type or a fused type.

See occurences of cnp.ndarray in the codebase.

💡 Do note, as mentioned by @thomasjpfan in #25484 (comment), that not all occurences of cnp.ndarray must be removed.

💡 People interested starting working with Cython might be interested in reading Cython Best Practices, Conventions and Knowledge section of scikit-learn documentation.

Proposed solution: treat one file per PR

  • Only perform changes necessary change to use memoryviews
  • If the file is big, you can have several PR to treat it
  • Create PRs for each warning with the following title:
    MAINT Use memoryviews in `source_file.ext`
    
    where:
    • source_file.ext is the file being treated

The following command allows to see which files needs to be treated:

ag "cnp.ndarray" -c

As of b69abf5, the following files need to be treated:

  • sklearn/metrics/cluster/_expected_mutual_info_fast.pyx:4
  • sklearn/metrics/_dist_metrics.pyx.tp:6
  • sklearn/neighbors/_quad_tree.pyx:1
  • sklearn/preprocessing/_csr_polynomial_expansion.pyx:9
  • sklearn/utils/_seq_dataset.pyx.tp:14
  • sklearn/utils/_fast_dict.pyx:4
  • sklearn/utils/_seq_dataset.pxd.tp:10
  • sklearn/utils/arrayfuncs.pyx:2
  • sklearn/utils/sparsefuncs_fast.pyx:50
  • sklearn/linear_model/_cd_fast.pyx:15
  • sklearn/tree/_tree.pxd:10
  • sklearn/tree/_tree.pyx:38
  • sklearn/neighbors/_binary_tree.pxi:1
  • sklearn/tree/_utils.pyx:1
  • sklearn/tree/_utils.pxd:1
  • sklearn/tree/_criterion.pyx:3
  • sklearn/utils/_random.pyx:4
  • sklearn/_isotonic.pyx:6

Miscellaneous

#24875 is also linked to this PR regarding using the newest NumPy C API.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jan 26, 2023
@jjerphan jjerphan added Meta-issue General issue associated to an identified list of tasks cython labels Jan 26, 2023
@jjerphan jjerphan removed the Needs Triage Issue requires triage label Jan 26, 2023
@jjerphan jjerphan added Easy Well-defined and straightforward way to resolve Build / CI labels Jan 26, 2023
@jjerphan
Copy link
Member Author

@OmarManzoor: you might be interested in this issue.

@OmarManzoor
Copy link
Contributor

I would like to try out sklearn/tree/_tree.pyx which is also remaining in #24875.

@jpangas
Copy link
Contributor
jpangas commented Jan 29, 2023

@jjerphan , is it okay if I can treat some files ?

@jjerphan
Copy link
Member Author
jjerphan commented Jan 29, 2023

Yes sure! Everyone is welcome to help resolving this issue.

@jjerphan
Copy link
Member Author

@jpangas: thank you for the initiative. Unfortunately scikit-learn does not use assignees on GitHub (especially for meta-issues like this one).

@jpangas
Copy link
Contributor
jpangas commented Jan 30, 2023

Sure, thanks for the clarification @jjerphan.

@jpangas
Copy link
Contributor
jpangas commented Jan 30, 2023

@jjerphan, while working on, sklearn/linear_model/_cd_fast.pyx, I came across #11722 which used memory views on the same file. The PR was merged but the changes are not reflecting on the current file in main. Were the changes reverted and should I proceed with the file?

@Vincent-Maladiere
Copy link
Contributor

Hi @jpangas, the PR you mentioned is almost five years old (2018). It has indeed been reverted lastly here because up to recently, Cython didn't allow fusetyped memoryviews to be const-qualified. We need input memoryviews to be const-qualified so that read-only arrays won't raise an error at runtime.

Long story short, you can now safely convert those back to memoryviews 😉

@jpangas
Copy link
Contributor
jpangas commented Jan 30, 2023

Makes sense. I just followed up on the conversation in the PR you shared. Thanks @Vincent-Maladiere .

@thomasjpfan
Copy link
Member

Note that cnp.ndarray itself has not been deprecated. Directly accessing cnp.ndarray's attributes such as .data was deprecated. I can see places where cnp.ndarray is still good to have, such as the return types of some tree methods in #25540.

@OmarManzoor
Copy link
Contributor
OmarManzoor commented Feb 9, 2023

@jjerphan

sklearn/utils/_vector_sentinel.pyx and sklearn/utils/_vector_sentinel.pxd seem to be of a simil 8000 ar pattern to the instances retained inside _tree. So I think we can leave these out.

cdef cnp.ndarray vector_to_nd_array(vector_typed * vect_ptr):
cdef:
cnp.npy_intp size = deref(vect_ptr).size()
StdVectorSentinel sentinel = _create_sentinel(vect_ptr)
cnp.ndarray arr = cnp.PyArray_SimpleNewFromData(
1, &size, sentinel.get_typenum(), sentinel.get_data())
# Makes the numpy array responsible of the life-cycle of its buffer.
# A reference to the StdVectorSentinel will be stolen by the call to
# `PyArray_SetBaseObject` below, so we increase its reference counter.
# See: https://docs.python.org/3/c-api/intro.html#reference-count-details
Py_INCREF(sentinel)
cnp.PyArray_SetBaseObject(arr, sentinel)
return arr

@jjerphan
Copy link
Member Author
jjerphan commented Feb 9, 2023

Yes, we need to use cnp.ndarray here because we need to create a NumPy array backed by a already allocated buffer here.

@thomasjpfan
Copy link
Member

@OmarManzoor If you are interested, I would prioritize sklearn/linear_model/_cd_fast.pyx because it is one of the remaining few still using the NumPy's deprecated C-API.

@Mike-gag
Copy link
Mike-gag commented Mar 7, 2023

is the list provided in the issue containing all files that need updating up to date? I would like to help out

@OmarManzoor
Copy link
Contributor
OmarManzoor commented Mar 7, 2023

is the list provided in the issue containing all files that need updating up to date? I would like to help out

A number of the files might already have been completed. Would you like to try out sklearn/utils/_random.pyx?
These are the remaining files I think:

  • sklearn/feature_extraction/_hashing_fast.pyx:1
  • sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors.pxd.tp:1
  • sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors.pyx.tp:2
  • sklearn/neighbors/_quad_tree.pyx:1
  • sklearn/tree/_utils.pyx:1
  • sklearn/tree/_utils.pxd:1
  • sklearn/utils/_random.pyx:4

@2357juan
Copy link
Contributor
2357juan commented Mar 8, 2023

I will try to treat sklearn/utils/_random.pyx.

@Mike-gag
Copy link
Mike-gag commented Mar 9, 2023

taking a stab at sklearn/tree/_utils.pyx atm

@npache
Copy link
Contributor
npache commented Mar 13, 2023

Hi! I'll be working on _hashing_fast.pyx

@npache
Copy link
Contributor
npache commented Mar 13, 2023

Hi! I'll be working on _hashing_fast.pyx

hi @jjerphan ! looking to this function, is the fix necesary in this one? I think is the following line
cdef cnp.ndarray values = np.empty(capacity, dtype=dtype)

@jjerphan
Copy link
Member Author

Hi @npache, can you open a Pull Request for _hashing_fast.pyx so that we can discuss it independently of this issue? Thank you. 🙂

@jjerphan
Copy link
Member Author

💡 People interested starting working with Cython might be interested in reading Cython Best Practices, Conventions and Knowledge section of scikit-learn documentation.

@Mike-gag
Copy link

What tests should I run to check if my changes on sklearn/tree/_utils.pyx don't break anything?

@npache
Copy link
Contributor
npache commented Mar 19, 2023

Hi! I'll be working on _criterion.pyx

@astro-satyam75
Copy link

Hi all, I'll be working on _isotonic.pyx

@jjerphan
Copy link
Member Author

What tests should I run to check if my changes on sklearn/tree/_utils.pyx don't break anything?

I do not think you need to add tests.

@antara-gandhi
Copy link

are there any files left to work on @jjerphan ?

@jjerphan
Copy link
Member Author

sklearn/_isotonic.pyx still needs to be treated.

@antara-gandhi
Copy link

sklearn/_isot E0A9 onic.pyx still needs to be treated.

cool @jjerphan then i will claim this and start working on it!

@BabaYaga1221
Copy link
Contributor

@jjerphan I submit #26068 request for changing file sklearn/_isotonic.pyx. I will be grateful if you review the code and provide relevant suggestions. Thanks.

Best Regards,
Farhan

@thomasjpfan thomasjpfan removed Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve labels Apr 4, 2023
@jjerphan
Copy link
Member Author
jjerphan commented Apr 6, 2023

#26068 concluded the resolution of this issue.

Thanks to everyone involved!

@jjerphan jjerphan closed this as completed Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build / CI cython help wanted Meta-issue General issue associated to an identified list of tasks
Projects
None yet
Development

No branches or pull requests

12 participants
0