-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT Improve scikit-learn reliability by replacing cnp.ndarray
s with typed memoryviews
#25484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@OmarManzoor: you might be interested in this issue. |
I would like to try out sklearn/tree/_tree.pyx which is also remaining in #24875. |
@jjerphan , is it okay if I can treat some files ? |
Yes sure! Everyone is welcome to help resolving this issue. |
@jpangas: thank you for the initiative. Unfortunately scikit-learn does not use assignees on GitHub (especially for meta-issues like this one). |
Sure, thanks for the clarification @jjerphan. |
Hi @jpangas, the PR you mentioned is almost five years old (2018). It has indeed been reverted lastly here because up to recently, Cython didn't allow fusetyped memoryviews to be const-qualified. We need input memoryviews to be const-qualified so that read-only arrays won't raise an error at runtime. Long story short, you can now safely convert those back to memoryviews 😉 |
Makes sense. I just followed up on the conversation in the PR you shared. Thanks @Vincent-Maladiere . |
Note that |
sklearn/utils/_vector_sentinel.pyx and sklearn/utils/_vector_sentinel.pxd seem to be of a simil
8000
ar pattern to the instances retained inside _tree. So I think we can leave these out. scikit-learn/sklearn/utils/_vector_sentinel.pyx Lines 107 to 120 in 4f85597
|
Yes, we need to use |
@OmarManzoor If you are interested, I would prioritize |
is the list provided in the issue containing all files that need updating up to date? I would like to help out |
A number of the files might already have been completed. Would you like to try out
|
I will try to treat sklearn/utils/_random.pyx. |
taking a stab at sklearn/tree/_utils.pyx atm |
Hi! I'll be working on _hashing_fast.pyx |
hi @jjerphan ! looking to this function, is the fix necesary in this one? I think is the following line |
Hi @npache, can you open a Pull Request for |
💡 People interested starting working with Cython might be interested in reading Cython Best Practices, Conventions and Knowledge section of scikit-learn documentation. |
What tests should I run to check if my changes on sklearn/tree/_utils.pyx don't break anything? |
Hi! I'll be working on _criterion.pyx |
Hi all, I'll be working on _isotonic.pyx |
I do not think you need to add tests. |
are there any files left to work on @jjerphan ? |
|
cool @jjerphan then i will claim this and start working on it! |
#26068 concluded the resolution of this issue. Thanks to everyone involved! |
Context
Using
cnp.ndarray
is not encouragged and using typed memoryviews must be preferred. More preciselyconst
-qualified memoryview must be used when the buffer of the NumPy array is readonly (some Cython interfaces previously have had support for readonly data usingcnp.ndarray
as a workaround).On a side-note, better uniform canonical support of readonly buffers has notable value: some frameworks like Ray or libraries like joblib make use of memory mapping which change the writability of buffers. Yet it might crash in some context if scikit-learn's implementations have no clear support for readonly data and this is fatal to users' workflows.
In overall, we have a poor overview for such support but we must. Efforts in this meta-issue will improve the overview for this support.
As an example, the following pattern:
must be changed to:
where
T
is a concrete type or a fused type.See occurences of
cnp.ndarray
in the codebase.💡 Do note, as mentioned by @thomasjpfan in #25484 (comment), that not all occurences of
cnp.ndarray
must be removed.💡 People interested starting working with Cython might be interested in reading Cython Best Practices, Conventions and Knowledge section of scikit-learn documentation.
Proposed solution: treat one file per PR
source_file.ext
is the file being treatedThe following command allows to see which files needs to be treated:
As of b69abf5, the following files need to be treated:
Miscellaneous
#24875 is also linked to this PR regarding using the newest NumPy C API.
The text was updated successfully, but these errors were encountered: