-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT Use memory views in _hashing_fast.pyx #25574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR! In this case, I do not think we can use cnp.ndarray
because of the dtype parameter.
@@ -35,7 +35,7 @@ def transform(raw_X, Py_ssize_t n_features, dtype, | |||
# and values arrays ourselves. Use a Py_ssize_t capacity for safety. | |||
cdef Py_ssize_t capacity = 8192 # arbitrary | |||
cdef cnp.int64_t size = 0 | |||
cdef cnp.ndarray values = np.empty(capacity, dtype=dtype) | |||
cdef double[:] values = np.empty(capacity, dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We likely need to keep this as cnp.ndarray
because of how the dtype
argument is passed in.
With this PR, _hashing_transform
would always return a float64
. Later, when the csr_matrix
is formed, it will make a copy if self.dtype=np.float32
:
scikit-learn/sklearn/feature_extraction/_hash.py
Lines 185 to 189 in 4f85597
X = sp.csr_matrix( | |
(values, indices, indptr), | |
dtype=self.dtype, | |
shape=(n_samples, self.n_features), | |
) |
With the current code in main
and dtype=np.float32
, the values
would be float32
and csr_matrix
does not need to make a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That means that dtype can only be float32 or float64? In one of the tests I observed that the dtype passed to the fast hash transform function was a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we declare value as a double I think a fused type like floating would also not be totally suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the dtype can be anything. float32
, float64
, int32
, etc. I do not think it's worth fusing everything because we can miss a dtype someone might be using.
It's okay for value
to be double. When setting value[size] = value
, it would do C-style casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I am going to close this PR as no change needs to be made here.
Reference Issues/PRs
Towards: #25484
What does this implement/fix? Explain your changes.
Any other comments?
CC: @Vincent-Maladiere @jjerphan @adam2392