MAINT Use memory views in _hashing_fast.pyx#25574
MAINT Use memory views in _hashing_fast.pyx#25574OmarManzoor wants to merge 1 commit intoscikit-learn:mainfrom
Conversation
There was a problem hiding this comment.
Thank you for the PR! In this case, I do not think we can use cnp.ndarray because of the dtype parameter.
| cdef Py_ssize_t capacity = 8192 # arbitrary | ||
| cdef cnp.int64_t size = 0 | ||
| cdef cnp.ndarray values = np.empty(capacity, dtype=dtype) | ||
| cdef double[:] values = np.empty(capacity, dtype=np.float64) |
There was a problem hiding this comment.
We likely need to keep this as cnp.ndarray because of how the dtype argument is passed in.
With this PR, _hashing_transform would always return a float64. Later, when the csr_matrix is formed, it will make a copy if self.dtype=np.float32:
scikit-learn/sklearn/feature_extraction/_hash.py
Lines 185 to 189 in 4f85597
With the current code in main and dtype=np.float32, the values would be float32 and csr_matrix does not need to make a copy.
There was a problem hiding this comment.
I see. That means that dtype can only be float32 or float64? In one of the tests I observed that the dtype passed to the fast hash transform function was a string.
There was a problem hiding this comment.
Since we declare value as a double I think a fused type like floating would also not be totally suitable.
There was a problem hiding this comment.
I think the dtype can be anything. float32, float64, int32, etc. I do not think it's worth fusing everything because we can miss a dtype someone might be using.
It's okay for value to be double. When setting value[size] = value, it would do C-style casting.
There was a problem hiding this comment.
Sounds good. I am going to close this PR as no change needs to be made here.
Reference Issues/PRs
Towards: #25484
What does this implement/fix? Explain your changes.
Any other comments?
CC: @Vincent-Maladiere @jjerphan @adam2392