8000 MAINT Use memory views in _hashing_fast.pyx by OmarManzoor · Pull Request #25574 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MAINT Use memory views in _hashing_fast.pyx #25574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion sklearn/feature_extraction/_hashing_fast.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def transform(raw_X, Py_ssize_t n_features, dtype,
# and values arrays ourselves. Use a Py_ssize_t capacity for safety.
cdef Py_ssize_t capacity = 8192 # arbitrary
cdef cnp.int64_t size = 0
cdef cnp.ndarray values = np.empty(capacity, dtype=dtype)
cdef double[:] values = np.empty(capacity, dtype=np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We likely need to keep this as cnp.ndarray because of how the dtype argument is passed in.

With this PR, _hashing_transform would always return a float64. Later, when the csr_matrix is formed, it will make a copy if self.dtype=np.float32:

X = sp.csr_matrix(
(values, indices, indptr),
dtype=self.dtype,
shape=(n_samples, self.n_features),
)

With the current code in main and dtype=np.float32, the values would be float32 and csr_matrix does not need to make a copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That means that dtype can only be float32 or float64? In one of the tests I observed that the dtype passed to the fast hash transform function was a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we declare value as a double I think a fused type like floating would also not be totally suitable.

Copy link
Member
@thomasjpfan thomasjpfan Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the dtype can be anything. float32, float64, int32, etc. I do not think it's worth fusing everything because we can miss a dtype someone might be using.

It's okay for value to be double. When setting value[size] = value, it would do C-style casting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I am going to close this PR as no change needs to be made here.


for x in raw_X:
for f, v in x:
Expand Down
0