-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
TST: force dtype of arange to int64
to not be platform dependent
#30793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST: force dtype of arange to int64
to not be platform dependent
#30793
Conversation
@adam2392 could you maybe have a look here? |
sklearn/tree/_partitioner.pyx
Outdated
@@ -702,9 +702,15 @@ cdef inline void shift_missing_values_to_left_if_required( | |||
best.pos += best.n_missing | |||
|
|||
|
|||
def _py_sort(float32_t[::1] feature_values, intp_t[::1] samples, intp_t n): | |||
"""Used for testing sort.""" | |||
sort(&feature_values[0], &samples[0], n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of modify the code here, we should modify the code in the test function:
def test_sort_log2_build():
"""Non-regression test for gh-30554.
Using log2 and log in sort correctly sorts feature_values, but the tie breaking is
different which can results in placing samples in a different order.
"""
rng = np.random.default_rng(75)
some = rng.normal(loc=0.0, scale=10.0, size=10).astype(np.float32)
feature_values = np.concatenate([some] * 5)
samples = np.arange(50, dtype=np.float64)
_py_sort(feature_values, samples, 50)
# fmt: off
# no black reformatting for this specific array
expected_samples = [
0, 40, 30, 20, 10, 29, 39, 19, 49, 9, 45, 15, 35, 5, 25, 11, 31,
41, 1, 21, 22, 12, 2, 42, 32, 23, 13, 43, 3, 33, 6, 36, 46, 16,
26, 4, 14, 24, 34, 44, 27, 47, 7, 37, 17, 8, 38, 48, 28, 18
]
# fmt: on
assert_array_equal(samples, expected_samples)
We need to force the dtype
in the np.arange
call and make sure that we always have np.int64
and that we are not platform dependent of the default int
dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The bug described in #30782 states the issue is in the unit-test, so all we need to do is fix the testing function(s) that call _py_sort
.
Before code hits Cython in sklearn/tree/*
files, the types are always forced anyways…. Until we support C++ templates :p.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@navneetkumaryadav207001 can you revert these changes, and fix the issue where _py_sort
is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sir, i will surely look into this also sorry for the delay in reply.
int64
to not be platform dependent
6d4fbd2
to
ada9947
Compare
Reference Issues/PRs
Fixes #30782
What does this implement/fix? Explain your changes.
This pull request fixes a buffer dtype mismatch error that occurs on Windows when using the
_py_sort
function from thesklearn.tree._partitioner
module. On Windows,np.arange
by default produces an array with dtypeint32
, while the Cython code expects a buffer of typeintp_t
(which corresponds tonp.intp
, typically a 64-bit integer on 64-bit systems). This mismatch leads to a ValueError ("Buffer dtype mismatch, expected 'intp_t' but got 'long'").To address this without requiring changes in user/test code, the
_py_sort
function has been modified to accept a generic Python object for thesamples
parameter and internally convert it to a contiguous array with dtypenp.intp
before proceeding with the sort. This ensures that the buffer passed to the Cython code is of the correct type, and users no longer need to manually specify the correct dtype.Any other comments?
This fix should help Windows users avoid the buffer dtype mismatch issue when running tests or using scikit-learn on systems where the default integer type is
int32
. The change is minimal and localized within_py_sort
, ensuring backward compatibility and ease of use for end users. Please let me know if any further adjustments or tests are needed.Code Changed
Before:
After: