TST: force dtype of arange to `int64` to not be platform dependent #30793

navneetkumaryadav207001 · 2025-02-08T17:14:58Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This pull request fixes a buffer dtype mismatch error that occurs on Windows when using the _py_sort function from the sklearn.tree._partitioner module. On Windows, np.arange by default produces an array with dtype int32, while the Cython code expects a buffer of type intp_t (which corresponds to np.intp, typically a 64-bit integer on 64-bit systems). This mismatch leads to a ValueError ("Buffer dtype mismatch, expected 'intp_t' but got 'long'").

To address this without requiring changes in user/test code, the _py_sort function has been modified to accept a generic Python object for the samples parameter and internally convert it to a contiguous array with dtype np.intp before proceeding with the sort. This ensures that the buffer passed to the Cython code is of the correct type, and users no longer need to manually specify the correct dtype.

Any other comments?

This fix should help Windows users avoid the buffer dtype mismatch issue when running tests or using scikit-learn on systems where the default integer type is int32. The change is minimal and localized within _py_sort, ensuring backward compatibility and ease of use for end users. Please let me know if any further adjustments or tests are needed.

Code Changed

Before:

def _py_sort(float32_t[::1] feature_values, intp_t[::1] samples, intp_t n):
    """Used for testing sort."""
    sort(&feature_values[0], &samples[0], n)

After:

def _py_sort(float32_t[::1] feature_values, object samples, intp_t n):
    """
    Used for testing sort.
    Automatically converts samples to a contiguous array of type np.intp.
    """
    # Convert samples to a contiguous numpy array with dtype=np.intp.
    samples = np.ascontiguousarray(samples, dtype=np.intp)
    # Create a memoryview of the converted array.
    cdef intp_t[::1] samples_view = samples
    # Now call the internal sort using the correctly typed memoryview.
    sort(&feature_values[0], &samples_view[0], n)

github-actions · 2025-02-08T17:16:16Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: ada9947. Link to the linter CI: here}

adrinjalali · 2025-02-11T18:38:51Z

@adam2392 could you maybe have a look here?

glemaitre · 2025-02-11T23:37:45Z

sklearn/tree/_partitioner.pyx

@@ -702,9 +702,15 @@ cdef inline void shift_missing_values_to_left_if_required(
        best.pos += best.n_missing


-def _py_sort(float32_t[::1] feature_values, intp_t[::1] samples, intp_t n):
-    """Used for testing sort."""
-    sort(&feature_values[0], &samples[0], n)


Instead of modify the code here, we should modify the code in the test function:

def test_sort_log2_build(): """Non-regression test for gh-30554. Using log2 and log in sort correctly sorts feature_values, but the tie breaking is different which can results in placing samples in a different order. """ rng = np.random.default_rng(75) some = rng.normal(loc=0.0, scale=10.0, size=10).astype(np.float32) feature_values = np.concatenate([some] * 5) samples = np.arange(50, dtype=np.float64) _py_sort(feature_values, samples, 50) # fmt: off # no black reformatting for this specific array expected_samples = [ 0, 40, 30, 20, 10, 29, 39, 19, 49, 9, 45, 15, 35, 5, 25, 11, 31, 41, 1, 21, 22, 12, 2, 42, 32, 23, 13, 43, 3, 33, 6, 36, 46, 16, 26, 4, 14, 24, 34, 44, 27, 47, 7, 37, 17, 8, 38, 48, 28, 18 ] # fmt: on assert_array_equal(samples, expected_samples)

We need to force the dtype in the np.arange call and make sure that we always have np.int64 and that we are not platform dependent of the default int dtype.

Agreed. The bug described in #30782 states the issue is in the unit-test, so all we need to do is fix the testing function(s) that call _py_sort.

Before code hits Cython in sklearn/tree/* files, the types are always forced anyways…. Until we support C++ templates :p.

@navneetkumaryadav207001 can you revert these changes, and fix the issue where _py_sort is called?

Yes sir, i will surely look into this also sorry for the delay in reply.

github-actions bot added cython module:tree labels Feb 8, 2025

glemaitre mentioned this pull request Feb 11, 2025

_py_sort() returns ValueError on windows with numpy 1.26.4 but works correctly with numpy 2.x #30782

Closed

glemaitre reviewed Feb 11, 2025

View reviewed changes

glemaitre changed the title ~~Fixed _py_sort() for numpy 1.26.4~~ TST: force dtype of arange to int64 to not be platform dependent Feb 11, 2025

navneetkumaryadav207001 closed this Feb 25, 2025

navneetkumaryadav207001 force-pushed the main branch from 6d4fbd2 to ada9947 Compare February 25, 2025 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TST: force dtype of arange to `int64` to not be platform dependent #30793

TST: force dtype of arange to `int64` to not be platform dependent #30793

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TST: force dtype of arange to int64 to not be platform dependent #30793

TST: force dtype of arange to int64 to not be platform dependent #30793

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TST: force dtype of arange to `int64` to not be platform dependent #30793

TST: force dtype of arange to `int64` to not be platform dependent #30793