8000 TST: force dtype of arange to `int64` to not be platform dependent by navneetkumaryadav207001 · Pull Request #30793 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

TST: force dtype of arange to int64 to not be platform dependent #30793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

navneetkumaryadav207001
Copy link

Reference Issues/PRs

Fixes #30782

What does this implement/fix? Explain your changes.

This pull request fixes a buffer dtype mismatch error that occurs on Windows when using the _py_sort function from the sklearn.tree._partitioner module. On Windows, np.arange by default produces an array with dtype int32, while the Cython code expects a buffer of type intp_t (which corresponds to np.intp, typically a 64-bit integer on 64-bit systems). This mismatch leads to a ValueError ("Buffer dtype mismatch, expected 'intp_t' but got 'long'").

To address this without requiring changes in user/test code, the _py_sort function has been modified to accept a generic Python object for the samples parameter and internally convert it to a contiguous array with dtype np.intp before proceeding with the sort. This ensures that the buffer passed to the Cython code is of the correct type, and users no longer need to manually specify the correct dtype.

Any other comments?

This fix should help Windows users avoid the buffer dtype mismatch issue when running tests or using scikit-learn on systems where the default integer type is int32. The change is minimal and localized within _py_sort, ensuring backward compatibility and ease of use for end users. Please let me know if any further adjustments or tests are needed.

Code Changed

Before:

def _py_sort(float32_t[::1] feature_values, intp_t[::1] samples, intp_t n):
    """Used for testing sort."""
    sort(&feature_values[0], &samples[0], n)

After:

def _py_sort(float32_t[::1] feature_values, object samples, intp_t n):
    """
    Used for testing sort.
    Automatically converts samples to a contiguous array of type np.intp.
    """
    # Convert samples to a contiguous numpy array with dtype=np.intp.
    samples = np.ascontiguousarray(samples, dtype=np.intp)
    # Create a memoryview of the converted array.
    cdef intp_t[::1] samples_view = samples
    # Now call the internal sort using the correctly typed memoryview.
    sort(&feature_values[0], &samples_view[0], n)

Copy link
github-actions bot commented Feb 8, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ada9947. Link to the linter CI: here

@adrinjalali
Copy link
Member

@adam2392 could you maybe have a look here?

@@ -702,9 +702,15 @@ cdef inline void shift_missing_values_to_left_if_required(
best.pos += best.n_missing


def _py_sort(float32_t[::1] feature_values, intp_t[::1] samples, intp_t n):
"""Used for testing sort."""
sort(&feature_values[0], &samples[0], n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of modify the code here, we should modify the code in the test function:

def test_sort_log2_build():
    """Non-regression test for gh-30554.

    Using log2 and log in sort correctly sorts feature_values, but the tie breaking is
    different which can results in placing samples in a different order.
    """
    rng = np.random.default_rng(75)
    some = rng.normal(loc=0.0, scale=10.0, size=10).astype(np.float32)
    feature_values = np.concatenate([some] * 5)
    samples = np.arange(50, dtype=np.float64)
    _py_sort(feature_values, samples, 50)
    # fmt: off
    # no black reformatting for this specific array
    expected_samples = [
        0, 40, 30, 20, 10, 29, 39, 19, 49,  9, 45, 15, 35,  5, 25, 11, 31,
        41,  1, 21, 22, 12,  2, 42, 32, 23, 13, 43,  3, 33,  6, 36, 46, 16,
        26,  4, 14, 24, 34, 44, 27, 47,  7, 37, 17,  8, 38, 48, 28, 18
    ]
    # fmt: on
    assert_array_equal(samples, expected_samples)

We need to force the dtype in the np.arange call and make sure that we always have np.int64 and that we are not platform dependent of the default int dtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The bug described in #30782 states the issue is in the unit-test, so all we need to do is fix the testing function(s) that call _py_sort.

Before code hits Cython in sklearn/tree/* files, the types are always forced anyways…. Until we support C++ templates :p.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navneetkumaryadav207001 can you revert these changes, and fix the issue where _py_sort is called?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sir, i will surely look into this also sorry for the delay in reply.

@glemaitre glemaitre changed the title Fixed _py_sort() for numpy 1.26.4 TST: force dtype of arange to int64 to not be platform dependent Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_py_sort() returns ValueError on windows with numpy 1.26.4 but works correctly with numpy 2.x
4 participants
0