8000 DOC Clarify how mixed array input types handled in array api by lucyleeow · Pull Request #31452 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DOC Clarify how mixed array input types handled in array api #31452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lucyleeow
Copy link
Member
@lucyleeow lucyleeow commented May 29, 2025

Reference Issues/PRs

Update documentation now that we have reach consensus on these issues:

closes #31286 (I think no more work is required for this as currently all converted metrics are outputting array in the same namespace/device as input)

Towards #28668
Towards #31274

I have not said close for these 2 because we would need to implement conversion of mixed inputs.

What does this implement/fix? Explain your changes.

Clarify in docs:

  • everything follows X for estimators
  • everything follows y_pred for scoring functions
  • when a scoring function outputs several values, an array of the same namespace and device is output

I realise that this update to the docs is not yet true, but I think this okay as I don't think this will get into 1.7, thus will be in dev docs only, and I think we will have implemented it by next release.
Otherwise, happy to leave this PR and implement the mixed input type conversion first.

Any other comments?

I tried to give reasons as to why we made these decisions. Hopefully it is clear and concise (and correct...), but happy to change/fix.

cc @ogrisel @betatim maybe @StefanieSenger

Copy link
github-actions bot commented May 29, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5833e9a. Link to the linter CI: here

Copy link
Contributor
@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work @lucyleeow!

I like the clarity of how this new addition is written and I am sure it will help many users. I have left a few minor comments, and I have some more general comments.

  1. What about starting this section with a fast to grasp takeaway for our user's mental model, that we have taken months to uncover: To clearly state that other array inputs follow X and y_pred in terms of array library and device?

  2. Another feedback would be to re-think the heading of this paragraph. It now not only talks about return types, but also on the internal processing of inputs that follow X and y_pred. Users interested in that might not expect this information under this heading.

  3. And, there is something about a sub-part that I do not understand (and which might also happen to users): In lines 194-212 you showcase an example, where Xs namespace is changed during the process of going through a pipeline and there are several things unclear to me:

  • When a pipeline starts out on user input with X and y both being on CPU, then I would not expect scikit-learn to take any measures on automatically processing these on GPU for intermediate steps, since we don't even know if users have GPU...
  • At first sight it seems that the y follows X principle here is broken and upon thinking about it, I could then see that this part is to demonstrate how y being on the different namespace on the beginning of a new step of a pipeline demonstrates exactly how it follows X without that the user needs to take care of it. But right now, there is only a very subtle hint on this in this paragraph and this idea is not clearly expressed.
  • When you write "Note that scikit-learn pipelines do not allow transformation of y (to avoid
    :ref:leakage <data_leakage>).": does that even include changes in which namespace y is stored, is that the reason you bring this up here?
  • I wonder, why X and y both need to be on CPU for TargetEncoder? Is there a technical reason or a more goal-oriented reason in terms of performance?

Oh my, that has gotten to become a lot of feedback. 😬 (sorry)

Comment on lines +216 to 218
The `predict` and `transform` method subsequently expect
inputs from the same array library and device as the data passed to the `fit`
method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention that as of now, scikit-learn does not interfere with how array libraries handle arrays from a different namespace or device passed to predict or transform. Something like:

"If arrays from a different library or on a different device are passed, behavior depends on the array library: it may raise an error or silently convert them."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am confused here, why would an array library be doing things within our predict and transform methods? 🤔

Copy link
Contributor
@StefanieSenger StefanieSenger May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When arrays from different namespaces are passed into a function, then the namespace that the function comes from determines the handling. For instance in xp.add(array1, array2 ) the namespace of xp could be torch and array1 is a torch array, but array2 is not.Then torch will handle this or raise an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay you've raised some good points 🤔 I'm going to raise these back in the issue

Comment on lines +225 to +228
:class:`~sklearn.linear_model.Ridge`. `y` cannot be transformed by the pipeline
(recall scikit-learn pipelines do not allow transformation of `y`) but as
:class:`~sklearn.linear_model.Ridge` is able to accept mixed input types,
this is not a problem and the pipeline is able to be run.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first sight it seems that the y follows X principle here is broken and upon thinking about it, I could then see that this part is to demonstrate how y being on the different namespace on the beginning of a new step of a pipeline demonstrates exactly how it follows X without that the user needs to take care of it. But right now, there is only a very subtle hint on this in this paragraph and this idea is not clearly expressed.

@StefanieSenger this is the only part that explicit talks about this. I tried to make it clearer but open to suggests to improve. Thanks!

@lucyleeow
Copy link
Member Author
lucyleeow commented May 30, 2025

These are very good suggestions. I have tried to address them all.

then I would not expect scikit-learn to take any measures on automatically processing these on GPU for intermediate steps, since we don't even know if users have GPU...

Sorry I have made it more explicit now. The function transformer step is: FunctionTransformer(partial(torch.asarray, device="cuda:0")), - which explicitly moves X to GPU.
For reference I was mostly copying Oliviers example here: #31274 (comment)

I wonder, why X and y both need to be on CPU for TargetEncoder? Is there a technical reason or a more goal-oriented reason in terms of performance?

In my mind, the categorical (string) data in X was being target encoded to numbers in TargetEncoder so X has to be on CPU (as due to it containing string data), and y "needs" to be on the same device as X for the estimator to work (TargetEncoder also uses y) - though if we are doing everything follows X, y technically doesn't need to be on CPU. I will try to make things clearer. I was worried I was going into too much detail and making things too long 🤔

When you write "Note that scikit-learn pipelines do not allow transformation of y (to avoid
:ref:leakage <data_leakage>).": does that even include changes in which namespace y is stored, is that the reason you bring this up here?

Yes, pipelines can't touch y.

Comment on lines +184 to +186
Estimators and scoring functions are able to accept input arrays
from different array libraries and/or devices. When mixed input arrays are
passed, scikit-learn silently converted to make them all consistent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just realised that this means that I can/should/could call some_estimator.fit(X_torch_on_gpu0, y_numpy, sample_weights=weights_cupy_gpu1) or some other more obscure combination. Feels like mayhem, but I guess that is "fine"? Maybe somewhere in a guide we can mention that this is technically allowed but that maybe people shouldn't do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (copied and slightly adjusted from the numpy docs):

.. warning::
  While the mixing of array types from different array libraries may be convenient, it is 
  not recommended. It might have unexpected behavior in corner cases. Users should 
  prefer explicitly passing matching types whenever possible.

Comment on lines +184 to +186
Estimators and scoring functions are able to accept input arrays
from different array libraries and/or devices. When mixed input arrays are
passed, scikit-learn silently converted to make them all consistent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Estimators and scoring functions are able to accept input arrays
from different array libraries and/or devices. When mixed input arrays are
passed, scikit-learn silently converted to make them all consistent.
Estimators and scoring functions are able to accept input arrays
from different array libraries and/or devices. When a mixed set of input arrays is
passed, scikit-learn converts arrays as needed to make them all consistent.

What do you think?

passed, scikit-learn silently converted to make them all consistent.

For estimators, the rule is **"everything follows `X`"** - mixed array inputs are
converted so they are all match the array library and device of `X`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
converted so they are all match the array library and device of `X`.
converted so that they all match the array library and device of `X`.

For estimators, the rule is **"everything follows `X`"** - mixed array inputs are
converted so they are all match the array library and device of `X`.
For scoring functions the rule is **"everything follows `y_pred`"** - mixed array
inputs are converted so they are all match the array library and device of `y_pred`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inputs are converted so they are all match the array library and device of `y_pred`.
inputs are converted so that they all match the array library and device of `y_pred`.

For scoring functions the rule is **"everything follows `y_pred`"** - mixed array
inputs are converted so they are all match the array library and device of `y_pred`.

When a function or method has been called with Array API compatible inputs, the
convention is to return array values of the same array container type and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using "array container type" as a way to make it clear that we are talking about numpy, cupy, pytorch, etc arrays and not "an array of float32s"?

The sentences above are already quite long, so I'm not sure if we should do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the expression "array container type" (which is already in the sentence) quite vague. In other parts of the docs we have talked about "array library" and "array namespace", so maybe like that (?):

Suggested change
convention is to return array values of the same array container type and
convention is to return a type from the same array library and


When an estimator is fitted with an Array API compatible `X`, all other
array inputs, including constructor arguments, (e.g., `y`, `sample_weight`,
`weights_init`) will be silently converted to match the array library and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`weights_init`) will be silently converted to match the array library and
`weights_init`) will be converted to match the array library and

"silently" makes me worried we will have the "explicit is better than implicit" police coming round to complain. And I can convince myself that words ending in -ly are not needed anyway :D

Comment on lines +206 to +209
This allows estimators to accept mixed input types, enabling `X` to be moved
to a different device within a pipeline, without explicitly moving `y`.
Note that scikit-learn pipelines do not allow transformation of `y` (to avoid
:ref:`leakage <data_leakage>`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This allows estimators to accept mixed input types, enabling `X` to be moved
to a different device within a pipeline, without explicitly moving `y`.
Note that scikit-learn pipelines do not allow transformation of `y` (to avoid
:ref:`leakage <data_leakage>`).
This behaviour enables pipelines to switch from processing on
the CPU to processing on the GPU at a specific point in the pipeline.

This is still a bit clunky :-/ I think it is enough to mention the use-case and then show the example below, which contains a longer explanation/reminder about the fact that you can't move y.

`partial(torch.asarray, device="cuda")`, which moves `X` to GPU, to improve
performance in the next step.
* :class:`~sklearn.linear_model.Ridge`, whose performance can be improved when
passed arrays on GPU, as it only takes numerical inputs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"on GPU", "on a GPU", "on the GPU"? Not sure which one is right/sounds best (same applies to "on CPU" above) - do you have a preference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for "on a GPU".

~~~~~~~~~~~~~~~~~

When an Array API compatible `y_pred` is passed to a scoring function,
all other array inputs (e.g., `y_true`, `sample_weight`) will be silently converted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
all other array inputs (e.g., `y_true`, `sample_weight`) will be silently converted
all other array inputs (e.g., `y_true`, `sample_weight`) will be converted

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some partial feedback.

For scoring functions the rule is **"everything follows `y_pred`"** - mixed array
inputs are converted so they are all match the array library and device of `y_pred`.

When a function or method has been called with Array API compatible inputs, the
Copy link
Member
@ogrisel ogrisel Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I cannot find the discussion anymore, but I think the official casing style for the spec name is "array API" (or the "Python array API") instead of "Array API".

Suggested change
When a function or method has been called with Array API compatible inputs, the
When a function or method has been called with array API compatible inputs, the

At least it's the casing style of the main header of the official spec site: https://data-apis.org/array-api/latest/

`partial(torch.asarray, device="cuda")`, which moves `X` to GPU, to improve
performance in the next step.
* :class:`~sklearn.linear_model.Ridge`, whose performance can be improved when
passed arrays on GPU, as it only takes numerical inputs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for "on a GPU".

Copy link
Contributor
@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are cool improvements, thank you @lucyleeow!
Upon reading through, I came across some minor things, that could be improved.

For scoring functions the rule is **"everything follows `y_pred`"** - mixed array
inputs are converted so they are all match the array library and device of `y_pred`.

When a function or method has been called with Array API compatible inputs, the
convention is to return array values of the same array container type and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the expression "array container type" (which is already in the sentence) quite vague. In other parts of the docs we have talked about "array library" and "array namespace", so maybe like that (?):

Suggested change
convention is to return array values of the same array container type and
convention is to return a type from the same array library and

Similarly, when an estimator is fitted with Array API compatible inputs, the
fitted attributes will be arrays from the same library as the input and stored
on the same device. The `predict` and `transform` method subsequently expect
For details and reasons behind this choice, see below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is not really needed here.

Suggested change
For details and reasons behind this choice, see below.

Comment on lines +216 to +217
* :class:`~sklearn.preprocessing.FunctionTransformer`, with the function
`partial(torch.asarray, device="cuda")`, which moves `X` to GPU, to improve
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it renders correctly, but in my mind that's a bit more obvious (and spares people to look up the usage of FunctionTransformer before they can read on:

Suggested change
* :class:`~sklearn.preprocessing.FunctionTransformer`, with the function
`partial(torch.asarray, device="cuda")`, which moves `X` to GPU, to improve
* :class:`~sklearn.preprocessing.FunctionTransformer`
(func=partial(torch.asarray, device="cuda")), which moves `X` to GPU, to improve

`partial(torch.asarray, device="cuda")`, which moves `X` to GPU, to improve
performance in the next step.
* :class:`~sklearn.linear_model.Ridge`, whose performance can be improved when
passed arrays on GPU, as it only takes numerical inputs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On "as it only takes numerical inputs": that is not making for improved performance in GPU compared to CPU. We could instead name an actual reason (some matrix computation going on within Ridge) or leave this out completely?

When an Array API compatible `y_pred` is passed to a scoring function,
all other array inputs (e.g., `y_true`, `sample_weight`) will be silently converted
to match the array library and device of `y_pred`, if they do not already.
This allows scoring functions to accept mixed input types, which enables it to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This allows scoring functions to accept mixed input types, which enables it to be
This allows scoring functions to accept mixed input types, which enables them to be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarification of output array type when metrics accept multiclass/multioutput
4 participants
0