8000 Feature request: an additional config context for forcing conversion toward a specific Array API-compatible array library. · Issue #25000 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Feature request: an additional config context for forcing conversion toward a specific Array API-compatible array library. #25000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fcharras opened this issue Nov 21, 2022 · 12 comments
Labels
Array API Needs Decision - Include Feature Requires decision regarding including feature New Feature

Comments

@fcharras
Copy link
Contributor
fcharras commented Nov 21, 2022

Describe the workflow you want to enable

Per #22554 such workflow is possible:

from sklearn.datasets import make_classification
from sklearn import config_context
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import cupy.array_api as xp

X_np, y_np = make_classification(random_state=0)
X_cu = xp.asarray(X_np)
y_cu = xp.asarray(y_np)
X_cu.device


with config_context(array_api_dispatch=True):
    lda = LinearDiscriminantAnalysis()
    X_trans = lda.fit_transform(X_cu, y_cu)

(see https://scikit-learn.org/dev/modules/array_api.html )

I would like to additonally enable following workflow:

X_np, y_np = make_classification(random_state=0)

with config_context(array_api_dispatch=True, array_api_namespace=xp):
    lda = LinearDiscriminantAnalysis()
    X_trans = lda.fit_transform(X_np, y_np)

where X_trans are the same in both examples, and in both examples compute has been done with cupy. The difference is that we pass numpy inputs and rely on the estimator to convert using the appropriate array namespace.

The added value here for the user is that the check_array function could be used to load arbitrary inputs to the array library relying on its xp.asarray method, in a way that is consistent to sklearn.utils.validation.check_array requirements accross all array libraries.

Currently, a user that would have data formated as, let's say, a list of lists, has two choices:

  • either pass the list of list directly to lda.fit, but then since the list of list does not have a __array_namespace__ attribute it will fallback to numpy array namespace
  • either convert first the input to cupy, but then, unless the user re-implements methodically check_array with cupy conversion, the behavior of the conversion might differ from what check_array would do when converting to numpy, and anyway reimplementing check_array is not very efficient to begin with.

Edit: following discussions, adding examples that highlights the added value I can see to memory management with such an interface:

X_np, y_np = my_data_loader()  # let's assume that X_np, y_np are typed `np.int64`...

# as a consequence, here the cupy array are also typed `int64`
X_cu = xp.asarray(X_np)
y_cu = xp.asarray(y_np)
X_cu.device

with config_context(array_api_dispatch=True):
    lda = LinearDiscriminantAnalysis()
    # But `lda` requires `floats` to work
    # (see https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/discriminant_analysis.py#L551)
    # so implicitly, an additional copy of X_cu and y_cu in `float` is going to be created
    # resulting on twice the memory allocation for data.
    X_trans = lda.fit_transform(X_cu, y_cu)

for the unwary user, this workflow is safer:

X_np, y_np = my_data_loader()  # let's assume that X_np, y_np are typed `np.int64`...

with config_context(array_api_dispatch=True, array_api_namespace=xp):
    lda = LinearDiscriminantAnalysis()
    # The X_np, y_np are yet again implicitly transformed as cupy float arrays. But
    # the intermediate copy that the users created is skipped (or at worst,
    # deleted right after creation), resulting in better memory management.
    X_trans = lda.fit_transform(X_np, y_np)

Describe your proposed solution

The get_namespace function can query the config_context and force the use of the __array_namespace__ that is set by the config rather than the __array_namespace__ that attached to the input (if any). (rather simple solution)

@fcharras fcharras added Needs Triage Issue requires triage New Feature labels Nov 21, 2022
@thomasjpfan
Copy link
Member
thomasjpfan commented Nov 21, 2022

When dealing with Array API, I prefer not to support a list of list at all. For a list of list, one can convert it into a NumPy array with check_array, convert it to a cupy array, and pass it into scikit-learn. If this was the only use case for the feature, I am -1 on the feature.

The other use case for adding array_api_namespace is to make the UX nicer for NumPy arrays. I feel like this is a nicer UX overall, but it adds device to device transfer to scikit-learn, which increases our scope. Currently, device to device transfer is handled by the user. I am overall +0.3 on this.

@thomasjpfan thomasjpfan added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Nov 21, 2022
@betatim
Copy link
Member
betatim commented Nov 22, 2022

It isn't clear to me why

X_np, y_np = make_classification(random_state=0)

with config_context(array_api_dispatch=True, array_api_namespace=xp):
    lda = LinearDiscriminantAnalysis()
    X_trans = lda.fit_transform(X_np, y_np)

is nicer for end users than:

X_np, y_np = make_classification(random_state=0)

X_cp = cp.array_api.asarray(X_np)
y_cp = cp.array_api.asarray(y_np)

with config_context(array_api_dispatch=True):
    lda = LinearDiscriminantAnalysis()
    X_trans = lda.fit_transform(X_cp, y_cp)

Both require the user to have quite a bit of knowledge about the underlying Array API specifics. For example you have to know that you need to use cupy.array_api.asarray to create your array, not cupy.array. Similarly you have to know that you need to pass cupy.array_api and not cupy to the array_api_namespace argument.

I agree it is annoying that in a case where the user has a GPU, cupy installed and is using code that can take advantage of all this they will get the CPU path. However, I think this is how the Array API is meant to work. So it is a feature and not a bug. (One reason why the scikit-learn engine's should keep asking each engine if it wants to handle a situation, instead of deducing it from the types ;) )

@fcharras
Copy link
Contributor Author
fcharras commented Nov 22, 2022

When dealing with Array API, I prefer not to support a list of list at all

I was thinking that in this case it's not up to sklearn to support a given type but up to what the xp.asarray call is able to accept ? Currently, for sklearn the current support for list of lists is made convenient by the fact that np.asarray can accept it I think ?

Also my point is not only about UX but also about compute efficiency and especially memory efficiency, if the input is converted within the estimator it's more likely that the raw input is going to be converted to an array with appropriate dtype and order without intermediate copy, while it's more likely that if the user is responsible to convert its input first, the dtype or order doesn't match the constraint enforced by the validation pipeline and as a consequence the xp.asarray call creates an extra copy.

With the current usage users need knowledge of those mechanisms to adequately convert data beforehand if memory efficiency is an issue, so it might seem simpler to some to just set array_api_namespace=xp and then have the estimator do the conversion correctly ?

@thomasjpfan
Copy link
Member
thomasjpfan commented Nov 22, 2022

Currently, for sklearn the current support for list of lists is made convenient by the fact that np.asarray can accept it I think ?

I spoke with NumPy devs about "array-like" and "list of list" and I have always sided with "domain libraries such as scikit-learn should require ndarrays and not support Python lists for numerical input". (But this is off topic and deserves its own discussion).

Also my point is not only about UX but also about compute efficiency and especially memory efficiency,

Can you update the opening message to include this point about memory efficiency? If memory efficient is the goal, then I think this feature request makes more sense. Functionally, the Array API specification does not support order, so we would only get efficiency gains from dtype.

@fcharras
Copy link
Contributor Author

In general I would also agree with that (implicit conversions are against the zen of python 🙏 ), but I understand that sklearn tries to just "make it work" so that beginner users can go faster to the point.

About memory efficiency, here are examples of what I mean (I'll also add the examples to the OP):

X_np, y_np = my_data_loader()  # let's assume that X_np, y_np are typed `np.int64`...

# as a consequence, here the cupy array are also typed `int64`
X_cu = xp.asarray(X_np)
y_cu = xp.asarray(y_np)
X_cu.device

with config_context(array_api_dispatch=True):
    lda = LinearDiscriminantAnalysis()
    # But `lda` requires `floats` to work
    # (see https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/discriminant_analysis.py#L551)
    # so implicitly, an additional copy of X_cu and y_cu in `float` is going to be created
    # resulting on twice the memory allocation for data.
    X_trans = lda.fit_transform(X_cu, y_cu)

for the unwary user, this workflow is safer:

X_np, y_np = my_data_loader()  # let's assume that X_np, y_np are typed `np.int64`...

with config_context(array_api_dispatch=True, array_api_namespace=xp):
    lda = LinearDiscriminantAnalysis()
    # The X_np, y_np are yet again implicitly transformed as cupy float arrays. But
    # the intermediate copy that the users created is skipped (or at worst,
    # deleted right after creation), resulting in better memory management.
    X_trans = lda.fit_transform(X_np, y_np)

On a side note, implicit copies in sklearn estimators in the data validation pipeline seems to almost always be avoidable if the user adapts its data loader to load directly with the adequate type the first time the data are read from disk. Has it ever been considered to issue a warning when those implicit copies are triggered ?

@fcharras
Copy link
Contributor Author
fcharras commented Nov 24, 2022

this point about memory efficiency? If memory efficient is the goal, then I think this feature request makes a more sense. Functionally, the Array API specification does not support order, so we would only get efficiency gains from dtype.

I wish that could change, the current default of ignoring order in the early work on array api is confusing, shouldn't an error be raised in this case ? I've been working recently with another library dpctl.tensor that implements the Array API and the asarray method does also expose the order argument but it will not be used because of this. What would you think of inspecting the signature of xp.asarray and pass the order argument if xp.asarray implemennts it ?

Edit: despite #22554 (comment) I think in sklearn there is expanded use of order ? implementation with low level languages will usually perform better with a layout (example kmeans) . Is there still an ongoing discussion with the Array API or is it closed at the moment ? If closed, we might also consider exposing get_namespace / _asarray_with_order to plugins,.

@thomasjpfan
Copy link
Member

What would you think of inspecting the signature of xp.asarray and pass the order argument if xp.asarray implemennts it ?

I think dpctl.tensor is the first Array API implementation I've seen that adds the order argument to asarray. CuPy and NumPy both natively support order, but their Array API implementation does not.

Is there still an ongoing discussion with the Array API or is it closed at the moment ?

It's currently closed. From my understanding, order was not included in the specification because there are array libraries that does not support it. Most notable is Tensorflow's Tensor and MXNet's Array is only C-ordered. NumPy, Jax, PyTorch, CuPy all support order.

Given that this is a pain point for us and dpctl.tensor adds order into their asarray implementation, I think it's worth reopening the discussion in https://github.com/data-apis/array-api/issues. I do not think there is a public discussion around order, so it would be nice to see what other domain libraries think.

@ogrisel
Copy link
Member
ogrisel commented Dec 1, 2022

To come back to the original issue, I think we need a story to deal with the following situation:

  • original data is on a CSV, parquet file or SQL database and pandas reads it as a regular dataframe;
  • then the dataframe is fed to a pipeline with:
    • a preprocessing step that users a column transformer that extracts numerical features (e.g. one hot encoding, splines, whatever);
    • possibly a few transformers that maybe support the Array API spec to work on GPU and as a result would return transformed data with the matching container namespace;
    • a final supervised model like LDA that supports the Array API spec to work on GPU.

Then we want to run cross-validate / grid-search on this pipeline and want to make sure that the latest GPU-capable stages on the pipeline run on the GPU as much as possible. The grid search should be able tweak the hyperparameters if the column transformer.

Right now this is not possible because:

  • a. the column transform always generates a numpy array in main memory
  • b. the target variable y_train could be manually either a CuPy or a NumPy array. And different stages in the pipeline might not be CuPy aware.

To solve a. we could introduce a dedicated transformer in charge of converting a numpy array or a numerical-valued dataframe into an Array API container for the user requested namespace (ArrayAPIConverter(namespace="cupy.array_api", dtype="float32")) to insert in the pipeline just before the estimators such as LDA that are expected to work better with float32 cupy containers.

Solving b. is more tricky because unfortunately our pipeline / transformer API does not allow transforming y_train (unless we rely on meta-estimators).

What @fcharras proposes could potentially be a solution to tackle both a. and b.

Alternatively, we might add a new constructor argument or public setter method on estimators capable of Array API spec support to request them to handle the conversion of any compatible inputs to a give namespace.

I do not have a converged yet to express my opinion on this matter. But I just recognize that the current state of things is not satisfying.

@betatim
Copy link
Member
betatim commented Dec 1, 2022

From reading all the comments here I think the problem we are trying to solve is:

  1. Data in a numpy array is passed into an estimator or a pipeline.
  2. The estimator (or most of the steps in the pipeline) have support for "Array API" arrays and cupy is installed (using cupy as 8000 an example, but could be any array API library that provides acceleration)
  3. This means we could run on the GPU, which would be faster, but don't because the input is a numpy array.

Is this a good summary?

@ogrisel
Copy link
Member
ogrisel commented Dec 1, 2022

Is this a good summary?

If the data is a numpy array, it's easy to just convert it to cupy and be done with it. The problem is when the original data fed to the input of the pipeline cannot be represented by a numpy or cupy array but only by an heterogenous datastructure such as a pandas dataframe. The conversion to the cupy container needs to happen only in the later stages of the pipeline. At the moment we do not have a way to do this.

@betatim
Copy link
Member
betatim commented Dec 2, 2022

Even converting to a cupy array is something that scikit-learn can't do right now. But yeah, for something like a pandas df that can only be converted to an array later in a pipeline it is even more tricky.

Because it is tricky, I am wondering if the best option is to let the user do it. For arrays there are accelerated versions, same for data frames. Which one to chose when might be easier to delegate to a human.

@fcharras
Copy link
Contributor Author

Opened an issue for discussion on the order parameter at data-apis/array-api#571

@glemaitre glemaitre moved this to Discussion in Array API May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Array API Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
Status: Discussion
Development

No branches or pull requests

4 participants
0