-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Feature request: an additional config context for forcing conversion toward a specific Array API-compatible array library. #25000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
When dealing with Array API, I prefer not to support a list of list at all. For a list of list, one can convert it into a NumPy array with The other use case for adding |
It isn't clear to me why X_np, y_np = make_classification(random_state=0)
with config_context(array_api_dispatch=True, array_api_namespace=xp):
lda = LinearDiscriminantAnalysis()
X_trans = lda.fit_transform(X_np, y_np) is nicer for end users than: X_np, y_np = make_classification(random_state=0)
X_cp = cp.array_api.asarray(X_np)
y_cp = cp.array_api.asarray(y_np)
with config_context(array_api_dispatch=True):
lda = LinearDiscriminantAnalysis()
X_trans = lda.fit_transform(X_cp, y_cp) Both require the user to have quite a bit of knowledge about the underlying Array API specifics. For example you have to know that you need to use I agree it is annoying that in a case where the user has a GPU, cupy installed and is using code that can take advantage of all this they will get the CPU path. However, I think this is how the Array API is meant to work. So it is a feature and not a bug. (One reason why the scikit-learn engine's should keep asking each engine if it wants to handle a situation, instead of deducing it from the types ;) ) |
I was thinking that in this case it's not up to sklearn to support a given type but up to what the Also my point is not only about UX but also about compute efficiency and especially memory efficiency, if the input is converted within the estimator it's more likely that the raw input is going to be converted to an array with appropriate dtype and order without intermediate copy, while it's more likely that if the user is responsible to convert its input first, the dtype or order doesn't match the constraint enforced by the validation pipeline and as a consequence the With the current usage users need knowledge of those mechanisms to adequately convert data beforehand if memory efficiency is an issue, so it might seem simpler to some to just set |
I spoke with NumPy devs about "array-like" and "list of list" and I have always sided with "domain libraries such as scikit-learn should require ndarrays and not support Python lists for numerical input". (But this is off topic and deserves its own discussion).
Can you update the opening message to include this point about memory efficiency? If memory efficient is the goal, then I think this feature request makes more sense. Functionally, the Array API specification does not support |
In general I would also agree with that (implicit conversions are against the zen of python 🙏 ), but I understand that sklearn tries to just "make it work" so that beginner users can go faster to the point. About memory efficiency, here are examples of what I mean (I'll also add the examples to the OP): X_np, y_np = my_data_loader() # let's assume that X_np, y_np are typed `np.int64`...
# as a consequence, here the cupy array are also typed `int64`
X_cu = xp.asarray(X_np)
y_cu = xp.asarray(y_np)
X_cu.device
with config_context(array_api_dispatch=True):
lda = LinearDiscriminantAnalysis()
# But `lda` requires `floats` to work
# (see https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/discriminant_analysis.py#L551)
# so implicitly, an additional copy of X_cu and y_cu in `float` is going to be created
# resulting on twice the memory allocation for data.
X_trans = lda.fit_transform(X_cu, y_cu) for the unwary user, this workflow is safer: X_np, y_np = my_data_loader() # let's assume that X_np, y_np are typed `np.int64`...
with config_context(array_api_dispatch=True, array_api_namespace=xp):
lda = LinearDiscriminantAnalysis()
# The X_np, y_np are yet again implicitly transformed as cupy float arrays. But
# the intermediate copy that the users created is skipped (or at worst,
# deleted right after creation), resulting in better memory management.
X_trans = lda.fit_transform(X_np, y_np) On a side note, implicit copies in sklearn estimators in the data validation pipeline seems to almost always be avoidable if the user adapts its data loader to load directly with the adequate type the first time the data are read from disk. Has it ever been considered to issue a warning when those implicit copies are triggered ? |
I wish that could change, the current default of ignoring Edit: despite #22554 (comment) I think in sklearn there is expanded use of |
I think
It's currently closed. From my understanding, order was not included in the specification because there are array libraries that does not support it. Most notable is Tensorflow's Tensor and MXNet's Array is only C-ordered. NumPy, Jax, PyTorch, CuPy all support order. Given that this is a pain point for us and |
To come back to the original issue, I think we need a story to deal with the following situation:
Then we want to run cross-validate / grid-search on this pipeline and want to make sure that the latest GPU-capable stages on the pipeline run on the GPU as much as possible. The grid search should be able tweak the hyperparameters if the column transformer. Right now this is not possible because:
To solve a. we could introduce a dedicated transformer in charge of converting a numpy array or a numerical-valued dataframe into an Array API container for the user requested namespace ( Solving b. is more tricky because unfortunately our pipeline / transformer API does not allow transforming What @fcharras proposes could potentially be a solution to tackle both a. and b. Alternatively, we might add a new constructor argument or public setter method on estimators capable of Array API spec support to request them to handle the conversion of any compatible inputs to a give namespace. I do not have a converged yet to express my opinion on this matter. But I just recognize that the current state of things is not satisfying. |
From reading all the comments here I think the problem we are trying to solve is:
Is this a good summary? |
If the data is a numpy array, it's easy to just convert it to cupy and be done with it. The problem is when the original data fed to the input of the pipeline cannot be represented by a numpy or cupy array but only by an heterogenous datastructure such as a pandas dataframe. The conversion to the cupy container needs to happen only in the later stages of the pipeline. At the moment we do not have a way to do this. |
Even converting to a cupy array is something that scikit-learn can't do right now. But yeah, for something like a pandas df that can only be converted to an array later in a pipeline it is even more tricky. Because it is tricky, I am wondering if the best option is to let the user do it. For arrays there are accelerated versions, same for data frames. Which one to chose when might be easier to delegate to a human. |
Opened an issue for discussion on the |
Describe the workflow you want to enable
Per #22554 such workflow is possible:
(see https://scikit-learn.org/dev/modules/array_api.html )
I would like to additonally enable following workflow:
where
X_trans
are the same in both examples, and in both examples compute has been done withcupy
. The difference is that we pass numpy inputs and rely on the estimator to convert using the appropriate array namespace.The added value here for the user is that the
check_array
function could be used to load arbitrary inputs to the array library relying on itsxp.asarray
method, in a way that is consistent tosklearn.utils.validation.check_array
requirements accross all array libraries.Currently, a user that would have data formated as, let's say, a list of lists, has two choices:
lda.fit
, but then since the list of list does not have a__array_namespace__
attribute it will fallback to numpy array namespacecupy
, but then, unless the user re-implements methodicallycheck_array
withcupy
conversion, the behavior of the conversion might differ from whatcheck_array
would do when converting tonumpy
, and anyway reimplementingcheck_array
is not very efficient to begin with.Edit: following discussions, adding examples that highlights the added value I can see to memory management with such an interface:
for the unwary user, this workflow is safer:
Describe your proposed solution
The
get_namespace
function can query theconfig_context
and force the use of the__array_namespace__
that is set by the config rather than the__array_namespace__
that attached to the input (if any). (rather simple solution)The text was updated successfully, but these errors were encountered: