-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[RFC] Always convert lists of lists of numbers to numpy arrays during input validation. #24745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tests currently fails because X can be a list of list of numbers and thus not have a `dtype` attribute. See this RFC for discussions: scikit-learn#24745
As far as I know, we would always convert a list of lists to a NumPy array. I can think of 2 cases that this is not the case:
What I am missing here is the context where an estimator will actually be using these lists of lists in the internal algorithm? |
I am attaching a few pointers to the description of this PR. |
Typically So we certainly want to avoid validating the data twice. Maybe the workaround here is to store the |
This is already what |
Ah yes, I missed that. I have been able to find my way around and I was not informed when I wrote this RFC. Hence let's close it. |
Describe the workflow you want to enable
Transformers and Estimators accept list of lists of numbers as valid for inputs like
X
.Yet, when it comes to access to some basic attributes of the datasets (like the shape and the dtype which are present for numpy array) or to reach the best performances (e.g. be able to use Cython implementation which only operates on continuous buffers of memory), list of lists of numbers structure is inconvenient.
Also lists of lists really are used for simple examples (such as doctests) but are unlikely used in practice.
Describe your proposed solution
I propose changing inputs validation to always convert list of list of numbers to their associated natural numpy array.
In this context:
int
will be converted to 2D numpy array ofnp.int64
float
will be converted to 2D numpy array ofnp.float64
RuntimeError
will be raised if leaf element aren't numbersRuntimeError
will be raised if internals list have different length (the case of ragged array)There might be some cost and maintenance complexity in converting list of lists to numpy array.
Changes mostly need be made in:
BaseEstimator._validate_data
:scikit-learn/sklearn/base.py
Lines 453 to 460 in 1dc23d7
sklearn.utils.check_array
:scikit-learn/sklearn/utils/validation.py
Lines 629 to 644 in 7b0a162
Describe alternatives you've considered, if relevant
Continue supporting list of lists of numbers and introduce utility functions to be able to get basic attributes of the datasets which this structure.
Additional context
Listing references as I find them:
43a61c4
(#22665)The text was updated successfully, but these errors were encountered: