[RFC] Always convert lists of lists of numbers to numpy arrays during input validation. #24745

jjerphan · 2022-10-24T15:19:36Z

Describe the workflow you want to enable

Transformers and Estimators accept list of lists of numbers as valid for inputs like X.

Yet, when it comes to access to some basic attributes of the datasets (like the shape and the dtype which are present for numpy array) or to reach the best performances (e.g. be able to use Cython implementation which only operates on continuous buffers of memory), list of lists of numbers structure is inconvenient.

Also lists of lists really are used for simple examples (such as doctests) but are unlikely used in practice.

Describe your proposed solution

I propose changing inputs validation to always convert list of list of numbers to their associated natural numpy array.

In this context:

lists of lists of Python int will be converted to 2D numpy array of np.int64
lists of lists of Python float will be converted to 2D numpy array of np.float64
a RuntimeError will be raised if leaf element aren't numbers
a RuntimeError will be raised if internals list have different length (the case of ragged array)

There might be some cost and maintenance complexity in converting list of lists to numpy array.

Changes mostly need be made in:

BaseEstimator._validate_data:

scikit-learn/sklearn/base.py

Lines 453 to 460 in 1dc23d7

    
           def _validate_data( 
        
               self, 
        
               X="no_validation", 
        
               y="no_validation", 
        
               reset=True, 
        
               validate_separately=False, 
        
               **check_params, 
        
           ):

sklearn.utils.check_array:

scikit-learn/sklearn/utils/validation.py

Lines 629 to 644 in 7b0a162

    
           def check_array( 
        
               array, 
        
               accept_sparse=False, 
        
               *, 
        
               accept_large_sparse=True, 
        
               dtype="numeric", 
        
               order=None, 
        
               copy=False, 
        
               force_all_finite=True, 
        
               ensure_2d=True, 
        
               allow_nd=False, 
        
               ensure_min_samples=1, 
        
               ensure_min_features=1, 
        
               estimator=None, 
        
               input_name="", 
        
           ):

Describe alternatives you've considered, if relevant

Continue supporting list of lists of numbers and introduce utility functions to be able to get basic attributes of the datasets which this structure.

Additional context

Listing references as I find them:

The text was updated successfully, but these errors were encountered:

Tests currently fails because X can be a list of list of numbers and thus not have a `dtype` attribute. See this RFC for discussions: scikit-learn#24745

glemaitre · 2022-10-24T16:04:54Z

As far as I know, we would always convert a list of lists to a NumPy array. I can think of 2 cases that this is not the case:

Some meta-estimators where we validate using _num_samples and _num_features. From the top of the head it was to avoid multiple validations.
Vectorizer for text will keep a list of str as input.

What I am missing here is the context where an estimator will actually be using these lists of lists in the internal algorithm?

jjerphan · 2022-10-24T16:06:46Z

What I am missing here is the context whe 8000 re an estimator will actually be using these lists of lists in the internal algorithm?

I am attaching a few pointers to the description of this PR.

glemaitre · 2022-10-24T16:46:52Z

Typically 43a61c4 (#22665) is such an example where we delay the validation in _fit because it is shared between estimator.

So we certainly want to avoid validating the data twice. Maybe the workaround here is to store the self._X_dtype in _fit such that we can reuse it _fit.

jeremiedbb · 2022-10-28T09:18:37Z

lists of lists of Python int will be converted to 2D numpy array of np.int64
lists of lists of Python float will be converted to 2D numpy array of np.float64

This is already what check_array does. I'm not sure to understand what you would change. As @glemaitre pointed out, there are specific cases where we don't want to call check_array early to avoid repeating the validation several times (in meta-estimators essentially)

jjerphan · 2022-11-03T15:37:36Z

Ah yes, I missed that. I have been able to find my way around and I was not informed when I wrote this RFC. Hence let's close it.

jjerphan added New Feature RFC Needs Triage Issue requires triage Validation related to input validation frontend labels Oct 24, 2022

jjerphan added a commit to jjerphan/scikit-learn that referenced this issue Oct 24, 2022

Preserve input dtype for LOF

43a61c4

Tests currently fails because X can be a list of list of numbers and thus not have a `dtype` attribute. See this RFC for discussions: scikit-learn#24745

jjerphan mentioned this issue Oct 24, 2022

FEA Introduce PairwiseDistances #23958

Closed

3 tasks

jjerphan added a commit to jjerphan/scikit-learn that referenced this issue Oct 24, 2022

DOC Add references to scikit-learn#24745

4fccb5f

jjerphan changed the title ~~[RFC] Always convert lists of lists of number to numpy arrays during input validation.~~ [RFC] Always convert lists of lists of numbers to numpy arrays during input validation. Oct 24, 2022

glemaitre removed the Needs Triage Issue requires triage label Oct 25, 2022

jjerphan 549D closed this as completed Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Always convert lists of lists of numbers to numpy arrays during input validation. #24745

[RFC] Always convert lists of lists of numbers to numpy arrays during input validation. #24745

[RFC] Always convert lists of lists of numbers to numpy arrays during input validation. #24745

[RFC] Always convert lists of lists of numbers to numpy arrays during input validation. #24745

Comments

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context