8000 Input validation refactoring · Issue #3440 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Input validation refactoring #3440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Jul 19, 2014 · 18 comments
Closed

Input validation refactoring #3440

amueller opened this issue Jul 19, 2014 · 18 comments
Labels

Comments

@amueller
Copy link
Member

I propose to refactor the input validation. The current zoo of methods is kinda confusing.

Related to #3142.

Checks that we want are:

  • numpy array vs sparse vs list vs anything indexable
  • sparse matrix type
  • inf / nan
  • dtype
  • ndims
  • number of samples is consistent in multiple arrays.
  • contiguous

I'll now check if there is anything else that we currently check and see what functions we have.

Remaining Issues

  • possibly remove "make_float_array" and "1d_or_column" and also make these options of check_array.
  • Currently "ensure2d" makes vectors into rows, which I find pretty counter-intuitive. This is for backward compatibility. 1d input is not currently handled consistently.
@arjoly
Copy link
Member
arjoly commented Jul 19, 2014

+1

@amueller
Copy link
Member Author

What we have

(only public functions)

  • assert_all_finite - elements of numpy / sparse matrix are not inf / nan
  • safe_asarray - let's through array, csr, csc, coo, converts other sparse formats to csr, optional finiteness and order.
  • array2d - raises on sparse, converts lists to arrays, makes atleast2d, optional finite, copy, order.
  • atleast2d_or_csx - converts sparse csr / csc, makes arrays atleast2d, optional finite, copy, order.
  • check_arrays - (I take the blame for large parts of this) checks that all inputs have the same len or shape[0], optionally converts sparse matrices to a given format, or passes through everything iterable, checks contiguity, dtype, finite.
  • column_or_1d - converts lists dense array, ravels 2ndim to 1ndim and optionally warns.
  • as_float_array - make array or sparse into float32 or float64 depending on input type.

@amueller
Copy link
Member Author

git grep array2d | wc -l
131

git grep safe_asarray | wc -l
59

git grep atleast2d_or_csr | wc -l
84

git grep atleast2d_or_csc | wc -l
28

git grep check_arrays | wc -l
152

@amueller
Copy link
Member Author

Ok my proposed interface is:

check_consistent_length(X, y)
# hopefully not as long in real live
X = check_array(X, allowed_sparse=['csr', 'lil'], convert_sparse_to='csr', dtype='any_float', order='C', copy=copy, check_finite=True)
y = column_or_1d(y)

and array_2d_or_csx would basically be a shortcut to allowed_sparse=['csx'], convert_sparse_to='csx'

@GaelVaroquaux
Copy link
Member

check_consistent_length(X, y)

hopefully not as long in real live

X = check_array(X, allowed_sparse={'csr'}, convert_sparse_to='csr', dtype=np.float, order='C', copy=copy)
y = column_or_1d(y)

I like that. I would also propose the renaming of check_array as
discussed earlier.

@amueller
8000 Copy link
Member Author

pr here: #3443

@arjoly
Copy link
Member
arjoly commented Jul 19, 2014

We could also add check to raise ValueError on 64 bit sparse matrices if it's not supported.

@arjoly
Copy link
Member
arjoly commented Jul 20, 2014

There is still the as_float_array utility.

@amueller
Copy link
Member Author

@arjoly can you comment on the 64bit sparse matrices? What is the issue there again? Otherwise I'd close this as I think the current state is pretty decent.

@jnothman
Copy link
Member
jnothman 8000 commented Jan 23, 2015

Sparse matrices rows with intp-like types (csr/c/bsr indices, indptr; coo col, row) were constrained to a 32 bit int until very recently. If the user has a supporting scipy version, and we only perform scipy public API things, 64-bit index support should be seamless. If we perform our own cython ops, or if we use explicit dtyping, things will break with unhelpful error messages. So it's best if we validate the index dtype where that's going to happen, or fix our implementations to allow very big sparse matrices. Basically, we need to alter all the sparse matrix tests to test with both index dtypes, and go from there.

@amueller
Copy link
Member Author

Right. That should actually not be too hard.

@amueller
Copy link
Member Author

I think I will make that a new issue, though, as that is more about tighter testing.

@amueller
Copy link
Member Author

#4149

@arjoly
Copy link
8000
Member
arjoly commented Jan 23, 2015

Are there reasons to keep as_float_array?

@amueller
Copy link
Member Author

Actually, probably not.

@amueller amueller reopened this Jan 23, 2015
@amueller
Copy link
Member Author

There is a lot of logic in that one, and I'm not sure I want to include this all in the check_array function....

@amueller
Copy link
Member Author

Maybe that is a good candidate to fix after #4136 has been merged. A problem is that it is a lot used in function interfaces, which are not tested in the common tests :-/

@amueller
Copy link
Member Author

closing, I think we're happy with check_array for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants
0