8000 check_array does not raise error when input contains something other than numbers or strings · Issue #11401 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

check_array does not raise error when input contains something other than numbers or strings #11401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeremiedbb opened this issue Jul 2, 2018 · 5 comments

Comments

@jeremiedbb
Copy link
Member

Now that imputers allow inputs with object dtype, e.g. strings or pandas categoricals, it seems that either check_array should be enhanced or that some common tests should be updated.
There is common test, check_dtype_object, that checks the estimators on input X that contains numbers and X[0,0] = {'foo':'bar'}. When expecting numeric inputs, the check_array is instanciated with dtype='numeric' and an error is raised as expected.
However, when instanciated with dtype=None or dtype=object, no error is raised. See the code below:

X = np.array([{'foo':'bar'}, "a", "b", "c"], dtype=object).reshape(-1, 1)
X
>>> array([[{'foo': 'bar'}],
           ['a'],
           ['b'],
           ['c']], dtype=object)
imputer = SimpleImputer(strategy='constant', missing_values='a')
imputer.fit_transform(X)
>>> array([[{'foo': 'bar'}],
           ['missing_value'],
           ['b'],
           ['c']], dtype=object)

No error is raised and the estimator works fine. Don't you think that we should raise an error in that case ?
This currently passes the test because when imputing on inputs with object dtypes, we can't set dtype='numeric' in check_array. I think the error should be raised even with dtype=object or dtype=None.

We could check that in the fit function of the estimators that accept non-numeric inputs, but I think it the role of check_array to do that. What's your opinion about that ?

@jnothman
Copy link
Member
jnothman commented Jul 3, 2018

Maybe we need something more custom like dtype='scalar-object' to accept elements being int, str, unicode, None, ...? But there is overhead for checking that the data complies, and I'm not sure it's worth it.

@jeremiedbb
Copy link
Member Author

Well it would only be used in estimators that allows non-numeric data, so maybe it's worth to have an overhead for those, I'm not sure. The thing is I think the above code shouldn't pass, don't you ?
I can make some tests to see if the overhead is really significant.

@jnothman
Copy link
Member
jnothman commented Jul 3, 2018 via email

@jorisvandenbossche
Copy link
Member

I think the error should be raised even with dtype=object or dtype=None.

I agree with Joel that I am not sure the overhead to infer the type of all elements in X is worth the error message.


But partly related to this, check_array can also be improved to work with object data in general (not necessarily such 'strange' object data as mentioned in the example above).

Currently, what I do in the encoders is something like:

        X_temp = check_array(X, dtype=None)
        if not hasattr(X, 'dtype') and np.issubdtype(X_temp.dtype, np.str_):
            X = check_array(X, dtype=np.object)
        else:
            X = X_temp

because there is currently no way to indicate "numeric-or-object" (meaning: if it is numeric, keep numeric, if it is object keep object (up to here it would be 'dtype=None' behaviour), but if no dtype yet, the coerced array should be either numeric or object).
This last part is mainly targetting lists, where numpy np.array([mixed list]) converts everything to string once there is at least a single string in the lists.

@jeremiedbb
Copy link
Member Author
jeremiedbb commented Jul 3, 2018

I did some tests and the overhead is indeed quite important.
Also spoke with olivier irl and is opinion is that we should let those pass through the imputer. The error will eventually be raised in the next estimators in the pipeline.

That being said, there is still a issue with the estimator_checks and more precisely with check_dtype_object, because the reason that this test does not fail with the SimpleImputer is that it's run with it's default values (I guess) which is strategy=mean. With this strategy, the fit calls check_array with dtype=FLOAT_DTYPES so the error is raised. The test would fail with 'constant'or 'most_frequent'.
So maybe there is a need to change this test, although I don't know exactly how yet.

edit : the comment of this test is
'# check that estimators treat dtype object as numeric if possible'
But now that some estimators allow non-numeric data, this should be updated.

I think we should have named the ASSUME_FINITE setting ASSUME_DATA_VALID and we could use it here

Assuming you're talking about ALLOW_NAN, this won't work. It only skip tests on data with NaNs, but we can make a new list: ALLOW_NON_NUMERIC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0