8000 modified check_arrays to handle multiple sparse formats by msalahi · Pull Request #3094 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

modified check_arrays to handle multiple sparse formats #3094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

modified check_arrays to handle multiple sparse formats #3094

wants to merge 1 commit into from

Conversation

msalahi
Copy link
@msalahi msalahi commented Apr 21, 2014

on pull request #3076, it was proposed we extend the functionality of sklearn.utils.validation.check_arrays(...) to allow for multiple sparse formats. for example:

check_arrays(*arrays, sparse_format=['csc', 'csr'])

i did my best to make sure my implementation's compatible with both string arguements (sparse_format='csc') and collections (sparse_format=['csc', 'csr']). wasn't quite sure what to do in the case of something like...

check_arrays(*arrays, sparse_format=['csc', 'dense'])

...where both sparse and dense formats are listed. i assumed the most intuitive way to read that was to say "csc and dense arrays are allowed," and attempted to get the code to behave accordingly.

i could definitely use a watchful eye on this one, as check_arrays is used everywhere.

@@ -192,7 +192,8 @@ def check_arrays(*arrays, **options):
Python lists or tuples occurring in arrays are converted to 1D numpy
arrays, unless allow_lists is specified.

sparse_format : 'csr', 'csc' or 'dense', None by default
sparse_format : 'csr', 'csc', 'dense', or an iterable containing some
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type should fit on one line to render fine in the doc

I would write

sparse_format : 'csr' | 'csc' | 'dense' | list
    ...
    if is list ...

@msalahi
Copy link
Author
msalahi commented Apr 22, 2014

addressed the issues @agramfort brought up: change documentation line, and avoid copying matrices to other types unnecessarily. There started to be way too much logic going on in check_arrays so i moved the sparse_format keyword arg validation and the array validation into two helper functions.

i decided that the most intuitive way to have the list argument to sparse_format work is to just have it accept ['csc' and/or 'csr']. 'dense,' and None can still be specified, they just can't be stuck in a list.

for example:

# OK
check_arrays(X, Y, sparse_format='csc')
check_arrays(X, Y, sparse_format=['csc'])
check_arrays(X, Y, sparse_format=['csc', 'csr']
check_arrays(X, Y, sparse_format='dense')

# not OK:

check_arrays(X, Y, sparse_format=['dense', 'csc'])
check_arrays(X, Y, sparse_format=[None]

finally, i added a little more test coverage around the sparse_format functionality

elif sparse_format is None:
return array
else:
input_format = array.__class__.__name__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use array.getformat()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there's array.format which is the same.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, didn't realize that existed -- much nicer!

@msalahi
Copy link
Author
msalahi commented Apr 23, 2014
8000

substituted my wonky array.__class__.__name__.split(' ')[0] for the much cleaner array.format to get a matrix's sparse format.

also added clearer documentation for the sparse_format keyword argument in check_arrays.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling a085b65 on msalahi:check-arrays-formats into feb6eb1 on scikit-learn:master.

@@ -102,6 +102,8 @@ def test_check_arrays_exceptions():
assert_raises(TypeError, check_arrays, [0, 1], [0, 1], meaning_of_life=42)
assert_raises(ValueError, check_arrays, [0], [0], sparse_format='fake')
assert_raises(ValueError, check_arrays, np.zeros((2, 3, 4)), [0])
assert_raises(ValueError, check_arrays, [0], [0], sparse_format=['csc',
'fake'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetics, I would prefer:

assert_raises(ValueError, check_arrays, [0], [0],
              sparse_format=['csc', 'fake'])

@amueller
Copy link
Member

I actually implemented this and also refactored all of the input validation in the last sprint. The check_arrays function is gone and we can allow multiple sparse formats. Therefore closing this.

@amueller amueller closed this Jan 16, 2015
@amueller
Copy link
Member

For reference: #3443 and #3447

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

3E2C Successfully merging this pull request may close these issues.

7 participants
0