-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Scaling a sparse matrix along axis 0 should accept a csc by default #5791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think this makes sense, and I agree to remove the superfluous copies as well. Could you time |
I don't think this would matter much in the unless done in a loop. But it is good to have for correctness. I do not give similar speeds between this branch and master. |
Well, I see only two places where a sparse matrix is used in test_data.py. So it definitely shouldn't matter in that. |
You have also removed a copy in the dense case. But, yeah, probably not worth a try as inputs in tests are tiny anyway. |
Do you have any more comments or can I update the PR description to +1? |
+1 from my side! |
LGTM. Thank you. Merging. |
[MRG+1] Scaling a sparse matrix along axis 0 should accept a csc by default
forgot to update |
@@ -124,7 +124,7 @@ def scale(X, axis=0, with_mean=True, with_std=True, copy=True): | |||
scaling using the ``Transformer`` API (e.g. as part of a preprocessing | |||
:class:`sklearn.pipeline.Pipeline`) | |||
""" | |||
X = check_array(X, accept_sparse='csr', copy=copy, ensure_2d=False, | |||
X = check_array(X, accept_sparse='csc', copy=copy, ensure_2d=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not ['csc, 'csr']
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not write the initial code, but I think the idea is to convert it into a format where the column operations are sufficiently faster. I just maintained that convention because converting it to csr would be non-intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you afraid that this will not be backward compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much faster is converting to csc and multipying compared to not converting? ['csc', 'csr']
will not convert to CSR but will let CSR pass.
For scaling a csc_matrix along axis 0, one pass is made across
n_features
.For scaling a csc_matrix along axis 0, two passes are made across
non_zero_values
and a pass acrossn_features
.So the csc_matrix should be accepted by default. This can be validated by this quick benchmark.