8000 [MRG+1] Scaling a sparse matrix along axis 0 should accept a csc by default by MechCoder · Pull Request #5791 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] Scaling a sparse matrix along axis 0 should accept a csc by default #5791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 13, 2015

Conversation

MechCoder
Copy link
Member

For scaling a csc_matrix along axis 0, one pass is made across n_features.

For scaling a csc_matrix along axis 0, two passes are made across non_zero_values and a pass across n_features.

So the csc_matrix should be accepted by default. This can be validated by this quick benchmark.

X = np.random.rand(10000, 10000)
X[X < 0.8] = 0.0
csr = sparse.csr_matrix(X)
csc = sparse.csc_matrix(X)
%timeit csr_mean_variance_axis0(csr)
10 loops, best of 3: 83.7 ms per loop
%timeit csc_mean_variance_axis0(csc)
10 loops, best of 3: 41.2 ms per loop

@MechCoder
Copy link
Member Author

@giorgiop ?

@giorgiop
Copy link
Contributor

I think this makes sense, and I agree to remove the superfluous copies as well. Could you time test_data and see how much we gain in runtime in there, for example?

@MechCoder
Copy link
Member Author

I don't think this would matter much in the unless done in a loop. But it is good to have for correctness.

I do not give similar speeds between this branch and master.

@MechCoder
Copy link
Member Author

Well, I see only two places where a sparse matrix is used in test_data.py. So it definitely shouldn't matter in that.

@giorgiop
Copy link
Contributor

Well, I see only two places where a sparse matrix is used in test_data.py. So it definitely shouldn't matter in that.

You have also removed a copy in the dense case. But, yeah, probably not worth a try as inputs in tests are tiny anyway.

@MechCoder
Copy link
Member Author

Do you have any more comments or can I update the PR description to +1?

@giorgiop
Copy link
Contributor

+1 from my side!

@GaelVaroquaux GaelVaroquaux changed the title [MRG] Scaling a sparse matrix along axis 0 should accept a csc by default [MRG+1] Scaling a sparse matrix along axis 0 should accept a csc by default Nov 13, 2015
@GaelVaroquaux
Copy link
Member

LGTM. Thank you. Merging.

GaelVaroquaux added a commit that referenced this pull request Nov 13, 2015
[MRG+1] Scaling a sparse matrix along axis 0 should accept a csc by default
@GaelVaroquaux GaelVaroquaux merged commit d644e8e into scikit-learn:master Nov 13, 2015
@MechCoder MechCoder deleted the scale_csc branch November 13, 2015 16:10
@MechCoder
Copy link
Member Author

forgot to update whatsnew. should I?

@@ -124,7 +124,7 @@ def scale(X, axis=0, with_mean=True, with_std=True, copy=True):
scaling using the ``Transformer`` API (e.g. as part of a preprocessing
:class:`sklearn.pipeline.Pipeline`)
"""
X = check_array(X, accept_sparse='csr', copy=copy, ensure_2d=False,
X = check_array(X, accept_sparse='csc', copy=copy, ensure_2d=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ['csc, 'csr']?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not write the initial code, but I think the idea is to convert it into a format where the column operations are sufficiently faster. I just maintained that convention because converting it to csr would be non-intuitive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you afraid that this will not be backward compatible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much faster is converting to csc and multipying compared to not converting? ['csc', 'csr'] will not convert to CSR but will let CSR pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0