ENH add sparse output to SplineTransformer #24145

lorentzenchr · 2022-08-08T14:38:42Z

Reference Issues/PRs

Fixes #20998.

What does this implement/fix? Explain your changes.

This PR adds argument ~~sparse~~ sparse_output to SplineTransformer. Set to True, it returns a sparse csr matrix.

Any other comments?

This is available only for scipy >= 1.8. Further improvements will be possible with scipy 1.10 (extrapolate argument for BSpline.design_matrix).

thomasjpfan

Thank you for the PR

sklearn/preprocessing/_polynomial.py

thomasjpfan · 2022-08-08T14:53:56Z

sklearn/preprocessing/_polynomial.py

        # Note that scipy BSpline returns float64 arrays and converts input
        # x=X[:, i] to c-contiguous float64.
        n_out = self.n_features_out_ + n_features * (1 - self.include_bias)
        if X.dtype in FLOAT_DTYPES:
            dtype = X.dtype
        else:
            dtype = np.float64
-        XBS = np.zeros((n_samples, n_out), dtype=dtype, order=self.order)
+        if use_sparse:
+            output_list = []


There is a bunch of sparse matrices being constructed in this implementation. For each feature:

A CSR design matrix is constructed

This matrix can be converted into a lil matrix

hstack converts it all back to CSR.

What is the runtime for transform with sparse output compared to dense output?

import numpy as np from sklearn.preprocessing import SplineTransformer X = np.linspace([-1, -10, 100], [1, 10, 100], 10000) st_sparse = SplineTransformer(sparse_output=True, extrapolation="error").fit(X) st_dense = SplineTransformer(extrapolation="error").fit(X) %timeit st_dense.transform(X) 2.13 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit st_sparse.transform(X) 43.7 ms ± 336 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

That's unfortunate. Having n_features=1 doesn't change the result.
But memory consumption should be better.

The bottleneck seems to be in scipy's _bspl.pyx function _make_design_matrix. In contrast, evaluate_spline does all loops explicitly. We might report this upstream.

@ev-br @egorz734 friendly ping

Not sure what's going on here TBH, but csr->lil->csr does sound expensive. Where is the bottleneck, what does the profiler say?

What's the impact of scipy/scipy#16840 on the scikit-learn benchmark in https://github.com/scikit-learn/scikit-learn/pull/24145/files#r940357974?

Are there scipy nightly builds to install? Otherwise, this is above my current time budget to benchmark.
Scipy 1.10 will not only speed up BSpline.design_matrix but also make it easier to implement spline extrapolation in scikit-learn, thereby further reducing runtime and memory consumption.

SciPy has nightly builds:

pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple scipy

but their CI is failing and not updating the nightly builds. I opened MacPython/scipy-wheels#175 to track the SciPy issue.

With scipy==1.10.0.dev0, one gets improved runtime:

import numpy as np from sklearn.preprocessing import SplineTransformer X = np.linspace([-1, -10, 100], [1, 10, 100], 10000) st_sparse = SplineTransformer(sparse_output=True, extrapolation="error").fit(X) st_dense = SplineTransformer(extrapolation="error").fit(X) %timeit st_dense.transform(X) 1.89 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit st_sparse.transform(X) 4.89 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

On my laptop, I get (scipy 1.10.1)

%timeit st_dense.transform(X) 2.12 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit st_sparse.transform(X) 4.61 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

jjerphan

Thank you for adding this support, @lorentzenchr.

Here are a few comments and questions.

doc/whats_new/v1.2.rst

sklearn/preprocessing/_polynomial.py

jeremiedbb · 2022-11-24T13:18:24Z

We won't have time to finish the review on this one before the 1.2 release. Moving it to 1.3

lorentzenchr · 2023-04-30T13:55:59Z

Finally, CI is 🟢.

Edit: Codecov does not count as it is clear that not everything is tested for a single scipy version.

jjerphan

LGTM!

I just have a few last comments. Errors on the CI with scipy-dev are unrelated (see #26154).

doc/whats_new/v1.2.rst

sklearn/preprocessing/_polynomial.py

sklearn/preprocessing/tests/test_polynomial.py

…it-learn into sparse_splines

ogrisel

Here is another pass of review. The performance is still poor with extrapolation (at least "periodic") even with scipy 1.10.1. From the inline comments, it's not clear if this is expected or not.

If it's expected and there is no easy way around it, I think the inline comments should be updated to make this more explicit (see below for details).

Other than that, LGTM!

sklearn/preprocessing/_polynomial.py

sklearn/preprocessing/tests/test_polynomial.py

ogrisel · 2023-05-17T09:59:14Z

sklearn/preprocessing/_polynomial.py

+            XBS = sparse.hstack(output_list)
+        elif self.sparse_output:
+            # TODO: Remove ones scipy 1.10 is the minimum version. See comments above.
+            XBS = sparse.csr_matrix(XBS)


This comment seems to imply that with scipy 1.10, the .tolil conversion would no longer be necessary.

However, as far as I understand, we still have to go through this condition when extrapolate="periodic", even with scipy 1.10 or later:

https://github.com/scikit-learn/scikit-learn/pull/24145/files#diff-690d1b546a74ed69a4dc3e77edaa88363fc0f8297a71b1a3619a0f1cf79b2621R899-R906

Would there be a way to avoid the .tolil conversion completely with recent scipy versions?

At the moment, sparse periodic extrapolation is more than 20x slower than its dense counterpart:

In [1]: import numpy as np ...: from sklearn.preprocessing import SplineTransformer ...: ...: X = np.linspace([-1, -10, 100], [1, 10, 101], 10000) ...: extrapolation="periodic" ...: st_sparse = SplineTransformer(sparse_output=True, extrapolation=extrapolation).fit(X) ...: st_dense = SplineTransformer(extrapolation=extrapolation).fit(X) In [2]: %timeit st_dense.transform(X) 1.29 ms ± 1.25 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) In [3]: %timeit st_sparse.transform(X) 64.1 ms ± 851 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

scipy version: 1.10.1

It is slower, yes, but this is not a performance regression as sparse output is a new feature and the default is dense output.

And yes, with scipy >= 1.10, we can get rid of (at least most of) the lil-conversions as we can use design_matrix(..., extrapolate=True).

ogrisel · 2023-05-17T13:40:02Z

Oops, I broke the linter when resolving the conflicts via the github UI. Let me push a fix.

ogrisel · 2023-06-05T13:05:07Z

Merged. Thanks @lorentzenchr!

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

FEA add sparse output to SplineTransformer

58fa259

github-actions bot added the module:preprocessing label Aug 8, 2022

DOC add whatsnew entry

7e4056b

lorentzenchr added this to the 1.2 milestone Aug 8, 2022

lorentzenchr added 2 commits August 8, 2022 16:48

TST add tests for scipy>=1.8

457ebed

DOC adapt whatsnew

30dd697

thomasjpfan reviewed Aug 8, 2022

View reviewed changes

MNT first review comments

29deb8c

lorentzenchr mentioned this pull request Aug 12, 2022

Runtime performance in BSpline.design_matrix is inferior to BSpline().__call__() scipy/scipy#16833

Closed

MNT rename sparse to sparse_output

8b57488

lorentzenchr mentioned this pull request Aug 25, 2022

Rename OneHotEncoder option sparse to sparse_output #24265

Closed

Merge branch 'main' into sparse_splines

195773a

lorentzenchr mentioned this pull request Sep 30, 2022

RFC bump up dependencies for 1.2 #24401

Closed

jjerphan reviewed Oct 27, 2022

View reviewed changes

lorentzenchr added 6 commits October 28, 2022 11:01

CLN address some review comments

db687a0

Merge branch 'main' into sparse_splines

85b23eb

CLN remove duplicated parameter validation

4a1c7b1

DOC transform return

17d427e

ENH add check for inter32 dtype of sparse hstack

e4b010d

CLN remove check for lil

093531e

jjerphan reviewed Oct 28, 2022

View reviewed changes

sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved

lorentzenchr added 4 commits October 28, 2022 13:43

CLN add TODO note

6dff136

Merge branch 'main' into sparse_splines

8934b2b

Merge branch 'main' into sparse_splines

aa650d7

CLN add sparse_output to parameter constraints

b651044

jeremiedbb modified the milestones: 1.2, 1.3 Nov 24, 2022

lorentzenchr added 2 commits March 7, 2023 23:04

Merge branch 'main' into sparse_splines

fb4369c

Merge branch 'main' into sparse_splines

24027f5

lorentzenchr added 7 commits April 30, 2023 12:29

DOC move whatsnew to 1.3

b9ddb9f

CLN fix idendation

cde7e03

TST SplineTransformer raises

d680438

Run CI again

eccd567

CLN better comment

61eaffe

CLN typo

638d33c

ENH use lil once more

8986d30

lorentzenchr and others added 3 commits April 30, 2023 16:39

Run CI again

4c10ab4

[scipy-dev] Test against the development version of SciPy

0ced11d

Retrigger CI

4814f62

jjerphan approved these changes May 1, 2023

View reviewed changes

lorentzenchr added 2 commits May 1, 2023 18:18

CLN address reviewer comments

190ff98

Merge branch 'sparse_splines' of https://github.com/lorentzenchr/scik…

4a4d516

…it-learn into sparse_splines

ogrisel reviewed May 17, 2023

View reviewed changes

Merge branch 'main' into sparse_splines

35b3a43

Fix duplicated imports

5bd297c

ogrisel mentioned this pull request May 17, 2023

SplineTransformer(extrapolate="periodic") outputs nan values for constant features #26390

Open

CLN address reviewer comments

ac87516

ogrisel added Waiting for Reviewer Waiting for Second Reviewer First reviewer is done, need a second one! labels Jun 5, 2023

ogrisel approved these changes Jun 5, 2023

View reviewed changes

ogrisel merged commit e5c6590 into scikit-learn:main Jun 5, 2023

ogrisel removed Waiting for Reviewer Waiting for Second Reviewer First reviewer is done, need a second one! labels Jun 5, 2023

ogrisel deleted the sparse_splines branch June 5, 2023 13:04

manudarmi pushed a commit to primait/scikit-learn that referenced this pull request Jun 12, 2023

ENH add sparse output to SplineTransformer (scikit-learn#24145)

643fc27

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

ENH add sparse output to SplineTransformer (scikit-learn#24145)

745bff3

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH add sparse output to SplineTransformer #24145

ENH add sparse output to SplineTransformer #24145

ENH add sparse output to SplineTransformer #24145

ENH add sparse output to SplineTransformer #24145

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment