Add Spline Transformer #17027

lorentzenchr · 2020-04-24T15:17:04Z

Describe the workflow you want to enable

I propose to add a SplineTransformer to preprocessing. This is similiar to PolynomialFeatures, but gives more flexibility (and numerical stability) for linear models to deal with continuous numerical features.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import SplineTransformer
from sklearn.pipeline import make_pipeline

# get data X, y
...
model = make_pipeline(SplineTransformer(degree=3, n_knots=20,
                                        positioning='quantile'),
                      LogisticRegression())
model.fit(X, y)

Describe your proposed solution

Add SplineTransformer and internally use scipy for splines. Start with
- 1-dimensional b-splines
- equidistant knots
- quantile based knots

Additional context

Patsy has an implementation of those that matches the R versions.

References

Eilers, Marx "Flexible Smoothing with B-splines and Penalties" passes the scikit-learn inclusion criteria by some margin 😏

The text was updated successfully, but these errors were encountered:

Reksbril · 2020-04-26T10:24:40Z

Can I work on this, or it needs some further discussion?

lorentzenchr · 2020-04-26T10:29:43Z

@Reksbril Thanks for volunteering. It needs discussion first.

lorentzenchr · 2020-07-04T10:01:55Z

Any comment from a core developer is very welcome. Especially regarding the question whether it is worth to start a PR.

rth · 2020-07-04T12:15:53Z

I'm not very familiar with the topic, so I can't comment on practical considerations. I have used splines for interpolation in the past but not so much in the ML context. Overall it seems to be a fairly standard and well established approach that would pass the inclusion criteria.

I'm a bit surprised spline regression isn't more mainstream in the python ML ecosystem. In particular, I can't find any earlier issues in scikit-learn about this. If we ever want to go beyond 1d-splines for multi-dimensional data, the placement and number of knots seems less straightforward as discussed e.g. in this SO answer. Also the ESL books says on this topic,

In many cases when the number of potential dimensions (features) is large, automatic methods are more desirable. The MARS and MART procedures both fall into this category.

MARS is implemented in https://github.com/scikit-learn-contrib/py-earth BTW.

Another question I have is why B-splines and say not smoothing splines, which would have fewer hyper-parameters?

It would be nice to have a few examples maybe using patsy, on how this would compare for linear models and multi-dimensional data e.g. with KBinsDiscretizer.

Maybe @agramfort @ogrisel would have other comments?

thomasjpfan · 2020-07-04T23:06:24Z

Me and @amueller have been interested in splines in the context of GAMs which would start with 1-D splines for each feature. I have been planning on pushing this forward for scikit-learn.

lorentzenchr · 2020-07-05T10:18:45Z

@rth

Another question I have is why B-splines and say not smoothing splines, which would have fewer hyper-parameters?

B-splines are just a numerical convenient 1-D basis for splines and available in scipy. You can represent a smoothing spline = natural cubic spline in form of a B-spline.
While smoothing splines have indeed nice properties, they place n_samples knots (if there are no ties): that is a lot! As soon as you are in a multivariate setting, my experience is that you rather place a fixed number of knots (say 40) equidistantly or quantile based. If available and you have the time, then you can add a penalty for the splines, which thereby become P-splines, and select the penalty strength by some sort of cross validation. Here again, B-splines have a nice property by which the integral over 2nd derivative (often used a penalty) is well approximated by a 2nd order difference matrix, see Eilers & Marx.

For scikit-learn, it would be nice to have splines available at all. Penalties are more tricky due to API constraints (SLEP006 sample properties and maybe also feature names rings a bell) as the SplineTransformer would need to tell the linear model which columns belong to the same spline/original continuous feature and the linear model might have to learn new penalties. In this regard, the truncated power series basis for splines might be easier because the typical penalty for them is the L2 penalty.

@thomasjpfan

I have been planning on pushing this forward for scikit-learn.

Nice to hear:smirk:

lorentzenchr · Jul 5, 2020

@thomasjpfan I like splines not so much for their interpretability, but for their flexibility in modelling continuous features in a smooth and controllable way (good mix between manual and automatic). As a counterexample, the fashionable and trendy decision tree based methods have discontinuities all over the place. Depending on the application, this may be a concern.

jnothman · 2020-07-06T01:37:48Z

I think a lot of the feature-based machine learning community could learn more about spline bases in predictive modelling, and it would be valuable to have these available and discussed.

mayer79 · 2020-08-07T19:31:04Z

Great initiative - I really miss splines in scikit-learn. In practical applications, I very often work with natural cubic splines, see its options in R. They are very stable, have acceptable extrapolation properties and use astonishingly few extra parameters compared to a polynomial approximation. It would be great if one could (optionally) pass the knot positions.

ecm0 · 2021-02-01T21:14:22Z

Very nice to see spline-based features made their way to scikit-learn! Today I posted a short demo for multivariate spline-based transformer able to capture correlation between features, in case this could be useful for future developments. https://gist.github.com/ecm0/fe8966f9170409cfbc4f34c919462f98

lorentzenchr · 2021-02-03T08:33:40Z

@ecm0 Nice to hear you like splines.

ecm0 · 2021-02-03T08:51:48Z

@lorentzenchr By the way, I'm happy to help in case multivariate splines are considered in the future.

lorentzenchr · 2021-02-04T07:56:31Z

By the way, I'm happy to help in case multivariate splines are considered in the future.

You're welcome to open an issue to propose and motivate new functionality. But note that we have a high barrier for new features, see this FAQ section.
In the case of multivariate splines, I think this is already available with a combination of SplineTransformer and PolynomialFeatures in a Pipeline. An example showing the usefulness could be helpful.

ecm0 · 2021-02-04T08:09:09Z

@lorentzenchr I had not thought about combining PolynomialFeatures and SplineTransformer. I agree this does the same as what I did in my demo. So this is already covered by the new implementation.

lorentzenchr added the New Feature label Apr 24, 2020

cmarmo added the module:preprocessing label Apr 30, 2020

rth mentioned this issue Jul 17, 2020

Generalized additive models (GAMs)? #3482

Open

lorentzenchr mentioned this issue Sep 9, 2020

[MRG+2] FEA Add SplineTransformer #18368

Merged

glemaitre closed this as completed in #18368 Jan 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add Spline Transformer #17027

Add Spline Transformer #17027

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add Spline Transformer #17027

Add Spline Transformer #17027

Comments

Uh oh!

Describe the workflow you want to enable

Describe your proposed solution

Additional context

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!