8000 Add Spline Transformer · Issue #17027 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Add Spline Transformer #17027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lorentzenchr opened this issue Apr 24, 2020 · 14 comments · Fixed by #18368
Closed

Add Spline Transformer #17027

lorentzenchr opened this issue Apr 24, 2020 · 14 comments · Fixed by #18368

Comments

@lorentzenchr
Copy link
Member
lorentzenchr commented Apr 24, 2020

Describe the workflow you want to enable

I propose to add a SplineTransformer to preprocessing. This is similiar to PolynomialFeatures, but gives more flexibility (and numerical stability) for linear models to deal with continuous numerical features.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import SplineTransformer
from sklearn.pipeline import make_pipeline

# get data X, y
...
model = make_pipeline(SplineTransformer(degree=3, n_knots=20,
                                        positioning='quantile'),
                      LogisticRegression())
model.fit(X, y)

Describe your proposed solution

Add SplineTransformer and internally use scipy for splines. Start with
- 1-dimensional b-splines
- equidistant knots
- quantile based knots

Additional context

Patsy has an implementation of those that matches the R versions.

References

Eilers, Marx "Flexible Smoothing with B-splines and Penalties" passes the scikit-learn inclusion criteria by some margin 😏

@Reksbril
Copy link
Contributor
Reksbril commented Apr 26, 2020

Can I work on this, or it needs some further discussion?

@lorentzenchr
Copy link
Member Author

@Reksbril Thanks for volunteering. It needs discussion first.

@lorentzenchr
Copy link
Member Author

Any comment from a core developer is very welcome. Especially regarding the question whether it is worth to start a PR.

@rth
Copy link
Member
rth commented Jul 4, 2020

I'm not very familiar with the topic, so I can't comment on practical considerations. I have used splines for interpolation in the past but not so much in the ML context. Overall it seems to be a fairly standard and well established approach that would pass the inclusion criteria.

I'm a bit surprised spline regression isn't more mainstream in the python ML ecosystem. In particular, I can't find any earlier issues in scikit-learn about this. If we ever want to go beyond 1d-splines for multi-dimensional data, the placement and number of knots seems less straightforward as discussed e.g. in this SO answer. Also the ESL books says on this topic,

In many cases when the number of potential dimensions (features) is large, automatic methods are more desirable. The MARS and MART procedures both fall into this category.

MARS is implemented in https://github.com/scikit-learn-contrib/py-earth BTW.

Another question I have is why B-splines and say not smoothing splines, which would have fewer hyper-parameters?

It would be nice to have a few examples maybe using patsy, on how this would compare for linear models and multi-dimensional data e.g. with KBinsDiscretizer.

Maybe @agramfort @ogrisel would have other comments?

@thomasjpfan
Copy link
Member

Me and @amueller have been interested in splines in the context of GAMs which would start with 1-D splines for each feature. I have been planning on pushing this forward for scikit-learn.

@lorentzenchr
Copy link
Member Author

@rth

Another question I have is why B-splines and say not smoothing splines, which would have fewer hyper-parameters?

B-splines are just a numerical convenient 1-D basis for splines and available in scipy. You can represent a smoothing spline = natural cubic spline in form of a B-spline.
While smoothing splines have indeed nice properties, they place n_samples knots (if there are no ties): that is a lot! As soon as you are in a multivariate setting, my experience is that you rather place a fixed number of knots (say 40) equidistantly or quantile based. If available and you have the time, then you can add a penalty for the splines, which thereby become P-splines, and select the penalty strength by some sort of cross validation. Here again, B-splines have a nice property by which the integral over 2nd derivative (often used a penalty) is well approximated by a 2nd order difference matrix, see Eilers & Marx.

For scikit-learn, it would be nice to have splines available at all. Penalties are more tricky due to API constraints (SLEP006 sample properties and maybe also feature names rings a bell) as the SplineTransformer would need to tell the linear model which columns belong to the same spline/original continuous feature and the linear model might have to learn new penalties. In this regard, the truncated power series basis for splines might be easier because the typical penalty for them is the L2 penalty.

@thomasjpfan

I have been planning on pushing this forward for scikit-learn.

Nice to hear:smirk:

@lorentzenchr
Copy link
Member Author

@thomasjpfan I like splines not so much for their interpretability, but for their flexibility in modelling continuous features in a smooth and controllable way (good mix between manual and automatic). As a counterexample, the fashionable and trendy decision tree based methods have discontinuities all over the place. Depending on the application, this may be a concern.

@jnothman
Copy link
Member
jnothman commented Jul 6, 2020 via email

@mayer79
Copy link
Contributor
mayer79 commented Aug 7, 2020

Great initiative - I really miss splines in scikit-learn. In practical applications, I very often work with natural cubic splines, see its options in R. They are very stable, have acceptable extrapolation properties and use astonishingly few extra parameters compared to a polynomial approximation. It would be great if one could (optionally) pass the knot positions.

@ecm0
Copy link
ecm0 commented Feb 1, 2021

Very nice to see spline-based features made their way to scikit-learn! Today I posted a short demo for multivariate spline-based transformer able to capture correlation between features, in case this could be useful for future developments. https://gist.github.com/ecm0/fe8966f9170409cfbc4f34c919462f98

@lorentzenchr
Copy link
Member Author

@ecm0 Nice to hear you like splines.

@ecm0
Copy link
ecm0 commented Feb 3, 2021

@lorentzenchr By the way, I'm happy to help in case multivariate splines are considered in the future.

@lorentzenchr
Copy link
Member Author

By the way, I'm happy to help in case multivariate splines are considered in the future.

You're welcome to open an issue to propose and motivate new functionality. But note that we have a high barrier for new features, see this FAQ section.
In the case of multivariate splines, I think this is already available with a combination of SplineTransformer and PolynomialFeatures in a Pipeline. An example showing the usefulness could be helpful.

@ecm0
Copy link
ecm0 commented Feb 4, 2021

@lorentzenchr I had not thought about combining PolynomialFeatures and SplineTransformer. I agree this does the same as what I did in my demo. So this is already covered by the new implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants
0