-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Feature request: Group aware Time-based cross validation #14257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@ogrisel I am interested in working on this |
Feel free to give it a try:
Good luck! |
@souravsingh Are you working on this? |
@aditya1702 Yes, I am. I will be putting up a WIP PR. |
@ogrisel I'd like to work on this issue. |
I think this is way more complex than what we usually tag as "good first issue" which are usually much much smaller. |
Is anyone working on this one?? I'd like to help.. |
@getgaurav2 ah, you already put up the new WIP with the implementation code. Actually I am working on this (after an ok from @mfcabrera a few days ago in the pull request linked). And I took time to understand the feature requirements, which I still would like to have it confirmed here. So I'll ask for your suggestions anyway. So, it has to take the Group effect onto the TimeSeriesSplit. For example, if my data has groups A, B, and C, for the first split, say only group C will be in the test set, and those C samples also have to be in the future of the samples from group A and B that are in training set. (Here, does it mean part of group A and B that are not in the past of C test set will have to be excluded?) For the second split, say group B is the test set, and as time increases, the size of group A and C in the training set will be larger, because the nature of TimeSeriesSplit. (Here, if the number of folds are not more than number of groups, will it mean the splits will give more weights to the groups that will be in the training set later?) maybe your code answer this already. I will go read it carefully. Since you already have the implementation code written, I don't know what's next now. I would still like to do it, but if I'm too slow and you would like to have this feature done soon please let me know how should we proceed. There's also the test code that should be fixed. @ogrisel |
@Pimpwhippa - this PR was just a way to save my work . I realized later that you were already on it. Apologies for the confusion .
|
@getgaurav2 Thank you for your answers. Yes I will be happy to work with you looking at the testing side. Can I make changes from mfcabrera s version and try to pass the test without taking into account your implementation code? Did his code fail the checks just because there was no implementation code? So I can make changes of the testing side only. Sorry for my newbie question. Just want to know what to do next. |
@Pimpwhippa yes - his test was failing because of no implementation . I agree with his test cases otherwise ( just by looking at the code so far ) Eventually , the split function from my implementation should replace your hard coded expected o/p and pass the test cases . |
@getgaurav2 Alright. Will try so. |
@getgaurav2 if you have any comments on the PR please let me know. Should I add/change anything? I'll continue working on linting and documentation in the meantime. |
@jnothman Thank you for dropping by to give the guidance. |
… same number of splits as in n-split argument.
…y as expected in the comment .
Hey guys, can I ask what the progress is on this? I having written a working iterator which gets through all of @Pimpwhippa's tests I could find (https://github.com/Pimpwhippa/scikit-learn/blob/afd31b1a337d3c6d02491fd6bff67b51b1a05e91/sklearn/model_selection/tests/test_split.py). It would need some proper docstringing(, linting?) and some critical review of course but maybe I could contribute? |
#16236 is waiting on a reviewer at this point. |
Ah it moved! Thanks for the #, that code looks great! Do you think it'll make |
I Hope it makes it into 0.24 . @albertvillanova might be able to share some more info . |
Hello, |
@labdmitriy When you say window mode, are you talking about rolling equal length windows for timeseries CV? I was thinking about starting to work on a PR for that, but if it's already in the pipe-ish I won't.
|
Yes, I added option to choose rolling or expanding window, also It works with sklearn GridSearchCV. |
Hello @ogrisel @jnothman , |
Hi @labdmitriy, I don't really understand what the alternative proposal is here. I don't think our TimeSeriesSplit is going to fulfil every possible need around time series splitting, but we should support some key use cases to avoid users falling into traps. Scikit-learn does not generally see time-series as within it primary scope... |
Hi @jnothman, |
Hi @jnothman, I've shared my implementation of group time series cross-validation which is compatible with sklearn. Article: https://medium.com/@labdmitriy/advanced-group-time-series-validation-bb00d4a74bcc Update (2022-05-27): enhanced version of this implementation now is the part of the mlxtend library. Thank you. |
When will this finally be a part of sklearn library? I don't really like to have to depend on external libs to be able to do something so basic as this |
Basically combining
TimeSeriesSplit
with theGroup
awareness of other CV strategies such asGroupKFold
.I think it's a good first issue for first time contributors that are already familiar with the existing cross validation tools in scikit-learn:
https://scikit-learn.org/stable/modules/cross_validation.html
Source code is here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_split.py
The text was updated successfully, but these errors were encountered: