10000 Feature request: Group aware Time-based cross validation · Issue #14257 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Feature request: Group aware Time-based cross validation #14257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue Jul 4, 2019 · 37 comments · May be fixed by #16236
Open

Feature request: Group aware Time-based cross validation #14257

ogrisel opened this issue Jul 4, 2019 · 37 comments · May be fixed by #16236
Assignees
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:model_selection New Feature

Comments

@ogrisel
Copy link
Member
ogrisel commented Jul 4, 2019

Basically combining TimeSeriesSplit with the Group awareness of other CV strategies such as GroupKFold.

I think it's a good first issue for first time contributors that are already familiar with the existing cross validation tools in scikit-learn:

https://scikit-learn.org/stable/modules/cross_validation.html

Source code is here:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_split.py

@ogrisel ogrisel added Enhancement good first issue Easy with clear instructions to resolve help wanted Moderate Anything that requires some knowledge of conventions and best practices New Feature Sprint labels Jul 4, 2019
@souravsingh
Copy link
Contributor

@ogrisel I am interested in working on this

@ogrisel
Copy link
Member Author
ogrisel commented Jul 5, 2019

Feel free to give it a try:

  • start by reading the documentation and the code of the existing CV split strategies,
  • then start by writing the tests for the new strategy and issue and early [WIP] PR to get feedback on the tests before writing the implementation itself.

Good luck!

@aditya1702
Copy link
Contributor

@souravsingh Are you working on this?

@souravsingh
Copy link
Contributor

@aditya1702 Yes, I am. I will be putting up a WIP PR.

@abhishek-jana
Copy link
Contributor
abhishek-jana commen 8000 ted Jul 12, 2019

@ogrisel I'd like to work on this issue.

@amueller
Copy link
Member

I think this is way more complex than what we usually tag as "good first issue" which are usually much much smaller.

@ManishAradwad
Copy link
Contributor

Is anyone working on this one?? I'd like to help..

@PimwipaV
Copy link

@getgaurav2 ah, you already put up the new WIP with the implementation code. Actually I am working on this (after an ok from @mfcabrera a few days ago in the pull request linked). And I took time to understand the feature requirements, which I still would like to have it confirmed here. So I'll ask for your suggestions anyway.

So, it has to take the Group effect onto the TimeSeriesSplit. For example, if my data has groups A, B, and C, for the first split, say only group C will be in the test set, and those C samples also have to be in the future of the samples from group A and B that are in training set. (Here, does it mean part of group A and B that are not in the past of C test set will have to be excluded?)

For the second split, say group B is the test set, and as time increases, the size of group A and C in the training set will be larger, because the nature of TimeSeriesSplit. (Here, if the number of folds are not more than number of groups, will it mean the splits will give more weights to the groups that will be in the training set later?) maybe your code answer this already. I will go read it carefully.

Since you already have the implementation code written, I don't know what's next now. I would still like to do it, but if I'm too slow and you would like to have this feature done soon please let me know how should we proceed. There's also the test code that should be fixed. @ogrisel

@getgaurav2
Copy link
Contributor
getgaurav2 commented Jan 26, 2020

@Pimpwhippa - this PR was just a way to save my work . I realized later that you were already on it. Apologies for the confusion .
My 2 cents on your questions:

  • Training set shall include Group A and B and test shall include Group C . Test C shall always be at a later time than A and B . To answer your ques more directly - "does it mean part of group A and B that are not in the past of C test set will have to be excluded" - No .
  • Yes
  • I would be happy to work with you if want to come at it from the testing side.

@PimwipaV
Copy link

@getgaurav2 Thank you for your answers. Yes I will be happy to work with you looking at the testing side. Can I make changes from mfcabrera s version and try to pass the test without taking into account your implementation code? Did his code fail the checks just because there was no implementation code? So I can make changes of the testing side only. Sorry for my newbie question. Just want to know what to do next.

@getgaurav2
Copy link
Contributor

@Pimpwhippa yes - his test was failing because of no implementation . I agree with his test cases otherwise ( just by looking at the code so far )
To begin purely on the test side , You can hard code an expected o/p of the split function and make sure it passes in local .

Eventually , the split function from my implementation should replace your hard coded expected o/p and pass the test cases .

@PimwipaV
Copy link

@getgaurav2 Alright. Will try so.

@PimwipaV
Copy link

@getgaurav2 if you have any comments on the PR please let me know. Should I add/change anything? I'll continue working on linting and documentation in the meantime.

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 18, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 18, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 19, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 19, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 24, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 24, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 24, 2020
@PimwipaV
Copy link

@jnothman Thank you for dropping by to give the guidance.

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 5, 2020
… same number of splits as in n-split argument.
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 6, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 6, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 29, 2020
@gosuto-inzasheru
Copy link
gosuto-inzasheru commented Dec 3, 2020

Hey guys, can I ask what the progress is on this? I having written a working iterator which gets through all of @Pimpwhippa's tests I could find (https://github.com/Pimpwhippa/scikit-learn/blob/afd31b1a337d3c6d02491fd6bff67b51b1a05e91/sklearn/model_selection/tests/test_split.py). It would need some proper docstringing(, linting?) and some critical review of course but maybe I could contribute?

@getgaurav2
Copy link
Contributor

Hey guys, can I ask what the progress is on this? I having written a working iterator which gets through all of @Pimpwhippa's tests I could find (https://github.com/Pimpwhippa/scikit-learn/blob/afd31b1a337d3c6d02491fd6bff67b51b1a05e91/sklearn/model_selection/tests/test_split.py). It would need some proper docstringing(, linting?) and some critical review of course but maybe I could contribute?

#16236 is waiting on a reviewer at this point.

@gosuto-inzasheru
Copy link

Ah it moved! Thanks for the #, that code looks great! Do you think it'll make 0.24?

@getgaurav2
Copy link
Contributor

Ah it moved! Thanks for the #, that code looks great! Do you think it'll make 0.24?

I Hope it makes it into 0.24 . @albertvillanova might be able to share some more info .

@labdmitriy
Copy link

Hello,
Is this feature completed and will be included in future releases?
I have no experience in contribution yet, but I wrote from scratch a class for grouped time series split with parameters, like test size, train size, number of splits, gap, shift size and window modes, and also different arguments checking and calculations.
Also it is pretty fast, because it uses only sklearn helper functions like indexable and _num_samples and numpy.
So if the current implementation will be included, I will try to write an article about that, but if not - I would happy to try to contribute my implementation in sklearn, if it is possible.
Thank you.

@devin-moonrise
Copy link

@labdmitriy When you say window mode, are you talking about rolling equal length windows for timeseries CV? I was thinking about starting to work on a PR for that, but if it's already in the pipe-ish I won't.

Hello,
Is this feature completed and will be included in future releases?
I have no experience in contribution yet, but I wrote from scratch a class for grouped time series split with parameters, like test size, train size, number of splits, gap, shift size and window modes, and also different arguments checking and calculations.
Also it is pretty fast, because it uses only sklearn helper functions like indexable and _num_samples and numpy.
So if the current implementation will be included, I will try to write an article about that, but if not - I would happy to try to contribute my implementation in sklearn, if it is possible.
Thank you.

@labdmitriy
Copy link
labdmitriy commented Dec 17, 2020

Yes, I added option to choose rolling or expanding window, also It works with sklearn GridSearchCV.
But I am not sure that my implementation will be interesting due to current implementation of this Feature Request, so your PR probably may be relevant.

@labdmitriy
Copy link

Hello @ogrisel @jnothman ,
Could you please comment the question above (about alternative implementation of this feature)? Just to make sure that it is not relevant now due to the current implementation by @getgaurav2 and if not - I will plan to publish my implementaion somewhere else.
Thank you.

@jnothman
Copy link
Member

Hi @labdmitriy, I don't really understand what the alternative proposal is here. I don't think our TimeSeriesSplit is going to fulfil every possible need around time series splitting, but we should support some key use cases to avoid users falling into traps. Scikit-learn does not generally see time-series as within it primary scope...

@jnothman
Copy link
Member

I think when we first discussed this, I did not appreciate that a primary need was to handle non-overlapping groups; #16236 assumes that each group is contiguous in the sample ordering. Was this your intention @ogrisel?

@labdmitriy
Copy link

Hi @labdmitriy, I don't really understand what the alternative proposal is here. I don't think our TimeSeriesSplit is going to fulfil every possible need around time series splitting, but we should support some key use cases to avoid users falling into traps. Scikit-learn does not generally see time-series as within it primary scope...

Hi @jnothman,
Thank you for your answer.
I mean that it is possible to create more general grouped time series, but if there is no need - no problem, I will share my implementation in the article later.
Thank you.

@labdmitriy
Copy link
labdmitriy commented Nov 21, 2021

Hi @labdmitriy, I don't really understand what the alternative proposal is here. I don't think our TimeSeriesSplit is going to fulfil every possible need around time series splitting, but we should support some key use cases to avoid users falling into traps. Scikit-learn does not generally see time-series as within it primary scope...

Hi @jnothman, Thank you for your answer. I mean that it is possible to create more general grouped time series, but if there is no need - no problem, I will share my implementation in the article later. Thank you.

Hi @jnothman,

I've shared my implementation of group time series cross-validation which is compatible with sklearn.
Implementation contains ideas from different libraries and also additional functionality.
Implemented using Python standard library + numpy + sklearn helper functions.
I hope that it will be useful for implementation in sklearn, and I will be glad to contribute it to sklearn if it would be possible.

Article: https://medium.com/@labdmitriy/advanced-group-time-series-validation-bb00d4a74bcc
GitHub repository with source code: https://github.com/labdmitriy/ml-lab

Update (2022-05-27): enhanced version of this implementation now is the part of the mlxtend library.

Thank you.

@gerardsimons
Copy link

When will this finally be a part of sklearn library? I don't really like to have to depend on external libs to be able to do something so basic as this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
0