8000 [MRG] Stratifiedkfold continuous (fixed) by DSLituiev · Pull Request #6598 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
8000

[MRG] Stratifiedkfold continuous (fixed) #6598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

DSLituiev
Copy link
Contributor

this PR addresses the issue #4757

tests attached (fixed)

@DSLituiev DSLituiev changed the title Stratifiedkfold continuous [MRG] Stratifiedkfold continuous (fixed) Mar 27, 2016
@DSLituiev DSLituiev force-pushed the stratifiedkfold_continuous branch 4 times, most recently from f128148 to 555fd93 Compare March 27, 2016 19:25
@agramfort
Copy link
Member

please run flake8 on your code

@DSLituiev DSLituiev force-pushed the stratifiedkfold_continuous branch from 555fd93 to 354e02d Compare March 27, 2016 22:28
@DSLituiev
Copy link
Contributor Author

done

msg = "y_train falls into bins of too ragged sizes")


def test_binnedstratifiedkfold_has_more_stable_distribution_moments_between_folds():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name could be a bit shorter I feel ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, unfortunately expressiveness requires space. your suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more like you could name it simply to test_binned_stratified_kfold_stable_dist_moments and add a comment inside the test explaining it in more detail if you prefer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just leave it as it is, though I don't mind if you amend it.

On Wed, Apr 13, 2016 at 9:32 AM, Raghav R V notifications@github.com
wrote:

In sklearn/tests/test_cross_validation.py
#6598 (comment)
:

  •    #bins = np.percentile(y, np.arange(n_folds))
    
  •    bins = np.array([np.percentile(y, q) for q in range(n_folds)])
    
  •    for  train_index, test_index in skf:
    
  •        y_test = y[test_index]
    
  •        hist_test, _ = np.histogram( y_test, bins = bins )
    
  •        assert_true(all(abs(hist_test - np.mean(hist_test)) <= 1),
    
  •                    msg = "y_test falls into bins of too ragged sizes")
    
  •        y_train = y[train_index]
    
  •        hist_train, _ = np.histogram( y_test, bins = bins )
    
  •        assert_true(all(abs(hist_train - np.mean(hist_train)) <= 1),
    
  •                    msg = "y_train falls into bins of too ragged sizes")
    
    +def test_binnedstratifiedkfold_has_more_stable_distribution_moments_between_folds():

I was thinking more like you could name it simply to
test_binned_stratified_kfold_stable_dist_moments and add a comment inside
the test explaining it in more detail if you prefer?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/6598/files/c0af740545c373a3dc5f6a95c415c96873e56491#r59580505

Copy link
Member

Choose a reason for hiding this comment

8000 The reason will be displayed to describe this comment to others. Learn more.

Okay no problem!

@raghavrv
Copy link
Member

Thanks heaps for the work. As we are deprecating the cross_validation.py, your implementation needs to be done for the model_selection module. Could you modify your implementation to have a data independent class initialization and add this to model_selection instead of cross_validation please?

@DSLituiev
Copy link
Contributor Author

it will take a bit of time, but i'll do it

@DSLituiev
Copy link
Contributor Author

@rvraghav93: can you please point me to the file I should put it in for model_selection?

@raghavrv
Copy link
Member

Please take your time!

And it should go in here - model_selection/_split.py

@DSLituiev
Copy link
Contributor Author

Thank you! I also have written a visual test for the test/train splitting, which visualizes split like:
x---x-x--x-x with x for test with respect to original y. Do you think it can be useful? If so, where can one add it?

@DSLituiev
Copy link
Contributor Author

Another question: for model_selection variant:
I basically use only y, not X. As y is an optional argument, is it OK to make a foolproof mechanism, which will work when split(y) is called? I.e. if no y is assigned and len(X.shape)==1 or X.shape[1] == 1 treat X as y?

@DSLituiev DSLituiev force-pushed the stratifiedkfold_continuous branch 2 times, most recently from ae275a5 to 567e9a8 Compare April 16, 2016 06:02
@DSLituiev DSLituiev force-pushed the stratifiedkfold_continuous branch from 567e9a8 to 42dea3b Compare April 16, 2016 17:45
@@ -577,6 +578,150 @@ def __len__(self):
return self.n_folds


class BinnedStratifiedKFold(_BaseKFold):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be removed from cross_validation.py. As we are deprecating the whole import path. You can safely have it implemented inside model_selection alone.

@raghavrv
Copy link
Member

From a cursory look, this look like great work to me! However you should leave the cross_validation.py/test_cross_validation.py untouched as we discussed earlier (#6598 (comment)). Please ping me once done.

@raghavrv
Copy link
Member

Thank you! I also have written a visual test for the test/train splitting, which visualizes split like:
x---x-x--x-x with x for test with respect to original y. Do you think it can be useful?

That sounds cool, but I don't think it fits well with our API. Sorry!

@raghavrv
Copy link
Member

if no y is assigned and len(X.shape)==1 or X.shape[1] == 1 treat X as y?

We discussed this at length in #4294, and decided not to support such behavior. Even if only y is used, both X and y must be supplied.

@raghavrv
Copy link
Member
raghavrv commented May 24, 2016

You could also add a nice example using one of regression datasets and compare the std of KFold with the BinnedStratifiedKFold...

Such an example should go in here.

@amueller
Copy link
Member

So why are we doing the binning strategy and not the sorting one? Sorting wouldn't have any additional parameters, while binning does, right?

@riccamastellone
Copy link

Has any progress made on this issue? It would be nice to have the BinnedStratifiedKFold function in the stable package.
If @DSLituiev is not porting his work as a model_selection module, I could look into it

@amueller
Copy link
Member

I'm still not convinced by binning vs sorting. @riccamastellone is there a particular reason why you'd want binning?

@riccamastellone
Copy link

@amueller you're probably right: no real need for it

@raghavrv
Copy link
Member

@DSLituiev Are you with us? :)

@DSLituiev
Copy link
Contributor Author

What is the suggestion, no need for the binning, i.e. shutting issue down? What about sorting? Binning is based on sorting. I do not get what is the proposal. @raghavrv @amueller

@jnothman
Copy link
Member
jnothman commented Mar 12, 2017 via email

@jnothman
Copy link
Member
jnothman commented Mar 12, 2017 via email

@amueller
Copy link
Member

This is described in Max Kuhn's book "Applied predictive modeling" on page 68: "To account for the outcome when splitting the data, stratified random sampling applies random sampling within subgroups (such as the classes). In this way, there is a higher likelihood that the outcome distributions will match. When the outcome is a number, a similar strategy can be used; the numeric values are broken into similar groups (e.g., low, medium, and high) and the randomization is executed within these groups."

We should add that to the references.
I'm also now pro-binning because that allows repeated stratified k-fold. If we'd do sorting there would be "only one" correct sorting, the binning allows for randomization.

@jnothman
Copy link
Member
jnothman commented May 21, 2018 via email

@amueller
Copy link
Member

How local is local shuffling / how would you do that?

@amueller
Copy link
Member

Though I think we can move forward with binning if we agree it's pretty reasonable and intuitive. And I kinda feel better if I can point to a book that says it makes sense (does that make sklearn the wikipedia of ML?)

@jnothman
Copy link
Member
jnothman commented May 21, 2018 via email

@sainathadapa
Copy link

I'm interested in this feature, and would like to help to bring this feature into the sklearn. As a start, I can rebase the @DSLituiev's branch on top of the latest master, and fix any issues. Or is there some other way I can help?

@amueller
Copy link
Member
amueller commented Aug 2, 2019

@sainathadapa that would be a good start

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development
3FD4

Successfully merging this pull request may close these issues.

8 participants
0