[MRG] Stratifiedkfold continuous (fixed) #6598

DSLituiev · 2016-03-27T17:47:27Z

this PR addresses the issue #4757

tests attached (fixed)

agramfort · 2016-03-27T21:00:07Z

please run flake8 on your code

DSLituiev · 2016-03-27T22:29:37Z

done

raghavrv · 2016-04-13T09:57:48Z

sklearn/tests/test_cross_validation.py

+                        msg = "y_train falls into bins of too ragged sizes")
+
+
+def test_binnedstratifiedkfold_has_more_stable_distribution_moments_between_folds():


The name could be a bit shorter I feel ;)

well, unfortunately expressiveness requires space. your suggestion?

I was thinking more like you could name it simply to test_binned_stratified_kfold_stable_dist_moments and add a comment inside the test explaining it in more detail if you prefer?

I would just leave it as it is, though I don't mind if you amend it.

On Wed, Apr 13, 2016 at 9:32 AM, Raghav R V notifications@github.com
wrote:

In sklearn/tests/test_cross_validation.py
#6598 (comment)
:

#bins = np.percentile(y, np.arange(n_folds))

bins = np.array([np.percentile(y, q) for q in range(n_folds)])

for train_index, test_index in skf:

y_test = y[test_index]

hist_test, _ = np.histogram( y_test, bins = bins )

assert_true(all(abs(hist_test - np.mean(hist_test)) <= 1),

msg = "y_test falls into bins of too ragged sizes")

y_train = y[train_index]

hist_train, _ = np.histogram( y_test, bins = bins )

assert_true(all(abs(hist_train - np.mean(hist_train)) <= 1),

msg = "y_train falls into bins of too ragged sizes")

+def test_binnedstratifiedkfold_has_more_stable_distribution_moments_between_folds():

I was thinking more like you could name it simply to
test_binned_stratified_kfold_stable_dist_moments and add a comment inside
the test explaining it in more detail if you prefer?

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/6598/files/c0af740545c373a3dc5f6a95c415c96873e56491#r59580505

Okay no problem!

raghavrv · 2016-04-15T22:58:48Z

Thanks heaps for the work. As we are deprecating the cross_validation.py, your implementation needs to be done for the model_selection module. Could you modify your implementation to have a data independent class initialization and add this to model_selection instead of cross_validation please?

DSLituiev · 2016-04-15T23:04:40Z

it will take a bit of time, but i'll do it

DSLituiev · 2016-04-15T23:07:07Z

@rvraghav93: can you please point me to the file I should put it in for model_selection?

raghavrv · 2016-04-15T23:08:11Z

Please take your time!

And it should go in here - model_selection/_split.py

DSLituiev · 2016-04-15T23:10:14Z

Thank you! I also have written a visual test for the test/train splitting, which visualizes split like:
x---x-x--x-x with x for test with respect to original y. Do you think it can be useful? If so, where can one add it?

DSLituiev · 2016-04-15T23:52:55Z

Another question: for model_selection variant:
I basically use only y, not X. As y is an optional argument, is it OK to make a foolproof mechanism, which will work when split(y) is called? I.e. if no y is assigned and len(X.shape)==1 or X.shape[1] == 1 treat X as y?

raghavrv · 2016-05-24T12:10:14Z

sklearn/cross_validation.py

@@ -577,6 +578,150 @@ def __len__(self):
        return self.n_folds


+class BinnedStratifiedKFold(_BaseKFold):


This needs to be removed from cross_validation.py. As we are deprecating the whole import path. You can safely have it implemented inside model_selection alone.

raghavrv · 2016-05-24T12:16:03Z

From a cursory look, this look like great work to me! However you should leave the cross_validation.py/test_cross_validation.py untouched as we discussed earlier (#6598 (comment)). Please ping me once done.

raghavrv · 2016-05-24T12:18:27Z

Thank you! I also have written a visual test for the test/train splitting, which visualizes split like:
x---x-x--x-x with x for test with respect to original y. Do you think it can be useful?

That sounds cool, but I don't think it fits well with our API. Sorry!

raghavrv · 2016-05-24T12:21:25Z

if no y is assigned and len(X.shape)==1 or X.shape[1] == 1 treat X as y?

We discussed this at length in #4294, and decided not to support such behavior. Even if only y is used, both X and y must be supplied.

raghavrv · 2016-05-24T12:23:56Z

You could also add a nice example using one of regression datasets and compare the std of KFold with the BinnedStratifiedKFold...

Such an example should go in here.

amueller · 2016-05-25T20:45:17Z

So why are we doing the binning strategy and not the sorting one? Sorting wouldn't have any additional parameters, while binning does, right?

riccamastellone · 2016-10-20T15:43:13Z

Has any progress made on this issue? It would be nice to have the BinnedStratifiedKFold function in the stable package.
If @DSLituiev is not porting his work as a model_selection module, I could look into it

amueller · 2016-10-20T15:57:47Z

I'm still not convinced by binning vs sorting. @riccamastellone is there a particular reason why you'd want binning?

riccamastellone · 2016-10-24T10:30:46Z

@amueller you're probably right: no real need for it

raghavrv · 2016-11-11T02:01:44Z

@DSLituiev Are you with us? :)

DSLituiev · 2017-03-10T17:05:59Z

What is the suggestion, no need for the binning, i.e. shutting issue down? What about sorting? Binning is based on sorting. I do not get what is the proposal. @raghavrv @amueller

jnothman · 2017-03-12T00:22:18Z

I have my doubts, but if you show that one or the other provides a better estimate of generalisation error given imbalanced density of the regression target than shuffled kfold, particularly if the approach is described in the literature, then do what works...

…

On 11 Mar 2017 4:06 am, "Dmytro Lituiev" ***@***.***> wrote: What is the suggestion, no need for the binning, i.e. shutting issue down? What about sorting? Binning is based on sorting. I do not get what is the proposal. @raghavrv <https://github.com/raghavrv> @amueller <https://github.com/amueller> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#6598 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6y-9W3WH_m0nlGXbuAafHo_yp3wxks5rkYL4gaJpZM4H5gYe> .

jnothman · 2017-03-12T00:26:59Z

I suppose by better I mean lower variance

…

On 12 Mar 2017 11:21 am, "Joel Nothman" ***@***.***> wrote: I have my doubts, but if you show that one or the other provides a better estimate of generalisation error given imbalanced density of the regression target than shuffled kfold, particularly if the approach is described in the literature, then do what works... On 11 Mar 2017 4:06 am, "Dmytro Lituiev" ***@***.***> wrote: > What is the suggestion, no need for the binning, i.e. shutting issue > down? What about sorting? Binning is based on sorting. I do not get what is > the proposal. @raghavrv <https://github.com/raghavrv> @amueller > <https://github.com/amueller> > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#6598 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz6y-9W3WH_m0nlGXbuAafHo_yp3wxks5rkYL4gaJpZM4H5gYe> > . >

amueller · 2018-05-21T15:54:08Z

This is described in Max Kuhn's book "Applied predictive modeling" on page 68: "To account for the outcome when splitting the data, stratified random sampling applies random sampling within subgroups (such as the classes). In this way, there is a higher likelihood that the outcome distributions will match. When the outcome is a number, a similar strategy can be used; the numeric values are broken into similar groups (e.g., low, medium, and high) and the randomization is executed within these groups."

We should add that to the references.
I'm also now pro-binning because that allows repeated stratified k-fold. If we'd do sorting there would be "only one" correct sorting, the binning allows for randomization.

jnothman · 2018-05-21T22:03:20Z

(I wonder how binning compares to local shuffling after sorting)

amueller · 2018-05-21T22:09:24Z

How local is local shuffling / how would you do that?

amueller · 2018-05-21T22:10:41Z

Though I think we can move forward with binning if we agree it's pretty reasonable and intuitive. And I kinda feel better if I can point to a book that says it makes sense (does that make sklearn the wikipedia of ML?)

jnothman · 2018-05-21T23:28:04Z

yes, let's do binning, then. By local shuffling I had thought randomly swapping adjacent pairs would be like adding random noise in the quantile transform.

sainathadapa · 2018-08-03T11:36:14Z

I'm interested in this feature, and would like to help to bring this feature into the sklearn. As a start, I can rebase the @DSLituiev's branch on top of the latest master, and fix any issues. Or is there some other way I can help?

amueller · 2019-08-02T21:06:48Z

@sainathadapa that would be a good start

DSLituiev changed the title ~~Stratifiedkfold continuous~~ [MRG] Stratifiedkfold continuous (fixed) Mar 27, 2016

DSLituiev force-pushed the stratifiedkfold_continuous branch 4 times, most recently from f128148 to 555fd93 Compare March 27, 2016 19:25

DSLituiev force-pushed the stratifiedkfold_continuous branch from 555fd93 to 354e02d Compare March 27, 2016 22:28

BinnedStratifiedKFold for conitinuous variables

c0af740

DSLituiev force-pushed the stratifiedkfold_continuous branch from 354e02d to c0af740 Compare March 28, 2016 18:38

MechCoder mentioned this pull request Apr 13, 2016

[WIP] ENH Make stratified a parameter and conflate all the stratified/non-stratified cross-validator class pairs. #5569

Closed

3 tasks

raghavrv reviewed Apr 13, 2016
View reviewed changes

DSLituiev force-pushed the stratifiedkfold_continuous branch 2 times, most recently from ae275a5 to 567e9a8 Compare April 16, 2016 06:02

BinnedStratifiedKFold ported to model_selection

42dea3b

DSLituiev force-pushed the stratifiedkfold_continuous branch from 567e9a8 to 42dea3b Compare April 16, 2016 17:45

raghavrv reviewed May 24, 2016
View reviewed changes

ktran9891 mentioned this pull request Nov 20, 2017

extend StratifiedKFold to float for regression #4757

Open

jnothman mentioned this pull request Mar 7, 2018

[Feature-Request] Add a flag to StratifiedKFold to force classes with only 1 sample in training #10767

Open

thomasjpfan added the Stalled label Aug 6, 2019

github-actions bot added the module:model_selection label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:49

glemaitre mentioned this pull request Oct 7, 2024

Add balance_regression option to train_test_split for regression problems #30009

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Stratifiedkfold continuous (fixed) #6598

[MRG] Stratifiedkfold continuous (fixed) #6598

		msg = "y_train falls into bins of too ragged sizes")


		def test_binnedstratifiedkfold_has_more_stable_distribution_moments_between_folds():

		@@ -577,6 +578,150 @@ def __len__(self):
		return self.n_folds


		class BinnedStratifiedKFold(_BaseKFold):

[MRG] Stratifiedkfold continuous (fixed) #6598

Are you sure you want to change the base?

[MRG] Stratifiedkfold continuous (fixed) #6598

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment