-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Stratifiedkfold continuous (fixed) #6598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
f128148
to
555fd93
Compare
please run flake8 on your code |
555fd93
to
354e02d
Compare
done |
354e02d
to
c0af740
Compare
msg = "y_train falls into bins of too ragged sizes") | ||
|
||
|
||
def test_binnedstratifiedkfold_has_more_stable_distribution_moments_between_folds(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name could be a bit shorter I feel ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, unfortunately expressiveness requires space. your suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more like you could name it simply to test_binned_stratified_kfold_stable_dist_moments
and add a comment inside the test explaining it in more detail if you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just leave it as it is, though I don't mind if you amend it.
On Wed, Apr 13, 2016 at 9:32 AM, Raghav R V notifications@github.com
wrote:
In sklearn/tests/test_cross_validation.py
#6598 (comment)
:
#bins = np.percentile(y, np.arange(n_folds))
bins = np.array([np.percentile(y, q) for q in range(n_folds)])
for train_index, test_index in skf:
y_test = y[test_index]
hist_test, _ = np.histogram( y_test, bins = bins )
assert_true(all(abs(hist_test - np.mean(hist_test)) <= 1),
msg = "y_test falls into bins of too ragged sizes")
y_train = y[train_index]
hist_train, _ = np.histogram( y_test, bins = bins )
assert_true(all(abs(hist_train - np.mean(hist_train)) <= 1),
msg = "y_train falls into bins of too ragged sizes")
+def test_binnedstratifiedkfold_has_more_stable_distribution_moments_between_folds():
I was thinking more like you could name it simply to
test_binned_stratified_kfold_stable_dist_moments and add a comment inside
the test explaining it in more detail if you prefer?—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/6598/files/c0af740545c373a3dc5f6a95c415c96873e56491#r59580505
There was a problem hiding this comment.
Choose a reason for hiding this comment
8000 The reason will be displayed to describe this comment to others. Learn more.
Okay no problem!
Thanks heaps for the work. As we are deprecating the |
it will take a bit of time, but i'll do it |
@rvraghav93: can you please point me to the file I should put it in for |
Please take your time! And it should go in here - |
Thank you! I also have written a visual test for the test/train splitting, which visualizes split like: |
Another question: for |
ae275a5
to
567e9a8
Compare
567e9a8
to
42dea3b
Compare
@@ -577,6 +578,150 @@ def __len__(self): | |||
return self.n_folds | |||
|
|||
|
|||
class BinnedStratifiedKFold(_BaseKFold): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be removed from cross_validation.py
. As we are deprecating the whole import path. You can safely have it implemented inside model_selection
alone.
From a cursory look, this look like great work to me! However you should leave the |
That sounds cool, but I don't think it fits well with our API. Sorry! |
We discussed this at length in #4294, and decided not to support such behavior. Even if only |
You could also add a nice example using one of regression datasets and compare the std of Such an example should go in here. |
So why are we doing the binning strategy and not the sorting one? Sorting wouldn't have any additional parameters, while binning does, right? |
Has any progress made on this issue? It would be nice to have the BinnedStratifiedKFold function in the stable package. |
I'm still not convinced by binning vs sorting. @riccamastellone is there a particular reason why you'd want binning? |
@amueller you're probably right: no real need for it |
@DSLituiev Are you with us? :) |
I have my doubts, but if you show that one or the other provides a better
estimate of generalisation error given imbalanced density of the regression
target than shuffled kfold, particularly if the approach is described in
the literature, then do what works...
…On 11 Mar 2017 4:06 am, "Dmytro Lituiev" ***@***.***> wrote:
What is the suggestion, no need for the binning, i.e. shutting issue down?
What about sorting? Binning is based on sorting. I do not get what is the
proposal. @raghavrv <https://github.com/raghavrv> @amueller
<https://github.com/amueller>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#6598 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6y-9W3WH_m0nlGXbuAafHo_yp3wxks5rkYL4gaJpZM4H5gYe>
.
|
I suppose by better I mean lower variance
…On 12 Mar 2017 11:21 am, "Joel Nothman" ***@***.***> wrote:
I have my doubts, but if you show that one or the other provides a better
estimate of generalisation error given imbalanced density of the regression
target than shuffled kfold, particularly if the approach is described in
the literature, then do what works...
On 11 Mar 2017 4:06 am, "Dmytro Lituiev" ***@***.***> wrote:
> What is the suggestion, no need for the binning, i.e. shutting issue
> down? What about sorting? Binning is based on sorting. I do not get what is
> the proposal. @raghavrv <https://github.com/raghavrv> @amueller
> <https://github.com/amueller>
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#6598 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz6y-9W3WH_m0nlGXbuAafHo_yp3wxks5rkYL4gaJpZM4H5gYe>
> .
>
|
This is described in Max Kuhn's book "Applied predictive modeling" on page 68: "To account for the outcome when splitting the data, stratified random sampling applies random sampling within subgroups (such as the classes). In this way, there is a higher likelihood that the outcome distributions will match. When the outcome is a number, a similar strategy can be used; the numeric values are broken into similar groups (e.g., low, medium, and high) and the randomization is executed within these groups." We should add that to the references. |
(I wonder how binning compares to local shuffling after sorting)
|
How local is local shuffling / how would you do that? |
Though I think we can move forward with binning if we agree it's pretty reasonable and intuitive. And I kinda feel better if I can point to a book that says it makes sense (does that make sklearn the wikipedia of ML?) |
yes, let's do binning, then.
By local shuffling I had thought randomly swapping adjacent pairs would be
like adding random noise in the quantile transform.
|
I'm interested in this feature, and would like to help to bring this feature into the sklearn. As a start, I can rebase the @DSLituiev's branch on top of the latest master, and fix any issues. Or is there some other way I can help? |
@sainathadapa that would be a good start |
this PR addresses the issue #4757
tests attached (fixed)