-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Add sample_weight support to Dummy Regressor #3779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -389,25 +390,43 @@ def fit(self, X, y, sample_weight=None): | |||
"'mean', 'median', 'quantile' or 'constant'" | |||
% self.strategy) | |||
|
|||
y = check_array(y, accept_sparse='csr', ensure_2d=False) | |||
y = check_array(y, ensure_2d=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed accept_sparse='csr'
since it's not supported.
This is ready for review and it will fix #3420 |
9e4f6d5
to
de49e67
Compare
de49e67
to
da31344
Compare
check_consistent_length(X, y) | ||
self.output_2d_ = y.ndim == 2 | ||
if y.ndim == 1: | ||
y = np.reshape(y, (-1, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noob question: out of curiosity, is there any difference between doing this and
`y = y[:, np.newaxis]`
I always use the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The y = y[:, np.newaxis]
doesn't preserve contiguity. This is a known bug and will be solve in the current or next release of numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes, I remember now :)
# Find index of median prediction for each sample | ||
weight_cdf = sample_weight[sorted_idx].cumsum() | ||
percentile_or_above = weight_cdf >= (percentile / 100.0) * weight_cdf[-1] | ||
percentile_idx = percentile_or_above.argmax() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding right, these two lines can be replaced by
precentile_idx = np.searchsorted(weight_cdf, (percentile / 100.) * weight_cdf[-1])
or am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this could be optimized in another pr? I have just taken what @pprett has done previously and put it there to be useful to more than just gradient boosting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, unless @pprett thinks if it is ok, to change this over here.
@arjoly Updated the PR description. |
Thanks @MechCoder ! |
|
||
# Find index of median prediction for each sample | ||
weight_cdf = sample_weight[sorted_idx].cumsum() | ||
percentile_or_above = weight_cdf >= (percentile / 100.0) * weight_cdf[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being stupid, but I am not able to get this to work. My arguments are [3, 2, 4] and [1, 2, 3] for array and sample_weight respectively. The sorted_idx is an array and thus throwing a TypeError. I wonder what are the expected arguments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_weight should be a numpy array
On Sun, Oct 19, 2014 at 12:26 PM, Saurabh Jha notifications@github.com
wrote:
In sklearn/utils/stats.py:
@@ -44,3 +44,16 @@ def _rankdata(a, method="average"):
except TypeError as e:
rankdata = _rankdata
+
+
+def _weighted_percentile(array, sample_weight, percentile=50):
- """Compute the weighted
percentile
ofarray
withsample_weight
. """- sorted_idx = np.argsort(array)
Find index of median prediction for each sample
- weight_cdf = sample_weight[sorted_idx].cumsum()
- percentile_or_above = weight_cdf >= (percentile / 100.0) * weight_cdf[-1]
Sorry for being stupid, but I am not able to get this to work. My
arguments are [3, 2, 4] and [1, 2, 3] for array and sample_weight
respectively. The sorted_idx is an array and thus throwing a TypeError. I
wonder what are the expected arguments here.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3779/files#r19059282.
Godspeed,
Manoj Kumar,
Intern, Telecom ParisTech
Mech Undergrad
http://manojbits.wordpress.com
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MechCoder !
In the long term, I hope to replace the dummy estimator in gradient boosting by the dummy regressor and classifier. any last reviewer? |
def test_dummy_regressor_sample_weight(n_samples=10): | ||
random_state = np.random.RandomState(seed=1) | ||
|
||
X = [[0]] * n_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it better to generate X randomly? Just for a sanity check.
I suppose you can go ahead and merge if noone replies in 2-3 days, or when you feel like. I believe none is following these changes and this is not that big a diff. |
Thanks @MechCoder ! I am ok to merge. Still I would appreciate a quick last review for this small pr. |
I think this should go in. |
[MRG+1] Add sample_weight support to Dummy Regressor
Thanks @arjoly . |
Thanks :-) |
@MechCoder Just to let you know for later, whenever you merge branch. It's better to do that by rebase to have a linear history. I have been told that it greatly eases maintenance and bug fix. |
Thanks for the tip.! |
No description provided.