-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Fix: FeatureHasher now accepts string values #6173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Is there a way to make this work with just one call to isinstance ? |
What I can do is encode for both |
elif isinstance(v, string_types): | ||
f = "%s%s%s" % (f, '=', v) | ||
value = 1 | ||
elif isinstance(v, unicode): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand when this case happens. Surely the isinstance(v, string_types)
branch will already have been executed.
I guess one further thing that is not tested here is that Otherwise, the tests seem fine. |
value = v | ||
if isinstance(v, integer_types): | ||
value = v | ||
elif isinstance(v, string_types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be the if
case without any need for elif
branches.
@jnothman I've made the changes. Could you please check? Thanks! |
@@ -19,6 +19,23 @@ def test_feature_hasher_dicts(): | |||
assert_array_equal(X1.toarray(), X2.toarray()) | |||
|
|||
|
|||
def test_feature_hasher_dicts_with_string_values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find this test necessary. There is already a test for the equivalence of dict and pair representation. You could more easily add a string-valued pair to the existing test and still show equivalence of the representations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine I'll modify the existing test to add this.
@jnothman sorry for the delay. I have addressed your comments. Can you please check once? Thanks! |
@jnothman gentle ping :) |
Would love to see this too! |
We could more explicitly test equivalence of handling to DictVectorizer using the latter's inverse_transform, but I think what you've got here suffices. Thanks. LGTM. Second review? |
Please add an entry in doc/whatsnew.rst crediting both authors. |
@jnothman I have added the whats_new entry. Hope it's alright. |
It'll do! |
@@ -8,6 +8,8 @@ from libc.stdlib cimport abs | |||
cimport numpy as np | |||
import numpy as np | |||
|
|||
from ..externals.six import string_types, integer_types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integer_types
import is not needed anymore.
Can I please get another review on this one? |
@MechCoder quick review? |
x1, x2 = h.transform(raw_X).toarray() | ||
x1_nz = sorted(np.abs(x1[x1 != 0])) | ||
x2_nz = sorted(np.abs(x2[x2 != 0])) | ||
assert_equal([1, 1], x1_nz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my ignorance, but how do we know these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we'll transform the first row (stored in x1), the "a" will be hashed as 1. Here x1
will be: [ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
and x2
will be: [ 0. 0. 0. 1. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -1.]
.
We might also want to add a minor test to show that the non-zero feature column got by transforming |
@MechCoder I've added the test. I hope this is fine now. |
Thanks ! |
[MRG+1] Fix: FeatureHasher now accepts string values
Thanks for reviews @MechCoder @jnothman! |
Thanks for the PR! On 19 March 2016 at 15:08, Devashish Deshpande notifications@github.com
|
Addresses issue #4883. Builds on the work done in #4913.