8000 [MRG+1] Fix: FeatureHasher now accepts string values by devashishd12 · Pull Request #6173 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
8000

[MRG+1] Fix: FeatureHasher now accepts string values #6173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 19, 2016
Merged

[MRG+1] Fix: FeatureHasher now accepts string values #6173

merged 1 commit into from
Mar 19, 2016

Conversation

devashishd12
Copy link
Contributor

Addresses issue #4883. Builds on the work done in #4913.

@devashishd12
Copy link
Contributor Author

@jnothman @amueller can you please review this once?

@devashishd12 devashishd12 changed the title [MRG] Fix: FeatureHasher accepts string values [WIP] Fix: FeatureHasher accepts string values Jan 15, 2016
@devashishd12 devashishd12 changed the title [WIP] Fix: FeatureHasher accepts string values [MRG] Fix: FeatureHasher accepts string values Jan 15, 2016
@devashishd12 devashishd12 changed the title [MRG] Fix: FeatureHasher accepts string values [MRG] Fix: FeatureHasher now accepts string values Jan 15, 2016
@rrohan
Copy link
Contributor
rrohan commented Jan 17, 2016

Is there a way to make this work with just one call to isinstance ?

@devashishd12
Copy link
Contributor Author

What I can do is encode for both string_types and unicode but I'm not sure if it would give any added advantage over the current method.

@devashishd12
Copy link
Contributor Author

@jnothman sorry for disturbing but I'm not quite sure about the tests for this one. Are they fine?
cc: @amueller

elif isinstance(v, string_types):
f = "%s%s%s" % (f, '=', v)
value = 1
elif isinstance(v, unicode):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand when this case happens. Surely the isinstance(v, string_types) branch will already have been executed.

@jnothman
Copy link
Member

I guess one further thing that is not tested here is that "foo":"bar" and "foo":"baz" (likely) hash to different columns.

Otherwise, the tests seem fine.

value = v
if isinstance(v, integer_types):
value = v
elif isinstance(v, string_types):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be the if case without any need for elif branches.

@devashishd12
Copy link
Contributor Author

@jnothman I've made the changes. Could you please check? Thanks!

@@ -19,6 +19,23 @@ def test_feature_hasher_dicts():
assert_array_equal(X1.toarray(), X2.toarray())


def test_feature_hasher_dicts_with_string_values():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find this test necessary. There is already a test for the equivalence of dict and pair representation. You could more easily add a string-valued pair to the existing test and still show equivalence of the representations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine I'll modify the existing test to add this.

@devashishd12
Copy link
Contributor Author

@jnothman sorry for the delay. I have addressed your comments. Can you please check once? Thanks!

@devashishd12
Copy link
Contributor Author

@jnothman gentle ping :)

@sritchie
Copy link

Would love to see this too!

@jnothman
Copy link
Member

We could more explicitly test equivalence of handling to DictVectorizer using the latter's inverse_transform, but I think what you've got here suffices. Thanks.

LGTM. Second review?

@jnothman
Copy link
Member

Please add an entry in doc/whatsnew.rst crediting both authors.

@jnothman jnothman changed the title [MRG] Fix: FeatureHasher now accepts string values [MRG+1] Fix: FeatureHasher now accepts string values Feb 27, 2016
@devashishd12
Copy link
Contributor Author

@jnothman I have added the whats_new entry. Hope it's alright.

@jnothman
Copy link
Member

It'll do!

@@ -8,6 +8,8 @@ from libc.stdlib cimport abs
cimport numpy as np
import numpy as np

from ..externals.six import string_types, integer_types
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integer_types import is not needed anymore.

@devashishd12
Copy link
Contributor Author

Can I please get another review on this one?

@jnothman
Copy link
Member
jnothman commented Mar 8, 2016

@MechCoder quick review?

x1, x2 = h.transform(raw_X).toarray()
x1_nz = sorted(np.abs(x1[x1 != 0]))
x2_nz = sorted(np.abs(x2[x2 != 0]))
assert_equal([1, 1], x1_nz)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my ignorance, but how do we know these values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we'll transform the first row (stored in x1), the "a" will be hashed as 1. Here x1 will be: [ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] and x2 will be: [ 0. 0. 0. 1. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -1.].

@MechCoder
Copy link
Member

We might also want to add a minor test to show that the non-zero feature column got by transforming a = [{'bax': 'abc'}, {'bax': 'abc'}] for both the rows are equal

@devashishd12
Copy link
Contributor Author

@MechCoder I've added the test. I hope this is fine now.

@MechCoder
Copy link
Member

Thanks !

MechCoder added a commit that referenced this pull request Mar 19, 2016
[MRG+1] Fix: FeatureHasher now accepts string values
@MechCoder MechCoder merged commit c2eaf75 into scikit-learn:master Mar 19, 2016
@devashishd12
Copy link
Contributor Author

Thanks for reviews @MechCoder @jnothman!

@devashishd12 devashishd12 deleted the featurehasher_fix branch March 19, 2016 04:08
@jnothman
Copy link
Member

Thanks for the PR!

On 19 March 2016 at 15:08, Devashish Deshpande notifications@github.com
wrote:

Thanks for reviews @MechCoder https://github.com/MechCoder @jnothman
https://github.com/jnothman!


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6173 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0