-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Expose Seed in FeatureHasher and HashingVectorizer #29748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think that there is a lot of discussion in #14605 that makes this change at least controversial. I would need to read the entire discussion again to make sure what was the issue raised against adding a random state or seed to the |
What confuses me is that AFAICT those changes were merged and then reverted within 3 weeks without discussion. Looking at the file's commit history the seed gets merged here, but removed the next commit during a renaming What's odd is that the seed variable doesn't even appear during the renaming, it looks like an old file was used. Maybe I missed or misunderstood something, but is it possible removing the seed was accidental? If so, could the seed be added back like was done by #14605? In the PR's discussion people seemed happy with the solution settled upon (variable called |
As far I see, the
I went through the discussion and I could spot the following points. For a consistent API, we would prefer to have So it lead to discussion around fixing (or not) the behaviour of So the middle ground in #14605 was to introduce |
My bad, I misunderstood the original commit. For the discussion about naming the variable, is there a similar case in the Sklearn API that serve as a reference? I'm not sure how to help or who could help otherwise It would be nice to have a long term solution and not just modify sklearn files locally to have access to the seed |
Basically this is a particular case which make it difficult. In this case, we need to come with a consensus among the 6284 core developer. I'll bring this topic on the next developer meeting at the end of the month. |
Describe the workflow you want to enable
Varying the seed of the FeatureHasher allows the user to control what inputs collide. This can allow for a better feature space either through experimentation (as a hyperparameter) or explicitly searching for a space that minimizes "bad" collisions
Describe your proposed solution
Add an optional "seed" parameter to the init of FeatureHasher which defaults to 0 (the current behavior, see the underlying hashing function). The seed would be thread through to _hashing_transform
Ditto for HashingVectorizer. Only difference here is that the seed would be passed to the FeatureHasher instance
This seems straightforward so I can implement this solution if it makes sense
Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: