8000 Expose Seed in FeatureHasher and HashingVectorizer · Issue #29748 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Expose Seed in FeatureHasher and HashingVectorizer #29748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
FelixLabelle opened this issue Aug 30, 2024 · 5 comments
Open

Expose Seed in FeatureHasher and HashingVectorizer #29748

FelixLabelle opened this issue Aug 30, 2024 · 5 comments
Labels

Comments

@FelixLabelle
Copy link

Describe the workflow you want to enable

Varying the seed of the FeatureHasher allows the user to control what inputs collide. This can allow for a better feature space either through experimentation (as a hyperparameter) or explicitly searching for a space that minimizes "bad" collisions

Describe your proposed solution

Add an optional "seed" parameter to the init of FeatureHasher which defaults to 0 (the current behavior, see the underlying hashing function). The seed would be thread through to _hashing_transform

Ditto for HashingVectorizer. Only difference here is that the seed would be passed to the FeatureHasher instance

This seems straightforward so I can implement this solution if it makes sense

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@FelixLabelle FelixLabelle added Needs Triage Issue requires triage New Feature labels Aug 30, 2024
@glemaitre
Copy link
Member

I think that there is a lot of discussion in #14605 that makes this change at least controversial. I would need to read the entire discussion again to make sure what was the issue raised against adding a random state or seed to the __init__.

@glemaitre glemaitre added Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Aug 31, 2024
@FelixLabelle
Copy link
Author
FelixLabelle commented Sep 1, 2024

What confuses me is that AFAICT those changes were merged and then reverted within 3 weeks without discussion.

Looking at the file's commit history the seed gets merged here, but removed the next commit during a renaming

What's odd is that the seed variable doesn't even appear during the renaming, it looks like an old file was used. Maybe I missed or misunderstood something, but is it possible removing the seed was accidental? If so, could the seed be added back like was done by #14605?

In the PR's discussion people seemed happy with the solution settled upon (variable called seed in transform that defaults to 0) and it would work for me

@glemaitre
Copy link
Member

Looking at the file's commit history the seed gets merged here, but removed the next commit during a renaming

As far I see, the seed parameter is added to the helper function of the cython routine and not the public class (i.e. FeatureHasher and HashingVectorizer).

If so, could the seed be added back like was done by #14605?

I went through the discussion and I could spot the following points. For a consistent API, we would prefer to have random_state. However, the behaviour is not in line with other estimators' random_state and it is surprising. So seed is a better name but we might not want to introduce it because it makes it inconsistent.

So it lead to discussion around fixing (or not) the behaviour of random_state with random number generator that is tightly related to scikit-learn/enhancement_proposals#88.

So the middle ground in #14605 was to introduce seed into the cython function only.

@FelixLabelle
Copy link
Author

My bad, I misunderstood the original commit.

For the discussion about naming the variable, is there a similar case in the Sklearn API that serve as a reference? I'm not sure how to help or who could help otherwise

It would be nice to have a long term solution and not just modify sklearn files locally to have access to the seed

@glemaitre
Copy link
Member

For the discussion about naming the variable, is there a similar case in the Sklearn API that serve as a reference? I'm not sure how to help or who could help otherwise

Basically this is a particular case which make it difficult. In this case, we need to come with a consensus among the 6284 core developer. I'll bring this topic on the next developer meeting at the end of the month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
0