Expose Seed in FeatureHasher and HashingVectorizer #29748

FelixLabelle · 2024-08-30T10:51:13Z

Describe the workflow you want to enable

Varying the seed of the FeatureHasher allows the user to control what inputs collide. This can allow for a better feature space either through experimentation (as a hyperparameter) or explicitly searching for a space that minimizes "bad" collisions

Describe your proposed solution

Add an optional "seed" parameter to the init of FeatureHasher which defaults to 0 (the current behavior, see the underlying hashing function). The seed would be thread through to _hashing_transform

Ditto for HashingVectorizer. Only difference here is that the seed would be passed to the FeatureHasher instance

This seems straightforward so I can implement this solution if it makes sense

Describe alternatives you've considered, if relevant

No response

Additional context

No response

glemaitre · 2024-08-31T22:01:03Z

I think that there is a lot of discussion in #14605 that makes this change at least controversial. I would need to read the entire discussion again to make sure what was the issue raised against adding a random state or seed to the __init__.

FelixLabelle · 2024-09-01T10:42:35Z

What confuses me is that AFAICT those changes were merged and then reverted within 3 weeks without discussion.

Looking at the file's commit history the seed gets merged here, but removed the next commit during a renaming

What's odd is that the seed variable doesn't even appear during the renaming, it looks like an old file was used. Maybe I missed or misunderstood something, but is it possible removing the seed was accidental? If so, could the seed be added back like was done by #14605?

In the PR's discussion people seemed happy with the solution settled upon (variable called seed in transform that defaults to 0) and it would work for me

glemaitre · 2024-09-02T11:47:56Z

Looking at the file's commit history the seed gets merged here, but removed the next commit during a renaming

As far I see, the seed parameter is added to the helper function of the cython routine and not the public class (i.e. FeatureHasher and HashingVectorizer).

If so, could the seed be added back like was done by #14605?

I went through the discussion and I could spot the following points. For a consistent API, we would prefer to have random_state. However, the behaviour is not in line with other estimators' random_state and it is surprising. So seed is a better name but we might not want to introduce it because it makes it inconsistent.

So it lead to discussion around fixing (or not) the behaviour of random_state with random number generator that is tightly related to scikit-learn/enhancement_proposals#88.

So the middle ground in #14605 was to introduce seed into the cython function only.

FelixLabelle · 2024-09-04T20:14:34Z

My bad, I misunderstood the original commit.

For the discussion about naming the variable, is there a similar case in the Sklearn API that serve as a reference? I'm not sure how to help or who could help otherwise

It would be nice to have a long term solution and not just modify sklearn files locally to have access to the seed

glemaitre · 2024-09-04T20:17:04Z

For the discussion about naming the variable, is there a similar case in the Sklearn API that serve as a reference? I'm not sure how to help or who could help otherwise

Basically this is a particular case which make it difficult. In this case, we need to come with a consensus among the 6284 core developer. I'll bring this topic on the next developer meeting at the end of the month.

FelixLabelle added Needs Triage Issue requires triage New Feature labels Aug 30, 2024

glemaitre added Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose Seed in FeatureHasher and HashingVectorizer #29748

Expose Seed in FeatureHasher and HashingVectorizer #29748

Expose Seed in FeatureHasher and HashingVectorizer #29748

Expose Seed in FeatureHasher and HashingVectorizer #29748

Comments

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context