-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Use a stable sorting algorithm when selecting the max_features
in TfidfVectorizer
#21446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I suppose the performance impact is negligible compared to other operations when calling fit or transform on a such a corpus. So this looks good to me. The only problem is that at the moment in our NUMPY_MIN_VERSION = "1.14.6" I suppose we will be able to bump this for the scikit-learn 1.1 release but I am not 100% sure. |
According to https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table we are already way more conservative than the recommended support. |
For backward compatibility, I think we still need to add a parameter to |
That's a good point. Otherwise it will cause very nasty bugs in existing pipelines. |
Hi. Thanks for the feedback. I'm interested on working on this.
It should a boolean parameter? |
The current code is actually not very intuitive and would deserve some refactoring to make it simpler (not sure how though): # in CountVectorizer.fit_transform
if max_features is not None:
X = self._sort_features(X, vocabulary)
X, self.stop_words_ = self._limit_features(
X, vocabulary, max_doc_count, min_doc_count, max_features
)
if max_features is None:
X = self._sort_features(X, vocabulary) But then preserving backward compat would be quite complex because if Assuming we can refactor/simplify while making it possible to preserve the backward compat, or do the minimal change in use_stable_feature="warn" that would raise a But I wouldn't really see the point of enabling |
Could we consider this as a bug fix (and skip introducing an option that we then deprecate in order to finally remove it)? |
That's an option. But I am not sure was is the most disruptive, the current "bug", the bug fix without deprecation warning or the complex deprecation procedure. On the long term we obviously need to fix the issue, no question. It's just the way to get there... |
@ogrisel And what's your preferred option? |
I think I prefer to be explicit and introduce a new option to enable the switch to the new behavior explicitly with a future warning otherwise. Also, once #22617 is merged we will be able to get use |
@ogrisel do you intend the new option to be deprecated in 2 cycles or stay 'forvever'? |
Describe the workflow you want to enable
Currently when selecting
max_features
in TfidfVectorizers, the algorithm (line 1172) uses numpy.argsort with thequicksort
enabled (the default argument).Given that quicksort is not stable, this leads to inconsistent behavior across different datasets, for example:
As all the features have the same frequency they are return in lexicographic order, this does not happens for the following dataset:
I would expect the result to be the same as the first example but is not.
Describe your proposed solution
Since
1.15.0
the option stable was added to numpy.argsort, so simply changing the line to:should be enough to solve the problem.
Describe alternatives you've considered, if relevant
As an alternative use the option
mergesort
that is stableAdditional context
The text was updated successfully, but these errors were encountered: