8000 Use a stable sorting algorithm when selecting the `max_features` in TfidfVectorizer · Issue #21446 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Use a stable sorting algorithm when selecting the max_features in TfidfVectorizer #21446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mesejo opened this issue Oct 24, 2021 · 11 comments

Comments

@mesejo
Copy link
mesejo commented Oct 24, 2021

Describe the workflow you want to enable

Currently when selecting max_features in TfidfVectorizers, the algorithm (line 1172) uses numpy.argsort with the quicksort enabled (the default argument).

Given that quicksort is not stable, this leads to inconsistent behavior across different datasets, for example:

corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())   # ['aaa' 'aab' 'aac' 'aad' 'aae']

As all the features have the same frequency they are return in lexicographic order, this does not happens for the following dataset:

corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF', 'AAG', 'ABA', 'ABB', 'ABC', 'ABD', 'ACA', 'ACB', 'ADA', 'AEA',
          'AFA', 'BAA']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())  # ['aaa' 'aca' 'acb' 'ada' 'aea']

I would expect the result to be the same as the first example but is not.

Describe your proposed solution

Since 1.15.0 the option stable was added to numpy.argsort, so simply changing the line to:

mask_inds = (-tfs[mask]).argsort(kind="stable")[:limit]

should be enough to solve the problem.

Describe alternatives you've considered, if relevant

As an alternative use the option mergesort that is stable

mask_inds = (-tfs[mask]).argsort(kind="mergesort")[:limit]

Additional context

sklearn.__version__
'1.0'
@ogrisel
Copy link
Member
ogrisel commented Oct 26, 2021

I suppose the performance impact is negligible compared to other operations when calling fit or transform on a such a corpus. So this looks good to me.

The only problem is that at the moment in our sklearn/_min_dependencies.py states:

    NUMPY_MIN_VERSION = "1.14.6"

I suppose we will be able to bump this for the scikit-learn 1.1 release but I am not 100% sure.

@ogrisel
Copy link
Member
ogrisel commented Oct 26, 2021

According to https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table we are already way more conservative than the recommended support.

@thomasjpfan
Copy link
Member

For backward compatibility, I think we still need to add a parameter to TfidfVectorizer to control the sorting behavior.

@ogrisel
Copy link
Member
ogrisel commented Oct 26, 2021

For backward compatibility, I think we still need to add a parameter to TfidfVectorizer to control the sorting behavior.

That's a good point. Otherwise it will cause very nasty bugs in existing pipelines.

@mesejo
Copy link
Author
mesejo commented Oct 27, 2021

Hi. Thanks for the feedback. I'm interested on working on this.

For backward compatibility, I think we still need to add a parameter to TfidfVectorizer to control the sorting behavior.

It should a boolean parameter?

@ogrisel
Copy link
Member
ogrisel commented Oct 28, 2021

The current code is actually not very intuitive and would deserve some refactoring to make it simpler (not sure how though):

            # in CountVectorizer.fit_transform

            if max_features is not None:
                X = self._sort_features(X, vocabulary)
            X, self.stop_words_ = self._limit_features(
                X, vocabulary, max_doc_count, min_doc_count, max_features
            )
            if max_features is None:
                X = self._sort_features(X, vocabulary)

But then preserving backward compat would be quite complex because if limit is None (max_features=None) we do not apply the argsort-based operation within the call to_limit_features but we still sort the features later using Python's sorted in the call to _sort_features. Note that sorted is guaranteed to be stable.

Assuming we can refactor/simplify while making it possible to preserve the backward compat, or do the minimal change in argsort without refactoring we could have add a constructor param such as:

use_stable_feature="warn"

that would raise a FutureWarning to make it explicit that in 1.1 the feature sorting is not stable and in the future (1.3) the feature sorting will be changed to be stable by default. Setting use_stable_feature=True makes it possible to enable the future sorting behavior and silence the warning. Then in 1.3 we would deprecate the use_stable_feature parameter itself and always use stable storing.

But I wouldn't really see the point of enabling stable_feature_sort=False to keep the current behavior without the warning since we want to always be stable by default in the future.

@lorentzenchr
Copy link
Member

Could we consider this as a bug fix (and skip introducing an option that we then deprecate in order to finally remove it)?

@ogrisel
Copy link
Member
ogrisel commented Dec 13, 2021

Could we consider this as a bug fix (and skip introducing an option that we then deprecate in order to finally remove it)?

That's an option. But I am not sure was is the most disruptive, the current "bug", the bug fix without deprecation warning or the complex deprecation procedure.

On the long term we obviously need to fix the issue, no question. It's just the way to get there...

@lorentzenchr
Copy link
Member

@ogrisel And what's your preferred option?

@ogrisel
Copy link
Member
ogrisel commented Feb 26, 2022

I think I prefer to be explicit and introduce a new option to enable the switch to the new behavior explicitly with a future warning otherwise.

Also, once #22617 is merged we will be able to get use .argsort(kind="stable").

@lucyleeow
Copy link
Member

I think I prefer to be explicit and introduce a new option to enable the switch to the new behavior explicitly with a future warning otherwise.

@ogrisel do you intend the new option to be deprecated in 2 cycles or stay 'forvever'?

@lucyleeow lucyleeow moved this from Todo📬 to In Progress🏗 in Quansight's scikit-learn Project Board May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0