Use a stable sorting algorithm when selecting the `max_features` in TfidfVectorizer #21446

mesejo · 2021-10-24T12:28:26Z

Describe the workflow you want to enable

Currently when selecting max_features in TfidfVectorizers, the algorithm (line 1172) uses numpy.argsort with the quicksort enabled (the default argument).

Given that quicksort is not stable, this leads to inconsistent behavior across different datasets, for example:

corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())   # ['aaa' 'aab' 'aac' 'aad' 'aae']

As all the features have the same frequency they are return in lexicographic order, this does not happens for the following dataset:

corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF', 'AAG', 'ABA', 'ABB', 'ABC', 'ABD', 'ACA', 'ACB', 'ADA', 'AEA',
          'AFA', 'BAA']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())  # ['aaa' 'aca' 'acb' 'ada' 'aea']

I would expect the result to be the same as the first example but is not.

Describe your proposed solution

Since 1.15.0 the option stable was added to numpy.argsort, so simply changing the line to:

mask_inds = (-tfs[mask]).argsort(kind="stable")[:limit]

should be enough to solve the problem.

Describe alternatives you've considered, if relevant

As an alternative use the option mergesort that is stable

mask_inds = (-tfs[mask]).argsort(kind="mergesort")[:limit]

Additional context

sklearn.__version__
'1.0'

The text was updated successfully, but these errors were encountered:

ogrisel · 2021-10-26T07:31:34Z

I suppose the performance impact is negligible compared to other operations when calling fit or transform on a such a corpus. So this looks good to me.

The only problem is that at the moment in our sklearn/_min_dependencies.py states:

    NUMPY_MIN_VERSION = "1.14.6"

I suppose we will be able to bump this for the scikit-learn 1.1 release but I am not 100% sure.

ogrisel · 2021-10-26T07:35:00Z

According to https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table we are already way more conservative than the recommended support.

thomasjpfan · 2021-10-26T15:00:12Z

For backward compatibility, I think we still need to add a parameter to TfidfVectorizer to control the sorting behavior.

ogrisel · 2021-10-26T15:50:40Z

For backward compatibility, I think we still need to add a parameter to TfidfVectorizer to control the sorting behavior.

That's a good point. Otherwise it will cause very nasty bugs in existing pipelines.

mesejo · 2021-10-27T23:03:20Z

Hi. Thanks for the feedback. I'm interested on working on this.

For backward compatibility, I think we still need to add a parameter to TfidfVectorizer to control the sorting behavior.

It should a boolean parameter?

ogrisel · 2021-10-28T12:50:09Z

The current code is actually not very intuitive and would deserve some refactoring to make it simpler (not sure how though):

            # in CountVectorizer.fit_transform

            if max_features is not None:
                X = self._sort_features(X, vocabulary)
            X, self.stop_words_ = self._limit_features(
                X, vocabulary, max_doc_count, min_doc_count, max_features
            )
            if max_features is None:
                X = self._sort_features(X, vocabulary)

But then preserving backward compat would be quite complex because if limit is None (max_features=None) we do not apply the argsort-based operation within the call to_limit_features but we still sort the features later using Python's sorted in the call to _sort_features. Note that sorted is guaranteed to be stable.

Assuming we can refactor/simplify while making it possible to preserve the backward compat, or do the minimal change in argsort without refactoring we could have add a constructor param such as:

use_stable_feature="warn"

that would raise a FutureWarning to make it explicit that in 1.1 the feature sorting is not stable and in the future (1.3) the feature sorting will be changed to be stable by default. Setting use_stable_feature=True makes it possible to enable the future sorting behavior and silence the warning. Then in 1.3 we would deprecate the use_stable_feature parameter itself and always use stable storing.

But I wouldn't really see the point of enabling stable_feature_sort=False to keep the current behavior without the warning since we want to always be stable by default in the future.

lorentzenchr · 2021-11-07T17:40:06Z

Could we consider this as a bug fix (and skip introducing an option that we then deprecate in order to finally remove it)?

ogrisel · 2021-12-13T17:47:29Z

Could we consider this as a bug fix (and skip introducing an option that we then deprecate in order to finally remove it)?

That's an option. But I am not sure was is the most disruptive, the current "bug", the bug fix without deprecation warning or the complex deprecation procedure.

On the long term we obviously need to fix the issue, no question. It's just the way to get there...

lorentzenchr · 2021-12-13T18:00:24Z

@ogrisel And what's your preferred option?

ogrisel · 2022-02-26T22:00:37Z

I think I prefer to be explicit and introduce a new option to enable the switch to the new behavior explicitly with a future warning otherwise.

Also, once #22617 is merged we will be able to get use .argsort(kind="stable").

lucyleeow · 2024-05-08T04:39:03Z

I think I prefer to be explicit and introduce a new option to enable the switch to the new behavior explicitly with a future warning otherwise.

@ogrisel do you intend the new option to be deprecated in 2 cycles or stay 'forvever'?

mesejo added the New Feature label Oct 24, 2021

ogrisel mentioned this issue Oct 26, 2021

RFC bump up dependencies (numpy, scipy and python) minimum versions for scikit-learn 1.1 #21460

Closed

thomasjpfan moved this to Delegate📪 in Quansight's scikit-learn Project Board Apr 14, 2022

thomasjpfan added this to Quansight's scikit-learn Project Board Apr 14, 2022

thomasjpfan moved this from Delegate📪 to Todo📬 in Quansight's scikit-learn Project Board Apr 28, 2022

cmarmo added the module:feature_extraction label Sep 14, 2022

lucyleeow moved this from Todo📬 to In Progress🏗 in Quansight's scikit-learn Project Board May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a stable sorting algorithm when selecting the `max_features` in TfidfVectorizer #21446

Use a stable sorting algorithm when selecting the `max_features` in TfidfVectorizer #21446

Use a stable sorting algorithm when selecting the max_features in TfidfVectorizer #21446

Use a stable sorting algorithm when selecting the max_features in TfidfVectorizer #21446

Comments

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Use a stable sorting algorithm when selecting the `max_features` in TfidfVectorizer #21446

Use a stable sorting algorithm when selecting the `max_features` in TfidfVectorizer #21446