Added option to use standard idf term for TfidfTransformer and TfidfVectorizer #14748

mitchelldehaven · 2019-08-24T14:08:06Z

The TfidfTransformer and TfidfVectorizer use an idf that is not the standard textbook definition. I added a parameter to both as an option to use the standard definition, while leaving the previously defined idf term as the default. I created some test cases to make sure it works but I didn't include them in tests/test_text.py. Should I include them there?

…r, fixed

amueller · 2019-08-24T16:03:31Z

I'm +.5 on this given how much conversation this has sparked in the the past

mitchelldehaven · 2019-08-24T16:11:53Z

I'm +.5 on this given how much conversation this has sparked in the the past

Has there been issues with having the standard idf definition added in the past? Or was the issue that the option wasn't there?

amueller · 2019-08-24T16:15:30Z

no there were many complaints that we don't implement the standard, after which we changed the docs to be very explicit about it.

mitchelldehaven · 2019-08-24T16:26:30Z

I see. I was just working with the TfidfVectorizer and our test cases weren't behaving as expected. Then looking at the TfidfTransformer I found the explicit description of the idf term. The work around is to do the CounterVectorizer, then TfiidfTransformer and use (_idf - 1) on the transformed data. I figured having the option to use the standard idf instead of the work around would be a better solution.

mitchelldehaven · 2019-09-10T17:38:41Z

@amueller I am a bit unfamiliar with how all of this works, should I close the PR due to inactivity?

rth · 2019-09-10T17:49:40Z

I also have mixed feeling about this. One one side it would be great to have the standard IDF weighting as an option, on the other I think that TfidfVectorizer already has way too many options, and adding one more is not helping.

Overall I might be +1 to only add it to TfidfTransformer.

the effect of adding "1" to the idf term in the
equation above results in terms with zero idf, i.e., terms that occur in
all documents in a training set, will not be entirely ignored.

Note that this is not the only effect. Adding that +1 generally reduces the difference between frequent and less frequent terms. For instance in the standard formulation a term with df=0.6 will have 2x higher weight than one with df=0.8, while if we add that +1, the difference is only of about 20%.

amueller · 2019-09-10T20:07:50Z

@mitchelldehaven Sorry, we can be quiet slow. Please be patient.

@rth I find having mismatching options between the two classes more confusing.

rth · 2019-09-10T20:53:38Z

I find having mismatching options between the two classes more confusing.

OK, I guess we could merge this then if no one is opposed. cc @jnothman @qinhanmin2014.

jnothman · 2019-09-10T23:40:11Z

I'm also happy to slowly deprecate TfidfVectorizer because it provides no benefit over a pipeline and creates much confusion. I've seen users do weird stuff like compare TfidfVectorizer to CountVectorizer but use different params.

rth · 2019-09-11T09:23:55Z

I'm also happy to slowly deprecate TfidfVectorizer because it provides no
benefit over a pipeline and creates much confusion.

+1 for deprecating it as well. Opened an issue about it #14951

In which case, again if there are no objections, we can only change TfidfTransformer in this PR (or also add it to TfidfVectorizer pending removal) -- it doesn't really matter.

mitchelldehaven · 2019-09-12T14:47:23Z

Should I modify the commit to only apply the changes to TfidfTransformer or leave it as is?

qinhanmin2014 · 2019-09-15T04:03:01Z

Now we have smooth_idf, sublinear_tf and standard_idf? Hmm, I think it will be a little bit difficult for users to understand the meaning of these parameters.
I'm fine with this PR.

rth

OK, let's merge as is. LGTM.

@mitchelldehaven Would you mind resolving the conflict?
Also please add an entry to the change log at doc/whats_new/v0.23.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

Now we have smooth_idf, sublinear_tf and standard_idf? Hmm, I think it will be a little bit difficult for users to understand the meaning of these parameters.

I agree the previous names are not ideal, but it's unclear what we can do about it in this PR. We are also unlikely to deprecate TfidfVectorizer given it's popularity (at least in the near future).

There seem to be enough users bothered by the fact that currently there is no such option (or by the defaults, but it would be difficult to change it now) https://twitter.com/BayesForDays/status/1250798944688525315

adrinjalali

Thanks @mitchelldehaven for the PR. looks neat.

My comments are just nitpicks and on the docstring's clarity. I think you could use the :math: directive to better document what the difference is, and have it better rendered in the docs. If it's not really clear how to do that, please let us know and we can help.

sklearn/feature_extraction/tests/test_text.py

sklearn/feature_extraction/text.py

jnothman · 2020-04-27T13:52:09Z

I'm happy with this, but wondering if we should reuse the use_idf param for this rather than create a new one. For example use_idf='standard' vs use_idf='nonzero' or something???

rth · 2020-04-27T13:56:06Z

but wondering if we should reuse the use_idf param for this rather than create a new one. For example use_idf='standard' vs use_idf='nonzero' or something???

+1 That would be even better.

mitchelldehaven · 2020-04-27T14:00:05Z

I'm happy with this, but wondering if we should reuse the use_idf param for this rather than create a new one. For example use_idf='standard' vs use_idf='nonzero' or something???

So for this use_idf = 'standard' is the textbook definition of the idf term, 'non-zero' is the current default behavior?

rth · 2020-04-27T20:40:05Z

So for this use_idf = 'standard' is the textbook definition of the idf term, 'non-zero' is the current default behavior?

Not too keen on these names, but I have a hard time finding better ones. Generally, I think I would really make sense having one main parameter for term weighting, one for document weighting (restricted to IDF variants) and one for normalization as classically done. Though the smooth_idf is a bit harder to fit there.

So yes, I think we could support following use_idf:

"non-zero" - current version
True (default) - alias for "non-zero"
"standard"

adrinjalali · 2020-05-04T09:27:45Z

removing from the milestone.

louisguitton · 2020-07-26T10:24:25Z

As discussed just now with @rth and @ogrisel (as part of Europython 2020 sprint), in order to continue work on this PR we need to :

fix conflicts here and address existing comments; including discussing naming of the new parameter: there was a consensus among us for offset_idf which has to be compared to the approach proposed in this PR so far
do a small bibliographical review to look for the different TF and IDF terms formulas, along with their canonical names. For that bibliography, textbooks should be preferred over Wikipedia as the latter could have been influenced by sklearn anyway

louisguitton · 2020-07-26T10:36:48Z

Bibliography

[1] R. Baeza-Yates and B. Ribeiro-Neto (2011). Modern Information Retrieval. Addison Wesley, pp. 68-74.
pdf not available but slides for chapter of interest can be found here
- log is taken in base 2
[2] C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 118-120., pp. 126-128, pp.226-227 (found 2 more relevant bits)
pdf can be found here
- introduces SMART notation
- introduces probabilistic interpretation of idf term (using the following formula: log(N/n_i))
[3] Wikipedia: tf-idf
[4] Wikipedia: SMART Info Retrieval and FreeDiscovery
[5] Mining of Massive Datasets p9-10 (see pictures below, ignore handwritten annotations as their source is probably wikipedia)
- TF term is using the so-called max-normalisation mentioned in [2] with a=1
[6] Apache Lucene (behind Solr and ElastiSearch) defines TF-IDF in a different way
- TF term uses a np.sqrt that is not mentioned in SMART
- IDF term uses the +1 offset like sklearn (it's the only other place so far) and also a smoothing only for denominator
[7] Gensim defines TF-IDF in a more general way
- you can plug your custom functions for TF and for IDF, which cuts short any debates because the library supports sane defaults, and also allows the user to customise; this is a pretty attractive idea
- their default for TF is frequency
- their defaults for IDF are offset=0, log_base=2, no smoothing

Mining of Massive datasets (Standford textbook)

ogrisel · 2020-07-26T11:44:08Z

Interestingly we use log (base e) instead of log2 (base 2). It seems to be more common to use base 2. But depending on the normalization that might not have a significant impact.

louisguitton · 2020-07-26T15:57:28Z

I've updated the bibliography and my conclusions are as follows:

For TF, let's not add any more customisation ; there is already binary and sublinear, and although others are mentioned (like max-normalisation or even square-root (!)), it is not clear if they have a clear use case in which they would be better than frequency (seems to be recommended for short docs) or sublinear (seems to be recommended for long docs)
For IDF:
- for the offset: only Lucene has offset = 1 too. Gensim allows for setting offset as a parameter, while defaulting to 0. Rest of litterature I've looked at today doesn't mention offset
- for the log base: take log in base 2; or at least introduce a log_base parameter which could default to np.e in order to not break defaults. This in my opinion could be tackled in a separate PR.
- for the smoothing: although some smoothing are applied only to denominator, I would keep the smoothing as it is (ndlr both numerator and denominator)
In general, we should consider supporting a generic case like gensim does (with arbitrary TF and IDF functions) so that in the future, we don't have to support more flags. In my opinion, although it leaves more work on the user side, it would make the API of TfidfTransformer cleaner.

mitchelldehaven · 2020-07-26T18:13:52Z

Sorry, life kind of got in the way. If someone else wants to continue this they can, or I can, either way. At the time I was reviewing quite a bit of TF-IDF stuff and noticed the inconsistency in the implementation and what I saw as the "standard" or "textbook" implementation. Looks like @louisguitton has a much better understanding of the various implementations than I ever did.

ogrisel · 2020-07-27T09:54:24Z

for the offset: only Lucene has offset = 1 too.

That's a big one. Lucene is the engine of ElasticSearch which is probably one the most widely deployed fulltext component of corporate and online services. However information retrieval is not exactly the same as machine learning and text mining.

ogrisel · 2020-07-27T09:56:59Z

In general, we should consider supporting a generic case like gensim does (with arbitrary TF and IDF functions) so that in the future, we don't have to support more flags. In my opinion, although it leaves more work on the user side, it would make the API of TfidfTransformer cleaner.

Honestly, I am not sure this is necessary. Just adding the option to set the IDF offset to 0 instead of 1 and updating the documentation to explain how to configure TfidfTransformer / TfidfVectorizer to recover some common math formulas would be enough in my opinion.

cmarmo · 2020-09-08T09:47:58Z

@louisguitton, if you are still interested in working on this, feel free to open a draft pull request directly on scikit-learn. This will make clear that you are taking care of the issue and bring some attention from core-devs. Thanks.

mitchelldehaven · 2021-12-15T00:25:45Z

I will be picking this up again. To summarize what needs to be done (afaik):

Fix merge conflicts.
Improve doc strings.
Reuse use_idf instead of standard_idf:
- use_idf = "standard" is the standard idf definition being added in this PR.
- use_idf = "nonzero" is current behavior.
- use_idf = True is current behavior.

@rth Does ^ look correct?

mitchelldehaven · 2022-06-01T02:31:29Z

@rth @amueller See the last comment above. Just want to make sure the above is what is wanted from the PR before I start working on it.

adrinjalali · 2022-06-02T11:30:41Z

I think your suggestion looks quite okay, you can go ahead implement it :)

mitchelldehaven · 2022-09-13T15:11:56Z

Sorry, I did take a look at this recently and had a bit of an issue fixing merge conflicts via a rebase against the main branch. Obviously since the number of commits is quite massive since I start this PR, the number conflicts make it quite a task. I generally use rebase, although haven't had to to it against quite so many commits before. Is a git merge make more sense here to resolve conflicts?

adrinjalali · 2022-09-13T16:57:19Z

yes, we usually recommend a merge instead of a rebase.

adrinjalali · 2024-04-16T16:05:26Z

Closing for no response. Happy to reopen or have a new PR if @mitchelldehaven or somebody else picks this up.

mitchelldehaven added 8 commits August 23, 2019 16:18

Added 'stardard_idf' parameter to TfidfTransformer and TfidfVectorizer

399e93f

Added 'stardard_idf' parameter to TfidfTransformer and TfidfVectorize…

45979c4

…r, fixed

Added standard_idf, fixed second issue

c6f6ac6

Fixed docstring

e4b503a

Fixed trailing whitespace

1830fac

Added test for standard_idf

84f327d

Fixed linting issue in test

8ca08f2

8000

Added setter test

8f35f96

rth mentioned this pull request Sep 11, 2019

Text vectorizers clean-up / TfidfVectorizer deprecation #14951

Open

rth mentioned this pull request Oct 1, 2019

Deprecate TfidfVectorizer #14966

Open

1 task

rth mentioned this pull request Oct 25, 2019

Issue w/ tf-idf computation #4391

Closed

github-actions bot added the module:feature_extraction label Mar 2, 2020

rth added this to the 0.23 milestone Apr 17, 2020

rth approved these changes Apr 17, 2020

View reviewed changes

cmarmo added the Waiting for Reviewer label Apr 18, 2020

adrinjalali reviewed Apr 22, 2020

View reviewed changes

adrinjalali removed the Waiting for Reviewer label Apr 23, 2020

adrinjalali removed this from the 0.23 milestone May 4, 2020

cmarmo added help wanted Stalled labels Jul 10, 2020

cmarmo removed the help wanted label Sep 8, 2020

cmarmo added the help wanted label Jan 8, 2021

Base automatically changed from master to main January 22, 2021 10:51

rth mentioned this pull request Apr 30, 2021

Customisable formulations of TF-IDF #17933

Closed

cmarmo removed Stalled help wanted labels Dec 15, 2021

adrinjalali closed this Apr 16, 2024

Uh oh!

Added option to use standard idf term for TfidfTransformer and TfidfVectorizer #14748

Added option to use standard idf term for TfidfTransformer and TfidfVectorizer #14748

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bibliography

Mining of Massive datasets (Standford textbook)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!