8000 Added option to use standard idf term for TfidfTransformer and TfidfVectorizer by mitchelldehaven · Pull Request #14748 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Added option to use standard idf term for TfidfTransformer and TfidfVectorizer #14748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

mitchelldehaven
Copy link
@mitchelldehaven mitchelldehaven commented Aug 24, 2019

The TfidfTransformer and TfidfVectorizer use an idf that is not the standard textbook definition. I added a parameter to both as an option to use the standard definition, while leaving the previously defined idf term as the default. I created some test cases to make sure it works but I didn't include them in tests/test_text.py. Should I include them there?

@amueller
Copy link
Member

I'm +.5 on this given how much conversation this has sparked in the the past

@mitchelldehaven
Copy link
Author

I'm +.5 on this given how much conversation this has sparked in the the past

Has there been issues with having the standard idf definition added in the past? Or was the issue that the option wasn't there?

@amueller
Copy link
Member

no there were many complaints that we don't implement the standard, after which we changed the docs to be very explicit about it.

@mitchelldehaven
Copy link
Author
mitchelldehaven commented Aug 24, 2019

I see. I was just working with the TfidfVectorizer and our test cases weren't behaving as expected. Then looking at the TfidfTransformer I found the explicit description of the idf term. The work around is to do the CounterVectorizer, then TfiidfTransformer and use (_idf - 1) on the transformed data. I figured having the option to use the standard idf instead of the work around would be a better solution.

@mitchelldehaven
Copy link
Author

@amueller I am a bit unfamiliar with how all of this works, should I close the PR due to inactivity?

@rth
Copy link
Member
rth commented Sep 10, 2019

I also have mixed feeling about this. One one side it would be great to have the standard IDF weighting as an option, on the other I think that TfidfVectorizer already has way too many options, and adding one more is not helping.

Overall I might be +1 to only add it to TfidfTransformer.

the effect of adding "1" to the idf term in the
equation above results in terms with zero idf, i.e., terms that occur in
all documents in a training set, will not be entirely ignored.

Note that this is not the only effect. Adding that +1 generally reduces the difference between frequent and less frequent terms. For instance in the standard formulation a term with df=0.6 will have 2x higher weight than one with df=0.8, while if we add that +1, the difference is only of about 20%.

@amueller
Copy link
Member

@mitchelldehaven Sorry, we can be quiet slow. Please be patient.

@rth I find having mismatching options between the two classes more confusing.

@rth
Copy link
Member
rth commented Sep 10, 2019

I find having mismatching options between the two classes more confusing.

OK, I guess we could merge this then if no one is opposed. cc @jnothman @qinhanmin2014.

@jnothman
Copy link
Member
jnothman commented Sep 10, 2019 via email

@rth
Copy link
Member
rth commented Sep 11, 2019

I'm also happy to slowly deprecate TfidfVectorizer because it provides no
benefit over a pipeline and creates much confusion.

+1 for deprecating it as well. Opened an issue about it #14951

In which case, again if there are no objections, we can only change TfidfTransformer in this PR (or also add it to TfidfVectorizer pending removal) -- it doesn't really matter.

@mitchelldehaven
Copy link
Author

Should I modify the commit to only apply the changes to TfidfTransformer or leave it as is?

@qinhanmin2014
Copy link
Member

Now we have smooth_idf, sublinear_tf and standard_idf? Hmm, I think it will be a little bit difficult for users to understand the meaning of these parameters.
I'm fine with this PR.

@rth rth mentioned this pull request Oct 1, 2019
1 task
@rth rth mentioned this pull request Oct 25, 2019
@rth rth added this to the 0.23 milestone Apr 17, 2020
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's merge as is. LGTM.

@mitchelldehaven Would you mind resolving the conflict?
Also please add an entry to the change log at doc/whats_new/v0.23.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

Now we have smooth_idf, sublinear_tf and standard_idf? Hmm, I think it will be a little bit difficult for users to understand the meaning of these parameters.

I agree the previous names are not ideal, but it's unclear what we can do about it in this PR. We are also unlikely to deprecate TfidfVectorizer given it's popularity (at least in the near future).

There seem to be enough users bothered by the fact that currently there is no such option (or by the defaults, but it would be difficult to change it now) https://twitter.com/BayesForDays/status/1250798944688525315

Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mitchelldehaven for the PR. looks neat.

My comments are just nitpicks and on the docstring's clarity. I think you could use the :math: directive to better document what the difference is, and have it better rendered in the docs. If it's not really clear how to do that, please let us know and we can help.

@jnothman
Copy link
Member

I'm happy with this, but wondering if we should reuse the use_idf param for this rather than create a new one. For example use_idf='standard' vs use_idf='nonzero' or something???

@rth
Copy link
Member
rth commented Apr 27, 2020

but wondering if we should reuse the use_idf param for this rather than create a new one. For example use_idf='standard' vs use_idf='nonzero' or something???

+1 That would be even better.

@mitchelldehaven
Copy link
Author

I'm happy with this, but wondering if we should reuse the use_idf param for this rather than create a new one. For example use_idf='standard' vs use_idf='nonzero' or something???

So for this use_idf = 'standard' is the textbook definition of the idf term, 'non-zero' is the current default behavior?

@rth
Copy link
Member
rth commented Apr 27, 2020

So for this use_idf = 'standard' is the textbook definition of the idf term, 'non-zero' is the current default behavior?

Not too keen on these names, but I have a hard time finding better ones. Generally, I think I would really make sense having one main parameter for term weighting, one for document weighting (restricted to IDF variants) and one for normalization as classically done. Though the smooth_idf is a bit harder to fit there.

So yes, I think we could support following use_idf:

  • "non-zero" - current version
  • True (default) - alias for "non-zero"
  • "standard"

@adrinjalali
Copy link
Member

removing from the milestone.

@adrinjalali adrinjalali removed this from the 0.23 milestone May 4, 2020
@louisguitton
Copy link
Contributor

As discussed just now with @rth and @ogrisel (as part of Europython 2020 sprint), in order to continue work on this PR we need to :

  1. fix conflicts here and address existing comments; including discussing naming of the new parameter: there was a consensus among us for offset_idf which has to be compared to the approach proposed in this PR so far
  2. do a small bibliographical review to look for the different TF and IDF terms formulas, along with their canonical names. For that bibliography, textbooks should be preferred over Wikipedia as the latter could have been influenced by sklearn anyway

@louisguitton
Copy link
Contributor
louisguitton commented Jul 26, 2020

Bibliography

  • [1] R. Baeza-Yates and B. Ribeiro-Neto (2011). Modern Information Retrieval. Addison Wesley, pp. 68-74.
    pdf not available but slides for chapter of interest can be found here

    • log is taken in base 2
  • [2] C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 118-120., pp. 126-128, pp.226-227 (found 2 more relevant bits)
    pdf can be found here

    • introduces SMART notation
    • introduces probabilistic interpretation of idf term (using the following formula: log(N/n_i))
  • [3] Wikipedia: tf-idf

  • [4] Wikipedia: SMART Info Retrieval and FreeDiscovery

  • [5] Mining of Massive Datasets p9-10 (see pictures below, ignore handwritten annotations as their source is probably wikipedia)

    • TF term is using the so-called max-normalisation mentioned in [2] with a=1
  • [6] Apache Lucene (behind Solr and ElastiSearch) defines TF-IDF in a different way

    • TF term uses a np.sqrt that is not mentioned in SMART image
    • IDF term uses the +1 offset like sklearn (it's the only other place so far) and also a smoothing only for denominator image
  • [7] Gensim defines TF-IDF in a more general way

    • you can plug your custom functions for TF and for IDF, which cuts short any debates because the library supports sane defaults, and also allows the user to customise; this is a pretty attractive idea
    • their default for TF is frequency
    • their defaults for IDF are offset=0, log_base=2, no smoothing
      image

Mining of Massive datasets (Standford textbook)

image
image

@ogrisel
Copy link
Member
ogrisel commented Jul 26, 2020

Interestingly we use log (base e) instead of log2 (base 2). It seems to be more common to use base 2. But depending on the normalization that might not have a significant impact.

@louisguitton
Copy link
Contributor
louisguitton commented Jul 26, 2020

I've updated the bibliography and my conclusions are as follows:

  • For TF, let's not add any more customisation ; there is already binary and sublinear, and although others are mentioned (like max-normalisation or even square-root (!)), it is not clear if they have a clear use case in which they would be better than frequency (seems to be recommended for short docs) or sublinear (seems to be recommended for long docs)
  • For IDF:
    • for the offset: only Lucene has offset = 1 too. Gensim allows for setting offset as a parameter, while defaulting to 0. Rest of litterature I've looked at today doesn't mention offset
    • for the log base: take log in base 2; or at least introduce a log_base parameter which could default to np.e in order to not break defaults. This in my opinion could be tackled in a separate PR.
    • for the smoothing: although some smoothing are applied only to denominator, I would keep the smoothing as it is (ndlr both numerator and denominator)
  • In general, we should consider supporting a generic case like gensim does (with arbitrary TF and IDF functions) so that in the future, we don't have to support more flags. In my opinion, although it leaves more work on the user side, it would make the API of TfidfTransformer cleaner.

@mitchelldehaven
Copy link
Author

Sorry, life kind of got in the way. If someone else wants to continue this they can, or I can, either way. At the time I was reviewing quite a bit of TF-IDF stuff and noticed the inconsistency in the implementation and what I saw as the "standard" or "textbook" implementation. Looks like @louisguitton has a much better understanding of the various implementations than I ever did.

@ogrisel
Copy link
Member
ogrisel commented Jul 27, 2020

for the offset: only Lucene has offset = 1 too.

That's a big one. Lucene is the engine of ElasticSearch which is probably one the most widely deployed fulltext component of corporate and online services. However information retrieval is not exactly the same as machine learning and text mining.

@ogrisel
Copy link
Member
ogrisel commented Jul 27, 2020

In general, we should consider supporting a generic case like gensim does (with arbitrary TF and IDF functions) so that in the future, we don't have to support more flags. In my opinion, although it leaves more work on the user side, it would make the API of TfidfTransformer cleaner.

Honestly, I am not sure this is necessary. Just adding the option to set the IDF offset to 0 instead of 1 and updating the documentation to explain how to configure TfidfTransformer / TfidfVectorizer to recover some common math formulas would be enough in my opinion.

@cmarmo cmarmo removed the help wanted label Sep 8, 2020
@cmarmo
Copy link
Contributor
cmarmo commented Sep 8, 2020

@louisguitton, if you are still interested in working on this, feel free to open a draft pull request directly on scikit-learn. This will make clear that you are taking care of the issue and bring some attention from core-devs. Thanks.

Base automatically changed from master to main January 22, 2021 10:51
@mitchelldehaven
Copy link
Author
mitchelldehaven commented Dec 15, 2021

I will be picking this up again. To summarize what needs to be done (afaik):

  • Fix merge conflicts.
  • Improve doc strings.
  • Reuse use_idf instead of standard_idf:
    • use_idf = "standard" is the standard idf definition being added in this PR.
    • use_idf = "nonzero" is current behavior.
    • use_idf = True is current behavior.

@rth Does ^ look correct?

@mitchelldehaven
Copy link
Author

@rth @amueller See the last comment above. Just want to make sure the above is what is wanted from the PR before I start working on it.

@adrinjalali
Copy link
Member

I think your suggestion looks quite okay, you can go ahead implement it :)

@mitchelldehaven
Copy link
Author

Sorry, I did take a look at this recently and had a bit of an issue fixing merge conflicts via a rebase against the main branch. Obviously since the number of commits is quite massive since I start this PR, the number conflicts make it quite a task. I generally use rebase, although haven't had to to it against quite so many commits before. Is a git merge make more sense here to resolve conflicts?

@adrinjalali
Copy link
Member

yes, we usually recommend a merge instead of a rebase.

@adrinjalali
Copy link
Member

Closing for no response. Happy to reopen or have a new PR if @mitchelldehaven or somebody else picks this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0