-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Added option to use standard idf term for TfidfTransformer and TfidfVectorizer #14748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm +.5 on this given how much conversation this has sparked in the the past |
Has there been issues with having the standard idf definition added in the past? Or was the issue that the option wasn't there? |
no there were many complaints that we don't implement the standard, after which we changed the docs to be very explicit about it. |
I see. I was just working with the TfidfVectorizer and our test cases weren't behaving as expected. Then looking at the TfidfTransformer I found the explicit description of the idf term. The work around is to do the CounterVectorizer, then TfiidfTransformer and use (_idf - 1) on the transformed data. I figured having the option to use the standard idf instead of the work around would be a better solution. |
@amueller I am a bit unfamiliar with how all of this works, should I close the PR due to inactivity? |
I also have mixed feeling about this. One one side it would be great to have the standard IDF weighting as an option, on the other I think that Overall I might be +1 to only add it to
Note that this is not the only effect. Adding that +1 generally reduces the difference between frequent and less frequent terms. For instance in the standard formulation a term with df=0.6 will have 2x higher weight than one with df=0.8, while if we add that +1, the difference is only of about 20%. |
@mitchelldehaven Sorry, we can be quiet slow. Please be patient. @rth I find having mismatching options between the two classes more confusing. |
OK, I guess we could merge this then if no one is opposed. cc @jnothman @qinhanmin2014. |
I'm also happy to slowly deprecate TfidfVectorizer because it provides no
benefit over a pipeline and creates much confusion. I've seen users do
weird stuff like compare TfidfVectorizer to CountVectorizer but use
different params.
|
+1 for deprecating it as well. Opened an issue about it #14951 In which case, again if there are no objections, we can only change |
Should I modify the commit to only apply the changes to TfidfTransformer or leave it as is? |
Now we have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's merge as is. LGTM.
@mitchelldehaven Would you mind resolving the conflict?
Also please add an entry to the change log at doc/whats_new/v0.23.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself with :user:
.
Now we have smooth_idf, sublinear_tf and standard_idf? Hmm, I think it will be a little bit difficult for users to understand the meaning of these parameters.
I agree the previous names are not ideal, but it's unclear what we can do about it in this PR. We are also unlikely to deprecate TfidfVectorizer given it's popularity (at least in the near future).
There seem to be enough users bothered by the fact that currently there is no such option (or by the defaults, but it would be difficult to change it now) https://twitter.com/BayesForDays/status/1250798944688525315
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mitchelldehaven for the PR. looks neat.
My comments are just nitpicks and on the docstring's clarity. I think you could use the :math:
directive to better document what the difference is, and have it better rendered in the docs. If it's not really clear how to do that, please let us know and we can help.
I'm happy with this, but wondering if we should reuse the |
+1 That would be even better. |
So for this use_idf = 'standard' is the textbook definition of the idf term, 'non-zero' is the current default behavior? |
Not too keen on these names, but I have a hard time finding better ones. Generally, I think I would really make sense having one main parameter for term weighting, one for document weighting (restricted to IDF variants) and one for normalization as classically done. Though the So yes, I think we could support following
|
removing from the milestone. |
As discussed just now with @rth and @ogrisel (as part of Europython 2020 sprint), in order to continue work on this PR we need to :
|
Bibliography
Mining of Massive datasets (Standford textbook) |
Interestingly we use log (base e) instead of log2 (base 2). It seems to be more common to use base 2. But depending on the normalization that might not have a significant impact. |
I've updated the bibliography and my conclusions are as follows:
|
Sorry, life kind of got in the way. If someone else wants to continue this they can, or I can, either way. At the time I was reviewing quite a bit of TF-IDF stuff and noticed the inconsistency in the implementation and what I saw as the "standard" or "textbook" implementation. Looks like @louisguitton has a much better understanding of the various implementations than I ever did. |
That's a big one. Lucene is the engine of ElasticSearch which is probably one the most widely deployed fulltext component of corporate and online services. However information retrieval is not exactly the same as machine learning and text mining. |
Honestly, I am not sure this is necessary. Just adding the option to set the IDF offset to 0 instead of 1 and updating the documentation to explain how to configure |
@louisguitton, if you are still interested in working on this, feel free to open a draft pull request directly on scikit-learn. This will make clear that you are taking care of the issue and bring some attention from core-devs. Thanks. |
I will be picking this up again. To summarize what needs to be done (afaik):
@rth Does ^ look correct? |
I think your suggestion looks quite okay, you can go ahead implement it :) |
Sorry, I did take a look at this recently and had a bit of an issue fixing merge conflicts via a rebase against the main branch. Obviously since the number of commits is quite massive since I start this PR, the number conflicts make it quite a task. I generally use rebase, although haven't had to to it against quite so many commits before. Is a |
yes, we usually recommend a |
Closing for no response. Happy to reopen or have a new PR if @mitchelldehaven or somebody else picks this up. |
The TfidfTransformer and TfidfVectorizer use an idf that is not the standard textbook definition. I added a parameter to both as an option to use the standard definition, while leaving the previously defined idf term as the default. I created some test cases to make sure it works but I didn't include them in tests/test_text.py. Should I include them there?