-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Issue w/ tf-idf computation #4391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1. I think it would be better to give users more tf-idf formula options. |
@amueller What do you think? |
Interesting. I feel pretty strongly that libraries should do what users expect and, therefore, use standard definitions (at least by default). Also, I forgot to mention above: I checked the source of both libshorttext and INDRI and both do tf * idf not tf * (idf + 1). I'm happy to check other projects too. |
I totally agree. I don't know what the standard libraries in the field are, but my impression was also that the |
Unless I am wrong, isn't it what (I dont have any strong opinion on what should be the best default behaviour though.) |
oh, you are right. that is what the option does. |
So either change the default, or at least document in the docstring that we do smoothing by default (currently you don't need to read the code, but read the description of the parameters). |
No. That is not what |
Indeed, read too fast, thanks for the report! Maybe @larsmans can comment on this? |
This issue is a duplicate of #2998. Re: standard behavior, if you look at what Lucene does (and therefore Elasticsearch, Solr), it does the (idf+1) thing and two additional hacks:
So we're closer to the textbook than the industry standard version of tf-idf, which people in academia use too without ever worrying about this. I'm not in favor of changing the default behavior, as it may break people's code and change their results (for the worse, typically). |
I am personally +1 in fixing our TF-IDF vectorizer to implement one of the most standard variant but:
Our variant is not really principled and keeps surprising users (as reported on SO, blog post comments and this issue tracker several times). |
I haven't really looked at it, but the wikipedia page lists quite a couple of variations. https://en.wikipedia.org/wiki/Tf%E2%80%93idf |
There are so many variants that there is, in fact, a systematic naming scheme for them. |
the naming scheme doesn't talk about the various +1's, though. |
It's not exhaustive, it's just what was implemented in SMART. |
What do you think about adding this flag? |
I think the addition of 1 to I'm not very happy with the current default behaviour, but I understand that deprecating the default is not very kind to users. I think that we need to make it clearer that this is probably undesirable, and have a more precise interpretation of the "feature" in documentation. While helpful, I'm not sure that @hannawallach's interpretation of what we're doing will be clear to end-users. What we're doing is essentially interpolating |
I'm closing this hoping that the documentation in #7015 is as close as we can get to a resolution without properly upsetting users! |
Old thread is old, but I also had this problem. Not only is it not standard, but it privileges term frequencies far too much, so common tokens like "the" and "." tend to be ranked at the top. I'd vote for @hannawallach's suggestion also, but in the meantime, one workaround is to set |
Re-opening. #14748 would partially address this, by allowing to at least use standard itf-df weighting. |
Hi,
I have some concerns about what is currently implemented as tf-idf in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py -- i.e., tf-idf = tf * (idf + 1).
My concern is regarding the "+ 1". The comment in the source code says, "The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf, instead of tf * idf. The effect of this is that terms with zero idf, i.e. that occur in all documents of a training set, will not be entirely ignored." This is indeed true -- the computation is effectively linearly interpolating between tf * idf and tf -- however, adding one to idf is not standard practice. Even after extensive search, reading many IR and NLP papers regarding tf-idf, and consulting w/ IR and NLP researchers, I have not (yet!) encountered an instance of anyone doing this. I am therefore concerned that this is the default behavior in sklearn -- all the more so because there is no mention of this in the documentation (I had to read the source code to figure it out).
Although there are several standard variants for computing tf and for computing idf (these are actually pretty well documented on the Wikipedia page), they are always combined as tf * idf, not tf * (idf + 1). Granted (and taking the standard definition of idf for simplicity), you can subsume the "+ 1" into the idf score itself, giving idf + 1 = log(D / D_v) + 1 = log(D / D_v) + log(e) = log(e * D / D_v) -- in effect this means you're pretending you saw e times as many documents as you actually did, and none of them contained any terms in the vocabulary (this means that all terms -- frequent and infrequent -- will be treated as if they occurred in fewer documents than they actually did). However, this is not equivalent to any of the standard variants of idf.
Since sklearn's tf-idf is widely used, I think it should reflect standard behavior/definitions in the IR and NLP literature, and thus the "+ 1" should be removed. To be clear, I'm happy for the "+ 1" to be left as an option (or even changed to be "+ c" with c speficied by the user), thereby allowing the user to linearly interpolate between tf and "standard" tf-idf, but I think users should be aware that they are doing this, rather than doing this and thinking they're using "standard" tf-idf.
The text was updated successfully, but these errors were encountered: