8000 Issue w/ tf-idf computation · Issue #4391 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Issue w/ tf-idf computation #4391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hannawallach opened this issue Mar 13, 2015 · 24 comments
Closed

Issue w/ tf-idf computation #4391

hannawallach opened this issue Mar 13, 2015 · 24 comments

Comments

@hannawallach
Copy link
Contributor

Hi,

I have some concerns about what is currently implemented as tf-idf in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py -- i.e., tf-idf = tf * (idf + 1).

My concern is regarding the "+ 1". The comment in the source code says, "The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf, instead of tf * idf. The effect of this is that terms with zero idf, i.e. that occur in all documents of a training set, will not be entirely ignored." This is indeed true -- the computation is effectively linearly interpolating between tf * idf and tf -- however, adding one to idf is not standard practice. Even after extensive search, reading many IR and NLP papers regarding tf-idf, and consulting w/ IR and NLP researchers, I have not (yet!) encountered an instance of anyone doing this. I am therefore concerned that this is the default behavior in sklearn -- all the more so because there is no mention of this in the documentation (I had to read the source code to figure it out).

Although there are several standard variants for computing tf and for computing idf (these are actually pretty well documented on the Wikipedia page), they are always combined as tf * idf, not tf * (idf + 1). Granted (and taking the standard definition of idf for simplicity), you can subsume the "+ 1" into the idf score itself, giving idf + 1 = log(D / D_v) + 1 = log(D / D_v) + log(e) = log(e * D / D_v) -- in effect this means you're pretending you saw e times as many documents as you actually did, and none of them contained any terms in the vocabulary (this means that all terms -- frequent and infrequent -- will be treated as if they occurred in fewer documents than they actually did). However, this is not equivalent to any of the standard variants of idf.

Since sklearn's tf-idf is widely used, I think it should reflect standard behavior/definitions in the IR and NLP literature, and thus the "+ 1" should be removed. To be clear, I'm happy for the "+ 1" to be left as an option (or even changed to be "+ c" with c speficied by the user), thereby allowing the user to linearly interpolate between tf and "standard" tf-idf, but I think users should be aware that they are doing this, rather than doing this and thinking they're using "standard" tf-idf.

@xuewei4d
Copy link
Contributor

+1. I think it would be better to give users more tf-idf formula options.

@hannawallach
Copy link
Contributor Author

@amueller What do you think?

@amueller
Copy link
Member

This has come up a couple of times. I tend to agree with you that this should be a) better documented and b) be as standard as possible. I have not worked with IR and NLP much, and I think @larsmans and @ogrisel argued in favor of keeping it, but I have to check to be sure.

@hannawallach
Copy link
Contributor Author

Interesting. I feel pretty strongly that libraries should do what users expect and, therefore, use standard definitions (at least by default). Also, I forgot to mention above: I checked the source of both libshorttext and INDRI and both do tf * idf not tf * (idf + 1). I'm happy to check other projects too.

@amueller
Copy link
Member

I totally agree. I don't know what the standard libraries in the field are, but my impression was also that the +1 was non-standard.
The way to change this would be to create an option cant_comeupwithaname=None which will add some scalar, and if it is not passed explicitly, raise a warning that the default will change from 1 to 0.

@glouppe
Copy link
Contributor
glouppe commented Mar 16, 2015

Unless I am wrong, isn't it what smooth_idf is made for? You can always set it it False to fallback to tf * idf.

(I dont have any strong opinion on what should be the best default behaviour though.)

@amueller
Copy link
Member

oh, you are right. that is what the option does.

@amueller
Copy link
Member

So either change the default, or at least document in the docstring that we do smoothing by default (currently you don't need to read the code, but read the description of the parameters).

@hannawallach
Copy link
Contributor Author

No. That is not what smooth_idf=True does. From reading the code (lines 973 and 974) smooth_idf=True means that idf is computed as idf = log((D + 1) / (D_v + 1)) rather than idf = log(D / D_v). The resultant idf (irrespective of the value of smooth_idf) is still combined w/ tf according to tf * (idf + 1). Basically smooth_idf=True is equivalent to adding one document that contains every term in the vocabulary. Computing tf * (idf + 1) is equivalent to adding e documents (assuming log is base e, which it is here) of which none contain any terms in the vocabulary.

@amueller amueller added the Bug label Mar 16, 2015
@amueller amueller added this to the 0.16 milestone Mar 16, 2015
@glouppe
Copy link
Contributor
glouppe commented Mar 16, 2015

Indeed, read too fast, thanks for the report! Maybe @larsmans can comment on this?

@larsmans
Copy link
Member

This issue is a duplicate of #2998.

Re: standard behavior, if you look at what Lucene does (and therefore Elasticsearch, Solr), it does the (idf+1) thing and two additional hacks:

tfidf = tf * idf²
idf = 1 + log(n_docs / (df + 1))

So we're closer to the textbook than the industry standard version of tf-idf, which people in academia use too without ever worrying about this. I'm not in favor of changing the default behavior, as it may break people's code and change their results (for the worse, typically).

@ogrisel
Copy link
Member
ogrisel commented Mar 24, 2015

I am personally +1 in fixing our TF-IDF vectorizer to implement one of the most standard variant but:

  • I would rather avoid introducing yet another hyper-param on this beast,
  • I think we should check that the variant we implement does not induce too much performance decrease on standard benchmarks (20 newsgroups, IMDB reviews, sentiment140, amazon reviews or such).

Our variant is not really principled and keeps surprising users (as reported on SO, blog post comments and this issue tracker several times).

@amueller
Copy link
Member
amueller commented Apr 4, 2015

I haven't really looked at it, but the wikipedia page lists quite a couple of variations. https://en.wikipedia.org/wiki/Tf%E2%80%93idf

@larsmans
Copy link
Member
larsmans commented Apr 9, 2015

There are so many variants that there is, in fact, a systematic naming scheme for them.

@amueller
Copy link
Member
amueller commented Apr 9, 2015

the naming scheme doesn't talk about the various +1's, though.

@larsmans
Copy link
Member
larsmans commented Apr 9, 2015

It's not exhaustive, it's just what was implemented in SMART.

@ermakovpetr
Copy link

What do you think about adding this flag?

@amueller amueller modified the milestones: 0.16, 0.17 Sep 11, 2015
@amueller amueller modified the milestones: 1.0, 0.17 Sep 20, 2015
@ermakovpetr
Copy link

#5900

@jnothman 8000
Copy link
Member

I think the addition of 1 to idf is not only non-standard, but seems to be inappropriate for machine learning. One might argue that in a retrieval context (i.e. Lucene's similarity score) a heavily weighted term that appears in (almost) every document should still match documents in which that term is heavily weighted.

I'm not very happy with the current default behaviour, but I understand that deprecating the default is not very kind to users. I think that we need to make it clearer that this is probably undesirable, and have a more precise interpretation of the "feature" in documentation.

While helpful, I'm not sure that @hannawallach's interpretation of what we're doing will be clear to end-users. What we're doing is essentially interpolating tf with tf.idf, saying that there is an extent to which we don't care about document frequency in the weighting of  a term.

@jnothman
Copy link
Member

I'm closing this hoping that the documentation in #7015 is as close as we can get to a resolution without properly upsetting users!

@mayhewsw
Copy link

Old thread is old, but I also had this problem. Not only is it not standard, but it privileges term frequencies far too much, so common tokens like "the" and "." tend to be ranked at the top. I'd vote for @hannawallach's suggestion also, but in the meantime, one workaround is to set sublinear_tf=True, which deprivileges tf to some extent.

@rth
Copy link
Member
rth commented Oct 25, 2019

Re-opening. #14748 would partially address this, by allowing to at least use standard itf-df weighting.

@adrinjalali
Copy link
Member

This was added to 1.0 a long time ago. Removing the milestone tag. I'd be personally happy if we get both #17933 and #14748 in though.

@adrinjalali adrinjalali removed this from the 1.0 milestone Aug 22, 2021
@adrinjalali
Copy link
Member

Seems like #7015 improved things a bit, and we can't really figure out a feasible way forward (also ref #14748). So closing might make sense here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

0