Issue w/ tf-idf computation #4391

hannawallach · 2015-03-13T21:25:31Z

Hi,

I have some concerns about what is currently implemented as tf-idf in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py -- i.e., tf-idf = tf * (idf + 1).

My concern is regarding the "+ 1". The comment in the source code says, "The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf, instead of tf * idf. The effect of this is that terms with zero idf, i.e. that occur in all documents of a training set, will not be entirely ignored." This is indeed true -- the computation is effectively linearly interpolating between tf * idf and tf -- however, adding one to idf is not standard practice. Even after extensive search, reading many IR and NLP papers regarding tf-idf, and consulting w/ IR and NLP researchers, I have not (yet!) encountered an instance of anyone doing this. I am therefore concerned that this is the default behavior in sklearn -- all the more so because there is no mention of this in the documentation (I had to read the source code to figure it out).

Although there are several standard variants for computing tf and for computing idf (these are actually pretty well documented on the Wikipedia page), they are always combined as tf * idf, not tf * (idf + 1). Granted (and taking the standard definition of idf for simplicity), you can subsume the "+ 1" into the idf score itself, giving idf + 1 = log(D / D_v) + 1 = log(D / D_v) + log(e) = log(e * D / D_v) -- in effect this means you're pretending you saw e times as many documents as you actually did, and none of them contained any terms in the vocabulary (this means that all terms -- frequent and infrequent -- will be treated as if they occurred in fewer documents than they actually did). However, this is not equivalent to any of the standard variants of idf.

Since sklearn's tf-idf is widely used, I think it should reflect standard behavior/definitions in the IR and NLP literature, and thus the "+ 1" should be removed. To be clear, I'm happy for the "+ 1" to be left as an option (or even changed to be "+ c" with c speficied by the user), thereby allowing the user to linearly interpolate between tf and "standard" tf-idf, but I think users should be aware that they are doing this, rather than doing this and thinking they're using "standard" tf-idf.

xuewei4d · 2015-03-14T17:33:39Z

+1. I think it would be better to give users more tf-idf formula options.

hannawallach · 2015-03-16T15:54:58Z

@amueller What do you think?

amueller · 2015-03-16T17:33:02Z

This has come up a couple of times. I tend to agree with you that this should be a) better documented and b) be as standard as possible. I have not worked with IR and NLP much, and I think @larsmans and @ogrisel argued in favor of keeping it, but I have to check to be sure.

hannawallach · 2015-03-16T17:48:42Z

Interesting. I feel pretty strongly that libraries should do what users expect and, therefore, use standard definitions (at least by default). Also, I forgot to mention above: I checked the source of both libshorttext and INDRI and both do tf * idf not tf * (idf + 1). I'm happy to check other projects too.

amueller · 2015-03-16T17:52:00Z

I totally agree. I don't know what the standard libraries in the field are, but my impression was also that the +1 was non-standard.
The way to change this would be to create an option cant_comeupwithaname=None which will add some scalar, and if it is not passed explicitly, raise a warning that the default will change from 1 to 0.

glouppe · 2015-03-16T18:12:46Z

Unless I am wrong, isn't it what smooth_idf is made for? You can always set it it False to fallback to tf * idf.

(I dont have any strong opinion on what should be the best default behaviour though.)

amueller · 2015-03-16T18:18:28Z

oh, you are right. that is what the option does.

amueller · 2015-03-16T18:53:41Z

So either change the default, or at least document in the docstring that we do smoothing by default (currently you don't need to read the code, but read the description of the parameters).

hannawallach · 2015-03-16T18:57:37Z

No. That is not what smooth_idf=True does. From reading the code (lines 973 and 974) smooth_idf=True means that idf is computed as idf = log((D + 1) / (D_v + 1)) rather than idf = log(D / D_v). The resultant idf (irrespective of the value of smooth_idf) is still combined w/ tf according to tf * (idf + 1). Basically smooth_idf=True is equivalent to adding one document that contains every term in the vocabulary. Computing tf * (idf + 1) is equivalent to adding e documents (assuming log is base e, which it is here) of which none contain any terms in the vocabulary.

glouppe · 2015-03-16T19:33:07Z

Indeed, read too fast, thanks for the report! Maybe @larsmans can comment on this?

larsmans · 2015-03-18T11:52:15Z

This issue is a duplicate of #2998.

Re: standard behavior, if you look at what Lucene does (and therefore Elasticsearch, Solr), it does the (idf+1) thing and two additional hacks:

tfidf = tf * idf²
idf = 1 + log(n_docs / (df + 1))

So we're closer to the textbook than the industry standard version of tf-idf, which people in academia use too without ever worrying about this. I'm not in favor of changing the default behavior, as it may break people's code and change their results (for the worse, typically).

ogrisel · 2015-03-24T08:41:53Z

I am personally +1 in fixing our TF-IDF vectorizer to implement one of the most standard variant but:

I would rather avoid introducing yet another hyper-param on this beast,
I think we should check that the variant we implement does not induce too much performance decrease on standard benchmarks (20 newsgroups, IMDB reviews, sentiment140, amazon reviews or such).

Our variant is not really principled and keeps surprising users (as reported on SO, blog post comments and this issue tracker several times).

amueller · 2015-04-04T18:44:39Z

I haven't really looked at it, but the wikipedia page lists quite a couple of variations. https://en.wikipedia.org/wiki/Tf%E2%80%93idf

larsmans · 2015-04-09T12:14:23Z

There are so many variants that there is, in fact, a systematic naming scheme for them.

amueller · 2015-04-09T15:32:37Z

the naming scheme doesn't talk about the various +1's, though.

larsmans · 2015-04-09T19:51:10Z

It's not exhaustive, it's just what was implemented in SMART.

ermakovpetr · 2015-05-13T01:25:14Z

What do you think about adding this flag?

ermakovpetr · 2015-11-23T19:52:43Z

#5900

jnothman · 2016-05-29T13:49:56Z

I think the addition of 1 to idf is not only non-standard, but seems to be inappropriate for machine learning. One might argue that in a retrieval context (i.e. Lucene's similarity score) a heavily weighted term that appears in (almost) every document should still match documents in which that term is heavily weighted.

I'm not very happy with the current default behaviour, but I understand that deprecating the default is not very kind to users. I think that we need to make it clearer that this is probably undesirable, and have a more precise interpretation of the "feature" in documentation.

While helpful, I'm not sure that @hannawallach's interpretation of what we're doing will be clear to end-users. What we're doing is essentially interpolating tf with tf.idf, saying that there is an extent to which we don't care about document frequency in the weighting of a term.

jnothman · 2016-08-12T02:17:37Z

I'm closing this hoping that the documentation in #7015 is as close as we can get to a resolution without properly upsetting users!

mayhewsw · 2019-10-25T14:53:27Z

Old thread is old, but I also had this problem. Not only is it not standard, but it privileges term frequencies far too much, so common tokens like "the" and "." tend to be ranked at the top. I'd vote for @hannawallach's suggestion also, but in the meantime, one workaround is to set sublinear_tf=True, which deprivileges tf to some extent.

rth · 2019-10-25T16:38:16Z

Re-opening. #14748 would partially address this, by allowing to at least use standard itf-df weighting.

adrinjalali · 2021-08-22T09:01:20Z

This was added to 1.0 a long time ago. Removing the milestone tag. I'd be personally happy if we get both #17933 and #14748 in though.

adrinjalali · 2024-04-18T14:20:22Z

Seems like #7015 improved things a bit, and we can't really figure out a feasible way forward (also ref #14748). So closing might make sense here.

amueller added the Bug label Mar 16, 2015

amueller added this to the 0.16 milestone Mar 16, 2015

jnothman mentioned this issue Apr 8, 2015

Curious for IDF calculations #4544

Closed

larsmans mentioned this issue May 15, 2015

Issue #978: changes in the formula TF nltk/nltk#979

Closed

amueller modified the milestones: 0.16, 0.17 Sep 11, 2015

amueller modified the milestones: 1.0, 0.17 Sep 20, 2015

ermakovpetr mentioned this issue Nov 26, 2015

[MRG+1] Issue w/ tf-idf computation #5900

Closed

amueller mentioned this issue May 5, 2016

TfidfVectorizer documentation doesn't track TfidfTransformer #6766

Closed

amueller mentioned this issue Jul 15, 2016

DOC be more explicit about tfidf-formula #6996

Closed

jnothman closed this as completed Aug 12, 2016

rth reopened this Oct 25, 2019

kno10 mentioned this issue Apr 30, 2021

Customisable formulations of TF-IDF #17933

Closed

adrinjalali removed this from the 1.0 milestone Aug 22, 2021

cmarmo added module:feature_extraction Needs Decision Requires decision labels Dec 7, 2021

adrinjalali closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Issue w/ tf-idf computation #4391

Issue w/ tf-idf computation #4391

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Issue w/ tf-idf computation #4391

Issue w/ tf-idf computation #4391

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!