8000 tf*idf is non-standard in scikit-learn · Issue #2998 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

tf*idf is non-standard in scikit-learn #2998

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kmike opened this issue Mar 24, 2014 · 5 comments
Closed

tf*idf is non-standard in scikit-learn #2998

kmike opened this issue Mar 24, 2014 · 5 comments

Comments

@kmike
Copy link
Contributor
kmike commented Mar 24, 2014

Hi,

Several months ago @tpeng pointed me to https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L966 - why is 1.0 added to idf?

This 1.0 makes idf positive when n_samples==df, and the comment is suggesting it is to avoid some division errors. What I don't understand is what are these division errors - idf is a multiplier, not a divisor in tf*idf, and we're calculating logarithm for idf - why divide by logarithm?

When this 1.0 summand is commented out sklearn.feature_extraction.tests.test_text start to fail, as expected:

..............FF...........F........
======================================================================
FAIL: sklearn.feature_extraction.tests.test_text.test_tf_idf_smoothing
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/kmike/envs/scraping/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/kmike/svn/scikit-learn/sklearn/feature_extraction/tests/test_text.py", line 322, in test_tf_idf_smoothing
    assert_array_almost_equal((tfidf ** 2).sum(axis=1), [1., 1., 1.])
  File "/Users/kmike/envs/scraping/lib/python2.7/site-packages/numpy/testing/utils.py", line 811, in assert_array_almost_equal
    header=('Arrays are not almost equal to %d decimals' % decimal))
  File "/Users/kmike/envs/scraping/lib/python2.7/site-packages/numpy/testing/utils.py", line 644, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 6 decimals

(mismatch 33.3333333333%)
 x: array([ 1.,  1.,  0.])
 y: array([ 1.,  1.,  1.])
>>  raise AssertionError('\nArrays are not almost equal to 6 decimals\n\n(mismatch 33.3333333333%)\n x: array([ 1.,  1.,  0.])\n y: array([ 1.,  1.,  1.])')


======================================================================
FAIL: sklearn.feature_extraction.tests.test_text.test_tfidf_no_smoothing
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/kmike/envs/scraping/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/kmike/svn/scikit-learn/sklearn/feature_extraction/tests/test_text.py", line 342, in test_tfidf_no_smoothing
    assert_array_almost_equal((tfidf ** 2).sum(axis=1), [1., 1., 1.])
  File "/Users/kmike/envs/scraping/lib/python2.7/site-packages/numpy/testing/utils.py", line 811, in assert_array_almost_equal
    header=('Arrays are not almost equal to %d decimals' % decimal))
  File "/Users/kmike/envs/scraping/lib/python2.7/site-packages/numpy/testing/utils.py", line 644, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 6 decimals

(mismatch 33.3333333333%)
 x: array([ 1.,  1.,  0.])
 y: array([ 1.,  1.,  1.])
>>  raise AssertionError('\nArrays are not almost equal to 6 decimals\n\n(mismatch 33.3333333333%)\n x: array([ 1.,  1.,  0.])\n y: array([ 1.,  1.,  1.])')


======================================================================
FAIL: sklearn.feature_extraction.tests.test_text.test_vectorizer_inverse_transform
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/kmike/envs/scraping/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/kmike/svn/scikit-learn/sklearn/feature_extraction/tests/test_text.py", line 695, in test_vectorizer_inverse_transform
    assert_array_equal(terms, inversed_terms)
  File "/Users/kmike/envs/scraping/lib/python2.7/site-packages/numpy/testing/utils.py", line 718, in assert_array_equal
    verbose=verbose, header='Arrays are not equal')
  File "/Users/kmike/envs/scraping/lib/python2.7/site-packages/numpy/testing/utils.py", line 599, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not equal

(shapes (4,), (3,) mismatch)
 x: array([u'beer', u'copyright', u'pizza', u'the'], 
      dtype='<U9')
 y: array([u'beer', u'copyright', u'pizza'], 
      dtype='<U9')
>>  raise AssertionError("\nArrays are not equal\n\n(shapes (4,), (3,) mismatch)\n x: array([u'beer', u'copyright', u'pizza', u'the'], \n      dtype='<U9')\n y: array([u'beer', u'copyright', u'pizza'], \n      dtype='<U9')")

What fails is normalization checks, and the inverse transform test. If I comment out normalization checks the rest of test_text.test_tfidf_no_smoothing as well as test_tf_idf_smoothing passes. As I read it, the rest of these tests is supposed to check some zero division errors, and these errors are not present.

By the way, SkipTest exceptions in these tests are likely useless because they should happen before assert_warns_message - this looks like a bug introduced in e1bdd99.

It is not clear for me what do these failing normalization tests mean. But the comment about zero division errors doesn't explain why is the formula non-standard. There are smoothing terms inside logarithm, but what is +1 outside logaritm for? Maybe it is explained in Yates2011, but I don't have an access to it, and it is better to add some more notes to the source code then.

Existing behavior was introduced by 0d1daad, so maybe @ogrisel knows the answer?

@larsmans
Copy link
Member

I'm not really sure why this was done, but I see two effects: one is that we actually compute tf(idf + 1) = tfidf + tf, so term frequencies get boosted. This appears to work well in classification and clustering settings.

The other is that tf-idf is never exactly zero when tf is non-zero, so that inverse_transform produces the intersection of a document's terms with the vocabulary, rather than intersecting this again with the terms that have non-zero df. Of course, for this purpose any ε would have sufficed rather than 1.

@ogrisel
Copy link
Member
ogrisel commented Mar 25, 2014

I will try to play a bit again with variant, at least to fix the comment which is wrong.

@larsmans
Copy link
Member

+1 I'm not really for changing the formula since it might break people's code.

@larsmans
Copy link
Member

We currently have two mechanisms in place that modify idf: smooth_idf and the hardcoded + 1.0. The former increments df, while the latter effectively increments tf, so we have something that is close to Lidstone smoothing, which is how I originally intended smooth_idf to work:

  • Concatenate the entire corpus in a single document. Add this document back to the corpus.
  • Now df' = df + 1 for all terms. This is exactly what smooth_df=1 entails.
  • idf = log(N / df') should now really be idf = log((N + 1) / df'), or more generally N + smooth_idf since there's an extra document. This is a bug.
  • In any case, to get Lidstone smoothing, we should now actually set tf' = tf + 1 to obtain tfidf = (tf + 1) idf, but instead we add it in at the end by the aforementioned tfidf = tf (idf + 1) = tf + tf × idf. This is different from true tfidf, but it has the advantage that per-class differences in term frequencies can still be learned by supervised estimators down the road even when idf is zero.

I pushed some documentation for the latter in fbe974b.

@jnothman
Copy link
Member

I'm closing this as "won't fix, but have documented"? People are welcome to reopen if that resolution isn't good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0