-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
tf*idf is non-standard in scikit-learn #2998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not really sure why this was done, but I see two effects: one is that we actually compute The other is that tf-idf is never exactly zero when tf is non-zero, so that |
I will try to play a bit again with variant, at least to fix the comment which is wrong. |
+1 I'm not really for changing the formula since it might break people's code. |
We currently have two mechanisms in place that modify idf:
I pushed some documentation for the latter in fbe974b. |
I'm closing this as "won't fix, but have documented"? People are welcome to reopen if that resolution isn't good enough. |
Hi,
Several months ago @tpeng pointed me to https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L966 - why is 1.0 added to idf?
This 1.0 makes idf positive when n_samples==df, and the comment is suggesting it is to avoid some division errors. What I don't understand is what are these division errors - idf is a multiplier, not a divisor in tf*idf, and we're calculating logarithm for idf - why divide by logarithm?
When this 1.0 summand is commented out
sklearn.feature_extraction.tests.test_text
start to fail, as expected:What fails is normalization checks, and the inverse transform test. If I comment out normalization checks the rest of test_text.test_tfidf_no_smoothing as well as
test_tf_idf_smoothing
passes. As I read it, the rest of these tests is supposed to check some zero division errors, and these errors are not present.By the way, SkipTest exceptions in these tests are likely useless because they should happen before
assert_warns_message
- this looks like a bug introduced in e1bdd99.It is not clear for me what do these failing normalization tests mean. But the comment about zero division errors doesn't explain why is the formula non-standard. There are smoothing terms inside logarithm, but what is +1 outside logaritm for? Maybe it is explained in Yates2011, but I don't have an access to it, and it is better to add some more notes to the source code then.
Existing behavior was introduced by 0d1daad, so maybe @ogrisel knows the answer?
The text was updated successfully, but these errors were encountered: