8000 DOC Correct TF-IDF formula in TfidfTransformer comments. (#13054) · scikit-learn/scikit-learn@fdf2f38 · GitHub
[go: up one dir, main page]

Skip to content

Commit fdf2f38

Browse files
vishaalkapoorjnothman
authored andcommitted
DOC Correct TF-IDF formula in TfidfTransformer comments. (#13054)
1 parent 1deb95a commit fdf2f38

File tree

2 files changed

+27
-25
lines changed

2 files changed

+27
-25
lines changed

doc/modules/feature_extraction.rst

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -436,11 +436,12 @@ Using the ``TfidfTransformer``'s default settings,
436436
the term frequency, the number of times a term occurs in a given document,
437437
is multiplied with idf component, which is computed as
438438

439-
:math:`\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1`,
439+
:math:`\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1`,
440440

441-
where :math:`n_d` is the total number of documents, and :math:`\text{df}(d,t)`
442-
is the number of documents that contain term :math:`t`. The resulting tf-idf
443-
vectors are then normalized by the Euclidean norm:
441+
where :math:`n` is the total number of documents in the document set, and
442+
:math:`\text{df}(t)` is the number of documents in the document set that
443+
contain term :math:`t`. The resulting tf-idf vectors are then normalized by the
444+
Euclidean norm:
444445

445446
:math:`v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 +
446447
v{_2}^2 + \dots + v{_n}^2}}`.
@@ -455,14 +456,14 @@ computed in scikit-learn's :class:`TfidfTransformer`
455456
and :class:`TfidfVectorizer` differ slightly from the standard textbook
456457
notation that defines the idf as
457458

458-
:math:`\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}}.`
459+
:math:`\text{idf}(t) = \log{\frac{n}{1+\text{df}(t)}}.`
459460

460461

461462
In the :class:`TfidfTransformer` and :class:`TfidfVectorizer`
462463
with ``smooth_idf=False``, the
463464
"1" count is added to the idf instead of the idf's denominator:
464465

465-
:math:`\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1`
466+
:math:`\text{idf}(t) = \log{\frac{n}{\text{df}(t)}} + 1`
466467

467468
This normalization is implemented by the :class:`TfidfTransformer`
468469
class::
@@ -509,21 +510,21 @@ v{_2}^2 + \dots + v{_n}^2}}`
509510
For example, we can compute the tf-idf of the first term in the first
510511
document in the `counts` array as follows:
511512

512-
:math:`n_{d} = 6`
513+
:math:`n = 6`
513514

514-
:math:`\text{df}(d, t)_{\text{term1}} = 6`
515+
:math:`\text{df}(t)_{\text{term1}} = 6`
515516

516-
:math:`\text{idf}(d, t)_{\text{term1}} =
517-
log \frac{n_d}{\text{df}(d, t)} + 1 = log(1)+1 = 1`
517+
:math:`\text{idf}(t)_{\text{term1}} =
518+
\log \frac{n}{\text{df}(t)} + 1 = \log(1)+1 = 1`
518519

519520
:math:`\text{tf-idf}_{\text{term1}} = \text{tf} \times \text{idf} = 3 \times 1 = 3`
520521

521522
Now, if we repeat this computation for the remaining 2 terms in the document,
522523
we get
523524

524-
:math:`\text{tf-idf}_{\text{term2}} = 0 \times (log(6/1)+1) = 0`
525+
:math:`\text{tf-idf}_{\text{term2}} = 0 \times (\log(6/1)+1) = 0`
525526

526-
:math:`\text{tf-idf}_{\text{term3}} = 1 \times (log(6/2)+1) \approx 2.0986`
527+
:math:`\text{tf-idf}_{\text{term3}} = 1 \times (\log(6/2)+1) \approx 2.0986`
527528

528529
and the vector of raw tf-idfs:
529530

@@ -540,12 +541,12 @@ Furthermore, the default parameter ``smooth_idf=True`` adds "1" to the numerator
540541
and denominator as if an extra document was seen containing every term in the
541542
collection exactly once, which prevents zero divisions:
542543

543-
:math:`\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1`
544+
:math:`\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1`
544545

545546
Using this modification, the tf-idf of the third term in document 1 changes to
546547
1.8473:
547548

548-
:math:`\text{tf-idf}_{\text{term3}} = 1 \times log(7/3)+1 \approx 1.8473`
549+
:math:`\text{tf-idf}_{\text{term3}} = 1 \times \log(7/3)+1 \approx 1.8473`
549550

550551
And the L2-normalized tf-idf changes to
551552

sklearn/feature_extraction/text.py

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1146,17 +1146,18 @@ class TfidfTransformer(BaseEstimator, TransformerMixin):
11461146
informative than features that occur in a small fraction of the training
11471147
corpus.
11481148
1149-
The formula that is used to compute the tf-idf of term t is
1150-
tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as
1151-
idf(d, t) = log [ n / df(d, t) ] + 1 (if ``smooth_idf=False``),
1152-
where n is the total number of documents and df(d, t) is the
1153-
document frequency; the document frequency is the number of documents d
1154-
that contain term t. The effect of adding "1" to the idf in the equation
1155-
above is that terms with zero idf, i.e., terms that occur in all documents
1156-
in a training set, will not be entirely ignored.
1157-
(Note that the idf formula above differs from the standard
1158-
textbook notation that defines the idf as
1159-
idf(d, t) = log [ n / (df(d, t) + 1) ]).
1149+
The formula that is used to compute the tf-idf for a term t of a document d
1150+
in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is
1151+
computed as idf(t) = log [ n / df(t) ] + 1 (if ``smooth_idf=False``), where
1152+
n is the total number of documents in the document set and df(t) is the
1153+
document frequency of t; the document frequency is the number of documents
1154+
in the document set that contain the term t. The effect of adding "1" to
1155+
the idf in the equation above is that terms with zero idf, i.e., terms
1156+
that occur in all documents in a training set, will not be entirely
1157+
ignored.
1158+
(Note that the idf formula above differs from the standard textbook
1159+
notation that defines the idf as
1160+
idf(t) = log [ n / (df(t) + 1) ]).
11601161
11611162
If ``smooth_idf=True`` (the default), the constant "1" is added to the
11621163
numerator and denominator of the idf as if an extra document was seen

0 commit comments

Comments
 (0)
0