@@ -436,11 +436,12 @@ Using the ``TfidfTransformer``'s default settings,
436
436
the term frequency, the number of times a term occurs in a given document,
437
437
is multiplied with idf component, which is computed as
438
438
439
- :math: `\text {idf}(t) = log{\frac {1 + n_d }{1 +\text {df}(d, t)}} + 1 `,
439
+ :math: `\text {idf}(t) = \ log {\frac {1 + n }{1 +\text {df}(t)}} + 1 `,
440
440
441
- where :math: `n_d` is the total number of documents, and :math: `\text {df}(d,t)`
442
- is the number of documents that contain term :math: `t`. The resulting tf-idf
443
- vectors are then normalized by the Euclidean norm:
441
+ where :math: `n` is the total number of documents in the document set, and
442
+ :math: `\text {df}(t)` is the number of documents in the document set that
443
+ contain term :math: `t`. The resulting tf-idf vectors are then normalized by the
444
+ Euclidean norm:
444
445
445
446
:math: `v_{norm} = \frac {v}{||v||_2 } = \frac {v}{\sqrt {v{_1 }^2 +
446
447
v{_2 }^2 + \dots + v{_n}^2 }}`.
@@ -455,14 +456,14 @@ computed in scikit-learn's :class:`TfidfTransformer`
455
456
and :class: `TfidfVectorizer ` differ slightly from the standard textbook
456
457
notation that defines the idf as
457
458
458
- :math: `\text {idf}(t) = log{\frac {n_d }{1 +\text {df}(d, t)}}.`
459
+ :math: `\text {idf}(t) = \ log {\frac {n }{1 +\text {df}(t)}}.`
459
460
460
461
461
462
In the :class: `TfidfTransformer ` and :class: `TfidfVectorizer `
462
463
with ``smooth_idf=False ``, the
463
464
"1" count is added to the idf instead of the idf's denominator:
464
465
465
- :math: `\text {idf}(t) = log{\frac {n_d }{\text {df}(d, t)}} + 1 `
466
+ :math: `\text {idf}(t) = \ log {\frac {n }{\text {df}(t)}} + 1 `
466
467
467
468
This normalization is implemented by the :class: `TfidfTransformer `
468
469
class::
@@ -509,21 +510,21 @@ v{_2}^2 + \dots + v{_n}^2}}`
509
510
For example, we can compute the tf-idf of the first term in the first
510
511
document in the `counts ` array as follows:
511
512
512
- :math: `n_{d} = 6 `
513
+ :math: `n = 6 `
513
514
514
- :math: `\text {df}(d, t)_{\text {term1 }} = 6 `
515
+ :math: `\text {df}(t)_{\text {term1 }} = 6 `
515
516
516
- :math: `\text {idf}(d, t)_{\text {term1 }} =
517
- log \frac {n_d }{\text {df}(d, t)} + 1 = log(1 )+1 = 1 `
517
+ :math: `\text {idf}(t)_{\text {term1 }} =
518
+ \ log \frac {n }{\text {df}(t)} + 1 = \ log (1 )+1 = 1 `
518
519
519
520
:math: `\text {tf-idf}_{\text {term1 }} = \text {tf} \times \text {idf} = 3 \times 1 = 3 `
520
521
521
522
Now, if we repeat this computation for the remaining 2 terms in the document,
522
523
we get
523
524
524
- :math: `\text {tf-idf}_{\text {term2 }} = 0 \times (log(6 /1 )+1 ) = 0 `
525
+ :math: `\text {tf-idf}_{\text {term2 }} = 0 \times (\ log (6 /1 )+1 ) = 0 `
525
526
526
- :math: `\text {tf-idf}_{\text {term3 }} = 1 \times (log(6 /2 )+1 ) \approx 2.0986 `
527
+ :math: `\text {tf-idf}_{\text {term3 }} = 1 \times (\ log (6 /2 )+1 ) \approx 2.0986 `
527
528
528
529
and the vector of raw tf-idfs:
529
530
@@ -540,12 +541,12 @@ Furthermore, the default parameter ``smooth_idf=True`` adds "1" to the numerator
540
541
and denominator as if an extra document was seen containing every term in the
541
542
collection exactly once, which prevents zero divisions:
542
543
543
- :math: `\text {idf}(t) = log{\frac {1 + n_d }{1 +\text {df}(d, t)}} + 1 `
544
+ :math: `\text {idf}(t) = \ log {\frac {1 + n }{1 +\text {df}(t)}} + 1 `
544
545
545
546
Using this modification, the tf-idf of the third term in document 1 changes to
546
547
1.8473:
547
548
548
- :math: `\text {tf-idf}_{\text {term3 }} = 1 \times log(7 /3 )+1 \approx 1.8473 `
549
+ :math: `\text {tf-idf}_{\text {term3 }} = 1 \times \ log (7 /3 )+1 \approx 1.8473 `
549
550
550
551
And the L2-normalized tf-idf changes to
551
552
0 commit comments