-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
DOC TfidfVectorizer documentation details #6839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -940,7 +940,7 @@ class TfidfTransformer(BaseEstimator, TransformerMixin): | |
|
||
Parameters | ||
---------- | ||
norm : 'l1', 'l2' or None, optional | ||
norm : 'l1', 'l2' or None, default='l2' | ||
Norm used to normalize term vectors. None for no normalization. | ||
|
||
use_idf : boolean, default=True | ||
|
@@ -1054,6 +1054,26 @@ def idf_(self): | |
class TfidfVectorizer(CountVectorizer): | ||
"""Convert a collection of raw documents to a matrix of TF-IDF features. | ||
|
||
Tf means term-frequency while tf-idf means term-frequency times inverse | ||
document-frequency. This is a common term weighting scheme in information | ||
retrieval, that has also found good use in document classification. | ||
|
||
The goal of using tf-idf instead of the raw frequencies of occurrence of a | ||
token in a given document is to scale down the impact of tokens that occur | ||
very frequently in a given corpus and that are hence empirically less | ||
informative than features that occur in a small fraction of the training | ||
corpus. | ||
|
||
The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting. I never knew we did this. Can @jnothman shed some light on if this is a common practise? It does not show up here (https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2) if I am right. |
||
instead of tf * idf. The effect of this is that terms with zero idf, i.e. | ||
that occur in all documents of a training set, will not be entirely | ||
ignored. The formulas used to compute tf and idf depend on parameter | ||
settings that correspond to the SMART notation used in IR, as follows: | ||
|
||
Tf is "n" (natural) by default, "l" (logarithmic) when sublinear_tf=True. | ||
Idf is "t" when use_idf is given, "n" (none) otherwise. | ||
Normalization is "c" (cosine) when norm='l2', "n" (none) when norm=None. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel these 3 lines can be omitted. The parameters below state the same thing.. |
||
|
||
Equivalent to CountVectorizer followed by TfidfTransformer. | ||
|
||
Read more in the :ref:`User Guide <text_feature_extraction>`. | ||
|
@@ -1165,7 +1185,7 @@ class TfidfVectorizer(CountVectorizer): | |
dtype : type, optional | ||
Type of the matrix returned by fit_transform() or transform(). | ||
|
||
norm : 'l1', 'l2' or None, optional | ||
norm : 'l1', 'l2' or None, default='l2' | ||
Norm used to normalize term vectors. None for no normalization. | ||
|
||
use_idf : boolean, default=True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explicitly mention the formula here?