8000 DOC improve tfidfvectorizer documentation (#12822) · scikit-learn/scikit-learn@ff46f6e · GitHub
[go: up one dir, main page]

Skip to content

Commit ff46f6e

Browse files
Reshama Shaikhjnothman
Reshama Shaikh
authored andcommitted
DOC improve tfidfvectorizer documentation (#12822)
1 parent 82e2160 commit ff46f6e

File tree

1 file changed

+40
-32
lines changed

1 file changed

+40
-32
lines changed

sklearn/feature_extraction/text.py

Lines changed: 40 additions & 32 deletions
1E79
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@ def _word_ngrams(self, tokens, stop_words=None):
169169
space_join = " ".join
170170

171171
for n in range(min_n,
172-
min(max_n + 1, n_original_tokens + 1)):
172+
min(max_n + 1, n_original_tokens + 1)):
173173
for i in range(n_original_tokens - n + 1):
174174
tokens_append(space_join(original_tokens[i: i + n]))
175175

@@ -1177,18 +1177,23 @@ class TfidfTransformer(BaseEstimator, TransformerMixin):
11771177
11781178
Parameters
11791179
----------
1180-
norm : 'l1', 'l2' or None, optional
1181-
Norm used to normalize term vectors. None for no normalization.
1182-
1183-
use_idf : boolean, default=True
1180+
norm : 'l1', 'l2' or None, optional (default='l2')
1181+
Each output row will have unit norm, either:
1182+
* 'l2': Sum of squares of vector elements is 1. The cosine
1183+
similarity between two vectors is their dot product when l2 norm has
1184+
been applied.
1185+
* 'l1': Sum of absolute values of vector elements is 1.
1186+
See :func:`preprocessing.normalize`
1187+
1188+
use_idf : boolean (default=True)
11841189
Enable inverse-document-frequency reweighting.
11851190
1186-
smooth_idf : boolean, default=True
1191+
smooth_idf : boolean (default=True)
11871192
Smooth idf weights by adding one to document frequencies, as if an
11881193
extra document was seen containing every term in the collection
11891194
exactly once. Prevents zero divisions.
11901195
1191-
sublinear_tf : boolean, default=False
1196+
sublinear_tf : boolean (default=False)
11921197
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
11931198
11941199
Attributes
@@ -1305,7 +1310,8 @@ def idf_(self, value):
13051310
class TfidfVectorizer(CountVectorizer):
13061311
"""Convert a collection of raw documents to a matrix of TF-IDF features.
13071312
1308-
Equivalent to CountVectorizer followed by TfidfTransformer.
1313+
Equivalent to :class:`CountVectorizer` followed by
1314+
:class:`TfidfTransformer`.
13091315
13101316
Read more in the :ref:`User Guide <text_feature_extraction>`.
13111317
@@ -1326,13 +1332,13 @@ class TfidfVectorizer(CountVectorizer):
13261332
If bytes or files are given to analyze, this encoding is used to
13271333
decode.
13281334
1329-
decode_error : {'strict', 'ignore', 'replace'}
1335+
decode_error : {'strict', 'ignore', 'replace'} (default='strict')
13301336
Instruction on what to do if a byte sequence is given to analyze that
13311337
contains characters not of the given `encoding`. By default, it is
13321338
'strict', meaning that a UnicodeDecodeError will be raised. Other
13331339
values are 'ignore' and 'replace'.
13341340
1335-
strip_accents : {'ascii', 'unicode', None}
1341+
strip_accents : {'ascii', 'unicode', None} (default=None)
13361342
Remove accents and perform other character normalization
13371343
during the preprocessing step.
13381344
'ascii' is a fast method that only works on characters that have
@@ -1343,14 +1349,14 @@ class TfidfVectorizer(CountVectorizer):
13431349
Both 'ascii' and 'unicode' use NFKD normalization from
13441350
:func:`unicodedata.normalize`.
13451351
1346-
lowercase : boolean, default True
1352+
lowercase : boolean (default=True)
13471353
Convert all characters to lowercase before tokenizing.
13481354
1349-
preprocessor : callable or None (default)
1355+
preprocessor : callable or None (default=None)
13501356
Override the preprocessing (string transformation) stage while
13511357
preserving the tokenizing and n-grams generation steps.
13521358
1353-
tokenizer : callable or None (default)
1359+
tokenizer : callable or None (default=None)
13541360
Override the string tokenization step while preserving the
13551361
preprocessing and n-grams generation steps.
13561362
Only applies if ``analyzer == 'word'``.
@@ -1363,7 +1369,7 @@ class TfidfVectorizer(CountVectorizer):
13631369
If a callable is passed it is used to extract the sequence of features
13641370
out of the raw, unprocessed input.
13651371
1366-
stop_words : string {'english'}, list, or None (default)
1372+
stop_words : string {'english'}, list, or None (default=None)
13671373
If a string, it is passed to _check_stop_list and the appropriate stop
13681374
list is returned. 'english' is currently the only supported string
13691375
value.
@@ -1384,58 +1390,63 @@ class TfidfVectorizer(CountVectorizer):
13841390
or more alphanumeric characters (punctuation is completely ignored
13851391
and always treated as a token separator).
13861392
1387-
ngram_range : tuple (min_n, max_n)
1393+
ngram_range : tuple (min_n, max_n) (default=(1, 1))
13881394
The lower and upper boundary of the range of n-values for different
13891395
n-grams to be extracted. All values of n such that min_n <= n <= max_n
13901396
will be used.
13911397
1392-
max_df : float in range [0.0, 1.0] or int, default=1.0
1398+
max_df : float in range [0.0, 1.0] or int (default=1.0)
13931399
When building the vocabulary ignore terms that have a document
13941400
frequency strictly higher than the given threshold (corpus-specific
13951401
stop words).
13961402
If float, the parameter represents a proportion of documents, integer
13971403
absolute counts.
13981404
This parameter is ignored if vocabulary is not None.
13991405
1400-
min_df : float in range [0.0, 1.0] or int, default=1
1406+
min_df : float in range [0.0, 1.0] or int (default=1)
14011407
When building the vocabulary ignore terms that have a document
14021408
frequency strictly lower than the given threshold. This value is also
14031409
called cut-off in the literature.
14041410
If float, the parameter represents a proportion of documents, integer
14051411
absolute counts.
14061412
This parameter is ignored if vocabulary is not None.
14071413
1408-
max_features : int or None, default=None
1414+
max_features : int or None (default=None)
14091415
If not None, build a vocabulary that only consider the top
14101416
max_features ordered by term frequency across the corpus.
14111417
14121418
This parameter is ignored if vocabulary is not None.
14131419
1414-
vocabulary : Mapping or iterable, optional
1420+
vocabulary : Mapping or iterable, optional (default=None)
14151421
Either a Mapping (e.g., a dict) where keys are terms and values are
14161422
indices in the feature matrix, or an iterable over terms. If not
14171423
given, a vocabulary is determined from the input documents.
14181424
1419-
binary : boolean, default=False
1425+
binary : boolean (default=False)
14201426
If True, all non-zero term counts are set to 1. This does not mean
14211427
outputs will have only 0/1 values, only that the tf term in tf-idf
14221428
is binary. (Set idf and normalization to False to get 0/1 outputs.)
14231429
1424-
dtype : type, optional
1430+
dtype : type, optional (default=float64)
14251431
Type of the matrix returned by fit_transform() or transform().
14261432
1427-
norm : 'l1', 'l2' or None, optional
1428-
Norm used to normalize term vectors. None for no normalization.
1433+
norm : 'l1', 'l2' or None, optional (default='l2')
1434+
Each output row will have unit norm, either:
1435+
* 'l2': Sum of squares of vector elements is 1. The cosine
1436+
similarity between two vectors is their dot product when l2 norm has
1437+
been applied.
1438+
* 'l1': Sum of absolute values of vector elements is 1.
1439+
See :func:`preprocessing.normalize`
14291440
1430-
use_idf : boolean, default=True
1441+
use_idf : boolean (default=True)
14311442
Enable inverse-document-frequency reweighting.
14321443
1433-
smooth_idf : boolean, default=True
1444+
smooth_idf : boolean (default=True)
14341445
Smooth idf weights by adding one to document frequencies, as if an
14351446
extra document was seen containing every term in the collection
14361447
exactly once. Prevents zero divisions.
14371448
1438-
sublinear_tf : boolean, default=False
1449+
sublinear_tf : boolean (default=False)
14391450
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
14401451
14411452
Attributes
@@ -1474,13 +1485,10 @@ class TfidfVectorizer(CountVectorizer):
14741485
14751486
See also
14761487
--------
1477-
CountVectorizer
1478-
Tokenize the documents and count the occurrences of token and return
1479-
them as a sparse matrix
1488+
CountVectorizer : Transforms text into a sparse matrix of n-gram counts.
14801489
1481-
TfidfTransformer
1482-
Apply Term Frequency Inverse Document Frequency normalization to a
1483-
sparse matrix of occurrence counts.
1490+
TfidfTransformer : Performs the TF-IDF transformation from a provided
1491+
matrix of counts.
14841492
14851493
Notes
14861494
-----

0 commit comments

Comments
 (0)
0