-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] partial_fit implementation in TfidfTransformer #9014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
1878f40
a286fdd
b9ddb7a
c5e3dbe
b218410
257988d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,7 +11,7 @@ | |
The :mod:`sklearn.feature_extraction.text` submodule gathers utilities to | ||
build feature vectors from text documents. | ||
""" | ||
from __future__ import unicode_literals | ||
from __future__ import unicode_literals, division | ||
|
||
import array | ||
from collections import Mapping, defaultdict | ||
|
@@ -1015,7 +1015,8 @@ def __init__(self, norm='l2', use_idf=True, smooth_idf=True, | |
self.sublinear_tf = sublinear_tf | ||
|
||
def fit(self, X, y=None): | ||
"""Learn the idf vector (global term weights) | ||
"""Learn the df vector (global term weights). | ||
It is used to calculate the idf scores for the terms. | ||
|
||
Parameters | ||
---------- | ||
|
@@ -1025,18 +1026,44 @@ def fit(self, X, y=None): | |
if not sp.issparse(X): | ||
X = sp.csc_matrix(X) | ||
if self.use_idf: | ||
n_samples, n_features = X.shape | ||
df = _document_frequency(X) | ||
self._n_samples, n_features = X.shape | ||
self._df = _document_frequency(X) | ||
|
||
# perform idf smoothing if required | ||
df += int(self.smooth_idf) | ||
n_samples += int(self.smooth_idf) | ||
self._df += int(self.smooth_idf) | ||
self._n_samples += int(self.smooth_idf) | ||
|
||
return self | ||
|
||
def partial_fit(self, X, y=None): | ||
"""Update the df vector (global term weights), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP257: one-line summary here |
||
which is used to calculate the idf scores for the terms. | ||
This method should only be called after `fit` since it | ||
is supposed to not change the number of features. | ||
|
||
Parameters | ||
---------- | ||
X : sparse matrix, [n_samples, n_features] | ||
a matrix of term/token counts | ||
""" | ||
|
||
# log+1 instead of log makes sure terms with zero idf don't get | ||
# suppressed entirely. | ||
idf = np.log(float(n_samples) / df) + 1.0 | ||
self._idf_diag = sp.spdiags(idf, diags=0, m=n_features, | ||
n=n_features, format='csr') | ||
if not hasattr(self, '_df'): | ||
raise ValueError("fit should be called before partial_fit") | ||
|
||
if not sp.issparse(X): | ||
X = sp.csc_matrix(X) | ||
if self.use_idf: | ||
n_samples, n_features = X.shape | ||
|
||
expected_n_features = self._df.shape[0] | ||
if n_features != expected_n_features: | ||
raise ValueError("The update input has n_features=%d while" | ||
" the model has been trained with n_features=" | ||
"%d" % (n_features, expected_n_features)) | ||
|
||
df = _document_frequency(X) | ||
self._df += df | ||
self._n_samples += n_samples | ||
|
||
return self | ||
|
||
|
@@ -1070,15 +1097,19 @@ def transform(self, X, copy=True): | |
X.data += 1 | ||
|
||
if self.use_idf: | ||
check_is_fitted(self, '_idf_diag', 'idf vector is not fitted') | ||
check_is_fitted(self, '_df', 'df vector is not fitted') | ||
|
||
expected_n_features = self._idf_diag.shape[0] | ||
expected_n_features = self._df.shape[0] | ||
if n_features != expected_n_features: | ||
raise ValueError("Input has n_features=%d while the model" | ||
" has been trained with n_features=%d" % ( | ||
n_features, expected_n_features)) | ||
|
||
idf_diag = sp.spdiags(self.idf_, diags=0, m=n_features, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the power law distribution of vocabularies, this is potentially doing a lot more work at each call to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The scenario with a somewhat degraded performance would be a sequence of transform() after a fit()/partial_fit(), since the idf vector would be re-calculated each time. If we think it's a problem, I see the following possibilities to avoid it:
I think that in NLP applications the vocabulary would be upper-bounded by millions, hence always storing both idf and df at the same shouldn't be much of a problem. Which option do you like most? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So if I understand the issue correctly,
In any case, it might be worth to actually benchmarking the overhead for the creation of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @n0mad Aww, so you actually did benchmarks above. Could you also test for e.g. running a transform on a single document? |
||
n=n_features, format='csr') | ||
|
||
# *= doesn't work | ||
X = X * self._idf_diag | ||
X = X * idf_diag | ||
|
||
if self.norm: | ||
X = normalize(X, norm=self.norm, copy=False) | ||
|
@@ -1087,9 +1118,7 @@ def transform(self, X, copy=True): | |
|
||
@property | ||
def idf_(self): | ||
# if _idf_diag is not set, this will raise an attribute error, | ||
# which means hasattr(self, "idf_") is False | ||
return np.ravel(self._idf_diag.sum(axis=0)) | ||
return np.log(self._n_samples / self._df) + 1.0 | ||
|
||
|
||
class TfidfVectorizer(CountVectorizer): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we'd usually use
partial_fit
all the way through, notfit
the first time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering what the semantics would be if the 'partial_fit cannot change the number of features' (see the issue discussion).
Ok, will re-do
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n0mad When calling
partial_fit
all the way through, the use case would be to provide a fixed vocabulary at initialization (otherwise it wouldn't make sense indeed).Generally, to do out-of-core text vectorization one would do the first pass on the dataset to estimate the vocabulary (but currently there is no way of doing that in scikit-learn without also computing the full Bag of Words matrix), then a second pass to actually compute the BoW matrix
(though that's probably not compatible with sklearn API). Gensim does something like that I think, but I can't find the exact reference anymore...