8000 [MRG] partial_fit implementation in TfidfTransformer by n0mad · Pull Request #9014 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] partial_fit implementation in TfidfTransformer #9014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions sklearn/feature_extraction/tests/test_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,23 @@ def test_tf_idf_smoothing():
assert_true((tfidf >= 8000 0).all())


def test_tfidf_partial_fit():
X = [[1, 1, 1],
[1, 1, 0],
[1, 0, 0]]

tr_full = TfidfTransformer(smooth_idf=True, norm='l2')
tfidf_full = tr_full.fit_transform(X).toarray()

tr_partial = TfidfTransformer(smooth_idf=True, norm='l2')
tr_partial.fit([[1, 1, 1]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we'd usually use partial_fit all the way through, not fit the first time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering what the semantics would be if the 'partial_fit cannot change the number of features' (see the issue discussion).
Ok, will re-do

Copy link
Member
@rth rth Sep 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n0mad When calling partial_fit all the way through, the use case would be to provide a fixed vocabulary at initialization (otherwise it wouldn't make sense indeed).

Generally, to do out-of-core text vectorization one would do the first pass on the dataset to estimate the vocabulary (but currently there is no way of doing that in scikit-learn without also computing the full Bag of Words matrix), then a second pass to actually compute the BoW matrix (though that's probably not compatible with sklearn API). Gensim does something like that I think, but I can't find the exact reference anymore...

tr_partial.partial_fit([[1, 1, 0]])
tr_partial.partial_fit([[1, 0, 0]])
tfidf_partial = tr_partial.transform(X).toarray()

assert_array_almost_equal(tfidf_full, tfidf_partial)


def test_tfidf_no_smoothing():
X = [[1, 1, 1],
[1, 1, 0],
Expand Down
63 changes: 46 additions & 17 deletions sklearn/feature_extraction/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
The :mod:`sklearn.feature_extraction.text` submodule gathers utilities to
build feature vectors from text documents.
"""
from __future__ import unicode_literals
from __future__ import unicode_literals, division

import array
from collections import Mapping, defaultdict
Expand Down Expand Up @@ -1015,7 +1015,8 @@ def __init__(self, norm='l2', use_idf=True, smooth_idf=True,
self.sublinear_tf = sublinear_tf

def fit(self, X, y=None):
"""Learn the idf vector (global term weights)
"""Learn the df vector (global term weights).
It is used to calculate the idf scores for the terms.

Parameters
----------
Expand All @@ -1025,18 +1026,44 @@ def fit(self, X, y=None):
if not sp.issparse(X):
X = sp.csc_matrix(X)
if self.use_idf:
n_samples, n_features = X.shape
df = _document_frequency(X)
self._n_samples, n_features = X.shape
self._df = _document_frequency(X)

# perform idf smoothing if required
df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)
self._df += int(self.smooth_idf)
self._n_samples += int(self.smooth_idf)

return self

def partial_fit(self, X, y=None):
"""Update the df vector (global term weights),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP257: one-line summary here

which is used to calculate the idf scores for the terms.
This method should only be called after `fit` since it
is supposed to not change the number of features.

Parameters
----------
X : sparse matrix, [n_samples, n_features]
a matrix of term/token counts
"""

# log+1 instead of log makes sure terms with zero idf don't get
# suppressed entirely.
idf = np.log(float(n_samples) / df) + 1.0
self._idf_diag = sp.spdiags(idf, diags=0, m=n_features,
n=n_features, format='csr')
if not hasattr(self, '_df'):
raise ValueError("fit should be called before partial_fit")

if not sp.issparse(X):
X = sp.csc_matrix(X)
if self.use_idf:
n_samples, n_features = X.shape

expected_n_features = self._df.shape[0]
if n_features != expected_n_features:
raise ValueError("The update input has n_features=%d while"
" the model has been trained with n_features="
"%d" % (n_features, expected_n_features))

df = _document_frequency(X)
self._df += df
self._n_samples += n_samples

return self

Expand Down Expand Up @@ -1070,15 +1097,19 @@ def transform(self, X, copy=True):
X.data += 1

if self.use_idf:
check_is_fitted(self, '_idf_diag', 'idf vector is not fitted')
check_is_fitted(self, '_df', 'df vector is not fitted')

expected_n_features = self._idf_diag.shape[0]
expected_n_features = self._df.shape[0]
if n_features != expected_n_features:
raise ValueError("Input has n_features=%d while the model"
" has been trained with n_features=%d" % (
n_features, expected_n_features))

idf_diag = sp.spdiags(self.idf_, diags=0, m=n_features,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the power law distribution of vocabularies, this is potentially doing a lot more work at each call to transform relative to the vocabulary used in each transformed document. We need a benchmark of repeated calls to transform. I now realise that converting the diagonal to CSR in transform already requires O(vocab_size) time so we're unlikely to see a big change (except that the previous O(vocab_size) operation should have been a little faster). :( Perhaps we should be storing idf_diag from call to call.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario with a somewhat degraded performance would be a sequence of transform() after a fit()/partial_fit(), since the idf vector would be re-calculated each time.

If we think it's a problem, I see the following possibilities to avoid it:

  1. cache idf_diag (a) (maybe under a flag) or just simply update it after each partial_fit (b) (i.e. store idf_diag, n_samples, and df);
  2. we can only store idf diag and n_samples, and then use them to re-calculate df, update it, and store the updated idf_diag & n_samples again. Super ugly, has performance costs to partial_fit().

I think that in NLP applications the vocabulary would be upper-bounded by millions, hence always storing both idf and df at the same shouldn't be much of a problem.

Which option do you like most?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand the issue correctly,

  • idf_ needs to be exposed and valid after each partial_fit since it's a public attribute
  • to calculate idf_ at each iteration you need the df from the previous iteration and the n_samples
  • in transform to multiple X rows by idf_ we need to create a diagonal sparse matrix. We can do that either in partial_fit (and cache it) or directly in transform. The overhead would be the same, and in my opinion, it's largely a matter whether we would a) do numerous partial_fit calls b) numerous transform calls. I think a) is always true by design (and thus more likely), and I'm not sure if the possible use case of calling transform multiple times is worth adding an additional flag for caching idf_diag.

In any case, it might be worth to actually benchmarking the overhead for the creation of sp.spdiags(self.idf_,..) to see if it's anything that we need to worry about...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n0mad Aww, so you actually did benchmarks above. Could you also test for e.g. running a transform on a single document?

n=n_features, format='csr')

# *= doesn't work
X = X * self._idf_diag
X = X * idf_diag

if self.norm:
X = normalize(X, norm=self.norm, copy=False)
Expand All @@ -1087,9 +1118,7 @@ def transform(self, X, copy=True):

@property
def idf_(self):
# if _idf_diag is not set, this will raise an attribute error,
# which means hasattr(self, "idf_") is False
return np.ravel(self._idf_diag.sum(axis=0))
return np.log(self._n_samples / self._df) + 1.0


class TfidfVectorizer(CountVectorizer):
Expand Down
0