8000 FEA Online implementation of non-negative matrix factorization (#16948) · jjerphan/scikit-learn@52ca62d · GitHub
[go: up one dir, main page]

Skip to content

Commit 52ca62d

Browse files
cmarmoTomDLTjeremiedbbthomasjpfan
authored andcommitted
FEA Online implementation of non-negative matrix factorization (scikit-learn#16948)
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org> Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
1 parent d955701 commit 52ca62d

File tree

9 files changed

+1163
-140
lines changed

9 files changed

+1163
-140
lines changed

doc/computing/scaling_strategies.rst

+1
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ Here is a list of incremental estimators for different tasks:
8080
+ :class:`sklearn.decomposition.MiniBatchDictionaryLearning`
8181
+ :class:`sklearn.decomposition.IncrementalPCA`
8282
+ :class:`sklearn.decomposition.LatentDirichletAllocation`
83+
+ :class:`sklearn.decomposition.MiniBatchNMF`
8384
- Preprocessing
8485
+ :class:`sklearn.preprocessing.StandardScaler`
8586
+ :class:`sklearn.preprocessing.MinMaxScaler`

doc/modules/classes.rst

+1
Original file line numberDiff line numberDiff line change
@@ -319,6 +319,7 @@ Samples generator
319319
decomposition.MiniBatchDictionaryLearning
320320
decomposition.MiniBatchSparsePCA
321321
decomposition.NMF
322+
decomposition.MiniBatchNMF
322323
decomposition.PCA
323324
decomposition.SparsePCA
324325
decomposition.SparseCoder

doc/modules/decomposition.rst

+26
Original file line numberDiff line numberDiff line change
@@ -921,6 +921,29 @@ stored components::
921921
* :ref:`sphx_glr_auto_examples_applications_plot_topics_extraction_with_nmf_lda.py`
922922
* :ref:`sphx_glr_auto_examples_decomposition_plot_beta_divergence.py`
923923

924+
.. _MiniBatchNMF:
925+
926+
Mini-batch Non Negative Matrix Factorization
927+
--------------------------------------------
928+
929+
:class:`MiniBatchNMF` [7]_ implements a faster, but less accurate version of the
930+
non negative matrix factorization (i.e. :class:`~sklearn.decomposition.NMF`),
931+
better suited for large datasets.
932+
933+
By default, :class:`MiniBatchNMF` divides the data into mini-batches and
934+
optimizes the NMF model in an online manner by cycling over the mini-batches
935+
for the specified number of iterations. The ``batch_size`` parameter controls
936+
the size of the batches.
937+
938+
In order to speed up the mini-batch algorithm it is also possible to scale
939+
past batches, giving them less importance than newer batches. This is done
940+
introducing a so-called forgetting factor controlled by the ``forget_factor``
941+
parameter.
942+
943+
The estimator also implements ``partial_fit``, which updates ``H`` by iterating
944+
only once over a mini-batch. This can be used for online learning when the data
945+
is not readily available from the start, or when the data does not fit into memory.
946+
924947
.. topic:: References:
925948

926949
.. [1] `"Learning the parts of objects by non-negative matrix factorization"
@@ -945,6 +968,9 @@ stored components::
945968
the beta-divergence" <1010.1763>`
946969
C. Fevotte, J. Idier, 2011
947970
971+
.. [7] :arxiv:`"Online algorithms for nonnegative matrix factorization with the
972+
Itakura-Saito divergence" <1106.4198>`
973+
A. Lefevre, F. Bach, C. Fevotte, 2011
948974
949975
.. _LatentDirichletAllocation:
950976

doc/whats_new/v1.1.rst

+5
Original file line numberDiff line numberDiff line change
@@ -288,6 +288,11 @@ Changelog
288288
:mod:`sklearn.decomposition`
289289
............................
290290

291+
- |MajorFeature| Added a new estimator :class:`decomposition.MiniBatchNMF`. It is a
292+
faster but less accurate version of non-negative matrix factorization, better suited
293+
for large datasets. :pr:`16948` by :user:`Chiara Marmo <cmarmo>`,
294+
:user:`Patricio Cerda <pcerda>` and :user:`Jérémie du Boisberranger <jeremiedbb>`.
295+
291296
- |Enhancement| :class:`decomposition.PCA` exposes a parameter `n_oversamples` to tune
292297
:func:`sklearn.decomposition.randomized_svd` and
293298
get accurate results when the number of features is large.

examples/applications/plot_topics_extraction_with_nmf_lda.py

+72-3
Original file line numberDiff line numberDiff line change
@@ -30,13 +30,15 @@
3030
import matplotlib.pyplot as plt
3131

3232
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
33-
from sklearn.decomposition import NMF, LatentDirichletAllocation
33+
from sklearn.decomposition import NMF, MiniBatchNMF, LatentDirichletAllocation
3434
from sklearn.datasets import fetch_20newsgroups
3535

3636
n_samples = 2000
3737
n_features = 1000
3838
n_components = 10
3939
n_top_words = 20
40+
batch_size = 128
41+
init = "nndsvda"
4042

4143

4244
def plot_top_words(model, feature_names, n_top_words, title):
@@ -101,7 +103,15 @@ def plot_top_words(model, feature_names, n_top_words, title):
101103
"n_samples=%d and n_features=%d..." % (n_samples, n_features)
102104
)
103105
t0 = time()
104-
nmf = NMF(n_components=n_components, random_state=1, alpha=0.1, l1_ratio=0.5).fit(tfidf)
106+
nmf = NMF(
107+
n_components=n_components,
108+
random_state=1,
109+
init=init,
110+
beta_loss="frobenius",
111+
alpha_W=0.00005,
112+
alpha_H=0.00005,
113+
l1_ratio=1,
114+
).fit(tfidf)
105115
print("done in %0.3fs." % (time() - t0))
106116

107117

@@ -121,10 +131,12 @@ def plot_top_words(model, feature_names, n_top_words, title):
121131
nmf = NMF(
122132
n_components=n_components,
123133
random_state=1,
134+
init=init,
124135
beta_loss="kullback-leibler",
125136
solver="mu",
126137
max_iter=1000,
127-
alpha=0.1,
138+
alpha_W=0.00005,
139+
alpha_H=0.00005,
128140
l1_ratio=0.5,
129141
).fit(tfidf)
130142
print("done in %0.3fs." % (time() - t0))
@@ -137,6 +149,63 @@ def plot_top_words(model, feature_names, n_top_words, title):
137149
"Topics in NMF model (generalized Kullback-Leibler divergence)",
138150
)
139151

152+
# Fit the MiniBatchNMF model
153+
print(
154+
"\n" * 2,
155+
"Fitting the MiniBatchNMF model (Frobenius norm) with tf-idf "
156+
"features, n_samples=%d and n_features=%d, batch_size=%d..."
157+
% (n_samples, n_features, batch_size),
158+
)
159+
t0 = time()
160+
mbnmf = MiniBatchNMF(
161+
n_components=n_components,
162+
random_state=1,
163+
batch_size=batch_size,
164+
init=init,
165+
beta_loss="frobenius",
166+
alpha_W=0.00005,
167+
alpha_H=0.00005,
168+
l1_ratio=0.5,
169+
).fit(tfidf)
170+
print("done in %0.3fs." % (time() - t0))
171+
172+
173+
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
174+
plot_top_words(
175+
mbnmf,
176+
tfidf_feature_names,
177+
n_top_words,
178+
"Topics in MiniBatchNMF model (Frobenius norm)",
179+
)
180+
181+
# Fit the MiniBatchNMF model
182+
print(
183+
"\n" * 2,
184+
"Fitting the MiniBatchNMF model (generalized Kullback-Leibler "
185+
"divergence) with tf-idf features, n_samples=%d and n_features=%d, "
186+
"batch_size=%d..." % (n_samples, n_features, batch_size),
187+
)
188+
t0 = time()
189+
mbnmf = MiniBatchNMF(
190+
n_components=n_components,
191+
random_state=1,
192+
batch_size=batch_size,
193+
init=init,
194+
beta_loss="kullback-leibler",
195+
alpha_W=0.00005,
196+
alpha_H=0.00005,
197+
l1_ratio=0.5,
198+
).fit(tfidf)
199+
print("done in %0.3fs." % (time() - t0))
200+
201+
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
202+
plot_top_words(
203+
mbnmf,
204+
tfidf_feature_names,
205+
n_top_words,
206+
"Topics in MiniBatchNMF model (generalized Kullback-Leibler divergence)",
207+
)
208+
140209
print(
141210
"\n" * 2,
142211
"Fitting LDA models with tf features, n_samples=%d and n_features=%d..."

sklearn/decomposition/__init__.py

+6-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,11 @@
55
"""
66

77

8-
from ._nmf import NMF, non_negative_factorization
8+
from ._nmf import (
9+
NMF,
10+
MiniBatchNMF,
11+
non_negative_factorization,
12+
)
913
from ._pca import PCA
1014
from ._incremental_pca import IncrementalPCA
1115
from ._kernel_pca import KernelPCA
@@ -31,6 +35,7 @@
3135
"IncrementalPCA",
3236
"KernelPCA",
3337
"MiniBatchDictionaryLearning",
38+
"MiniBatchNMF",
3439
"MiniBatchSparsePCA",
3540
"NMF",
3641
"PCA",

0 commit comments

Comments
 (0)
0