A High-Performance Parallel Algorithm
for Nonnegative Matrix Factorization
Ramakrishnan Kannan Grey Ballard Haesun Park
Georgia Tech Sandia National Laboratories Georgia Tech
rkannan@gatech.edu gmballa@sandia.gov hpark@cc.gatech.edu
Abstract clustering [16, 17] and many other areas. In the typical case, k
Non-negative matrix factorization (NMF) is the problem of deter- min(m, n); for problems today, m and n can be on the order of
mining two non-negative low rank factors W and H, for the given millions or more, and k is on the order of tens or hundreds.
input matrix A, such that A ≈ WH. NMF is a useful tool for many There is a vast literature on algorithms for NMF and their con-
applications in different domains such as topic modeling in text vergence properties [15]. The commonly adopted NMF algorithms
mining, background separation in video analysis, and community are – (i) Multiplicative Update (MU) [24] (ii) Hierarchical Alterna-
detection in social networks. Despite its popularity in the data min- tive Least Squares (HALS) [3, 10] (iii) NMF using Block Principal
Pivoting (ANLS-BPP) [14], and (iv) Stochastic Gradient Descent
Downloaded from the ACM Digital Library on April 8, 2025.
ing community, there is a lack of efficient distributed algorithms to
solve the problem for big data sets. (SGD) Updates [8]. Most of the algorithms in NMF literature are
We propose a high-performance distributed-memory parallel al- based on alternately optimizing each of the low rank factors W and
gorithm that computes the factorization by iteratively solving alter- H while keeping the other fixed, in which case each subproblem is
nating non-negative least squares (NLS) subproblems for W and a constrained convex optimization problem. Subproblems can then
H. It maintains the data and factor matrices in memory (distributed be solved using standard optimization techniques such as projected
across processors), uses MPI for interprocessor communication, gradient or interior point; a detailed survey for solving such prob-
and, in the dense case, provably minimizes communication costs lems can be found in [15, 26]. In this paper, our implementation
(under mild assumptions). As opposed to previous implementa- uses a fast active-set based method called Block Principal Pivoting
tions, our algorithm is also flexible: (1) it performs well for both (BPP) [14], but the parallel algorithm proposed in this paper can be
dense and sparse matrices, and (2) it allows the user to choose any easily extended for other algorithms such as MU and HALS.
one of the multiple algorithms for solving the updates to low rank Recently with the advent of large scale internet data and interest
factors W and H within the alternating iterations. We demonstrate in Big Data, researchers have started studying scalability of many
the scalability of our algorithm and compare it with baseline imple- foundational machine learning algorithms. To illustrate the dimen-
mentations, showing significant performance improvements. sion of matrices commonly used in the machine learning commu-
nity, we present a few examples. Nowadays the adjacency matrix of
a billion-node social network is common. In the matrix representa-
1. Introduction tion of a video data, every frame contains three matrices for each
Non-negative Matrix Factorization (NMF) is the problem of finding RGB color, which is reshaped into a column. Thus in the case of
two low rank factors W ∈ Rm×k and H ∈ Rk×n for a given input a 4K video, every frame will take approximately 27 million rows
+ +
matrix A ∈ Rm×n , such that A ≈ WH. Here, Rm×n denotes the set of (4096 row pixels x 2196 column pixels x 3 colors). Similarly, the
+ +
m × n matrices with non-negative real values. Formally, the NMF popular representation of documents in text mining is a bag-of-
problem [24] can be defined as words matrix, where the rows are the dictionary and the columns
are the documents (e.g., webpages). Each entry Ai j in the bag-of-
min kA − WHkF , (1) words matrix is generally the frequency count of the word i in the
W>0,H>0
document j. Typically with the explosion of the new terms in social
where kXkF = ( 2 1/2
P
i j xi j )
is the Frobenius norm. media, the number of words spans to millions.
NMF is widely used in data mining and machine learning as a To handle such high dimensional matrices, it is important to
dimension reduction and factor analysis method. It is a natural fit study low rank approximation methods in a data-distributed envi-
for many real world problems as the non-negativity is inherent in ronment. For example, in many large scale scenarios, data samples
many representations of real-world data and the resulting low rank are collected and stored over many general purpose computers, as
factors are expected to have natural interpretation. The applications the set of samples is too large to store on a single machine. In this
of NMF range from text mining [22], computer vision [11], and case, the computation must also be distributed across processors.
bioinformatics [13] to blind source separation [3], unsupervised Local computation is preferred as local access of data is much faster
than remote access due to the costs of interprocessor communica-
tion. However, for low rank approximation algorithms, like MU,
HALS, and BPP, some communication is necessary.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without
The simplest way to organize these distributed computations on
fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice large data sets is through a MapReduce framework like Hadoop, but
and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to this simplicity comes at the expense of performance. In particular,
lists, contact the Owner/Author. Request permissions from permissions@acm.org or Publications Dept., ACM, Inc., fax
+1 (212) 869-0481. Copyright 2016 held by Owner/Author. Publication Rights Licensed to ACM.
most MapReduce frameworks require data to be read from and
PPoPP’16 March 12-16, 2016, Barcelona, Spain written to disk at every iteration, and they involve communication-
Copyright c 2016 ACM 978-1-4503-4092-2/16/03. . . $15.00
DOI: http://dx.doi.org/10.1145/2851141.2851152
intensive global, input-data shuffles across machines.
In this work, we present a much more efficient algorithm and A Input matrix
implementation using tools from the field of High-Performance W Left low rank factor
Computing (HPC). We maintain data in memory (distributed across H Right low rank factor
processors), take advantage of optimized libraries for local compu- m Number of rows of input matrix
tational routines, and use the Message Passing Interface (MPI) stan- n Number of columns of input matrix
dard to organize interprocessor communication. The current trend k Low rank
for high-performance computers is that available parallelism (and Mi ith row block of matrix M
therefore aggregate computational rate) is increasing much more Mi ith column block of matrix M
quickly than improvements in network bandwidth and latency. This Mi j (i, j)th subblock of M
trend implies that the relative cost of communication (compared to p Number of parallel processes
computation) is increasing. pr Number of rows in processor grid
To address this challenge, we analyze algorithms in terms of pc Number of columns in processor grid
both their computation and communication costs. The two major
tasks of the NMF algorithm are (a) performing matrix multipli-
cations and (b) solving many Non-negative Least Squares (NLS) Table 1: Notation
subproblems. In this paper, we use a carefully chosen data dis-
tribution in order to use a communication-optimal algorithm for
each of the matrix multiplications, while at the same time exploit-
ing the parallelism in the NLS problems. In particular, our pro- matrices. For example, Ai is the ith row block of matrix A, and
posed algorithm ensures that after the input data is initially read Ai is the ith column block. Likewise, ai is the ith row of A, and
into memory, it is never communicated; we communicate only the ai is the ith column. We use m and n to denote the numbers of
factor matrices and other smaller temporary matrices among the rows and columns of A, respectively, and we assume without loss
p processors that participate in the distributed computation. Fur- of generality m > n throughout.
Downloaded from the ACM Digital Library on April 8, 2025.
thermore, we prove that p in the case of dense input and under the
assumption that k 6 mn/p, our proposed algorithm minimizes 2.2 Communication model
bandwidth cost (the amount of data communicated between pro- To analyze our algorithms, we use the α-β-γ model of distributed-
cessors) to within a constant factor of the lower bound. We also memory parallel computation. In this model, interprocessor com-
reduce latency costs (the number of times processors communi- munication occurs in the form of messages sent between two pro-
cate with each other) by utilizing MPI collective communication cessors across a bidirectional link (we assume a fully connected
operations, along with temporary local memory space, performing network). We model the cost of a message of size n words as α+nβ,
O(log p) messages per iteration, the minimum achievable for ag- where α is the per-message latency cost and β is the per-word band-
gregating global data. width cost. Each processor can compute floating point operations
Fairbanks et al. [5] present a parallel NMF algorithm designed (flops) on data that resides in its local memory; γ is the per-flop
for multicore machines. To demonstrate the importance of mini- computation cost. With this communication model, we can predict
mizing communication, we consider this approach to parallelizing the performance of an algorithm in terms of the number of flops it
an alternating NMF algorithm in distributed memory. While this performs as well as the number of words and messages it communi-
naive algorithm exploits the natural parallelism available within the cates. For simplicity, we will ignore the possibilities of overlapping
alternating iterations (the fact that rows of W and columns of H computation with communication in our analysis. For more details
can be computed independently), it performs more communication on the α-β-γ model, see [2, 25].
than necessary to set up the independent problems. We compare
the performance of this algorithm with our proposed approach to
demonstrate the importance of designing algorithms to minimize 2.3 MPI collectives
communication; that is, simply parallelizing the computation is not Point-to-point messages can be organized into collective commu-
sufficient for satisfactory performance and parallel scalability. nication operations that involve more than two processors. MPI
The main contribution of this work is a new, high-performance provides an interface to the most commonly used collectives like
parallel algorithm for non-negative matrix factorization. The algo- broadcast, reduce, and gather, as the algorithms for these collectives
rithm is flexible, as it is designed for both sparse and dense input can be optimized for particular network topologies and proces-
matrices and can leverage many different algorithms for determin- sor characteristics. The algorithms we consider use the all-gather,
ing the non-negative low rank factors W and H. The algorithm is reduce-scatter, and all-reduce collectives, so we review them here,
also efficient, maintaining data in memory, using MPI collectives along with their costs. Our analysis assumes optimal collective al-
for interprocessor communication, and using efficient libraries for gorithms are used (see [2, 25]), though our implementation relies
local computation. Furthermore, we provide a theoretical commu- on the underlying MPI implementation.
nication cost analysis to show that our algorithm reduces commu- At the start of an all-gather collective, each of p processors owns
nication relative to the naive approach, and in the case of dense data of size n/p. After the all-gather, each processor owns a copy of
input, that it provably minimizes communication. We show with the entire data of size n. The cost of an all-gather is α·log p+β· p−1
p
n.
performance experiments that the algorithm outperforms the naive At the start of a reduce-scatter collective, each processor owns data
approach by significant factors, and that it scales well for up to 100s of size n. After the reduce-scatter, each processor owns a subset of
of processors on both synthetic and real-world data. the sum over all data, which is of size n/p. (Note that the reduction
can be computed with other associative operators besides addition.)
2. Preliminaries The cost of an reduce-scatter is α · log p + (β + γ) · p−1 p
n. At the
start of an all-reduce collective, each processor owns data of size
2.1 Notation n. After the all-reduce, each processor owns a copy of the sum
Table 1 summarizes the notation we use throughout this paper. over all data, which is also of size n. The cost of an all-reduce
We use upper case letters for matrices and lower case letters for is 2α · log p + (2β + γ) · p−1 p
n. Note that the costs of each of the
vectors. We use both subscripts and superscripts for sub-blocks of collectives are zero when p = 1.
3. Related Work Algorithm 1 [W, H] = AU-NMF(A, k)
In the data mining and machine learning literature there is an Require: A is an m × n matrix, k is rank of approximation
overlap between low rank approximations and matrix factorizations 1: Initialize H with a non-negative matrix in Rn×k
+ .
due to the nature of applications. Despite its name, non-negative 2: while stopping criteria not satisfied do
matrix “factorization” is really a low rank approximation. 3: Update W using HHT and AHT
The recent distributed NMF algorithms in the literature are 4: Update H using WT W and WT A
[6, 8, 18, 19, 29]. Liu et al. propose running Multiplicative Update 5: end while
(MU) for KL divergence, squared loss, and “exponential” loss func-
tions [19]. Matrix multiplication, element-wise multiplication, and
element-wise division are the building blocks of the MU algorithm.
The authors discuss performing these matrix operations effectively unknown factors W and H, we solve the following subproblems,
in Hadoop for sparse matrices. Using similar approaches, Liao et which have a unique solution for a full rank H and W:
al. implement an open source Hadoop based MU algorithm and
W ← argmin A − W̃H ,
study its scalability on large-scale biological data sets [18]. Also, W̃>0
F
Yin, Gao, and Zhang present a scalable NMF that can perform fre- (2)
quent updates, which aim to use the most recently updated data H ← argmin A − WH̃ F
.
H̃>0
[29]. Gemmula et al. propose a Generic algorithm that works on
different loss functions, often involving the distributed computa- Since each subproblem involves nonnegative least squares, this
tion of the gradient [8]. According to the authors, the formulation two-block BCD method is also called the Alternating Non-negative
presented in the paper can also be extended to handle non-negative Least Squares (ANLS) method [15]. Block Principal Pivoting
constraints. Similarly Faloutsos et al. propose a distributed, scal- (BPP), discussed more in detail at Section 4.2, is an algorithm
able method for decomposing matrices, tensors, and coupled data that solves these NLS subproblems. In the context of the AU-NMF
sets through stochastic gradient descent on a variety of objective
Downloaded from the ACM Digital Library on April 8, 2025.
algorithm, this ANLS method maximally reduces the overall NMF
functions [6]. The authors also provide an implementation that can objective function value by finding the optimal solution for given
enforce non-negative constraints on the factor matrices. H and W in lines 3 and 4 respectively.
We note that Spark [30] is a popular big-data processing infras- There are other popular NMF algorithms that update the fac-
tructure that is is generally more efficient for iterative algorithms tor matrices alternatively without maximally reducing the objec-
such as NMF than Hadoop, as it maintains data in memory and tive function value each time, in the same sense as in ANLS. These
avoids file system I/O. Although Spark has collaborative filtering li- updates do not necessarily solve each of the subproblems (2) to
braries such as MLlib [21], which use matrix factorization and can optimality but simply improve the overall objective function (1).
impose non-negativity constraints, none of them implement pure Such methods include Multiplicative Update (MU) [24] and Hi-
NMF, and so we do not have a direct comparison against NMF erarchical Alternating Least Squares (HALS) [3], which was also
running on Spark. The problem of collaborative filtering is differ- independently proposed as Rank-one Residual Iteration (RRI) [10].
ent from NMF because non-nonzero entries are treated as missing To show how these methods can fit into the AU-NMF framework,
values rather than zeroes, and therefore different computations are we discuss them in more detail.
performed at each iteration. In the case of HALS/RRI, individual columns of W and rows
Apart from distributed NMF algorithms using Hadoop, there are of H are updated with all other entries in the factor matrices fixed.
also implementations of the MU algorithm in a distributed memory This approach is a block coordinate descent method with 2k blocks,
setting using X10 [9] and on a GPU [20]. set to minimize the function
k
X
f (w1 , · · · , wk , h1 , · · · , hk ) = A − wi hi , (3)
i=1 F
where wi is the ith column of W and hi is the ith row of H. The
4. Foundations update rules can be written in closed form:
In this section, we will introduce the Alternating-Updating NMF
(AHT )i − W(HHT )i
" #
(AU-NMF) framework, multiple methods for solving NMF and wi ← wi + ,
details on ANLS-BPP (Alternating Non-negative Least Squares - (HHT )ii +
Block Principal Pivoting). We also present a straightforward ap- (4)
(WT A)i − (WT W)i H
" #
proach to parallelization of the framework. hi ← hi + .
(WT W)ii +
Note that the columns of W and rows of H are updated in order,
so that the most up-to-date values are always used, and these 2k
4.1 Alternating-Updating NMF Algorithms updates can be done in an arbitrary order. However, if all the W
NMF algorithms take a non-negative input matrix A ∈ Rm×n and a updates are done before H (or vice-versa), the method falls into the
+
low rank k and determine two non-negative low rank factors W ∈ AU-NMF framework. After computing the matrices HHT , AHT ,
Rm×k and H ∈ Rk×n such that A ≈ WH. We define Alternating- WT W, and WT A, the extra computation is 2(m + n)k2 flops for
+ +
Updating NMF algorithms as those that alternate between updating updating both W and H.
W for a given H and updating H for a given W. In the context In the case of MU, individual entries of W and H are updated
of our parallel framework, we restrict attention to the class of NMF with all other entries fixed. In this case, the update rules are
algorithms that use the Gram matrix associated with a factor matrix (AHT )i j
and the product of the input data matrix A with the corresponding wi j ← wi j ,
factor matrix, as we show in Algorithm 1. (WHHT )i j
(5)
The specifics of lines 3 and 4 depend on the NMF algorithm. In (WT A)i j
the block coordinate descent framework where two blocks are the hi j ← hi j .
(WT WH)i j
Instead of performing these (m + n)k in an arbitrary order, if all of Algorithm 2 [W, H] = Naive-Parallel-NMF(A, k)
W is updated before H (or vice-versa), this method also follows
Require: A is an m × n matrix distributed both row-wise and
the AU-NMF framework. The extra cost of computing W(HHT )
column-wise across p processors, k is rank of approximation
and (WT W)H is 2(m + n)k2 flops to perform updates for all entries
Require: Local matrices: Ai is m/p×n, Ai is m×n/p, Wi is m/p×k,
of W and H.
Hi is k × n/p
The convergence properties of these different algorithms are dis-
1: pi initializes Hi
cussed in detail by Kim, He and Park [15]. We emphasize here that
2: while stopping criteria not satisfied do
both HALS/RRI and MU require computing Gram matrices and
/* Compute W given H */
matrix products of the input matrix and each factor matrix. There-
3: collect H on each processor using all-gather
fore, if the update ordering follows the convention of updating all of
4: pi computes Wi ← SolveBPP(HHT , Ai HT )
W followed by all of H, both methods fit into the AU-NMF frame-
/* Compute H given W */
work. Our proposed parallel algorithm (presented in Section 5) can
5: collect W on each processor using all-gather
be extended to these methods (or any other AU-NMF method) with
6: pi computes (Hi )T ← SolveBPP(WT W, (WT Ai )T )
only a change in local computation.
7: end while
4.2 Block Principal Pivoting Ensure: W, H ≈ argmin kA − W̃H̃k
W̃>0,H̃>0
In this paper, we focus on and use the BPP method [14] to solve Ensure: W is an m × k matrix distributed row-wise across pro-
the NLS problem, as it is the fastest algorithm (in terms of number cessors, H is a k × n matrix distributed column-wise across
of iterations). As argued in Section 4.1, we note that many NMF processors
algorithms, including MU and HALS, can be used within our
parallel frameworks (Algorithms 2 and 3).
BPP is the state-of-the-art method for solving the NLS subprob-
lems in Eq. (2). The main subroutine of BPP is the single right-hand
Downloaded from the ACM Digital Library on April 8, 2025.
side NLS problem 4.3 Naive Parallel NMF Algorithm
min kCx − bk2 . (6) In this section we present a naive parallelization of NMF algorithms
x>0
[5]. Each NLS problem with multiple right-hand sides can be paral-
The Karush-Kuhn-Tucker (KKT) optimality conditions for lelized on the observation that the problems for multiple right-hand
Eq. (6) are as follows sides are independent from each other. That is, we can solve several
instances of Eq. (6) independently for different b where C is fixed,
y = CT Cx − CT b (7a)
which implies that we can optimize row blocks of W and column
y>0 (7b) blocks of H in parallel.
x>0 (7c) Algorithm 2 presents a straightforward approach to setting up
the independent subproblems. Let us divide W into row blocks
xi yi = 0 ∀i. (7d)
W1 , . . . , W p and H into column blocks H1 , . . . , H p . We then
The KKT conditions (7) states that at optimality, the support sets double-partition the data matrix A accordingly into row blocks
(i.e., the non-zero elements) of x and y are complementary to each A1 , . . . , A p and column blocks A1 , . . . , A p so that processor i owns
other. Therefore, Eq. (7) is an instance of the Linear Complemen- both Ai and Ai (see Figure 1). With these partitions of the data and
tarity Problem (LCP) which arises frequently in quadratic program- the variables, one can implement any ANLS algorithm in parallel,
ming. When k min(m, n), active-set and active-set-like methods with only one communication step for each solve.
are very suitable because most computations involve matrices of The computation cost of Algorithm 2 depends on the local
sizes m × k, n × k, and k × k which are small and easy to handle. NLS algorithm. For comparison with our proposed algorithm, we
If we knew which indices correspond to nonzero values in the assume each processor uses BPP to solve the local NLS problems.
optimal solution, then computing the solution is an unconstrained Thus, the computation at line 4 consists of computing Ai HT , HHT ,
least squares problem on these indices. In the optimal solution, call and solving NLS given the normal equations formulation of rank k
the set of indices i such that xi = 0 the active set, and let the remain- for m/p columns. Likewise, the computation at line 6 consists of
ing indices be the passive set. The BPP algorithm works to find this computing WT Ai , WT W, and solving NLS for n/p columns. In the
final active set and passive set. It greedily swaps indices between dense case, this amounts to 4mnk/p + (m + n)k2 +CBPP ((m + n)/p, k)
the intermediate active and passive sets until finding a partition that flops. In the sparse case, processor i performs 2(nnz(Ai )+nnz(Ai ))k
satisfies the KKT condition. In the partition of the optimal solu- flops to compute Ai HT and WT Ai instead of 4mnk/p.
tion, the values of the indices that belong to the active set will take The communication cost of the all-gathers at lines 3 and 5, based
zero. The values of the indices that belong to the passive set are on the expression given in Section 2.3 is α · 2 log p + β · (m + n)k.
determined by solving the unconstrained least squares problem re- The local memory requirement includes storing each processor’s
stricted to the passive set. Kim, He and Park [14], discuss the BPP part of matrices A, W, and H. In the case of dense A, this is
algorithm in further detail. We use the notation 2mn/p + (m + n)k/p words, as A is stored twice; in the sparse case,
processor i requires nnz(Ai ) + nnz(Ai ) words for the input matrix
X ← SolveBPP(CT C, CT B)
and (m + n)k/p words for the output factor matrices. Local memory
to define the (local) function for using BPP to solve Eq. (6) for is also required for storing temporary matrices W and H of size
every column of X. We define CBPP (k, c) as the cost of SolveBPP, (m + n)k words.
given the k × k matrix CT C and k × c matrix CT B. SolveBPP mainly We summarize the algorithmic costs of Algorithm 2 in Table 2.
involves solving least squares problems over the intermediate pas- This naive algorithm [5] has three main drawbacks: (1) it requires
sive sets. Our implementation uses the normal equations to solve storing two copies of the data matrix (one in row distribution and
the unconstrained least squares problems because the normal equa- one in column distribution) and both full factor matrices locally,
tions matrices have been pre-computed in order to check the KKT (2) it does not parallelize the computation of HHT and WT W
condition. However, more numerically stable methods such as QR (each processor computes it redundantly), and (3) as we will see
decomposition can also be used. in Section 5, it communicates more data than necessary.
Algorithm Flops Words Messages Memory
Naive-Parallel-NMF 4 mnk
p
+ (m + n)k 2
+ CBPP m+np
,k O((m + n)k) O(log p) O mn
+ (m + n)k
2 p
HPC-NMF (m/p > n) 4 mnk
p
+ (m+n)k
p
+ CBPP m+np
,k O(nk) O(log p) O p + mk
mn
p
+ nk
2 q q
mnk2 mnk2
HPC-NMF (m/p < n) 4 mnk
p
+ (m+n)k
p
+ CBPP m+np
,k O p
O(log p) O mn
p
+ p
q
mnk2
Lower Bound − Ω min p
, nk Ω(log p) mn
p
+ (m+n)k
p
Table 2: Leading order algorithmic costs for Naive-Parallel-NMF and HPC-NMF (per iteration). Note that the computation and memory
costs assume the data matrix A is dense, but the communication costs (words and messages) apply to both dense and sparse cases.
ative sizes of the dimensions of the matrices and the number of
H k H0 H1 H2 processors; the four multiplies for NMF fall into either the “one
large dimension” or “two large dimensions” cases. HPC-NMF uses
n
p a careful data distribution in order to use a communication-optimal
algorithm for each of the matrix multiplications, while at the same
k ← n →
time exploiting the parallelism in the NLS problems.
The algorithm uses a 2D distribution of the data matrix A across
a pr × pc grid of processors (with p = pr pc ), as shown in Figure
2. Algorithm 3 performs an alternating method
np in parallel
o with
a per-iteration bandwidth cost of O min mnk2 /p, nk words,
Downloaded from the ACM Digital Library on April 8, 2025.
m
W0 p A0 latency cost of O(log p) messages, and load-balanced computation
(up to the sparsity pattern of A and convergence rates of local BPP
↑ computations).
To minimize the communication cost and local memory require-
ments, inp the typical case pr and pc are chosen so that p m/pr ≈
n/pc ≈ mn/p, in which case the bandwidth cost is O mnk2 /p .
If the matrix is very tall and skinny, i.e., m/p > n, then we choose
pr = p and pc = 1. In this case, the distribution of the data matrix
W1 m A A1 is 1D, and the bandwidth cost is O(nk) words.
The matrix distributions for Algorithm 3 are given in Figure 2;
we use a 2D distribution of A and 1D distributions of W and H.
Recall from Table 1 that Mi and Mi denote row and column blocks
of M, respectively. Thus, the notation (Wi ) j denotes the jth row
↓ block within the ith row block of W. Lines 3–8 compute W for a
fixed H, and lines 9–14 compute H for a fixed W; note that the
computations and communication patterns for the two alternating
W2 A2 iterations are analogous.
In the rest of this section, we derive the per-iteration compu-
tation and communication costs, as well as the local memory re-
quirements. We also argue the communication-optimality of the al-
gorithm in the dense case. Table 2 summarizes the results of this
section and compares them to Naive-Parallel-NMF.
W A0 A1 A2 Computation Cost Local matrix computations occur at lines 3, 6,
Figure 1: Distribution of matrices for Naive-Parallel-NMF (Algo- 9, and 12. In the case that A is dense, each processor performs
rithm 2), for p = 3. Note that Ai is m/p × n, Ai is m × n/p, Wi is n 2 m n m m n mnk (m + n)k2
m/pr × k, and Hi is k × n/p. k +2 k + k2 + 2 k=4 +
p pr pc p pr pc p p
flops. In the case that A is sparse, processor (i, j) performs (m +
5. High Performance Parallel NMF n)k2 /p flops in computing Ui j and Xi j , and 4nnz(Ai j )k flops in com-
puting Vi j and Yi j . Local non-negative least squares problems oc-
We present our proposed algorithm, HPC-NMF, as Algorithm 3. cur at lines 8 and 14. In each case, the symmetric positive semi-
The main ideas of the algorithm are to (1) exploit the indepen- definite matrix is k × k and the number of columns/rows of length k
dence of NLS problems for rows of W and columns of H and (2) to be computed are m/p and n/p, respectively. These costs together
use communication-optimal matrix multiplication algorithms to set require CBPP (k, (m + n)/p) flops. There are computation costs asso-
up the NLS problems. The naive approach (Algorithm 2) shares ciated with the all-reduce and reduce-scatter collectives, both those
the first property, by parallelizing over rows of W and columns contribute only to lower order terms.
of H, but it uses parallel matrix multiplication algorithms that
communicate more data than necessary. The central intuition for Communication Cost Communication occurs during six collec-
communication-efficient parallel algorithms for computing HHT , tive operations (lines 4, 5, 7, 10, 11, and 13). We use the cost ex-
AHT , WT W, and WT A comes from a classification proposed by pressions presented in Section 2.3 for these collectives. The com-
Demmel et al. [4]. They consider three cases, depending on the rel- munication cost of the all-reduces (lines 4 and 10) is α · 4 log p +
Algorithm 3 [W, H] = HPC-NMF(A, k)
H0 H1
Require: A is an m × n matrix distributed across a pr × pc grid of
processors, k is rank of approximation
Require: Local matrices: Ai j is m/pr × n/pc , Wi is m/pr × k, (Wi ) j H k (H0 )0 (H0 )1 (H0 )2 (H1 )0 (H1 )1 (H1 )2
is m/p × k, H j is k × n/pc , and (H j )i is k × n/p ←
n
→
n
pc p
1: pi j initializes (H j )i
2: while stopping criteria not satisfied do k ← n →
/* Compute W given H */
3: pi j computes Ui j = (H j )i (H j )i T (W0 )0 ↑
compute HHT = i, j Ui j using all-reduce across all procs
P
4:
m
. HHT is k × k and symmetric W0 pr A00 A01
5: pi j collects H j using all-gather across proc columns
6: pi j computes Vi j = Ai j HTj (W0 )1 ↓
↑
. Vi j is m/pr × k
compute (AHT )i = j Vi j using reduce-scatter across proc
P
7:
row to achieve row-wise distribution of (AHT )i
(W1 )0
. pi j owns m/p × k submatrix ((AHT )i ) j
8: pi j computes (Wi ) j ← SolveBPP(HHT , ((AHT )i ) j )
/* Compute H given W */
W1 m A10 A11
9: pi j computes Xi j = (Wi ) j T (Wi ) j (W1 )1
compute WT W= i, j Xi j using all-reduce across all procs
P
10:
. WT W is k × k and symmetric
Downloaded from the ACM Digital Library on April 8, 2025.
11: pi j collects Wi using all-gather across proc rows ↓
m
12: pi j computes Yi j = Wi T Ai j (W2 )0 p
. Yi j is k × n/pc
13: compute (WT A) j = i Yi j using reduce-scatter across proc
P
W2 A20 A21
columns to achieve column-wise distribution of (WT A) j
. pi j owns k × n/p submatrix ((WT A) j )i (W2 )1
14: pi j computes ((H ) ) ← SolveBPP(WT W, (((WT A) j )i )T )
j i T
15: end while
Ensure: W, H ≈ argmin kA − W̃H̃k
W̃>0,H̃>0
W A
Ensure: W is an m × k matrix distributed row-wise across pro- Figure 2: Distribution of matrices for HPC-NMF (Algorithm 3), for
cessors, H is a k × n matrix distributed column-wise across pr = 3 and pc = 2. Note that Ai j is m/pr × m/pc , Wi is m/pr × k,
processors (Wi ) j is m/p × k, H j is k × n/pc , and (H j )i is k × n/p.
β · 2k2 ; the cost of the two all-gathers (lines 5 and 11) is α · log p + Communication Optimality In the case that A is dense, Algo-
β · ((pr −1)nk/p + (pc −1)mk/p); and the cost of the two reduce- rithm 3 provably minimizes communication costs. Theorem 5.1 es-
scatters (lines 7 and 13) is α·log p+β·((pc −1)mk/p p+ (pr −1)nk/p). tablishes the bandwidth cost lower bound for any algorithm that
In the case that m/p < n, we choose p r = np/m > 1 and computes WT A or AHT each iteration. A latency lower bound of
pc = mp/n > 1, and these communication costs simplify Ω(log p) exists in our communication model for any algorithm that
p
to
α·O(log p)+β·O(mk/pr +nk/pc +k2 ) = α·O(log p)+β·O( mnk2 /p+
p aggregates global information [2], and for NMF, this global aggre-
gation is necessary in each iteration. Based on the costs derived
k2 ). In the case that m/p > n, we choose pc = 1, and the costs
above,p HPC-NMF is communication optimal under the assumption
simplify to α · O(log p) + β · O(nk).
k < mn/p, matching these lower bounds to within constant fac-
Memory Requirements The local memory requirement includes tors.
storing each processor’s part of matrices A, W, and H. In the case
of dense A, this is mn/p + (m + n)k/p words; in the sparse case, Theorem 5.1 ([4]). Let A ∈ Rm×n , W ∈ Rm×kp , and H ∈ Rk×n
processor (i, j) requires nnz(Ai j ) words for the input matrix and be dense matrices, with k < n 6 m. If k < mn/p, then any
(m + n)k/p words for the output factor matrices. Local memory is distributed-memory parallel algorithm on p processors that load
T T
also required for storing temporary matrices W j , Hi , Vi j , and Yi j , p computes W A and/or AH
balances the matrix distributions and
of size 2mk/pr + 2nk/pc ) words. must communicate at least Ω(min{ mnk /p, nk}) words along its
2
In the dense case, assuming k < n/pc and k < m/pr , the local critical path.
memory requirement is no more than a constant times the size of the
original data. For the optimal choices Proof The proof follows directly from [4, Section II.B]. Each ma-
np o of pr and pc , this assumption trix multiplication WT Ap and AHT has dimensions k < n 6 m,
simplifies to k < max mn/p, m/p .
We note that if the temporary memory requirements become so the assumption k < mn/p ensures that neither multiplication
prohibitive, the computation of ((AHT )i ) j and ((WT A) j )i via all- has “3 largepdimensions.” Thus, the communication lower bound
gathers and reduce-scatters can be blocked, decreasing the local is either Ω( mnk2 /p) in the case of p > m/n (or “2 large dimen-
memory requirements at the expense of greater latency costs. While sions”), or Ω(nk), in the
p case of p < m/n (or “1 large dimension”).
this case is plausible for sparse A, we did not encounter local If p < m/n,pthen nk < mnk2 /p, so the lower bound can be written
memory issues in our experiments. as Ω(min{ mnk2 /p, nk}).
We note that the communication costs of Algorithm 3 are the every RGB frame is a column of our matrix, so that the matrix
same for dense and sparse data matrices (the data matrix itself is is dense with dimensions 1,013,400 × 2400.
never communicated). In the case that A is sparse, this communi- • Sparse Real World Matrix Webbase : This data set is a very
cation lower bound does not necessarily apply, as the required data large, directed sparse graph with nearly 1 million nodes (1,000,005)
movement depends on the sparsity pattern of A. Thus, we cannot and 3.1 million edges (3,105,536), which was first reported by
make claims of optimality in the sparse case (for general A). The Williams et al. [27]. The NMF output of this directed graph
communication lower bounds for WT A and/or AHT (where A is helps us understand clusters in graphs.
sparse) can be expressed in terms of hypergraphs that encode the
sparsity structure of A [1]. Indeed, hypergraph partitioners have The size of both real world data sets were adjusted to the nearest
been used to reduce communication and achieve load balance for a dimension for uniformly distributing the matrix.
similar problem: computing a low-rank representation of a sparse
tensor (without non-negativity constraints on the factors) [12]. 6.1.2 Machine
We conducted our experiments on “Edison” at the National Energy
6. Experiments Research Scientific Computing Center. Edison is a Cray XC30
In this section, we describe our implementation of HPC-NMF and supercomputer with a total of 5,576 compute nodes, where each
evaluate its performance. We identify a few synthetic and real world node has dual-socket 12-core Intel Ivy Bridge processors. Each of
data sets to experiment with HPC-NMF as well as Naive-Parallel- the 24 cores has a clock rate of 2.4 GHz (translating to a peak
NMF, comparing performance and exploring scaling behavior. floating point rate of 460.8 Gflops/node) and private 64KB L1 and
In the data mining and machine learning community, there has 256KB L2 caches; each of the two sockets has a (shared) 30MB
been a large interest in using Hadoop for large scale implementa- L3 cache; each node has 64 GB of memory. Edison uses a Cray
tion. Hadoop requires disk I/O and is designed for processing gi- “Aries” interconnect that has a dragonfly topology. Because our
gantic text files. Many of the real world data sets that are available experiments use a relatively small number of nodes, we consider
Downloaded from the ACM Digital Library on April 8, 2025.
for research are large scale sparse internet text data, recommender the local connectivity: a “blade” comprises 4 nodes and a router,
systems, social networks, etc. Towards this end, there has been in- and sets of 16 blades’ routers are fully connected via a circuit board
terest towards Hadoop implementations of matrix factorization al- backplane (within a “chassis”). Our experiments do not exceed 64
gorithms [8, 18, 19]. However, the use of NMF extends beyond nodes, so we can assume a very efficient, fully connected network.
sparse internet data and is also applicable for dense real world data
such as video, images, etc. Hence in order to keep our implementa- 6.1.3 Software
tion applicable to wider audience, we choose to use MPI for our dis- Our objective of the implementation is using open source software
tributed implementation. Apart from the application point of view, as much as possible to promote reproducibility and reuse of our
we use an MPI/C++ implementation for other technical advantages code. The entire C++ code was developed using the matrix library
that are necessary for scientific applications such as (1) the ability Armadillo [23]. In Armadillo, the elements of the dense matrix
to leverage recent hardware improvements, (2) effective communi- are stored in column major order and the sparse matrices in Com-
cation between nodes, and (3) the availability of numerically stable pressed Sparse Column (CSC) format. For dense BLAS and LA-
and efficient BLAS and LAPACK routines. PACK operations, we linked Armadillo with OpenBLAS [28]. We
6.1 Experimental Setup use Armadillo’s own implementation of sparse matrix-dense matrix
multiplication, the default GNU C++ Compiler and MPI library on
6.1.1 Data Sets Edison.
We used sparse and dense matrices that are either synthetically
generated or from real world applications. We will explain the data 6.1.4 Initialization and Stopping Criteria
sets in this section. To ensure fair comparison among algorithms, the same random
• Dense Synthetic Matrix (DSyn): We generate a uniform random seed was used across different methods appropriately. That is, the
matrix of size 172,800 × 115,200 and add random Gaussian initial random matrix H was generated with the same random seed
noise. The dimensions of this matrix is chosen such that it when testing with different algorithms (note that W need not be
is uniformly distributable across processes. Every process will initialized). This ensures that all the algorithms perform the same
have its own prime seed that is different from other processes to computations (up to roundoff errors), though the only computation
generate the input random matrix A. with a running time that is sensitive to matrix values is the local
NNLS solve via BPP.
• Sparse Synthetic Matrix (SSyn): We generate a random sparse In this paper, we used number of iterations as the stopping cri-
Erdős-Rényi matrix of the dimension 172,800 × 115,200 with teria for all the algorithms. For fair comparison, all the algorithms
density of 0.001. That is, every entry is nonzero with probability in the paper were executed for 10 iterations.
0.001.
• Dense Real World Matrix (Video): NMF can be performed in 6.2 Algorithms
the video data for background subtraction to detect moving For each of our data sets, we benchmark and compare three algo-
objects. The low rank matrix  = WHT represents background rithms: (1) Algorithm 2, (2) Algorithm 3 with pr = p and√pc = 1
and the error matrix A− Â represents moving objects. Detecting (1D processor grid), and (3) Algorithm 3 with pr ≈ pc ≈ p (2D
moving objects has many real-world applications such as traffic processor grid). We choose these three algorithms to confirm the
estimation [7], security monitoring, etc. In the case of detecting following conclusions from the analysis of Section 5: the perfor-
moving objects, only the last minute or two of video is taken mance of a naive parallelization of Naive-Parallel-NMF (Algorithm
from the live video camera. The algorithm to incrementally 2) will be severely limited by communication overheads, and the
adjust the NMF based on the new streaming video is presented correct choice of processor grid within Algorithm 3 is necessary
in [15]. To simulate this scenario, we collected a video in a to optimize performance. To demonstrate the latter conclusion, we
busy intersection of the Georgia Tech campus at 20 frames per choose the two extreme choices of processor grids and test some
second for two minutes. We then reshaped the matrix such that data sets where a 1D processor grid is optimal (e.g., the Video ma-
trix) and some where a squarish 2D grid is optimal (e.g., the Web- We can also compare HPC-NMF with a 1D processor grid
base matrix). with Naive-Parallel-NMF for squarish matrices (SSyn, DSyn, and
While we would like to compare against other high-performance Webbase). Our analysis does not predict a significant difference
NMF algorithms in the literature, the only other distributed- in communication costs of these two approaches (when m ≈ n),
memory implementations of which we’re aware are implemented and we see from the data that Naive-Parallel-NMF outperforms
using Hadoop and are designed only for sparse matrices [18], [19], HPC-NMF for two of the three matrices (but the opposite is true
[8], [29] and [6]. We stress that Hadoop is not designed for high for DSyn). The main differences appear in the All-Gather versus
performance computing of iterative numerical algorithms, requir- Reduce-Scatter communication costs and the local MM (all of
ing disk I/O between steps, so a run time comparison between a which are involved in the WT A computation). In all three cases, our
Hadoop implementation and a C++/MPI implementation is not a proposed 2D processor grid (with optimal choice of m/pr ≈ n/pc )
fair comparison of parallel algorithms. A qualitative example of performs better than both alternatives.
differences in run time is that a Hadoop implementation of the MU
algorithm on a large sparse matrix of dimension 217 × 216 with 6.5 Strong Scalability
2 × 108 nonzeros (with k=8) takes on the order of 50 minutes per it-
eration [19], while our implementation takes a second per iteration The goal of our second set of experiments is to demonstrate the
for the synthetic data set (which is an order of magnitude larger in strong scalability of each of the algorithms. For each data set, we
terms of rows, columns, and nonzeros) running on only 24 nodes. fix the rank k to be 50 and vary the number of processors (this is a
strong-scaling experiment because the size of the data set is fixed).
6.3 Time Breakdown We run our experiments on {24, 96, 216, 384, 600} processors/cores,
which translates to {1, 4, 9, 16, 25} nodes. The dense matrices are
To differentiate the computation and communication costs among
too large for 1 or 4 nodes, so we benchmark only on {216, 384, 600}
the algorithms, we present the time breakdown among the various
cores in those cases.
tasks within the algorithms for both performance experiments. For
The right side of Figure 3 shows the scaling results for all four
Algorithm 3, there are three local computation tasks and three
Downloaded from the ACM Digital Library on April 8, 2025.
data sets, and Table 3 gives the overall per-iteration time for each
communication tasks to compute each of the factor matrices:
algorithm, number of processors, and data set. We first consider the
• MM, computing a matrix multiplication with the local data HPC-NMF algorithm with a 2D processor grid: comparing the per-
matrix and one of the factor matrices; formance results on 25 nodes (600 cores) to the 1 node (24 cores),
• NLS, solving the set of NLS problems using BPP; we see nearly perfect parallel speedups. The parallel speedups are
23× for SSyn and 28× for the Webbase matrix. We believe the su-
• Gram, computing the local contribution to the Gram matrix; perlinear speedup of the Webbase matrix is a result of the running
• All-Gather, to compute the global matrix multiplication; time being dominated by NLS; with more processors, the local NLS
problem is smaller and more likely to fit in smaller levels of cache,
• Reduce-Scatter, to compute the global matrix multiplication;
providing better performance. For the dense matrices, the speedup
• All-Reduce, to compute the global Gram matrix. of HPC-NMF on 25 nodes over 9 nodes is 2.7× for DSyn and 2.8×
for Video, which are also nearly linear.
In our results, we do not distinguish the costs of these tasks for W
In the case of the Naive-Parallel-NMF algorithm, we do see
and H separately; we report their sum, though we note that we do
parallel speedups, but they are not linear. For the sparse data, we see
not always expect balance between the two contributions for each
parallel speedups of 10× and 11× with a 25× increase in number
task. Algorithm 2 performs all of these tasks except Reduce-Scatter
of processors. For the dense data, we observe speedups of 1.6×
and All-Reduce; all of its communication is in All-Gather.
and 1.8× with a 2.8× increase in the number of processors. The
6.4 Algorithmic Comparison main reason for not achieving perfect scaling is the unnecessary
communication overheads.
Our first set of experiments is designed primarily to compare the
three parallel implementations. For each data set, we fix the num-
ber of processors to be 600 and vary the rank k of the desired fac- 6.6 Weak Scalability
torization. Because most of the computation (except for NLS) and Our third set of experiments shows the weak scalability of each of
bandwidth costs are linear in k (except for the All-Reduce), we ex- the algorithms. We consider only the synthetic data sets so that we
pect linear performance curves for each algorithm individually. can flexibly scale the dimensions of the data matrix. Again, we run
The left side of Figure 3 shows the results of this experiment our experiments on {24, 96, 216, 384, 600} processors/cores. We fix
for all four data sets. The first conclusion we draw is that HPC- the input data size per processor in this scaling experiment: mn/p is
NMF with a 2D processor grid performs significantly better than the same across all experiments (dense and sparse). The data matrix
the Naive-Parallel-NMF; the largest speedup is 4.4×, for the sparse dimensions range from 57600 × 38400 on 24 cores up to 288000 ×
synthetic data and k = 10 (a particularly communication bound 192000 on 600 cores; for HPC-NMF with a 2D processor grid, the
problem). Also, as predicted, the 2D processor grid outperforms local matrix is always 9600 × 9600. The dimensions are chosen so
the 1D processor grid on the squarish matrices. While we expect the that this experiment matches the strong-scaling experiment on 216
1D processor grid to outperform the 2D grid for the tall-and-skinny processors. Figure 4 shows our results.
Video matrix, their performances are comparable; this is because We emphasize that while this experiment fixes the amount of
both algorithms are computation bound, as we see from Figure 3g, input matrix data per processor, it does not fix the amount of
so the difference in communication is negligible. factor matrix data per processor (which decreases as we scale up).
The second conclusion we can draw is that the scaling with k Likewise, it fixes the number of MM flops performed by each
tends to be close to linear, with an exception in the case of the processor but not the number of NLS flops; the latter also decreases
Webbase matrix. We see from Figure 3e that this problem spends as we scale up. Thus, if communication were free, we would expect
much of its time in NLS, which does not scale linearly with k. Note the overall time to decrease as we scale to more processors, at a
that for a fixed problem, the size of the local NLS problem remains rate that depends on the relative time spent in MM and NLS. This
the same across algorithms. Thus, we expect similar timing results behavior generally holds true in Figure 4: in the dense case, since
and observe that to be true for most cases. most of the time is in MM, the time generally holds steady as the
(a) Sparse Synthetic (SSyn) Comparison (b) Sparse Synthetic (SSyn) Strong Scaling
All-Reduce
Time (seconds) 0.6 6 Reduce-Scatter
Time (seconds)
All-Gather
Gram
0.4 4 NLS
MM
0.2 2
0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
N
1D
2D
1D
2D
1D
2D
1D
2D
1D
2D
Naive HPC-NMF-1D HPC-NMF-2D 24 96 216 384 600
Low Rank (k) Number of Processes (p)
(c) Dense Synthetic (DSyn) Comparison (d) Dense Synthetic (DSyn) Strong Scaling
All-Reduce
2 Reduce-Scatter
1
Time (seconds)
Time (seconds)
All-Gather
Gram
NLS
1 MM
0.5
Downloaded from the ACM Digital Library on April 8, 2025.
0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
N
1D
2D
1D
2D
1D
2D
1D
2D
1D
2D
Naive HPC-NMF-1D HPC-NMF-2D 24 96 216 384 600
Low Rank (k) Number of Processes (p)
(e) Webbase Comparison (f) Webbase Strong Scaling
6 All-Reduce
Reduce-Scatter
Time (seconds)
Time (seconds)
40 All-Gather
Gram
4 NLS
MM
20
2
0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
N
N
1D
2D
1D
2D
1D
2D
1D
2D
1D
2D
Naive HPC-NMF-1D HPC-NMF-2D 24 96 216 384 600
Low Rank (k) Number of Processes (p)
(g) Video Comparison (h) Video Strong Scaling
All-Reduce
1.5 4 Reduce-Scatter
Time (seconds)
Time (seconds)
All-Gather
Gram
1 NLS
MM
2
0.5
0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
N
N
1D
2D
1D
2D
1D
2D
1D
2D
1D
2D
Naive HPC-NMF-1D HPC-NMF-2D 24 96 216 384 600
Low Rank (k) Number of Processes (p)
’
Figure 3: Comparison and strong-scaling experiments on sparse and dense data sets for three algorithms: Naive (N), HPC-NMF-1D (1D),
HPC-NMF-2D (2D). Plots on the left vary the low rank k for fixed p = 600, and plots on the right vary the number of processes (cores) p for
fixed k = 50. The reported time is the average over 10 iterations.
(a) Dense Synthetic (DSyn) Weak Scaling (b) Sparse Synthetic (SSyn) Weak Scaling
All-Reduce Reduce-Scatter All-Gather
2 2 Gram NLS MM
Time (seconds)
Time (seconds)
1 1
0 0
N
N
1D
2D
1D
2D
1D
2D
1D
2D
1D
2D
1D
2D
1D
2D
1D
2D
1D
2D
1D
2D
24 96 216 384 600 24 96 216 384 600
Number of Processes (p) Number of Processes (p)
Figure 4: Weak-scaling experiments on dense (left) and sparse (right) synthetic data sets for three algorithms: Naive (N), HPC-NMF-1D
(1D), HPC-NMF-2D (2D). The ratio mn/p is fixed across all data points (dense and sparse), with data matrix dimensions ranging from
57600 × 38400 on 24 cores up to 288000 × 192000 on 600 cores. The reported time is the average over 10 iterations.
Naive-Parallel-NMF HPC-NMF-1D HPC-NMF-2D
Downloaded from the ACM Digital Library on April 8, 2025.
Cores DSyn SSyn Video Webbase DSyn SSyn Video Webbase DSyn SSyn Video Webbase
24 6.5632 48.0256 5.0821 52.8549 4.8427 84.6286
96 1.5929 18.5507 1.4836 14.5873 1.1147 16.6966
216 2.1819 0.6027 2.7899 7.1274 2.1548 0.9488 4.7928 9.2730 1.5283 0.4816 1.6106 7.4799
384 1.2594 0.6466 2.2106 5.1431 1.2559 0.7695 3.8295 6.4740 0.8620 0.2661 0.8963 4.0630
600 1.1745 0.5592 1.7583 4.6825 0.9685 0.6666 0.5994 6.2751 0.5519 0.1683 0.5699 2.7376
Table 3: Average per-iteration running times (in seconds) of parallel NMF algorithms for k = 50.
number of processors increases, while in the sparse case, more time For the data sets on which we experimented, we showed that
is spent in NLS and times decrease from left to right. an efficient implementation of a naive parallel algorithm spends
We also point out that we used the same matrix dimensions in much of its time in interprocessor communication. In the case of
the dense and sparse cases; because the communication does not HPC-NMF, the problems remain computation bound on up to 600
depend on the input matrix sparsity, we see that the communication processors, typically spending most of the time in local matrix
costs are same (for each algorithm and number of processors). The multiplication or NLS solves.
main difference in running time comes from MM, which is much We focus in this work on BPP, because it has been shown to
cheaper in the sparse case. reduce overall running time in the sequential case by requiring
In the case of HPC-NMF-2D, the weak scaling is nearly perfect fewer iterations [14]. Because much of the time per iteration of
as the time spent in communication is negligible. This is explained HPC-NMF is spent on local NLS, we believe further empirical
by the theory (see Table 2): if mn/p is fixed, then the bandwidth exploration is necessary to understand the proposed HPC-NMF’s
cost O( mnk2 /p) is also fixed, so we expect HPC-NMF-2D to
p
advantages for other AU-NMF algorithms such as MU and HALS.
scale well to much larger numbers of processors. In the case of We note that if we use the MU or HALS approach for determining
Naive-Parallel-NMF and HPC-NMF-1D, we see that communica- low rank factors, the relative cost of interprocessor communication
tion costs increase as we scale up. Again, Table 2 shows that the will grow, making the communication efficiency of our algorithm
bandwidth costs of those algorithms increase as we scale up, so we more important.
don’t expect those to scale as well in this case. We note that this is In future work, we would like to extend HPC-NMF algorithm to
only one form of weak scaling; for example, if we were to fix the dense and sparse tensors, computing the CANDECOMP/PARAFAC
quantity m/p, then we would expect HPC-NMF-1D to scale well decomposition in parallel with non-negativity constraints on the
(though Naive-Parallel-NMF would not). The best overall speedup factor matrices. We would also like to explore more intelligent dis-
we observe from this experiment is in the sparse case on 600 pro- tributions of sparse matrices: while our 2D distribution is based on
cessors: HPC-NMF-2D is 2.7× faster than Naive-Parallel-NMF. evenly dividing rows and columns, it does not necessarily load bal-
ance the nonzeros of the matrix, which can lead to load imbalance
in MM. We are interested in using graph and hypergraph parti-
tioning techniques to load balance the memory and computation
7. Conclusion while at the same time reducing communication costs as much as
In this paper, we propose a high-performance distributed-memory possible. Finally, we have not yet reached the limits of the scala-
parallel algorithm that computes an NMF by iteratively solving bility of HPC-NMF; we would like to expand our benchmarks to
alternating non-negative least squares (ANLS) subproblems. We larger numbers of nodes on the same size data sets to study perfor-
carefully designed a parallel algorithm which avoids communica- mance behavior when communication costs completely dominate
tion overheads and scales well to modest numbers of cores. the running time.
Acknowledgments ray data analysis. Bioinformatics, 23(12):1495–1502, 2007. URL
http://dx.doi.org/10.1093/bioinformatics/btm134.
This research was supported in part by an appointment to the Sandia Na-
tional Laboratories Truman Fellowship in National Security Science and [14] J. Kim and H. Park. Fast nonnegative matrix factorization: An active-
Engineering, sponsored by Sandia Corporation (a wholly owned subsidiary set-like method and comparisons. SIAM Journal on Scientific Comput-
of Lockheed Martin Corporation) as Operator of Sandia National Lab- ing, 33(6):3261–3281, 2011. URL http://dx.doi.org/10.1137/
oratories under its U.S. Department of Energy Contract No. DE-AC04- 110821172.
94AL85000. [15] J. Kim, Y. He, and H. Park. Algorithms for nonnegative matrix
This research used resources of the National Energy Research Scientific and tensor factorizations: A unified view based on block coordinate
Computing Center, a DOE Office of Science User Facility supported by the descent framework. Journal of Global Optimization, 58(2):285–319,
Office of Science of the U.S. Department of Energy under Contract No. 2014. URL http://dx.doi.org/10.1007/s10898-013-0035-4.
DE-AC02-05CH11231. [16] D. Kuang, C. Ding, and H. Park. Symmetric nonnegative matrix
Partial funding for this work was also provided by AFOSR Grant factorization for graph clustering. In Proceedings of SDM, pages 106–
FA9550-13-1-0100, National Science Foundation (NSF) grants IIS-1348152 117, 2012. URL http://epubs.siam.org/doi/pdf/10.1137/1.
and ACI-1338745, Defense Advanced Research Projects Agency (DARPA) 9781611972825.10.
XDATA program grant FA8750-12-2-0309. We also thank NSF for the
travel grant to present this work in the conference through the grant CCF- [17] D. Kuang, S. Yun, and H. Park. SymNMF: nonnegative low-rank
1552229. Any opinions, findings and conclusions or recommendations ex- approximation of a similarity matrix for graph clustering. Journal of
pressed in this material are those of the authors and do not necessarily Global Optimization, pages 1–30, 2013. URL http://dx.doi.org/
reflect the views of the USDOE, NERSC, AFOSR, NSF or DARPA. 10.1007/s10898-014-0247-2.
[18] R. Liao, Y. Zhang, J. Guan, and S. Zhou. CloudNMF: A MapReduce
References implementation of nonnegative matrix factorization for large-scale
[1] G. Ballard, A. Druinsky, N. Knight, and O. Schwartz. Brief announce- biological datasets. Genomics, proteomics & bioinformatics, 12(1):
ment: Hypergraph partitioning for parallel sparse matrix-matrix mul- 48–51, 2014. URL http://dx.doi.org/10.1016/j.gpb.2013.
tiplication. In Proceedings of SPAA, pages 86–88, 2015. URL 06.001.
http://doi.acm.org/10.1145/2755573.2755613. [19] C. Liu, H.-c. Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed
Downloaded from the ACM Digital Library on April 8, 2025.
[2] E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. Col- nonnegative matrix factorization for web-scale dyadic data analysis
lective communication: theory, practice, and experience. Concurrency on MapReduce. In Proceedings of the WWW, pages 681–690. ACM,
and Computation: Practice and Experience, 19(13):1749–1783, 2007. 2010. URL http://dx.doi.org/10.1145/1772690.1772760.
URL http://dx.doi.org/10.1002/cpe.1206. [20] E. Mejı́a-Roa, D. Tabas-Madrid, J. Setoain, C. Garcı́a, F. Tirado, and
[3] A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari. Nonnegative A. Pascual-Montano. NMF-mGPU: non-negative matrix factorization
matrix and tensor factorizations: applications to exploratory multi- on multi-GPU systems. BMC bioinformatics, 16(1):43, 2015. URL
way data analysis and blind source separation. Wiley, 2009. http://dx.doi.org/10.1186/s12859-015-0485-4.
[4] J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and [21] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu,
O. Spillinger. Communication-optimal parallel recursive rectangular J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J.
matrix multiplication. In Proceedings of IPDPS, pages 261–272, Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine
2013. URL http://dx.doi.org/10.1109/IPDPS.2013.80. Learning in Apache Spark, May 2015. URL http://arxiv.org/
abs/1505.06807.
[5] J. P. Fairbanks, R. Kannan, H. Park, and D. A. Bader. Behavioral
clusters in dynamic graphs. Parallel Computing, 47:38–50, 2015. [22] V. P. Pauca, F. Shahnaz, M. W. Berry, and R. J. Plemmons. Text
URL http://dx.doi.org/10.1016/j.parco.2015.03.002. mining using nonnegative matrix factorizations. In Proceedings of
SDM, 2004.
[6] C. Faloutsos, A. Beutel, E. P. Xing, E. E. Papalexakis, A. Kumar,
and P. P. Talukdar. Flexi-FaCT: Scalable flexible factorization of [23] C. Sanderson. Armadillo: An open source C++ linear algebra library
coupled tensors on Hadoop. In Proceedings of the SDM, pages 109– for fast prototyping and computationally intensive experiments. Tech-
117, 2014. URL http://epubs.siam.org/doi/abs/10.1137/1. nical report, NICTA, 2010. URL http://arma.sourceforge.net/
9781611973440.13. armadillo_nicta_2010.pdf.
[7] R. Fujimoto, A. Guin, M. Hunter, H. Park, G. Kanitkar, R. Kannan, [24] D. Seung and L. Lee. Algorithms for non-negative matrix factoriza-
M. Milholen, S. Neal, and P. Pecher. A dynamic data driven appli- tion. NIPS, 13:556–562, 2001.
cation system for vehicle tracking. Procedia Computer Science, 29: [25] R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collec-
1203–1215, 2014. URL http://dx.doi.org/10.1016/j.procs. tive communication operations in MPICH. International Journal of
2014.05.108. High Performance Computing Applications, 19(1):49–66, 2005. URL
[8] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale http://hpc.sagepub.com/content/19/1/49.abstract.
matrix factorization with distributed stochastic gradient descent. In [26] Y.-X. Wang and Y.-J. Zhang. Nonnegative matrix factorization: A
Proceedings of the KDD, pages 69–77. ACM, 2011. URL http: comprehensive review. TKDE, 25(6):1336–1353, June 2013. URL
//dx.doi.org/10.1145/2020408.2020426. http://dx.doi.org/10.1109/TKDE.2012.51.
[9] D. Grove, J. Milthorpe, and O. Tardieu. Supporting array program- [27] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel.
ming in X10. In Proceedings of ACM SIGPLAN International Work- Optimization of sparse matrix-vector multiplication on emerging mul-
shop on Libraries, Languages, and Compilers for Array Program- ticore platforms. Parallel Computing, 35(3):178 – 194, 2009.
ming, ARRAY’14, pages 38:38–38:43, 2014. URL http://doi. [28] Z. Xianyi. Openblas, Last Accessed 03-Dec-2015. URL http:
acm.org/10.1145/2627373.2627380. //www.openblas.net.
[10] N.-D. Ho, P. V. Dooren, and V. D. Blondel. Descent methods for [29] J. Yin, L. Gao, and Z. Zhang. Scalable nonnegative matrix factoriza-
nonnegative matrix factorization. CoRR, abs/0801.3199, 2008. tion with block-wise updates. In Machine Learning and Knowledge
[11] P. O. Hoyer. Non-negative matrix factorization with sparseness con- Discovery in Databases, volume 8726 of LNCS, pages 337–352, 2014.
straints. JMLR, 5:1457–1469, 2004. URL www.jmlr.org/papers/ URL http://dx.doi.org/10.1007/978-3-662-44845-8_22.
volume5/hoyer04a/hoyer04a.pdf. [30] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Sto-
[12] O. Kaya and B. Uçar. Scalable sparse tensor decompositions in dis- ica. Spark: Cluster computing with working sets. In Proceedings
tributed memory systems. In Proceedings of SC, pages 77:1–77:11. of the 2nd USENIX Conference on Hot Topics in Cloud Comput-
ACM, 2015. URL http://doi.acm.org/10.1145/2807591. ing, HotCloud’10, pages 10–10. USENIX Association, 2010. URL
2807624. http://dl.acm.org/citation.cfm?id=1863103.1863113.
[13] H. Kim and H. Park. Sparse non-negative matrix factorizations
via alternating non-negativity-constrained least squares for microar-