[go: up one dir, main page]

0% found this document useful (0 votes)
96 views10 pages

Oxford Bookworm Library

Oxford Bookworm Library

Uploaded by

Quyet Luu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views10 pages

Oxford Bookworm Library

Oxford Bookworm Library

Uploaded by

Quyet Luu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2013 IEEE 27th International Symposium on Parallel & Distributed Processing Workshops and PhD Forum

An efficient parallelization strategy for dynamic programming on GPU

Karl-Eduard Berger∗† , François Galea∗


∗ CEA, LIST
Embedded real time systems laboratory
Centre de Saclay, PC 172, 91191 Gif-sur-Yvette Cedex, France
† University of Versailles Saint-Quentin-en-Yvelines
PRiSM laboratory
55 avenue des Etats-Unis, 78035 Versailles Cedex, France
Email: karl-eduard.berger@cea.fr, francois.galea@cea.fr

Abstract—Optimization methods generally do not fall into easier to programmers and allowed the GPGPU community
the most suitable algorithms for parallelization on a GPU. to gain much more control on the GPU management.
However, a relatively good efficiency still can be obtained if the GPUs are still getting increasingly used in scientific
method is properly adapted to the GPU programming model,
which is the case for dynamic programming. In this article, we computations, which require increasing computing power
propose a parallelization strategy for thread grouping for dy- to be performed. Linear algebra libraries are being made
namic programming in CUDA. We show that parametrizing the available to take profit from GPU acceleration, including
solver parallelism according to the hardware allows better per- BLAS-compatible basic matrix/vector algebra functions, and
formance. The strategy provides good acceleration compared to linear algebra solvers [32], [33].
a standard GPU parallel strategy on a dynamic programming-
based implementation of the knapsack problem. We show On a GPU, the best efficiency is achieved on compute-
this strategy is helpful in the case of the multi-dimensional bound applications, for which the computation time is much
knapsack problem, where computing multi-dimensional indices greater than the communication and synchronization time
is a costly operation. between the processors.
Keywords-dynamic programming; parallelism; knapsack In this article we address the problem of efficiently
problem; GPU computing; combinatorial optimization; CUDA. implementing dynamic programming on a GPU. Dynamic
programming (DP) naturally exposes an important level
I. I NTRODUCTION of parallelism, but also requires frequent synchronization
Since the introduction of programmable Graphics Pro- between the parallel processes. It is applicable to many
cessig Units (GPUs) in general purpose computers, the different problems including shortest path [3], matrix chain
scientific community quickly understood the interest in using multiplications [10], and longest common substring problem
their computing power for scientific computations. Even [9].
though those processors were firstly designed for graphics We take as a use case the multidimensional knapsack
rendering, they provide a much higher computing power problem (MKP). MKP is a multidimensional extension of
than the general purpose Central Processing Units (CPUs) the well-known knapsack problem (KP). Even though large
in these computers. GPGPU (General Purpose GPU) is the instances of KP and MKP are usually solved to near opti-
specific way of programming a GPU in a computer program mality using collaborative approaches of metaheuristic and
in order to speed up time consuming computations. Indeed, Integer Linear Programming methods, optimal solving of
due to their massive internal parallelism, GPUs can provide small instances can be efficiently performed using dynamic
a good acceleration for computations if they are programmed programming.
properly, and if the computations to be performed expose a This paper is not about efficiently solving large instances
significant enough level of parallelism. of the MKP: it rather focuses on an efficient parallelization
Early GPGPU computing consisted in using the available of the general DP process on GPU, which can be used for
graphics libraries (OpenGL, Direct3D) to access and control parallelizing any DP-based algorithm, such as those for the
the GPU computations [16]. As the graphics rendering problems we cited above. This is especially useful when the
pipelines of the graphics cards became more and more problems to be solved are small subproblems of a larger
programmable, the field of use for GPGPU computing general problem. In this case, it may be relevant to reduce
broadened at the same time. On 15 February 2007, the each step’s computation time as much as possible; GPU
graphics card manufacturer NVIDIA released the CUDA parallelization is a possible approach to achieve this goal.
Software Development Kit, which allowed for the first Section II of this article exposes different related works
time to specifically program GPUs to perform non-graphics on parallel dynamic programming and optimization on GPU.
computations. This move made the access to GPGPU much Section III presents a formulation of the multidimensional

978-0-7695-4979-8/13 $26.00 © 2013 IEEE 1797


DOI 10.1109/IPDPSW.2013.208
knapsack problem. We present in Section IV how the MKP The MKP is a generalisation of the well-known knapsack
can be solved using dynamic programming, with a reference problem. It is well known as a NP-hard problem. This
sequential algorithm. Section V details the proposed paral- problem consists in choosing a subset of a given set of items
lelization strategy, as well as an efficient method to deal with which maximizes the total profit while satisfying a set of
multiple index management for MKP. Experimental results capacity constraints.
are presented in Section VI. It can be stated as follows :
II. R ELATED WORKS ⎧ n

⎪ 
There have been several publications about parallel dy- ⎪
⎪ Maximize pj x j


namic programming. Boyer et al. [28] propose references ⎪
⎪ j=1

⎨ subject to:
for DP methods for the knapsack problem on the ICL DAP,
the CM-2 Connection Machine or the MasPar MP-1 parallel n


computers. More recent studies concerned sequence compar- ⎪
⎪ wij xj ≤ ci ∀i ∈ {1, · · · , m}


ison on multi-threaded computers [11], discrete optimization ⎪
⎪ j=1


on multi-transputer systems [12] or RNA secondary structure xj ∈ {0, 1} ∀j ∈ {1, · · · , n}
prediction on petascale multi-core architectures [17].
Even though GPGPU computing is a recent field in com-
puting, using it for solving optimization problems has be- where:
come a natural trend in the parallel optimization domain. As
• m is the number of constraints.
a non-exhaustive list of problems tackled with GPGPU, one
• ci represents the available capacity of constraint i. The
can find the 3-satisfiability problem, solved using a genetic
ci values are the components of a m-sized capacity
algorithm implemented using shader programming [13], the
vector we denote as c.
Permuted Perceptron Problem, solved using a tabu search
• n is the number of items.
method implemented in CUDA [20], and the flowshop n
• p ∈ N is the profit vector. pk represents the profit of
scheduling problem solved using a multiobjective local th
the k item.
search [24]. Successful attempts have also been performed mn
• w ∈N is the cost matrix. wki is the cost in constraint
on the simplex algorithms [19], [22] and the branch and
i for selecting item k.
bound method [29].
• xk is a binary decision variable, which equals to 1 if
Dynamic programming on GPU has been proposed by item k has been selected, and 0 otherwise.
Boyer et al. for solving the knapsack problem [28]. They
propose a parallelization strategy involving one parallel Each wij cost can be viewed as a resource demand for
DP evaluation in every GPU thread. Their method also item j, in terms of resource i. For instance, consider a
implement management for storing the solution array using real-time operating system, where we want to run a set of
a compression method which provides good performance re- periodic tasks, while maximizing the global efficiency. The
sults in terms of memory usage and compression time. This different resource constraints could be the CPU time (the
method has been extended to the multiple GPU case [23]. total CPU usage could not exceed 100% of the available
Other applications of DP on GPU include the matrix chain processing power), the memory consumption, or available
optimization problem [26], [25], or the Optimal Polygon I/O throughput.
Triangulation problem [27]. The parallelization strategies If m is a fixed integer such as m ≥ 1, the problem can
used in these works are dependent on the structure of the DP be optimally solved in pseudo-polynomial time [14]. Note
process, where computations of varying levels of parallelism that if m = 1, then the problem is the standard knapsack
are done in a single DP step. problem (KP).

III. T HE MULTI - DIMENSIONAL KNAPSACK PROBLEM IV. S EQUENTIAL DP ALGORITHM FOR MKP
The MKP has first appeared as a model for budgeting Dynamic programming is a well known optimization
problems [1, 1955],[4, 1957]. The first dynamic program- method for problems which exhibit an optimal substructure.
ming method for solving small sized instances of KP was If the optimal solution of a problem can be recursively
presented by Bellman in 1957 [2], and its extension to expressed by using the optimal solution of some of its
MKP by Gilmore and Gomory in 1966 [5]. Methods for subproblems, then dynamic programming can be used.
solving large MKP instances include heuristics [7], [8], This is the case for the knapsack problems. If we know
hybrid exact/heuristic methods [21], and more recently exact all optimal solution values Tk−1 (d) from a set of k −1 items
methods using tree search methods, including branch and for any knapsack of vector size d inferior or equal to c =
bound [18], [30]. Relatively recent complete surveys on the (c1 , c2 , · · · , cm ), the decision to choose or not a new item
MKP were published in 2004 [14], [15]. with weight vector wk = (wk1 , wk2 , · · · , wkm ), wk ≤ d and

1798
profit value pk for a given knapsack vector size of d only single array T , and reuse the previous values to compute the
depends on the resulting value of two solutions : new ones. In that case, one would choose an ordering of the
• Tk−1 (d) if item k is not taken. d possible vector knapsack sizes such that newly computed
• Tk−1 (d − wk ) + pk if item k is taken. Tk values do not overlap previous Tk−1 values before they
Of course if wk  d, then the item can not be chosen for are used. For instance if m = 1, then the d scalar values
a multidimensional knapsack of vector size d. Hence, the would typically be chosen in decreasing order. If m > 1,
optimal solution value at step k for a knapsack of vector reverse lexicographic order would be a valid possibility.
size d is Best performance is achieved if the set of items is sorted
 in such a way the highest values for the wk weights come
max(Tk−1 (d), Tk−1 (d − wk ) + pk ) if wk ≤ d
Tk (d) = first. Indeed, in such a case, the values of the components
Tk−1 (d) if wk  d
of σ decrease faster at the first iterations of the k loop.
Moreover, Toth [6, 1980] demonstrated for the KP that at potentially enabling higher values of d earlier in the algo-
k th step, if the total weight ŵ satisfies the following: rithm execution. The ordering we chose is the decreasing
n
 order of the sum of the weight components of the items.
ŵ < c − wi This simple method allowed to provide significantly better
i=k+1 results compared to using a random ordering.
then the current state will never lead to an optimal solution
V. PARALLEL IMPLEMENTATIONS ON GPU
and therefore does not require to be computed. Indeed, in
such a case, even selecting all items from k + 1 to n, the The parallel implementation of MKP consists in paral-
total weight remains lower than the total knapsack capacity. lelizing the for loop in Algorithm 1, line 7, we call the “inner
This property, we call the “Toth property”, also applies in loop”, in opposition of the main loop which iterates over the
the multidimensional case. In that case the < comparison is knapsack items, we call the “outer loop” of the algorithm.
a vector (component-wise) comparison. As the computations are independent between iterations, the
From these properties, we present the sequential dynamic potential level of parallelism is the number of iterations.
programming algorithm for the MKP (Algorithm 1). In this In this section we propose several parallel implemen-
tations of this parallelization on a GPU. We first have
Algorithm 1 The sequential MKP to present how GPUs work and how we can use it to
1: function S EQUENTIAL MKP(n, m, p, w, c) solve the MKP. Then we present different approaches for
Input: n ∈ N  number of items the parallelization of a simple knapsack problem dynamic
Input: m ∈ N  number of MKP constraints
Input: p ∈ Nn  item profits vector
programming solver, and we present the specific problem of
Input: w ∈ Nnm  item weight matrix efficiently managing indices with multiple dimensions on a
Input: c ∈ Nm  MKP capacities vector GPU.
2: T0 ← 0  Optimal values with 0 objects
n
3: σ← wk  Sum of vector weights A. Presentation of a GPU
k=1
4: for k = 1 → n do A GPU is a massively parallel SPMD (single program,
5: σ ← σ − wk multiple data) processor. It is generally embedded in a
6: d ← max(0, c − σ)
7: for all d ≤ d ≤ c do  All possible MKP vector capacities graphics card (the device) connected to a computer (the
8: if d ≤ wk then host) via a PCI bus interface. The device also embeds local
9: Tk (d) ← Tk−1 (d) memory, available for graphics or GPGPU use. The device
10: else
11: Tk (d) ← max(Tk−1 (d), Tk−1 (d − wk ) + pk ) memory contains the GPU program code, and all the data
12: end if that the GPU programs manipulate. The host can allocate
13: end for blocks in the device memory, transfer data blocks from the
14: end for
15: return Tn (c) host memory to the device memory and vice versa, as well
16: end function as execute kernels (GPU programs) to manipulate the data
in the device memory.
algorithm, each Tk is an associative array which contains A kernel is typically executed by a large number of
the current optimal knapsack value for each vector MKP threads, and the GPU logic is responsible for executing the
capacity, after having processed the first k items in the set. threads with the maximum parallelism the GPU is able to
The σ vector is initialized with the sum of vector weights, provide.
and for each item, the Toth property is used to determine a GPU threads run in thread groups called warps, which
minimum value d for the possible d vector capacities. typically consist of 32 threads. As all threads in a warp
As at step k, the Tk values are computed only from share the same program control logic, all of its threads
the Tk−1 values computed at the previous step. A typical always execute the same instruction at the same time. This
sequential implementation of this algorithm would only use a behaviour is well-suited for performing computations on

1799
Number of blocks

Block 4 Thread 0 Thread 1 Thread 2 Thread 3

Block 3 Thread 0 Thread 1 Thread 2 Thread 3

Block 2 Thread 0 Thread 1 Thread 2 Thread 3

Block 1 Thread 0 Thread 1 Thread 2 Thread 3

Block 0 Thread 0 Thread 1 Thread 2 Thread 3

Number of threads
(max 1024)

Figure 1. Organization of threads in blocks in a GPU program

Figure 2. Memory structure in a GPU device


arrays: thread 0 in the warp handles element 0 in the
array, thread 1 handles element 1, and so on. Moreover,
memory accesses are optimal when all threads in the warp for the kernel arguments as well as part of the thread-local
access consecutive memory elements in the warp order. variables. The other storage for thread-local variables is the
Such parallel memory accesses are said to be coalesced registers: registers are local to every thread and are the fastest
in that case. This is possible in the case of MKP. Indeed, available storage.
if we parallelize the for loop in Algorithm 1, line 7, as Each block of threads is executed by a hardware execution
the accesses to the elements of arrays Tk and Tk−1 are unit called a multiprocessor. The maximum number of
performed according to the current value of d, we can order threads which can be assigned (or scheduled) by a single
the array elements and values of d for that purpose. multiprocessor depends on defined hardware limits and also
Thread warp scheduling is managed by the GPU hard- the number of used registers in the kernel code to be
ware, enabling fast and efficient scheduling of several thou- executed by all threads. Depending on the recency of the
sands of threads. To take profit from that, and for maximum hardware, the multiprocessors in a typical modern GPU
efficiency, a typical GPU program spawns a number of may embed 32k or 64k registers, allowing several tenths
threads large enough to use all of the possible hardware of registers to be used in a single thread.
thread contexts.
GPU kernels are executed in blocks of threads. All threads B. Determination of a parallelization strategy
in a block execute the same program, and the GPU is During this work, we first implemented the mono-
responsible for scheduling all the warps in order the block to dimensional knapsack problem (KP) with different paral-
finish its execution. The typical maximum number of threads lelization strategies. This corresponds to the MKP for which
per block is 1024. the number of constraints m is equal to one, and therefore
Each of the threads executing the same kernel is assigned the capacity c is a scalar value, as well as the weight value
a unique pair of block and thread identifiers. It is up to wj for each item j.
the programmer to use these values in the kernel program The first strategy, called the fine grained strategy, results
to identify the specific data each thread has to manipulate. from a particular interpretation of the Nvidia CUDA pro-
Figure 1 illustrates how threads are organized in blocks: gramming documentation [31], which emphasizes the need
threads are organized in a 2D grid, one dimension being the to maximize parallel execution. The general behavior which
number of blocks and the other being the number of threads is commonly observed in GPGPU computing to achieve this
per block. goal is to spawn as many threads as there are independent
Figure 2 is an illustration of the memory structure of parallel operations, which in our case is the number of
the GPU. All threads from all blocks have direct access to iterations of the inner loop over the different capacities. This
the device memory. There is another type of faster memory value, in the general case, is close to c, the capacity of the
available, though in much lower quantity: the shared mem- knapsack. Thus, the method consists in executing the outer
ory. Shared memory is used for communication between loop iterations on the CPU, and calling the GPU kernel to
threads of the same block; we do not have a specific use for run the inner loop in parallel, with the finest possible grain
it for dynamic programming. It is however used as storage of parallelism. The necessary synchronization of all GPU

1800
threads between iterations of the outer loop is implicit, as kernel<<<NB,NT>>>(arg1, arg2, ...);

the CPU waits for all GPU threads to finish their execution This corresponds to spawning N B blocks of N T threads
before the next outer loop iteration. This is the strategy each, running the ‘‘kernel” function on the GPU, each
which is commonly used in many related works in parallel thread having the same list of arguments. CUDA threads
dynamic programming on GPU. also have additional implicit arguments. Those we use are
A second strategy we developed results from the obser- necessary to identify the current thread’s block number
vation that each thread in the fine grained strategy performs (from 0 to N B − 1) and its thread number in the block
a very little amount of work. This led us to investigate on (from 0 to N T − 1). Those values are respectively stored
limiting the number of spawned threads while maximizing in the threadIdx.x and blockIdx.x variables. While
the GPU utilization and minimizing the intervention of the the GPU threads are running, the CPU waits for their
CPU in the algorithm. It consists in a full implementation termination before continuing its execution.
of the whole KP algorithm in a GPU kernel. In this con- The CPU/GPU implementation algorithm is presented in
figuration, the different thread blocks execute the outer loop Algorithm 2. For maximum accuracy we write it down in
over the different items simultaneously and share the parallel CUDA language, which mostly consists in C/C++ with ad-
iterations of the inner loop over the different capacities. ditions to describe GPU kernels and to spawn GPU threads.
Synchronization is performed between all threads at the end In this algorithm, the T and T 1 arrays correspond to the Tk
of each iteration of the inner loop to handle concurrency.
While thread synchronization is supported by CUDA for
threads within the same block, there is no supported way
Algorithm 2 Parallel KP, CPU/GPU, coarse strategy
for synchronization for threads from different blocks. This is
due to the fact that CUDA blocks can be considered as jobs 1 // fixed number of threads per block
2 #define NT 1024
which may or may not be executed in parallel, depending on 3 // fixed number of blocks
the hardware they are running on. However, we implemented 4 #define NB 7
5
a barrier mechanism between parallel running blocks, using 6 // GPU kernel for the d loop
__global__ void kernel(int n, int c,
the hardware specification of the GPU to determine the num- 7
8 int x0, int x1, int w, int p, int *T, int *T1) {
ber of blocks which can actually run in parallel. In order to 9 int x = threadIdx.x + blockIdx.x * NT;
10 int d, d_ = min(x0,x1);
reduce the overhead due to the inter-block synchronization, 11 d_ -= d_ % 32;
we tried to minimize the necessary number of created blocks 12 for (d = x + d_; d < c; d += NT*NB) {
13 int t = T1[d];
while preserving the maximum GPU utilization. This led us 14 if (d >= x0)
15 t = max(T1[d-w] + p, t);
to use the maximum number of threads per block, which 16 T[d] = t;
is 1024 on all the GPUs we used for testing. Significant 17 }
18 }
performance increase resulted from this method. 19

The final strategy we developed, called the coarse 20 // Knapsack method - main function on CPU
21 // n: number of items
strategy, is an intermediate solution which basically follows 22 // c: knapsack capacity + 1
23 // T, T1: two arrays[c] in GPU memory
the scheme of the fine grained strategy, while reducing the 24 // it: a sorted set of n items in CPU memory
number of blocks at the same level as the second strategy. 25 int KPKernelLoop(
26 int n, int c, int *T, int *T1, item *it) {
What basically happens is that each thread executes several 27 int result;
iterations of the inner capacity loop of the MKP algorithm, 28 int k, x0, x1, s = 0;
29 // compute the sum of all weights
resulting in a larger grain of parallelism. In this case there 30 for (k=0; k<n; ++k)
31 s += it[k].w;
is no longer the need for thread synchronization, as we get 32 // Set the contents of the T array to all zeros
similar implicit synchronization as the fine grained strategy. 33 cudaMemset(T, 0, c*sizeof(int));
34 x1 = 0;
This parallel strategy produces very similar results as the 35 for (k=0; k<n; ++k) {
swap(T,T1);
second strategy, while its design is much simpler and does 36
37 s -= it[k].w;
not require undocumented tricks to achieve its goal. This 38 x0 = max(C-s, it[k].w);
39 kernel<<<NB, NT>>>(n, c, x0, x1,
is the strategy we propose and describe in the rest this paper. 40 it[k].w, it[k].p, T, T1);
41 x1 = x0;
42 }
As described above, the coarse strategy consists in per- 43 cudaMemcpy(&result,&T[c-1],
44 sizeof(int),cudaMemcpyDeviceToHost);
forming the main loop of Algorithm 1 (the k loop) on the 45 return result;
CPU. At each iteration, a pre-determined number of thread 46 }

blocks with the maximum number of threads per block is


spawned on the GPU to execute the inner loop (the d loop)
in parallel.
The notation in CUDA for spawning threads is the fol- and Tk+1 arrays in Algorithm 1. The values computed for
lowing: the x0 and x1 variables correspond to the application of the

1801
Toth property. The contents of each item (weight and profit) index d = d0 + d1 c0 + d2 c0 c1 . In this case, the associative
are passed as arguments to the kernel function, and the array contains c0 c1 c2 elements, which may cause a problem
management of the Toth property is done by the CPU. At the for memory allocation when dealing with large knapsack
beginning of the GPU kernel execution, a d_ starting index sizes. Our method is therefore best suited for relatively small
is computed according to the Toth property. To ensure some instances.
extent of memory access coalescing (as described above), Given a d one-dimensional index, the values of the d
the d_ index value is lowered to the nearest lower multiple vector can be obtained as:
of 32 (the number of threads per warp). Each thread executes    
a loop on a d value by starting from the current thread id  d d
d0 = d mod c0 ; d1 = mod c1 ; d2 = mod c2
number added to the d_ index, and incrementing the value c0 c0 c1
at the end of each iteration with the total number of running
threads (N T × N B). This respects the sequential alignment More generally, for m constraints, and a maximum ca-
requirements for coalesced memory accesses when reading pacity vector c = (c0 , c1 , · · · , cm−1 ), any capacity vector
or writing at each d position in the shared arrays. Misaligned d = (d0 , d1 , · · · , dm−1 ), d ≤ c can be assigned a 1D index:
memory accesses are done in the general case for accessing m−1 i−1


the d−w element of array T 1 at iteration d ; however modern d = di cj


GPUs (corresponding to compute capability 2.0 and above i=0 j=0
on Nvidia GPUs) perform misaligned sequential memory
accesses at the limited cost of two accesses instead of one. Our first implementation is an attempt to address any
Some previous works (for instance the work by Boyer et possible number of constraints. For this it converts a 1D
al. [28]) make use of the texture memory access mecha- index d into multiple indices using the formulas above. We
nism of the GPUs to improve the performance of memory present the CUDA kernel implementation as Algorithm 3.
accesses. As our experimentations using texture memory
did not show any significant performance increase on our
hardware platforms, our proposed methods do not make use Algorithm 3 Parallel MKP GPU kernel, dynamic manage-
of it. ment of constraints
Our method does not require to perform any costly data 1 struct item {
transfer between the main memory and the GPU memory. 2 // profit
3 int p;
Indeed, the initial values for the T array are set to zero by the 4 // offset in MKP array (= w[0] + w[1]*c[0] + ...)
GPU, then the KP item element values (weight and profit) 5 int ww;
6 };
are passed as arguments when calling the DP kernel, and 7
8 // GPU kernel for MKP, variable number of constraints
only the value of the last element of the T array is copied 9 // C : product of all (capacities+1)
from the GPU memory to the main memory. 10 // C = c[0] * c[1] * ...
11 // m : number of constraints
12 // c : array[m] of knapsack capacities + 1
C. Dynamic management of multiple constraints 13 // it : pointer in GPU memory to current item
14 // w : array[m] of weights for the current item
After having presented our parallelization strategy, we 15 // T : destination MKP array
16 // T1 : source (for item k-1) MKP array
now present different methods we developed for the man- 17 __global__ void kernel(int C, int m, int *c,
agement of several knapsack constraints. All those methods 18 item *it, int *w, int *T, int *T1) {
19 int x = threadIdx.x + blockIdx.x * NT;
have been implemented in a parallel MKP solver using the 20 int i, d, di, td, test;
coarse strategy we presented above. 21 for (d = x; d < C; d += NT*NB) {
22 int t = T1[d];
Managing multiple constraints involves an associative 23 td = d;
24 test = 1;
array of maximum profits corresponding to different pos- 25 for (i = 0; i < m; i++) {
sible MKP capacity values. As MKP capacities are multi- 26 // compute d[i] : i’th coordinate in d vector
27 di = td % c[i];
dimensional, this is equivalent to using a vector for indexing 28 // test if j0 >= w0 && j1 >= w1 && ...
the different capacities. As a matter of fact, this is exactly 29 if (di < w[i]) {
30 test = 0;
how multiple-dimensional arrays are handled in many pro- 31 break;
32 }
gramming languages such as C: such arrays are implemented 33 td = td / c[i];
as a monodimensional array, and conversion is performed to 34 }
35 if (test)
combine all index values in order to determine a unique 36 t = max(T1[d-it->ww] + it->v, t);
T[d] = t;
monodimensional index corresponding to the multidimen- 37
38 }
sional index. 39 }

For instance, in the three-dimensional case, using a max-


imum capacity vector c = (c0 , c1 , c2 ), any capacity vector
d = (d0 , d1 , d2 ), d ≤ c can be assigned a one dimensional

1802
D. Static dimension specialization to elements d of the T and T 1 arrays are coalesced, while
The previous method is fully dynamic, which allows it only the eventual accesses to the d − ww elements of T 1 are
to handle all possible numbers of dimensions. However, not coalesced, though respecting the misaligned sequential
we possibly can notice one major drawback: because of access pattern, as described for Algorithm 2.
its adaptibility, it requires all parameters (items, MKP Even though this specialized version algorithm contains a
capacities) to be stored in GPU global memory. As we major improvement, significant computing power is neces-
already know, GPUs are massively parallel, and memory sary to generate the different multiple-dimensional indices,
accesses are to be avoided as much as possible. Moreover, as we make heavy use of the integer divide and modulo
the memory accesses for item and capacity components are operators to convert the global one-dimensional index into
not coalesced, which is detrimental to the efficiency of the the multiple dimensional one. Such operators are known to
method. This is particularly unfortunate as the corresponding require heavy computing power, and Nvidia recommends to
component values are constant for all threads of each deter- avoid their use as much as possible.
mined GPU kernel invocation. Therefore, another possibility The previous version of the MKP kernel can be improved
for us is to consider that we can specialize the code for fixed in that way by calculating offset indices for only the first
values of the number of MKP constraints in order to profit application of the dynamic programming recurrence func-
from the massive number of available registers and fast- tion, and using only additions and subtractions to obtain
access shared memory in the GPU. This can be justified, the index values for the further iterations. This improve-
because in most engineering problems the dimensions are ment is presented in the example 3D-specialized kernel in
fixed at least for whole classes of instances. Fixing the Algorithm 5. In this implementation, the divide and modulo
dimension of the problem allows us to provide all the item
data and capacity components as arguments of the GPU Algorithm 5 Parallel MKP GPU kernel, fixed number of
kernel, which is implemented by the CUDA compiler using constraints, lighter computation of indices
the shared memory. We designed a generator program for
1 /* Parallel MKP GPU kernel, 3D version
different versions of the MKP kernel code with specified 2 C : total number of capacity combinations = c0*c1*c2
3 c0, c1 : capacity of constraints 0 and 1
numbers of constraints. This allows us to provide the best
4 i0, i1, i2 : components of the increment vector
tailored version of the solver for any specific classes of the 5 ww : w0 + c0*w1 + c0*c1*w2
6 w0, w1, w2 : weights of item in constraints 0 to 2
problem. 7 x0, x1, x2 : minimum index values to update T
An example implementation of a specialized kernel for 8 p : item profit
9 T, T1 : Arrays T_k and T_{k-1}
the two-dimensional MKP is presented in Algorithm 4. 10 */

One can notice in this implementation that the item data 11 __global__ void kernel(int C,int c0,int c1,
12 int i0,int i1,int i2,int ww,int w0,int w1,int w2,
13 int x0,int x1,int x2,int p,int *T,int *T1) {
14 int x = threadIdx.x + blockIdx.x * NT;
Algorithm 4 Parallel MKP GPU kernel, fixed number of 15 int d = x;
constraints 16 int d0 = d%c0;
17 int d1 = (d/c0)%c1;
1 /* Parallel MKP GPU kernel, 2D version 18 int d2 = d/(c0*c1);
2 C : total number of capacity combinations C = c0*c1 19 for (; d < C; d += NT*NB) {
3 c0 : capacity of constraint 0
20 if (d0>=x0 && d1>=x1 && d2>=x2) {
4 ww : w0 + c0*w1
21 int t = T1[d];
22 if (d0>=w0 && d1>=w1 && d2>=w2)
5 w0, w1 : weights of item in constraints 0 and 1
23 t = max(T1[d-ww] + p, t);
6 x0, x1 : minimum index values to update the T array
24 T[d] = t;
7 p : item profit
25 }
8 T, T1 : Arrays T_k and T_{k-1}
26 d0 += i0;
9 */ 27 if (d0 >= c0) { d0 -= c0; d1 += 1; }
10 __global__ void kernel(int C,int c0,int ww,int w0, 28 d1 += i1;
11 int w1,int x0,int x1,int p,int *T,int *T1) { 29 if (d1 >= c1) { d1 -= c1; d2 += 1; }
12 int x = threadIdx.x + blockIdx.x * NT; 30 d2 += i2;
13 int d; 31 }
14 for (d = x; d < C; d += NT*NB) { 32 }
15 int d0 = d%c0; // Coordinate in dimension 0
16 int d1 = d/c0; // Coordinate in dimension 1
17 if (d0>=x0 && d1>=x1) {
18 int t = T1[d];
19 if (d0>=w0 && d1>=w1) operators are only used to determine the initial values of the
20 t = max(T1[d-ww] + p, t);
21 T[d] = t; d vector, potentially significantly reducing the computing
22 } overhead of the whole algorithm execution.
23 }
24 }
VI. E XPERIMENTAL RESULTS
A. Hardware specification
(weights and profit) and knapsack capacities are now stored We tested our Algorithm on three different platforms.
in local variables. Also one can observe that all accesses They are listed in Table I. All GPUs are from Nvidia. The

1803
CPU GPU #C #MP platform instance CPU fine sp. coarse sp. n/c sp.
1 AMD Sempron 140 2.7 GHz GeForce GTX460 336 7 1 10000 26.5 3.2 8.2 1.6 16.8 2.03
2 Intel Xeon X5650 2.67 GHz Tesla C2075 448 14 20000 107.1 12.8 8.3 6.3 17.1 2.05
3 Intel Xeon E5-2620 2 GHz GeForce GTX680 1536 8 30000 241.2 28.7 8.4 14.0 17.3 2.06
40000 431.0 51.2 8.4 25.1 17.1 2.04
Table I 50000 653.8 80.0 8.2 38.9 16.8 2.06
T ESTING HARDWARE PLATFORMS . T HE NUMBER OF GPU CORES (C) 60000 943.1 115.2 8.2 56.5 16.7 2.04
AND MULTIPROCESSORS (MP) ARE SHOWN . 70000 1327.2 141.9 9.3 76.4 17.4 1.86
80000 1723.4 185.4 9.3 100.6 17.1 1.84
90000 2172.5 234.6 9.3 127.3 17.1 1.84
100000 2693.0 289.8 9.3 156.0 17.3 1.86
100
2 10000 14.2 2.2 6.4 1.1 13.2 2.07
GTX 460 20000 59.4 8.9 6.7 4.3 13.9 2.08
Tesla C2075 30000 134.1 19.8 6.8 9.5 14.1 2.08
GTX 680 40000 248.5 35.3 7.0 17.0 14.7 2.08
50000 400.7 55.1 7.3 26.4 15.2 2.09
60000 571.4 79.3 7.2 38.0 15.0 2.09
70000 774.0 100.8 7.7 51.8 15.0 1.95
time (s)

80000 962.2 131.6 7.3 67.6 14.2 1.95


10 90000 1238.1 166.6 7.4 85.4 14.5 1.95
100000 1518.3 205.8 7.4 105.5 14.4 1.95
3 10000 11.1 1.6 7.0 0.9 11.8 1.68
20000 47.8 6.3 7.6 3.8 12.7 1.68
30000 108.8 14.1 7.7 8.4 13.0 1.69
40000 195.4 25.1 7.8 14.9 13.1 1.68
50000 306.1 39.2 7.8 23.3 13.1 1.68
1 60000 441.6 56.5 7.8 33.5 13.2 1.68
0 5 10 15 20 25 30 35 70000 602.1 76.9 7.8 45.6 13.2 1.68
number of blocks 80000 786.6 100.3 7.8 59.6 13.2 1.68
90000 996.2 127.0 7.8 75.4 13.2 1.68
100000 1229.1 156.2 7.9 92.6 13.3 1.69
Figure 3. Parallel run times depending on the number of blocks
Table II
RUN TIMES ( S ) AND SPEEDUPS BETWEEN GPU AND CPU KP VERSIONS

methods on GPU have been implemented in CUDA 4.2.


All programs for the CPU used the gcc compiler with code
optimizations (-O3) enabled.
Between 1 and 5 blocks, performance is leaden as ex-
B. Experiments pected, since we do not exploit the maximum parallelism the
We first tested our coarse parallelization strategy on the GPU provides. Results converge to the first local minimum
mono-dimensional knapsack problem. For this purpose, we which corresponds for each curve to the number of multi-
compared the performance of our implementation with an processors on the corresponding GPU. After reaching those
implementation of the commonly used fine grained strategy. first local minima, the following values increase, potentially
We also developed an efficient sequential CPU implementa- up to a local maximum. Other local minima can be found
tion of the algorithm for the sake of performance comparison at twice the number of multiprocessors. Then, depending on
of the GPU method with the CPU on each platform. the GPU, a similar pattern follows or not.
Instances: The instances we tested are those used in [28]. In each curve, the first local minimum appears to cor-
Those instances present strong correlation between weigths respond to a global minimum, or at least to a very good
and profits, in order no item is dominated by any other candidate value for the number of blocks used in the method.
one. Typically, the weights have randomly been generated As such values are obtained when the number of blocks is
between 1 and 1000, and the profit is the weight plus 50. the number of multiprocessors, in the following testings we
The instance sizes correspond to the number of items. The use the number of multiprocessors as the number of blocks.
knapsack sizes have been calculated in order to obtain a ratio Parallelization strategies: The run times of the different
between the knapsack size and the sum of all item weights CPU and GPU implementations are shown on Table II.
(the tightness value) of 0.5. The fine grained method is executed with 192 threads per
Number of blocks: To determine the number of thread block. This value has been chosen as it showed to provide
blocks to be spawned for maximum efficiency of the coarse the best results in average. The required number of blocks
parallel strategy, we ran the method with different configura- for each instance is determined accordingly.
tions. As explained in Section V-B, we fixed the number of From the table we can observe that the coarse strategy
threads per block to its maximum value, which is 1024 on all outperforms the fine grained strategy on all GPUs. A relative
the GPUs we tested. The method was run using numbers of speedup of at least 1.84 can be observed, and can exceed a
blocks varying from 1 to 32. Results are shown in Figure 3. value of 2 on some instances.
Those testings have been performed on a knapsack instance The parallel strategy which was strictly following the
of 20000 items and knapsack size of 4989314; however parallelization method recommended by Nvidia turned out
every other instance show exactly a similar behavior. to be the slowest. Spawning as many threads as there are

1804
D # constraints knapsack size max weight # items D CPU GPU dynamic GPU static GPU light
2.1 2 3000 20 600 2.1 4.80 1.64 0.39 0.31
2.2 2 5000 40 500 2.2 11.09 3.78 0.92 0.77
2.3 2 7500 50 600 2.3 31.06 10.21 2.60 2.25
3.1 3 300 10 120 3.1 2.84 1.32 0.34 0.21
3.2 3 400 10 160 3.2 9.12 4.19 1.03 0.68
3.3 3 600 15 160 3.3 31.52 14.12 3.49 2.35
3.4 3 600 12 200 3.4 37.57 17.67 4.31 2.86
4.1 4 80 10 32 4.1 1.16 0.67 0.22 0.11
4.2 4 90 10 36 4.2 2.12 1.21 0.37 0.20
4.3 4 120 10 48 4.3 8.88 5.10 1.52 0.77
4.4 4 120 8 60 4.4 10.46 6.45 1.81 0.93
Table III Table IV
MKP I NSTANCE DETAILS MKP RUN TIMES WITH DIFFERENT MULTIPLE DIMENSION
MANAGEMENT STRATEGIES

KP capacities does indeed expose the maximum parallelism


to the GPU, however the cost of spawning a thread does strictly limited to the update of the global MKP array itself.
not seem negligible compared to the actual work performed Finally, the lighter version of dimension management solves
in a single application of the DP recurrence function. We the problem of computing the values of the different indices
observed that scheduling one block of 1024 threads on each between iterations by replacing all unnecessary divide and
of the multiprocessors of the GPU turned to be a good modulo operators, which are very expensive on the GPU, in
parallelization strategy. terms of performance, with additions and subtractions. While
the computing of indices was increasing with the number of
This assertion holds even when the multi-threading logic
dimensions in the GPU static implementation, the lighter
of the GPU is not fully saturated: for instance, each of the 8
version maintains good run times whatever the number of
multiprocessors of the GTX680 can execute two blocks of
dimensions, with a speedup varying from 10 to 15 in all
1024 threads in parallel, if the kernel code uses less than 32
tested cases.
registers. As the code of our coarse strategy implementation
uses 7 registers, this condition is fulfilled. Figure 3 however C ONCLUSION
shows there is no improvement when using 16 blocks instead Experimental results show that even not strictly designed
of 8. for that purpose, GPUs can show satisfactory acceleration
Multiple index management: We now compare the differ- performance for dynamic programming applications.
ent methods we implemented for managing MKP indices of To reduce the performance bottleneck due to the necessary
items. For this we tested the method on the third hardware synchronization, we managed to determine a paralleliza-
platform we presented earlier (Xeon E5-2620, GTX680). tion strategy which takes profit from the knowledge of
The multidimensional instances are presented in Table III. the specific GPU we are using. These parameters can be
All knapsack sizes are the same in all dimensions. For each automatically queried from the GPU using dedicated CUDA
item, the weights in every dimension have been randomly library function calls, allowing the code to run efficiently on
chosen between 0 and the maximum weight minus one, and every possible targets.
the profit is the sum of the weights plus 5. The number Taking profit from the iterative nature of the function
of items is chosen so that the expected tightness ratio executed in each GPU thread, we proposed an efficient man-
statistically approaches 0.5. We did not use the instances agement of multiple constraint indices, avoiding to perform
from the literature, because their sizes would require too costly operations at each step. The method is efficient for
much memory when treated with dynamic programming. small MKP instances; a more adapted approach is necessary
All our instances are available for download, see [34]. to be able to deal with larger instances.
The results are shown in Table IV. The first observation The parallelization strategy we determined can be used
we can make on these figures is that the dynamic GPU im- for a large class of parallel dynamic programming problems
plementation is by far, the most inefficient. One explanation on GPU, as it does not depend on the dynamic programming
is that most of the data it managed (knapsack dimensions, recurrence function. It will be used in further research on
item weights, etc) are stored in main memory, meaning all different optimization problems.
threads have to fetch the same data from memory, moreover
in an uncoalesced manner. In the opposite, the static GPU R EFERENCES
version, due to the way it is programmed (passing most [1] J. Lorie and L.J. Savage (1955). Three problems in capital rationing.
of the item and instance data in kernel arguments), makes The Journal of Business 28, pp. 229–239.
an efficient use of the GPU shared memory (where kernel [2] R. Bellman (1957). Dynamic Programming. Princeton Univeristy
function arguments are stored). GPU Memory accesses are Press.

1805
[3] R. Bellman (1958). On a Routing Problem. Quart. Appl. Math. 16, pp [20] T.V. Luong, N. Melab, and E. Talbi (2010). Large neighborhood
87–90. local search optimization on graphics processing units. In Proc. IPDPS
Workshops 2010, pp. 1–8.
[4] A.S. Manne and H.M. Markowitz (1957). On the solution of discrete
programming problems. Econometrica 25, pp. 84–110. [21] J. Puchinger, G. R. Raidl and U. Pferschy (2010). The multidimen-
sional knapsack problem: Structure and algorithms. INFORMS Journal
[5] P.C. Gilmore and R. Gomory (1966). The theory and computation of on Computing 22(2), pp. 250–265.
knapsack functions. Operations Research 14, pp. 1045–1075.
[22] M. E. Lalami, V. Boyer and D. El-Baz (2011). Efficient Implementa-
[6] P. Toth (1980). Dynamic Programming Algorithms for the Zero-One tion of the Simplex Method on a CPU-GPU System. Proceedings of the
Knapsack Problem. Computing 25, pp. 29–45. 24th IEEE International Parallel & Distributed Processing Symposium,
Anchorage, USA, May 2011.
[7] F. Glover and G.A. Kochenberger (1996). Critical event tabu search
for multidimensional knapsack problems. In Metaheuristics: Theory [23] V. Boyer, D. El Baz and M. Elkihel (2011). Dense Dynamic Program-
and Applications. I.H. Osman, J.P. Kelly, eds., Kluwer Academic ming on Multi GPU. Proceedings of the 19th International Euromicro
Publishers, pp. 407–427. Conference on Parallel, Distributed and Network-Based Processing
(PDP-11), pp. 545–551. Ayia Napa, Cyprus, February 2011.
[8] P. C. Chu, and J.E. Beasley (1998). A genetic algorithm for the
multiconstrained knapsack problem. Journal of Heuristics 4, pp. 63–86. [24] T.V. Luong, N. Melab and E. Talbi (2011). GPU-based Approaches for
Multiobjective Local Search Algorithms. A Case Study: the Flowshop
Scheduling Problem. EvoCOP’11 Proceedings of the 11th European
[9] D. Gusfield (1999) [1997]. Algorithms on Strings, Trees and Se- conference on Evolutionary computation in combinatorial optimiza-
quences: Computer Science and Computational Biology. USA: Cam- tion, pp. 155–166. Springer-Verlag Berlin, Heidelberg, 2011.
bridge University Press. ISBN 0-521-58519-8.
[25] K. Nishida, Y. Ito and K. Nakano (2011). Accelerating the Dynamic
[10] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein (2001). Programming for the Matrix Chain Product on the GPU. Second
15.2: Matrix-chain multiplication. Introduction to Algorithms. Second International Conference on Networking and Computing, pp. 320–326.
Edition. MIT Press and McGraw-Hill. pp. 331–338. ISBN 0-262- Osaka, Japan, 2011.
03293-7.
[26] C.-C. Wu, J.-Y. Ke, H. Lin3 and W.-C. Feng (2011). Optimizing
[11] W. S. Martins, J. B. Del Curvillo, F. J. Useche, K. B. Theobald and G. Dynamic Programming on Graphics Processing Units via Adaptive
R. Gao (2001). A Multithreaded Parallel Implementation of a Dynamic Thread-Level Parallelism. 17th International Conference on Parallel
Programming Algorithm for Sequence Comparison. Proceedings of and Distributed Systems, pp. 96–103. Tainan, Taiwan, 2011.
the Pacific Symposium on Biocomputing 2001. pp 311–322. Hawaii,
January 2001. [27] K. Nishida, K. Nakano and Y. Ito (2012). Accelerating the Dynamic
Programming for the Optimal Polygon Triangulation on the GPU.
[12] J. Sadecki (2002). Parallel dynamic programming algorithms: Multi- Lecture Notes in Computer Science 7439, pp. 1–15, 2012.
transputer systems. International Journal of Applied Mathematics and
Computer Science 12(2), pp. 241–255. [28] V. Boyer, D. El Baz and M. Elkihel (2012). Solving knapsack
problems on GPU. Computers & Operations Research 39, pp. 42–47.
[13] C.J. Thompson, S. Hahn, M. Oskin (2002). Using Modern Graphics
Architectures for General-Purpose Computing: A Framework and [29] A. Boukedjar, M. E. Lalami, D. El-Baz (2012). Parallel Branch
Analysis. Proceedings of the 35th Annual IEEE/ACM International and Bound on a CPU-GPU System. Proceedings of the 25th IEEE
Symposium on Microarchitecture (MICRO-35). pp. 306–317. Istanbul, International Parallel & Distributed Processing Symposium, Shanghai,
Turkey, November 2002. China, May 2012.

[14] H. Kellerer, U. Pferschy and D. Pisinger (2004). Knapsack Problems. [30] R. Mansini and G. Speranza (2012). CORAL: An Exact Algorithm
Springer. for the Multidimensional Knapsack Problem. INFORMS Journal on
Computing Summer 2012 24(3), pp. 399–415.
[15] A. Fréville (2004). The multidimensional 01 knapsack problem: an
overview. European Journal of Operational Research 155, pp. 1–21. [31] Nvidia CUDA Programming Guide, version 4.2 (2012). Nvidia, April
2012.
[16] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E.
Lefohn and T. J. Purcell. A Survey of General-Purpose Computation on [32] CULA GPU-accelerated linear algebra libraries.
Graphics Hardware. In Eurographics 2005, State of the Art Reports, http://www.culatools.com/
August 2005, pp. 21-51.
[33] ViennaCL, a linear algebra library for computations on GPUs and
[17] G. Tan, N. Sun, and R. Gao (2007). A Parallel Dynamic Programming multi-core CPUs. http://viennacl.sourceforge.net/
Algorithm on a Multi-core Architecture. 19th Annual ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA’07), 2007, [34] Our MKP knapsack Instances.
pp.135–144. http://fgalea.free.fr/mkp.html

[18] Y. Vimont, S. Boussier and M. Vasquez (2008). Reduced costs propa-


gation in an efficient implicit enumeration for the 01 multidimensional
knapsack problem. Journal of Comb Optim 15, pp. 165–178.

[19] J. Bieling, P. Peschlow and P. Martini (2010). And Efficient GPU


Implementation of the Revised Simplex Method. Proceedings of the
23rd IEEE International Parallel & Distributed Processing Symposium,
Atlanta, USA, April 2010.

1806

You might also like