Oxford Bookworm Library
Oxford Bookworm Library
Abstract—Optimization methods generally do not fall into easier to programmers and allowed the GPGPU community
the most suitable algorithms for parallelization on a GPU. to gain much more control on the GPU management.
However, a relatively good efficiency still can be obtained if the GPUs are still getting increasingly used in scientific
method is properly adapted to the GPU programming model,
which is the case for dynamic programming. In this article, we computations, which require increasing computing power
propose a parallelization strategy for thread grouping for dy- to be performed. Linear algebra libraries are being made
namic programming in CUDA. We show that parametrizing the available to take profit from GPU acceleration, including
solver parallelism according to the hardware allows better per- BLAS-compatible basic matrix/vector algebra functions, and
formance. The strategy provides good acceleration compared to linear algebra solvers [32], [33].
a standard GPU parallel strategy on a dynamic programming-
based implementation of the knapsack problem. We show On a GPU, the best efficiency is achieved on compute-
this strategy is helpful in the case of the multi-dimensional bound applications, for which the computation time is much
knapsack problem, where computing multi-dimensional indices greater than the communication and synchronization time
is a costly operation. between the processors.
Keywords-dynamic programming; parallelism; knapsack In this article we address the problem of efficiently
problem; GPU computing; combinatorial optimization; CUDA. implementing dynamic programming on a GPU. Dynamic
programming (DP) naturally exposes an important level
I. I NTRODUCTION of parallelism, but also requires frequent synchronization
Since the introduction of programmable Graphics Pro- between the parallel processes. It is applicable to many
cessig Units (GPUs) in general purpose computers, the different problems including shortest path [3], matrix chain
scientific community quickly understood the interest in using multiplications [10], and longest common substring problem
their computing power for scientific computations. Even [9].
though those processors were firstly designed for graphics We take as a use case the multidimensional knapsack
rendering, they provide a much higher computing power problem (MKP). MKP is a multidimensional extension of
than the general purpose Central Processing Units (CPUs) the well-known knapsack problem (KP). Even though large
in these computers. GPGPU (General Purpose GPU) is the instances of KP and MKP are usually solved to near opti-
specific way of programming a GPU in a computer program mality using collaborative approaches of metaheuristic and
in order to speed up time consuming computations. Indeed, Integer Linear Programming methods, optimal solving of
due to their massive internal parallelism, GPUs can provide small instances can be efficiently performed using dynamic
a good acceleration for computations if they are programmed programming.
properly, and if the computations to be performed expose a This paper is not about efficiently solving large instances
significant enough level of parallelism. of the MKP: it rather focuses on an efficient parallelization
Early GPGPU computing consisted in using the available of the general DP process on GPU, which can be used for
graphics libraries (OpenGL, Direct3D) to access and control parallelizing any DP-based algorithm, such as those for the
the GPU computations [16]. As the graphics rendering problems we cited above. This is especially useful when the
pipelines of the graphics cards became more and more problems to be solved are small subproblems of a larger
programmable, the field of use for GPGPU computing general problem. In this case, it may be relevant to reduce
broadened at the same time. On 15 February 2007, the each step’s computation time as much as possible; GPU
graphics card manufacturer NVIDIA released the CUDA parallelization is a possible approach to achieve this goal.
Software Development Kit, which allowed for the first Section II of this article exposes different related works
time to specifically program GPUs to perform non-graphics on parallel dynamic programming and optimization on GPU.
computations. This move made the access to GPGPU much Section III presents a formulation of the multidimensional
III. T HE MULTI - DIMENSIONAL KNAPSACK PROBLEM IV. S EQUENTIAL DP ALGORITHM FOR MKP
The MKP has first appeared as a model for budgeting Dynamic programming is a well known optimization
problems [1, 1955],[4, 1957]. The first dynamic program- method for problems which exhibit an optimal substructure.
ming method for solving small sized instances of KP was If the optimal solution of a problem can be recursively
presented by Bellman in 1957 [2], and its extension to expressed by using the optimal solution of some of its
MKP by Gilmore and Gomory in 1966 [5]. Methods for subproblems, then dynamic programming can be used.
solving large MKP instances include heuristics [7], [8], This is the case for the knapsack problems. If we know
hybrid exact/heuristic methods [21], and more recently exact all optimal solution values Tk−1 (d) from a set of k −1 items
methods using tree search methods, including branch and for any knapsack of vector size d inferior or equal to c =
bound [18], [30]. Relatively recent complete surveys on the (c1 , c2 , · · · , cm ), the decision to choose or not a new item
MKP were published in 2004 [14], [15]. with weight vector wk = (wk1 , wk2 , · · · , wkm ), wk ≤ d and
1798
profit value pk for a given knapsack vector size of d only single array T , and reuse the previous values to compute the
depends on the resulting value of two solutions : new ones. In that case, one would choose an ordering of the
• Tk−1 (d) if item k is not taken. d possible vector knapsack sizes such that newly computed
• Tk−1 (d − wk ) + pk if item k is taken. Tk values do not overlap previous Tk−1 values before they
Of course if wk d, then the item can not be chosen for are used. For instance if m = 1, then the d scalar values
a multidimensional knapsack of vector size d. Hence, the would typically be chosen in decreasing order. If m > 1,
optimal solution value at step k for a knapsack of vector reverse lexicographic order would be a valid possibility.
size d is Best performance is achieved if the set of items is sorted
in such a way the highest values for the wk weights come
max(Tk−1 (d), Tk−1 (d − wk ) + pk ) if wk ≤ d
Tk (d) = first. Indeed, in such a case, the values of the components
Tk−1 (d) if wk d
of σ decrease faster at the first iterations of the k loop.
Moreover, Toth [6, 1980] demonstrated for the KP that at potentially enabling higher values of d earlier in the algo-
k th step, if the total weight ŵ satisfies the following: rithm execution. The ordering we chose is the decreasing
n
order of the sum of the weight components of the items.
ŵ < c − wi This simple method allowed to provide significantly better
i=k+1 results compared to using a random ordering.
then the current state will never lead to an optimal solution
V. PARALLEL IMPLEMENTATIONS ON GPU
and therefore does not require to be computed. Indeed, in
such a case, even selecting all items from k + 1 to n, the The parallel implementation of MKP consists in paral-
total weight remains lower than the total knapsack capacity. lelizing the for loop in Algorithm 1, line 7, we call the “inner
This property, we call the “Toth property”, also applies in loop”, in opposition of the main loop which iterates over the
the multidimensional case. In that case the < comparison is knapsack items, we call the “outer loop” of the algorithm.
a vector (component-wise) comparison. As the computations are independent between iterations, the
From these properties, we present the sequential dynamic potential level of parallelism is the number of iterations.
programming algorithm for the MKP (Algorithm 1). In this In this section we propose several parallel implemen-
tations of this parallelization on a GPU. We first have
Algorithm 1 The sequential MKP to present how GPUs work and how we can use it to
1: function S EQUENTIAL MKP(n, m, p, w, c) solve the MKP. Then we present different approaches for
Input: n ∈ N number of items the parallelization of a simple knapsack problem dynamic
Input: m ∈ N number of MKP constraints
Input: p ∈ Nn item profits vector
programming solver, and we present the specific problem of
Input: w ∈ Nnm item weight matrix efficiently managing indices with multiple dimensions on a
Input: c ∈ Nm MKP capacities vector GPU.
2: T0 ← 0 Optimal values with 0 objects
n
3: σ← wk Sum of vector weights A. Presentation of a GPU
k=1
4: for k = 1 → n do A GPU is a massively parallel SPMD (single program,
5: σ ← σ − wk multiple data) processor. It is generally embedded in a
6: d ← max(0, c − σ)
7: for all d ≤ d ≤ c do All possible MKP vector capacities graphics card (the device) connected to a computer (the
8: if d ≤ wk then host) via a PCI bus interface. The device also embeds local
9: Tk (d) ← Tk−1 (d) memory, available for graphics or GPGPU use. The device
10: else
11: Tk (d) ← max(Tk−1 (d), Tk−1 (d − wk ) + pk ) memory contains the GPU program code, and all the data
12: end if that the GPU programs manipulate. The host can allocate
13: end for blocks in the device memory, transfer data blocks from the
14: end for
15: return Tn (c) host memory to the device memory and vice versa, as well
16: end function as execute kernels (GPU programs) to manipulate the data
in the device memory.
algorithm, each Tk is an associative array which contains A kernel is typically executed by a large number of
the current optimal knapsack value for each vector MKP threads, and the GPU logic is responsible for executing the
capacity, after having processed the first k items in the set. threads with the maximum parallelism the GPU is able to
The σ vector is initialized with the sum of vector weights, provide.
and for each item, the Toth property is used to determine a GPU threads run in thread groups called warps, which
minimum value d for the possible d vector capacities. typically consist of 32 threads. As all threads in a warp
As at step k, the Tk values are computed only from share the same program control logic, all of its threads
the Tk−1 values computed at the previous step. A typical always execute the same instruction at the same time. This
sequential implementation of this algorithm would only use a behaviour is well-suited for performing computations on
1799
Number of blocks
Number of threads
(max 1024)
1800
threads between iterations of the outer loop is implicit, as kernel<<<NB,NT>>>(arg1, arg2, ...);
the CPU waits for all GPU threads to finish their execution This corresponds to spawning N B blocks of N T threads
before the next outer loop iteration. This is the strategy each, running the ‘‘kernel” function on the GPU, each
which is commonly used in many related works in parallel thread having the same list of arguments. CUDA threads
dynamic programming on GPU. also have additional implicit arguments. Those we use are
A second strategy we developed results from the obser- necessary to identify the current thread’s block number
vation that each thread in the fine grained strategy performs (from 0 to N B − 1) and its thread number in the block
a very little amount of work. This led us to investigate on (from 0 to N T − 1). Those values are respectively stored
limiting the number of spawned threads while maximizing in the threadIdx.x and blockIdx.x variables. While
the GPU utilization and minimizing the intervention of the the GPU threads are running, the CPU waits for their
CPU in the algorithm. It consists in a full implementation termination before continuing its execution.
of the whole KP algorithm in a GPU kernel. In this con- The CPU/GPU implementation algorithm is presented in
figuration, the different thread blocks execute the outer loop Algorithm 2. For maximum accuracy we write it down in
over the different items simultaneously and share the parallel CUDA language, which mostly consists in C/C++ with ad-
iterations of the inner loop over the different capacities. ditions to describe GPU kernels and to spawn GPU threads.
Synchronization is performed between all threads at the end In this algorithm, the T and T 1 arrays correspond to the Tk
of each iteration of the inner loop to handle concurrency.
While thread synchronization is supported by CUDA for
threads within the same block, there is no supported way
Algorithm 2 Parallel KP, CPU/GPU, coarse strategy
for synchronization for threads from different blocks. This is
due to the fact that CUDA blocks can be considered as jobs 1 // fixed number of threads per block
2 #define NT 1024
which may or may not be executed in parallel, depending on 3 // fixed number of blocks
the hardware they are running on. However, we implemented 4 #define NB 7
5
a barrier mechanism between parallel running blocks, using 6 // GPU kernel for the d loop
__global__ void kernel(int n, int c,
the hardware specification of the GPU to determine the num- 7
8 int x0, int x1, int w, int p, int *T, int *T1) {
ber of blocks which can actually run in parallel. In order to 9 int x = threadIdx.x + blockIdx.x * NT;
10 int d, d_ = min(x0,x1);
reduce the overhead due to the inter-block synchronization, 11 d_ -= d_ % 32;
we tried to minimize the necessary number of created blocks 12 for (d = x + d_; d < c; d += NT*NB) {
13 int t = T1[d];
while preserving the maximum GPU utilization. This led us 14 if (d >= x0)
15 t = max(T1[d-w] + p, t);
to use the maximum number of threads per block, which 16 T[d] = t;
is 1024 on all the GPUs we used for testing. Significant 17 }
18 }
performance increase resulted from this method. 19
The final strategy we developed, called the coarse 20 // Knapsack method - main function on CPU
21 // n: number of items
strategy, is an intermediate solution which basically follows 22 // c: knapsack capacity + 1
23 // T, T1: two arrays[c] in GPU memory
the scheme of the fine grained strategy, while reducing the 24 // it: a sorted set of n items in CPU memory
number of blocks at the same level as the second strategy. 25 int KPKernelLoop(
26 int n, int c, int *T, int *T1, item *it) {
What basically happens is that each thread executes several 27 int result;
iterations of the inner capacity loop of the MKP algorithm, 28 int k, x0, x1, s = 0;
29 // compute the sum of all weights
resulting in a larger grain of parallelism. In this case there 30 for (k=0; k<n; ++k)
31 s += it[k].w;
is no longer the need for thread synchronization, as we get 32 // Set the contents of the T array to all zeros
similar implicit synchronization as the fine grained strategy. 33 cudaMemset(T, 0, c*sizeof(int));
34 x1 = 0;
This parallel strategy produces very similar results as the 35 for (k=0; k<n; ++k) {
swap(T,T1);
second strategy, while its design is much simpler and does 36
37 s -= it[k].w;
not require undocumented tricks to achieve its goal. This 38 x0 = max(C-s, it[k].w);
39 kernel<<<NB, NT>>>(n, c, x0, x1,
is the strategy we propose and describe in the rest this paper. 40 it[k].w, it[k].p, T, T1);
41 x1 = x0;
42 }
As described above, the coarse strategy consists in per- 43 cudaMemcpy(&result,&T[c-1],
44 sizeof(int),cudaMemcpyDeviceToHost);
forming the main loop of Algorithm 1 (the k loop) on the 45 return result;
CPU. At each iteration, a pre-determined number of thread 46 }
1801
Toth property. The contents of each item (weight and profit) index d = d0 + d1 c0 + d2 c0 c1 . In this case, the associative
are passed as arguments to the kernel function, and the array contains c0 c1 c2 elements, which may cause a problem
management of the Toth property is done by the CPU. At the for memory allocation when dealing with large knapsack
beginning of the GPU kernel execution, a d_ starting index sizes. Our method is therefore best suited for relatively small
is computed according to the Toth property. To ensure some instances.
extent of memory access coalescing (as described above), Given a d one-dimensional index, the values of the d
the d_ index value is lowered to the nearest lower multiple vector can be obtained as:
of 32 (the number of threads per warp). Each thread executes
a loop on a d value by starting from the current thread id d d
d0 = d mod c0 ; d1 = mod c1 ; d2 = mod c2
number added to the d_ index, and incrementing the value c0 c0 c1
at the end of each iteration with the total number of running
threads (N T × N B). This respects the sequential alignment More generally, for m constraints, and a maximum ca-
requirements for coalesced memory accesses when reading pacity vector c = (c0 , c1 , · · · , cm−1 ), any capacity vector
or writing at each d position in the shared arrays. Misaligned d = (d0 , d1 , · · · , dm−1 ), d ≤ c can be assigned a 1D index:
memory accesses are done in the general case for accessing m−1 i−1
1802
D. Static dimension specialization to elements d of the T and T 1 arrays are coalesced, while
The previous method is fully dynamic, which allows it only the eventual accesses to the d − ww elements of T 1 are
to handle all possible numbers of dimensions. However, not coalesced, though respecting the misaligned sequential
we possibly can notice one major drawback: because of access pattern, as described for Algorithm 2.
its adaptibility, it requires all parameters (items, MKP Even though this specialized version algorithm contains a
capacities) to be stored in GPU global memory. As we major improvement, significant computing power is neces-
already know, GPUs are massively parallel, and memory sary to generate the different multiple-dimensional indices,
accesses are to be avoided as much as possible. Moreover, as we make heavy use of the integer divide and modulo
the memory accesses for item and capacity components are operators to convert the global one-dimensional index into
not coalesced, which is detrimental to the efficiency of the the multiple dimensional one. Such operators are known to
method. This is particularly unfortunate as the corresponding require heavy computing power, and Nvidia recommends to
component values are constant for all threads of each deter- avoid their use as much as possible.
mined GPU kernel invocation. Therefore, another possibility The previous version of the MKP kernel can be improved
for us is to consider that we can specialize the code for fixed in that way by calculating offset indices for only the first
values of the number of MKP constraints in order to profit application of the dynamic programming recurrence func-
from the massive number of available registers and fast- tion, and using only additions and subtractions to obtain
access shared memory in the GPU. This can be justified, the index values for the further iterations. This improve-
because in most engineering problems the dimensions are ment is presented in the example 3D-specialized kernel in
fixed at least for whole classes of instances. Fixing the Algorithm 5. In this implementation, the divide and modulo
dimension of the problem allows us to provide all the item
data and capacity components as arguments of the GPU Algorithm 5 Parallel MKP GPU kernel, fixed number of
kernel, which is implemented by the CUDA compiler using constraints, lighter computation of indices
the shared memory. We designed a generator program for
1 /* Parallel MKP GPU kernel, 3D version
different versions of the MKP kernel code with specified 2 C : total number of capacity combinations = c0*c1*c2
3 c0, c1 : capacity of constraints 0 and 1
numbers of constraints. This allows us to provide the best
4 i0, i1, i2 : components of the increment vector
tailored version of the solver for any specific classes of the 5 ww : w0 + c0*w1 + c0*c1*w2
6 w0, w1, w2 : weights of item in constraints 0 to 2
problem. 7 x0, x1, x2 : minimum index values to update T
An example implementation of a specialized kernel for 8 p : item profit
9 T, T1 : Arrays T_k and T_{k-1}
the two-dimensional MKP is presented in Algorithm 4. 10 */
One can notice in this implementation that the item data 11 __global__ void kernel(int C,int c0,int c1,
12 int i0,int i1,int i2,int ww,int w0,int w1,int w2,
13 int x0,int x1,int x2,int p,int *T,int *T1) {
14 int x = threadIdx.x + blockIdx.x * NT;
Algorithm 4 Parallel MKP GPU kernel, fixed number of 15 int d = x;
constraints 16 int d0 = d%c0;
17 int d1 = (d/c0)%c1;
1 /* Parallel MKP GPU kernel, 2D version 18 int d2 = d/(c0*c1);
2 C : total number of capacity combinations C = c0*c1 19 for (; d < C; d += NT*NB) {
3 c0 : capacity of constraint 0
20 if (d0>=x0 && d1>=x1 && d2>=x2) {
4 ww : w0 + c0*w1
21 int t = T1[d];
22 if (d0>=w0 && d1>=w1 && d2>=w2)
5 w0, w1 : weights of item in constraints 0 and 1
23 t = max(T1[d-ww] + p, t);
6 x0, x1 : minimum index values to update the T array
24 T[d] = t;
7 p : item profit
25 }
8 T, T1 : Arrays T_k and T_{k-1}
26 d0 += i0;
9 */ 27 if (d0 >= c0) { d0 -= c0; d1 += 1; }
10 __global__ void kernel(int C,int c0,int ww,int w0, 28 d1 += i1;
11 int w1,int x0,int x1,int p,int *T,int *T1) { 29 if (d1 >= c1) { d1 -= c1; d2 += 1; }
12 int x = threadIdx.x + blockIdx.x * NT; 30 d2 += i2;
13 int d; 31 }
14 for (d = x; d < C; d += NT*NB) { 32 }
15 int d0 = d%c0; // Coordinate in dimension 0
16 int d1 = d/c0; // Coordinate in dimension 1
17 if (d0>=x0 && d1>=x1) {
18 int t = T1[d];
19 if (d0>=w0 && d1>=w1) operators are only used to determine the initial values of the
20 t = max(T1[d-ww] + p, t);
21 T[d] = t; d vector, potentially significantly reducing the computing
22 } overhead of the whole algorithm execution.
23 }
24 }
VI. E XPERIMENTAL RESULTS
A. Hardware specification
(weights and profit) and knapsack capacities are now stored We tested our Algorithm on three different platforms.
in local variables. Also one can observe that all accesses They are listed in Table I. All GPUs are from Nvidia. The
1803
CPU GPU #C #MP platform instance CPU fine sp. coarse sp. n/c sp.
1 AMD Sempron 140 2.7 GHz GeForce GTX460 336 7 1 10000 26.5 3.2 8.2 1.6 16.8 2.03
2 Intel Xeon X5650 2.67 GHz Tesla C2075 448 14 20000 107.1 12.8 8.3 6.3 17.1 2.05
3 Intel Xeon E5-2620 2 GHz GeForce GTX680 1536 8 30000 241.2 28.7 8.4 14.0 17.3 2.06
40000 431.0 51.2 8.4 25.1 17.1 2.04
Table I 50000 653.8 80.0 8.2 38.9 16.8 2.06
T ESTING HARDWARE PLATFORMS . T HE NUMBER OF GPU CORES (C) 60000 943.1 115.2 8.2 56.5 16.7 2.04
AND MULTIPROCESSORS (MP) ARE SHOWN . 70000 1327.2 141.9 9.3 76.4 17.4 1.86
80000 1723.4 185.4 9.3 100.6 17.1 1.84
90000 2172.5 234.6 9.3 127.3 17.1 1.84
100000 2693.0 289.8 9.3 156.0 17.3 1.86
100
2 10000 14.2 2.2 6.4 1.1 13.2 2.07
GTX 460 20000 59.4 8.9 6.7 4.3 13.9 2.08
Tesla C2075 30000 134.1 19.8 6.8 9.5 14.1 2.08
GTX 680 40000 248.5 35.3 7.0 17.0 14.7 2.08
50000 400.7 55.1 7.3 26.4 15.2 2.09
60000 571.4 79.3 7.2 38.0 15.0 2.09
70000 774.0 100.8 7.7 51.8 15.0 1.95
time (s)
1804
D # constraints knapsack size max weight # items D CPU GPU dynamic GPU static GPU light
2.1 2 3000 20 600 2.1 4.80 1.64 0.39 0.31
2.2 2 5000 40 500 2.2 11.09 3.78 0.92 0.77
2.3 2 7500 50 600 2.3 31.06 10.21 2.60 2.25
3.1 3 300 10 120 3.1 2.84 1.32 0.34 0.21
3.2 3 400 10 160 3.2 9.12 4.19 1.03 0.68
3.3 3 600 15 160 3.3 31.52 14.12 3.49 2.35
3.4 3 600 12 200 3.4 37.57 17.67 4.31 2.86
4.1 4 80 10 32 4.1 1.16 0.67 0.22 0.11
4.2 4 90 10 36 4.2 2.12 1.21 0.37 0.20
4.3 4 120 10 48 4.3 8.88 5.10 1.52 0.77
4.4 4 120 8 60 4.4 10.46 6.45 1.81 0.93
Table III Table IV
MKP I NSTANCE DETAILS MKP RUN TIMES WITH DIFFERENT MULTIPLE DIMENSION
MANAGEMENT STRATEGIES
1805
[3] R. Bellman (1958). On a Routing Problem. Quart. Appl. Math. 16, pp [20] T.V. Luong, N. Melab, and E. Talbi (2010). Large neighborhood
87–90. local search optimization on graphics processing units. In Proc. IPDPS
Workshops 2010, pp. 1–8.
[4] A.S. Manne and H.M. Markowitz (1957). On the solution of discrete
programming problems. Econometrica 25, pp. 84–110. [21] J. Puchinger, G. R. Raidl and U. Pferschy (2010). The multidimen-
sional knapsack problem: Structure and algorithms. INFORMS Journal
[5] P.C. Gilmore and R. Gomory (1966). The theory and computation of on Computing 22(2), pp. 250–265.
knapsack functions. Operations Research 14, pp. 1045–1075.
[22] M. E. Lalami, V. Boyer and D. El-Baz (2011). Efficient Implementa-
[6] P. Toth (1980). Dynamic Programming Algorithms for the Zero-One tion of the Simplex Method on a CPU-GPU System. Proceedings of the
Knapsack Problem. Computing 25, pp. 29–45. 24th IEEE International Parallel & Distributed Processing Symposium,
Anchorage, USA, May 2011.
[7] F. Glover and G.A. Kochenberger (1996). Critical event tabu search
for multidimensional knapsack problems. In Metaheuristics: Theory [23] V. Boyer, D. El Baz and M. Elkihel (2011). Dense Dynamic Program-
and Applications. I.H. Osman, J.P. Kelly, eds., Kluwer Academic ming on Multi GPU. Proceedings of the 19th International Euromicro
Publishers, pp. 407–427. Conference on Parallel, Distributed and Network-Based Processing
(PDP-11), pp. 545–551. Ayia Napa, Cyprus, February 2011.
[8] P. C. Chu, and J.E. Beasley (1998). A genetic algorithm for the
multiconstrained knapsack problem. Journal of Heuristics 4, pp. 63–86. [24] T.V. Luong, N. Melab and E. Talbi (2011). GPU-based Approaches for
Multiobjective Local Search Algorithms. A Case Study: the Flowshop
Scheduling Problem. EvoCOP’11 Proceedings of the 11th European
[9] D. Gusfield (1999) [1997]. Algorithms on Strings, Trees and Se- conference on Evolutionary computation in combinatorial optimiza-
quences: Computer Science and Computational Biology. USA: Cam- tion, pp. 155–166. Springer-Verlag Berlin, Heidelberg, 2011.
bridge University Press. ISBN 0-521-58519-8.
[25] K. Nishida, Y. Ito and K. Nakano (2011). Accelerating the Dynamic
[10] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein (2001). Programming for the Matrix Chain Product on the GPU. Second
15.2: Matrix-chain multiplication. Introduction to Algorithms. Second International Conference on Networking and Computing, pp. 320–326.
Edition. MIT Press and McGraw-Hill. pp. 331–338. ISBN 0-262- Osaka, Japan, 2011.
03293-7.
[26] C.-C. Wu, J.-Y. Ke, H. Lin3 and W.-C. Feng (2011). Optimizing
[11] W. S. Martins, J. B. Del Curvillo, F. J. Useche, K. B. Theobald and G. Dynamic Programming on Graphics Processing Units via Adaptive
R. Gao (2001). A Multithreaded Parallel Implementation of a Dynamic Thread-Level Parallelism. 17th International Conference on Parallel
Programming Algorithm for Sequence Comparison. Proceedings of and Distributed Systems, pp. 96–103. Tainan, Taiwan, 2011.
the Pacific Symposium on Biocomputing 2001. pp 311–322. Hawaii,
January 2001. [27] K. Nishida, K. Nakano and Y. Ito (2012). Accelerating the Dynamic
Programming for the Optimal Polygon Triangulation on the GPU.
[12] J. Sadecki (2002). Parallel dynamic programming algorithms: Multi- Lecture Notes in Computer Science 7439, pp. 1–15, 2012.
transputer systems. International Journal of Applied Mathematics and
Computer Science 12(2), pp. 241–255. [28] V. Boyer, D. El Baz and M. Elkihel (2012). Solving knapsack
problems on GPU. Computers & Operations Research 39, pp. 42–47.
[13] C.J. Thompson, S. Hahn, M. Oskin (2002). Using Modern Graphics
Architectures for General-Purpose Computing: A Framework and [29] A. Boukedjar, M. E. Lalami, D. El-Baz (2012). Parallel Branch
Analysis. Proceedings of the 35th Annual IEEE/ACM International and Bound on a CPU-GPU System. Proceedings of the 25th IEEE
Symposium on Microarchitecture (MICRO-35). pp. 306–317. Istanbul, International Parallel & Distributed Processing Symposium, Shanghai,
Turkey, November 2002. China, May 2012.
[14] H. Kellerer, U. Pferschy and D. Pisinger (2004). Knapsack Problems. [30] R. Mansini and G. Speranza (2012). CORAL: An Exact Algorithm
Springer. for the Multidimensional Knapsack Problem. INFORMS Journal on
Computing Summer 2012 24(3), pp. 399–415.
[15] A. Fréville (2004). The multidimensional 01 knapsack problem: an
overview. European Journal of Operational Research 155, pp. 1–21. [31] Nvidia CUDA Programming Guide, version 4.2 (2012). Nvidia, April
2012.
[16] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E.
Lefohn and T. J. Purcell. A Survey of General-Purpose Computation on [32] CULA GPU-accelerated linear algebra libraries.
Graphics Hardware. In Eurographics 2005, State of the Art Reports, http://www.culatools.com/
August 2005, pp. 21-51.
[33] ViennaCL, a linear algebra library for computations on GPUs and
[17] G. Tan, N. Sun, and R. Gao (2007). A Parallel Dynamic Programming multi-core CPUs. http://viennacl.sourceforge.net/
Algorithm on a Multi-core Architecture. 19th Annual ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA’07), 2007, [34] Our MKP knapsack Instances.
pp.135–144. http://fgalea.free.fr/mkp.html
1806