PK Introduction CUDA
PK Introduction CUDA
P. Kestener
Architectures,
Parallelism and Moore
law
Why multi-core ?
GPU computing
FPGA co-processor
Nvidia / CUDA
Programming model
GPU computing :
Maison de la Simulation perspectives
Installing CUDA
CFD example/demo
Chamonix, June 2011
PyCUDA / PyOpenCL
CUDA / MPI
References
1 / 170
GPU
Contents... P. Kestener
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
Programming model
• hardware architecture CUDA : optimisation
References
• many links to on-line resources (courses, source codes, articles,
...)
2 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
3 / 170
GPU
Summary P. Kestener
Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?
GPU computing
GPU computing History
Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model
GPU computing :
Hardware architecture perspectives
Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
4 / 170
GPU
Architectures,
Parallelism and Moore
Ï before the 1990’s, parallel computers were rare and available law
Why multi-core ?
for only the most critical problems What’s a thread ?
Computing the 114000 frames (77 minutes) required 800000 Nvidia / CUDA
Why shoud we use GPU ?
computer hours. Each frame consists in 300 MBytes of data Hardware architecture
PyCUDA / PyOpenCL
GPU meets.
CUDA / MPI
Ï Software for off-line rendering : e.g. RenderMan References
(http://renderman.pixar.com) by Pixar, from modelling to
rendering
Ï Hardware rendering : OpenGL low-level API used in real-time
rendering (i.e. done in dedicated hardware like GPU), gaming
industry
5 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
Ï Parallel computing relies on exploitable concurrency and embedded GPU
computing ?
Nvidia / CUDA
problem can be decomposed into subproblems that can safely Why shoud we use GPU ?
Programming model
Mattson et al, CUDA : optimisation
Installing CUDA
Ï concurrency: property of a system in which several CFD example/demo
computations are executing simultaneously PyCUDA / PyOpenCL
References
6 / 170
GPU
Architectures,
Parallelism and Moore
The number of transistors that can be placed law
Why multi-core ?
inexpensively on an integrated circuit doubles What’s a thread ?
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
7 / 170
GPU
Architectures,
Parallelism and Moore
Moore’s Law continues with law
Why multi-core ?
Ï technology scaling (32 nm in 2010), What’s a thread ?
architectures, computing ?
FPGA co-processor
Ï reducing energy consumed per logic operation to keep power Nvidia / CUDA
dissipation within limit. Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
Architectures,
Parallelism and Moore
Pollack’s rule law
Why multi-core ?
Ï if you double the logic in a processor core, then it delivers What’s a thread ?
FPGA co-processor
Ï For example, two smaller processor cores, instead of a large Nvidia / CUDA
monolithic processor core, can potentially provide 70-80% Why shoud we use GPU ?
Hardware architecture
more performance, as compared to only 40% from a large Programming model
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
benefits: computing ?
FPGA co-processor
Programming model
Ï each processor core can be run at its own optimized supply CUDA : optimisation
Installing CUDA
Ï easier to load balance among processor cores to distribute
CFD example/demo
heat across the die;
PyCUDA / PyOpenCL
Ï can potentially produce lower die temperatures improving CUDA / MPI
10 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
GPU computing
Ï More transistors ? What’s the purpose ? How to use them History
FPGA co-processor
• , keep transistor density increasing (more and more difficult) : Programming model
CUDA : optimisation
32 nm in 2010 GPU computing :
perspectives
Ï Utilize efficiently transistors on chip Installing CUDA
• , data-level parallelism (SIMD, vector units) : SSE, Cell Spe, PyCUDA / PyOpenCL
CUDA / MPI
GPU !
• , thread-level parallelism: hardware-multi-threading, References
http://www.ugrad.cs.ubc.ca/~cs448b/2010-1/lecture/2010-09-09-ugrad.pdf
11 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
Ï More transistors = more computing power ! and embedded GPU
computing ?
Ï More transistors ? What’s the purpose ? How to use them FPGA co-processor
PyCUDA / PyOpenCL
to 48 active threads per core in Fermi)
CUDA / MPI
http://www.ugrad.cs.ubc.ca/~cs448b/2010-1/lecture/2010-09-09-ugrad.pdf
References
12 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï not only in supercomputers What’s a thread ?
Ï multiple small cores at a lower clock frequency are more and embedded GPU
computing ?
corollary : the parallel version of a code can be much more Nvidia / CUDA
Why shoud we use GPU ?
Programming model
same frequency CUDA : optimisation
GPU computing :
Ï Even smartphones processors become multi-core: multiple perspectives
Installing CUDA
small cores at a lower frequency are more efficient than single
CFD example/demo
core (increase battery life-time, ...) PyCUDA / PyOpenCL
Ï Modern smartphones perform nearly 100GOPS within a CUDA / MPI
http://www.date-conference.com/proceedings/PAPERS/2009/DATE09/PDFFILES/10.1
S. Borkar and A. Chien, The Future of Microprocessors, ACM
Communications
13 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
GPU computing :
Ï Application processing : user interface, address books, perspectives
Installing CUDA
diaries, sms, java, internet browsing, email, gaming, CFD example/demo
document and spreadsheet editing, photo handling PyCUDA / PyOpenCL
Ï Radio processing : demodulation, decoding, protocol, ... CUDA / MPI
budget
Ï The solution has to be multicore, as multiple small cores at a
Architectures,
Ï Thread - hardware / software definition(s): Parallelism and Moore
law
• execution unit - fil d’exécution Why multi-core ?
GPU computing
• processing unit - unité de traitement
History
• lightweight process - processus léger (jargon Unix) : in software, and embedded GPU
computing ?
switching between threads does not involve changing the FPGA co-processor
GPU computing :
Installing CUDA
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
Ï SMT (Simultaneous Multi-Threading) : One physical perspectives
CFD example/demo
• Each logical processor maintains a complete set of the
PyCUDA / PyOpenCL
architecture state (general- purpose registers, control registers,
CUDA / MPI
...)
References
• Logical processors share nearly all other resources, such as
caches, execution units, branch predictors, control logic, and
buses
Costs of a 2-way Hyper-Threading : chip area (+5%),
performance (+15 to 30%)
Ï CMT (Chip-level Multi-Threading) :
16 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
Ï ( academic ? ) Implicit Multi-thread processor : Dynamically GPU computing
generate threads from single-threaded programs and execute History
FPGA co-processor
Multiscalar, dynamic multithreading, speculative
Nvidia / CUDA
multithreaded, ... Why shoud we use GPU ?
Hardware architecture
Ï Wikipedia [Multi-threading]: Programming model
CUDA : optimisation
(...) multithreading aims to increase utilization of a single core GPU computing :
Installing CUDA
References
Ï example : hide memory access latency by thread-context
switching (key feature in modern GPU high performances)
17 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
Why is there a large performance gap between manycore GPUs
GPU computing
and general purpose multicore CPU ? History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
Ï Different goals produce different designs:
• GPU assumes work load is highly parallel
• CPU must be good at everything, parallel or not
18 / 170
GPU
Architectures,
Parallelism and Moore
law
Why is there a large performance gap between manycore GPUs Why multi-core ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
Ï CPU design goal : optimize architecture for sequential code CUDA / MPI
performance : minimize latency experienced by 1 thread References
Architectures,
Parallelism and Moore
Why is there a large performance gap between manycore GPUs law
Why multi-core ?
and general purpose multicore CPU ? What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
Ï GPU design goal : maximize throughput of all threads
CUDA / MPI
• # threads in flight limited by resources => lots of resources
References
(registers, bandwidth, etc.)
• multithreading can hide latency => skip the big caches
• share control logic across many threads
ref: Jared Hoberock, Stanford, cs193,
http://code.google.com/p/stanford-cs193g-sp2010/
20 / 170
GPU
Architectures,
Parallelism and Moore
Why is there a large performance gap between manycore GPUs law
Why multi-core ?
and general purpose multicore CPU ? What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
Ï fast growing game industry made a tremendous economic
CUDA / MPI
pressure to design architectures optimized for maximum chip
References
area / power budget per floating point operations.
Ï GPU takes advantage of a large number of execution threads
to find work to do when other threads are waiting for
long-latency memory accesses, thus minimizing the control
logic required for each execution thread.
21 / 170
GPU
Architectures,
Parallelism and Moore
Why is there a large performance gap between manycore GPUs law
Why multi-core ?
and general purpose multicore CPU ? What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
Ï GPU: much more area dedicated to floating point PyCUDA / PyOpenCL
Ï GPUs are numeric computing engines that will not perform References
well on some tasks for which CPU are optimized. Need to take
advantage of both !
reference: D. Kirk and W.W. Hwu, Programming massively parallel
processors, Morgan and Kaufmann eds.
J. Dongarra, CEA-EDF-INRIA summer school, 2011
22 / 170
GPU
Architectures,
Parallelism and Moore
law
Why is there a large performance gap between manycore GPUs Why multi-core ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
References
Ï GPU: Amortize cost / complexity of managing an instruction
stream across many ALUs.
reference: D. Kirk and W.W. Hwu, Programming massively parallel
processors, Morgan and Kaufmann eds.
J. Dongarra, CEA-EDF-INRIA summer school, 2011
23 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï cost of software development is best justified by a very large What’s a thread ?
programing projects for some time because they felt the impact CFD example/demo
24 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
PyCUDA / PyOpenCL
reference: D. Kirk and W.W. Hwu, Programming massively parallel CUDA / MPI
25 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
FPGA co-processor
• The programmer structures the code and data to solve these
Nvidia / CUDA
sub-problems concurrently Why shoud we use GPU ?
Hardware architecture
Ï The goals of parallel computing are Programming model
CUDA : optimisation
• To solve problems in less time, and/or GPU computing :
Installing CUDA
• To achieve better solutions
CFD example/demo
The problems must be large enough to justify parallel computing PyCUDA / PyOpenCL
CUDA / MPI
and to exhibit exploitable concurrency.
References
reference :
http://courses.engr.illinois.edu/ece498/al/Syllabus.html
26 / 170
GPU
Architectures,
Parallelism and Moore
Ideal parallel programmer skills: law
Why multi-core ?
What’s a thread ?
Ï Computer architecture:
GPU computing
• Memory organization, History
• SIMT (single instruction multiple thread) versus SPMD (single Nvidia / CUDA
Why shoud we use GPU ?
program multiple-data versus SIMD, Hardware architecture
• floating point accuracy Programming model
CUDA : optimisation
Ï Programming models and compilers: GPU computing :
perspectives
• parallel execution models, Installing CUDA
• types of available memories, CFD example/demo
• array data layout, PyCUDA / PyOpenCL
• loop transformation CUDA / MPI
Ï Algorithm techniques: References
• tiling, cutoff, binning
Ï Domain knowledge:
• numerical methods, models accuracy required
reference: Kirk, Hwu, Programming massivelly parallel processors,
chapter 10
27 / 170
GPU
Summary P. Kestener
Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?
GPU computing
GPU computing History
Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model
GPU computing :
Hardware architecture perspectives
Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
28 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
http://www.irisa.fr/archi09/defour.pdf What’s a thread ?
Ï most super-computer in the 80s / 90s were vector processor FPGA co-processor
Nvidia / CUDA
Ï CRAY-1: 1976, 80MHz, 64-bit/data, 24-bit/adress, vector Why shoud we use GPU ?
register file, 160 MIPS, 250 MFLOPS, 8MB RAM, 5.5 tonnes, Hardware architecture
Programming model
GPU computing :
Installing CUDA
Ï vector processors fall (end of 80s - beginning of 90s) : CMOS PyCUDA / PyOpenCL
References
microprocessor vs difficult to fit a vector processor on a single
chip; cache size increase in scalar processors; programming
vector procesor required assembly language skills.
29 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
Nvidia / CUDA
Ï basics of shaders (programmable functionalities in the Why shoud we use GPU ?
Hardware architecture
graphics pipeline) Programming model
CUDA : optimisation
Ï Overview of GPU architectures GPU computing :
perspectives
Ï Legacy GPGPU (before CUDA, ∼ 2004) Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
30 / 170
GPU
1999 Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
Ï graphics pipeline: conceptual model of stages that graphics CUDA / MPI
1999 Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
32 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
http://www.nvidia.com/object/Technical_Brief_TandL.html GPU computing :
perspectives
http://www.cs.unr.edu/~fredh/papers/thesis/023-crow/GPUFinal.pdfInstalling CUDA
CFD example/demo
Ï design specification of graphics API (API should be PyCUDA / PyOpenCL
cross-vendor, cross-paltform, provide hardware abstraction) : CUDA / MPI
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
http://www.nvidia.com/object/Technical_Brief_TandL.html perspectives
http://www.cs.unr.edu/~fredh/papers/thesis/023-crow/GPUFinal.pdfInstalling CUDA
CFD example/demo
PyCUDA / PyOpenCL
Ï 1999 : transfert transformation operations to GPU (4x4 matrix CUDA / MPI
Architectures,
http://www.opengl.org/documentation/specs/version1.1/state.pdf
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
EnableClientState
GPU computing
DisableClientState
CCCC
NormalPointer Control
VertexPointer
InterLeavedArrays
CCCC
CCCC
CCCC
CCCC EvalMesh Grid
Evaluator
Control and embedded GPU
CCCC
ArrayElement EvalPoint Application
DrawElements
CCCC
CCCC
DrawArrays
CCCC
CCCC
Map
CCC
CCC
computing ?
CCC
Evaluation
CCCC
CCCC
EvalCoord
CCC The OpenGL graphics system diagram, Version 1.1. Copyright 1996 Silicon Graphics, Inc. All rights reserved.
CCCC CCC
CCCC CCC
CCCC
CCCC
Map
Enable/Disable CCC
CCC FPGA co-processor
CCCC CCC
CCCC CCC
CCCC Current CCC
CCC
CC CCC
EdgeFlag Edge
CC CCC
Flag
Enable/Disable
CC CCC
TexCoord1 t 0
CC
CC
CC
CCC
CCC
CCC
TexGen
OBJECT_LINEAR
Nvidia / CUDA
CC CCC
TexGen b A*b
A
CC CCC
TexCoord2 r 0 EYE_LINEAR
CC CCC
TexGen
CC CCC
SPHERE_MAP
TexCoord4
Current
Texture
Vertices
CC
CC
Coordinates
CC
CC
Color3 A 1
CC
CC
CC
CC
CC
CC
CC
Evaluators
CC
CC &
CC
Hardware architecture
CC CC
Enable/Disable
CC
Vertex ArraysCC
CC
Color4
Convert
RGBA to float
CC
Current
RGBA
CC
ColorMaterial
Material CC
CC CC
CC CC C CC
Color
CC
CC CC
CC
Material
Parameters
LightModel
CC CC
CC
Texture Coordinate
Generation
Programming model
CC CC CC CC
Control Input
Conversion C
CC& C
Current Begin/End
Convert
Index
CC CC
CC
Color Light Light Material Light Model
index to float
CC CC C CUDA : optimisation
Index Enable/Disable Parameters Parameters Parameters
Current
CC
CC CC
CC C C
CCC CCC
Values Lighting
CC CC
Enable/Disable
CC CC
Clamp to
Primitive
CC C
RGBA Lighting Equation [0,1]
Convert Assembly
Normal3 normal coords
to float
CC
CC
Current
Normal
CC
CC
b
M
M*b Normalize
[0,2n−1]
GPU computing :
CC
CC CC
CC
Vertex2
z 0
CC
CC CC
CC
M−T
Enable/Disable
Matrix
Control perspectives
CC CC
FrontFace
RasterPos2 M (Lighting) Clipping, Perspective,
CC
CC CC and
Vertex3
RasterPos3
w 1
CC CC
CC Viewport Application
Rasteriz−
CC CC Installing CUDA
ation Texturing, Per−Fragment Operations
Vertex4 OBJECT M EYE Fog,
b M*b Feedback
RasterPos4 COORDINATES M COORDINATES and
& Frame Buffer
Antialiasing
Selection &
Pixels Frame Buffer Control
Rectangle
CFD example/demo
Rect
Generation
Primitives Fragments
Model View
Matrix
Stack
Key to OpenGL Operations
MatrixMode
PushMatrix
Matrix
Control
ClipPlane
FrontFace
PolygonMode
Enable/Disable
(Antialiasing/Stipple)
PolygonOffset
PyCUDA / PyOpenCL
PopMatrix CullFace
b
M
M−Tb
LoadIdentity Polygon
Polygon Polygon
LoadMatrix Rasterization
Culling Mode
Projection Viewport
Clip
CUDA / MPI
Matrix DepthRange Enable/Disable
M*N ShadeModel Planes
M Stack (Antialiasing)
N LineStipple
LineWidth
Polygon
MultMatrix POLYGONS Polygon
b M M*b View Volume
Clipping Divide Line
Clipping
Translate Flatshading Vertex Segment
Line Apply
Scale LINE Line Coordinates Rasterization
Matrix b M*b View Volume Viewport
SEGMENTS Clipping by TexParameter
Rotate Clipping Enable/Disable
Generators (Vertex w
Frustum Enable/Disable
(Antialiasing)
References
Only) Point Enable/Disable Enable/Disable Enable/Disable
Ortho POINTS Point Enable/Disable Enable/Disable StencilOp Enable/Disable Enable/Disable Enable/Disable Enable/Disable
b M*b View Volume PointSize TexEnv Fog
RASTER POS. Culling Scissor AlphaFunc StencilFunc DepthFunc BlendFunc LogicOp
Culling
Coverage
Point Texel Texture Pixel Alpha Depth
Fog (antialiasing) Scissor Stencil Blending
Rasterization Generation Application Ownership Test Buffer Dithering Logic Op
Application Test Test (RGBA only)
Current Test (RGBA only) Test
Raster
Position
RenderMode
Clear
Clear
Control
Notes: Selection Feedback
Bitmap
Selection
Rasterization
1. Commands (and constants) are shown without the Encoding Control Encoding Masking
Readback
indeterminate. Control
4. This diagram is schematic; it may not directly
CopyPixels
correspond to any actual OpenGL implementation. ReadPixels
Pack
ReadBuffer
CopyTexImage
Pixels
CopyTexSubImage
35 / 170
GPU
Architectures,
Ï “All processors aspire to be general-purpose.” Parallelism and Moore
law
– Tim Van Hook, Graphics Hardware 2001 Why multi-core ?
What’s a thread ?
Ï 2001 : Nvidia GeForce3
GPU computing
Ï programmable graphics pipeline : pixel and vertex-shaders History
writen in low-level language, pros : high flexibility and and embedded GPU
computing ?
Nvidia / CUDA
standards, ease development plus rapide. Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
1
36 / 170
GPU
Architectures,
Ï “All processors aspire to be general-purpose.” Parallelism and Moore
law
– Tim Van Hook, Graphics Hardware 2001 Why multi-core ?
What’s a thread ?
Ï 2001 : Nvidia GeForce3
GPU computing
Ï programmable graphics pipeline : pixel and vertex-shaders History
writen in low-level language, pros : high flexibility and and embedded GPU
computing ?
Nvidia / CUDA
standards, ease development plus rapide. Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
1
37 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï “All processors aspire to be general-purpose.” What’s a thread ?
FPGA co-processor
Ï programmable graphics pipeline : pixel and vertex-shaders
Nvidia / CUDA
writen in low-level language, pros : high flexibility and Why shoud we use GPU ?
Programming model
standards, ease development plus rapide. CUDA : optimisation
GPU computing :
Ï GPU : Vertex Processors (MIMD), Fragment Processors (SIMD), perspectives
Installing CUDA
32 bits float CFD example/demo
38 / 170
GPU
Architectures,
Parallelism and Moore
GPU computing
History
Ï IEEE754 standard written in mid-80s and embedded GPU
computing ?
Ï Intel 80387 : first floating-point coprocessor FPGA co-processor
Programming model
algorithms quite complex to handle/implement CUDA : optimisation
GPU computing :
Ï FP16 in 2000 perspectives
Installing CUDA
Ï FP32 in 2003-2004 : simplified IEEE754 standard, float point CFD example/demo
rounding are complex and costly in terms of transistors count, PyCUDA / PyOpenCL
What’s a thread ?
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
40 / 170
GPU
What’s a thread ?
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
41 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Nvidia / CUDA
purpose computing task Why shoud we use GPU ?
Hardware architecture
Ï GPU were designed for computer graphics (output streams of Programming model
GPU computing :
perspectives
Ï There a need for translating computational concepts into GPU Installing CUDA
CUDA / MPI
http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial.html
References
have also a look at chapiter 31 in GPUGems2, Mapping
Computational concepts to GPUs,
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter31.html
42 / 170
GPU
Architectures,
Parallelism and Moore
GPU computing
FPGA co-processor
• CPU memory can be read / written anywhere in a program
Nvidia / CUDA
• Vextex programs are not allowed to randomly index into input Why shoud we use GPU ?
Hardware architecture
vertices. Programming model
• GPU texture (memory buffer) are read-only but random acces CUDA : optimisation
PyCUDA / PyOpenCL
CUDA / MPI
References
43 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
Legacy GPGPU : GPU computing
General Purpose computations on GPU History
FPGA co-processor
Ï CPU - GPU analogies Nvidia / CUDA
Why shoud we use GPU ?
Ï GPGPU concept 1 : arrays = texture Hardware architecture
• create a frame buffer object for off-screen rendering; random Programming model
CUDA : optimisation
access not allowed GPU computing :
perspectives
• memory : CPU array → GPU GL_TEXTURE_2D / (read-only or Installing CUDA
44 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
General Purpose computations on GPU Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Ï GPGPU concept 2 : CPU programs, loops = GPU fragment Programming model
CUDA : optimisation
shaders (a fragment code on a stream of vertices) GPU computing :
perspectives
Ï GPGPU concept 3 : CPU computing = GPU drawing : The Installing CUDA
vertex processors trandform the geometry and the rasterizer CFD example/demo
determine which pixels in the output buffer it cover and PyCUDA / PyOpenCL
CUDA / MPI
generate a fragment for each one.
References
45 / 170
GPU
Architectures,
Parallelism and Moore
law
Legacy GPGPU : Why multi-core ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter31.html
CUDA / MPI
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Ï Nvidia Geforce8800, 2006, introduce a unified architecture History
FPGA co-processor
Ï first generation with hardware features designed with GPGPU Nvidia / CUDA
in mind: almost full support of IEEE 754 standard for single Why shoud we use GPU ?
Hardware architecture
precision floating point, random read/write in external RAM, Programming model
GPU computing :
perspectives
Ï new hardware architecture generation: CUDA (Compute Installing CUDA
CUDA / MPI
compiler, SDK, librairies like cuFFT), a C-like programming
References
language
47 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
48 / 170
GPU
Architectures,
Parallelism and Moore
law
What’s a thread ?
http://www.nvidia.com/object/computational_fluid_dynamics.html
Nvidia / CUDA
Why shoud we use GPU ?
Ï Molecular Dynamics / Quantum chemistry: Hardware architecture
CFD example/demo
Ï financial computing:
PyCUDA / PyOpenCL
http://people.maths.ox.ac.uk/gilesm/hpc/slides.html
CUDA / MPI
Ï Signal/Image processing, Computer vision, etc References
49 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Conference GTC (GPU Technology Conference 2010) : more than Nvidia / CUDA
http://www.nvidia.com/object/gtc2010-presentation-archive.htmlProgramming model
CUDA : optimisation
Have a look at the CUDA show case: GPU computing :
perspectives
http://www.nvidia.com/object/cuda_apps_flash_new.html Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
50 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
51 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
52 / 170
GPU
Architectures,
1 Parallelism and Moore
Amdahl’s law: R = (1−p)+p/N
law
Why multi-core ?
What’s a thread ?
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
53 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï OpenGL ES (Open Graphics Library What’s a thread ?
FPGA co-processor
as 3D API 3D in Symbian , Nvidia / CUDA
Android, iPhone SDK, ... Why shoud we use GPU ?
Programming model
programmable hardware, CUDA : optimisation
PyCUDA / PyOpenCL
Ï new system-on-chip (CPU+GPU sur CUDA / MPI
une seule puce), tablet-PC market, References
entertainment center, ...
http://tegradeveloper.nvidia.com/tegra/
54 / 170
GPU
Architectures,
Ï FPGA : programmable integrated circuits, firmware Parallelism and Moore
law
(hardware with software flexibility), array of configurable logic Why multi-core ?
What’s a thread ?
block (encode a n-input boolean function) + routing tracks
GPU computing
Ï since ∼ 1985 History
Ï low-level HDL (Hardware description language) e.g. VHDL and embedded GPU
computing ?
Nvidia / CUDA
Ï IP (Intellectual Properties) library Why shoud we use GPU ?
Programming model
(Ethernet, Gigabit Tranceiver, etc) and computing blocks CUDA : optimisation
(PowerPC CPU, bloc DSP, small floating-point ALU (2003)) GPU computing :
perspectives
CFD example/demo
computing ?
PyCUDA / PyOpenCL
Ï since 2004-2005, emergence of high-level languages (C-like) +
CUDA / MPI
small board (with PCI-express form factor) + driver References
Ï high-level language as always allows abstraction hiding
low-level design flow (logic synthesis into RTL, placement,
routing, ...) focus on scientific algorithm
Ï Appealling but design too complex (to much low-level, i.e.
hardware knowledge required to make efficient use of them),
FPGA are expensive... 55 / 170
GPU
acceleratorr Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
56 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
GPU computing
Ï bibliographic reference : Reconfigurable Computing, History
Accelerating Computation with FPGA, by M. Gokhale and P.S. and embedded GPU
computing ?
Nvidia / CUDA
Ï wikipedia Reconfigurable computing Why shoud we use GPU ?
Hardware architecture
Ï workshop HPRCTA10 Programming model
CUDA : optimisation
GPU computing :
perspectives
PyCUDA / PyOpenCL
Computing and possible
CUDA / MPI
merging CPU/GPU
References
architectures and
developpement tools
57 / 170
GPU
Architectures,
Applications : Parallelism and Moore
law
Ï computing applications : SDR (Software Defined Radio, Why multi-core ?
What’s a thread ?
telecommunication, radio-astrophysics, militaire)
GPU computing
Ï applications : high throuput networking / storage : low latency
History
Nvidia / CUDA
Ï commercial high-level tools (C-to-RTL) : Mitrion-C, ImpulseC,
Why shoud we use GPU ?
Programming model
les problèmes : CUDA : optimisation
CFD example/demo
Ï Manufacturers involve in HPC ?
PyCUDA / PyOpenCL
CUDA / MPI
References
58 / 170
GPU
Summary P. Kestener
Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?
GPU computing
GPU computing History
Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model
GPU computing :
Hardware architecture perspectives
Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
59 / 170
GPU
Architectures,
Parallelism and Moore
law
Ï Brut force (high peak GFLOPS rate) Why multi-core ?
What’s a thread ?
Ï lower GFLOPS cost GPU computing
History
Ï massively multi-thread architecture and embedded GPU
computing ?
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
60 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï compare CPU/GPU : much more computing-dedicated What’s a thread ?
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
61 / 170
GPU
Architectures,
Parallelism and Moore
law
Ï gaming mass market behind the scene (viable model ?); GPU Why multi-core ?
What’s a thread ?
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
62 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Ï accessibility : the programming model is simple compared History
CUDA / MPI
References
63 / 170
GPU
Architectures,
Parallelism and Moore
A GPU Computing dedicated system (2009) law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
Ï performance increase
Ï electric power saving
Ï improve reliability (e.g. ECC)
http://gpgpu.univ-perp.fr/images/3/36/GPU_Reliability_2008.pdf
64 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
65 / 170
GPU
Architectures,
Parallelism and Moore
Ï CPU-GPU link, Pci-express bus x16, Gen2 : law
Why multi-core ?
BP = 16 ∗ 2 ∗ 250MBytes/s = 8 GBytes/s What’s a thread ?
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
66 / 170
GPU
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
What’s a thread ?
CUDA : GT200 (mid 2008), hardware capability 1.3
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
Ï 10 TPC - 3 SM - 8 SP-cores
Ï 10 TPC × 3 SM × 8 SP-cores = 240 cores
69 / 170
GPU
What’s a thread ?
CUDA : Fermi (spring 2010), hardware capability 2.0
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
Ï 16 SM - 32 SP-cores
Ï 16 SM × 32 SP-cores = 512 cores
70 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
71 / 170
GPU
Architectures,
Parallelism and Moore
Ï PTX (Parallel Thread Execution) law
Why multi-core ?
Ï write a program once for a pixel What’s a thread ?
Nvidia / CUDA
Ï block is a logical array of Why shoud we use GPU ?
Programming model
(built-in variable) CUDA : optimisation
GPU computing :
Ï grid is a logical array of blocks perspectives
Installing CUDA
indexed with blockIdx (built-in CFD example/demo
variable) PyCUDA / PyOpenCL
in CUDA documentation.
Ï Read chapter 1, 2, 3 of
ptx_isa_2.2.pdf
72 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Ï threads from a given block can: Nvidia / CUDA
• synchronize execution Why shoud we use GPU ?
Hardware architecture
• exchange data using a shared Programming model
GPU computing :
Ï blocks are independent, no perspectives
Installing CUDA
CUDA / MPI
References
73 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
References
74 / 170
GPU
Architectures,
Parallelism and Moore
law
Ï each block of threads Why multi-core ?
GPU computing :
• sequentially (on perspectives
Installing CUDA
the same
CFD example/demo
multiprocessor)
PyCUDA / PyOpenCL
Ï independent CUDA / MPI
threads gives
scalability of the
programming model.
75 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
76 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
77 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï CPU and GPU have physically What’s a thread ?
FPGA co-processor
GPU can’t initiate transfers,
Nvidia / CUDA
acces disk, ... Why shoud we use GPU ?
Programming model
management for allocation, free, CUDA : optimisation
CFD example/demo
• Can’t tell from the pointer value
PyCUDA / PyOpenCL
wether the address is CPU or
CUDA / MPI
GPU memory space !!
References
• dereferencing a CPU pointer
inside a routine running on GPU
⇒ CRASH !
78 / 170
GPU
Architectures,
Parallelism and Moore
law
What’s a thread ?
FPGA co-processor
8-16kB/multiprocessor
Nvidia / CUDA
Ï off-chip memory : Why shoud we use GPU ?
Hardware architecture
• global : RW, up to GBytes, slow Programming model
GPU computing :
• constant : RO, 64kB/chip, perspectives
Installing CUDA
__const__ declared variable, very
CFD example/demo
fast (1-4 cycles), cached
PyCUDA / PyOpenCL
• texture : RO, located in global
CUDA / MPI
memory, cached
References
• local : RW, slow, use controlled
by compiler, used if no more
registers
79 / 170
GPU
What’s a thread ?
GPU computing
FPGA co-processor
Programming model
Ï local memory (shared among threads) CUDA : optimisation
CFD example/demo
Ï Global memory (historical)
PyCUDA / PyOpenCL
• read-only during computation
CUDA / MPI
• write-only at the end of computation (precomputd address)
References
Ï Global memory (new, i.e. CUDA)
• allows general scatter/gather (read/write)
• take care: no collision rules, need atomic operations
80 / 170
GPU
allocation/release Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Ï Host (CPU) manages device (GPU) memory: History
FPGA co-processor
• cudaError_t cudaMalloc(void ** pointer, size_t
Nvidia / CUDA
nbytes) Why shoud we use GPU ?
• cudaError_t cudaMemset(void * pointer, intvalue, Hardware architecture
Programming model
size_tcount) CUDA : optimisation
• cudaError_t cudaFree(void* pointer) GPU computing :
perspectives
CFD example/demo
int n = 1024;
PyCUDA / PyOpenCL
int nbytes = 1024*sizeof(int);
CUDA / MPI
int* d_a = 0;
References
cudaMalloc( (void**)&d_a, nbytes);
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
81 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
Ï cudaError_t cudaMemcpy(void *dst, void *src, and embedded GPU
computing ?
size_t nbytes, enumcudaMemcpyKinddirection) FPGA co-processor
• blocks CPU thread until all bytes have been copied Programming model
CUDA : optimisation
• doesn’t start copying until previous CUDA calls complete GPU computing :
perspectives
Ï enum cudaMemcpyKind Installing CUDA
References
82 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
83 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
84 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
85 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
86 / 170
GPU
Architectures,
Parallelism and Moore
law
What’s a thread ?
Ï GPU execution requires GPU computing
the kernel code + History
FPGA co-processor
• Concurrent kernel
Nvidia / CUDA
execution (hardware Why shoud we use GPU ?
Programming model
GPU computing :
unit creates, manages, perspectives
Installing CUDA
organize thread
CFD example/demo
(scheduling); threads are PyCUDA / PyOpenCL
grouped into warp (group CUDA / MPI
consecutive indexes)
⇒ hardware resources
sharing !
87 / 170
GPU
Architectures,
Parallelism and Moore
Ï branch divergence: A warp executes one common instruction law
What’s a thread ?
e.g.), the warp serially executes each branch path taken, History
FPGA co-processor
complete, the threads converge back to the same execution Nvidia / CUDA
path. Performance drops. See reduction code example. Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
credits :
W. Fung, Dynamic warp formation and scheduling for efficient GPU
flow control
88 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Ï Execution context (PC, registers, ...) for each warp is History
maintained on-chip during the entire lifetime of the warp ! and embedded GPU
computing ?
FPGA co-processor
Ï GPU threads GPU are very lighweight (creation and context Nvidia / CUDA
switching are almost free, i.e. only take a few cycles). Why shoud we use GPU ?
Hardware architecture
Ï Read CUDA_C_Programming_Guide.pdf, chapter 4 Programming model
CUDA : optimisation
Ï The number of blocks and warps that can reside and be GPU computing :
perspectives
depends on the amount of registers and shared memory used CFD example/demo
by the kernel and the amount of registers and shared memory PyCUDA / PyOpenCL
CUDA / MPI
available on the multiprocessor. (Cuda Programming guide,
References
section 4.2)
89 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
CUDA_C_Programming_Guide.pdf, appendix G
For a given kernel (given number of register per thread, given shared
memory per block), this table helps understanding which hardware
resources will be exhausted first when changing run-time parameters (grid
size, block size). 90 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Based on Brent Oster’s slides (NVIDIA) What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
91 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Based on Brent Oster’s slides (NVIDIA) What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
92 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Based on Brent Oster’s slides (NVIDIA) What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
93 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Based on Brent Oster’s slides (NVIDIA) What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
94 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
GPU computing
Ï NVCC : compiler driver : call History
Nvidia / CUDA
Ï PTX : Parallel Thread Why shoud we use GPU ?
Programming model
GPU computing :
Installing CUDA
95 / 170
GPU
Architectures,
Parallelism and Moore
http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf law
Why multi-core ?
Ï NVCC : compiler driver : call What’s a thread ?
Programming model
(e.g. nvcc) generate PTX CUDA : optimisation
GPU computing :
instructions, which in a second perspectives
Installing CUDA
stage, are optimized and
CFD example/demo
translated into native hardware
PyCUDA / PyOpenCL
instructions (depending CUDA / MPI
hardware capability). References
96 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
FPGA co-processor
Programming model
(low-level machine instruction) CUDA : optimisation
GPU computing :
Ï graphics driver can also convert perspectives
Installing CUDA
PTX into CUBIN (Just-In-Time
CFD example/demo
optimisation) and issue a
PyCUDA / PyOpenCL
PCI-express upload to GPU. CUDA / MPI
References
97 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
nvcc documentation : nvcc.pdf and embedded GPU
computing ?
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
98 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
nvcc documentation : nvcc.pdf and embedded GPU
computing ?
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
99 / 170
GPU
Architectures,
Parallelism and Moore
law
Ï Function and type qualifiers Why multi-core ?
What’s a thread ?
__global__ void KernelFunc(...); // kernel callable from host
GPU computing
__device__ void DeviceFunc(...); // function callable on device History
__device__ int GlobalVar; // variable in device memory and embedded GPU
FPGA co-processor
__host__ void HostFunc(...); // function callable on host
Nvidia / CUDA
Ï built-in variables : threadIdx and blockDim, blockIdx and Why shoud we use GPU ?
Hardware architecture
CFD example/demo
«< .. »> is used to set grid and block sizes (can also set shared PyCUDA / PyOpenCL
mem size per block) CUDA / MPI
100 / 170
GPU
Architectures,
Parallelism and Moore
Ï specific data types, (vector_types.h) : example dim3, design law
Why multi-core ?
for memory alignment What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
Ï in a CUDA kernel, automatic variables (i.e. without any type CFD example/demo
compiler can choose to place them in local memory (external CUDA / MPI
RAM, take care of performance drop, might need to rewrite the References
Architectures,
Ï specific data types, (vector_types.h) : example dim3, design Parallelism and Moore
law
for memory alignment Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
memory (external DRAM but cached, so very fast, few clock Installing CUDA
CFD example/demo
cycles);
PyCUDA / PyOpenCL
Ï has the lifetime of the application (no need to re-init between
CUDA / MPI
different kernel calls); References
Ï has static storage (take care if you want to use it in multiple
compilation unit; you can’t use the extern keyword);
Ï if you want to use the same constant variable in multiple
compilation unit (different .cu files), you need to init constant
memory in each of them, i.e. call cudaMemcpyToSymbol
102 / 170
GPU
Architectures,
Parallelism and Moore
file:///usr/local/cuda32/doc/html/modules.html law
Why multi-core ?
Ï Runtime API : high-level, build What’s a thread ?
FPGA co-processor
CUDA / MPI
Ï GPU Context ∼ CPU process
References
Ï GPU Module ∼ CPU dynamic
library
Ï Driver and Run-time API can be
used/linked together
103 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
104 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
105 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
106 / 170
GPU
Architectures,
Ï example : sum of vectors Parallelism and Moore
law
// Compute vector sum C = A+B Why multi-core ?
What’s a thread ?
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C) GPU computing
History
{ and embedded GPU
int i = threadIdx.x + blockDim.x * blockIdx.x; computing ?
FPGA co-processor
C[i] = A[i] + B[i];
} Nvidia / CUDA
Why shoud we use GPU ?
Ï invocation Hardware architecture
Programming model
void main() { CUDA : optimisation
// allocate device (GPU) memory GPU computing :
perspectives
float* d_A, d_B, d_C; Installing CUDA
cudaMalloc( (void**) &d_A, N * sizeof(float));
CFD example/demo
cudaMalloc( (void**) &d_B, N * sizeof(float));
PyCUDA / PyOpenCL
cudaMalloc( (void**) &d_C, N * sizeof(float));
CUDA / MPI
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
A CUDA program should take into account the following GPU computing
History
constraints and embedded GPU
computing ?
Nvidia / CUDA
indexes should access consecutive memory addresses for Why shoud we use GPU ?
Programming model
GPU computing :
perspectives
Ï Efficient use of parallelism Installing CUDA
References
Ï Try to avoid shared memory bank conflicts
108 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
Nvidia / CUDA
Ï Thrust : http://code.google.com/p/thrust Why shoud we use GPU ?
Hardware architecture
Ï cudpp : http://gpgpu.org/developer/cudpp Programming model
CUDA : optimisation
Ï comparison Thrust/CUDPP : GPU computing :
perspectives
http://code.google.com/p/thrust/wiki/ThrustAndCUDPP Installing CUDA
PyCUDA / PyOpenCL
CUDA / MPI
References
109 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
110 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
111 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
112 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
113 / 170
GPU
Architectures,
Parallelism and Moore
Titane: ∼ 100 CPU-TFLOPS + ∼ 200 GPU-TFLOPS (Tesla law
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
Architectures,
Parallelism and Moore
Titane: ∼ 100 CPU-TFLOPS + ∼ 200 GPU-TFLOPS (Tesla law
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Ï What append when multiple threads on a warp try to modify History
FPGA co-processor
Programming model
Ï example atomic function: int atomicAdd(int* address, CUDA : optimisation
int val); : no other thread can access this address until the GPU computing :
perspectives
CFD example/demo
Ï See CUDA programming guide, Section B.5 (Memory fence PyCUDA / PyOpenCL
116 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Ï int printf(const char *format[, arg, ...]); Nvidia / CUDA
Ï Use with care ! Don’t forget to reduce the blok sizes ! GPU computing :
perspectives
Installing CUDA
Ï See CUDA programming guide, section B.14 CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
117 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
GPU computing
History
FPGA co-processor
Ï cuBLAS: CUDA-based Basic Linear Algebra by NVIDIA, Nvidia / CUDA
Ï culatools: GPU linear algebra package, ∼ cuBLAS, Why shoud we use GPU ?
Hardware architecture
CFD example/demo
Ï cusp: sparse linear algebra on GPU, PyCUDA / PyOpenCL
References
Ï libra: GPU SDK for Matlab,
Ï cudpp: data parallel primitive for GPU (see also Thrust)
http://www.nvidia.com/object/tesla_software.html
118 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
CFD example/demo
Ï Algorithm optimizations that are hard and labor intensive
PyCUDA / PyOpenCL
• Locality and regularization transformations
CUDA / MPI
119 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
GPU computing
Ï GPU global/exteranl memory allocation is costly History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
120 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Based on K. Skadron slides (University of Virginia) What’s a thread ?
Ï transfer time CPU ←→ GPU (think about tranfer overhead GPU computing
History
before off-loading computations to GPU) and embedded GPU
computing ?
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
121 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Programming model
http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/GPU+CUDA.pdf CUDA : optimisation
de coalescence, conflit de banc memoire, ...) from slide 142. Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
122 / 170
GPU
Architectures,
Parallelism and Moore
Figure G1, /usr/local/cuda32/doc/CUDA_C_Programming_Guide.pdf, law
Why multi-core ?
SDK 3.2 What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
123 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
124 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
125 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
126 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
Ï Array of Structures (AoS) / and embedded GPU
computing ?
References
http://perso.ens- lyon.fr/sylvain.collange/talks/calcul_gpu_scollange.pdf
127 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
GPU computing
Ï environment variables: History
FPGA co-processor
• CUDA_PROFILE_LOG or COMPUTE_PROFILE_LOG : results file
Nvidia / CUDA
name (default is launch directory/cuda_profile.log) Why shoud we use GPU ?
Programming model
file; list names of performance counters for logging information CUDA : optimisation
GPU computing :
Ï these variables are already set for hands-on tutorial. perspectives
Installing CUDA
timestamp References
gld_incoherent
gst_incoherent
128 / 170
GPU
Architectures,
Parallelism and Moore
law
Ï gld_incoherent: Number of non-coalesced global memory Why multi-core ?
GPU computing
Ï gld_coherent: Number of coalesced global memory loads History
FPGA co-processor
stores
Nvidia / CUDA
Ï gst_coherent: Number of coalesced global memory stores Why shoud we use GPU ?
Hardware architecture
Installing CUDA
Ï branch: Number of branch events taken by threads
CFD example/demo
Ï divergent_branch: Number of divergent branches within a PyCUDA / PyOpenCL
References
Ï instructions: instruction count
Ï warp_serialize: Number of threads in a warp that serialize
based on address conflicts to shared or constant memory
Ï cta_launched: executed thread blocks
129 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Ï data prefetching = load data in advance (e.g. for the next History
FPGA co-processor
Ï When a memory access operation is executed, it does not Nvidia / CUDA
block other operations following it as long as they don’t use Why shoud we use GPU ?
Hardware architecture
the data from the operation. Programming model
CUDA : optimisation
Ï for loop without data prefetching: GPU computing :
perspectives
CFD example/demo
sum += array [ i ] ;
PyCUDA / PyOpenCL
}
CUDA / MPI
References
Ï every addition waits for its data to be loaded from memory
130 / 170
GPU
Architectures,
Ï data prefetching = load data in advance (e.g. for the next Parallelism and Moore
law
iteration in a loop) !!! Why multi-core ?
What’s a thread ?
Ï When a memory access operation is executed, it does not
GPU computing
block other operations following it as long as they don’t use History
Nvidia / CUDA
temp = array [ 0 ] ; Why shoud we use GPU ?
Hardware architecture
for ( i = 0 ; i < N−1; i ++) { Programming model
GPU computing :
Installing CUDA
temp = temp2 ; CFD example/demo
} PyCUDA / PyOpenCL
References
Ï with data prefetching, inside the loop, memory load for
iteration i + 1 and the actual addition for iteration i are done in
parallel
Ï data prefetching benefits: needs less warp to hide memory
latency
Ï data prefetching cost for GPU: more registers, so less warps 131 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï globla memory R/W are pipelined What’s a thread ?
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
132 / 170
GPU
Architectures,
Ï L1 cache: ressources shared with shared memory ! Parallelism and Moore
law
Ï L1 cache used for reading local variables, if not enough Why multi-core ?
What’s a thread ?
registers
GPU computing
Ï L2 cache: for global memory History
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
133 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
134 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
CUDA / MPI
Ï standard http://www.khronos.org, version 1.0 (12/2008) References
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
CUDA / MPI
Ï standard http://www.khronos.org, version 1.0 (12/2008) References
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
Ï convert CUDA program to OpenCL :
References
http://developer.amd.com/documentation/articles/pages/OpenCL- and- the- ATI-Stream-v2.0- Beta.aspx#four
137 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Google Trends with keywords History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
138 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Google Trends with keywords History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
139 / 170
GPU
Architectures,
Parallelism and Moore
Ï Preprocessor directives à la OpenMP, source code law
transformation: CAPS and its tools HMPP (Hybrid Manycores Why multi-core ?
What’s a thread ?
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
140 / 170
GPU
Architectures,
Parallelism and Moore
Ï Avoir un GPU compatible CUDA : G80 ou plus (≥ 2007), law
What’s a thread ?
Programming model
devdriver_3.2_linux_64_260.24.run CUDA : optimisation
GPU computing :
Ï chaîne de compilation (toolkit): perspectives
Installing CUDA
cudatoolkit_3.2.9_linux_64_ubuntu10.04.run
CFD example/demo
Ï SDK Cuda / OpenCL : gpucomputingsdk_3.2_linux.run PyCUDA / PyOpenCL
Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï exécuter l’exemple deviceQuery dans le SDK What’s a thread ?
Nvidia / CUDA
cat /proc/driver/nvidia/version Why shoud we use GPU ?
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
142 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
• nom_du_projet_gold.c : vers CPU native, pour comparaison Why shoud we use GPU ?
Hardware architecture
fonctionnelle des résultats et éventuellement benchmark des Programming model
GPU computing :
Installing CUDA
143 / 170
GPU
Summary P. Kestener
Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?
GPU computing
GPU computing History
Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model
GPU computing :
Hardware architecture perspectives
Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
144 / 170
GPU
Architectures,
Parallelism and Moore
law
2D : GPU computing
History
Ut + F(U)x + G(U)y = 0 and embedded GPU
computing ?
Nvidia / CUDA
ρ ρu ρv
Why shoud we use GPU ?
Hardware architecture
ρu ρu2 + p ρuv Programming model
U =
ρv , F = ρuv , G = ρv2 + p
CUDA : optimisation
GPU computing :
perspectives
E u(E + p) v(E + p) Installing CUDA
CFD example/demo
CUDA / MPI
Ï Equation d’état des gaz parfaits pour l’énergie interne :
p
e = ρ(γ−1) , avec γ = 1.4 (air à T = 20o C)
References
145 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
Ï schéma Godunov 2ème ordre où les Uni sont remplacés par Hardware architecture
Programming model
des fonctions linéaires par morceaux. CUDA : optimisation
GPU computing :
Ï MUSCL-Hancock en 3 étapes : perspectives
Installing CUDA
- calcul des pentes et des valeurs de Ui aux bords de la cellule CFD example/demo
- évolution d’un 1/2 pas de temps PyCUDA / PyOpenCL
- résolution du problème de Riemann pour avoir les nouveaux CUDA / MPI
146 / 170
GPU
Architectures,
Ï 4 grilles 2D: Parallelism and Moore
law
• 1 par variable conservative Why multi-core ?
Ï fermée History
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
147 / 170
GPU
Simulation P. Kestener
Architectures,
Ï simulation d’un jet fluide injecté dans le domaine Parallelism and Moore
law
Ï paramètres de la simulation Why multi-core ?
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
148 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Ï run: tend = 1s, 1 output tous les 50 pas de History
∆x = 0.05m
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Ï bordures: fermées des 4 cotés Programming model
CUDA : optimisation
Ï schéma: 10 itérations pour Riemann, GPU computing :
perspectives
facteur de courant 0.8 Installing CUDA
CUDA / MPI
Ï conditions initiales : References
T
W = [ρ = 1, u = 0, v = 0, p = 1] à l’intérieur
et T W = 0 sur les bords
149 / 170
GPU
Kelvin-Helmholtz) Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Temps de calcul d’une simulation 3D 256x256x16: 30 minutes History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
150 / 170
GPU
Performance P. Kestener
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
151 / 170
GPU
Summary P. Kestener
Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?
GPU computing
GPU computing History
Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model
GPU computing :
Hardware architecture perspectives
Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
152 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Ï CPU development tools for sequential program have a long Why shoud we use GPU ?
Hardware architecture
Ï PyCuda wraps the CUDA driver API into the Python language Installing CUDA
CFD example/demo
Ï reference : Andeas Klöckner, PyCUDA / PyOpenCL
http://mathema.tician.de/software/pycuda CUDA / MPI
References
153 / 170
GPU
Architectures,
Parallelism and Moore
law
Ï Python + CUDA = PyCUDA: Why multi-core ?
GPU computing :
scientific computing) perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
154 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Ï download the package from History
FPGA co-processor
Ï untar and configure (set the CUDA toolkit location) : Nvidia / CUDA
./configure.py -cuda-root=/usr/local/cuda32/ Why shoud we use GPU ?
Hardware architecture
-cuda-enable-gl Programming model
CUDA : optimisation
Ï build package: GPU computing :
perspectives
make Installing CUDA
CUDA / MPI
/usr/local/lib/python2.6/dist-packages/pycuda-0.
References
94.2-py2.6-linux-x86_64.egg
155 / 170
GPU
Architectures,
Parallelism and Moore
law
What’s a thread ?
Compute multiplication by a scalar, elementwise GPU computing
History
FPGA co-processor
CFD example/demo
References
a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)
Ï Kernel declaration and execution
Ï Retrieve result on CPU and compare
156 / 170
GPU
Architectures,
Parallelism and Moore
PyCUDA code demo.py from examples law
Why multi-core ?
Compute multiplication by a scalar, elementwise What’s a thread ?
GPU computing
Ï Initialization and data array declaration: History
Nvidia / CUDA
# compile and load into GPU device Why shoud we use GPU ?
Programming model
__global__ void doublify(float *a) CUDA : optimisation
GPU computing :
{ perspectives
CFD example/demo
a[idx] *= 2;
PyCUDA / PyOpenCL
}
CUDA / MPI
""") References
# get a handle
func = mod.get_function("doublify")
# launch GPU computation
func(a_gpu, block=(4,4,1))
Ï Retrieve result on CPU and compare
157 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
PyCUDA code demo.py from examples History
FPGA co-processor
Nvidia / CUDA
Ï Initialization and data array declaration: Why shoud we use GPU ?
Hardware architecture
Ï Kernel declaration and execution Programming model
CUDA : optimisation
Ï Retrieve result on CPU and compare GPU computing :
perspectives
a_doubled = numpy.empty_like(a) Installing CUDA
CUDA / MPI
print a
References
print "doubled with kernel:"
print a_doubled
158 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
Programming model
a_gpu = gpuarray.to_gpu(numpy.random.randn(4,4). CUDA : optimisation
Installing CUDA
Ï Some predefined CUDA kernel implemented as gpuarray
CFD example/demo
instance methods : reductions (min,max,sum,...), elementwise PyCUDA / PyOpenCL
operators (fabs, trigonometric, ...) CUDA / MPI
http://pypi.python.org/pypi/pyfft
159 / 170
GPU
Architectures,
Parallelism and Moore
law
Lattice Boltzmann simulations : Sailfish Why multi-core ?
What’s a thread ?
http://sailfish.us.edu.pl/
GPU computing
TP: History
FPGA co-processor
Ï Grab the CUDA kernel code for solving the heat equation, and Nvidia / CUDA
CFD example/demo
Ï Have a look at pyFFT (this is a pyCUDA/pyOpenCL interface PyCUDA / PyOpenCL
to compute FFT on GPU); already installed in CUDA / MPI
/usr/local/lib/python2.6/dist-packages/pyfft-0.3.5-py2.6.egg
References .
Download sources to get doc and examples.
Ï future ?: copperhead : a data-parallel subset of Python, code
dynamically compiled for target platform
160 / 170
GPU
Summary P. Kestener
Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?
GPU computing
GPU computing History
Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model
GPU computing :
Hardware architecture perspectives
Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
161 / 170
GPU
Architectures,
Ï CUDA and MPI are almost orthogonal: good ! Parallelism and Moore
law
Ï MPI: distributed parallel processing Why multi-core ?
What’s a thread ?
Ï CUDA: co-processor, CPU accelerator
GPU computing
Ï NVIDIA System Management Interface (from Nvidia driver): History
• nvidia-smi -s : show the current rules for COMPUTE and embedded GPU
computing ?
time perspectives
Installing CUDA
• Exclusive: Only one host thread can use the device at any CFD example/demo
given time. PyCUDA / PyOpenCL
• Prohibited: No host thread can use the device.
CUDA / MPI
• When using the Default mode (this a system admi choice), the
References
programmer NEEDS to call cudaSetDevice(devId) to explicit
which GPU device is associated to which MPI process
• example GPU device choice rule:
cudaGetDeviceCount(&count); int devId = myRank %
count;
Ï 225_GTC2010.pdf slides
162 / 170
GPU
Architectures,
Parallelism and Moore
Titane: ∼ 100 CPU-TFLOPS + ∼ 200 GPU-TFLOPS (Tesla law
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
Architectures,
Parallelism and Moore
Titane: ∼ 100 CPU-TFLOPS + ∼ 200 GPU-TFLOPS (Tesla law
What’s a thread ?
GPU computing
History
FPGA co-processor
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Programming model
CUDA : optimisation
GPU computing :
perspectives
Installing CUDA
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
Architectures,
Ï easier to use the nvcc compiler driver than mpicc to integrate Parallelism and Moore
law
CUDA into a MPI program. Why multi-core ?
What’s a thread ?
Ï Just need to pass the MPI CFLAGS and LDFLAGS
GPU computing
Ï integration with a build system: an example autotools-based History
Ï Example hands-on: try to parallelize the heat-solver using MPI FPGA co-processor
Nvidia / CUDA
over 2 nodes (only 1 border for MPI communications). Start Why shoud we use GPU ?
with CPU version, and then the GPU version. Hardware architecture
Programming model
Ï potentially interesting tools: CUDA : optimisation
• Global Memory for Accelerator GMAC, is a user-level library that GPU computing :
perspectives
PyCUDA / PyOpenCL
CUDA / MPI
References
165 / 170
GPU
Summary P. Kestener
Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?
GPU computing
GPU computing History
Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model
GPU computing :
Hardware architecture perspectives
Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
166 / 170
GPU
Architectures,
Parallelism and Moore
Ï multimédia : law
Why multi-core ?
http://www.nvidia.com/object/cuda_education.html What’s a thread ?
Nvidia / CUDA
Ï Workshop ASPLOS2008 : http://gpgpu.org/asplos2008 Why shoud we use GPU ?
Hardware architecture
Ï École thématique ARCHI09 : Programming model
GPU computing :
processeur vectoriel / précurseur GPU). perspectives
Installing CUDA
CUDA / MPI
Ï http://sites.google.com/site/cudaiap2009/home References
Ï CUDA wiki at DAAC
Ï CUDA Tutorial at ISC2009
Ï un bon résumé :
http://www.caam.rice.edu/~timwar/NUDG/RMMC/CUDA.html
167 / 170
GPU
Bibliographie P. Kestener
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
FPGA co-processor
Ï doc CUDA du SDK / Toolkit 3.2 (Septembre 2010): Nvidia / CUDA
• CUDA_Toolkit_Reference_Manual.pdf Why shoud we use GPU ?
Hardware architecture
• nvcc.pdf Programming model
GPU computing :
• CUDA_C_Best_Practices_Guide.pdf perspectives
Installing CUDA
• ptx_isa_2.2.pdf
CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
168 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
Ï les flux RSS / News: History
Nvidia / CUDA
http://twitter.com/statuses/user_timeline/18691870.rss
Why shoud we use GPU ?
• les News Nvidia http://news.developer.nvidia.com Hardware architecture
Ï le forum Nvidia pour poser des questions sur CUDA : GPU computing :
perspectives
CFD example/demo
Ï le coin des développeurs Nvidia : PyCUDA / PyOpenCL
http://developer.nvidia.com/page/home.html CUDA / MPI
169 / 170
GPU
Architectures,
Parallelism and Moore
law
Why multi-core ?
What’s a thread ?
GPU computing
History
GPU computing :
Ï cis565, GPU Programming and Architecture, University of perspectives
CFD example/demo
Ï Cuda index from University of Michigan PyCUDA / PyOpenCL
CUDA / MPI
References
170 / 170