[go: up one dir, main page]

0% found this document useful (0 votes)
11 views170 pages

PK Introduction CUDA

The document discusses GPU computing, including its architectures, programming models, and the significance of multi-core processors in enhancing performance. It covers the history of GPU development, the importance of parallelism, and various programming tools such as CUDA and PyCUDA. Additionally, it highlights the implications of Moore's Law on computing power and efficiency.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views170 pages

PK Introduction CUDA

The document discusses GPU computing, including its architectures, programming models, and the significance of multi-core processors in enhancing performance. It covers the history of GPU development, the importance of parallelism, and various programming tools such as CUDA and PyCUDA. Additionally, it highlights the implications of Moore's Law on computing power and efficiency.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 170

GPU

P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

Introduction to massively parallel What’s a thread ?

GPU computing

computing with GPU History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA

Pierre Kestener Why shoud we use GPU ?


Hardware architecture

Programming model

CEA-Saclay, DSM, France CUDA : optimisation

GPU computing :
Maison de la Simulation perspectives

Installing CUDA

CFD example/demo
Chamonix, June 2011
PyCUDA / PyOpenCL

CUDA / MPI

References

1 / 170
GPU

Contents... P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

Ï Architectures, parallelism and Moore’s law GPU computing


History
Ï A short historical overview and embedded GPU
computing ?

• from vector processors to GPU FPGA co-processor

• transfer graphics functionalities CPU/GPU - hardware/software Nvidia / CUDA


Why shoud we use GPU ?
Ï Introduction to GPU Computing, Hardware architecture

Programming model
• hardware architecture CUDA : optimisation

• CUDA programming model, workflow, execution model, CUDA GPU computing :


perspectives

language Installing CUDA

• development tools (compiler, debug, etc...) CFD example/demo

• parallel programming patterns : reduction, ... PyCUDA / PyOpenCL

Ï Web bibliography CUDA / MPI

References
• many links to on-line resources (courses, source codes, articles,
...)

2 / 170
GPU

Program for the week P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

3 / 170
GPU

Summary P. Kestener

Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?

What’s a thread ? What’s a thread ?

GPU computing
GPU computing History

and embedded GPU


History computing ?

and embedded GPU computing ? FPGA co-processor

Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model

Why shoud we use GPU ? CUDA : optimisation

GPU computing :
Hardware architecture perspectives

Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI

Installing CUDA References

CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
4 / 170
GPU

Parallel computing - Rendering - GPU P. Kestener

Architectures,
Parallelism and Moore
Ï before the 1990’s, parallel computers were rare and available law
Why multi-core ?
for only the most critical problems What’s a thread ?

Ï Toy Story (1995) : first completely computer-generated GPU computing


History

feature-length film, processed on a "renderfarm" consisting and embedded GPU


computing ?
of 117 Sun(™) SPARCstation(™) @100MHz workstations. FPGA co-processor

Computing the 114000 frames (77 minutes) required 800000 Nvidia / CUDA
Why shoud we use GPU ?
computer hours. Each frame consists in 300 MBytes of data Hardware architecture

(one hard-disk in 1995). Programming model


CUDA : optimisation

Ï Computer animation (and rendering, i.e. the process of GPU computing :


perspectives

generating an image from a model or scene description) is Installing CUDA

where parallel computing / HPC and graphics computing / CFD example/demo

PyCUDA / PyOpenCL
GPU meets.
CUDA / MPI
Ï Software for off-line rendering : e.g. RenderMan References
(http://renderman.pixar.com) by Pixar, from modelling to
rendering
Ï Hardware rendering : OpenGL low-level API used in real-time
rendering (i.e. done in dedicated hardware like GPU), gaming
industry
5 / 170
GPU

Parallel computing / Parallel programming P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History
Ï Parallel computing relies on exploitable concurrency and embedded GPU
computing ?

Ï Concurrency exists in a computational problem when the FPGA co-processor

Nvidia / CUDA
problem can be decomposed into subproblems that can safely Why shoud we use GPU ?

execute at the same time, Hardware architecture

Programming model
Mattson et al, CUDA : optimisation

in Patters for parallel programming GPU computing :


perspectives

Installing CUDA
Ï concurrency: property of a system in which several CFD example/demo
computations are executing simultaneously PyCUDA / PyOpenCL

Ï How to structure code to expose and exploit concurrency ? CUDA / MPI

References

6 / 170
GPU

Moore’s law - the free lunch is over... P. Kestener

Architectures,
Parallelism and Moore
The number of transistors that can be placed law
Why multi-core ?
inexpensively on an integrated circuit doubles What’s a thread ?

approximately every two years GPU computing


History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

7 / 170
GPU

Moore’s law - the free lunch is over... P. Kestener

Architectures,
Parallelism and Moore
Moore’s Law continues with law
Why multi-core ?
Ï technology scaling (32 nm in 2010), What’s a thread ?

Ï improving transistor performance to increase frequency, GPU computing

Ï increasing transistor integration capacity to realize complex History

and embedded GPU

architectures, computing ?

FPGA co-processor
Ï reducing energy consumed per logic operation to keep power Nvidia / CUDA
dissipation within limit. Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Shekhar Borkar, Thousand Core Chips - A Technology Perspective, in Intel


Corp, Microprocessor Technology Lab, 2007, p. 1-4
8 / 170
GPU

Moore’s law - Towards multi-core architectures P. Kestener

Architectures,
Parallelism and Moore
Pollack’s rule law
Why multi-core ?
Ï if you double the logic in a processor core, then it delivers What’s a thread ?

only 40% more performance GPU computing

Ï A multi-core microarchitecture has potential to provide near History

and embedded GPU

linear performance improvement with complexity and power. computing ?

FPGA co-processor
Ï For example, two smaller processor cores, instead of a large Nvidia / CUDA
monolithic processor core, can potentially provide 70-80% Why shoud we use GPU ?
Hardware architecture
more performance, as compared to only 40% from a large Programming model

monolithic core CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Shekhar Borkar, Thousand Core Chips - A Technology Perspective, in Intel Corp,


Microprocessor Technology Lab, 2007, p. 1-4 9 / 170
GPU

Moore’s law - Towards multi-core architectures P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing

(Heterogeneous) Multi-core processors have several other History

and embedded GPU

benefits: computing ?

FPGA co-processor

Ï each processor core could be individually turned on or off, Nvidia / CUDA


Why shoud we use GPU ?
thereby saving power; Hardware architecture

Programming model
Ï each processor core can be run at its own optimized supply CUDA : optimisation

voltage and frequency; GPU computing :


perspectives

Installing CUDA
Ï easier to load balance among processor cores to distribute
CFD example/demo
heat across the die;
PyCUDA / PyOpenCL
Ï can potentially produce lower die temperatures improving CUDA / MPI

reliability and leakage. References

10 / 170
GPU

Moore’s law - Towards multi-core architectures P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

Ï More transistors = more computing power ! What’s a thread ?

GPU computing
Ï More transistors ? What’s the purpose ? How to use them History

efficiently ? and embedded GPU


computing ?

FPGA co-processor

Ï Improve single-core CPU performances: Nvidia / CUDA

/ keep frequency increasing (watch electric power !)


Why shoud we use GPU ?
• Hardware architecture

• , keep transistor density increasing (more and more difficult) : Programming model
CUDA : optimisation
32 nm in 2010 GPU computing :
perspectives
Ï Utilize efficiently transistors on chip Installing CUDA

• / instruction-level parallelism (out-of-order execution, etc...) CFD example/demo

• , data-level parallelism (SIMD, vector units) : SSE, Cell Spe, PyCUDA / PyOpenCL

CUDA / MPI
GPU !
• , thread-level parallelism: hardware-multi-threading, References

multi-core, many-core ...

http://www.ugrad.cs.ubc.ca/~cs448b/2010-1/lecture/2010-09-09-ugrad.pdf

11 / 170
GPU

Moore’s law - Towards multi-core architectures P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History
Ï More transistors = more computing power ! and embedded GPU
computing ?
Ï More transistors ? What’s the purpose ? How to use them FPGA co-processor

efficiently ? Nvidia / CUDA


Why shoud we use GPU ?
Hardware architecture
Ï re-think or re-design algorithms to exploit parallelism Programming model
CUDA : optimisation
(multithreads, multicores, ...) and make them scalable GPU computing :
perspectives
(whatever the number of cores) ! Installing CUDA

Ï modern GPU have massively multi-threads architectures (up CFD example/demo

PyCUDA / PyOpenCL
to 48 active threads per core in Fermi)
CUDA / MPI

http://www.ugrad.cs.ubc.ca/~cs448b/2010-1/lecture/2010-09-09-ugrad.pdf
References

12 / 170
GPU

Multi-core - a technology perspective P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï not only in supercomputers What’s a thread ?

Ï almost every PC in 2010 has a multi-core CPU GPU computing


History

Ï multiple small cores at a lower clock frequency are more and embedded GPU
computing ?

power efficient FPGA co-processor

corollary : the parallel version of a code can be much more Nvidia / CUDA
Why shoud we use GPU ?

energy-efficient than its sequential version running at the Hardware architecture

Programming model
same frequency CUDA : optimisation

GPU computing :
Ï Even smartphones processors become multi-core: multiple perspectives

Installing CUDA
small cores at a lower frequency are more efficient than single
CFD example/demo
core (increase battery life-time, ...) PyCUDA / PyOpenCL
Ï Modern smartphones perform nearly 100GOPS within a CUDA / MPI

power budget of only 1W ! References

http://www.date-conference.com/proceedings/PAPERS/2009/DATE09/PDFFILES/10.1
S. Borkar and A. Chien, The Future of Microprocessors, ACM
Communications

13 / 170
GPU

Multi-core parallelism in smartphones P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model

Ï maximum workload for a 3G smartphone is about 100 GOPS CUDA : optimisation

GPU computing :
Ï Application processing : user interface, address books, perspectives

Installing CUDA
diaries, sms, java, internet browsing, email, gaming, CFD example/demo
document and spreadsheet editing, photo handling PyCUDA / PyOpenCL
Ï Radio processing : demodulation, decoding, protocol, ... CUDA / MPI

Ï Multi-media processing References

Ï The challenge is to provide 100GOPS within a 1W power

budget
Ï The solution has to be multicore, as multiple small cores at a

lower clock frequency are more power efficient


http://www.date-conference.com/proceedings/PAPERS/2009/DATE09/PDFFILES/10.1
14 / 170
GPU

What is a thread anyway ? P. Kestener

Architectures,
Ï Thread - hardware / software definition(s): Parallelism and Moore
law
• execution unit - fil d’exécution Why multi-core ?

• instruction unit - fil d’instruction What’s a thread ?

GPU computing
• processing unit - unité de traitement
History
• lightweight process - processus léger (jargon Unix) : in software, and embedded GPU
computing ?
switching between threads does not involve changing the FPGA co-processor

memory context. Nvidia / CUDA


Ï confusion: multi-threading 6= multi-cores Why shoud we use GPU ?
Hardware architecture
Ï multi-threading on a single core = time-division Programming model

multiplexing i.e. the processor switches between different CUDA : optimisation

GPU computing :

threads. Multiple thread are not actually running perspectives

Installing CUDA

simultaneously, just interleaved execution CFD example/demo


Ï SMT (Simultaneous Multi-Threading) : a single core able to PyCUDA / PyOpenCL

process multiple instructions from multiple threads in one CUDA / MPI

cycle. This is called Hyper-Threading by Intel. The two (or References

more) harware thread share acces to cache memory or virtual


memory (TLB) leading potentially to contention.
Ï CMT (Chip-level Multi-Threading) : integrates two or more
processors into one chip, each executing threads
independently; also called multi-processing
15 / 170
GPU

What is a thread anyway ? P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
Ï SMT (Simultaneous Multi-Threading) : One physical perspectives

processor for multipe logical processors. Installing CUDA

CFD example/demo
• Each logical processor maintains a complete set of the
PyCUDA / PyOpenCL
architecture state (general- purpose registers, control registers,
CUDA / MPI
...)
References
• Logical processors share nearly all other resources, such as
caches, execution units, branch predictors, control logic, and
buses
Costs of a 2-way Hyper-Threading : chip area (+5%),
performance (+15 to 30%)
Ï CMT (Chip-level Multi-Threading) :
16 / 170
GPU

What is a thread anyway ? P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?
Ï ( academic ? ) Implicit Multi-thread processor : Dynamically GPU computing
generate threads from single-threaded programs and execute History

and embedded GPU


such speculative threads concurrent with the lead thread. computing ?

FPGA co-processor
Multiscalar, dynamic multithreading, speculative
Nvidia / CUDA
multithreaded, ... Why shoud we use GPU ?
Hardware architecture
Ï Wikipedia [Multi-threading]: Programming model
CUDA : optimisation
(...) multithreading aims to increase utilization of a single core GPU computing :

by leveraging thread-level as well as instruction-level perspectives

Installing CUDA

parallelism. CFD example/demo

Ï multi-threading aim: improve instruction throughput of the PyCUDA / PyOpenCL

processor. CUDA / MPI

References
Ï example : hide memory access latency by thread-context
switching (key feature in modern GPU high performances)

17 / 170
GPU

From multi-core CPU to manycore GPU P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?
Why is there a large performance gap between manycore GPUs
GPU computing
and general purpose multicore CPU ? History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References
Ï Different goals produce different designs:
• GPU assumes work load is highly parallel
• CPU must be good at everything, parallel or not

18 / 170
GPU

From multi-core CPU to manycore GPU P. Kestener

Architectures,
Parallelism and Moore
law
Why is there a large performance gap between manycore GPUs Why multi-core ?

and general purpose multicore CPU ? What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL
Ï CPU design goal : optimize architecture for sequential code CUDA / MPI
performance : minimize latency experienced by 1 thread References

• sophisticated (i.e. large chip area) control logic for


instruction-level parallelism (branch prediction, out-of-order
instruction, etc...)
• CPU have large cache memory to reduce the instruction and
data access latency
19 / 170
GPU

From multi-core CPU to manycore GPU P. Kestener

Architectures,
Parallelism and Moore
Why is there a large performance gap between manycore GPUs law
Why multi-core ?
and general purpose multicore CPU ? What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL
Ï GPU design goal : maximize throughput of all threads
CUDA / MPI
• # threads in flight limited by resources => lots of resources
References
(registers, bandwidth, etc.)
• multithreading can hide latency => skip the big caches
• share control logic across many threads
ref: Jared Hoberock, Stanford, cs193,
http://code.google.com/p/stanford-cs193g-sp2010/
20 / 170
GPU

From multi-core CPU to manycore GPU P. Kestener

Architectures,
Parallelism and Moore
Why is there a large performance gap between manycore GPUs law
Why multi-core ?
and general purpose multicore CPU ? What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL
Ï fast growing game industry made a tremendous economic
CUDA / MPI
pressure to design architectures optimized for maximum chip
References
area / power budget per floating point operations.
Ï GPU takes advantage of a large number of execution threads
to find work to do when other threads are waiting for
long-latency memory accesses, thus minimizing the control
logic required for each execution thread.
21 / 170
GPU

From multi-core CPU to manycore GPU P. Kestener

Architectures,
Parallelism and Moore
Why is there a large performance gap between manycore GPUs law
Why multi-core ?
and general purpose multicore CPU ? What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo
Ï GPU: much more area dedicated to floating point PyCUDA / PyOpenCL

computations CUDA / MPI

Ï GPUs are numeric computing engines that will not perform References

well on some tasks for which CPU are optimized. Need to take
advantage of both !
reference: D. Kirk and W.W. Hwu, Programming massively parallel
processors, Morgan and Kaufmann eds.
J. Dongarra, CEA-EDF-INRIA summer school, 2011
22 / 170
GPU

From multi-core CPU to manycore GPU P. Kestener

Architectures,
Parallelism and Moore
law
Why is there a large performance gap between manycore GPUs Why multi-core ?

and general purpose multicore CPU ? What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

Ï SIMD CUDA / MPI

References
Ï GPU: Amortize cost / complexity of managing an instruction
stream across many ALUs.
reference: D. Kirk and W.W. Hwu, Programming massively parallel
processors, Morgan and Kaufmann eds.
J. Dongarra, CEA-EDF-INRIA summer school, 2011
23 / 170
GPU

Other practical issues to explain GPU rise P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï cost of software development is best justified by a very large What’s a thread ?

customer population: GPU computing


History
• Traditional parallel computing system used to have negligible and embedded GPU
computing ?
market (e.g. CRAY vector processor in the 80s). FPGA co-processor

• large market (game industry) made GPU economically Nvidia / CUDA


attractive Why shoud we use GPU ?
Hardware architecture

Ï Ease of accessibility of parallel computing systems Programming model


CUDA : optimisation
• before 2006, parallel software ran only on data-center / clusters GPU computing :
perspectives
• According to Kirk and Hwu, NIH refused to fund parallel Installing CUDA

programing projects for some time because they felt the impact CFD example/demo

of parallel software would be limited (no huge cluster-based PyCUDA / PyOpenCL

machines in clinical settings). Today, to possibility to have small CUDA / MPI

GPU-based equipment re-enabled parallel software research for References

such medical application.

reference: D. Kirk and W.W. Hwu, Programming massively parallel


processors, Morgan and Kaufmann eds.

24 / 170
GPU

Other practical issues to explain GPU rise P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing

before 2006, OpenGL® / Direct3D® API were the only way to


History
Ï and embedded GPU
computing ?
program GPUs - this is called legacy GPGPU (General Purpose FPGA co-processor

GPU) computing and required highly specific programming Nvidia / CUDA

skills Why shoud we use GPU ?


Hardware architecture

see the GPGPU tutorial Programming model


CUDA : optimisation
http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial.html
GPU computing :
perspectives
Ï everything changed in 2007, with the release of CUDA (new Installing CUDA

hardware, new programming model, etc...) CFD example/demo

PyCUDA / PyOpenCL
reference: D. Kirk and W.W. Hwu, Programming massively parallel CUDA / MPI

processors, Morgan and Kaufmann eds. References

25 / 170
GPU

Fundamentals of Parallel Computing P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

Ï Parallel computing requires that GPU computing


History
• The problem can be decomposed into sub-problems that can and embedded GPU

be safely solved at the same time computing ?

FPGA co-processor
• The programmer structures the code and data to solve these
Nvidia / CUDA
sub-problems concurrently Why shoud we use GPU ?
Hardware architecture
Ï The goals of parallel computing are Programming model
CUDA : optimisation
• To solve problems in less time, and/or GPU computing :

• To solve bigger problems, and/or perspectives

Installing CUDA
• To achieve better solutions
CFD example/demo

The problems must be large enough to justify parallel computing PyCUDA / PyOpenCL

CUDA / MPI
and to exhibit exploitable concurrency.
References
reference :
http://courses.engr.illinois.edu/ece498/al/Syllabus.html

26 / 170
GPU

Parallel programming / computational thinking P. Kestener

Architectures,
Parallelism and Moore
Ideal parallel programmer skills: law
Why multi-core ?

What’s a thread ?
Ï Computer architecture:
GPU computing
• Memory organization, History

• caching and locality, and embedded GPU


computing ?
• memory bandwidth, FPGA co-processor

• SIMT (single instruction multiple thread) versus SPMD (single Nvidia / CUDA
Why shoud we use GPU ?
program multiple-data versus SIMD, Hardware architecture
• floating point accuracy Programming model
CUDA : optimisation
Ï Programming models and compilers: GPU computing :
perspectives
• parallel execution models, Installing CUDA
• types of available memories, CFD example/demo
• array data layout, PyCUDA / PyOpenCL
• loop transformation CUDA / MPI
Ï Algorithm techniques: References
• tiling, cutoff, binning
Ï Domain knowledge:
• numerical methods, models accuracy required
reference: Kirk, Hwu, Programming massivelly parallel processors,
chapter 10
27 / 170
GPU

Summary P. Kestener

Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?

What’s a thread ? What’s a thread ?

GPU computing
GPU computing History

and embedded GPU


History computing ?

and embedded GPU computing ? FPGA co-processor

Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model

Why shoud we use GPU ? CUDA : optimisation

GPU computing :
Hardware architecture perspectives

Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI

Installing CUDA References

CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
28 / 170
GPU

From vector processors to GPU P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
http://www.irisa.fr/archi09/defour.pdf What’s a thread ?

Ï Vector processor : implements an instruction set containing GPU computing


History
instructions that operate on 1D arrays of data called vectors and embedded GPU
computing ?

Ï most super-computer in the 80s / 90s were vector processor FPGA co-processor

Nvidia / CUDA
Ï CRAY-1: 1976, 80MHz, 64-bit/data, 24-bit/adress, vector Why shoud we use GPU ?

register file, 160 MIPS, 250 MFLOPS, 8MB RAM, 5.5 tonnes, Hardware architecture

Programming model

∼ 200-kW (cooling included) CUDA : optimisation

GPU computing :

Ï modern/current scalar processors have vector instructions set perspectives

Installing CUDA

(SIMD) : SSE, Altivec (using 128 bits register). CFD example/demo

Ï vector processors fall (end of 80s - beginning of 90s) : CMOS PyCUDA / PyOpenCL

technology; rising of PC mass market; single chip CUDA / MPI

References
microprocessor vs difficult to fit a vector processor on a single
chip; cache size increase in scalar processors; programming
vector procesor required assembly language skills.

29 / 170
GPU

GPU evolution: architecture and programming P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

Ï basics of graphics pipeline FPGA co-processor

Nvidia / CUDA
Ï basics of shaders (programmable functionalities in the Why shoud we use GPU ?
Hardware architecture
graphics pipeline) Programming model
CUDA : optimisation
Ï Overview of GPU architectures GPU computing :
perspectives
Ï Legacy GPGPU (before CUDA, ∼ 2004) Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

The Evolution of GPUs for General Purpose Computing,


par Ian Buck
http://www.nvidia.com/content/GTC-2010/pdfs/2275_GTC2010.pdf

30 / 170
GPU

Overview of GPU architectures: from 1995 to P. Kestener

1999 Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL
Ï graphics pipeline: conceptual model of stages that graphics CUDA / MPI

data is sent through (software or hardware) References

Ï GPU main function : off-load graphics task from CPU to GPU


Ï GPU: dedicated hardware for specialized tasks : GPU
hardware designed to optimize highly repetitive tasks: e.g.
texturing (bi-, tri-, quadri-linear filtering), rastering (pixel
interpolation), z-cull (remove hidden surface), etc...
31 / 170
GPU

Overview of GPU architectures: from 1995 to P. Kestener

1999 Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

Ï graphics pipeline: conceptual model of stages that graphics References

data is sent through (software or hardware)


Ï from 1995 : 3D rendering, rasterization (vector image into
pixel image), more complex rendering

32 / 170
GPU

A short historical overview : 1995 to 1999 P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation
http://www.nvidia.com/object/Technical_Brief_TandL.html GPU computing :
perspectives
http://www.cs.unr.edu/~fredh/papers/thesis/023-crow/GPUFinal.pdfInstalling CUDA
CFD example/demo
Ï design specification of graphics API (API should be PyCUDA / PyOpenCL
cross-vendor, cross-paltform, provide hardware abstraction) : CUDA / MPI

OpenGL, DirectX References


Ï 3D graphics boom :
• Motion picture industry : Toy Story I (1995, Pixar) is the first
full-length computer-generated feature film (off-line rendering
with RenderMan)
• mass-market game industry: Quake, Doom, etc...; fligh
simulators 33 / 170
GPU

A short historical overview : 1995 to 1999 P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
http://www.nvidia.com/object/Technical_Brief_TandL.html perspectives

http://www.cs.unr.edu/~fredh/papers/thesis/023-crow/GPUFinal.pdfInstalling CUDA
CFD example/demo

PyCUDA / PyOpenCL
Ï 1999 : transfert transformation operations to GPU (4x4 matrix CUDA / MPI

multiplication) and lightning References

Ï 1999 : Nvidia introduces the first consumer-level GPU with the


entire graphics pipeline in hardware.
Ï fixed function pipeline (developper limited to a fixed set of
features)
34 / 170
GPU

OpenGL StateMachine P. Kestener

Architectures,
http://www.opengl.org/documentation/specs/version1.1/state.pdf
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

EnableClientState
GPU computing
DisableClientState

The OpenGL Machine


EdgeFlagPointer R
TexCoordPointer
ColorPointer
Vertex
CCCC MapGrid
History
CCCC
IndexPointer Array

CCCC
NormalPointer Control
VertexPointer
InterLeavedArrays
CCCC
CCCC
CCCC
CCCC EvalMesh Grid
Evaluator
Control and embedded GPU
CCCC
ArrayElement EvalPoint Application
DrawElements
CCCC
CCCC
DrawArrays

CCCC
CCCC
Map
CCC
CCC
computing ?
CCC
Evaluation

CCCC
CCCC
EvalCoord

CCC The OpenGL graphics system diagram, Version 1.1. Copyright  1996 Silicon Graphics, Inc. All rights reserved.
CCCC CCC
CCCC CCC
CCCC
CCCC
Map
Enable/Disable CCC
CCC FPGA co-processor
CCCC CCC
CCCC CCC
CCCC Current CCC
CCC
CC CCC
EdgeFlag Edge

CC CCC
Flag
Enable/Disable

CC CCC
TexCoord1 t 0
CC
CC
CC
CCC
CCC
CCC
TexGen
OBJECT_LINEAR
Nvidia / CUDA
CC CCC
TexGen b A*b
A

CC CCC
TexCoord2 r 0 EYE_LINEAR

CC CCC
TexGen

CC CCC
SPHERE_MAP

Why shoud we use GPU ?


Texture
TexCoord3 q 1
CC
CC CCC
CCC TexGen
Matrix
Stack

TexCoord4
Current
Texture
Vertices
CC
CC
Coordinates
CC
CC
Color3 A 1
CC
CC
CC
CC
CC
CC
CC
Evaluators
CC
CC &
CC
Hardware architecture
CC CC
Enable/Disable
CC
Vertex ArraysCC
CC
Color4
Convert
RGBA to float
CC
Current
RGBA
CC
ColorMaterial
Material CC
CC CC
CC CC C CC
Color

CC
CC CC
CC
Material
Parameters
LightModel
CC CC
CC
Texture Coordinate
Generation
Programming model
CC CC CC CC
Control Input
Conversion C
CC& C
Current Begin/End
Convert
Index

CC CC
CC
Color Light Light Material Light Model
index to float

CC CC C CUDA : optimisation
Index Enable/Disable Parameters Parameters Parameters
Current
CC
CC CC
CC C C
CCC CCC
Values Lighting
CC CC
Enable/Disable

CC CC
Clamp to
Primitive

CC C
RGBA Lighting Equation [0,1]
Convert Assembly
Normal3 normal coords
to float
CC
CC
Current
Normal
CC
CC
b
M
M*b Normalize

Color Index Lighting Equation


Mask to

[0,2n−1]
GPU computing :
CC
CC CC
CC
Vertex2
z 0
CC
CC CC
CC
M−T
Enable/Disable
Matrix
Control perspectives
CC CC
FrontFace
RasterPos2 M (Lighting) Clipping, Perspective,
CC
CC CC and
Vertex3
RasterPos3
w 1
CC CC
CC Viewport Application
Rasteriz−

CC CC Installing CUDA
ation Texturing, Per−Fragment Operations
Vertex4 OBJECT M EYE Fog,
b M*b Feedback
RasterPos4 COORDINATES M COORDINATES and
& Frame Buffer
Antialiasing
Selection &
Pixels Frame Buffer Control
Rectangle

CFD example/demo
Rect
Generation

Primitives Fragments
Model View
Matrix
Stack
Key to OpenGL Operations
MatrixMode
PushMatrix
Matrix
Control
ClipPlane
FrontFace
PolygonMode
Enable/Disable
(Antialiasing/Stipple)

PolygonOffset
PyCUDA / PyOpenCL
PopMatrix CullFace
b
M
M−Tb
LoadIdentity Polygon
Polygon Polygon
LoadMatrix Rasterization
Culling Mode
Projection Viewport
Clip

CUDA / MPI
Matrix DepthRange Enable/Disable
M*N ShadeModel Planes
M Stack (Antialiasing)
N LineStipple
LineWidth
Polygon
MultMatrix POLYGONS Polygon
b M M*b View Volume
Clipping Divide Line
Clipping
Translate Flatshading Vertex Segment
Line Apply
Scale LINE Line Coordinates Rasterization
Matrix b M*b View Volume Viewport
SEGMENTS Clipping by TexParameter
Rotate Clipping Enable/Disable
Generators (Vertex w
Frustum Enable/Disable
(Antialiasing)

References
Only) Point Enable/Disable Enable/Disable Enable/Disable
Ortho POINTS Point Enable/Disable Enable/Disable StencilOp Enable/Disable Enable/Disable Enable/Disable Enable/Disable
b M*b View Volume PointSize TexEnv Fog
RASTER POS. Culling Scissor AlphaFunc StencilFunc DepthFunc BlendFunc LogicOp
Culling

Coverage
Point Texel Texture Pixel Alpha Depth
Fog (antialiasing) Scissor Stencil Blending
Rasterization Generation Application Ownership Test Buffer Dithering Logic Op
Application Test Test (RGBA only)
Current Test (RGBA only) Test
Raster
Position

RenderMode

Clear
Clear
Control
Notes: Selection Feedback
Bitmap
Selection
Rasterization
1. Commands (and constants) are shown without the Encoding Control Encoding Masking

gl (or GL_) prefix.


Clear
2. The following commands do not appear in this PassThrough PixelZoom Values
DepthMask
PolygonStipple StencilMask
diagram: glAccum, glClearAccum, glHint, Selection
SelectBuffer Name FeedbackBuffer Bitmap
display list commands, texture object commands, Stack Unpack Pixel
DrawPixels ClearStencil
commands for obtaining OpenGL state Pixels Rasterization
ClearDepth
TexImage Pixel
(glGet commands and glIsEnabled), and Texture ClearIndex
InitNames TexSubImage Transfer
ClearColor Masking
Frame Buffer Frame Buffer
glPushAttrib and glPopAttrib. Utility library LoadName Memory
Control
PopName
routines are not shown. PushName
PixelStore

3. After their exectution, glDrawArrays and PixelTransfer ColorMask


DrawBuffer
IndexMask
glDrawElements leave affected current values PixelMap

Readback
indeterminate. Control
4. This diagram is schematic; it may not directly
CopyPixels
correspond to any actual OpenGL implementation. ReadPixels
Pack
ReadBuffer
CopyTexImage
Pixels
CopyTexSubImage

35 / 170
GPU

A short historical overview : 2000 and after P. Kestener

Architectures,
Ï “All processors aspire to be general-purpose.” Parallelism and Moore
law
– Tim Van Hook, Graphics Hardware 2001 Why multi-core ?

What’s a thread ?
Ï 2001 : Nvidia GeForce3
GPU computing
Ï programmable graphics pipeline : pixel and vertex-shaders History

writen in low-level language, pros : high flexibility and and embedded GPU
computing ?

permettre au matériel de suivre les évolutions rapides des FPGA co-processor

Nvidia / CUDA
standards, ease development plus rapide. Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

1
36 / 170
GPU

A short historical overview : 2000 and after P. Kestener

Architectures,
Ï “All processors aspire to be general-purpose.” Parallelism and Moore
law
– Tim Van Hook, Graphics Hardware 2001 Why multi-core ?

What’s a thread ?
Ï 2001 : Nvidia GeForce3
GPU computing
Ï programmable graphics pipeline : pixel and vertex-shaders History

writen in low-level language, pros : high flexibility and and embedded GPU
computing ?

permettre au matériel de suivre les évolutions rapides des FPGA co-processor

Nvidia / CUDA
standards, ease development plus rapide. Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

1
37 / 170
GPU

A short historical overview : 2000 and after P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï “All processors aspire to be general-purpose.” What’s a thread ?

– Tim Van Hook, Graphics Hardware 2001 GPU computing


History

Ï 2001 : Nvidia GeForce3 and embedded GPU


computing ?

FPGA co-processor
Ï programmable graphics pipeline : pixel and vertex-shaders
Nvidia / CUDA
writen in low-level language, pros : high flexibility and Why shoud we use GPU ?

permettre au matériel de suivre les évolutions rapides des Hardware architecture

Programming model
standards, ease development plus rapide. CUDA : optimisation

GPU computing :
Ï GPU : Vertex Processors (MIMD), Fragment Processors (SIMD), perspectives

Installing CUDA
32 bits float CFD example/demo

Ï introduce high-level languages : Cg (Nvidia, compatibility PyCUDA / PyOpenCL

with OpenGL/DirectX), HLSL (Microsoft, compatibility with CUDA / MPI

DirectX only) References

Ï hardware abstraction, computer programmeur only needs to


know very little about hardware

38 / 170
GPU

A short historical overview : 2000 and after P. Kestener

Architectures,
Parallelism and Moore

Floating point computations capability implemented in law


Why multi-core ?

GPU hardware What’s a thread ?

GPU computing
History
Ï IEEE754 standard written in mid-80s and embedded GPU
computing ?
Ï Intel 80387 : first floating-point coprocessor FPGA co-processor

IEEE754-compatible Nvidia / CUDA


Why shoud we use GPU ?

Ï Value = (−1)S × M × 2E , denormalized, infinity, NaN; rounding Hardware architecture

Programming model
algorithms quite complex to handle/implement CUDA : optimisation

GPU computing :
Ï FP16 in 2000 perspectives

Installing CUDA
Ï FP32 in 2003-2004 : simplified IEEE754 standard, float point CFD example/demo
rounding are complex and costly in terms of transistors count, PyCUDA / PyOpenCL

Ï rounding computation fully implemented for + and * in 2007, CUDA / MPI

denormalised number not completed implemented References

Ï 2010 : 4 mandatory IEEE rounding modes; Subnormals at


full-speed (Nvidia GF100)
Ï links:
http://perso.ens-lyon.fr/sylvain.collange/talks/raim11_scollange.pdf
39 / 170
GPU

From legacy GPGPU to CUDA : unified P. Kestener

architecture (2007) Architectures,


Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

Nvidia GeForce 7800 GPU computing


History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

40 / 170
GPU

From legacy GPGPU to CUDA : unified P. Kestener

architecture (2007) Architectures,


Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

Nvidia GeForce 8000 GPU computing


History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

41 / 170
GPU

Legacy GPU computing - before CUDA P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

Legacy GPGPU : What’s a thread ?

General Purpose computations on GPU GPU computing


History

and embedded GPU


computing ?

Ï Legacy GPGPU: twist Graphics APIs to perform general FPGA co-processor

Nvidia / CUDA
purpose computing task Why shoud we use GPU ?
Hardware architecture
Ï GPU were designed for computer graphics (output streams of Programming model

colored pixels from input streams of vertices, texels) CUDA : optimisation

GPU computing :
perspectives
Ï There a need for translating computational concepts into GPU Installing CUDA

programming model CFD example/demo

Ï GPGPU dedicated tutorial for OpenGL (2005) PyCUDA / PyOpenCL

CUDA / MPI
http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial.html
References
have also a look at chapiter 31 in GPUGems2, Mapping
Computational concepts to GPUs,
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter31.html

42 / 170
GPU

Legacy GPU computing - before CUDA P. Kestener

Architectures,
Parallelism and Moore

Legacy GPGPU : law


Why multi-core ?

General Purpose computations on GPU What’s a thread ?

GPU computing

Ï CPU - GPU analogies History

and embedded GPU


Ï GPGPU concept 1 : arrays = texture computing ?

FPGA co-processor
• CPU memory can be read / written anywhere in a program
Nvidia / CUDA
• Vextex programs are not allowed to randomly index into input Why shoud we use GPU ?
Hardware architecture
vertices. Programming model
• GPU texture (memory buffer) are read-only but random acces CUDA : optimisation

allowed !! These are output of vextex/fragment processor. GPU computing :


perspectives

• render-to-texture (instead of display), very recent feature (∼ Installing CUDA

2003) CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

43 / 170
GPU

Legacy GPU computing - before CUDA P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?
Legacy GPGPU : GPU computing
General Purpose computations on GPU History

and embedded GPU


computing ?

FPGA co-processor
Ï CPU - GPU analogies Nvidia / CUDA
Why shoud we use GPU ?
Ï GPGPU concept 1 : arrays = texture Hardware architecture

• create a frame buffer object for off-screen rendering; random Programming model
CUDA : optimisation
access not allowed GPU computing :
perspectives
• memory : CPU array → GPU GL_TEXTURE_2D / (read-only or Installing CUDA

write-only), bind a texture to a FBO CFD example/demo


• CPU array indexes (integer) → GPU texture coordinates (float PyCUDA / PyOpenCL
∈ [0, 1]) CUDA / MPI
• data type : float → GL_LUMINANCE or GL_RGBA References
• array size : power of 2 (?)
• scatter operations (a[i] = x) need to be converted into gather
(x = a[i])

44 / 170
GPU

Legacy GPU computing - before CUDA P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


Legacy GPGPU : computing ?

FPGA co-processor
General Purpose computations on GPU Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Ï GPGPU concept 2 : CPU programs, loops = GPU fragment Programming model
CUDA : optimisation
shaders (a fragment code on a stream of vertices) GPU computing :
perspectives
Ï GPGPU concept 3 : CPU computing = GPU drawing : The Installing CUDA

vertex processors trandform the geometry and the rasterizer CFD example/demo

determine which pixels in the output buffer it cover and PyCUDA / PyOpenCL

CUDA / MPI
generate a fragment for each one.
References

45 / 170
GPU

Legacy GPU computing - before CUDA P. Kestener

Architectures,
Parallelism and Moore
law
Legacy GPGPU : Why multi-core ?

General Purpose computations on GPU What’s a thread ?

GPU computing
History

Ï Reduction example : compute Max of items in a 2D array and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter31.html
CUDA / MPI

Ï Implementation: use a pairs of buffers and iterate a 2-steps References

operation of rendering to / reading from beffuer by reducing


the output size by 2 until we have
¡ only ¢a single element
containing the max. We need O log(n) passes.
Ï other example code : sum of arrays (hands-on) saxpy_cg.cpp
46 / 170
GPU

GPU computing - CUDA P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Ï Nvidia Geforce8800, 2006, introduce a unified architecture History

(only one type of shader processor) and embedded GPU


computing ?

FPGA co-processor
Ï first generation with hardware features designed with GPGPU Nvidia / CUDA
in mind: almost full support of IEEE 754 standard for single Why shoud we use GPU ?
Hardware architecture
precision floating point, random read/write in external RAM, Programming model

memory cache controlled by software CUDA : optimisation

GPU computing :
perspectives
Ï new hardware architecture generation: CUDA (Compute Installing CUDA

Unified Device Architecture) CFD example/demo

Ï new programming model: CUDA + development tools (toolkit, PyCUDA / PyOpenCL

CUDA / MPI
compiler, SDK, librairies like cuFFT), a C-like programming
References
language

47 / 170
GPU

CUDA - langage C-like - HelloWorld P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

48 / 170
GPU

GPU computing - CUDA P. Kestener

Architectures,
Parallelism and Moore
law

Applications examples: Why multi-core ?

What’s a thread ?

Ï meteo : Numerical weather prediction, GPU computing


History
http://www.mmm.ucar.edu/wrf/WG2/GPU/ and embedded GPU
computing ?
Ï CFD: FPGA co-processor

http://www.nvidia.com/object/computational_fluid_dynamics.html
Nvidia / CUDA
Why shoud we use GPU ?
Ï Molecular Dynamics / Quantum chemistry: Hardware architecture

http://www.nvidia.com/object/molecular_dynamics.html Programming model


CUDA : optimisation

Ï Bioinformatics GPU computing :


perspectives

http://www.nvidia.com/object/tesla_bio_workbench.html Installing CUDA

CFD example/demo
Ï financial computing:
PyCUDA / PyOpenCL
http://people.maths.ox.ac.uk/gilesm/hpc/slides.html
CUDA / MPI
Ï Signal/Image processing, Computer vision, etc References

Ï too many by now....


Ï Books: GPU Computing GEMS, Emerald Edition,

49 / 170
GPU

GPU computing - CUDA P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Conference GTC (GPU Technology Conference 2010) : more than Nvidia / CUDA

250 applications examples Why shoud we use GPU ?


Hardware architecture

http://www.nvidia.com/object/gtc2010-presentation-archive.htmlProgramming model
CUDA : optimisation
Have a look at the CUDA show case: GPU computing :
perspectives
http://www.nvidia.com/object/cuda_apps_flash_new.html Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

50 / 170
GPU

What kinds of computation map well to GPUs ? P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Nukada et al., SC08


http://www.nvidia.com/content/GTC-2010/pdfs/2084_GTC2010.pdf

51 / 170
GPU

What kinds of computation map well to GPUs ? P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Nukada et al., SC08


http://www.nvidia.com/content/GTC-2010/pdfs/2084_GTC2010.pdf

52 / 170
GPU

Is parallelisation worth it ? P. Kestener

Architectures,
1 Parallelism and Moore
Amdahl’s law: R = (1−p)+p/N
law
Why multi-core ?

What’s a thread ?

Ï p: fraction of work that can be parallelized GPU computing


History
Ï 1 − p: fraction of sequential work and embedded GPU
computing ?
Ï R: prediction maximum speed-up using N parallel processors FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

53 / 170
GPU

embedded systems... P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï OpenGL ES (Open Graphics Library What’s a thread ?

for Embedded System) GPU computing


History
• OpenGL ES 1.1.X (2004-2008): and embedded GPU

fixed function hardware, chosen computing ?

FPGA co-processor
as 3D API 3D in Symbian , Nvidia / CUDA
Android, iPhone SDK, ... Why shoud we use GPU ?

• OpenGL ES 2.0.X (2007-...): Hardware architecture

Programming model
programmable hardware, CUDA : optimisation

specification of a high-level GPU computing :


perspectives

shading language, used in new Installing CUDA

iPod Touch, Nokia N900, ... CFD example/demo

PyCUDA / PyOpenCL
Ï new system-on-chip (CPU+GPU sur CUDA / MPI
une seule puce), tablet-PC market, References
entertainment center, ...
http://tegradeveloper.nvidia.com/tegra/

54 / 170
GPU

FPGA : reconfigurable computing P. Kestener

Architectures,
Ï FPGA : programmable integrated circuits, firmware Parallelism and Moore
law
(hardware with software flexibility), array of configurable logic Why multi-core ?

What’s a thread ?
block (encode a n-input boolean function) + routing tracks
GPU computing
Ï since ∼ 1985 History

Ï low-level HDL (Hardware description language) e.g. VHDL and embedded GPU
computing ?

(ADA-like), allow to design a functionality at clock cycle level FPGA co-processor

Nvidia / CUDA
Ï IP (Intellectual Properties) library Why shoud we use GPU ?

Ï more and more complex : integration of communication block Hardware architecture

Programming model
(Ethernet, Gigabit Tranceiver, etc) and computing blocks CUDA : optimisation

(PowerPC CPU, bloc DSP, small floating-point ALU (2003)) GPU computing :
perspectives

Ï reconfigurable computing : can we use FPGA general purpose Installing CUDA

CFD example/demo
computing ?
PyCUDA / PyOpenCL
Ï since 2004-2005, emergence of high-level languages (C-like) +
CUDA / MPI
small board (with PCI-express form factor) + driver References
Ï high-level language as always allows abstraction hiding
low-level design flow (logic synthesis into RTL, placement,
routing, ...) focus on scientific algorithm
Ï Appealling but design too complex (to much low-level, i.e.
hardware knowledge required to make efficient use of them),
FPGA are expensive... 55 / 170
GPU

FPGA : reconfigurable computing / hardware P. Kestener

acceleratorr Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?
voir http://www.nallatech.com/ FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

56 / 170
GPU

FPGA : reconfigurable computing P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

Some links : What’s a thread ?

GPU computing
Ï bibliographic reference : Reconfigurable Computing, History

Accelerating Computation with FPGA, by M. Gokhale and P.S. and embedded GPU
computing ?

Graham, Springer, 2005. FPGA co-processor

Nvidia / CUDA
Ï wikipedia Reconfigurable computing Why shoud we use GPU ?
Hardware architecture
Ï workshop HPRCTA10 Programming model
CUDA : optimisation

GPU computing :
perspectives

Ï FPGA seem less appealing Installing CUDA

after the rise of GPU CFD example/demo

PyCUDA / PyOpenCL
Computing and possible
CUDA / MPI
merging CPU/GPU
References
architectures and
developpement tools

57 / 170
GPU

FPGA : reconfigurable computing P. Kestener

Architectures,
Applications : Parallelism and Moore
law
Ï computing applications : SDR (Software Defined Radio, Why multi-core ?

What’s a thread ?
telecommunication, radio-astrophysics, militaire)
GPU computing
Ï applications : high throuput networking / storage : low latency
History

(finance, trading), compression, ... and embedded GPU


computing ?

Ï some vendors : Nallatech, DRC Computer, XtremeData FPGA co-processor

Nvidia / CUDA
Ï commercial high-level tools (C-to-RTL) : Mitrion-C, ImpulseC,
Why shoud we use GPU ?

... Hardware architecture

Programming model
les problèmes : CUDA : optimisation

Ï FPGA are expensive (6= GPU, gaming market) GPU computing :


perspectives

Ï Design tools complexity for software programmers Installing CUDA

CFD example/demo
Ï Manufacturers involve in HPC ?
PyCUDA / PyOpenCL

CUDA / MPI

References

58 / 170
GPU

Summary P. Kestener

Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?

What’s a thread ? What’s a thread ?

GPU computing
GPU computing History

and embedded GPU


History computing ?

and embedded GPU computing ? FPGA co-processor

Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model

Why shoud we use GPU ? CUDA : optimisation

GPU computing :
Hardware architecture perspectives

Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI

Installing CUDA References

CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
59 / 170
GPU

Why should we use GPU ? P. Kestener

Architectures,
Parallelism and Moore
law
Ï Brut force (high peak GFLOPS rate) Why multi-core ?

What’s a thread ?
Ï lower GFLOPS cost GPU computing
History
Ï massively multi-thread architecture and embedded GPU
computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

60 / 170
GPU

Why should we use GPU ? P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï compare CPU/GPU : much more computing-dedicated What’s a thread ?

transistors and less control logic on GPU GPU computing


History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

61 / 170
GPU

Why should we use GPU ? P. Kestener

Architectures,
Parallelism and Moore
law

Ï gaming mass market behind the scene (viable model ?); GPU Why multi-core ?

What’s a thread ?

= personal super-computer GPU computing


History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

62 / 170
GPU

Why should we use GPU ? P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Ï accessibility : the programming model is simple compared History

to legacy GPGPU : and embedded GPU


computing ?

• creation of a structured set of threads FPGA co-processor

• CRCW memory model, explicit memory managing Nvidia / CUDA


Why shoud we use GPU ?
• C-like language (C + extensions) Hardware architecture

• graphics API (OpenGL) interoperability; (ease development of a Programming model


CUDA : optimisation
GUI without the need to readback data on CPU memory) GPU computing :
perspectives
Ï CUDA : Compute Unified Device Architecture Installing CUDA

Name of the architecture (hardware) and the programming CFD example/demo

model (software) PyCUDA / PyOpenCL

CUDA / MPI

References

63 / 170
GPU

GPU computing P. Kestener

Architectures,
Parallelism and Moore
A GPU Computing dedicated system (2009) law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References
Ï performance increase
Ï electric power saving
Ï improve reliability (e.g. ECC)
http://gpgpu.univ-perp.fr/images/3/36/GPU_Reliability_2008.pdf

64 / 170
GPU

GPU computing : double precision P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

65 / 170
GPU

CPU/GPU : memory bandwidth P. Kestener

Architectures,
Parallelism and Moore
Ï CPU-GPU link, Pci-express bus x16, Gen2 : law
Why multi-core ?
BP = 16 ∗ 2 ∗ 250MBytes/s = 8 GBytes/s What’s a thread ?

Ï CPU-local : DDR memory (f = 266MHz) GPU computing


History
BP= 266 ∗ 106 *4 (transferts/cycle) * 8 (bytes) = 8.5 GBytes/s and embedded GPU
computing ?

Ï GPU-local (carte GTX280) : 512-bit bus, DDR@f = 1100MHz, FPGA co-processor

BP= 2 ∗ 1100 ∗ 106 (transferts/s) * 512/8 (bytes/transfert)= 140 Nvidia / CUDA


Why shoud we use GPU ?
GBytes/s Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

66 / 170
GPU

Hardware architecture: Nvidia CUDA : G80 - P. Kestener

GT200 - Fermi Architectures,


Parallelism and Moore
law
Why multi-core ?

CUDA course : What’s a thread ?

http://courses.ece.uiuc.edu/ece498/al/Syllabus.html GPU computing


History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Ï 8 TPC (Thread Processor Cluster) - 2 SM (Streaming


Multiprocessor) - 8 SP (Streaming Processor, SIMD)
Ï 8 TPC × 2 SM × 8 SP-cores = 128 cores
67 / 170
GPU

Hardware architecture: Nvidia CUDA : G80 - P. Kestener

GT200 - Fermi Architectures,


Parallelism and Moore
law
Why multi-core ?

CUDA : G80 (end 2006), hardware capability 1.0 What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Ï 8 TPC (Thread Processor Cluster) - 2 SM (Streaming


Multiprocessor) - 8 SP (Streaming Processor, SIMD)
Ï 8 TPC × 2 SM × 8 SP-cores = 128 cores
68 / 170
GPU

Hardware architecture: Nvidia CUDA : G80 - P. Kestener

GT200 - Fermi Architectures,


Parallelism and Moore
law
Why multi-core ?

What’s a thread ?
CUDA : GT200 (mid 2008), hardware capability 1.3
GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Ï 10 TPC - 3 SM - 8 SP-cores
Ï 10 TPC × 3 SM × 8 SP-cores = 240 cores

69 / 170
GPU

Hardware architecture: Nvidia CUDA : G80 - P. Kestener

GT200 - Fermi Architectures,


Parallelism and Moore
law
Why multi-core ?

What’s a thread ?
CUDA : Fermi (spring 2010), hardware capability 2.0
GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Ï 16 SM - 32 SP-cores
Ï 16 SM × 32 SP-cores = 512 cores

70 / 170
GPU

Architectures comparison: G80 - GT200 - Fermi P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

71 / 170
GPU

CUDA : programming model (PTX) P. Kestener

Architectures,
Parallelism and Moore
Ï PTX (Parallel Thread Execution) law
Why multi-core ?
Ï write a program once for a pixel What’s a thread ?

thread GPU computing


History

Ï run this program on multiple and embedded GPU


computing ?

threads FPGA co-processor

Nvidia / CUDA
Ï block is a logical array of Why shoud we use GPU ?

threads indexed with threadIdx Hardware architecture

Programming model
(built-in variable) CUDA : optimisation

GPU computing :
Ï grid is a logical array of blocks perspectives

Installing CUDA
indexed with blockIdx (built-in CFD example/demo
variable) PyCUDA / PyOpenCL

Ï a block of threads is also called a CUDA / MPI

CTA (Cooperative Thread Array) References

in CUDA documentation.
Ï Read chapter 1, 2, 3 of
ptx_isa_2.2.pdf

72 / 170
GPU

CUDA : programming model (PTX) P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor
Ï threads from a given block can: Nvidia / CUDA
• synchronize execution Why shoud we use GPU ?
Hardware architecture
• exchange data using a shared Programming model

memory space CUDA : optimisation

GPU computing :
Ï blocks are independent, no perspectives

Installing CUDA

cross-block synchronisation, CFD example/demo

execution in undefined order PyCUDA / PyOpenCL

CUDA / MPI

References

73 / 170
GPU

CUDA : programming model (PTX) P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


Ï heterogeneous systems : CPU computing ?

FPGA co-processor

and GPU have separated Nvidia / CUDA


memory spaces (host and Why shoud we use GPU ?
Hardware architecture
device) Programming model
CUDA : optimisation
Ï the programmer focuses on GPU computing :
perspectives
code parallelization (algorithm Installing CUDA

level) not on how he was to CFD example/demo

schedule blocks of threads on PyCUDA / PyOpenCL

multiprocessors. CUDA / MPI

References

74 / 170
GPU

CUDA : Programming model P. Kestener

Architectures,
Parallelism and Moore
law
Ï each block of threads Why multi-core ?

can be scheduled on What’s a thread ?

any of the available GPU computing


History
cores (can not rely on and embedded GPU
computing ?
a specific order to FPGA co-processor

design algoritms) Nvidia / CUDA

• concurrently (on Why shoud we use GPU ?


Hardware architecture
different Programming model

multiprocessors) CUDA : optimisation

GPU computing :
• sequentially (on perspectives

Installing CUDA
the same
CFD example/demo
multiprocessor)
PyCUDA / PyOpenCL
Ï independent CUDA / MPI

execution of blocks of References

threads gives
scalability of the
programming model.

75 / 170
GPU

CUDA : Programming model P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

credits: slides by Andreas Klöckner (author of PyCUDA /


PyOpenCL)

76 / 170
GPU

CUDA : Programming model P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

credits: slides by Andreas Klöckner (author of PyCUDA /


PyOpenCL)

77 / 170
GPU

CUDA memory hierarchy : G80/GT200 P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï CPU and GPU have physically What’s a thread ?

separated memory spaces GPU computing

• data transfert to/from GPU are History

and embedded GPU


explicit and controlled by CPU : computing ?

FPGA co-processor
GPU can’t initiate transfers,
Nvidia / CUDA
acces disk, ... Why shoud we use GPU ?

• dedicated memory Hardware architecture

Programming model
management for allocation, free, CUDA : optimisation

transfers, ... GPU computing :


perspectives

Ï pointers are only addresses... Installing CUDA

CFD example/demo
• Can’t tell from the pointer value
PyCUDA / PyOpenCL
wether the address is CPU or
CUDA / MPI
GPU memory space !!
References
• dereferencing a CPU pointer
inside a routine running on GPU
⇒ CRASH !

78 / 170
GPU

CUDA memory hierarchy : G80/GT200 P. Kestener

Architectures,
Parallelism and Moore
law

Ï on-chip memory : Why multi-core ?

What’s a thread ?

• shared : RW, very fast (if no bank GPU computing

conflicts), 16kB/multiprocessor History

and embedded GPU


• register : RW, very fast, computing ?

FPGA co-processor
8-16kB/multiprocessor
Nvidia / CUDA
Ï off-chip memory : Why shoud we use GPU ?
Hardware architecture
• global : RW, up to GBytes, slow Programming model

(∼ 100 clock cycles) CUDA : optimisation

GPU computing :
• constant : RO, 64kB/chip, perspectives

Installing CUDA
__const__ declared variable, very
CFD example/demo
fast (1-4 cycles), cached
PyCUDA / PyOpenCL
• texture : RO, located in global
CUDA / MPI
memory, cached
References
• local : RW, slow, use controlled
by compiler, used if no more
registers

79 / 170
GPU

GPU memory model summary for P. Kestener

legacy/modern GPGPU Architectures,


Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing

Limited memory access during computation History

and embedded GPU


computing ?

FPGA co-processor

Ï register (per fragment/thread) Nvidia / CUDA


Why shoud we use GPU ?
• read/write Hardware architecture

Programming model
Ï local memory (shared among threads) CUDA : optimisation

• Does not exist in general (legacy GPGPU) GPU computing :


perspectives

• CUDA allows access to shared memory between threads Installing CUDA

CFD example/demo
Ï Global memory (historical)
PyCUDA / PyOpenCL
• read-only during computation
CUDA / MPI
• write-only at the end of computation (precomputd address)
References
Ï Global memory (new, i.e. CUDA)
• allows general scatter/gather (read/write)
• take care: no collision rules, need atomic operations

80 / 170
GPU

CUDA global/external memory : P. Kestener

allocation/release Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Ï Host (CPU) manages device (GPU) memory: History

and embedded GPU


• header cuda_runtime.h for run time API computing ?

FPGA co-processor
• cudaError_t cudaMalloc(void ** pointer, size_t
Nvidia / CUDA
nbytes) Why shoud we use GPU ?
• cudaError_t cudaMemset(void * pointer, intvalue, Hardware architecture

Programming model
size_tcount) CUDA : optimisation
• cudaError_t cudaFree(void* pointer) GPU computing :
perspectives

Ï example use: Installing CUDA

CFD example/demo
int n = 1024;
PyCUDA / PyOpenCL
int nbytes = 1024*sizeof(int);
CUDA / MPI
int* d_a = 0;
References
cudaMalloc( (void**)&d_a, nbytes);
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);

81 / 170
GPU

GPU memory : data transfer P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History
Ï cudaError_t cudaMemcpy(void *dst, void *src, and embedded GPU
computing ?
size_t nbytes, enumcudaMemcpyKinddirection) FPGA co-processor

• header cuda_runtime_api.h for run time API Nvidia / CUDA


Why shoud we use GPU ?
• returns after the copy is complete Hardware architecture

• blocks CPU thread until all bytes have been copied Programming model
CUDA : optimisation
• doesn’t start copying until previous CUDA calls complete GPU computing :
perspectives
Ï enum cudaMemcpyKind Installing CUDA

• cudaMemcpyHostToDevice CFD example/demo

• cudaMemcpyDeviceToHost PyCUDA / PyOpenCL

• cudaMemcpyDeviceToDevice CUDA / MPI

References

82 / 170
GPU

CUDA memory hierarchy and thread P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

cs 193, Stanford 2010,


http://code.google.com/p/stanford-cs193g-sp2010/

83 / 170
GPU

CUDA Code walkthrough P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

84 / 170
GPU

CUDA Code walkthrough P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

85 / 170
GPU

CUDA Code walkthrough P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

86 / 170
GPU

CUDA : Execution model- notion of warp P. Kestener

Architectures,
Parallelism and Moore
law

/usr/local/cuda32/doc/ptx_isa_2.2.pdf (200 pages) Why multi-core ?

What’s a thread ?
Ï GPU execution requires GPU computing
the kernel code + History

and embedded GPU


grid,block geometry computing ?

FPGA co-processor
• Concurrent kernel
Nvidia / CUDA
execution (hardware Why shoud we use GPU ?

≥ 2.0) Hardware architecture

Programming model

Ï multiprocessor control CUDA : optimisation

GPU computing :
unit creates, manages, perspectives

Installing CUDA
organize thread
CFD example/demo
(scheduling); threads are PyCUDA / PyOpenCL
grouped into warp (group CUDA / MPI

of 32 threads with References

consecutive indexes)
⇒ hardware resources
sharing !

87 / 170
GPU

CUDA : Execution model- notion of warp P. Kestener

Architectures,
Parallelism and Moore
Ï branch divergence: A warp executes one common instruction law

at a time. If threads of a warp diverge via a data-dependent Why multi-core ?

What’s a thread ?

conditional branch (if-then-else with condition on threadIdx GPU computing

e.g.), the warp serially executes each branch path taken, History

and embedded GPU


disabling threads that are not on that path, and when all paths computing ?

FPGA co-processor
complete, the threads converge back to the same execution Nvidia / CUDA
path. Performance drops. See reduction code example. Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

credits :
W. Fung, Dynamic warp formation and scheduling for efficient GPU
flow control
88 / 170
GPU

CUDA : Execution model- notion of warp P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Ï Execution context (PC, registers, ...) for each warp is History

maintained on-chip during the entire lifetime of the warp ! and embedded GPU
computing ?

FPGA co-processor
Ï GPU threads GPU are very lighweight (creation and context Nvidia / CUDA
switching are almost free, i.e. only take a few cycles). Why shoud we use GPU ?
Hardware architecture
Ï Read CUDA_C_Programming_Guide.pdf, chapter 4 Programming model
CUDA : optimisation
Ï The number of blocks and warps that can reside and be GPU computing :
perspectives

processed together on the multiprocessor for a given kernel Installing CUDA

depends on the amount of registers and shared memory used CFD example/demo

by the kernel and the amount of registers and shared memory PyCUDA / PyOpenCL

CUDA / MPI
available on the multiprocessor. (Cuda Programming guide,
References
section 4.2)

89 / 170
GPU

CUDA : SIMT Multithread Execution P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

CUDA_C_Programming_Guide.pdf, appendix G
For a given kernel (given number of register per thread, given shared
memory per block), this table helps understanding which hardware
resources will be exhausted first when changing run-time parameters (grid
size, block size). 90 / 170
GPU

CUDA : SIMT Multithread Execution P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Based on Brent Oster’s slides (NVIDIA) What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

91 / 170
GPU

CUDA : SIMT Multithread Execution P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Based on Brent Oster’s slides (NVIDIA) What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

92 / 170
GPU

CUDA : SIMT Multithread Execution P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Based on Brent Oster’s slides (NVIDIA) What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

93 / 170
GPU

CUDA : SIMT Multithread Execution P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Based on Brent Oster’s slides (NVIDIA) What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

94 / 170
GPU

CUDA : compilation workflow P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf What’s a thread ?

GPU computing
Ï NVCC : compiler driver : call History

behind nvopencc (open64), and embedded GPU


computing ?

gcc/g++, ...) FPGA co-processor

Nvidia / CUDA
Ï PTX : Parallel Thread Why shoud we use GPU ?

eXecution Hardware architecture

Programming model

Ï PTX defines an ISA (Instruction CUDA : optimisation

GPU computing :

Set Architecture) and a perspectives

Installing CUDA

low-level machine virtuelle CFD example/demo


providing hardware abstraction PyCUDA / PyOpenCL

(portability across GPU CUDA / MPI

hardware evolution, GPU References

generations, scalability accross


GPU sizes and number of SM)

95 / 170
GPU

CUDA : compilation workflow P. Kestener

Architectures,
Parallelism and Moore
http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf law
Why multi-core ?
Ï NVCC : compiler driver : call What’s a thread ?

behind nvopencc (open64), GPU computing


History
gcc/g++, ...) and embedded GPU
computing ?
Ï PTX : Parallel Thread FPGA co-processor

eXecution Nvidia / CUDA


Why shoud we use GPU ?

Ï High-level language compilers Hardware architecture

Programming model
(e.g. nvcc) generate PTX CUDA : optimisation

GPU computing :
instructions, which in a second perspectives

Installing CUDA
stage, are optimized and
CFD example/demo
translated into native hardware
PyCUDA / PyOpenCL
instructions (depending CUDA / MPI
hardware capability). References

Ï Possibility to define other


high-level languages to target
the same ISA (e.g.
CUDA-Fortran)

96 / 170
GPU

CUDA : compilation workflow P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf GPU computing


History

and embedded GPU


computing ?

FPGA co-processor

Ï second stage uses the ptxas Nvidia / CUDA


Why shoud we use GPU ?
tool: PTX assembly to cubin Hardware architecture

Programming model
(low-level machine instruction) CUDA : optimisation

GPU computing :
Ï graphics driver can also convert perspectives

Installing CUDA
PTX into CUBIN (Just-In-Time
CFD example/demo
optimisation) and issue a
PyCUDA / PyOpenCL
PCI-express upload to GPU. CUDA / MPI

References

97 / 170
GPU

CUDA : compilation workflow P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History
nvcc documentation : nvcc.pdf and embedded GPU
computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

98 / 170
GPU

CUDA : compilation workflow P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History
nvcc documentation : nvcc.pdf and embedded GPU
computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

99 / 170
GPU

CUDA : C-language extensions and run-time API P. Kestener

Architectures,
Parallelism and Moore
law
Ï Function and type qualifiers Why multi-core ?

What’s a thread ?
__global__ void KernelFunc(...); // kernel callable from host
GPU computing
__device__ void DeviceFunc(...); // function callable on device History
__device__ int GlobalVar; // variable in device memory and embedded GPU

__shared__ int SharedVar; // shared in PDC by thread block computing ?

FPGA co-processor
__host__ void HostFunc(...); // function callable on host
Nvidia / CUDA

Ï built-in variables : threadIdx and blockDim, blockIdx and Why shoud we use GPU ?
Hardware architecture

gridDim (read-only registers) Programming model


CUDA : optimisation
Ï kernel function launch syntax GPU computing :
perspectives

KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threadsInstalling


each CUDA

CFD example/demo
«< .. »> is used to set grid and block sizes (can also set shared PyCUDA / PyOpenCL
mem size per block) CUDA / MPI

Ï synchronisation threads inside bloc References

__syncthreads(); // barrier synchronization within kernel


Ï libc-like routine (e.g.: memory allocation, CPU/GPU data
transfer, ...)

100 / 170
GPU

CUDA : C-Programming and run-time API P. Kestener

Architectures,
Parallelism and Moore
Ï specific data types, (vector_types.h) : example dim3, design law
Why multi-core ?
for memory alignment What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

Ï in a CUDA kernel, automatic variables (i.e. without any type CFD example/demo

qualifiers) generally reside in a register. In some case, the PyCUDA / PyOpenCL

compiler can choose to place them in local memory (external CUDA / MPI

RAM, take care of performance drop, might need to rewrite the References

alogrithm to fit in register)


Ï size of an array placed in shared memory can be either set
explicitely or only at launch time, in that case use declaration:
extern __shared__ float shared[];
101 / 170
GPU

CUDA : C-Programming and run-time API P. Kestener

Architectures,
Ï specific data types, (vector_types.h) : example dim3, design Parallelism and Moore
law
for memory alignment Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

Ï a __constant__ declared variable is placed in the constant GPU computing :


perspectives

memory (external DRAM but cached, so very fast, few clock Installing CUDA

CFD example/demo
cycles);
PyCUDA / PyOpenCL
Ï has the lifetime of the application (no need to re-init between
CUDA / MPI
different kernel calls); References
Ï has static storage (take care if you want to use it in multiple
compilation unit; you can’t use the extern keyword);
Ï if you want to use the same constant variable in multiple
compilation unit (different .cu files), you need to init constant
memory in each of them, i.e. call cudaMemcpyToSymbol
102 / 170
GPU

CUDA : C-Programming and run-time API P. Kestener

Architectures,
Parallelism and Moore
file:///usr/local/cuda32/doc/html/modules.html law
Why multi-core ?
Ï Runtime API : high-level, build What’s a thread ?

on top of the driver API (init, GPU computing

context and module History

and embedded GPU

management are implicit, code computing ?

FPGA co-processor

is concise), prefix cuda device Nvidia / CUDA


emulation Why shoud we use GPU ?
Hardware architecture

Ï Driver API : low-level, better Programming model


CUDA : optimisation
level of control, harder to GPU computing :
perspectives
program/debug, optimisation Installing CUDA

PTX JIT (Just-In-Time), code CFD example/demo

lengthy, prefix cu PyCUDA / PyOpenCL

CUDA / MPI
Ï GPU Context ∼ CPU process
References
Ï GPU Module ∼ CPU dynamic
library
Ï Driver and Run-time API can be
used/linked together

103 / 170
GPU

CUDA Run-time API: code walkthrough P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

104 / 170
GPU

CUDA Run-time API: code walkthrough P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

105 / 170
GPU

CUDA Run-time API: code walkthrough P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

106 / 170
GPU

CUDA Run-time API: code walkthrough P. Kestener

Architectures,
Ï example : sum of vectors Parallelism and Moore
law
// Compute vector sum C = A+B Why multi-core ?

What’s a thread ?
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C) GPU computing
History
{ and embedded GPU
int i = threadIdx.x + blockDim.x * blockIdx.x; computing ?

FPGA co-processor
C[i] = A[i] + B[i];
} Nvidia / CUDA
Why shoud we use GPU ?
Ï invocation Hardware architecture

Programming model
void main() { CUDA : optimisation
// allocate device (GPU) memory GPU computing :
perspectives
float* d_A, d_B, d_C; Installing CUDA
cudaMalloc( (void**) &d_A, N * sizeof(float));
CFD example/demo
cudaMalloc( (void**) &d_B, N * sizeof(float));
PyCUDA / PyOpenCL
cudaMalloc( (void**) &d_C, N * sizeof(float));
CUDA / MPI

// copy host memory to device References

cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice);


cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice);

// Execute on N/256 blocks of 256 threads each


vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
107 / 170
GPU

CUDA Optimisations P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

A CUDA program should take into account the following GPU computing
History
constraints and embedded GPU
computing ?

Ï Coalescent acces to global memory : threads with consecutive FPGA co-processor

Nvidia / CUDA
indexes should access consecutive memory addresses for Why shoud we use GPU ?

good alignment Hardware architecture

Programming model

Ï Use shared memory (high bandwidth, low latency) CUDA : optimisation

GPU computing :
perspectives
Ï Efficient use of parallelism Installing CUDA

Keep GPU busy as long as possible CFD example/demo

Try to have a high computing ops / memory acces PyCUDA / PyOpenCL

Make a good use of thread hierarchy CUDA / MPI

References
Ï Try to avoid shared memory bank conflicts

108 / 170
GPU

Parallel primitives P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?
parallel-prefix-sum ("scan"), parallel sort and parallel reduction FPGA co-processor

Nvidia / CUDA
Ï Thrust : http://code.google.com/p/thrust Why shoud we use GPU ?
Hardware architecture
Ï cudpp : http://gpgpu.org/developer/cudpp Programming model
CUDA : optimisation
Ï comparison Thrust/CUDPP : GPU computing :
perspectives
http://code.google.com/p/thrust/wiki/ThrustAndCUDPP Installing CUDA

Ï reduction example reduction in CUDA SDK CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

109 / 170
GPU

CUDA: summary P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

110 / 170
GPU

CUDA: summary P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

111 / 170
GPU

CUDA: summary P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

112 / 170
GPU

CUDA: summary P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

113 / 170
GPU

CEA hybrid CPU/GPU Cluster, 2009 - Titane P. Kestener

Architectures,
Parallelism and Moore
Titane: ∼ 100 CPU-TFLOPS + ∼ 200 GPU-TFLOPS (Tesla law

S1070, hw 1.3) Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Ï cudaGetDeviceCount(&count); returns 2 because a Titane


node only sees half a Tesla-S1070, i.e. 2 GPU devices.
http://www-ccrt.cea.fr/fr/moyen_de_calcul/titane.htm
114 / 170
GPU

CEA hybrid CPU/GPU Cluster, 2009 - Titane P. Kestener

Architectures,
Parallelism and Moore
Titane: ∼ 100 CPU-TFLOPS + ∼ 200 GPU-TFLOPS (Tesla law

S1070, hw 1.3) Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Ï cudaGetDeviceCount(&count); returns 2 because a Titane


node only sees half a Tesla-S1070, i.e. 2 GPU devices.
http://www-ccrt.cea.fr/fr/moyen_de_calcul/titane.htm
115 / 170
GPU

CUDA atomic operations P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Ï What append when multiple threads on a warp try to modify History

and embedded GPU


data at the same address (global or shared memory) ? computing ?

FPGA co-processor

Ï which thread performs the final write is undefined ! Nvidia / CUDA


Why shoud we use GPU ?
Ï examples algorithms: histogram or tree-building Hardware architecture

Programming model
Ï example atomic function: int atomicAdd(int* address, CUDA : optimisation

int val); : no other thread can access this address until the GPU computing :
perspectives

operation is complete. Installing CUDA

CFD example/demo
Ï See CUDA programming guide, Section B.5 (Memory fence PyCUDA / PyOpenCL

functions) to have another example of reduction CUDA / MPI

implementation that uses atomic functions. References

116 / 170
GPU

CUDA printf for debug P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor
Ï int printf(const char *format[, arg, ...]); Nvidia / CUDA

Ï similar to the standard C-library (stdio.h) Why shoud we use GPU ?


Hardware architecture

Ï only available for hardware 2.0 (Fermi) Programming model


CUDA : optimisation

Ï Use with care ! Don’t forget to reduce the blok sizes ! GPU computing :
perspectives

Installing CUDA
Ï See CUDA programming guide, section B.14 CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

117 / 170
GPU

CUDA: Miscellaneous development tools... P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

... for numerical applications What’s a thread ?

GPU computing
History

Ï cuFFT: CUDA-based FFT implementation, and embedded GPU


computing ?

FPGA co-processor
Ï cuBLAS: CUDA-based Basic Linear Algebra by NVIDIA, Nvidia / CUDA

Ï culatools: GPU linear algebra package, ∼ cuBLAS, Why shoud we use GPU ?
Hardware architecture

Ï MAGMA: Matrix algebra on GPU and multi-core Programming model


CUDA : optimisation

(faster than cuBLAS), GPU computing :


perspectives

Ï openCurrent: C++ library for solving PDE on GPU, Installing CUDA

CFD example/demo
Ï cusp: sparse linear algebra on GPU, PyCUDA / PyOpenCL

Ï openNL: sparse linear algebra, CUDA / MPI

References
Ï libra: GPU SDK for Matlab,
Ï cudpp: data parallel primitive for GPU (see also Thrust)
http://www.nvidia.com/object/tesla_software.html

118 / 170
GPU

GPU computing challenges P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing

Ï Computations with no known scalable parallel algorithms History

and embedded GPU


computing ?
• Shortest path, Delaunay triangulation, ...
FPGA co-processor

Ï Data distributions that cause catastrophical load imbalance Nvidia / CUDA

in parallel algorithms Why shoud we use GPU ?


Hardware architecture

• Free-form graphs, MRI spiral scan Programming model


CUDA : optimisation
Ï Computations that do not have data reuse GPU computing :
perspectives

• Matrix vector multiplication, ... Installing CUDA

CFD example/demo
Ï Algorithm optimizations that are hard and labor intensive
PyCUDA / PyOpenCL
• Locality and regularization transformations
CUDA / MPI

credits: slides by Wen-mei Hwu References

119 / 170
GPU

CUDA : Memory allocation and transfer time P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

Based on K. Skadron slides (University of Virginia) What’s a thread ?

GPU computing
Ï GPU global/exteranl memory allocation is costly History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

120 / 170
GPU

CUDA : Memory allocation and transfer time P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Based on K. Skadron slides (University of Virginia) What’s a thread ?

Ï transfer time CPU ←→ GPU (think about tranfer overhead GPU computing
History
before off-loading computations to GPU) and embedded GPU
computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

121 / 170
GPU

CUDA optimisation P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Ï slides from isc2009 : CUDA Optimisation Nvidia / CUDA


Why shoud we use GPU ?
Ï slides Hardware architecture

Programming model
http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/GPU+CUDA.pdf CUDA : optimisation

matrix transposition example code (illustration des concept GPU computing :


perspectives

de coalescence, conflit de banc memoire, ...) from slide 142. Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

122 / 170
GPU

CUDA optimisation : coalescence P. Kestener

Architectures,
Parallelism and Moore
Figure G1, /usr/local/cuda32/doc/CUDA_C_Programming_Guide.pdf, law
Why multi-core ?
SDK 3.2 What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

123 / 170
GPU

CUDA optimisation : coalescence P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

124 / 170
GPU

CUDA optimisation : coalescence P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

125 / 170
GPU

CUDA optimisation : coalescence P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

126 / 170
GPU

CUDA optimisation : coalescence and SoA/AoS P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History
Ï Array of Structures (AoS) / and embedded GPU
computing ?

• 3 colors / pixel → alignment FPGA co-processor

• complex access pattern to Nvidia / CUDA


Why shoud we use GPU ?
global memory Hardware architecture

Ï Structure of Arrays (SoA) , Programming model


CUDA : optimisation

• coalescence constraint by GPU computing :


perspectives

design Installing CUDA

• coalescence constraint still CFD example/demo

OK if you add more PyCUDA / PyOpenCL

field/arrays; e.g. RGB → RGBA CUDA / MPI

References

http://perso.ens- lyon.fr/sylvain.collange/talks/calcul_gpu_scollange.pdf

127 / 170
GPU

CUDA : profiler P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

Ï Toolkit documentation: Compute_Profiler.txt What’s a thread ?

GPU computing
Ï environment variables: History

and embedded GPU


• CUDA_PROFILE or COMPUTE_PROFILE : 1 for enable computing ?

FPGA co-processor
• CUDA_PROFILE_LOG or COMPUTE_PROFILE_LOG : results file
Nvidia / CUDA
name (default is launch directory/cuda_profile.log) Why shoud we use GPU ?

• CUDA_PROFILE_CONFIG or COMPUTE_PROFILE_CONFIG : config Hardware architecture

Programming model
file; list names of performance counters for logging information CUDA : optimisation

GPU computing :
Ï these variables are already set for hands-on tutorial. perspectives

Installing CUDA

Ï example of CUDA profile config file: CFD example/demo

divergent_branch PyCUDA / PyOpenCL

warp_serialize CUDA / MPI

timestamp References

gld_incoherent
gst_incoherent

128 / 170
GPU

CUDA : profiler / terminology P. Kestener

Architectures,
Parallelism and Moore
law
Ï gld_incoherent: Number of non-coalesced global memory Why multi-core ?

loads What’s a thread ?

GPU computing
Ï gld_coherent: Number of coalesced global memory loads History

and embedded GPU


Ï gst_incoherent: Number of non-coalesced global memory computing ?

FPGA co-processor
stores
Nvidia / CUDA
Ï gst_coherent: Number of coalesced global memory stores Why shoud we use GPU ?
Hardware architecture

Ï local_load: Number of local memory loads Programming model


CUDA : optimisation

Ï local_store: Number of local memory stores GPU computing :


perspectives

Installing CUDA
Ï branch: Number of branch events taken by threads
CFD example/demo
Ï divergent_branch: Number of divergent branches within a PyCUDA / PyOpenCL

warp CUDA / MPI

References
Ï instructions: instruction count
Ï warp_serialize: Number of threads in a warp that serialize
based on address conflicts to shared or constant memory
Ï cta_launched: executed thread blocks

129 / 170
GPU

CUDA optimsation : data prefetching P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Ï data prefetching = load data in advance (e.g. for the next History

iteration in a loop) !!! and embedded GPU


computing ?

FPGA co-processor
Ï When a memory access operation is executed, it does not Nvidia / CUDA
block other operations following it as long as they don’t use Why shoud we use GPU ?
Hardware architecture
the data from the operation. Programming model
CUDA : optimisation
Ï for loop without data prefetching: GPU computing :
perspectives

for ( i = 0 ; i < N; i ++) { Installing CUDA

CFD example/demo
sum += array [ i ] ;
PyCUDA / PyOpenCL
}
CUDA / MPI

References
Ï every addition waits for its data to be loaded from memory

130 / 170
GPU

CUDA optimsation : data prefetching P. Kestener

Architectures,
Ï data prefetching = load data in advance (e.g. for the next Parallelism and Moore
law
iteration in a loop) !!! Why multi-core ?

What’s a thread ?
Ï When a memory access operation is executed, it does not
GPU computing
block other operations following it as long as they don’t use History

the data from the operation. and embedded GPU


computing ?

Ï for loop with data prefetching: FPGA co-processor

Nvidia / CUDA
temp = array [ 0 ] ; Why shoud we use GPU ?
Hardware architecture
for ( i = 0 ; i < N−1; i ++) { Programming model

temp2 = array [ i + 1 ] ; CUDA : optimisation

GPU computing :

sum += temp ; perspectives

Installing CUDA
temp = temp2 ; CFD example/demo
} PyCUDA / PyOpenCL

sum += temp ; CUDA / MPI

References
Ï with data prefetching, inside the loop, memory load for
iteration i + 1 and the actual addition for iteration i are done in
parallel
Ï data prefetching benefits: needs less warp to hide memory
latency
Ï data prefetching cost for GPU: more registers, so less warps 131 / 170
GPU

GPU computing - Fermi et cache L1/L2 P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï globla memory R/W are pipelined What’s a thread ?

Ï global memory latency : qq 100 cycles d’horloge GPU computing


History

Ï shared memory / L1 cache latency : 10-20 cycles and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

132 / 170
GPU

GPU computing - Fermi et cache L1/L2 P. Kestener

Architectures,
Ï L1 cache: ressources shared with shared memory ! Parallelism and Moore
law
Ï L1 cache used for reading local variables, if not enough Why multi-core ?

What’s a thread ?
registers
GPU computing
Ï L2 cache: for global memory History

Ï gain: and embedded GPU


computing ?

• Caching captures locality, amplifies bandwidth, reduces FPGA co-processor

latency Nvidia / CUDA


Why shoud we use GPU ?
• Caching aids irregular or unpredictable accesses Hardware architecture

• ⇒ better performances for algorithms with complex memory Programming model


CUDA : optimisation
access patterns GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

133 / 170
GPU

GPU computing - CUDA P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Ï Nivia Performance Primitives Nvidia / CUDA


Why shoud we use GPU ?
Ï NVIDIA NPP is a library of functions for performing CUDA Hardware architecture

accelerated processing Programming model


CUDA : optimisation

Ï The initial set of functionality in the library focuses on GPU computing :


perspectives

imaging and video processing Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

134 / 170
GPU

GPU computing - OpenCL P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

Ï OpenCL sur Wikipedia // Introduction OpenCL PyCUDA / PyOpenCL

CUDA / MPI
Ï standard http://www.khronos.org, version 1.0 (12/2008) References

Ï aim: programming model for GPU (Nvidia/ATI), multicore


CPU and CELL: Data and task parallel compute model
Ï OpenCL programming model use most of the abstract
concepts of CUDA
135 / 170
GPU

GPU computing - OpenCL P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

Ï OpenCL sur Wikipedia // Introduction OpenCL PyCUDA / PyOpenCL

CUDA / MPI
Ï standard http://www.khronos.org, version 1.0 (12/2008) References

Ï aim: programming model for GPU (Nvidia/ATI), multicore


CPU and CELL: Data and task parallel compute model
Ï OpenCL programming model use most of the abstract
concepts of CUDA
136 / 170
GPU

GPU computing - OpenCL P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI
Ï convert CUDA program to OpenCL :
References
http://developer.amd.com/documentation/articles/pages/OpenCL- and- the- ATI-Stream-v2.0- Beta.aspx#four

Ï CUDA used to be ahead OpenCL; CUDA adopted rapidly,


people will start to move to OpenCL

137 / 170
GPU

GPU computing - OpenCL P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Google Trends with keywords History

and embedded GPU


NVIDIA Cuda, ATI Stream, OpenCL computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

138 / 170
GPU

GPU computing - OpenCL P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Google Trends with keywords History

and embedded GPU


NVIDIA Cuda, ATI Stream, OpenCL computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

139 / 170
GPU

GPU computing - HMPP (Caps) P. Kestener

Architectures,
Parallelism and Moore
Ï Preprocessor directives à la OpenMP, source code law

transformation: CAPS and its tools HMPP (Hybrid Manycores Why multi-core ?

What’s a thread ?

Parallel Programing) GPU computing


History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

140 / 170
GPU

Installer CUDA / OpenCL P. Kestener

Architectures,
Parallelism and Moore
Ï Avoir un GPU compatible CUDA : G80 ou plus (≥ 2007), law

exemple perso : carte GeForce GTX-285 (hardware 1.3) Why multi-core ?

What’s a thread ?

Ï système exemple : OS Linux - x86_64 (Novembre 2010) - GPU computing


History
CUDA 3.2 and embedded GPU
computing ?
Ï Resources en ligne chez NVIDIA FPGA co-processor

http://developer.nvidia.com/object/gpucomputing.html Nvidia / CUDA


Why shoud we use GPU ?
Ï Driver Nvidia noyau : 260.24 - Hardware architecture

Programming model
devdriver_3.2_linux_64_260.24.run CUDA : optimisation

GPU computing :
Ï chaîne de compilation (toolkit): perspectives

Installing CUDA
cudatoolkit_3.2.9_linux_64_ubuntu10.04.run
CFD example/demo
Ï SDK Cuda / OpenCL : gpucomputingsdk_3.2_linux.run PyCUDA / PyOpenCL

• utilitaires : passage des arguments de la ligne de commande, CUDA / MPI

mesure de temps de calcul, read/write PGM, GLUT ... References

• CUDPP (CUDA Data Parallel Primitives Library)


• ∼ 80 exemples d’applications CUDA: gestion de la mémoire,
recherche, reduction, générateur de nombres aléatoires, produit
scalaire, traitement d’images, finance, interface graphique ...
• ∼ 30 exemples en OpenCL
141 / 170
GPU

Avoir un GPU compatible CUDA P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?
Ï exécuter l’exemple deviceQuery dans le SDK What’s a thread ?

Ï header cuda_runtime_api.h : cudaGetDeviceCount, GPU computing


History
cudaGetDeviceProperties and embedded GPU
computing ?

Ï connaître la version du driver installé : FPGA co-processor

Nvidia / CUDA
cat /proc/driver/nvidia/version Why shoud we use GPU ?

GetForce GTX 285 Tesla C1060 Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

142 / 170
GPU

SDK CUDA P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

Ï les exemples du SDK fonctionnent suivant un même schéma What’s a thread ?

(voir le projet template), en général 3 fichiers sources : GPU computing


History
• nom_du_projet.cu : contient le main et donc le point d’entrée and embedded GPU
computing ?
vers l’exécution de code sur le GPU FPGA co-processor

• nom_du_projet_kernel.cu : version CUDA d’un algorithme Nvidia / CUDA

• nom_du_projet_gold.c : vers CPU native, pour comparaison Why shoud we use GPU ?
Hardware architecture
fonctionnelle des résultats et éventuellement benchmark des Programming model

performances CUDA : optimisation

GPU computing :

Ï Quelques exemples plus importants pédagogiquement perspectives

Installing CUDA

• transpose : notion de coalescence des accès mémoire CFD example/demo

• reduction : exemple de calcul de la somme des éléments d’un PyCUDA / PyOpenCL

tableau (les algorithmes sous-jacents ont été transformé en CUDA / MPI

librairie (C++ avec template) : voir Thrust et CUPP) References

Ï On peut mettre du code CUDA dans un fichier C/CPP


standard, à condition de le protéger par la macro __CUDACC__

143 / 170
GPU

Summary P. Kestener

Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?

What’s a thread ? What’s a thread ?

GPU computing
GPU computing History

and embedded GPU


History computing ?

and embedded GPU computing ? FPGA co-processor

Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model

Why shoud we use GPU ? CUDA : optimisation

GPU computing :
Hardware architecture perspectives

Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI

Installing CUDA References

CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
144 / 170
GPU

Example d’application : Équations d’Euler 2D P. Kestener

Architectures,
Parallelism and Moore
law

Formulation conservative (∂t φ + ∇f = 0) des équations d’Euler


Why multi-core ?
Ï What’s a thread ?

2D : GPU computing
History
Ut + F(U)x + G(U)y = 0 and embedded GPU
computing ?

Ï Variables conservatives et termes de flux: FPGA co-processor

Nvidia / CUDA

ρ ρu ρv
      Why shoud we use GPU ?
Hardware architecture
ρu  ρu2 + p   ρuv  Programming model
U =
 ρv  , F =  ρuv  , G =  ρv2 + p 
     CUDA : optimisation

GPU computing :
perspectives
E u(E + p) v(E + p) Installing CUDA

CFD example/demo

Ï E = ρ( 21 V2 + e) Energie volumique totale PyCUDA / PyOpenCL

CUDA / MPI
Ï Equation d’état des gaz parfaits pour l’énergie interne :
p
e = ρ(γ−1) , avec γ = 1.4 (air à T = 20o C)
References

Ï passage aux variables non-conservatives: U ⇒ W avec


T
W = [ρ, u, v, p]

145 / 170
GPU

Schémas numériques P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

Ï Méthode de Godunov du 1er ordre : GPU computing


∆t
Un+1
i
= Uni + ∆x (Fi− 1 − Fi+ 1 ) History

and embedded GPU


2 2
computing ?
Ï implantation du schéma MUSCL-Hancock (Monotone FPGA co-processor

Upstream-centered Schemes for Conservation Laws) Nvidia / CUDA


Why shoud we use GPU ?

Ï schéma Godunov 2ème ordre où les Uni sont remplacés par Hardware architecture

Programming model
des fonctions linéaires par morceaux. CUDA : optimisation

GPU computing :
Ï MUSCL-Hancock en 3 étapes : perspectives

Installing CUDA
- calcul des pentes et des valeurs de Ui aux bords de la cellule CFD example/demo
- évolution d’un 1/2 pas de temps PyCUDA / PyOpenCL
- résolution du problème de Riemann pour avoir les nouveaux CUDA / MPI

flux Fi+ 1 et mise à jour des Ui References


2
Ï résolution 2D par séparation des directions

146 / 170
GPU

Structures de données P. Kestener

Architectures,
Ï 4 grilles 2D: Parallelism and Moore
law
• 1 par variable conservative Why multi-core ?

• discrétisation de l’espace 2D ou évolue le fluide What’s a thread ?

• conditions limites: bordure de 2 éléments, plusieurs types: GPU computing

Ï fermée History

and embedded GPU


Ï ouverte sur l’infini computing ?

Ï ouverte sur la bordure opposée FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

147 / 170
GPU

Simulation P. Kestener

Architectures,
Ï simulation d’un jet fluide injecté dans le domaine Parallelism and Moore
law
Ï paramètres de la simulation Why multi-core ?

• paramètres du run (temps simulation, fréquence output) What’s a thread ?

• paramètres géométriques (largeur, longueur, ∆x) GPU computing


History
• types de bordures and embedded GPU
computing ?
• paramètres du schéma FPGA co-processor

• paramètres du jet Nvidia / CUDA


Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

148 / 170
GPU

Exemple de simulation 2D (jet) P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Ï run: tend = 1s, 1 output tous les 50 pas de History

temps and embedded GPU


computing ?

Ï géométrie: grille rectangulaire 200 × 800, FPGA co-processor

∆x = 0.05m
Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture
Ï bordures: fermées des 4 cotés Programming model
CUDA : optimisation
Ï schéma: 10 itérations pour Riemann, GPU computing :
perspectives
facteur de courant 0.8 Installing CUDA

Ï jet: central, largeur 10 éléments, CFD example/demo

v = 300m.s−1 PyCUDA / PyOpenCL

CUDA / MPI
Ï conditions initiales : References
T
W = [ρ = 1, u = 0, v = 0, p = 1] à l’intérieur
et T W = 0 sur les bords

149 / 170
GPU

Exemple de simulation 3D (instabilité de P. Kestener

Kelvin-Helmholtz) Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Temps de calcul d’une simulation 3D 256x256x16: 30 minutes History

and embedded GPU


(CPU) et 30 secondes (GPU) computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

150 / 170
GPU

Performance P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Référence : P. Kestener et al., HPCTA 2010, Busan.

151 / 170
GPU

Summary P. Kestener

Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?

What’s a thread ? What’s a thread ?

GPU computing
GPU computing History

and embedded GPU


History computing ?

and embedded GPU computing ? FPGA co-processor

Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model

Why shoud we use GPU ? CUDA : optimisation

GPU computing :
Hardware architecture perspectives

Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI

Installing CUDA References

CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
152 / 170
GPU

CUDA - Python bindings P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


PyCUDA computing ?

FPGA co-processor

Nvidia / CUDA

Ï CPU development tools for sequential program have a long Why shoud we use GPU ?
Hardware architecture

history, provide high productivity / efficiency Programming model


CUDA : optimisation
Ï GPU development tools are in their infancy ! GPU computing :
perspectives

Ï PyCuda wraps the CUDA driver API into the Python language Installing CUDA

CFD example/demo
Ï reference : Andeas Klöckner, PyCUDA / PyOpenCL
http://mathema.tician.de/software/pycuda CUDA / MPI

References

153 / 170
GPU

PyCuda workflow P. Kestener

Architectures,
Parallelism and Moore
law
Ï Python + CUDA = PyCUDA: Why multi-core ?

Ï GPUs are everything that What’s a thread ?

scripting languages are not • manage resources GPU computing


History
automatically and embedded GPU
computing ?
• Highly parallel • provide a glue-language FPGA co-processor

• Very for lower-level building Nvidia / CUDA

architecture-sensitive blocks Why shoud we use GPU ?


Hardware architecture
• Built for maximum • Very nice integration with Programming model

FP/memory throughput numpy (a must in python CUDA : optimisation

GPU computing :
scientific computing) perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

154 / 170
GPU

PyCUDA installation P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Ï download the package from History

and embedded GPU


http://pypi.python.org/pypi/pycuda computing ?

FPGA co-processor
Ï untar and configure (set the CUDA toolkit location) : Nvidia / CUDA
./configure.py -cuda-root=/usr/local/cuda32/ Why shoud we use GPU ?
Hardware architecture
-cuda-enable-gl Programming model
CUDA : optimisation
Ï build package: GPU computing :
perspectives
make Installing CUDA

Ï install package system-wide: sudo make install CFD example/demo

This last step installs the package in PyCUDA / PyOpenCL

CUDA / MPI
/usr/local/lib/python2.6/dist-packages/pycuda-0.
References
94.2-py2.6-linux-x86_64.egg

155 / 170
GPU

PyCuda - Code walkthrough P. Kestener

Architectures,
Parallelism and Moore
law

PyCUDA code demo.py from examples Why multi-core ?

What’s a thread ?
Compute multiplication by a scalar, elementwise GPU computing
History

and embedded GPU

Ï Initialization and data array declaration: computing ?

FPGA co-processor

# acces to CUDA driver API Nvidia / CUDA


Why shoud we use GPU ?
import pycuda.driver as cuda Hardware architecture

# initialize, create context Programming model


CUDA : optimisation
import pycuda.autoinit GPU computing :
perspectives
from pycuda.compiler import SourceModule Installing CUDA

CFD example/demo

import numpy PyCUDA / PyOpenCL

a = numpy.random.randn(4,4).astype(numpy.float32) CUDA / MPI

References
a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)
Ï Kernel declaration and execution
Ï Retrieve result on CPU and compare

156 / 170
GPU

PyCuda - Code walkthrough P. Kestener

Architectures,
Parallelism and Moore
PyCUDA code demo.py from examples law
Why multi-core ?
Compute multiplication by a scalar, elementwise What’s a thread ?

GPU computing
Ï Initialization and data array declaration: History

Ï Kernel declaration and execution and embedded GPU


computing ?

# define CUDA kernel as a python string FPGA co-processor

Nvidia / CUDA
# compile and load into GPU device Why shoud we use GPU ?

mod = SourceModule(""" Hardware architecture

Programming model
__global__ void doublify(float *a) CUDA : optimisation

GPU computing :
{ perspectives

int idx = threadIdx.x + threadIdx.y*4; Installing CUDA

CFD example/demo
a[idx] *= 2;
PyCUDA / PyOpenCL
}
CUDA / MPI
""") References
# get a handle
func = mod.get_function("doublify")
# launch GPU computation
func(a_gpu, block=(4,4,1))
Ï Retrieve result on CPU and compare
157 / 170
GPU

PyCuda - Code walkthrough P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
PyCUDA code demo.py from examples History

Compute multiplication by a scalar, elementwise and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Ï Initialization and data array declaration: Why shoud we use GPU ?
Hardware architecture
Ï Kernel declaration and execution Programming model
CUDA : optimisation
Ï Retrieve result on CPU and compare GPU computing :
perspectives
a_doubled = numpy.empty_like(a) Installing CUDA

cuda.memcpy_dtoh(a_doubled, a_gpu) CFD example/demo

print "original array:" PyCUDA / PyOpenCL

CUDA / MPI
print a
References
print "doubled with kernel:"
print a_doubled

158 / 170
GPU

PyCuda package: GPUArray P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

Ï data array abstraction : gpuarray GPU computing

import pycuda.gpuarray as gpuarray History

and embedded GPU


computing ?
Ï gpuarray is a numpy.ndarray work-alike that stores its data FPGA co-processor

and performs its computations on the compute device Nvidia / CUDA


Why shoud we use GPU ?
Ï hide host to device memory copy: Hardware architecture

Programming model
a_gpu = gpuarray.to_gpu(numpy.random.randn(4,4). CUDA : optimisation

astype(numpy.float32)) GPU computing :


perspectives

Installing CUDA
Ï Some predefined CUDA kernel implemented as gpuarray
CFD example/demo
instance methods : reductions (min,max,sum,...), elementwise PyCUDA / PyOpenCL
operators (fabs, trigonometric, ...) CUDA / MPI

Ï pyfft package (execute FFT on GPU with python) : References

http://pypi.python.org/pypi/pyfft

159 / 170
GPU

PyCuda - Real life example P. Kestener

Architectures,
Parallelism and Moore
law
Lattice Boltzmann simulations : Sailfish Why multi-core ?

What’s a thread ?
http://sailfish.us.edu.pl/
GPU computing
TP: History

and embedded GPU


Ï Have a look at examples in pycuda distribution computing ?

FPGA co-processor
Ï Grab the CUDA kernel code for solving the heat equation, and Nvidia / CUDA

implement a pycuda version + a Matplotlib GUI Why shoud we use GPU ?


Hardware architecture

Ï Play with sailfish examples Programming model


CUDA : optimisation

Ï Implement a pyCUDA version of the CPU version GPU computing :


perspectives

rayleighbenard.py (simple python LBM solver) Installing CUDA

CFD example/demo
Ï Have a look at pyFFT (this is a pyCUDA/pyOpenCL interface PyCUDA / PyOpenCL
to compute FFT on GPU); already installed in CUDA / MPI
/usr/local/lib/python2.6/dist-packages/pyfft-0.3.5-py2.6.egg
References .
Download sources to get doc and examples.
Ï future ?: copperhead : a data-parallel subset of Python, code
dynamically compiled for target platform

160 / 170
GPU

Summary P. Kestener

Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?

What’s a thread ? What’s a thread ?

GPU computing
GPU computing History

and embedded GPU


History computing ?

and embedded GPU computing ? FPGA co-processor

Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model

Why shoud we use GPU ? CUDA : optimisation

GPU computing :
Hardware architecture perspectives

Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI

Installing CUDA References

CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
161 / 170
GPU

CUDA -MPI coupling/integration P. Kestener

Architectures,
Ï CUDA and MPI are almost orthogonal: good ! Parallelism and Moore
law
Ï MPI: distributed parallel processing Why multi-core ?

What’s a thread ?
Ï CUDA: co-processor, CPU accelerator
GPU computing
Ï NVIDIA System Management Interface (from Nvidia driver): History

• nvidia-smi -s : show the current rules for COMPUTE and embedded GPU
computing ?

applications FPGA co-processor

• nvidia-smi -q : query information (utilization rates, memory Nvidia / CUDA


Why shoud we use GPU ?
usage, etc...) Hardware architecture
Ï nvidia-smi compute modes (on Tesla hardware running linux) Programming model
CUDA : optimisation
• Default: Multiple host threads can use the device at the same
GPU computing :

time perspectives

Installing CUDA
• Exclusive: Only one host thread can use the device at any CFD example/demo
given time. PyCUDA / PyOpenCL
• Prohibited: No host thread can use the device.
CUDA / MPI
• When using the Default mode (this a system admi choice), the
References
programmer NEEDS to call cudaSetDevice(devId) to explicit
which GPU device is associated to which MPI process
• example GPU device choice rule:
cudaGetDeviceCount(&count); int devId = myRank %
count;
Ï 225_GTC2010.pdf slides
162 / 170
GPU

CEA hybrid CPU/GPU Cluster, 2009 - Titane P. Kestener

Architectures,
Parallelism and Moore
Titane: ∼ 100 CPU-TFLOPS + ∼ 200 GPU-TFLOPS (Tesla law

S1070, hw 1.3) Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Ï cudaGetDeviceCount(&count); returns 2 because a Titane


node only sees half a Tesla-S1070, i.e. 2 GPU devices.
http://www-ccrt.cea.fr/fr/moyen_de_calcul/titane.htm
163 / 170
GPU

CEA hybrid CPU/GPU Cluster, 2009 - Titane P. Kestener

Architectures,
Parallelism and Moore
Titane: ∼ 100 CPU-TFLOPS + ∼ 200 GPU-TFLOPS (Tesla law

S1070, hw 1.3) Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor

Nvidia / CUDA
Why shoud we use GPU ?
Hardware architecture

Programming model
CUDA : optimisation

GPU computing :
perspectives

Installing CUDA

CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

Ï cudaGetDeviceCount(&count); returns 2 because a Titane


node only sees half a Tesla-S1070, i.e. 2 GPU devices.
http://www-ccrt.cea.fr/fr/moyen_de_calcul/titane.htm
164 / 170
GPU

CUDA MPI: software development issues P. Kestener

Architectures,
Ï easier to use the nvcc compiler driver than mpicc to integrate Parallelism and Moore
law
CUDA into a MPI program. Why multi-core ?

What’s a thread ?
Ï Just need to pass the MPI CFLAGS and LDFLAGS
GPU computing
Ï integration with a build system: an example autotools-based History

project will be provided during hands-on session and embedded GPU


computing ?

Ï Example hands-on: try to parallelize the heat-solver using MPI FPGA co-processor

Nvidia / CUDA
over 2 nodes (only 1 border for MPI communications). Start Why shoud we use GPU ?

with CPU version, and then the GPU version. Hardware architecture

Programming model
Ï potentially interesting tools: CUDA : optimisation

• Global Memory for Accelerator GMAC, is a user-level library that GPU computing :
perspectives

implements an Asymmetric Distributed Shared Memory model Installing CUDA

• Nvidia Peer-to-Peer Communication tool GPUDirect CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

165 / 170
GPU

Summary P. Kestener

Architectures,
Architectures, Parallelism and Moore law Parallelism and Moore
law
Why multi-core ? Why multi-core ?

What’s a thread ? What’s a thread ?

GPU computing
GPU computing History

and embedded GPU


History computing ?

and embedded GPU computing ? FPGA co-processor

Nvidia / CUDA
FPGA co-processor Why shoud we use GPU ?
Hardware architecture
Nvidia / CUDA Programming model

Why shoud we use GPU ? CUDA : optimisation

GPU computing :
Hardware architecture perspectives

Installing CUDA
Programming model
CFD example/demo
CUDA : optimisation PyCUDA / PyOpenCL
GPU computing : perspectives CUDA / MPI

Installing CUDA References

CFD example/demo
PyCUDA / PyOpenCL
CUDA / MPI
References
166 / 170
GPU

Cours/Tutoriels en ligne sur CUDA P. Kestener

Architectures,
Parallelism and Moore
Ï multimédia : law
Why multi-core ?
http://www.nvidia.com/object/cuda_education.html What’s a thread ?

Ï CUDA Education GPU computing


History
http://www.nvidia.com/object/cuda_university_courses.html
and embedded GPU
computing ?
Ï un des 1er cours sur CUDA Université Illinois FPGA co-processor

Nvidia / CUDA
Ï Workshop ASPLOS2008 : http://gpgpu.org/asplos2008 Why shoud we use GPU ?
Hardware architecture
Ï École thématique ARCHI09 : Programming model

http://www.irisa.fr/archi09/defour.pdf (historique CUDA : optimisation

GPU computing :
processeur vectoriel / précurseur GPU). perspectives

Installing CUDA

Ï Parallel Programing (John Hopkins University) : cs420 CFD example/demo

Ï NCSA, 1er avril 2009 PyCUDA / PyOpenCL

CUDA / MPI
Ï http://sites.google.com/site/cudaiap2009/home References
Ï CUDA wiki at DAAC
Ï CUDA Tutorial at ISC2009
Ï un bon résumé :
http://www.caam.rice.edu/~timwar/NUDG/RMMC/CUDA.html
167 / 170
GPU

Bibliographie P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?

FPGA co-processor
Ï doc CUDA du SDK / Toolkit 3.2 (Septembre 2010): Nvidia / CUDA
• CUDA_Toolkit_Reference_Manual.pdf Why shoud we use GPU ?
Hardware architecture
• nvcc.pdf Programming model

• CUDA_C_Programming_Guide.pdf CUDA : optimisation

GPU computing :
• CUDA_C_Best_Practices_Guide.pdf perspectives

Installing CUDA
• ptx_isa_2.2.pdf
CFD example/demo

PyCUDA / PyOpenCL

CUDA / MPI

References

168 / 170
GPU

S’informer en ligne sur le GPU Computing P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
Ï les flux RSS / News: History

and embedded GPU


• sur Twitter http://twitter.com/nvidiadeveloper et le flux computing ?

RSS FPGA co-processor

Nvidia / CUDA
http://twitter.com/statuses/user_timeline/18691870.rss
Why shoud we use GPU ?
• les News Nvidia http://news.developer.nvidia.com Hardware architecture

• GPGPU.org Rss http://gpgpu.org/feed Programming model


CUDA : optimisation

Ï le forum Nvidia pour poser des questions sur CUDA : GPU computing :
perspectives

http://forums.nvidia.com/index.php?showforum=62 Installing CUDA

CFD example/demo
Ï le coin des développeurs Nvidia : PyCUDA / PyOpenCL
http://developer.nvidia.com/page/home.html CUDA / MPI

Ï un blog intéressant : http://gpumodeling.blogspot.com/ References

169 / 170
GPU

Other interesting links onparallel computing P. Kestener

Architectures,
Parallelism and Moore
law
Why multi-core ?

What’s a thread ?

GPU computing
History

and embedded GPU


computing ?
Ï Parallel Computing course : FPGA co-processor

http://lca.ucsd.edu/projects/phys244 Nvidia / CUDA


Why shoud we use GPU ?
Ï cs193, Programming Massively Parallel Processors with CUDA, Hardware architecture

Stanford University Programming model


CUDA : optimisation

GPU computing :
Ï cis565, GPU Programming and Architecture, University of perspectives

Pennsylvania Installing CUDA

CFD example/demo
Ï Cuda index from University of Michigan PyCUDA / PyOpenCL

CUDA / MPI

References

170 / 170

You might also like