High Performance Computing Center
Hanoi University of Science & Technology
Introduction to GP-GPU and CUDA
Duong Nhat Tan (dn.nhattan@gmail.com)
2012
Outline
Overview
What is GPGPU?
GPU Computing with CUDA
Hardware Model
Execution Model
Thread Hierarchy
Memory Model
GPU Computing Application Areas
Summary
High Performance Computing Center 2
Overview
Scientific computing has the following
characteristics:
The problems are not interested.
Use computer to calculate the arithmetic.
Always want the programs run faster
For examples: weather forecasting, climate
change, modeling, simulation, gene
prediction, docking…
High Performance Computing Center 3
Several Approaches
Supercomputers
Mainframe
Cluster
Multi/many cores systems
High Performance Computing Center 4
Microprocessor trends
Many cores running at lower frequencies are fundamentally
more power-efficient
Multi- cores (2-8 cores)
CPU Intel pentium D/core duo/ core 2 duo/ quad cores, core i3,i5,
i7
Many-cores (> 8 cores)
GPU - Graphics Processing unit
A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. W. Brodersen,
“Optimizing Power Using Transformations,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
The development of modern GPUs
GPU - NVIDIA GeFore GTX 295
CUDA Cores 480 ( 240 per GPU )
Graphics Clock (MHz) 576
Processor Clock (MHz) 1242
Memory Clock (MHz) 999
Memory Bandwidth (GB/sec) 223.8
Benchmark (GFLPOS) 1788.48
High Performance Computing Center 6
CPU vs GPU
CPUs are optimized for high performance on sequential code:
transistors dedicated to data caching and flow control
GPUs use additional transistors directly for data processing
Books: “Program ming Massively Parallel Processors: A Hands-on Approach”
High Performance Computing Center 7
GPU Solutions
NVIDIA
GeForce (gaming/movie playback)
Quadro (professional graphics)
Tesla (HPC)
AMD/ATI
Radeon (gaming/movie playback)
FireStream (HPC)
AMD FireStream 9170
High Performance Computing Center 8
Motivation
Costs/performance ratio
Costs for power supply
Costs for maintain, operation
High Performance Computing Center 9
GPGPU
GP-GPU stands for General Purpose Computation on GPU
A technique/technology/approach that consists in using the GPU chip on
the video card as a coprocessor that accelerates operations that are
normally executed on the CPU
GPGPU is different from general graphics operations?
GPGPU – running various kinds of algorithms on a GPU, not necessarily
image processing.
For example: FFT, Monte-Carlo, Data-Sorting, Data mining and the list
continues
Until 2006, developers must cast their problems to graphics
field and resolve them using graphics API
High Performance Computing Center 10
Parallel Computing with GPU
High Performance Computing Center 11
NVIDIA GPU
11/2006: NVIDIA released G80 architecture with an
environment application development - CUDA
Allow developers to develop GPGP applications on high level
programming languages
- Built from a scalable
array of Streaming
Processors (SM)
- Each SM contains 8 SP
(Scalar Processor)
- Each SM can initialize,
manage, execute up to
768 threads
G80 Architecture
High Performance Computing Center 12
NVIDIA GPU
G80-based GPU
Geforce 8800 GT
14 SMs equivalent 112 cores
DRAM 512MB
06/2008
Geforce GT 200 series
30 SMs (240 cores)
DRAM 1GB
Tesla
30 SMs (240 cores)
DRAM 4GB
High Performance Computing Center 13
Tesla Specification
Power consumption: 187 W!
High Performance Computing Center 14
GPU Computing with CUDA
CUDA: Compute Unified Device Architect
Application Development Environment for
NVIDIA GPU
Compiler, debugger, profiler, high-level
programming languages
Libraries (CUBLAS, CUFFT, ..) and Code
Samples
GPU Computing with CUDA
The GPU is viewed as a compute device that:
Is a coprocessor to the CPU or host
Has its own DRAM (device memory)
CUDA C is an extension of C/C++ language
Data parallel programming model
Executing thousands of processes in parallel on
GPUs
Cost of synchronization is not expensive
High Performance Computing Center 16
Hardware implementation
A set of SIMD Multiprocessors with On- Chip shared memory
High Performance Computing Center 17
Scalable Programming Models
High Performance Computing Center 18
Memory Model
There are 6 Memory Types :
• Registers
o on chip
o fast access
o per thread
o limited amount
High Performance Computing Center 19
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
o in DRAM
o slow
o non-cached
o per thread
o relative large
High Performance Computing Center 20
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
o on chip
o fast access
o per block
o 16 KByte
o synchronize between
threads
High Performance Computing Center 21
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
• Global Memory
o in DRAM
o slow
o non-cached
o per grid
o communicate between
grids
High Performance Computing Center 22
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
o in DRAM
o cached
o per grid
o read-only
High Performance Computing Center 23
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
• Texture Memory
o in DRAM
o cached
o per grid
o read-only
High Performance Computing Center 24
Memory Model
• Registers
• Shared Memory
o on chip
• Local Memory
• Global Memory
• Constant Memory
• Texture Memory
o in Device Memory
High Performance Computing Center 25
Memory Model
• Global Memory
• Constant Memory
• Texture Memory
o managed by host code
o persistent across kernels
High Performance Computing Center 26
Hetegenerous Programming
High Performance Computing Center 27
GP-GPU Applications
http://www.nvidia.com/object/tesla_computing_solutions.html 28
Bioinfomatics
Sequence Alignment: to find out the most
homogeneous characteristic of sequences
Smith-Waterman: identify the optimal local
alignment of sequences by grading the similarity
using the dynamic programming method
Search and matching a new DNA sequence in
existing huge gene databases
BLAST http://blast.ncbi.nlm.nih.gov/Blast.cgi
FASTA http://www.ebi.ac.uk/Tools/sss/fasta/
High Performance Computing Center 29
Bioinfomatics
CUDA-BLASTP: “CUDA-BLASTP is designed to accelerate NCBI BLASTP for
scanning protein sequence databases on GPUs, programmed using the CUDA
programming model”
CUDASW++: an implementation of SW algorithm on NVIDIA GPU
GPU HMMER: ―implements methods using probabilistic models called profile
hidden Markov models on GPU”
High Performance Computing Center 30
Weather Forecasting
MM5/WRF models: numerical weather
prediction system
Find the answers for system of equations with
thousands of variables in an acceptable time
Process a huge amount of data (parameters
about degree, humidity, wind speed, atmosphere,
…)
―characterize and model performance of the
kernels in terms of computational intensity, data
parallelism, memory bandwidth pressure, etc‖
http://www.mmm.ucar.edu/wrf/WG2/GPU/
High Performance Computing Center 31
WRF Single Moment 5 Cloud
Microphysics
Michalakes, J. and M. Vachharajani, ―GPU Acceleration of Numerical Weather
Prediction‖, Parallel Processing Letters Vol. 18 No. 4. World Scientific. Dec. 2008. pp.
531—548
32
Cryptanalysis
MD5 code breaking using GPU
MD5 is one-way hash function
Inverse problem
Input: MD5 hash
Ouput : the origin password
Brute force attacks in 2 steps:
Step 1: Construct the password search space
Step 2: Implement the MD5 hash function for all passwords
on GPUs
MD5 Bruteforce Benchmarks
World Fastest MD5 cracker BarsWF
http://3.14.by/en/read/md5_benchmark
Seismic Exploration
―the cost of exploration and drilling deep wells can
reach hundreds of millions of dollars, and there’s
often only one chance to do it successfully‖
SeismicCity
use the most advanced depth imaging technologies
Using Tesla 1U System
Speed up 20x compared to CPU previous configuration
http://www.nvidia.com/object/seismiccity.html
http://www.seismiccity.com/
High Performance Computing Center 35
Gamming/Entertaiment
Two main methods in 3D rendering
Rasterization (supported by GPU, fast)
Raytracing (intensive computation but high-quality image)
a scene with 15 cars, rendered by
an Apple G5 computer with two 2 GHz
PowerPC processors and 2 GB memory
take 15 hours! (2006)
Per H. Christensen, Julian Fong, David M. Laur and Dana Batali.
Ray Tracing for the Movie 'Cars'. Proceedings of the IEEE
Symposium on Interactive Ray Tracing 2006, p. 1-6
Solutions: NVIDIA OptiX
36
Other Applications
Web Ranking on GPU
PageRank
HITS
TrustRank
Search Results depend on two scores:
Content score: the relevance between search key
word and page content
Popularity score: determined by analysis of the
web’s hyperlink structure
High Performance Computing Center 37
Web Ranking Problems
The web is huge
Very large data size (millions to billions
of web pages)
The web is dynamic
Webpages always change (size and structure)
Require computation in a short time and
continuously
Require huge computing performance
High Performance Computing Center 38
Google’s PageRank on GPU
When compared with a quad-core CPU
implementation, speed up reach 21-22 x
5000 4656
4500
4000
3500
thời gian (s)
3000
2532 GPU
2500
CPU
2000 1737
1500 1195
1000
500 214
55 79 116
0
0.8 0.85 0.9 0.95
alpha
Applying GP-GPU techonology in PageRank Computation – Msc Thesic,
Pham Nguyen Quang Anh, HUST, 2010
High Performance Computing Center 39
Other Applications
All-Pairs N-Body Simulation:
approximates the evolution of a system of bodies in which
each body continuously interacts with every other body
On GeForce 8800 GTX GPU
http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html
40
Supercomputers
http://www.top500.org/
The first supercomputer using GPU
2009, Tsubame, Japan:
170 x Tesla 1U (680 GPU), 77.48 TFLOP
Established in one week !
the 29th in top 500
2010: 11/500 supercomputers equipped GPUs
2011: 37/500 supercomputer in top500 use GPUs
Tianhe-1A, China
2nd in top 500, 2.566 petaFLOPS
uses 7,168 Nvidia GPUs, 14,336 Intel CPUs
41
Summary
GPU computing solutions is very effective
Providing both hardware and software
Very cost-effective solutions compared to
CPU and GRID/ cluster
Trend
More cores on-chip
Better support for float point
Flexiber configuration & control/data flow
Lower price
Support higher level programming language
High Performance Computing Center 42
THANK YOU
High Performance Computing Center 43