[go: up one dir, main page]

0% found this document useful (0 votes)
46 views77 pages

PP Cuda Unit1 1

Uploaded by

ankitupatil1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views77 pages

PP Cuda Unit1 1

Uploaded by

ankitupatil1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

PP - CUDA

OCT. 2024
Vinay T R
Assistant Professor, MSRIT

1
PARALLEL PROGRAMMING
WITH CUDA - AD72
Alternative Names: High Performance Computing
Course Content

Unit I
A Short History of Supercomputing: Von Neumann Architecture, Cray, multinode computing, nvidia and cuda, alternatives to cuda, types of parallelism.
Pedagogy / Course delivery tools: Chalk and talk, Power Point Presentation, Videos.
Links: https://onlinecourses.nptel.ac.in/noc20_cs92/preview

Unit II
GPUs History of GPU Computing: FLYNN’S TAXONOMY, SOME COMMON PARALLEL PATTERNS, Reduced Instruction Set Computers, Multiple Core Processors, Vector Processors, Limits
to parallelizability, Amdahl’s law on Parallelism.
Pedagogy/Course delivery tools: Chalk and talk, Power Point Presentation, Videos.
Links: https://onlinecourses.nptel.ac.in/noc20_cs92/preview

Unit III
Introduction: GPUs as Parallel Computers, Architecture of a Model GPU, Why More Speed or Parallelism? GPU Computing. Introduction to CUDA: Data Parallelism, CUDA Program Structure, A
Vector Addition Kernel , Device Global Memory And Data Transfer, Kernel Functions and Threading.
Pedagogy/Course delivery tools: Chalk and talk, Power Point Presentation, Videos.
Links: https://onlinecourses.nptel.ac.in/noc20_cs92/preview

Unit IV
CUDA Threads: CUDA Thread Organization, Mapping Threads To Multidimensional Data, Synchronization and Transparent Scalability, Assigning Resources to Blocks, Thread Scheduling and
Latency Tolerance.
Pedagogy/Course delivery tools: Chalk and talk, Power Point Presentation, Videos.
Links: https://www.youtube.com/watch?v=xDtitNlLByQ

Unit V
Implementation of algorithms in CUDA: A Matrix-Matrix Multiplication, Program to implement sorting using CUDA, Program to Histogram calculation using CUDA, Program to create threads
using default stream in CUDA, . CUDA for Deep Learning - A Case Study.
Pedagogy/Course delivery tools: Chalk and talk, Power Point Presentation, Videos.
Links: https://www.youtube.com/watch?v=IiKhXC6NFDg
Laboratory Session:
1. OpenMp parallel programs on using #pragma directive in C.
2. OpenMp parallel programs on using #pragma directive using work sharing constructs in C
3. OpenMp programs using sections like omp for and omp single.
4. OpenMp programs on parallel constructs.
5. OpenMp programs on task construct.
6. OpenMp programs using thread private directives.
7. OpenMp programs using thread private directives.
8. OpenMp programs on threads scheduling.
9. OpenMp programs using last private reduction, copying and shared.
10. Programs for Point to Point MPI calls.
11. Programs for Message passing MPI calls.
12. CUDA programs on message passing.
13. CUDA programs on broadcasting
14. Graph Processing with GPU
Suggested Learning Resources
Text Book:
Introduction to parallel computing by Ananth Grama, Pearson education Publishers, second edition, 2003.
CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, Shane Cook Morgan Kaufmann, 2013, ISBN: 978-
0-12-415933-4
Reference:
GPU parallel program development using CUDA by Tolga Soyata. CRC Press 2018.
What Is Parallel Computing?

• Serial Computing
• Traditionally, software has been written for serial computation:
• A problem is broken into a discrete series of instructions
• Instructions are executed sequentially one after another
• Executed on a single processor
• Only one instruction may execute at any moment in time
Parallel Computing
• In the simplest sense, parallel computing is the simultaneous use of
multiple compute resources to solve a computational problem:
• A problem is broken into discrete parts that can be solved
concurrently
• Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different
processors
• An overall control/coordination mechanism is employed
•The computational problem should be able to:
• Be broken apart into discrete pieces of work that can be solved simultaneously;
• Execute multiple program instructions at any moment in time;
• Be solved in less time with multiple compute resources than with a single compute
resource.
•The compute resources are typically:
• A single computer with multiple processors/cores
• An arbitrary number of such computers connected by a network
Parallel Computers
•Virtually all stand-alone computers today are parallel from a hardware perspective:
• Multiple functional units (L1 cache, L2 cache, branch, prefetch, decode,
floating-point, graphics processing (GPU), integer, etc.)
• Multiple execution units/cores
• Multiple hardware threads
Why Use Parallel Computing?
The Real World Is Massively Complex
•In the natural world, many complex, interrelated events are happening at the same time, yet within a
temporal sequence.
•Compared to serial computing, parallel computing is much better suited for modeling, simulating and
understanding complex, real world phenomena.
•For example, imagine modeling these serially:
Ex: Grid for numerical weather model for the Earth
Main Reasons for Using Parallel Programming
i. SAVE TIME AND/OR MONEY
• In theory, throwing more resources at a task will shorten its time to
completion, with potential cost savings.
• Parallel computers can be built from cheap, commodity components.
ii. SOLVE LARGER / MORE COMPLEX PROBLEMS
• Many problems are so large and/or complex that it is impractical or
impossible to solve them using a serial program, especially given
limited computer memory.
• Example: "Grand Challenge Problems"
(en.wikipedia.org/wiki/Grand_Challenge) requiring petaflops and
petabytes of computing resources.
• Example: Web search engines/databases processing millions of
transactions every second
iii. PROVIDE CONCURRENCY
• A single compute resource can only do one thing at a time. Multiple
compute resources can do many things simultaneously.
• Example: Collaborative Networks provide a global venue where
people from around the world can meet and conduct work "virtually."
iv. MAKE BETTER USE OF UNDERLYING PARALLEL HARDWARE
• Modern computers, even laptops, are parallel in architecture with
multiple processors/cores.
• Parallel software is specifically intended for parallel hardware with
multiple cores, threads, etc.
• In most cases, serial programs run on modern computers "waste"
potential computing power.
The Future
• During the past 20+ years, the trends indicated by ever faster
networks, distributed systems, and multi-processor computer
architectures (even at the desktop level) clearly show that parallelism
is the future of computing.
• In this same time period, there has been a greater
than 500,000x increase in supercomputer performance, with no end
currently in sight.
• The race is already on for Exascale Computing - we are entering
Exascale era
• Exaflop = 1018 calculations per second
• US DOE Exascale Computing Project: https://www.exascaleproject.org
Who Is Using Parallel Computing?
i. Science and Engineering
• Historically, parallel computing has been considered to be "the high end of computing,"
and has been used to model difficult problems in many areas of science and engineering:
• Atmosphere, Earth, Environment
• Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics
• Bioscience, Biotechnology, Genetics
• Chemistry, Molecular Sciences
• Geology, Seismology
• Mechanical Engineering - from prosthetics to spacecraft
• Electrical Engineering, Circuit Design, Microelectronics
• Computer Science, Mathematics
• Defense, Weapons
ii. Industrial and Commercial
• Today, commercial applications provide an equal or greater driving force in the development of
faster computers. These applications require the processing of large amounts of data in
sophisticated ways. For example:
• "Big Data," databases, data mining
• Artificial Intelligence (AI)
• Oil exploration
• Web search engines, web based business services
• Medical imaging and diagnosis
• Pharmaceutical design
• Financial and economic modeling
• Management of national and multi-national corporations
• Advanced graphics and virtual reality, particularly in the entertainment industry
• Networked video and multi-media technologies
• Collaborative work environments
Global Applications
• Parallel computing is now being used extensively around the world, in
a wide variety of applications.
Need for Speed ????
History of Supercomputing
• image
Unit-1: A short History of Supercomputing

• Definition: Supercomputers are extremely powerful computers used for complex calculations, simulations, and data
processing.
• Purpose: They tackle problems in fields like climate modeling, physics simulations, and large-scale data analysis.
• Early Beginnings (1950s-1960s)
Key Machine: ENIAC (1945) - One of the first general-purpose computers.
Transition: Introduction of transistor technology and more efficient designs (e.g., CDC 6600 in 1964).
• The Rise of Vector Processing (1970s-1980s)
Vector Processors: Designed for handling vector calculations (e.g., Cray-1).
Impact: Revolutionized scientific computing with faster processing capabilities.
• Massively Parallel Processing (1990s)
Trend: Shift to parallel architectures (e.g., IBM Blue Gene).
Advancements: Enabled solving larger problems more efficiently.
• The Petascale Era (2000s)
Milestone: The first petascale supercomputer (e.g., IBM Roadrunner).
Performance: Capable of over a quadrillion calculations per second.

• The Exascale Revolution (2010s-Present)


Goal: Development of exascale computers (1 exaflop = 1 quintillion calculations).
Projects: U.S. and global initiatives for exascale computing (e.g., Frontier).

• Applications of Supercomputing
Fields: Weather forecasting, genomics, materials science, artificial intelligence.
Impact: Enhancing research capabilities and innovation.

• Future of Supercomputing
Trends: Quantum computing, increased energy efficiency, and AI integration.
Outlook: Continuous advancements in processing power and capabilities.
Von Neumann Architecture

• The Von Neumann architecture is a foundational concept in computer design that describes a system
architecture for electronic computers.
• It was proposed by mathematician and physicist John von Neumann in the 1940s.
Key Components of Von Neumann Architecture
1. Central Processing Unit (CPU):
Control Unit (CU): Directs the operation of the processor, managing the execution of instructions by
fetching them from memory, decoding them, and executing them.
Arithmetic Logic Unit (ALU): Performs arithmetic and logical operations (e.g., addition, subtraction,
comparisons).

2. Memory: Stores data and instructions. Memory is typically divided into:


Main Memory (RAM): Where data and instructions currently being used are stored.
Secondary Storage: For long-term storage (e.g., hard drives, SSDs).

3. Input/Output (I/O) Devices: Interfaces through which the computer interacts with the external environment,
such as keyboards, mice, printers, and monitors.
Von Neumann Architecture

• Advantages:
• Simplicity
• Flexibility
• Cost-Effective
• Lower Power Consumption

• Disadvantages:
• Bottleneck
• Sequential Processing Limitation

• Applications:
• Embedded Systems
• Personal Computers
• Simple Servers
Understanding Von Neumann Architecture
• The Von Neumann architecture outlines a system where a computer’s data
and program instructions share the same memory space. It comprises
several key components:
• Processing Unit: Includes an Arithmetic Logic Unit (ALU) and processor
registers, which perform calculations and hold data temporarily.
• Control Unit: Contains an instruction register and a program counter,
directing the processing unit on which instructions to execute.
• Memory: Stores both data and instructions, making them accessible to the
processing unit.
• External Storage: Provides long-term storage for data and programs.
• Input and Output Mechanisms: Facilitate communication between the
computer and the outside world.
• This design allows for flexibility and simplicity, as the same memory system
can store instructions and data.
• The 'one-at-a-time' phrase means that the von neumann architecture
is a sequential processing machine.
CRAY
Introduction to CRAY Supercomputer

Cray supercomputers are renowned for their exceptional processing power and efficiency, making
them pivotal in the field of high-performance computing (HPC). Founded by Seymour Cray in
1972, Cray Inc. has consistently pushed the boundaries of what is possible in computing, catering to
diverse sectors such as scientific research, engineering, and data analysis.
Key Features:
• High Performance: Cray systems are designed to perform complex calculations at unprecedented
speeds, often ranking among the top supercomputers globally.

• Advanced Architecture: They utilize innovative architectures, including multi-core processors and
advanced interconnects, enabling efficient parallel and vector processing.

• Scalability: Cray supercomputers can be easily scaled to handle large datasets and intensive
computations, making them suitable for both small projects and large-scale research initiatives.
Cray Super computer
CRAY 1

The Cray-1 is a historic milestone in computing, recognized as the first successful supercomputer. Introduced in
1976 by Cray Research and designed by Seymour Cray, it revolutionized the field of high-performance
computing (HPC) and set a new standard for scientific calculations.
Key Features:
• Vector Processing:
• The Cray-1 employed vector processing, allowing it to perform multiple calculations on large datasets
simultaneously. This significantly improved its efficiency for mathematical computations, especially in
scientific applications.
• Innovative Design:
• Its distinctive C-shaped architecture was not only visually striking but also optimized for cooling and
space, making it compact compared to other systems of the era.
• Impressive Performance:
• Capable of achieving speeds of up to 80 megaflops (million floating-point operations per second), the
Cray-1 was unparalleled in performance at the time, making it a preferred choice for complex
simulations.
• Advanced Memory System:
• It featured a large, fast memory architecture that utilized integrated circuits, allowing for rapid data
access and processing, which was essential for handling intensive computational tasks.
• Impact and Legacy:
• The Cray-1 quickly became indispensable in various fields, including meteorology, physics, and
molecular biology, enabling breakthroughs in research that required substantial computational power.
• Its success established Cray Research as a leader in the supercomputing market and paved the way for
subsequent generations of supercomputers.

Conclusion:
The Cray-1 not only transformed how calculations were performed but also laid the groundwork for future
advancements in supercomputing. Its innovative design and capabilities have left a lasting legacy in the world
of high-performance computing, influencing both hardware development and computational techniques in
scientific research.
CRAY 2

• The Cray-2, introduced in 1985 by Cray Research, is regarded as one of the most advanced supercomputers
of its time. Building on the success of the Cray-1, the Cray-2 featured several innovations that further
enhanced its performance and efficiency, making it a critical tool for scientific research and industrial
applications.
Key Features:
• Vector Processing:
• Like its predecessor, the Cray-2 utilized vector processing technology, allowing it to perform multiple
calculations simultaneously. This capability was essential for tasks involving large datasets and complex
mathematical computations.
• Cooling Technology:
• The Cray-2 introduced an innovative liquid immersion cooling system, where the entire computer was
submerged in a special coolant. This design not only improved cooling efficiency but also allowed for a
more compact configuration.
• Impressive Performance:
• The Cray-2 achieved performance levels of up to 1.9 gigaflops (billion floating-point operations per
second), making it the fastest supercomputer in the world at the time of its release.
• Advanced Memory Architecture:
• It featured a sophisticated memory system, with up to 256 megabytes of fast memory, allowing for rapid
data access and processing, crucial for high-demand applications.
• Impact and Applications:
• The Cray-2 was widely used in various scientific and engineering fields, including climate modeling,
computational fluid dynamics, and structural analysis. Its power enabled researchers to conduct complex
simulations and analyses that were previously impossible.
• Organizations such as NASA and various national laboratories relied on the Cray-2 for high-stakes
computations, solidifying its reputation as a vital tool for cutting-edge research.
• Legacy:
• The Cray-2's innovative technologies and design set new benchmarks in the field of supercomputing and
influenced future generations of computers.
• Its advancements in cooling technology and memory architecture laid the groundwork for subsequent
supercomputers, enhancing performance and energy efficiency.
• Conclusion:
The Cray-2 represented a significant leap in supercomputing capabilities, combining cutting-edge technology
with powerful performance. Its contributions to scientific research and engineering have left a lasting impact
on the field, ensuring its place in the history of high-performance computing.
Cray Uses
• Scientific Research: Weather and Climate Modeling, Astrophysics
• Engineering and Design: Computational Fluid Dynamics (CFD), Structural
Analysis
• Pharmaceuticals and Healthcare: Drug Discovery
• Financial Services: Risk Analysis and Management, High-Frequency Trading
• Energy Sector: Oil and Gas Exploration, Renewable Energy Research
• Artificial Intelligence and Machine Learning: Data Processing, Predictive
Analytics.
Multinode computing
Moore’s Law

• Gordon Moore (co-founder of Intel) predicted in 1965 that


the transistor density of semiconductor chips would double
roughly every 18 months.

50
Moore’s Law holds also for performance and
capacity
1945 2002
Computer ENIAC Laptop
Number of 18 000 6 000 000 000
vacuum tubes / transistors
Weight (kg) 27 200 0.9
Size (m3) 68 0.0028
Power (watts) 20 000 60
Cost ($) 4 630 000 1 000
Memory (bytes) 200 1 073 741 824
Performance (Flops/s) 800 5 000 000 000

51
Memory Hierarchy

NMIT HPC 2023 52


Processor-Memory Problem
• Processors issue instructions roughly every
nanosecond

• DRAM can be accessed roughly every 100


nanoseconds

• The gap is growing:


• processors getting faster by 60% per year
• DRAM getting faster by 7% per year

53
Processor-Memory Problem

NMIT HPC 2023 54


CPU Clock Rates

1980 1985 1990 1995 2000 2000:1980


processor 8080 286 386 Pent P-III
clock rate(MHz) 1 6 20 150 750 750
cycle time(ns) 1,000 166 50 6 1.6 750

processor AMD Rizen: 12 Cores; Intel i11: 8 cores


clock rate(GHz) 4.8 5.3

55
The CPU-Memory Gap
The increasing gap between DRAM, disk, and CPU speeds.
10,00,00,000
1,00,00,000
10,00,000
Disk seek time
1,00,000
DRAM access time
ns 10,000
SRAM access time
1,000
CPU cycle time
100
10
1
1980 1985 1990 1995 2000

year

56
An Example Memory Hierarchy
Smaller, L0:
faster, registers CPU registers hold words retrieved
and from L1 cache
costlier L1: on-chip L1
(per byte) cache (SRAM) L1 cache holds cache lines retrieved
storage from the L2 cache memory
devices L2: off-chip L2
cache (SRAM) L2 cache holds cache lines
retrieved from main memory

L3: main memory


Larger, (DRAM)
Main memory holds disk
slower, blocks retrieved from local
and disks
cheaper local secondary storage
L4:
(per byte) (local disks)
storage Local disks hold files
retrieved from disks on
devices remote network servers

L5: remote secondary storage


(distributed file systems, Web servers)
How fast can a serial computer be?
• Consider the 1 Tflop sequential machine
• data must travel distance, r, to get from memory to CPU
• to get 1 data element per cycle, this means 1012 times per
second at the speed of light, c = 3x108 m/s
• so r < c / 1012 = 0.3 mm

• For 1 TB of storage in a 0.3 mm2 area


• each word occupies about 3 Angstroms2, the size of a small
atom

58
Logical Inference

So, we need Parallel Computing!

59
GPU’s
• The graphics processing unit, or GPU, has become one of the most
important types of computing technology, both for personal and
business computing.
• Designed for parallel processing, the GPU is used in a wide range of
applications, including graphics and video rendering.
• Although they’re best known for their capabilities in gaming, GPUs
are becoming more popular for use in creative production and
artificial intelligence (AI).
• GPU acceleration, or graphics processing unit acceleration, is a
computing technique that uses the enormous power of graphics
processing units to dramatically increase the performance of
applications.
• This technique uses the parallel processing capabilities of GPUs,
allowing you to handle more tasks simultaneously, leading to huge
improvements in computational speeds and efficiency.
• NVIDIA and CUDA
Alternative to CUDA
• CUDA is a wonderful piece of tech that allows you to squeeze every
bit out of your Nvidia GPU.
• However, it only works with NVIDIA, and it’s not easy to port your
existing CUDA code to other platforms.
• You look for an alternative to CUDA, obviously.
What are the alternatives to CUDA?
• OpenCL: An open standard for parallel programming across CPUs,
GPUs, and other processors with some performance overhead
compared to CUDA.
• AMD ROCm: An open-source GPU computing platform developed by
AMD that allows the porting of CUDA code to AMD GPUs.
• SYCL: A higher-level programming model based on C++ for
heterogeneous processors enabling code portability across CUDA and
OpenCL through Intel’s DPC++ and hipSYCL.
• Vulkan Compute: It is a compute API of the Vulkan graphics
framework, enabling GPU computing on a wide range of GPUs with
lower-level control.
• Intel oneAPI: It is a cross-architecture programming model from Intel,
including a DPC++ compiler for SYCL, offering an alternative to CUDA
for Intel GPUs.

• OpenMP: It is an API for parallel programming on CPUs and GPUs. It


uses compiler directives, and recent versions support GPU offloading
as an alternative to CUDA.
Types of Parallelism *****
Case Study: Parallelism ( how do you allocate
tasks to processors)
1. Temporal Parallelism
Appropriateness and Challenges

1. Synchronization
2. Bubbles in pipeline ( if student has answered only 2 questions)
3. Fault Tolerance ( coffee break)
4. Inter-task Communication
5. Scalability ( Teachers cannot be increased )
2. Data Parallelism
Advantages and Disadvantages

1. The assignment of jobs to teachers is pre-decided. This is called Static assignment. ( so completion time
varies)
Questionnaires
1. What are the different methods of increasing the speed of computers.
2. List the advantages and disadvantages of using parallel computers.
3. Are parallel computers more reliable than serial computers? If yes, explain why.
4. How do parallel computers reduce rounding error in solving numeric intensive problems?

You might also like