0% found this document useful (0 votes)

46 views77 pages

PP Cuda Unit1 1

Uploaded by

ankitupatil1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views77 pages

PP Cuda Unit1 1

Uploaded by

ankitupatil1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

PP - CUDA

OCT. 2024
Vinay T R
Assistant Professor, MSRIT

1
PARALLEL PROGRAMMING
WITH CUDA - AD72
Alternative Names: High Performance Computing
Course Content

Unit I
A Short History of Supercomputing: Von Neumann Architecture, Cray, multinode computing, nvidia and cuda, alternatives to cuda, types of parallelism.
Pedagogy / Course delivery tools: Chalk and talk, Power Point Presentation, Videos.
Links: https://onlinecourses.nptel.ac.in/noc20_cs92/preview

Unit II
GPUs History of GPU Computing: FLYNN’S TAXONOMY, SOME COMMON PARALLEL PATTERNS, Reduced Instruction Set Computers, Multiple Core Processors, Vector Processors, Limits
to parallelizability, Amdahl’s law on Parallelism.
Pedagogy/Course delivery tools: Chalk and talk, Power Point Presentation, Videos.
Links: https://onlinecourses.nptel.ac.in/noc20_cs92/preview

Unit III
Introduction: GPUs as Parallel Computers, Architecture of a Model GPU, Why More Speed or Parallelism? GPU Computing. Introduction to CUDA: Data Parallelism, CUDA Program Structure, A
Vector Addition Kernel , Device Global Memory And Data Transfer, Kernel Functions and Threading.
Pedagogy/Course delivery tools: Chalk and talk, Power Point Presentation, Videos.
Links: https://onlinecourses.nptel.ac.in/noc20_cs92/preview

Unit IV
CUDA Threads: CUDA Thread Organization, Mapping Threads To Multidimensional Data, Synchronization and Transparent Scalability, Assigning Resources to Blocks, Thread Scheduling and
Latency Tolerance.
Pedagogy/Course delivery tools: Chalk and talk, Power Point Presentation, Videos.
Links: https://www.youtube.com/watch?v=xDtitNlLByQ

Unit V
Implementation of algorithms in CUDA: A Matrix-Matrix Multiplication, Program to implement sorting using CUDA, Program to Histogram calculation using CUDA, Program to create threads
using default stream in CUDA, . CUDA for Deep Learning - A Case Study.
Pedagogy/Course delivery tools: Chalk and talk, Power Point Presentation, Videos.
Links: https://www.youtube.com/watch?v=IiKhXC6NFDg
Laboratory Session:
1. OpenMp parallel programs on using #pragma directive in C.
2. OpenMp parallel programs on using #pragma directive using work sharing constructs in C
3. OpenMp programs using sections like omp for and omp single.
4. OpenMp programs on parallel constructs.
5. OpenMp programs on task construct.
6. OpenMp programs using thread private directives.
7. OpenMp programs using thread private directives.
8. OpenMp programs on threads scheduling.
9. OpenMp programs using last private reduction, copying and shared.
10. Programs for Point to Point MPI calls.
11. Programs for Message passing MPI calls.
12. CUDA programs on message passing.
13. CUDA programs on broadcasting
14. Graph Processing with GPU
Suggested Learning Resources
Text Book:
Introduction to parallel computing by Ananth Grama, Pearson education Publishers, second edition, 2003.
CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, Shane Cook Morgan Kaufmann, 2013, ISBN: 978-
0-12-415933-4
Reference:
GPU parallel program development using CUDA by Tolga Soyata. CRC Press 2018.
What Is Parallel Computing?

• Serial Computing
• Traditionally, software has been written for serial computation:
• A problem is broken into a discrete series of instructions
• Instructions are executed sequentially one after another
• Executed on a single processor
• Only one instruction may execute at any moment in time
Parallel Computing
• In the simplest sense, parallel computing is the simultaneous use of
multiple compute resources to solve a computational problem:
• A problem is broken into discrete parts that can be solved
concurrently
• Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different
processors
• An overall control/coordination mechanism is employed
•The computational problem should be able to:
• Be broken apart into discrete pieces of work that can be solved simultaneously;
• Execute multiple program instructions at any moment in time;
• Be solved in less time with multiple compute resources than with a single compute
resource.
•The compute resources are typically:
• A single computer with multiple processors/cores
• An arbitrary number of such computers connected by a network
Parallel Computers
•Virtually all stand-alone computers today are parallel from a hardware perspective:
• Multiple functional units (L1 cache, L2 cache, branch, prefetch, decode,
floating-point, graphics processing (GPU), integer, etc.)
• Multiple execution units/cores
• Multiple hardware threads
Why Use Parallel Computing?
The Real World Is Massively Complex
•In the natural world, many complex, interrelated events are happening at the same time, yet within a
temporal sequence.
•Compared to serial computing, parallel computing is much better suited for modeling, simulating and
understanding complex, real world phenomena.
•For example, imagine modeling these serially:
Ex: Grid for numerical weather model for the Earth
Main Reasons for Using Parallel Programming
i. SAVE TIME AND/OR MONEY
• In theory, throwing more resources at a task will shorten its time to
completion, with potential cost savings.
• Parallel computers can be built from cheap, commodity components.
ii. SOLVE LARGER / MORE COMPLEX PROBLEMS
• Many problems are so large and/or complex that it is impractical or
impossible to solve them using a serial program, especially given
limited computer memory.
• Example: "Grand Challenge Problems"
(en.wikipedia.org/wiki/Grand_Challenge) requiring petaflops and
petabytes of computing resources.
• Example: Web search engines/databases processing millions of
transactions every second
iii. PROVIDE CONCURRENCY
• A single compute resource can only do one thing at a time. Multiple
compute resources can do many things simultaneously.
• Example: Collaborative Networks provide a global venue where
people from around the world can meet and conduct work "virtually."
iv. MAKE BETTER USE OF UNDERLYING PARALLEL HARDWARE
• Modern computers, even laptops, are parallel in architecture with
multiple processors/cores.
• Parallel software is specifically intended for parallel hardware with
multiple cores, threads, etc.
• In most cases, serial programs run on modern computers "waste"
potential computing power.
The Future
• During the past 20+ years, the trends indicated by ever faster
networks, distributed systems, and multi-processor computer
architectures (even at the desktop level) clearly show that parallelism
is the future of computing.
• In this same time period, there has been a greater
than 500,000x increase in supercomputer performance, with no end
currently in sight.
• The race is already on for Exascale Computing - we are entering
Exascale era
• Exaflop = 1018 calculations per second
• US DOE Exascale Computing Project: https://www.exascaleproject.org
Who Is Using Parallel Computing?
i. Science and Engineering
• Historically, parallel computing has been considered to be "the high end of computing,"
and has been used to model difficult problems in many areas of science and engineering:
• Atmosphere, Earth, Environment
• Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics
• Bioscience, Biotechnology, Genetics
• Chemistry, Molecular Sciences
• Geology, Seismology
• Mechanical Engineering - from prosthetics to spacecraft
• Electrical Engineering, Circuit Design, Microelectronics
• Computer Science, Mathematics
• Defense, Weapons
ii. Industrial and Commercial
• Today, commercial applications provide an equal or greater driving force in the development of
faster computers. These applications require the processing of large amounts of data in
sophisticated ways. For example:
• "Big Data," databases, data mining
• Artificial Intelligence (AI)
• Oil exploration
• Web search engines, web based business services
• Medical imaging and diagnosis
• Pharmaceutical design
• Financial and economic modeling
• Management of national and multi-national corporations
• Advanced graphics and virtual reality, particularly in the entertainment industry
• Networked video and multi-media technologies
• Collaborative work environments
Global Applications
• Parallel computing is now being used extensively around the world, in
a wide variety of applications.
Need for Speed ????
History of Supercomputing
• image
Unit-1: A short History of Supercomputing

• Definition: Supercomputers are extremely powerful computers used for complex calculations, simulations, and data
processing.
• Purpose: They tackle problems in fields like climate modeling, physics simulations, and large-scale data analysis.
• Early Beginnings (1950s-1960s)
Key Machine: ENIAC (1945) - One of the first general-purpose computers.
Transition: Introduction of transistor technology and more efficient designs (e.g., CDC 6600 in 1964).
• The Rise of Vector Processing (1970s-1980s)
Vector Processors: Designed for handling vector calculations (e.g., Cray-1).
Impact: Revolutionized scientific computing with faster processing capabilities.
• Massively Parallel Processing (1990s)
Trend: Shift to parallel architectures (e.g., IBM Blue Gene).
Advancements: Enabled solving larger problems more efficiently.
• The Petascale Era (2000s)
Milestone: The first petascale supercomputer (e.g., IBM Roadrunner).
Performance: Capable of over a quadrillion calculations per second.

• The Exascale Revolution (2010s-Present)

Goal: Development of exascale computers (1 exaflop = 1 quintillion calculations).
Projects: U.S. and global initiatives for exascale computing (e.g., Frontier).

• Applications of Supercomputing
Fields: Weather forecasting, genomics, materials science, artificial intelligence.
Impact: Enhancing research capabilities and innovation.

• Future of Supercomputing
Trends: Quantum computing, increased energy efficiency, and AI integration.
Outlook: Continuous advancements in processing power and capabilities.
Von Neumann Architecture

• The Von Neumann architecture is a foundational concept in computer design that describes a system
architecture for electronic computers.
• It was proposed by mathematician and physicist John von Neumann in the 1940s.
Key Components of Von Neumann Architecture
1. Central Processing Unit (CPU):
Control Unit (CU): Directs the operation of the processor, managing the execution of instructions by
fetching them from memory, decoding them, and executing them.
Arithmetic Logic Unit (ALU): Performs arithmetic and logical operations (e.g., addition, subtraction,
comparisons).

2. Memory: Stores data and instructions. Memory is typically divided into:

Main Memory (RAM): Where data and instructions currently being used are stored.
Secondary Storage: For long-term storage (e.g., hard drives, SSDs).

3. Input/Output (I/O) Devices: Interfaces through which the computer interacts with the external environment,
such as keyboards, mice, printers, and monitors.
Von Neumann Architecture

• Advantages:
• Simplicity
• Flexibility
• Cost-Effective
• Lower Power Consumption

• Disadvantages:
• Bottleneck
• Sequential Processing Limitation

• Applications:
• Embedded Systems
• Personal Computers
• Simple Servers
Understanding Von Neumann Architecture
• The Von Neumann architecture outlines a system where a computer’s data
and program instructions share the same memory space. It comprises
several key components:
• Processing Unit: Includes an Arithmetic Logic Unit (ALU) and processor
registers, which perform calculations and hold data temporarily.
• Control Unit: Contains an instruction register and a program counter,
directing the processing unit on which instructions to execute.
• Memory: Stores both data and instructions, making them accessible to the
processing unit.
• External Storage: Provides long-term storage for data and programs.
• Input and Output Mechanisms: Facilitate communication between the
computer and the outside world.
• This design allows for flexibility and simplicity, as the same memory system
can store instructions and data.
• The 'one-at-a-time' phrase means that the von neumann architecture
is a sequential processing machine.
CRAY
Introduction to CRAY Supercomputer

Cray supercomputers are renowned for their exceptional processing power and efficiency, making
them pivotal in the field of high-performance computing (HPC). Founded by Seymour Cray in
1972, Cray Inc. has consistently pushed the boundaries of what is possible in computing, catering to
diverse sectors such as scientific research, engineering, and data analysis.
Key Features:
• High Performance: Cray systems are designed to perform complex calculations at unprecedented
speeds, often ranking among the top supercomputers globally.

• Advanced Architecture: They utilize innovative architectures, including multi-core processors and
advanced interconnects, enabling efficient parallel and vector processing.

• Scalability: Cray supercomputers can be easily scaled to handle large datasets and intensive
computations, making them suitable for both small projects and large-scale research initiatives.
Cray Super computer
CRAY 1

The Cray-1 is a historic milestone in computing, recognized as the first successful supercomputer. Introduced in
1976 by Cray Research and designed by Seymour Cray, it revolutionized the field of high-performance
computing (HPC) and set a new standard for scientific calculations.
Key Features:
• Vector Processing:
• The Cray-1 employed vector processing, allowing it to perform multiple calculations on large datasets
simultaneously. This significantly improved its efficiency for mathematical computations, especially in
scientific applications.
• Innovative Design:
• Its distinctive C-shaped architecture was not only visually striking but also optimized for cooling and
space, making it compact compared to other systems of the era.
• Impressive Performance:
• Capable of achieving speeds of up to 80 megaflops (million floating-point operations per second), the
Cray-1 was unparalleled in performance at the time, making it a preferred choice for complex
simulations.
• Advanced Memory System:
• It featured a large, fast memory architecture that utilized integrated circuits, allowing for rapid data
access and processing, which was essential for handling intensive computational tasks.
• Impact and Legacy:
• The Cray-1 quickly became indispensable in various fields, including meteorology, physics, and
molecular biology, enabling breakthroughs in research that required substantial computational power.
• Its success established Cray Research as a leader in the supercomputing market and paved the way for
subsequent generations of supercomputers.

Conclusion:
The Cray-1 not only transformed how calculations were performed but also laid the groundwork for future
advancements in supercomputing. Its innovative design and capabilities have left a lasting legacy in the world
of high-performance computing, influencing both hardware development and computational techniques in
scientific research.
CRAY 2

• The Cray-2, introduced in 1985 by Cray Research, is regarded as one of the most advanced supercomputers
of its time. Building on the success of the Cray-1, the Cray-2 featured several innovations that further
enhanced its performance and efficiency, making it a critical tool for scientific research and industrial
applications.
Key Features:
• Vector Processing:
• Like its predecessor, the Cray-2 utilized vector processing technology, allowing it to perform multiple
calculations simultaneously. This capability was essential for tasks involving large datasets and complex
mathematical computations.
• Cooling Technology:
• The Cray-2 introduced an innovative liquid immersion cooling system, where the entire computer was
submerged in a special coolant. This design not only improved cooling efficiency but also allowed for a
more compact configuration.
• Impressive Performance:
• The Cray-2 achieved performance levels of up to 1.9 gigaflops (billion floating-point operations per
second), making it the fastest supercomputer in the world at the time of its release.
• Advanced Memory Architecture:
• It featured a sophisticated memory system, with up to 256 megabytes of fast memory, allowing for rapid
data access and processing, crucial for high-demand applications.
• Impact and Applications:
• The Cray-2 was widely used in various scientific and engineering fields, including climate modeling,
computational fluid dynamics, and structural analysis. Its power enabled researchers to conduct complex
simulations and analyses that were previously impossible.
• Organizations such as NASA and various national laboratories relied on the Cray-2 for high-stakes
computations, solidifying its reputation as a vital tool for cutting-edge research.
• Legacy:
• The Cray-2's innovative technologies and design set new benchmarks in the field of supercomputing and
influenced future generations of computers.
• Its advancements in cooling technology and memory architecture laid the groundwork for subsequent
supercomputers, enhancing performance and energy efficiency.
• Conclusion:
The Cray-2 represented a significant leap in supercomputing capabilities, combining cutting-edge technology
with powerful performance. Its contributions to scientific research and engineering have left a lasting impact
on the field, ensuring its place in the history of high-performance computing.
Cray Uses
• Scientific Research: Weather and Climate Modeling, Astrophysics
• Engineering and Design: Computational Fluid Dynamics (CFD), Structural
Analysis
• Pharmaceuticals and Healthcare: Drug Discovery
• Financial Services: Risk Analysis and Management, High-Frequency Trading
• Energy Sector: Oil and Gas Exploration, Renewable Energy Research
• Artificial Intelligence and Machine Learning: Data Processing, Predictive
Analytics.
Multinode computing
Moore’s Law

• Gordon Moore (co-founder of Intel) predicted in 1965 that

the transistor density of semiconductor chips would double
roughly every 18 months.

50
Moore’s Law holds also for performance and
capacity
1945 2002
Computer ENIAC Laptop
Number of 18 000 6 000 000 000
vacuum tubes / transistors
Weight (kg) 27 200 0.9
Size (m3) 68 0.0028
Power (watts) 20 000 60
Cost ($) 4 630 000 1 000
Memory (bytes) 200 1 073 741 824
Performance (Flops/s) 800 5 000 000 000

51
Memory Hierarchy

NMIT HPC 2023 52

Processor-Memory Problem
• Processors issue instructions roughly every
nanosecond

• DRAM can be accessed roughly every 100

nanoseconds

• The gap is growing:

• processors getting faster by 60% per year
• DRAM getting faster by 7% per year

53
Processor-Memory Problem

NMIT HPC 2023 54

CPU Clock Rates

1980 1985 1990 1995 2000 2000:1980

processor 8080 286 386 Pent P-III
clock rate(MHz) 1 6 20 150 750 750
cycle time(ns) 1,000 166 50 6 1.6 750

processor AMD Rizen: 12 Cores; Intel i11: 8 cores

clock rate(GHz) 4.8 5.3

55
The CPU-Memory Gap
The increasing gap between DRAM, disk, and CPU speeds.
10,00,00,000
1,00,00,000
10,00,000
Disk seek time
1,00,000
DRAM access time
ns 10,000
SRAM access time
1,000
CPU cycle time
100
10
1
1980 1985 1990 1995 2000

year

56
An Example Memory Hierarchy
Smaller, L0:
faster, registers CPU registers hold words retrieved
and from L1 cache
costlier L1: on-chip L1
(per byte) cache (SRAM) L1 cache holds cache lines retrieved
storage from the L2 cache memory
devices L2: off-chip L2
cache (SRAM) L2 cache holds cache lines
retrieved from main memory

L3: main memory

Larger, (DRAM)
Main memory holds disk
slower, blocks retrieved from local
and disks
cheaper local secondary storage
L4:
(per byte) (local disks)
storage Local disks hold files
retrieved from disks on
devices remote network servers

L5: remote secondary storage

(distributed file systems, Web servers)
How fast can a serial computer be?
• Consider the 1 Tflop sequential machine
• data must travel distance, r, to get from memory to CPU
• to get 1 data element per cycle, this means 1012 times per
second at the speed of light, c = 3x108 m/s
• so r < c / 1012 = 0.3 mm

• For 1 TB of storage in a 0.3 mm2 area

• each word occupies about 3 Angstroms2, the size of a small
atom

58
Logical Inference

So, we need Parallel Computing!

59
GPU’s
• The graphics processing unit, or GPU, has become one of the most
important types of computing technology, both for personal and
business computing.
• Designed for parallel processing, the GPU is used in a wide range of
applications, including graphics and video rendering.
• Although they’re best known for their capabilities in gaming, GPUs
are becoming more popular for use in creative production and
artificial intelligence (AI).
• GPU acceleration, or graphics processing unit acceleration, is a
computing technique that uses the enormous power of graphics
processing units to dramatically increase the performance of
applications.
• This technique uses the parallel processing capabilities of GPUs,
allowing you to handle more tasks simultaneously, leading to huge
improvements in computational speeds and efficiency.
• NVIDIA and CUDA
Alternative to CUDA
• CUDA is a wonderful piece of tech that allows you to squeeze every
bit out of your Nvidia GPU.
• However, it only works with NVIDIA, and it’s not easy to port your
existing CUDA code to other platforms.
• You look for an alternative to CUDA, obviously.
What are the alternatives to CUDA?
• OpenCL: An open standard for parallel programming across CPUs,
GPUs, and other processors with some performance overhead
compared to CUDA.
• AMD ROCm: An open-source GPU computing platform developed by
AMD that allows the porting of CUDA code to AMD GPUs.
• SYCL: A higher-level programming model based on C++ for
heterogeneous processors enabling code portability across CUDA and
OpenCL through Intel’s DPC++ and hipSYCL.
• Vulkan Compute: It is a compute API of the Vulkan graphics
framework, enabling GPU computing on a wide range of GPUs with
lower-level control.
• Intel oneAPI: It is a cross-architecture programming model from Intel,
including a DPC++ compiler for SYCL, offering an alternative to CUDA
for Intel GPUs.

• OpenMP: It is an API for parallel programming on CPUs and GPUs. It

uses compiler directives, and recent versions support GPU offloading
as an alternative to CUDA.
Types of Parallelism *****
Case Study: Parallelism ( how do you allocate
tasks to processors)
1. Temporal Parallelism
Appropriateness and Challenges

1. Synchronization
2. Bubbles in pipeline ( if student has answered only 2 questions)
3. Fault Tolerance ( coffee break)
4. Inter-task Communication
5. Scalability ( Teachers cannot be increased )
2. Data Parallelism
Advantages and Disadvantages

1. The assignment of jobs to teachers is pre-decided. This is called Static assignment. ( so completion time
varies)
Questionnaires
1. What are the different methods of increasing the speed of computers.
2. List the advantages and disadvantages of using parallel computers.
3. Are parallel computers more reliable than serial computers? If yes, explain why.
4. How do parallel computers reduce rounding error in solving numeric intensive problems?

GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
34 pages
CMP 252 - Parallelism Fundamentals
No ratings yet
CMP 252 - Parallelism Fundamentals
64 pages
Generic Questions
No ratings yet
Generic Questions
70 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
Lecture 2 Computer Architecture Course 2024 1
No ratings yet
Lecture 2 Computer Architecture Course 2024 1
57 pages
Cours 1
No ratings yet
Cours 1
38 pages
Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
Parallel Comp Point Main
No ratings yet
Parallel Comp Point Main
18 pages
Parallel Computing Unit 1 - Introduction To Parallel Computing
No ratings yet
Parallel Computing Unit 1 - Introduction To Parallel Computing
43 pages
PDC Lecture 2
No ratings yet
PDC Lecture 2
13 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
Parallel Distributed Computing
No ratings yet
Parallel Distributed Computing
51 pages
Co 1
No ratings yet
Co 1
66 pages
EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
No ratings yet
EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
170 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
Chapter # 1
No ratings yet
Chapter # 1
117 pages
Hpclab
No ratings yet
Hpclab
58 pages
Lec1-Introduction To Parallel - Distributed System
No ratings yet
Lec1-Introduction To Parallel - Distributed System
29 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
39 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Topic 1 2024
No ratings yet
Topic 1 2024
41 pages
Lecture 4
No ratings yet
Lecture 4
27 pages
Chapter 1 - Parallel Architectures
No ratings yet
Chapter 1 - Parallel Architectures
60 pages
Week 1
No ratings yet
Week 1
74 pages
PDC 3
No ratings yet
PDC 3
26 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Lecture 2 Introduction To Parallel and Distributed Computing
No ratings yet
Lecture 2 Introduction To Parallel and Distributed Computing
29 pages
Introduction To Parallel Computing
100% (1)
Introduction To Parallel Computing
34 pages
PDC 1
No ratings yet
PDC 1
41 pages
CS621 - Handouts - Mids
No ratings yet
CS621 - Handouts - Mids
61 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
30 pages
Chapter1 - CLO1
No ratings yet
Chapter1 - CLO1
28 pages
Intro Parallel Computing PDF
No ratings yet
Intro Parallel Computing PDF
58 pages
MedRad Service Manual Mark V
100% (5)
MedRad Service Manual Mark V
242 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
Lec1 and 2
No ratings yet
Lec1 and 2
52 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
Introduction To Parallel Co...
No ratings yet
Introduction To Parallel Co...
44 pages
Lecture 9
No ratings yet
Lecture 9
72 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Lecture 1 Introduction To PDC
No ratings yet
Lecture 1 Introduction To PDC
17 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
30 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
CS ELEC 2 Introduce Parallel Computing
No ratings yet
CS ELEC 2 Introduce Parallel Computing
28 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
10 Parallel Computing
No ratings yet
10 Parallel Computing
15 pages
Parallel Computing Varun Patial
No ratings yet
Parallel Computing Varun Patial
41 pages
How To Fix Unarc - IsDone Error While Installing Games - New Method V20250205
No ratings yet
How To Fix Unarc - IsDone Error While Installing Games - New Method V20250205
36 pages
Chapter 1 - IT Essentials - PC Hardware & Software
75% (8)
Chapter 1 - IT Essentials - PC Hardware & Software
7 pages
MPMC - 3.1 8051 Microcontroller Architecture, Memory Organization and Machine Cycle
No ratings yet
MPMC - 3.1 8051 Microcontroller Architecture, Memory Organization and Machine Cycle
75 pages
11 Computer Application Short Notes
No ratings yet
11 Computer Application Short Notes
12 pages
The New Trends of Parallel Processing
No ratings yet
The New Trends of Parallel Processing
5 pages
Ict Question
No ratings yet
Ict Question
204 pages
WinCC V4 Manual Part 3 - 3
No ratings yet
WinCC V4 Manual Part 3 - 3
422 pages
Parallel Computing Terminology
No ratings yet
Parallel Computing Terminology
11 pages
(C) Internet Is A Wellknown Example of ICT. How It Enables People To Communicate Easily Through Different Communication Mediums?
No ratings yet
(C) Internet Is A Wellknown Example of ICT. How It Enables People To Communicate Easily Through Different Communication Mediums?
21 pages
SAT Calculator Program SAT Operating System TI-89 Titanium Full Version Manual
100% (1)
SAT Calculator Program SAT Operating System TI-89 Titanium Full Version Manual
32 pages
Ec8691 MPMC MCQ
No ratings yet
Ec8691 MPMC MCQ
50 pages
Introduction To Computing
No ratings yet
Introduction To Computing
6 pages
MIC 02 - Basic Computer System
No ratings yet
MIC 02 - Basic Computer System
78 pages
R01an3259ej0100 Automotive
No ratings yet
R01an3259ej0100 Automotive
17 pages
Lecture Parallel Computing
No ratings yet
Lecture Parallel Computing
6 pages
984 To Unity Converter - User Manual
No ratings yet
984 To Unity Converter - User Manual
80 pages
Microprocessor I - Lecture 02
No ratings yet
Microprocessor I - Lecture 02
45 pages
Unit 1 - Types and Componenets of Computer Systems
No ratings yet
Unit 1 - Types and Componenets of Computer Systems
14 pages
LM2 (HGC Software)
No ratings yet
LM2 (HGC Software)
39 pages
Alcatel-Lucent 9955 RNP: Installation Guide V7.1.0
No ratings yet
Alcatel-Lucent 9955 RNP: Installation Guide V7.1.0
60 pages
Soyo - MB - SY 5VA0 A5 22
No ratings yet
Soyo - MB - SY 5VA0 A5 22
37 pages
Accutorr Plus SVC Man Contents
No ratings yet
Accutorr Plus SVC Man Contents
101 pages
Unit Code: Sit 409 Unit Title: Sit409: Embedded Systems LECTURE HOURS: 45 Prerequisite: Operating Systems
No ratings yet
Unit Code: Sit 409 Unit Title: Sit409: Embedded Systems LECTURE HOURS: 45 Prerequisite: Operating Systems
28 pages
WPC Practical 20
No ratings yet
WPC Practical 20
4 pages
ICT Class 9 & 10 PDF
No ratings yet
ICT Class 9 & 10 PDF
6 pages
High Level Synthesis With Catapultc: Michal Stala
No ratings yet
High Level Synthesis With Catapultc: Michal Stala
29 pages
Dell Desktop RC With All Amendments
No ratings yet
Dell Desktop RC With All Amendments
53 pages
Presentation On Computer Data Storage 9th Grade
No ratings yet
Presentation On Computer Data Storage 9th Grade
1 page
Differentiate Between RAM and ROM: Random Access Memory (RAM)
No ratings yet
Differentiate Between RAM and ROM: Random Access Memory (RAM)
7 pages
Panzer General - Install Guide - PC
No ratings yet
Panzer General - Install Guide - PC
4 pages
Parts of Computers
No ratings yet
Parts of Computers
4 pages
Zetec Ultravision EC Datasheet
0% (1)
Zetec Ultravision EC Datasheet
2 pages
Build A PC 2012-WithROG
No ratings yet
Build A PC 2012-WithROG
11 pages
World’s First AC-Powered Multi-Parameter Processor: A Journey Beyond Limits
From Everand
World’s First AC-Powered Multi-Parameter Processor: A Journey Beyond Limits
RAJKUMAR OJHA
No ratings yet
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet

PP Cuda Unit1 1

Uploaded by

PP Cuda Unit1 1

Uploaded by

PP - CUDA

• The Exascale Revolution (2010s-Present)

2. Memory: Stores data and instructions. Memory is typically divided into:

• Gordon Moore (co-founder of Intel) predicted in 1965 that

NMIT HPC 2023 52

• DRAM can be accessed roughly every 100

• The gap is growing:

NMIT HPC 2023 54

1980 1985 1990 1995 2000 2000:1980

processor AMD Rizen: 12 Cores; Intel i11: 8 cores

L3: main memory

L5: remote secondary storage

• For 1 TB of storage in a 0.3 mm2 area

So, we need Parallel Computing!

• OpenMP: It is an API for parallel programming on CPUs and GPUs. It

You might also like