0% found this document useful (0 votes)

105 views11 pages

Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API

This document provides an introduction to CUDA memory management and data transfer API functions. It explains how to allocate and free device memory using cudaMalloc and cudaFree. It also demonstrates how to transfer data between host and device memory using cudaMemcpy. The document includes code for a vector addition example that allocates device memory, copies the vectors to device, launches the kernel, and copies the result back to host.

Uploaded by

BagongJaruh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views11 pages

Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API

Uploaded by

BagongJaruh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Lesson 1.

Introduction to CUDA
- Memory Allocation and Data Movement API Functions

Objective

To learn the basic API functions in CUDA host

code
Device Memory Allocation
Host-Device Data Transfer

Data Parallelism - Vector Addition Example

vector A

A[0]

A[1]

A[2]

A[N-1]

vector B

B[0]

B[1]

B[2]

B[N-1]

C[0]

C[1]

C[2]

vector C

C[N-1]

Vector Addition Traditional C Code

// Compute vector sum C = A+B
void vecAdd(float *h_A, float *h_B, float *h_C,
int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i]+h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements

vecAdd(h_A, h_B, h_C, N);

}
4

Part 1
Host Memory

Heterogeneous Computing vecAdd

CUDA Host Code

Device Memory
GPU
Part 2

CPU

Part 3

#include <cuda.h>
void vecAdd(float *h_A, float *h_B, float *h_C,
int n)
{
int size = n* sizeof(float);
float *d_A, *d_B, *d_C;
1. // Allocate device memory for A, B, and C
// copy A and B to device memory

2. // Kernel launch code the device performs the

actual vector addition
3. // copy C from the device memory // Free device
vectors
}
5

Partial Overview of CUDA Memories

(Device) Grid

Device code can:

R/W per-thread
registers
R/W all-shared global
memory

Host code can

Transfer data to/from
per grid global
memory

Block (0, 1)

Block (0, 0)
Registers

Registers

Thread (0, 0)

Thread (0, 1)

Registers

Thread (0, 0) Thread (0, 1)

Host
Global
Memory

We will cover more memory types later.

CUDA Device Memory Management

API functions

(Device) Grid

Registers

Thread (0, 0)

Thread (0, 1)

Host
Global
Memory

cudaMalloc()

Block (0, 1)

Block (0, 0)

Registers

Thread (0, 0) Thread (0, 1)

Allocates object in the

device global memory
Two parameters
Address of a pointer to
the allocated object
Size of allocated object
in terms of bytes

cudaFree()
Frees object from device
global memory
Pointer to freed object
7

Host-Device Data Transfer

API functions

(Device) Grid

Registers

Thread (0, 0)

Thread (0, 1)

Host
Global
Memory

cudaMemcpy()

Block (0, 1)

Block (0, 0)

Registers

Thread (0, 0) Thread (0, 1)

memory data transfer

Requires four parameters
Pointer to destination
Pointer to source
Number of bytes copied
Type/Direction of
transfer
Transfer to device is
asynchronous
8

Vector Addition Host Code

void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{

int size = n * sizeof(float); float d_A, d_B, *d_C;

cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
// Kernel invocation code to be shown later
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}

In Practice, Check for API Errors in Host Code

cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess) {
printf(%s in %s at line %d\n,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}

To Learn More, Read

Chapter 3. Thank you!

CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
HC31 1.11 Huawei - Davinci.HengLiao v4.0 PDF
No ratings yet
HC31 1.11 Huawei - Davinci.HengLiao v4.0 PDF
44 pages
Processors
No ratings yet
Processors
25 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
No ratings yet
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
77 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
How To Run CUDA C
100% (1)
How To Run CUDA C
6 pages
Reconfigurable Hardware Design Approach For Economic Neural Network
No ratings yet
Reconfigurable Hardware Design Approach For Economic Neural Network
5 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
GPU Programming and Parallelism
No ratings yet
GPU Programming and Parallelism
16 pages
DDR5
No ratings yet
DDR5
26 pages
Advanced Performance Optimization in CUDA (S62192)
100% (1)
Advanced Performance Optimization in CUDA (S62192)
127 pages
A Graphics Processing Unit
No ratings yet
A Graphics Processing Unit
14 pages
What Is Direct Memory Access (DMA) and Why Should We Know About It?
No ratings yet
What Is Direct Memory Access (DMA) and Why Should We Know About It?
23 pages
Parallel Implementation of Sobel Filter Using CUDA
No ratings yet
Parallel Implementation of Sobel Filter Using CUDA
4 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
Unit 1
No ratings yet
Unit 1
16 pages
ARM Assembly Basics for Beginners
No ratings yet
ARM Assembly Basics for Beginners
19 pages
ARM Cortex-M4 Architecture Guide
No ratings yet
ARM Cortex-M4 Architecture Guide
26 pages
ARM Cortex-A57 Block Diagram
No ratings yet
ARM Cortex-A57 Block Diagram
1 page
7.performance Analysis of Wallace Tree Multiplier With Kogge Stone Adder Using 15-4 Compressor
No ratings yet
7.performance Analysis of Wallace Tree Multiplier With Kogge Stone Adder Using 15-4 Compressor
38 pages
Mplab IDE Tutorial
100% (1)
Mplab IDE Tutorial
12 pages
Software Development Process
No ratings yet
Software Development Process
6 pages
Assembly Language Program Development With MASM
100% (1)
Assembly Language Program Development With MASM
9 pages
ARM Architecture for Engineers
No ratings yet
ARM Architecture for Engineers
44 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lecture 2 - ARM Instruction Set
No ratings yet
Lecture 2 - ARM Instruction Set
42 pages
Raspberry Pi Notes (E-Next - In)
No ratings yet
Raspberry Pi Notes (E-Next - In)
3 pages
Ahb Slave Agent Classes - Code
100% (1)
Ahb Slave Agent Classes - Code
12 pages
Memory Organization
No ratings yet
Memory Organization
70 pages
Implementation of White Line Following Firebird V Robot
No ratings yet
Implementation of White Line Following Firebird V Robot
11 pages
IDS Unit 5 Visualization
No ratings yet
IDS Unit 5 Visualization
24 pages
Systolic Array Architecture For Educational Use
No ratings yet
Systolic Array Architecture For Educational Use
6 pages
GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
Embedded Systems Diploma by Sameh Afifi
100% (1)
Embedded Systems Diploma by Sameh Afifi
3 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
IDA+VMWare - Linux Debugger
No ratings yet
IDA+VMWare - Linux Debugger
8 pages
1-IAS Architecture-12-12-2022
No ratings yet
1-IAS Architecture-12-12-2022
34 pages
8051 Micro Controller
80% (5)
8051 Micro Controller
9 pages
C++ Pointers
No ratings yet
C++ Pointers
103 pages
Shared Memory Architecture
No ratings yet
Shared Memory Architecture
17 pages
2620 Final PDF
No ratings yet
2620 Final PDF
45 pages
Virtual Memory, Segmentation and Paging
No ratings yet
Virtual Memory, Segmentation and Paging
22 pages
DMA Controller Insights
No ratings yet
DMA Controller Insights
22 pages
Lab9 Report
No ratings yet
Lab9 Report
4 pages
Lab RISC-V
No ratings yet
Lab RISC-V
5 pages
Armv7 A Cortex A Series PG PDF
No ratings yet
Armv7 A Cortex A Series PG PDF
421 pages
Struct Vs Union
No ratings yet
Struct Vs Union
26 pages
Machine Learning and AI Workloads Hardware Requirements
No ratings yet
Machine Learning and AI Workloads Hardware Requirements
2 pages
What-When-How: 8051 Serial Port Programming in Assembly
No ratings yet
What-When-How: 8051 Serial Port Programming in Assembly
18 pages
EDA Playground Documentation: Release
No ratings yet
EDA Playground Documentation: Release
35 pages
Report On Chapter 4 Bus-Based Computer Systems
No ratings yet
Report On Chapter 4 Bus-Based Computer Systems
2 pages
Inline Assembly Code
No ratings yet
Inline Assembly Code
15 pages
Embedded C Programming Class
No ratings yet
Embedded C Programming Class
41 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
42 pages
Convo 2k17 PDF
No ratings yet
Convo 2k17 PDF
172 pages
ReminderMedia Joins LPL Financial's Vendor Affinity Program
No ratings yet
ReminderMedia Joins LPL Financial's Vendor Affinity Program
2 pages
Advantages of Bladder Surge Tanks
100% (1)
Advantages of Bladder Surge Tanks
8 pages
Inverter Basics & Safety Course
100% (6)
Inverter Basics & Safety Course
73 pages
Science 3rd Q Standard Unit System
No ratings yet
Science 3rd Q Standard Unit System
11 pages
Programming Manual O3M151 Line Guidance
No ratings yet
Programming Manual O3M151 Line Guidance
47 pages
My India My Pride
No ratings yet
My India My Pride
3 pages
Teacher Recruitment Application Form
No ratings yet
Teacher Recruitment Application Form
12 pages
GKLKV MP MOC - 110913 v1.0
No ratings yet
GKLKV MP MOC - 110913 v1.0
95 pages
Strategic Plan 050414
No ratings yet
Strategic Plan 050414
10 pages
Requested Addition To CGT Comments On Inflated Hemp Acreage in Processing Investments
No ratings yet
Requested Addition To CGT Comments On Inflated Hemp Acreage in Processing Investments
3 pages
RH199 RHCSA Rapid Track Course
No ratings yet
RH199 RHCSA Rapid Track Course
3 pages
Astm A193
No ratings yet
Astm A193
12 pages
Social Media's Positive Impact on Teens
No ratings yet
Social Media's Positive Impact on Teens
4 pages
VelPAK Tutorial
No ratings yet
VelPAK Tutorial
95 pages
Blackmer Parts List Pump Models: Txd2.5A, Txsd2.5A TX2.5A, TXS2.5A
No ratings yet
Blackmer Parts List Pump Models: Txd2.5A, Txsd2.5A TX2.5A, TXS2.5A
2 pages
Heidelberg CP 2000 Computer 113204
No ratings yet
Heidelberg CP 2000 Computer 113204
1 page
A Novel Approach For Image Watermarking Using DCT and JND Techniques
No ratings yet
A Novel Approach For Image Watermarking Using DCT and JND Techniques
12 pages
PH301: Microprocessor Architecture and Programming (3-0-0-6)
No ratings yet
PH301: Microprocessor Architecture and Programming (3-0-0-6)
2 pages
Gas Oil Msds
No ratings yet
Gas Oil Msds
10 pages
Fault Arc Simulation in MV Networks
No ratings yet
Fault Arc Simulation in MV Networks
7 pages
L787-160-M 数据标签
No ratings yet
L787-160-M 数据标签
1 page
Wheat Conditioning Essentials
No ratings yet
Wheat Conditioning Essentials
32 pages
B737 Max
50% (4)
B737 Max
4 pages
Exam Paper Grade 10 Paper 2 (I.C.T.)
No ratings yet
Exam Paper Grade 10 Paper 2 (I.C.T.)
12 pages
What Bolt Is That
No ratings yet
What Bolt Is That
2 pages
Carnot Cycle Problems & Solutions
No ratings yet
Carnot Cycle Problems & Solutions
1 page
Flood Routing in The Ogunpa River in Nigeria Using Hec-Ras
No ratings yet
Flood Routing in The Ogunpa River in Nigeria Using Hec-Ras
11 pages
Solar Module Specs for Installers
No ratings yet
Solar Module Specs for Installers
2 pages
Harnessing Your Staff'S Informal Network
No ratings yet
Harnessing Your Staff'S Informal Network
9 pages

Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API

Uploaded by

Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API

Uploaded by

Lesson 1.

To learn the basic API functions in CUDA host

Data Parallelism - Vector Addition Example

Vector Addition Traditional C Code

vecAdd(h_A, h_B, h_C, N);

Heterogeneous Computing vecAdd

2. // Kernel launch code the device performs the

Partial Overview of CUDA Memories

Device code can:

Host code can

Thread (0, 0) Thread (0, 1)

We will cover more memory types later.

CUDA Device Memory Management

Thread (0, 0) Thread (0, 1)

Allocates object in the

Host-Device Data Transfer

Thread (0, 0) Thread (0, 1)

memory data transfer

Vector Addition Host Code

int size = n * sizeof(float); float *d_A, *d_B, *d_C;

In Practice, Check for API Errors in Host Code

To Learn More, Read

You might also like

int size = n * sizeof(float); float d_A, d_B, *d_C;