0% found this document useful (0 votes)

7 views50 pages

Module 2

Uploaded by

singhguma86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views50 pages

Module 2

Uploaded by

singhguma86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

MODULE TWO:

PROFILING
Dr. Volker Weinberg | LRZ
MODULE OVERVIEW
Topics to be covered

 Compiling and profiling sequential code

 Explanation of multicore programming
 Compiling and profiling multicore code
COMPILING SEQUENTIAL CODE
NVIDIA’S HPC COMPILERS (AKA PGI)
NVIDIA Compiler Names (PGI names still work)

 nvc - The command to compile C code (formerly known as ‘pgcc’)

 nvc++ - The command to compile C++ code (formerly known as ‘pgc++’)

 nvfortran - The command to compile Fortran code (formerly known As

pgfortran/pgf90/pgf95/pgf77)

 The -fast flag instructs the compiler to optimize the code to the best of its abilities

$ nvc –fast main.c $ pgcc –fast main.c

$ nvc++ -fast main.cpp $ pgc++ -fast main.cpp
$ nvfortran –fast main.F90 $ pgfortran –fast main.F90
NVIDIA’S HPC COMPILERS (AKA PGI)
-Minfo flag

 The Minfo flag will instruct the compiler to print feedback about the compiled code
 -Minfo=accel will give us information about what parts of the code were accelerated
via OpenACC
 -Minfo=opt will give information about all code optimizations
 -Minfo=all will give all code feedback, whether positive or negative

$ pgcc –fast –Minfo=all main.c

$ pgc++ -fast -Minfo=all main.cpp
$ pgfortran –fast –Minfo=all main.f90
NVIDIA NSIGHT FAMILY
Nsight Product Family
Workflow

Nsight Systems -
Analyze application
algorithm system-wide

Nsight Compute -
Debug/optimize CUDA
kernel

Nsight Graphics -
Debug/optimize graphics
workloads

This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
Thread/core
migration
Processes
and
threads Thread state

CUDA and
OpenGL API trace

cuDNN and
cuBLAS trace

Kernel and memory

transfer activities

Multi-GPU

This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
PROFILING SEQUENTIAL CODE
OPENACC DEVELOPMENT CYCLE
 Analyze your code to determine
most likely places needing Analyze
parallelization or optimization.
 Parallelize your code by starting
with the most time consuming parts,
check for correctness and then
analyze it again.
 Optimize your code to improve
observed speed-up from
parallelization.
Optimize Parallelize
PROFILING SEQUENTIAL CODE
Step 1: Run Your Code Terminal Window
$ pgcc –fast jacobi.c laplace2d.c
Record the time it takes for your $ ./a.out
sequential program to run. 0, 0.250000
100, 0.002397
200, 0.001204
300, 0.000804
Note the final results to verify
400, 0.000603
correctness later.
500, 0.000483
600, 0.000403
700, 0.000345
Always run a problem that is 800, 0.000302
representative of your real jobs. 900, 0.000269
total: 39.432648 s
PROFILING SEQUENTIAL CODE
Step 2: Profile Your Code Lab Code: Laplace Heat Transfer
Obtain detailed information about how
the code ran. Total Runtime: 39.43 seconds

This can include information such as:

 Total runtime
 Runtime of individual routines swap
calcNext
19.04s
 Hardware counters 21.49s

Identify the portions of code that took

the longest to run. We want to focus on
these “hotspots” when parallelizing.
PROFILING WITH NSIGHT SYSTEM
AND NVTX
PROFILING SEQUENTIAL CODE
Using Command Line Interface (CLI)

NVIDIA Nsight Systems CLI provides

 Simple interface to collect data
 Can be copied to any system and analysed later
 Profiles both serial and parallel code
 For more info enter nsys --help on the terminal

To profile a serial application with NVIDIA Nsight Systems, we use NVIDIA Tools Extension
(NVTX) API functions in addition to collecting backtraces while sampling.
PROFILING SEQUENTIAL CODE
NVIDIA Tools Extension API (NVTX) library

What is it?
 A C-based Application Programming Interface (API) for annotating events
 Can be easily integrated to the application
 Can be used with NVIDIA Nsight Systems
Why?
 Allows manual instrumentation of the application
 Allows additional information for profiling (e.g: tracing of CPU events and time ranges)
How?
 Import the header only C library nvToolsExt.h
 Wrap the code region or a specific function with nvtxRangePush() and nvtxRangPop()
#include <string.h> -t Selects the APIs to be traced (nvtx in this example)
#include <stdio.h>
#include <stdlib.h>
#include <omp.h> --status if true, generates summary of statistics after the collection
#include "laplace2d.h"
#include <nvtx3/nvToolsExt.h> -b Selects the backtrace method to use while sampling. The option dwarf
int main(int argc, char** argv) uses DWARF's CFI (Call Frame Information).
{
const int n = 4096; --force-overwrite if true, overwrites the existing results
const int m = 4096;
const int iter_max = 1000;
-o sets the output (qdrep) filename
const double tol = 1.0e-6;
double error = 1.0;

double restrict A = (double)malloc(sizeof(double)nm);

double *restrict Anew = (double*)malloc(sizeof(double)*n*m);

nvtxRangePushA("init");
initialize(A, Anew, m, n);
nvtxRangePop();

printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);

double st = omp_get_wtime();
int iter = 0;

nvtxRangePushA("while");
while ( error > tol && iter < iter_max )
{
nvtxRangePushA("calc");
error = calcNext(A, Anew, m, n);
nvtxRangePop();

nvtxRangePushA("swap");
swap(A, Anew, m, n);
nvtxRangePop();

if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);

iter++; NVTX range

} statistics
nvtxRangePop();

double runtime = omp_get_wtime() - st;

printf(" total: %f s\n", runtime);

“calc” region (calcNext function) takes 26.6%
deallocate(A, Anew); “swap” region (swap function) takes 23.4% of
return 0; total execution time
}
Open laplace-seq.qdrep with
jacobi.c Nsight System GUI to view the
(starting and ending of ranges are timeline
highlighted with the same color)
PROFILING SEQUENTIAL
CODE
Using Nsight Systems

Open the generated report files

(*.qdrep) from command line in the
Nsight Systems profiler.

File > Open

PROFILING SEQUENTIAL
CODE
Using Nsight Systems

Navigate through the “view selector”.

“Analysis summary” shows a summary of the profiling

session. To review the project configuration used to
generate this report, see next slide.

“Timeline View” contains the timeline at the top, and a

bottom pane that contains the events view and the
function table.

 To get started, follow these steps:

 Create an NVIDIA Developer account at http://courses.nvidia.com/join Select "Log in
with my NVIDIA Account" and then '"Create Account“
 Visit http://courses.nvidia.com/dli-event and enter the event code

LRZ_OPENACC_AMBASSADOR_MY22
TRAINING SETUP
TRAINING SETUP
TRAINING SETUP
TRAINING SETUP
TRAINING SETUP
TRAINING SETUP
TRAINING SETUP

 To be able to visualise Nsight System profiler output during the course, please install
Nsight System latest version on your local system before the course. The software
can be downloaded from https://developer.nvidia.com/nsight-systems.
PROFILING MULTICORE CODE
PROFILING MULTICORE CODE
What is multicore?

 Multicore refers to using a CPU with multiple CPU

computational cores as our parallel device
 These cores can run independently of each
other, but have shared access to memory
 Loop iterations can be spread across CPU
threads and can utilize SIMD/vector instructions
(SSE, AVX, etc.)
 Parallelizing on a multicore CPU is a good
starting place, since data management is
unnecessary
PROFILING MULTICORE CODE
Using a multicore CPU with OpenACC

 OpenACC’s generic model involves a Host

combination of a host and a device =
Device
 Host generally means a CPU, and the device
is some parallel hardware
 When running with a multicore CPU as our
device, typically this means that our
host/device will be the same
Host
 This also means that their memories will be Memory =
the same Device
Memory
PROFILING MULTICORE CODE
Compiling code for a specific parallel hardware
 The ‘-ta’ flag will allow us to compile our code for a specific, target parallel hardware
 ‘ta’ stands for “Target Accelerator,” an accelerator being another way to refer to a
parallel hardware
 Our OpenACC code can be compiled for many different kinds of parallel hardware
without having to change the code

$ pgcc –fast –Minfo=accel –ta=multicore laplace2d.c

calcNext:
35, Generating Multicore code
36, #pragma acc loop gang
PROFILING MULTICORE CODE
Compiling code for a specific parallel hardware
 nsys profile -t nvtx --stats=true --force-overwrite true -o
laplace_parallel ./laplace_parallel
PROFILING OPENACC CODE
PARALLEL VS
SEQUENTIAL
Compiler feedback

Have a close look at the PGI compiler

Sequential
feedback for both sequential and parallel
implementation of the application.

It provides information about how your program

was optimized or why a particular optimization
was not made.

Note: Adding –Minfo flag or -Minfo=accel

or -Minfo=all when compiling, will enable
compiler feedback messages, giving details
about the parallel code generated.

Parallel
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
#include <math.h>
#include <stdlib.h>

#define OFFSET(x, y, m) (((x)*(m)) + (y))

void initialize(double restrict A, double restrict Anew, int m, int n)

{
memset(A, 0, n * m * sizeof(double));
memset(Anew, 0, n * m * sizeof(double));

for(int i = 0; i < m; i++){

A[i] = 1.0;
Anew[i] = 1.0; CUDA API
}
}
statistics
double calcNext(double *restrict A, double *restrict Anew, int m, int n)
{
double error = 0.0;
#pragma acc parallel loop reduction(max:err)

for( int j = 1; j < n-1; j++)

{
#pragma acc loop
for( int i = 1; i < m-1; i++ )
{ CUDA Kernel
Anew[OFFSET(j, i, m)] = 0.25 * ( A[OFFSET(j, i+1, m)] + A[OFFSET(j, i-1, m)] statistics
+ A[OFFSET(j-1, i, m)] + A[OFFSET(j+1, i, m)]);
error = max( error, fabs(Anew[OFFSET(j, i, m)] - A[OFFSET(j, i , m)]));
}
}
return error;
} CUDA
Memory
void swap(double *restrict A, double *restrict Anew, int m, int n) Operation
{
#pragma acc parallel loop statistics
for( int j = 1; j < n-1; j++)
{
#pragma acc loop
for( int i = 1; i < m-1; i++ )
{
A[OFFSET(j, i, m)] = Anew[OFFSET(j, i, m)];
}
}
}

void deallocate(double restrict A, double restrict Anew)

{
free(A);
NVTX range
free(Anew); statistics
}

laplace2d.c “calc” region (calcNext function) takes 29.2% Open laplace-par.qdrep

(Parallelised using OpenACC parallel “swap” region (swap function) takes 18.3% of with Nsight System GUI to
directives (pragmas highlighted) view the timeline
total execution time
PARALLEL VS
SEQUENTIAL SPEEDUP
Viewing captured NVTX events

Have a close look at the captured

NVTX events for both serial and
parallel implementations. Parallel

Time spent in “while” loop has

significantly decreased.
Sequential
Achieved speedup: ~47

Parallel

Sequential
PROFILING PARALLEL
CODE
Viewing timeline via Nsight Systems

Contents of the tree-like hierarchy on

the left depend on the project settings
used to collect this report.

If a certain feature has not been

enabled, corresponding rows will not
be shown on the timeline.

In this example, we chose to trace

NVTX and OpenACC while sampling.

Note: Kernel launches are

represented by blue and memory
transfers are displayed in green.
LAB CODE
LAPLACE HEAT TRANSFER
Introduction to lab code - visual Very Hot Room Temp

We will observe a simple simulation

of heat distributing across a metal
plate.

We will apply a consistent heat to

the top of the plate.

Then, we will simulate the heat

distributing across the plate.
LAPLACE HEAT TRANSFER
Introduction to lab code - technical

The lab simulates a very basic

2-dimensional heat transfer problem. A Anew
We have two 2-dimensional arrays,
A and Anew. 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The arrays represent a 2-
dimensional, metal plate. Each 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
element in the array is a double
value that represents temperature. 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

We will simulate the distribution of 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
heat until a minimum change value
is achieved, or until we exceed a
maximum number of iterations.
LAPLACE HEAT TRANSFER
Introduction to lab code - technical

We initialize the top row to be a

temperature of 1.0 A Anew
The calcNext function will iterate
through all of the inner elements of 0.0
1.0 0.0
1.0 0.0
1.0 0.0
1.0 0.0
1.0 0.0
1.0 0.0
1.0 0.0
1.0
array A, and update the
corresponding elements in Anew 0.0 0.0 0.0 0.0 0.0 0.0
0.25 0.0
0.25 0.0

We will take the average of the 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
neighboring cells, and record it in
Anew.
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The swap function will copy the
contents of Anew to A
LAPLACE HEAT TRANSFER
Introduction to lab code

A Anew

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

0.0 0.25 0.25 0.0 0.0 0.25 0.25 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

The swap function will copy the
contents of Anew to A
KEY CONCEPTS
In this module we discussed…
 Compiling sequential and parallel code
 CPU profiling for sequential and parallel execution
 Specifics of our Laplace Heat Transfer lab code
LAB GOALS
In this lab you will do the following…

 Build and run the example code using the NVIDIA’s HPC compiler
 Use Nsight Systems to understand where the program spends its time
THANK YOU

Nvidia Profiling Tools Keipert 10 4 22
No ratings yet
Nvidia Profiling Tools Keipert 10 4 22
27 pages
S62256 - Demystify CUDA Debugging and Performance With Powerful Developer Tools
No ratings yet
S62256 - Demystify CUDA Debugging and Performance With Powerful Developer Tools
44 pages
Sanju HPC 9,10
No ratings yet
Sanju HPC 9,10
5 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
Advanced Multi-GPU Programming Guide
No ratings yet
Advanced Multi-GPU Programming Guide
91 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
Lecture 17-Introduction To GPU
No ratings yet
Lecture 17-Introduction To GPU
36 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Module 4
No ratings yet
Module 4
40 pages
Introduction To CUDA Platform 1
No ratings yet
Introduction To CUDA Platform 1
18 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
Release Notes
No ratings yet
Release Notes
7 pages
Release Notes
No ratings yet
Release Notes
7 pages
Nsight Systems v2023.2.1 Release Notes
No ratings yet
Nsight Systems v2023.2.1 Release Notes
7 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
Installation Guide
No ratings yet
Installation Guide
11 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
CUDA Tools
No ratings yet
CUDA Tools
25 pages
Installation Guide
No ratings yet
Installation Guide
14 pages
Installation Guide
No ratings yet
Installation Guide
14 pages
GPUParallelProgramming PDF
No ratings yet
GPUParallelProgramming PDF
104 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Cuda Program + Wait For User Input
No ratings yet
Cuda Program + Wait For User Input
2 pages
Owens
No ratings yet
Owens
67 pages
Using-Modern-Cpp-Techniques-To-Enhance-Multicore-Optimizations - Das's Edution
No ratings yet
Using-Modern-Cpp-Techniques-To-Enhance-Multicore-Optimizations - Das's Edution
17 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
Cuda Examples
No ratings yet
Cuda Examples
28 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
High Performance Computing Labs & Concepts
No ratings yet
High Performance Computing Labs & Concepts
5 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
Parralel Demro 003
No ratings yet
Parralel Demro 003
46 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Synthesis Gpgpu Draft2012 09
No ratings yet
Synthesis Gpgpu Draft2012 09
100 pages
OpenACC Princeton Bootcamp PDF
No ratings yet
OpenACC Princeton Bootcamp PDF
51 pages
Lec 1
No ratings yet
Lec 1
27 pages
N Sight Compute Cli
No ratings yet
N Sight Compute Cli
47 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
Exercise Instructions
No ratings yet
Exercise Instructions
12 pages
Labview Multicore Systems
No ratings yet
Labview Multicore Systems
86 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
GPGPU
100% (1)
GPGPU
139 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
Topic 8
No ratings yet
Topic 8
71 pages
Previosyear 3 RD
No ratings yet
Previosyear 3 RD
28 pages
Note2 4
No ratings yet
Note2 4
11 pages
Modern GPU
100% (1)
Modern GPU
221 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
Requires Further Review: Repurpose Global
No ratings yet
Requires Further Review: Repurpose Global
4 pages
John Doe Resume
No ratings yet
John Doe Resume
2 pages
pm0271 Guidelines For Bluetooth Low Energy Stack Programming On stm32wb Stm32wba Mcus Stmicroelectronics
No ratings yet
pm0271 Guidelines For Bluetooth Low Energy Stack Programming On stm32wb Stm32wba Mcus Stmicroelectronics
95 pages
The Quantitative Parameters in Computer-Assisted Approach Authors Lexical Choices in The Novels by Martin Amis
No ratings yet
The Quantitative Parameters in Computer-Assisted Approach Authors Lexical Choices in The Novels by Martin Amis
4 pages
A Guide To The SAS Studio 3.8 Documentation and Programming Documentation For SAS 9.4 and SAS Viya
No ratings yet
A Guide To The SAS Studio 3.8 Documentation and Programming Documentation For SAS 9.4 and SAS Viya
2 pages
Shumyk Ivanna, COa15-19, Seminar 1
No ratings yet
Shumyk Ivanna, COa15-19, Seminar 1
19 pages
Programming Fundamentals C++ NCCM221
No ratings yet
Programming Fundamentals C++ NCCM221
4 pages
Artis - & - Personal - LIVE - Release Communication - 2020 10 19
No ratings yet
Artis - & - Personal - LIVE - Release Communication - 2020 10 19
5 pages
Crisis Averted How Ai Helps in e Reputation Damage Control
No ratings yet
Crisis Averted How Ai Helps in e Reputation Damage Control
12 pages
CS 03 Popular Scaling Approaches Continued
No ratings yet
CS 03 Popular Scaling Approaches Continued
53 pages
Johnson Controls: Software Release
No ratings yet
Johnson Controls: Software Release
2 pages
Epoc Training Module 2020-07338451 1800000007338451
No ratings yet
Epoc Training Module 2020-07338451 1800000007338451
48 pages
Al Saif Company Profile Final 2021.1 v1
No ratings yet
Al Saif Company Profile Final 2021.1 v1
131 pages
Cryptography
No ratings yet
Cryptography
23 pages
Liver Hepatic Segmentation With Page Numbers
No ratings yet
Liver Hepatic Segmentation With Page Numbers
18 pages
LED Display Installation Guide
No ratings yet
LED Display Installation Guide
20 pages
Lead-In: A1 Answer Key
No ratings yet
Lead-In: A1 Answer Key
36 pages
Computer Science and Engineering - PIEMR
No ratings yet
Computer Science and Engineering - PIEMR
4 pages
Hydara - XSS - 2014
No ratings yet
Hydara - XSS - 2014
17 pages
ML Endsem
No ratings yet
ML Endsem
3 pages
IT Infrastructure in India's Food Sector
No ratings yet
IT Infrastructure in India's Food Sector
11 pages
Bees News Letter: K S R Institute For Engineering and Technology
No ratings yet
Bees News Letter: K S R Institute For Engineering and Technology
8 pages
4 - Managing Process Safety in The Age of Digital Transformation
No ratings yet
4 - Managing Process Safety in The Age of Digital Transformation
6 pages
Fiserv's Journey To Intelligent Cloud Migration and Optimization With Alation AWS
No ratings yet
Fiserv's Journey To Intelligent Cloud Migration and Optimization With Alation AWS
18 pages
Practice Worksheet - Class 5 (CH 2, CH 3)
No ratings yet
Practice Worksheet - Class 5 (CH 2, CH 3)
2 pages
IXL - Provide The Date - Spanish
No ratings yet
IXL - Provide The Date - Spanish
2 pages
Chapter 10 QB64 Programming Statements
No ratings yet
Chapter 10 QB64 Programming Statements
22 pages
Pradyumna Pund Resume
No ratings yet
Pradyumna Pund Resume
1 page
Common Abbreviations in English
No ratings yet
Common Abbreviations in English
23 pages
Cryptography
No ratings yet
Cryptography
43 pages

Module 2

Uploaded by

Module 2

Uploaded by

MODULE TWO:

 Compiling and profiling sequential code

 nvc - The command to compile C code (formerly known as ‘pgcc’)

 nvc++ - The command to compile C++ code (formerly known as ‘pgc++’)

 nvfortran - The command to compile Fortran code (formerly known As

$ nvc –fast main.c $ pgcc –fast main.c

$ pgcc –fast –Minfo=all main.c

Kernel and memory

This can include information such as:

Identify the portions of code that took

NVIDIA Nsight Systems CLI provides

double *restrict A = (double*)malloc(sizeof(double)*n*m);

printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);

if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);

iter++; NVTX range

double runtime = omp_get_wtime() - st;

printf(" total: %f s\n", runtime);

Open the generated report files

File > Open

Navigate through the “view selector”.

“Analysis summary” shows a summary of the profiling

“Timeline View” contains the timeline at the top, and a

Read more: https://docs.nvidia.com/nsight-systems

 To get started, follow these steps:

 Multicore refers to using a CPU with multiple CPU

 OpenACC’s generic model involves a Host

$ pgcc –fast –Minfo=accel –ta=multicore laplace2d.c

Have a close look at the PGI compiler

It provides information about how your program

Note: Adding –Minfo flag or -Minfo=accel

#define OFFSET(x, y, m) (((x)*(m)) + (y))

void initialize(double *restrict A, double *restrict Anew, int m, int n)

for(int i = 0; i < m; i++){

for( int j = 1; j < n-1; j++)

void deallocate(double *restrict A, double *restrict Anew)

laplace2d.c “calc” region (calcNext function) takes 29.2% Open laplace-par.qdrep

Have a close look at the captured

Time spent in “while” loop has

Contents of the tree-like hierarchy on

If a certain feature has not been

In this example, we chose to trace

Note: Kernel launches are

We will observe a simple simulation

We will apply a consistent heat to

Then, we will simulate the heat

The lab simulates a very basic

We initialize the top row to be a

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

0.0 0.25 0.25 0.0 0.0 0.25 0.25 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

You might also like

double restrict A = (double)malloc(sizeof(double)nm);

void initialize(double restrict A, double restrict Anew, int m, int n)

void deallocate(double restrict A, double restrict Anew)