[go: up one dir, main page]

Academia.eduAcademia.edu
The Apex-MAP Benchmark Dr. Volker Weinberg Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften volker.weinberg@lrz.de PRACE Workshop “New Languages & Future Technology Prototypes” LRZ, 1.-2. March 2010 Outline 1 The Apex-MAP Benchmark 2 Modelling LRZ’s Application Mix 3 Validation of ApexMAP: Simulation of 2 Mathematical Kernels on 2 different architectures 4 Summary, References and Acknowledgements The Apex-MAP Benchmark Goal: A simple synthetic benchmark with tunable hardware indep. parameters that can be used to model the scientific application mix of the HLRB-II would be very useful for the evaluation of new hardware platforms. Apex-MAP: Application Performance Characterisation Project (Apex) Memory Access Probe (MAP) https://ftg.lbl.gov/ApeX/ApeX.shtml Developed by E. Strohmaier & H. Shang, Future Technology Group (FTG), Lawrence Berkeley National Lab (LBNL), California. Paper: https://ftg.lbl.gov/ApeX/mascots.pdf Initial assumption: Performance behaviour can be characterised by a small set of code-specific and architecture-independent performance factors. Simulates typical memory access patterns and computational intensities of scientific applications. Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Main Parameters of the Apex-MAP Benchmark memory access patterns: random access pattern strided access pattern random access: indexed access using an index buffer ind[i] whose length is set by the parameter I strided access: access data with stride S M: total size of allocated memory block in which data accesses are simulated L: number of contiguous memory locations (sub-blocks of length L < M) accessed in succession starting at ind[i] → Spatial Locality α: shape parameter of power distribution function (0 ≤ α ≤ 1), describes temporal reuse of data, α determines the random starting address ind[i] → Temporal Locality C: computational intensity: own subroutine that runs with peak performance on Itanium Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Temporal Locality ind[j] = (L*pow(drand48(), 1/α) * (M/L -1)) ∈ [0;M-L[ α ∈ [0;1] α=  0 1 program accesses a single address repeatedly uniform random memory access 6 α=0.0010 α=0.0025 α=0.0050 α=0.0100 α=0.0250 α=0.0500 α=0.1000 α=0.2500 α=0.5000 α=1.0000 5 P (x) 4 3 2 1 0 0 0.2 0.4 0.6 0.8 x Distribution of pow(drand48(), 1/α) ∈ [0;1[ Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark 1 Apex-MAP Kernel Routine: Memory Access random access memory pattern double *data=(float *) malloc(M*sizeof(double)); for (i = 0; i < I; i++) { for (k = 0; k < L; k++) { W0 += c0*data[ind[i]+k]; W0 += compute(C); } } strided access memory pattern for (int k = 0; k < M/S; k+=1) { W0 += c0*data[k*S]; W0 += compute(C); } Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Apex-MAP Kernel Routine: Computational Intensity double compute(int C){ double s0,s1,s2,s3,s4,s5,s6,s7; s0=s1=s2=s3=s4=s5=s6=s7=0.; for(int i=1;i<=C;i++){ dummy(&s0,&s1,&s2,&s3,&s4,&s5,&s6,&s7); s0+=(x[0]*y[0])+(x[0]*y[1])+(x[0]*y[2])+(x[0]*y[3])+ (x[0]*y[4])+(x[0]*y[5])+(x[0]*y[6])+(x[0]*y[7]); s1+=(x[1]*y[0])+(x[1]*y[1])+(x[1]*y[2])+(x[1]*y[3])+ (x[1]*y[4])+(x[1]*y[5])+(x[1]*y[6])+(x[1]*y[7]); ... s7+=(x[7]*y[0])+(x[7]*y[1])+(x[7]*y[2])+(x[7]*y[3])+ (x[7]*y[4])+(x[7]*y[5])+(x[7]*y[6])+(x[7]*y[7]); } return s0+s1+s2+s3+s4+s5+s6+s7; } Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Apex-MAP Kernel Routine: Assembler Code for -O2 Intel(R) C++ Compiler for IA-64, Ver. 10.1: icc -S -O2 compute.c Itanium: 128 floating point registers - 2 x ld8 basis address of x[] und y[] - START LOOP - 8 x stfd for s0,...,s7 - dummy() call - 8 x ldfd for s0,...,s7 (8 floating point registers) - 8 x ldfd for x[0],...,x[7] (8 floating point registers) - 8 x ldfd for y[0],...,y[7] (8 floating point registers) - 64 x consecutive bundles with 1 fma per bundle { .mfi nop.m 0 fma.d f15=f39,f47,f12 nop.i 0 } e.g. for s0: f15=f39*f47+f12 f07=f39*f38+f15 f09=f39*f35+f07 f11=f39*f37+f09 f13=f39*f34+f11 f15=f39*f36+f13 f07=f39*f33+f15 f13=f39*f32+f07 f15=x[0]*y[7]+s0 f07=x[0]*y[6]+x[0]*y[7]+s0 f09=x[0]*y[5]+x[0]*y[6]+x[0]*y[7]+s0 f11=x[0]*y[4]+x[0]*y[5]+x[0]*y[6]+x[0]*y[7]+s0 f13=x[0]*y[3]+x[0]*y[4]+x[0]*y[5]+x[0]*y[6]+x[0]*y[0]+s0 f15=x[0]*y[2]+x[0]*y[3]+x[0]*y[4]+x[0]*y[5]+x[0]*y[6]+x[0]*y[7]+s0 f07=x[0]*y[1]+x[0]*y[2]+x[0]*y[3]+x[0]*y[4]+x[0]*y[5]+x[0]*y[6]+x[0]*y[7]+s0 f13=x[0]*y[0]+x[0]*y[1]+x[0]*y[2]+x[0]*y[3]+x[0]*y[4]+x[0]*y[5]+x[0]*y[6]+x[0]*y[0]+s0 - STOP LOOP - 8 x stfd for s0,...,s7 - 7 x fadd to compute s=s0+s1+s2+s3+s4+s5+s6+s7 (fadd = fma with f?=f?,f1,f? (pure addition, since f1=1.)) Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Modelling of HLRB-II Performance using Apex-MAP Performance Counters (on Itanium): Number of floating point operations per cycle: (FP OPS RETIRED / CPU OP CYCLES ALL) Memory bandwidth, expressed by L3 misses in Bytes/cycle: (L3 MISSES/CPU OP CYCLES ALL × L3 cacheline size) Processor Clock rate max. Flops per cycle peak performance per core L3 cache L3 cacheline size Bandwidth to memory Volker Weinberg, LRZ Intel Itanium2 Montecito 1.6 GHz 4 6.4 GFlop/s 9 MB 128 Bytes 8.5 GBytes/s 5.3 Bytes/cycle LRZ· 2.3.2010 The Apex-MAP Benchmark Nehalem-EP 2.53 GHz 4 10.12 GFlop/s 8 MB 64 Bytes 25.6 GB/s 10.1 Bytes/cycle Modelling of HLRB-II Performance using Apex-MAP Number of floating point operations per cycle (FP OPS RETIRED / CPU OP CYCLES ALL) versus the memory bandwidth, expressed by L3 misses in Bytes/cycle (L3 MISSES/CPU OP CYCLES ALL × 128 Bytes). Apex covers 98.9% real applications Apex-MAP (≈ 3 days , sampling every 10 min) 3254784 samples various Apex-MAP parameters 23226 samples Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Modelling of HLRB-II Performance using Apex-MAP Number of floating point operations per cycle (FP OPS RETIRED / CPU OP CYCLES ALL) versus the memory bandwidth, expressed by L3 misses in Bytes/cycle (L3 MISSES/CPU OP CYCLES ALL × 128 Bytes). Apex covers 98.9% real applications Apex-MAP (≈ 3 days , sampling every 10 min) 3254784 samples various Apex-MAP parameters 23226 samples Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Modelling of HLRB-II Performance using Apex-MAP strided access memory pattern Volker Weinberg, LRZ LRZ· 2.3.2010 random access memory patterns The Apex-MAP Benchmark Chosen Data Points to Model the Application Mix Overall mean Apex-MAP performance on HLRB-II: Measured HLRB-II performance (3-day interval): Volker Weinberg, LRZ LRZ· 2.3.2010 0.898 GFlop/s per core 0.48 Flop/cycle × 1.6 GHz = 0.768 GFlop/s The Apex-MAP Benchmark Validation of Apex-MAP Mathematical kernels used for the validation: mod2am mod2as Dense matrix-matrix multiplication Sparse matrix-vector multiplication Validating Apex-MAP by using the two mathematical kernels needs several steps: 1 2 Measure the performance of mod2am/as on the original hardware (HLRB II). Measure the hardware counters for mod2am/as on HLRB II. 3 Generate weights for each square and each kernel. 4 Measure the performance of mod2am/as on the target hardware (Nehalem EP). 5 Run Apex-MAP with the weights for mod2am/as on Nehalem and HLRB II. Compare the predicted results (Step 5) with the actual results (Steps 1 and 4). 6 Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Validation of Apex-MAP: Measurements on HLRB-II mod2am Volker Weinberg, LRZ mod2as LRZ· 2.3.2010 The Apex-MAP Benchmark Validation of Apex-MAP: Predictions on HLRB-II mod2am mod2as measured mean perf. of mod2am/s: 5,4 Gflop/s 0.5 GFlop/s (84% peak) (8% peak) Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Validation of Apex-MAP: Predictions on Nehalem-EP mod2am mod2as measured mean perf. of mod2am/s: 8 Gflop/s 0.9 GFlop/s (80% peak) (9% peak) Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Apex-MAP Kernel Routine: Perform. on Nehalem-EP 100 icc -xSSE4.2 -O0 icc -xSSE4.2 -O1 icc -xSSE4.2 -O2 icc -xSSE4.2 -O3 % of Peak 80 60 40 20 0 10 100 1000 10000 C 100000 1e+06 Intel C++ Comp. for Intel 64, Vers. 11.1 Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark 1e+07 SSE2/3 Floating Point Instructions x86-64: 16 x 128 bit MMX Registers %xmm0, . . . , %xmm15 → 32 doubles SSE2 Packed Double Precision Data Type SSE2 Vertical Operations addpd /mulpd SSE3 Horizontal Addition haddpd Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Apex-MAP Kernel Routine: Assembler Code for -O2 Intel C++ Comp. for Intel 64, Vers. 11.1: icc -S -O2 -xSSE4.2 compute.c x86-64: 16 x 128 bit MMX Registers %xmm0, . . . , %xmm15 → 32 doubles - START LOOP - dummy() call - 4 x movaps fuer y[0],...,y[7] movaps y(%rip), %xmm9 %xmm9= movaps 16+y(%rip), %xmm8 %xmm8= movaps 32+y(%rip), %xmm7 %xmm7= movaps 48+y(%rip), %xmm0 %xmm0= - fuer jedes s0,...s7 movsd (%rsp), %xmm6 y[0] y[2] y[4] y[6] | | | | y[1] y[3] y[5] y[7] %xmm6= s0 | 0. movddup movaps movaps movaps x(%rip), %xmm3 %xmm3, %xmm5 %xmm3, %xmm2 %xmm3, %xmm4 %xmm3= %xmm5= %xmm2= %xmm4= x[0] x[0] x[0] x[0] mulpd addpd mulpd mulpd addpd mulpd addpd addpd haddpd %xmm9, %xmm5, %xmm8, %xmm7, %xmm4, %xmm0, %xmm3, %xmm2, %xmm6, %xmm5= %xmm6= %xmm2= %xmm4= %xmm6= %xmm3= %xmm2= %xmm6= %xmm6= x[0]*y[0] | x[0]*y[1] s0+x[0]*y[0] | x[0]*y[1] x[0]*y[2] | x[0]*y[3] x[0]*y[4] | x[0]*y[5] s0+x[0]*y[0]+x[0]*y[4] | x[0]*y[1]+ x[0]*y[5] x[0]*y[6] | x[0]*y[7] x[0]*y[2]+x[0]*y[6] | x[0]*y[3]+x[0]*y[7] s0+x[0]*y[0]+x[0]*y[4]*x[0]*y[2]+x[0]*y[6] | x[0]*y[1]+x[0]*y[5]+x[0]*y[3]+x[0]*y[7] s0+x[0]*y[0]+x[0]*y[4]*x[0]*y[2]+x[0]*y[6]+x[0]*y[1]+x[0]*y[5]+x[0]*y[3]+x[0]*y[7] |same %xmm5 %xmm6 %xmm2 %xmm4 %xmm6 %xmm3 %xmm2 %xmm6 %xmm6 | | | | x[0] x[0] (not for s7) x[0] (not for s7) x[0] (not for s7) --> 32 x addpd (SSE2), 32 x mulpd (SSE2), 8 x haddpd (SSE3), 8 x movsd (SSE2) to load s0,...,s7, 8 x movddup (SSE3) to load x[i] | x[i] , 7x3 x movaps (SSE) to copy x[i] | x[i] - STOP LOOP 7 x addsd to compute s=s0+s1+s2+s3+s4+s5+s6+s7 7 x addsd (SSE2) Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Summary Apex-MAP benchmark can be used to model our application mix on HLRB-II → Apex-MAP is well suited for benchmarking the suitability of new hardware architectures, e.g. multi- and manycore CPUs, systems with 10000+ processors, systems with accelerators (GPGPUs, CELL processors, ...). It has to be assured that the 128 Floating point operations in the loop body are really executed and not and not cancelled by optimisations of the compiler. Future work: Investigation of the quality of predictions; refine the benchmark (kernel routine) that it adopts easily to new environments. Vision: Implementing Apex-MAP in a language that supports both multi-core CPUs, GPGPUs and the CELL processor (→ currently only RapidMind) could offer an easy way to simulate typical application performance patterns on a broad range of architectures. Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark References I E. Strohmaier, H. Shan Architecture Independent Performance Characterization and Benchmarking for Scientific Applications Volendam, The Netherlands, Oct. 2004 https://ftg.lbl.gov/ApeX/mascots.pdf R. Patra, M. Brehm, R. Bader, R. Ebner, S. Haupt Performance Monitoring – A Generic Approach (LRZ-Bericht 2006-06) http://www.lrz-muenchen.de/wir/berichte/TB/ LRZ-Bericht-2006-06.pdf Volker Weinberg, Matthias Brehm, Iris Christadler OMI4papps: Optimisation, Modelling and Implementation for Highly Parallel Applications http://arxiv.org/abs/1001.1860 Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark Acknowledgements Matthias Brehm, Iris Christadler, LRZ, Germany KONWIHR-II project OMI4papps: Optimisation, Modelling and Implementation for Highly Parallel Applications PRACE project funded in part by the EU’s 7th Framework Programme (FP7/2007-2013) under grant agreement no. RI-211528 Volker Weinberg, LRZ LRZ· 2.3.2010 The Apex-MAP Benchmark