Maximize The Memory Performance

This document discusses how to maximize memory performance for ANSYS simulations by optimally configuring a computer's memory channels. It finds that using fewer memory channels can significantly decrease simulation performance, with performance dropping 9% using 1 fewer channel and 33% using 2 fewer channels. It recommends configuring memory sticks evenly across all available channels and sockets to ensure maximum bandwidth.

Uploaded by

Dibya Ranjan Barik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views7 pages

Maximize The Memory Performance

Uploaded by

Dibya Ranjan Barik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Maximizing Memory Performance for ANSYS Simulations

By Alex Pickard, 2018-11-19

Memory or RAM is an important aspect of configuring computers for high performance computing (HPC)
simulation work. The performance of the computer is not only affected by the amount of RAM but also
by the amount of bandwidth, or transfer speed, between the processor and the memory. The bandwidth
is affected by the speed of the memory stick, but what few people know is that the way memory sticks
are configured on the motherboard is even more important. CPUs are capable of accessing information
on multiple memory sticks in parallel in what are referred to as memory channels. Modern CPUs can
access between 1 and 8 memory channels simultaneously (model specific). We have found however, that
many computers are not configured to make use of all their memory channels and this is creating a
bottleneck in their performance. All ANSYS solvers require significant memory bandwidth to sufficiently
feed data to the multiple CPU cores available on modern processors

This guide will review a recent test on how memory channels can impact your simulation performance
and show how to best configure a new or existing system for maximum performance.

CPU Data Starvation

CPU “data starvation” is becoming an ever more common
obstacle in HPC applications, where the number of cores on a ANSYS 18.X
CPU is constantly increasing, and the amount of work each core Performance
is able to perform each cycle is also increasing (See AVX, and 149%
AVX-512). This ability to process massive amounts of data
requires the ability to provide said data to the CPU cores in a 100%
timely fashion, or else the CPU will sit idle waiting for work.

Intel and AMD have both been working to increase the memory
bandwidth available regularly. For example, Intel has upgraded
their highly successful Xeon-EP platform (our most commonly
recommended solution) as follows:
18.1 no AVX-512 18.2 with AVX-512
• 3 channel DDR3-1333 in 2010
• 4 channel DDR3-1600 in 2012
• 4 channel DDR3-1866 in 2013
• 4 channel DDR4-2133 in 2014
• 4 channel DDR4-2400 in 2016
• 6 channel DDR4-2666 in 2017
The increase is substantial, but in the same period of time the platform has moved from a maximum of 6
cores per CPU to now 28 cores per CPU, and also increased in both core frequency and operations per
core cycle (AVX). The total memory bandwidth available per theoretical gigaflop of performance is clearly
falling. There are also other system platforms available, such as the dual memory channel intel consumer
CPUs (i5, i7 series, Xeon E) and the quad memory channel Intel “HEDT” platform (i7, i9 and Xeon W).
Likewise, AMD has CPU models and platforms ranging from 1 to 8! memory channels.

Memory Bandwidth Impact on Constant CPU Performance

To test the effect of memory bandwidth on solver performance, a test case is setup where an identical
CFX model is solved on the same system, but each successive test removes a stick of RAM, and thus a
memory channel, from the system. The test model is small so RAM quantity is not an issue, and all other
factors are kept constant (CPU cores, frequency , software version, CPU cache). The setup is as follows.

CFX Version 18.0

Model perf_Pump (1.3M Nodes, 5.3M elements, see end of report)
CPU i7-5960X @ 4.0 GHz
CPU Cores 8
RAM DDR-2400 CAS14
Memory Ch. 4
OS CentOS Linux 7
Solver Intel MPI Distributed Single Precision

This system would be typically classified as having ample memory bandwidth, having only 2 CPU cores per
DDR4 memory channel. The effect of varying number of memory channels on solve time is as follows.

Simply removing 1 stick of RAM makes the solver 9% slower, this would also be comparable to running
10-12 cores on 4 Channels. Removing 2 sticks, giving 4 CPU cores per memory channel, gives a
performance degradation of 33%. Going to the extreme, and running this computer with only a single
RAM stick, shows the severe effects of CPU bandwidth starvation, the solve time is 226% of normal. This
shows how critical it is that your computer be setup with optimal memory layout to take advantage of
available memory channels. Buying 1 stick and “getting another one later” is not a plan for success.

Note that the 2 channel result, which is comparable to 16 cores on all 4 channels, does not necessarily
make a 16 core CPU a poor choice. The 16 core CPUs that are available today run at lower frequency than
4.0 GHz, and have more memory channels, and are thus comparable to the 4 channel result above in
terms of efficiency if configured properly. A minor bandwidth loss, but with many more cores, provides a
better platform value than a cluster of comparable i7 machines. The performance per core will be lower,
but due to system complexity, for users requiring up to 36 cores the recommended platform is a dual
socket Xeon-EP system.

CPU Scaling in Constant vs. Increasing Memory Bandwidth Scenario

To demonstrate the effect of memory bandwidth on high core count scenarios, a very powerful, quad
CPU, Xeon E7 system is compared against a cluster of 8 core Intel i7 machines.

CFX Version 17.2 17.2

Model perf_Pump (1.3M Nodes) perf_Pump (1.3M Nodes, 5.3M elements)
CPU 4 x Xeon E7-8890v4 12 x i7-5960X or 6900k (mixed)
Cores 4 x 24 cores (96 total) 12 x 8 cores (96 total)
RAM 4 x 4 channels DDR4 12 x 4 channels DDR4-2400
CPU Cache 4 x 60 MB (240 MB total) 12 x 20 MB (240 MB total)
Interconnect Single Monolithic System QDR Infiniband

The testing consists of ramping up the number of CPU cores being used. The Xeon E7 system will have all
memory bandwidth and CPU cache available at any quantity of cores in use, and will be subdividing those
finite resources as more CPU cores are assigned. By comparison, the i7 cluster will ramp up by adding
more topologically identical nodes to the job. Thus, every time a CPU is added, so does an equal amount
of memory bandwidth and core cache. This should show the true scaling potential of the solver, and will
only be penalized by communication overhead and mesh overlap.

Please note that core frequency is not constant on the Quad E7 system. It is 3.4 GHz max at low core
counts, but has a base frequency of 2.2 GHz. Turbo frequency decreases as more Xeon cores are added.
The quad Xeon E7 system initially starts out faster than the single i7 machine (both using 8 cores). This is
despite the i7 having a faster core frequency (4.0 GHz vs 3.4 GHz max turbo). This is because the Xeon
system has 4 times the memory bandwidth and 3 times the core cache available. In fact, it is using 16
memory channels to feed only 8 CPU cores. By the time 16 cores are in use the systems are effectively
matched in performance, and from there on the i7 cluster is faster. Scaling is still good in the up to 48 core
range on the quad Xeon system, which is effectively 3 CPU cores per memory channel, and then the
performance beings to taper off more severely. In the end, the Xeon system ends up at only 56% of the i7
cluster’s speed.

Another result of significant importance is that when adding additional nodes to a cluster, specifically ones
that are topologically identical and add as much non-CPU resources as they do CPU power, the scaling
power of the CFX solver is very impressive. CFX was able to solve this model 9.91 times faster on 12
machines than it could on 1, and that was with substantial mesh overlap (18.6% on average) and only 13.6
thousand nodes per CPU core, which is much lower than the recommended range for good scaling
performance (30-50k). For 64 cores, where there were 20k nodes per core, results were even better at
90% scaling efficiency.

What can I do about it?

Memory channels are not necessarily the number of sticks in the computer, because it is possible to install
2 or even 3 sticks into a single memory channel. It is also possible to accidently install 2 sticks in channel
#1 and no sticks in channel #2, thus having only channel #1 active and restricting bandwidth. Optimal
memory setup is very important. Every CPU has its own dedicated memory, so a computer with 2 CPUs
has twice the number of memory channels available as one with only a single CPU. To determine how
many memory channels are available on your computer, search for your CPU model on the Intel ARK
database.
As you can see above, an example computer shown here has a maximum of 4 memory channels available.
Make sure to put an identical stick in every channel, before moving on to putting a second stick in every
channel. It is not necessary to have the first batch of 4 sticks match the second batch, but there should be
one of each type in every channel to have a balanced configuration. Your motherboard manual will
describe what positions to install the RAM in. A typical example of the relevant entry in the manual is
shown below:

If you are unable to open your computer and check that the memory is balanced properly, there is a
command you can run in the windows command prompt to check your memory layout.
wmic MEMORYCHIP get BankLabel,DeviceLocator,Capacity,ConfiguredClockSpeed
Output:
BankLabel Capacity ConfiguredClockSpeed DeviceLocator
ChannelA 8589934592 1600 ChannelA_Dimm1
ChannelA 4294967296 1600 ChannelA_Dimm2
ChannelB 8589934592 1600 ChannelB_Dimm1
ChannelB 4294967296 1600 ChannelB_Dimm2
ChannelC 8589934592 1600 ChannelC_Dimm1
ChannelC 4294967296 1600 ChannelC_Dimm2
ChannelD 8589934592 1600 ChannelD_Dimm1
ChannelD 4294967296 1600 ChannelD_Dimm2

This computer has an 8GB stick and a 4GB stick of 1600 MHz RAM in every channel, and is thus properly
balanced.
Buying a New Computer
When selecting a new computer for simulation, we typically recommend a “High End Desktop” (HEDT)
platform or Xeon EP (Xeon scalable, i.e. Xeon Gold etc.). The first is a 4 memory channel single CPU
platform that accommodates 6 to 18 CPU cores (12 for use with 1 ANSYS HPC pack). The second is a 6
memory channel per CPU platform, which can accommodate up to 8 CPU’s, but for ANSYS is frequently
used in dual CPU, 12 total memory channel configurations targeting the 36-ish core range. In both cases
it is 1 memory channel per 3 cores typical.

Besides choosing the right platform it is also important to make sure all memory channels are equally
populated, even if it means buying smaller sticks. 4 channels with 8 GB sticks to get 32 GB total is better
than 2 channels with 16 GB sticks and 2 channels left empty.

Conclusion
Memory bandwidth has a quite noticeable impact on solver performance, even for modern machines with
optimal layout and apparently ample bandwidth. It is very important to consider and plan for an optimal
system memory layout.

Leaving memory channels unpopulated, or unevenly balanced with different RAM sticks, is not good idea.
Get memory in matched sets that are appropriate for the platform being used (at least 1 identical stick
per channel).

One of the main reasons people witness disappointing performance gains when adding more cores to
their simulations is not due to inefficiency of the solver code, but instead due to a combination of
decreasing core frequency (turbo speed) and subdivision of out-of-core resources, especially memory
bandwidth. The CFX solver demonstrated near perfect scaling at quite low mesh nodes per core, but this
is frequently not witnessed when testing a single computer due to operating in a resource constrained
environment.
CFX Problem Description

Mulesoft Exercises
44% (9)
Mulesoft Exercises
49 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
AI in Cybersecurity - Report PDF
100% (2)
AI in Cybersecurity - Report PDF
28 pages
A Case For Intelligent RAM: IRAM: 1. Introduction and Why There Is A Problem
No ratings yet
A Case For Intelligent RAM: IRAM: 1. Introduction and Why There Is A Problem
23 pages
Pipelining For Multi-Core Architectures
No ratings yet
Pipelining For Multi-Core Architectures
31 pages
A Case For Intelligent RAM IRAM
No ratings yet
A Case For Intelligent RAM IRAM
23 pages
Seminar Report
50% (4)
Seminar Report
30 pages
Trends in Computer Architecture
No ratings yet
Trends in Computer Architecture
30 pages
Reduced Instruction Set Computer (RISC)
No ratings yet
Reduced Instruction Set Computer (RISC)
9 pages
Memory Performance On HP Z840/Z640/Z440 Workstations
No ratings yet
Memory Performance On HP Z840/Z640/Z440 Workstations
8 pages
O Memorijama
No ratings yet
O Memorijama
36 pages
White-Paper-Micron-Intel-HPC-AI-Workloads
No ratings yet
White-Paper-Micron-Intel-HPC-AI-Workloads
7 pages
Midtermsolutions
No ratings yet
Midtermsolutions
3 pages
Ch0 Overview
No ratings yet
Ch0 Overview
81 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Increasing Factors Which Improves The Performance of Computer in Future
No ratings yet
Increasing Factors Which Improves The Performance of Computer in Future
7 pages
arc ass
No ratings yet
arc ass
30 pages
1.1 Why There Is A Problem?
No ratings yet
1.1 Why There Is A Problem?
18 pages
Research Article: Memory Map: A Multiprocessor Cache Simulator
No ratings yet
Research Article: Memory Map: A Multiprocessor Cache Simulator
13 pages
03-Memory
No ratings yet
03-Memory
48 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Other uses of RAM
No ratings yet
Other uses of RAM
3 pages
ASA Chapter4
No ratings yet
ASA Chapter4
8 pages
Architecture Sample Questions
No ratings yet
Architecture Sample Questions
4 pages
RAM & Storage Solutions
No ratings yet
RAM & Storage Solutions
5 pages
Chapter_1
No ratings yet
Chapter_1
21 pages
Ict Questions
No ratings yet
Ict Questions
2 pages
css10q3w5 UpgradingMoboComponents (1)
No ratings yet
css10q3w5 UpgradingMoboComponents (1)
22 pages
Lecture 3: Memory Buffers and Scheduling
No ratings yet
Lecture 3: Memory Buffers and Scheduling
21 pages
Solutions COA7e 1
No ratings yet
Solutions COA7e 1
92 pages
Lecture-1-02.01.2025
No ratings yet
Lecture-1-02.01.2025
18 pages
Unit IV Memory Organization
No ratings yet
Unit IV Memory Organization
24 pages
Embedded DDR Interfaces: Ten Tips To Success For Your Soc
No ratings yet
Embedded DDR Interfaces: Ten Tips To Success For Your Soc
13 pages
Week 5 - The Impact of Multi-Core Computing On Computational Optimization
No ratings yet
Week 5 - The Impact of Multi-Core Computing On Computational Optimization
11 pages
Performance Analysis On Multicore Processors
No ratings yet
Performance Analysis On Multicore Processors
9 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
CS 355 Computer Architecture: Text: Computer Organization & Design, D A Patterson, J L Hennessy
No ratings yet
CS 355 Computer Architecture: Text: Computer Organization & Design, D A Patterson, J L Hennessy
12 pages
Computer Careers
No ratings yet
Computer Careers
17 pages
PDF
No ratings yet
PDF
41 pages
Mekelle Institute of Technology: PC Hardware Troubleshooting (CSE501) Lecture - 4
No ratings yet
Mekelle Institute of Technology: PC Hardware Troubleshooting (CSE501) Lecture - 4
63 pages
Cs6303comparchnotes PDF
No ratings yet
Cs6303comparchnotes PDF
250 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
37 pages
Core Performance
No ratings yet
Core Performance
13 pages
Embedded Systems: Applications in Imaging and Communication
No ratings yet
Embedded Systems: Applications in Imaging and Communication
71 pages
02b_Cache
No ratings yet
02b_Cache
48 pages
Lecture 8
No ratings yet
Lecture 8
33 pages
Comprehensive Guide to IT Hardware Components
No ratings yet
Comprehensive Guide to IT Hardware Components
11 pages
Memory Organization
No ratings yet
Memory Organization
52 pages
Unit 4 MMemory Hierarchy
No ratings yet
Unit 4 MMemory Hierarchy
14 pages
Limitation of Memory Sys Per
No ratings yet
Limitation of Memory Sys Per
38 pages
How To Design A Microprocessor - Lesson Plan
No ratings yet
How To Design A Microprocessor - Lesson Plan
7 pages
Changes in Hardware: 4.1 Memory Cells
No ratings yet
Changes in Hardware: 4.1 Memory Cells
11 pages
Storage & Authoring
No ratings yet
Storage & Authoring
20 pages
Caches From Simon Southwell
No ratings yet
Caches From Simon Southwell
24 pages
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec04-Cache - (Cuuduongthancong - Com)
No ratings yet
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec04-Cache - (Cuuduongthancong - Com)
16 pages
Dagatan Nino PR
No ratings yet
Dagatan Nino PR
12 pages
CSCE614 2011c HW1
0% (1)
CSCE614 2011c HW1
4 pages
Goodman SnoopyProtocol
No ratings yet
Goodman SnoopyProtocol
8 pages
Hard Drive: What If You Were To Design A Solid State Hard Disk Out of Normal Memory Modules?
No ratings yet
Hard Drive: What If You Were To Design A Solid State Hard Disk Out of Normal Memory Modules?
32 pages
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
No ratings yet
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
21 pages
"Multicore Processors": A Seminar Report
No ratings yet
"Multicore Processors": A Seminar Report
11 pages
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
From Everand
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
Rodrigo Copetti
No ratings yet
NI Tutorial
No ratings yet
NI Tutorial
4 pages
LEC1 - Narrowband - Broadband Acces Technology - Updated
No ratings yet
LEC1 - Narrowband - Broadband Acces Technology - Updated
42 pages
LTE Arq+harq
No ratings yet
LTE Arq+harq
3 pages
4FB6D21E-96F8-477B-A259-9FAD8AB3F307
No ratings yet
4FB6D21E-96F8-477B-A259-9FAD8AB3F307
4 pages
A Digital Security System With Door Lock System Using RFID Technology
No ratings yet
A Digital Security System With Door Lock System Using RFID Technology
3 pages
CCNA 2 Final Exam V4.0 Answers
No ratings yet
CCNA 2 Final Exam V4.0 Answers
13 pages
CSE101 S1 Theory Assignment 1
No ratings yet
CSE101 S1 Theory Assignment 1
1 page
NQC User Manual
No ratings yet
NQC User Manual
6 pages
Directed Writing Eassy
No ratings yet
Directed Writing Eassy
6 pages
Ryans Product Book December 2016 Issue 95
100% (1)
Ryans Product Book December 2016 Issue 95
68 pages
FELCOM 70 Operator's Manual H4 7-3-09 PDF
No ratings yet
FELCOM 70 Operator's Manual H4 7-3-09 PDF
262 pages
Full Doc Janani
No ratings yet
Full Doc Janani
121 pages
Amber IT Corporate Booklet
No ratings yet
Amber IT Corporate Booklet
28 pages
Unify Openscape
No ratings yet
Unify Openscape
78 pages
Understanding IP Addressing-Exercises
No ratings yet
Understanding IP Addressing-Exercises
7 pages
Understanding The Writing Process and The Main Forms of Business Messages
No ratings yet
Understanding The Writing Process and The Main Forms of Business Messages
3 pages
EC2050 Mobile Ad Hoc Networks VIII Semester ECE: Unit I
No ratings yet
EC2050 Mobile Ad Hoc Networks VIII Semester ECE: Unit I
25 pages
Networktut MCQ
No ratings yet
Networktut MCQ
88 pages
Ecolab Social Media Policy
No ratings yet
Ecolab Social Media Policy
4 pages
Linksys OS - Configuring The Linksys WRT610N As An Access Point
No ratings yet
Linksys OS - Configuring The Linksys WRT610N As An Access Point
5 pages
NetApp ONTAP Cloud Volumes
No ratings yet
NetApp ONTAP Cloud Volumes
27 pages
User+Manual+ +Circle+on+NETGEAR
No ratings yet
User+Manual+ +Circle+on+NETGEAR
138 pages
Trello Beginner's Tutorial
75% (4)
Trello Beginner's Tutorial
110 pages
WMM Power Save
No ratings yet
WMM Power Save
16 pages
Attendance Management System
0% (1)
Attendance Management System
21 pages
CSC339 - Computer Communication and Networks: By: Dr. Abdul Wahid
No ratings yet
CSC339 - Computer Communication and Networks: By: Dr. Abdul Wahid
27 pages
Continuous Integration With Tibco Activematrix Businessworks 6
No ratings yet
Continuous Integration With Tibco Activematrix Businessworks 6
13 pages