Maximize The Memory Performance
Maximize The Memory Performance
Memory or RAM is an important aspect of configuring computers for high performance computing (HPC)
simulation work. The performance of the computer is not only affected by the amount of RAM but also
by the amount of bandwidth, or transfer speed, between the processor and the memory. The bandwidth
is affected by the speed of the memory stick, but what few people know is that the way memory sticks
are configured on the motherboard is even more important. CPUs are capable of accessing information
on multiple memory sticks in parallel in what are referred to as memory channels. Modern CPUs can
access between 1 and 8 memory channels simultaneously (model specific). We have found however, that
many computers are not configured to make use of all their memory channels and this is creating a
bottleneck in their performance. All ANSYS solvers require significant memory bandwidth to sufficiently
feed data to the multiple CPU cores available on modern processors
This guide will review a recent test on how memory channels can impact your simulation performance
and show how to best configure a new or existing system for maximum performance.
Intel and AMD have both been working to increase the memory
bandwidth available regularly. For example, Intel has upgraded
their highly successful Xeon-EP platform (our most commonly
recommended solution) as follows:
18.1 no AVX-512 18.2 with AVX-512
• 3 channel DDR3-1333 in 2010
• 4 channel DDR3-1600 in 2012
• 4 channel DDR3-1866 in 2013
• 4 channel DDR4-2133 in 2014
• 4 channel DDR4-2400 in 2016
• 6 channel DDR4-2666 in 2017
The increase is substantial, but in the same period of time the platform has moved from a maximum of 6
cores per CPU to now 28 cores per CPU, and also increased in both core frequency and operations per
core cycle (AVX). The total memory bandwidth available per theoretical gigaflop of performance is clearly
falling. There are also other system platforms available, such as the dual memory channel intel consumer
CPUs (i5, i7 series, Xeon E) and the quad memory channel Intel “HEDT” platform (i7, i9 and Xeon W).
Likewise, AMD has CPU models and platforms ranging from 1 to 8! memory channels.
This system would be typically classified as having ample memory bandwidth, having only 2 CPU cores per
DDR4 memory channel. The effect of varying number of memory channels on solve time is as follows.
Simply removing 1 stick of RAM makes the solver 9% slower, this would also be comparable to running
10-12 cores on 4 Channels. Removing 2 sticks, giving 4 CPU cores per memory channel, gives a
performance degradation of 33%. Going to the extreme, and running this computer with only a single
RAM stick, shows the severe effects of CPU bandwidth starvation, the solve time is 226% of normal. This
shows how critical it is that your computer be setup with optimal memory layout to take advantage of
available memory channels. Buying 1 stick and “getting another one later” is not a plan for success.
Note that the 2 channel result, which is comparable to 16 cores on all 4 channels, does not necessarily
make a 16 core CPU a poor choice. The 16 core CPUs that are available today run at lower frequency than
4.0 GHz, and have more memory channels, and are thus comparable to the 4 channel result above in
terms of efficiency if configured properly. A minor bandwidth loss, but with many more cores, provides a
better platform value than a cluster of comparable i7 machines. The performance per core will be lower,
but due to system complexity, for users requiring up to 36 cores the recommended platform is a dual
socket Xeon-EP system.
The testing consists of ramping up the number of CPU cores being used. The Xeon E7 system will have all
memory bandwidth and CPU cache available at any quantity of cores in use, and will be subdividing those
finite resources as more CPU cores are assigned. By comparison, the i7 cluster will ramp up by adding
more topologically identical nodes to the job. Thus, every time a CPU is added, so does an equal amount
of memory bandwidth and core cache. This should show the true scaling potential of the solver, and will
only be penalized by communication overhead and mesh overlap.
Please note that core frequency is not constant on the Quad E7 system. It is 3.4 GHz max at low core
counts, but has a base frequency of 2.2 GHz. Turbo frequency decreases as more Xeon cores are added.
The quad Xeon E7 system initially starts out faster than the single i7 machine (both using 8 cores). This is
despite the i7 having a faster core frequency (4.0 GHz vs 3.4 GHz max turbo). This is because the Xeon
system has 4 times the memory bandwidth and 3 times the core cache available. In fact, it is using 16
memory channels to feed only 8 CPU cores. By the time 16 cores are in use the systems are effectively
matched in performance, and from there on the i7 cluster is faster. Scaling is still good in the up to 48 core
range on the quad Xeon system, which is effectively 3 CPU cores per memory channel, and then the
performance beings to taper off more severely. In the end, the Xeon system ends up at only 56% of the i7
cluster’s speed.
Another result of significant importance is that when adding additional nodes to a cluster, specifically ones
that are topologically identical and add as much non-CPU resources as they do CPU power, the scaling
power of the CFX solver is very impressive. CFX was able to solve this model 9.91 times faster on 12
machines than it could on 1, and that was with substantial mesh overlap (18.6% on average) and only 13.6
thousand nodes per CPU core, which is much lower than the recommended range for good scaling
performance (30-50k). For 64 cores, where there were 20k nodes per core, results were even better at
90% scaling efficiency.
If you are unable to open your computer and check that the memory is balanced properly, there is a
command you can run in the windows command prompt to check your memory layout.
wmic MEMORYCHIP get BankLabel,DeviceLocator,Capacity,ConfiguredClockSpeed
Output:
BankLabel Capacity ConfiguredClockSpeed DeviceLocator
ChannelA 8589934592 1600 ChannelA_Dimm1
ChannelA 4294967296 1600 ChannelA_Dimm2
ChannelB 8589934592 1600 ChannelB_Dimm1
ChannelB 4294967296 1600 ChannelB_Dimm2
ChannelC 8589934592 1600 ChannelC_Dimm1
ChannelC 4294967296 1600 ChannelC_Dimm2
ChannelD 8589934592 1600 ChannelD_Dimm1
ChannelD 4294967296 1600 ChannelD_Dimm2
This computer has an 8GB stick and a 4GB stick of 1600 MHz RAM in every channel, and is thus properly
balanced.
Buying a New Computer
When selecting a new computer for simulation, we typically recommend a “High End Desktop” (HEDT)
platform or Xeon EP (Xeon scalable, i.e. Xeon Gold etc.). The first is a 4 memory channel single CPU
platform that accommodates 6 to 18 CPU cores (12 for use with 1 ANSYS HPC pack). The second is a 6
memory channel per CPU platform, which can accommodate up to 8 CPU’s, but for ANSYS is frequently
used in dual CPU, 12 total memory channel configurations targeting the 36-ish core range. In both cases
it is 1 memory channel per 3 cores typical.
Besides choosing the right platform it is also important to make sure all memory channels are equally
populated, even if it means buying smaller sticks. 4 channels with 8 GB sticks to get 32 GB total is better
than 2 channels with 16 GB sticks and 2 channels left empty.
Conclusion
Memory bandwidth has a quite noticeable impact on solver performance, even for modern machines with
optimal layout and apparently ample bandwidth. It is very important to consider and plan for an optimal
system memory layout.
Leaving memory channels unpopulated, or unevenly balanced with different RAM sticks, is not good idea.
Get memory in matched sets that are appropriate for the platform being used (at least 1 identical stick
per channel).
One of the main reasons people witness disappointing performance gains when adding more cores to
their simulations is not due to inefficiency of the solver code, but instead due to a combination of
decreasing core frequency (turbo speed) and subdivision of out-of-core resources, especially memory
bandwidth. The CFX solver demonstrated near perfect scaling at quite low mesh nodes per core, but this
is frequently not witnessed when testing a single computer due to operating in a resource constrained
environment.
CFX Problem Description