DRAM
DRAM
DRAM
1
SAFARI Technical Report No. 2016-003 (March 28, 2016)
cell, a large amount of the time is spent on injecting the final Rlow Rhigh
Temperature
small amount of charge into the cell. If there is already enough
unfilled
Worst
charge in the cell for the next access, the cell does not need unfilled
(by design) (Rhigh)
to be fully restored. In this case, it is possible to shorten the
latter part of the restoration time, creating the opportunity to leakage large leakage large
leakage leakage
shorten the corresponding timing parameters (tRAS and tWR).
Third, at the end of precharging, i.e., setting the bitline into Rlow Rhigh
Temperature
the initial voltage level (before accessing a cell) for the next unfilled
Typical
unfilled
(by design) (Rhigh)
access, a large amount of the time is spent on precharging the
final small amount of bitline voltage difference from the ini- leakage small unfilled small
tial level. When there is already enough charge in the cell to leakage (by design) leakage
overcome the voltage difference in the bitline, the bitline does Typical Cell Worst Cell
not need to be fully precharged. Thus, it is possible to shorten Figure 1: Effect of Reduced Latency: Typical vs. Worst
the final part of the precharge time, creating the opportunity to
shorten the corresponding timing parameter (tRP). Based on any charge for the worst-case cell at the worst-case tempera-
these three observations, we understand that timing parameters ture. Although the other three cells are not fully charged in
can be shortened if DRAM cells have enough charge. their initial state, they are left with a similar amount of charge
1.4. Adaptive-Latency DRAM as the worst-case (top-right). This is because these cells are
capable of either holding more charge (typical cell, left col-
As explained, the amount of charge in the cell right before an umn) or holding their charge longer (typical temperature, bot-
access to it plays a critical role in whether the correct data is tom row). Therefore, optimizing the timing parameters (based
retrieved from the cell. In Figure 1, we illustrate the impact of on the amount of existing charge slack) provides the opportu-
process variation using two different cells: one is a typical cell nity to reduce overall DRAM latency while still maintaining
(left column) and the other is the worst-case cell which deviates the reliability guarantees provided by the DRAM manufactur-
the most from the typical (right column). The worst-case cell ers.
contains less charge than the typical cell in its initial state. This Based on these observations, we propose Adaptive-Latency
is because of two reasons. First, due to its large resistance, DRAM (AL-DRAM), a mechanism that dynamically opti-
the worst-case cell cannot allow charge to flow inside quickly. mizes the timing parameters for different modules at differ-
Second, due to its small capacitance, the worst-case cell cannot ent temperatures. AL-DRAM exploits the additional charge
store much charge even when it is full. To accommodate such slack present in the common-case compared to the worst-case,
a worst-case cell, existing timing parameters are set to a large thereby preserving the level of reliability (at least as high as the
value. worst-case) provided by DRAM manufacturers.
In Figure 1, we also illustrate the impact of temperature de-
pendence using two cells at two different temperatures: i) a 1.5. DRAM Latency Profiling
typical temperature (55◦ C, bottom row), and ii) the worst-case We present and analyze the results of our DRAM profiling ex-
temperature (85◦ C, top row) supported by DRAM standards. periments, performed on our FPGA-based DRAM testing in-
Both typical and worst-case cells leak charge at a faster rate at frastructure [22, 48, 49, 50, 64, 69, 99]. Figures 2a and 2b
the worst-case temperature. Therefore, not only does the worst- show the results of this experiment for the read and write la-
case cell have less charge to begin with, but it is left with even tency tests. The y-axis plots the sum of the relevant timing
less charge at the worst temperature because it leaks charge at parameters (tRCD, tRAS, and tRP for the read latency test
a faster rate (top-right in Figure 1). To accommodate the com- and tRCD, tWR, and tRP for the write latency test). The solid
bined effect of process variation and temperature dependence, black line shows the latency sum of the standard timing param-
existing timing parameters are set to a very large value. That is eters (DDR3 DRAM specification). The dotted red line and the
why the worst-case condition for correctness is specified by the dotted blue line show the acceptable latency parameters that do
top-right of Figure 1, which shows the least amount of charge not cause any errors for each DIMM at 85◦ C and 55◦ C, re-
stored in the worst-case cell at the worst-case temperature in spectively. The solid red line and blue line show the average
its initial state. On top of this, DRAM manufacturers still add acceptable latency across all DIMMs.
an extra latency margin even for such worst-case conditions. In
other words, the amount of charge at the worst-case condition
is still greater than what is required for correctness under that
condition.
If we were to reduce the timing parameters, we would also
be reducing the charge stored in the cells. It is important to
note, however, that we are proposing to exploit only the addi-
tional slack (in terms of charge) compared to the worst-case.
This allows us to provide as strong of a reliability guarantee as
the worst-case. In Figure 1, we illustrate the impact of reducing
the timing parameters. The lightened portions inside the cells
represent the amount of charge that we are giving up by using (a) Read Latency (b) Write Latency
the reduced timing parameters. Note that we are not giving up Figure 2: Access Latency Analysis of 115 DIMMs
2
SAFARI Technical Report No. 2016-003 (March 28, 2016)
25
We make two observations. First, even at the highest tem- MEAN
gcc
mcf
copy
scale
gups
hmmer
namd
calculix
gromac
povray
h264
bzip2
sjeng
tonto
perl
gobmk
astar
xalan
cactus
zeus
sphinx
bwave
dealII
omnet
soplex
milc
libq
lbm
gems
add
triad
s.cluster
canneal
mcached
apache
intensive
non-intensive
all-workloads
peratures enable a significant potential to reduce DRAM ac-
cess latencies. Second, we observe that at lower temperatures
(e.g., 55◦ C), the potential for latency reduction is even greater
(32.7% on average for read, and 55.1% on average for write op- Figure 3: Real System Performance Improvement with AL-DRAM
erations), where the corresponding reduction in timing parame-
• Effect of Reducing Multiple Timing Parameters. We
ters tRCD/tRAS/tWR/tRP are 17.3%/37.7%/54.8%/35.2% on
study the potential for reducing multiple timing parame-
average across all the DIMMs.
ters simultaneously. Our key observation is that reducing
1.6. Real-System Evaluation one timing parameter leads to decreasing the opportunity
We evaluate AL-DRAM on a real system that offers dynamic to reduce another timing parameter simultaneously.
software-based control over DRAM timing parameters at run- • Analysis of the Repeatability of Cell Failures. We per-
time [9, 10]. We use the minimum values of the timing param- form tests for five different scenarios to determine that
eters that do not introduce any errors at 55◦ C for any module a cell failure due to reduced latency is repeatable: same
to determine the latency reduction at 55◦ C. Thus, the latency test, test with different data patterns, test with timing-
is reduced by 27%/32%/33%/18% for tRCD/tRAS/tWR/tRP, parameter combinations, test with different temperatures,
respectively. Our full methodology is described in our HPCA and read/write test. Most of these scenarios show that a
2015 paper [64]. very high fraction (more than 95%) of the erroneous cells
Figure 3 shows the performance improvement of reducing consistently experience an error over multiple iterations of
the timing parameters in the evaluated memory system with the same test.
one rank and one memory channel at 55◦ C operating temper- • Performance Sensitivity Analyses. We analyze the im-
ature. We run a variety of different applications in two dif- pact of increasing the number of ranks and channels, exe-
ferent configurations. The first one (single-core) runs only cuting heterogeneous workloads, using different row buffer
one thread, and the second one (multi-core) runs multiple ap- policies.
plications/threads. We run each configuration 30 times (only
SPEC benchmarks are executed 3 times due to their large ex- 2. Significance
ecution times), and present the average performance improve- 2.1. Novelty
ment across all the runs and their standard deviation as an error
bar. Based on the last-level cache misses per kilo instructions To our knowledge, our HPCA 2015 paper is the first work to
(MPKI), we categorize our applications into memory-intensive i) provide a detailed qualitative and empirical analysis of the
or non-intensive groups, and report the geometric mean perfor- relationship between process variation and temperature depen-
mance improvement across all applications from each group. dence of modern DRAM devices on the one side, and DRAM
We draw three key conclusions from Figure 3. First, AL- access latency on the other side (we directly attribute the rela-
DRAM provides significant performance improvement over tionship between the two to the amount of charge in cells), ii)
the baseline (as high as 20.5% for the very memory-bandwidth- experimentally characterize a large number of existing DIMMs
intensive STREAM applications [78]). Second, when the to understand the potential of reducing DRAM timing con-
memory system is under higher pressure with multi-core/multi- straints, iii) provide a practical mechanism that can take advan-
threaded applications, we observe significantly higher perfor- tage of this potential, and iv) evaluate the performance benefits
mance (than in the single-core case) across all applications of this mechanism by dynamically optimizing DRAM timing
from our workload pool. Third, as expected, memory-intensive parameters on a real system using a variety of real workloads.
applications benefit more in performance than non-memory- We make the following major contributions.
intensive workloads (14.0% vs. 2.9% on average). We con- Addressing a Critical Real Problem, High DRAM La-
clude that by reducing the DRAM timing parameters using AL- tency, with Low Cost. High DRAM latency is a critical bot-
DRAM, we can speed up a real system by 10.5% (on average tleneck for overall system performance in a variety of mod-
across all 35 workloads on the multi-core/multi-thread config- ern computing systems [81, 90], especially in real large-scale
uration). server systems [72]. Considering the difficulties in DRAM
scaling [46, 81, 90], the problem is getting worse in future sys-
1.7. Other Results and Analyses in Our Paper
tems due to process variation. Our HPCA 2015 work leverages
Our HPCA paper includes more DRAM latency analyses and the heterogeneity created by DRAM process variation across
system performance evaluations. DRAM chips and system operating conditions to mitigate the
• Effect of Changing the Refresh Interval on DRAM La- DRAM latency problem. We propose a practical mechanism,
tency. We evaluate DRAM latency at different refresh in- Adaptive-Latency DRAM, which mitigates DRAM latency with
tervals. We observe that refreshing DRAM cells more fre- very modest hardware cost, and with no changes to the DRAM
quently enables more DRAM latency reduction. chip itself.
3
SAFARI Technical Report No. 2016-003 (March 28, 2016)
Low Latency DRAM Architectures. Previous works [23, in DRAM by better scheduling memory requests [11, 12, 28,
24, 42, 53, 65, 77, 82, 105, 108, 112, 129] propose new DRAM 35, 43, 45, 51, 52, 59, 60, 61, 78, 79, 80, 86, 87, 92, 104, 114,
architectures that provide lower latency. These works im- 115, 116, 117, 118, 130], employing prefetching [6, 21, 26,
prove DRAM latency at the cost of either significant additional 27, 32, 34, 36, 37, 59, 83, 84, 85, 88, 89, 91, 93, 113], mem-
DRAM chip area (i.e., extra sense amplifiers [77, 105, 112], ory/cache compression [1, 7, 8, 29, 31, 33, 94, 95, 96, 97, 111,
an additional SRAM cache [42, 129]), specialized proto- 121, 128], or better caching [47, 100, 101, 110]. Our proposal
cols [23, 53, 65, 108] or a combination of these. Our proposed is orthogonal to all of these approaches and can be applied in
mechanism requires no changes to the DRAM chip and the conjunction with them to achieve even higher latency reduc-
DRAM interface, and hence has almost negligible overhead. tions.
Furthermore, AL-DRAM is largely orthogonal to these pro-
posed designs, and can be applied in conjunction with them, 2.2. Potential Long-Term Impact
providing greater cumulative reduction in latency.
Tolerating High DRAM Latency by Exploiting DRAM In-
Large-Scale Latency Profiling of Modern DRAM Chips.
trinsic Characteristics. Today, there is a large latency cliff be-
Using our FPGA-based DRAM testing infrastructure [22, 48,
tween the on-chip last level cache and off-chip DRAM, leading
49, 50, 64, 69, 99], we profile 115 DRAM modules (862
to a large performance fall-off when applications start missing
DRAM chips in total) and show that there is significant timing
in the last level cache. By enabling lower DRAM latency, our
variation between different DIMMs at different temperatures.
mechanism, Adaptive-Latency DRAM, smoothens this latency
We believe that our results are statistically significant to vali-
cliff without adding another layer into the memory hierarchy.
date our hypothesis that the DRAM timing parameters strongly
depend on the amount of cell charge. We provide detailed char- Applicability to Future Memory Devices. We show the
acterization of each DIMM online at the SAFARI Research benefits of the common-case timing optimization in modern
Group website [63]. Furthermore, we introduce our FPGA- DRAM devices by taking advantage of intrinsic characteris-
based DRAM infrastructure and experimental methodology for tics of DRAM. Considering that most memory devices adopt
DRAM profiling, which are carefully constructed to represent a unified specification that is dictated by the worst-case op-
the worst-case conditions in power noise, bitline/wordline cou- erating condition, our approach that optimizes device latency
pling, data patterns, and access patterns. Such information will for the common case can be applicable to other memory de-
hopefully be useful for future DRAM research. vices by leveraging the intrinsic characteristics of the technol-
Extensive Real System Evaluation of DRAM Latency. ogy they are built with. We believe there is significant potential
We evaluate our mechanism on a real system and show that our for approaches that could reduce the latency of Phase Change
mechanism provides significant performance improvement. Memory (PCM) [30, 55, 56, 57, 75, 98, 102, 103, 123, 125],
Reducing the timing parameters strips the excessive margin in STT-MRAM [54, 68, 75], RRAM [122], and Flash mem-
DRAM’s electrical charge. We show that the remaining margin ory [13, 14, 15, 16, 17, 18, 19, 20, 73, 74, 76].
is enough for DRAM to operate correctly. To verify the cor- New Research Opportunities. Adaptive-Latency DRAM
rectness of our experiments, we ran our workloads for 33 days creates new opportunities by enabling mechanisms that can
non-stop, and examined their and the system’s correctness with leverage the heterogeneous latency offered by our mechanism.
reduced timing parameters. Using the reduced timing parame- We describe a few of these briefly.
ters over the course of 33 days, our real system was able to exe-
Optimizing the operating conditions for faster DRAM ac-
cute 35 different workloads in both single-core and multi-core
cess. Adaptive-Latency DRAM provides different access la-
configurations while preserving correctness and being error-
tency at different operating conditions. Thus, optimizing the
free. Note that these results do not absolutely guarantee that
DRAM operating conditions enables faster DRAM access with
no errors can be introduced by reducing the timing parameters.
Adaptive-Latency DRAM. For instance, balancing DRAM ac-
However, we believe that we have demonstrated a proof-of-
cesses over different DRAM channels and ranks leads to reduc-
concept which shows that DRAM latency can be reduced at no
ing the DRAM operating temperature, maximizing the benefits
impact on DRAM reliability. Ultimately, the DRAM manufac-
from Adaptive-Latency DRAM. At the system level, operating
turers can provide the reliable timing parameters for different
the system at a constant low temperature can enable the use of
operating conditions and modules.
AL-DRAM’s lower latency more frequently.
Other Methods for Lowering Memory Latency. There
are many works that reduce overall memory access latency Optimizing data placement for reducing overall DRAM ac-
by modifying DRAM, the DRAM-controller interface, and cess latency. We characterize the latency variation in different
DRAM controllers. These works enable more parallelism DIMMs due to process variation. Placing data based on this
and bandwidth [4, 5, 23, 24, 53, 62, 108, 120, 127, 131], re- information and the latency criticality of data maximizes the
duce refresh counts [48, 69, 70, 99, 119], accelerate bulk op- benefits of lowering DRAM latency.
erations [24, 107, 108, 109], accelerate computation in the Error-correction mechanisms to further reduce DRAM la-
logic layer of 3D-stacked DRAM [2, 3, 40, 126], enable bet- tency. Error-correction mechanisms can fix the errors from
ter communication between CPU and other devices through lowering DRAM latency even further, leading to further reduc-
DRAM [66], leverage DRAM access patterns [41], reduce tion in DRAM latency without errors. Future research that uses
write-related latencies by better designing DRAM and DRAM error correction to enable even lower latency DRAM is there-
control policies [25, 58, 106], reduce overall queuing latencies fore promising as it opens a new set of trade-offs.
4
SAFARI Technical Report No. 2016-003 (March 28, 2016)
5
SAFARI Technical Report No. 2016-003 (March 28, 2016)
[72] D. Lo et al. Heracles: Improving resource efficiency at scale. In ISCA, [108] V. Seshadri et al. RowClone: Fast and Energy-efficient in-DRAM Bulk
2015. Data Copy and Initialization. In MICRO, 2013.
[73] Y. Lu et al. High-Performance and Lightweight Transaction Support in [109] V. Seshadri et al. Gather-Scatter DRAM: In-DRAM Address Transla-
Flash-Based SSDs. In IEEE TC, 2015. tion to Improve the Spatial Locality of Non-unit Strided Accesses. In
[74] Y. Luo et al. WARM: Improving NAND flash memory lifetime with MICRO, 2015.
write-hotness aware retention management. In MSST, 2015. [110] V. Seshadri et al. The Evicted-Address Filter: A Unified Mechanism to
[75] J. Meza et al. A Case for Small Row Buffers in Non-Volatile Main Address Both Cache Pollution and Thrashing. In PACT, 2012.
Memories. In ICCD, Poster Session, 2012. [111] A. Shafiee et al. MemZip: Exploring Unconventional Benefits from
[76] J. Meza et al. A large-scale study of flash memory failures in the field. Memory Compression. In HPCA, 2014.
In SIGMETRICS, 2015. [112] Y. H. Son et al. Reducing Memory Access Latency with Asymmetric
[77] Micron. RLDRAM 2 and 3 Specifications. http://www.micron. DRAM Bank Organizations. In ISCA, 2013.
com/products/dram/rldram-memory. [113] S. Srinath et al. Feedback Directed Prefetching: Improving the Perfor-
[78] T. Moscibroda and O. Mutlu. Memory Performance Attacks: Denial of mance and Bandwidth-Efficiency of Hardware Prefetchers. In HPCA,
Memory Service in Multi-core Systems. In USENIX Security, 2007. 2007.
[79] T. Moscibroda and O. Mutlu. Distributed Order Scheduling and Its [114] L. Subramanian et al. The Blacklisting Memory Scheduler: Achieving
Application to Multi-core Dram Controllers. In PODC, 2008. high performance and fairness at low cost. In ICCD, 2014.
[80] S. P. Muralidhara et al. Reducing memory interference in multicore sys- [115] L. Subramanian et al. The Blacklisting Memory Scheduler: Balancing
tems via application-aware memory channel partitioning. In MICRO, Performance, Fairness and Complexity. In TPDS, 2016.
2011. [116] L. Subramanian et al. The Application Slowdown Model: Quantifying
[81] O. Mutlu. Memory Scaling: A Systems Architecture Perspective. In and Controlling the Impact of Inter-Application Interference at Shared
IMW, 2013. Caches and Main Memory. In MICRO, 2015.
[82] O. Mutlu. Memory Scaling: A Systems Architecture Perspective. In [117] L. Subramanian et al. MISE: Providing Performance Predictability and
MemCon, 2013. Improving Fairness in Shared Main Memory Systems. In HPCA, 2013.
[83] O. Mutlu et al. Address-value delta (AVD) prediction: increasing the [118] H. Usui et al. DASH: Deadline-Aware High-Performance Memory
effectiveness of runahead execution by exploiting regular memory allo- Scheduler for Heterogeneous Systems with Hardware Accelerators. In
cation patterns. In MICRO, 2005. ACM TACO, 2016.
[84] O. Mutlu et al. Techniques for efficient processing in runahead execu- [119] R. Venkatesan et al. Retention-Aware Placement in DRAM (RAPID):
tion engines. In ISCA, 2005. Software Methods for Quasi-Non-Volatile DRAM. In HPCA, 2006.
[85] O. Mutlu et al. Efficient Runahead Execution: Power-efficient Memory [120] F. Ware and C. Hampel. Improving Power and Data Efficiency with
Latency Tolerance. In IEEE Micro, 2006. Threaded Memory Modules. In ICCD, 2006.
[86] O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Schedul- [121] P. R. Wilson et al. The Case for Compressed Caching in Virtual Memory
ing for Chip Multiprocessors. In MICRO, 2007. Systems. In ATEC, 1999.
[87] O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: En- [122] H.-S. Wong et al. Metal Oxide RRAM. In Proceedings of the IEEE,
hancing both Performance and Fairness of Shared DRAM Systems. In 2012.
ISCA, 2008. [123] H.-S. Wong et al. Phase Change Memory. In Proceedings of the IEEE,
[88] O. Mutlu et al. Runahead execution: an alternative to very large instruc- 2010.
tion windows for out-of-order processors. In HPCA, 2003. [124] D. Yaney et al. A meta-stable leakage phenomenon in DRAM charge
[89] O. Mutlu et al. Runahead execution: An effective alternative to large storage - Variable hold time. In IEDM, 1987.
instruction windows. In IEEE Micro, 2003. [125] H. Yoon et al. Efficient Data Mapping and Buffering Techniques for
[90] O. Mutlu and L. Subramanian. Research Problems and Opportunities Multilevel Cell Phase-Change Memories. In ACM TACO, 2014.
in Memory Systems. In SUPERFRI, 2015. [126] D. Zhang et al. TOP-PIM: Throughput-oriented Programmable Pro-
[91] K. Nesbit et al. AC/DC: an adaptive data cache prefetcher. In PACT, cessing in Memory. In HPCA, 2014.
2004. [127] T. Zhang et al. Half-DRAM: A high-bandwidth and low-power DRAM
[92] K. J. Nesbit et al. Fair Queuing Memory Systems. In MICRO, 2006. architecture from the rethinking of fine-grained activation. In ISCA,
[93] R. H. Patterson et al. Informed Prefetching and Caching. In SOSP, 2014.
1995. [128] Y. Zhang et al. Frequent value locality and value-centric data cache
[94] G. Pekhimenko et al. Toggle-Aware Bandwidth Compression for GPUs. design. In ASPLOS, 2000.
In HPCA, 2016. [129] Z. Zhang et al. Cached DRAM for ILP Processor Memory Access La-
[95] G. Pekhimenko et al. Exploiting Compressed Block Size as an Indicator tency Reduction. In IEEE Micro, 2001.
of Future Reuse. In HPCA, 2015. [130] J. Zhao et al. FIRM: Fair and High-Performance Memory Control for
[96] G. Pekhimenko et al. Linearly Compressed Pages: A Low-complexity, Persistent Memory Systems. In MICRO, 2014.
Low-latency Main Memory Compression Framework. In MICRO, [131] H. Zheng et al. Mini-rank: Adaptive DRAM architecture for improving
2013. memory power efficiency. In MICRO, 2008.
[97] G. Pekhimenko et al. Base-Delta-Immediate Compression: A Practical
Data Compression Mechanism for On-Chip Caches. In PACT, 2012.
[98] M. Qureshi et al. Enhancing lifetime and security of PCM-based main
memory with start-gap wear leveling. In MICRO, 2009.
[99] M. Qureshi et al. AVATAR: A Variable-Retention-Time (VRT) Aware
Refresh for DRAM Systems. In DSN, 2015.
[100] M. K. Qureshi et al. Adaptive Insertion Policies for High Performance
Caching. In ISCA, 2007.
[101] M. K. Qureshi et al. A Case for MLP-Aware Cache Replacement. In
ISCA, 2006.
[102] M. K. Qureshi et al. Scalable High Performance Main Memory System
Using Phase-change Memory Technology. In ISCA, 2009.
[103] S. Raoux et al. Phase-change random access memory: A scalable tech-
nology. In IBM Journal of Research and Development, 2008.
[104] S. Rixner et al. Memory Access Scheduling. In ISCA, 2000.
[105] Y. Sato et al. Fast Cycle RAM (FCRAM); a 20-ns random row access,
pipe-lined operating DRAM. In Symposium on VLSI Circuits, 1998.
[106] V. Seshadri et al. The Dirty-Block Index. In ISCA, 2014.
[107] V. Seshadri et al. Fast Bulk Bitwise AND and OR in DRAM. In IEEE
CAL, 2015.