MCM 3 Notes
MCM 3 Notes
■ Load instructions that load N32-bit words of memory such as LDR and
LDM take Ncycles to issue, but the result of the last word loaded is not
available on the following cycle. The updated load address is available
on the next cycle. This assumes zero-wait-state memory for an uncached
system, or a cache hit for a cached system. An LDM of a single value is
exceptional, taking two cycles. If the instruction loads pc, then add two
cycles.
■ Load instructions that load 16-bit or 8-bit data such as LDRB, LDRSB,
LDRH, and LDRSH take one cycle to issue. The load result is not
available on the following two cycles. The updated load address is
available on the next cycle. This assumes zero-wait-state memory for an
uncached system, or a cache hit for a cached system.
■ Store instructions that store N values take N cycles. This assumes zero-
wait-state memory for an uncached system, or a cache hit or a write
buffer with N free entries for a cached system. An STM of a single value
is exceptional, taking two cycles.
In this method of load scheduling, we load the data required for the loop
at the end of the previous loop, rather than at the beginning of the current
loop. To get performance improvement with little increase in code size,
we don’t unroll the loop.
Negative Indexing
This loop structure counts from −N to 0 (inclusive or exclusive) in steps
of size STEP.
Logarithmic Indexing
This loop structure counts down from 2N to 1 in powers of two. For
example, if N = 4, then it counts 16, 8, 4, 2, 1.
Looping Constructs Summary
■ ARM requires two instructions to implement a counted loop: a subtract
that sets flags and a conditional branch.
■ Unroll loops to improve loop performance. Do not overunroll because
this will hurt cache performance. Unrolled loops may be inefficient for a
small number of iterations. You can test for this case and only call the
unrolled loop if the number of iterations is large.
■ Nested loops only require a single loop counter register, which can
improve efficiency by freeing up registers for other uses.
■ ARM can implement negative and logarithmic indexed loops
efficiently.