[go: up one dir, main page]

0% found this document useful (0 votes)
20 views28 pages

MCM 3 Notes

Notes

Uploaded by

pranavcs72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views28 pages

MCM 3 Notes

Notes

Uploaded by

pranavcs72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Writing and Optimizing

ARM Assembly Code


Writing Assembly Code

Ex: shows how to convert a C function to an assembly function—usually


the first stage of assembly optimization. Consider the simple C program
main.c following that prints the squares of the integers from 0 to 9:
#include <stdio.h>
int square(int i);
int main(void)
{
int i;
for (i=0; i<10; i++)
{
printf("Square of %d is %d\n", i, square(i));
}
}
int square(int i)
{
return i*i;
}
Let’s see how to replace square by an assembly function that performs
the same action. Remove the C definition of square, but not the
declaration (the second line) to produce a new C file main1.c. Next add
an armasm assembler file square.s with the following contents:
AREA |.text|, CODE, READONLY
EXPORT square
; int square(int i)
square
MUL r1, r0, r0 ; r1 = r0 * r0
MOV r0, r1 ; r0 = r1
MOV pc, lr ; return r0
END
Profiling and Cycle Counting
The first stage of any optimization process is to identify the critical
routines and measure their current performance.
A profiler is a tool that measures the proportion of time or processing
cycles spent in each subroutine. You use a profiler to identify the most
critical routines.
A cycle counter measures the number of cycles taken by a specific
routine. You can measure your success by using a cycle counter to
benchmark a given subroutine before and after an optimization.
Instruction Scheduling
The time taken to execute instructions depends on the implementation
pipeline. Instructions that are conditional on the value of the ARM
condition codes in the cpsr take one cycle if the condition is not met. If
the condition is met, then the following rules apply:

■ ALU operations such as addition, subtraction, and logical operations


take one cycle. This includes a shift by an immediate value. If you use a
register-specified shift, then add one cycle. If the instruction writes to the
pc, then add two cycles.

■ Load instructions that load N32-bit words of memory such as LDR and
LDM take Ncycles to issue, but the result of the last word loaded is not
available on the following cycle. The updated load address is available
on the next cycle. This assumes zero-wait-state memory for an uncached
system, or a cache hit for a cached system. An LDM of a single value is
exceptional, taking two cycles. If the instruction loads pc, then add two
cycles.
■ Load instructions that load 16-bit or 8-bit data such as LDRB, LDRSB,
LDRH, and LDRSH take one cycle to issue. The load result is not
available on the following two cycles. The updated load address is
available on the next cycle. This assumes zero-wait-state memory for an
uncached system, or a cache hit for a cached system.

■ Branch instructions take three cycles.

■ Store instructions that store N values take N cycles. This assumes zero-
wait-state memory for an uncached system, or a cache hit or a write
buffer with N free entries for a cached system. An STM of a single value
is exceptional, taking two cycles.

■ Multiply instructions take a varying number of cycles depending on


the value of the second operand in the product.
To understand how to schedule code efficiently on the ARM, we need to
understand the ARM pipeline and dependencies. The ARM9TDMI
processor performs five operations in parallel:
■ Fetch: Fetch from memory the instruction at address pc. The
instruction is loaded into the core and then processes down the core
pipeline.
■ Decode: Decode the instruction that was fetched in the previous cycle.
The processor also reads the input operands from the register bank if
they are not available via one of
the forwarding paths.
■ ALU: Executes the instruction that was decoded in the previous cycle.
Normally this involves calculating the answer for a data processing
operation, or the address for a load, store, or branch operation. Some
instructions may spend several cycles in this stage. For example,
multiply and register-controlled shift operations take several ALU cycles.

Figure: ARM9TDMI pipeline executing in ARM state.

■ LS1: Load or store the data specified by a load or store instruction. If


the instruction is not a load or store, then this stage has no effect.

■ LS2: Extract and zero- or sign-extend the data loaded by a byte or


halfword load instruction. If the instruction is not a load of an 8-bit byte
or 16-bit halfword item, then this stage has no effect.
After an instruction has completed the five stages of the pipeline, the
core writes the result to the register file. Note that pc points to the
address of the instruction being fetched.
The ALU is executing the instruction that was originally fetched from
address pc − 8 in parallel with fetching the instruction at address pc.

If an instruction requires the result of a previous instruction that is not


available, then the processor stalls. This is called a pipeline hazard or
pipeline interlock.
Example: This example shows the case where there is no interlock.
ADD r0, r0, r1
ADD r0, r0, r2
This instruction pair takes two cycles. The ALU calculates r0 + r1 in one
cycle. Therefore this result is available for the ALU to calculate r0 + r2
in the second cycle.
Example: This example shows a one-cycle interlock caused by load use.
LDR r1, [r2, #4]
ADD r0, r0, r1
This instruction pair takes three cycles. The ALU calculates the address
r2 + 4 in the first cycle while decoding the ADD instruction in parallel.
However, the ADD cannot proceed on the second cycle because the load
instruction has not yet loaded the value of r1. Therefore the pipeline
stalls for one cycle while the load instruction completes the LS1 stage.
Now that r1 is ready, the processor executes the ADD in the ALU on the
third cycle.
Example: This example shows why a branch instruction takes three
cycles. The processor must flush the pipeline when jumping to a new
address.
MOV r1, #1
B case1
AND r0, r0, r1
EOR r2, r2, r3
...
case1
SUB r0, r0, r1
The three executed instructions take a total of five cycles. The MOV
instruction executes on the first cycle. On the second cycle, the branch
instruction calculates the destination address. This causes the core to
flush the pipeline and refill it using this new pc value. The refill takes
two cycles. Finally, the SUB instruction executes normally.
Below Figure illustrates the pipeline state on each cycle. The pipeline
drops the two instructions following the branch when the branch takes
place.
Scheduling of load instructions
Load instructions occur frequently in compiled code, accounting for
approximately one third of all instructions. Careful scheduling of load
instructions so that pipeline stalls don’t occur can improve performance.
The ARM9TDMI pipeline will stall for two cycles. The compiler can’t
do any better since everything following the load of c depends on its
value. However, there are two ways you can alter the structure of the
algorithm to avoid the cycles by using assembly. We call these methods
load scheduling by preloading and unrolling.
Load Scheduling by Preloading

In this method of load scheduling, we load the data required for the loop
at the end of the previous loop, rather than at the beginning of the current
loop. To get performance improvement with little increase in code size,
we don’t unroll the loop.

Load Scheduling by Unrolling


This method of load scheduling works by unrolling and then interleaving
the body of the loop. For example, we can perform loop iterations i, i +
1, i + 2 interleaved. When the result of an operation from loop i is not
ready, we can perform an operation from loop i + 1 that avoids waiting
for the loop i result.
Register Allocation
You can use 14 of the 16 visible ARM registers to hold general-purpose
data. The other two registers are the stack pointer r13 and the program
counter r15. For a function to be ATPCS compliant it must preserve the
callee values of registers r4 to r11. ATPCS also specifies that the stack
should be eight-byte aligned; therefore you must preserve this alignment
if calling subroutines. Use the following template for optimized
assembly routines requiring many registers:
routine_name
STMFD sp!, {r4-r12, lr} ; stack saved registers
; body of routine
; the fourteen registers r0-r12 and lr are available
LDMFD sp!, {r4-r12, pc} ; restore registers and return
Allocating Variables to Register Numbers
When you write an assembly routine, it is best to start by using names for
the variables, rather than explicit register numbers. This allows you to
change the allocation of variables to register numbers easily. You can
even use different register names for the same physical register number
when their use doesn’t overlap. Register names increase the clarity and
readability of optimized code.
For the most part ARM operations are orthogonal with respect to register
number. In other words, specific register numbers do not have specific
roles. If you swap all occurrences of two registers Ra and Rb in a
routine, the function of the routine does not change. However, there are
several cases where the physical number of the register is important:
■ Argument registers
■ Registers used in a load or store multiple.
■ Load and store double word.
There are several possible ways we can proceed when we run out of
registers:
■ Reduce the number of registers we require by performing fewer
operations in each loop. In this case we could load four words in each
load multiple rather than eight.
■ Use the stack to store the least-used values to free up more registers. In
this case we could store the loop counter N on the stack.
■ Alter the code implementation to free up more registers.
Using More than 14 Local Variables
If you need more than 14 local 32-bit variables in a routine, then you
must store some variables on the stack. The standard procedure is to
work outwards from the innermost loop of the algorithm, since the
innermost loop has the greatest performance impact.

Making the Most of Available Registers


On a load-store architecture such as the ARM, it is more efficient to
access values held in registers than values held in memory. There are
several tricks you can use to fit several sub-32-bit length variables into a
single 32-bit register and thus can reduce code size and increase
performance.
Register Allocation Summary
■ ARM has 14 available registers for general-purpose use: r0 to
r12 and r14. The stack pointer r13 and program counter r15
cannot be used for general-purpose data. Operating system
interrupts often assume that the user mode r13 points to a valid
stack, so don’t be tempted to reuse r13.
■ If you need more than 14 local variables, swap the variables out
to the stack, working outwards from the innermost loop.
■ Use register names rather than physical register numbers when
writing assembly routines. This makes it easier to reallocate
registers and to maintain the code.
■ To ease register pressure you can sometimes store multiple
values in the same register.
For example, you can store a loop counter and a shift in one
register. You can also store multiple pixels in one register.
Conditional Execution
The processor core can conditionally execute most ARM instructions.
This conditional execution is based on one of 15 condition codes. If you
don’t specify a condition, the assembler defaults to the execute always
condition (AL). The other 14 conditions split into seven pairs of
complements. The conditions depend on the four condition code flags N,
Z, C, V stored in the cpsr register.

By default, ARM instructions do not update the N, Z, C, V flags in the


ARM cpsr. For most instructions, to update these flags you append an S
suffix to the instruction mnemonic. Exceptions to this are comparison
instructions that do not write to a destination register. Their sole purpose
is to update the flags and so they don’t require the S suffix.
By combining conditional execution and conditional setting of the flags,
you can implement simple if statements without any need for branches.
This improves efficiency since branches can take many cycles and also
reduces code size.
The following C code converts an unsigned integer 0 ≤ i ≤ 15 to a
hexadecimal character c:
if (i<10)
{
c = i + ‘0’;
}
else
{
c = i + ‘A’-10;
}
We can write this in assembly using conditional execution rather than
conditional branches:
CMP i, #10
ADDLO c, i, #‘0’
ADDHS c, i, #‘A’-10
The sequence works since the first ADD does not change the condition
codes. The second ADD is still conditional on the result of the compare.
Conditional execution is even more powerful for cascading
conditions.
Conditional Execution Summary

■ You can implement most if statements with conditional


execution. This is more efficient than using a conditional branch.

■ You can implement if statements with the logical AND or OR of


several similar conditions using compare instructions that are
themselves conditional.
Looping Constructs
Most routines critical to performance will contain a loop. This section
describes how to implement these loops efficiently in assembly. We also
look at examples of how to unroll loops for maximum performance.
Decremented Counted Loops
For a decrementing loop of N iterations, the loop counter i counts down
from N to 1 inclusive. The loop terminates with i = 0. An efficient
implementation is
MOV i, N
loop
; loop body goes here and i=N,N-1,...,1
SUBS i, i, #1
BGT loop
Unrolled Counted Loops
This brings us to the subject of loop unrolling. Loop. unrolling reduces
the loop overhead by executing the loop body multiple times. However,
there are problems to overcome. What if the loop count is not a multiple
of the unroll amount? What if the loop count is smaller than the unroll
amount.

Multiple Nested Loops


How many loop counters does it take to maintain multiple nested loops?
Actually, one will suffice—or more accurately, one provided the sum of
the bits needed for each loop count does not exceed 32. We can combine
the loop counts within a single register, placing the innermost loop count
at the highest bit positions.
Other Counted Loops
You may want to use the value of a loop counter as an input to
calculations in the loop. It’s not always desirable to count down from N
to 1 or N −1 to 0. For example, you may want to select bits out of a data
register one at a time; in this case you may want a power-of-two mask
that doubles on each iteration.

Negative Indexing
This loop structure counts from −N to 0 (inclusive or exclusive) in steps
of size STEP.
Logarithmic Indexing
This loop structure counts down from 2N to 1 in powers of two. For
example, if N = 4, then it counts 16, 8, 4, 2, 1.
Looping Constructs Summary
■ ARM requires two instructions to implement a counted loop: a subtract
that sets flags and a conditional branch.
■ Unroll loops to improve loop performance. Do not overunroll because
this will hurt cache performance. Unrolled loops may be inefficient for a
small number of iterations. You can test for this case and only call the
unrolled loop if the number of iterations is large.
■ Nested loops only require a single loop counter register, which can
improve efficiency by freeing up registers for other uses.
■ ARM can implement negative and logarithmic indexed loops
efficiently.

You might also like