MCM 3 Notes

Notes

Uploaded by

pranavcs72

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views28 pages

MCM 3 Notes

Notes

Uploaded by

pranavcs72

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Writing and Optimizing

ARM Assembly Code

Writing Assembly Code

Ex: shows how to convert a C function to an assembly function—usually

the first stage of assembly optimization. Consider the simple C program
main.c following that prints the squares of the integers from 0 to 9:
#include <stdio.h>
int square(int i);
int main(void)
{
int i;
for (i=0; i<10; i++)
{
printf("Square of %d is %d\n", i, square(i));
}
}
int square(int i)
{
return i*i;
}
Let’s see how to replace square by an assembly function that performs
the same action. Remove the C definition of square, but not the
declaration (the second line) to produce a new C file main1.c. Next add
an armasm assembler file square.s with the following contents:
AREA |.text|, CODE, READONLY
EXPORT square
; int square(int i)
square
MUL r1, r0, r0 ; r1 = r0 * r0
MOV r0, r1 ; r0 = r1
MOV pc, lr ; return r0
END
Profiling and Cycle Counting
The first stage of any optimization process is to identify the critical
routines and measure their current performance.
A profiler is a tool that measures the proportion of time or processing
cycles spent in each subroutine. You use a profiler to identify the most
critical routines.
A cycle counter measures the number of cycles taken by a specific
routine. You can measure your success by using a cycle counter to
benchmark a given subroutine before and after an optimization.
Instruction Scheduling
The time taken to execute instructions depends on the implementation
pipeline. Instructions that are conditional on the value of the ARM
condition codes in the cpsr take one cycle if the condition is not met. If
the condition is met, then the following rules apply:

■ ALU operations such as addition, subtraction, and logical operations

take one cycle. This includes a shift by an immediate value. If you use a
register-specified shift, then add one cycle. If the instruction writes to the
pc, then add two cycles.

■ Load instructions that load N32-bit words of memory such as LDR and
LDM take Ncycles to issue, but the result of the last word loaded is not
available on the following cycle. The updated load address is available
on the next cycle. This assumes zero-wait-state memory for an uncached
system, or a cache hit for a cached system. An LDM of a single value is
exceptional, taking two cycles. If the instruction loads pc, then add two
cycles.
■ Load instructions that load 16-bit or 8-bit data such as LDRB, LDRSB,
LDRH, and LDRSH take one cycle to issue. The load result is not
available on the following two cycles. The updated load address is
available on the next cycle. This assumes zero-wait-state memory for an
uncached system, or a cache hit for a cached system.

■ Branch instructions take three cycles.

■ Store instructions that store N values take N cycles. This assumes zero-
wait-state memory for an uncached system, or a cache hit or a write
buffer with N free entries for a cached system. An STM of a single value
is exceptional, taking two cycles.

■ Multiply instructions take a varying number of cycles depending on

the value of the second operand in the product.
To understand how to schedule code efficiently on the ARM, we need to
understand the ARM pipeline and dependencies. The ARM9TDMI
processor performs five operations in parallel:
■ Fetch: Fetch from memory the instruction at address pc. The
instruction is loaded into the core and then processes down the core
pipeline.
■ Decode: Decode the instruction that was fetched in the previous cycle.
The processor also reads the input operands from the register bank if
they are not available via one of
the forwarding paths.
■ ALU: Executes the instruction that was decoded in the previous cycle.
Normally this involves calculating the answer for a data processing
operation, or the address for a load, store, or branch operation. Some
instructions may spend several cycles in this stage. For example,
multiply and register-controlled shift operations take several ALU cycles.

Figure: ARM9TDMI pipeline executing in ARM state.

■ LS1: Load or store the data specified by a load or store instruction. If

the instruction is not a load or store, then this stage has no effect.

■ LS2: Extract and zero- or sign-extend the data loaded by a byte or

halfword load instruction. If the instruction is not a load of an 8-bit byte
or 16-bit halfword item, then this stage has no effect.
After an instruction has completed the five stages of the pipeline, the
core writes the result to the register file. Note that pc points to the
address of the instruction being fetched.
The ALU is executing the instruction that was originally fetched from
address pc − 8 in parallel with fetching the instruction at address pc.

If an instruction requires the result of a previous instruction that is not

available, then the processor stalls. This is called a pipeline hazard or
pipeline interlock.
Example: This example shows the case where there is no interlock.
ADD r0, r0, r1
ADD r0, r0, r2
This instruction pair takes two cycles. The ALU calculates r0 + r1 in one
cycle. Therefore this result is available for the ALU to calculate r0 + r2
in the second cycle.
Example: This example shows a one-cycle interlock caused by load use.
LDR r1, [r2, #4]
ADD r0, r0, r1
This instruction pair takes three cycles. The ALU calculates the address
r2 + 4 in the first cycle while decoding the ADD instruction in parallel.
However, the ADD cannot proceed on the second cycle because the load
instruction has not yet loaded the value of r1. Therefore the pipeline
stalls for one cycle while the load instruction completes the LS1 stage.
Now that r1 is ready, the processor executes the ADD in the ALU on the
third cycle.
Example: This example shows why a branch instruction takes three
cycles. The processor must flush the pipeline when jumping to a new
address.
MOV r1, #1
B case1
AND r0, r0, r1
EOR r2, r2, r3
...
case1
SUB r0, r0, r1
The three executed instructions take a total of five cycles. The MOV
instruction executes on the first cycle. On the second cycle, the branch
instruction calculates the destination address. This causes the core to
flush the pipeline and refill it using this new pc value. The refill takes
two cycles. Finally, the SUB instruction executes normally.
Below Figure illustrates the pipeline state on each cycle. The pipeline
drops the two instructions following the branch when the branch takes
place.
Scheduling of load instructions
Load instructions occur frequently in compiled code, accounting for
approximately one third of all instructions. Careful scheduling of load
instructions so that pipeline stalls don’t occur can improve performance.
The ARM9TDMI pipeline will stall for two cycles. The compiler can’t
do any better since everything following the load of c depends on its
value. However, there are two ways you can alter the structure of the
algorithm to avoid the cycles by using assembly. We call these methods
load scheduling by preloading and unrolling.
Load Scheduling by Preloading

In this method of load scheduling, we load the data required for the loop
at the end of the previous loop, rather than at the beginning of the current
loop. To get performance improvement with little increase in code size,
we don’t unroll the loop.

Load Scheduling by Unrolling

This method of load scheduling works by unrolling and then interleaving
the body of the loop. For example, we can perform loop iterations i, i +
1, i + 2 interleaved. When the result of an operation from loop i is not
ready, we can perform an operation from loop i + 1 that avoids waiting
for the loop i result.
Register Allocation
You can use 14 of the 16 visible ARM registers to hold general-purpose
data. The other two registers are the stack pointer r13 and the program
counter r15. For a function to be ATPCS compliant it must preserve the
callee values of registers r4 to r11. ATPCS also specifies that the stack
should be eight-byte aligned; therefore you must preserve this alignment
if calling subroutines. Use the following template for optimized
assembly routines requiring many registers:
routine_name
STMFD sp!, {r4-r12, lr} ; stack saved registers
; body of routine
; the fourteen registers r0-r12 and lr are available
LDMFD sp!, {r4-r12, pc} ; restore registers and return
Allocating Variables to Register Numbers
When you write an assembly routine, it is best to start by using names for
the variables, rather than explicit register numbers. This allows you to
change the allocation of variables to register numbers easily. You can
even use different register names for the same physical register number
when their use doesn’t overlap. Register names increase the clarity and
readability of optimized code.
For the most part ARM operations are orthogonal with respect to register
number. In other words, specific register numbers do not have specific
roles. If you swap all occurrences of two registers Ra and Rb in a
routine, the function of the routine does not change. However, there are
several cases where the physical number of the register is important:
■ Argument registers
■ Registers used in a load or store multiple.
■ Load and store double word.
There are several possible ways we can proceed when we run out of
registers:
■ Reduce the number of registers we require by performing fewer
operations in each loop. In this case we could load four words in each
load multiple rather than eight.
■ Use the stack to store the least-used values to free up more registers. In
this case we could store the loop counter N on the stack.
■ Alter the code implementation to free up more registers.
Using More than 14 Local Variables
If you need more than 14 local 32-bit variables in a routine, then you
must store some variables on the stack. The standard procedure is to
work outwards from the innermost loop of the algorithm, since the
innermost loop has the greatest performance impact.

Making the Most of Available Registers

On a load-store architecture such as the ARM, it is more efficient to
access values held in registers than values held in memory. There are
several tricks you can use to fit several sub-32-bit length variables into a
single 32-bit register and thus can reduce code size and increase
performance.
Register Allocation Summary
■ ARM has 14 available registers for general-purpose use: r0 to
r12 and r14. The stack pointer r13 and program counter r15
cannot be used for general-purpose data. Operating system
interrupts often assume that the user mode r13 points to a valid
stack, so don’t be tempted to reuse r13.
■ If you need more than 14 local variables, swap the variables out
to the stack, working outwards from the innermost loop.
■ Use register names rather than physical register numbers when
writing assembly routines. This makes it easier to reallocate
registers and to maintain the code.
■ To ease register pressure you can sometimes store multiple
values in the same register.
For example, you can store a loop counter and a shift in one
register. You can also store multiple pixels in one register.
Conditional Execution
The processor core can conditionally execute most ARM instructions.
This conditional execution is based on one of 15 condition codes. If you
don’t specify a condition, the assembler defaults to the execute always
condition (AL). The other 14 conditions split into seven pairs of
complements. The conditions depend on the four condition code flags N,
Z, C, V stored in the cpsr register.

By default, ARM instructions do not update the N, Z, C, V flags in the

ARM cpsr. For most instructions, to update these flags you append an S
suffix to the instruction mnemonic. Exceptions to this are comparison
instructions that do not write to a destination register. Their sole purpose
is to update the flags and so they don’t require the S suffix.
By combining conditional execution and conditional setting of the flags,
you can implement simple if statements without any need for branches.
This improves efficiency since branches can take many cycles and also
reduces code size.
The following C code converts an unsigned integer 0 ≤ i ≤ 15 to a
hexadecimal character c:
if (i<10)
{
c = i + ‘0’;
}
else
{
c = i + ‘A’-10;
}
We can write this in assembly using conditional execution rather than
conditional branches:
CMP i, #10
ADDLO c, i, #‘0’
ADDHS c, i, #‘A’-10
The sequence works since the first ADD does not change the condition
codes. The second ADD is still conditional on the result of the compare.
Conditional execution is even more powerful for cascading
conditions.
Conditional Execution Summary

■ You can implement most if statements with conditional

execution. This is more efficient than using a conditional branch.

■ You can implement if statements with the logical AND or OR of

several similar conditions using compare instructions that are
themselves conditional.
Looping Constructs
Most routines critical to performance will contain a loop. This section
describes how to implement these loops efficiently in assembly. We also
look at examples of how to unroll loops for maximum performance.
Decremented Counted Loops
For a decrementing loop of N iterations, the loop counter i counts down
from N to 1 inclusive. The loop terminates with i = 0. An efficient
implementation is
MOV i, N
loop
; loop body goes here and i=N,N-1,...,1
SUBS i, i, #1
BGT loop
Unrolled Counted Loops
This brings us to the subject of loop unrolling. Loop. unrolling reduces
the loop overhead by executing the loop body multiple times. However,
there are problems to overcome. What if the loop count is not a multiple
of the unroll amount? What if the loop count is smaller than the unroll
amount.

Multiple Nested Loops

How many loop counters does it take to maintain multiple nested loops?
Actually, one will suffice—or more accurately, one provided the sum of
the bits needed for each loop count does not exceed 32. We can combine
the loop counts within a single register, placing the innermost loop count
at the highest bit positions.
Other Counted Loops
You may want to use the value of a loop counter as an input to
calculations in the loop. It’s not always desirable to count down from N
to 1 or N −1 to 0. For example, you may want to select bits out of a data
register one at a time; in this case you may want a power-of-two mask
that doubles on each iteration.

Negative Indexing
This loop structure counts from −N to 0 (inclusive or exclusive) in steps
of size STEP.
Logarithmic Indexing
This loop structure counts down from 2N to 1 in powers of two. For
example, if N = 4, then it counts 16, 8, 4, 2, 1.
Looping Constructs Summary
■ ARM requires two instructions to implement a counted loop: a subtract
that sets flags and a conditional branch.
■ Unroll loops to improve loop performance. Do not overunroll because
this will hurt cache performance. Unrolled loops may be inefficient for a
small number of iterations. You can test for this case and only call the
unrolled loop if the number of iterations is large.
■ Nested loops only require a single loop counter register, which can
improve efficiency by freeing up registers for other uses.
■ ARM can implement negative and logarithmic indexed loops
efficiently.

Chaper 6 Library Classes
No ratings yet
Chaper 6 Library Classes
20 pages
Instructions Scheduling
No ratings yet
Instructions Scheduling
26 pages
Python Sample Questions
No ratings yet
Python Sample Questions
23 pages
Win32 Window Wrapper Tutorial
No ratings yet
Win32 Window Wrapper Tutorial
6 pages
ARM Processor Instruction Set- Lecture 6
No ratings yet
ARM Processor Instruction Set- Lecture 6
43 pages
1)What is a function How can you define and call a function
No ratings yet
1)What is a function How can you define and call a function
6 pages
CL 12 INFO TECH, SQP-Term-II
No ratings yet
CL 12 INFO TECH, SQP-Term-II
7 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Cpe626 ARMorganization
No ratings yet
Cpe626 ARMorganization
10 pages
Cheat Sheet PDF
No ratings yet
Cheat Sheet PDF
1 page
C Programming QP
No ratings yet
C Programming QP
19 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
CH 07
No ratings yet
CH 07
81 pages
Python Course Outline_final
No ratings yet
Python Course Outline_final
3 pages
CH2 Arm
No ratings yet
CH2 Arm
68 pages
Stack Inverse Explained
No ratings yet
Stack Inverse Explained
3 pages
Efficient Programming Techniques For ARM
100% (1)
Efficient Programming Techniques For ARM
18 pages
CLASS 9 COMP APP Exam
100% (1)
CLASS 9 COMP APP Exam
2 pages
Taming Control Flow: A Structured Approach To Eliminating Goto Statements
No ratings yet
Taming Control Flow: A Structured Approach To Eliminating Goto Statements
12 pages
CO 1 Material
No ratings yet
CO 1 Material
20 pages
ARM Implementation: Datapath Control Unit (FSM)
No ratings yet
ARM Implementation: Datapath Control Unit (FSM)
22 pages
Language Description: Syntactic Structure
No ratings yet
Language Description: Syntactic Structure
35 pages
Journal10 IFFA SHAH
No ratings yet
Journal10 IFFA SHAH
8 pages
JavaScript Study Notes
No ratings yet
JavaScript Study Notes
9 pages
Cap202 - Object Oriented Programming Syllabus
No ratings yet
Cap202 - Object Oriented Programming Syllabus
2 pages
ARM Microcontroller_CIE 2
No ratings yet
ARM Microcontroller_CIE 2
63 pages
Architecture Programmers Model Instruction Set
No ratings yet
Architecture Programmers Model Instruction Set
33 pages
ARM Instruction Set Architecture
No ratings yet
ARM Instruction Set Architecture
8 pages
CSE459 CSharp 02 LanguageOverview
No ratings yet
CSE459 CSharp 02 LanguageOverview
26 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
A First Look at ARM Instruction Set Architecture
No ratings yet
A First Look at ARM Instruction Set Architecture
3 pages
Assignment
No ratings yet
Assignment
6 pages
Lecture6 ARM
No ratings yet
Lecture6 ARM
50 pages
What Are The Types of Memory Allocated in Memory in Java?
No ratings yet
What Are The Types of Memory Allocated in Memory in Java?
15 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
ARM Presentation
No ratings yet
ARM Presentation
51 pages
Unit 2
No ratings yet
Unit 2
46 pages
Csc201 MCQ Questions (Updated) Compiled by Engr Marvie ... More To Come.
No ratings yet
Csc201 MCQ Questions (Updated) Compiled by Engr Marvie ... More To Come.
8 pages
ASM Session1
No ratings yet
ASM Session1
32 pages
p3 - Chapter 4 - Processors and Computer architecture-6-mnlEWe66XLtD460P PDF
No ratings yet
p3 - Chapter 4 - Processors and Computer architecture-6-mnlEWe66XLtD460P PDF
8 pages
3 Pipeline
No ratings yet
3 Pipeline
38 pages
11 ARM Processor
No ratings yet
11 ARM Processor
54 pages
Oop Assignment
No ratings yet
Oop Assignment
22 pages
05 Instruction+Level+Parallelism
No ratings yet
05 Instruction+Level+Parallelism
11 pages
ARM Pipelining
No ratings yet
ARM Pipelining
31 pages
Module-3 ARMProgram Notes.-16857877494142 PDF
No ratings yet
Module-3 ARMProgram Notes.-16857877494142 PDF
5 pages
5.MHN ARM InstructionSet
No ratings yet
5.MHN ARM InstructionSet
44 pages
5-Stage Pipeline CPU Hardware
No ratings yet
5-Stage Pipeline CPU Hardware
33 pages
Module 5 Ppt
No ratings yet
Module 5 Ppt
67 pages
Jcs2201-Python Programming Unit-I Notes
No ratings yet
Jcs2201-Python Programming Unit-I Notes
42 pages
AT - Better C Code For ARM Devices
No ratings yet
AT - Better C Code For ARM Devices
30 pages
Embedded Lecture 4 ARM
No ratings yet
Embedded Lecture 4 ARM
47 pages
Module 4 - ECE3014 Introduction To Embedded System and ARM-1
No ratings yet
Module 4 - ECE3014 Introduction To Embedded System and ARM-1
27 pages
l18 Arm
No ratings yet
l18 Arm
71 pages
Q&A Module-2 (1)
No ratings yet
Q&A Module-2 (1)
12 pages
ARM Processor
No ratings yet
ARM Processor
46 pages
Module-5.pptx
No ratings yet
Module-5.pptx
43 pages
Risc Processor- Arm 9
No ratings yet
Risc Processor- Arm 9
84 pages
l18 Arm
No ratings yet
l18 Arm
71 pages
Module-2: Microcontroller and Embedded Systems
No ratings yet
Module-2: Microcontroller and Embedded Systems
74 pages
Trignometry Tutorial - 01-2
No ratings yet
Trignometry Tutorial - 01-2
5 pages
Module 2
No ratings yet
Module 2
41 pages
CSE331_L3B_ARM_ISA smh2 12 Sep 2024
No ratings yet
CSE331_L3B_ARM_ISA smh2 12 Sep 2024
71 pages
Revision Tour - Test
No ratings yet
Revision Tour - Test
2 pages
Extending and Embedding Perl
No ratings yet
Extending and Embedding Perl
384 pages
Introduction To Processor Design & The ARM Architecture
100% (1)
Introduction To Processor Design & The ARM Architecture
65 pages
ES Unit 1 (A) 2023-1
No ratings yet
ES Unit 1 (A) 2023-1
64 pages
ARM Assembly Language Guide: Common ARM Instructions (And Psuedo-Instructions)
No ratings yet
ARM Assembly Language Guide: Common ARM Instructions (And Psuedo-Instructions)
7 pages
ARM Arch Instruc Set Part2
No ratings yet
ARM Arch Instruc Set Part2
18 pages
mod5
No ratings yet
mod5
67 pages
Es U-1 Ch-2 Part2
No ratings yet
Es U-1 Ch-2 Part2
8 pages
ARM Processor
No ratings yet
ARM Processor
63 pages
APP_UNIT_2_PKM
No ratings yet
APP_UNIT_2_PKM
102 pages
657668478
No ratings yet
657668478
78 pages
Python For Civil and Structural Engineers
100% (2)
Python For Civil and Structural Engineers
259 pages
04 - The ARM Architecture and ISA
No ratings yet
04 - The ARM Architecture and ISA
73 pages
Arm-Module 7
No ratings yet
Arm-Module 7
37 pages
Arm2 1
No ratings yet
Arm2 1
65 pages
ARM Introduction & Instruction Set Architecture: Aleksandar Milenkovic
No ratings yet
ARM Introduction & Instruction Set Architecture: Aleksandar Milenkovic
31 pages
Module 1B - ARM Cortex M0+ Core Architecture
No ratings yet
Module 1B - ARM Cortex M0+ Core Architecture
28 pages
Development of The ARM Architecture
No ratings yet
Development of The ARM Architecture
44 pages
ARM Assembly Language: Course Objective's
No ratings yet
ARM Assembly Language: Course Objective's
39 pages
Basic PL-SQL
No ratings yet
Basic PL-SQL
47 pages
ARM Introduction & Instruction Set Architecture
100% (2)
ARM Introduction & Instruction Set Architecture
71 pages
Arm Program Model
No ratings yet
Arm Program Model
4 pages
Durga Sir Java Notes
50% (2)
Durga Sir Java Notes
32 pages
ARM Organization and Implementation: Aleksandar Milenkovic
100% (2)
ARM Organization and Implementation: Aleksandar Milenkovic
37 pages
Computer Science II Essentials
From Everand
Computer Science II Essentials
Randall Raus
No ratings yet
ROUTING INFORMATION PROTOCOL: RIP DYNAMIC ROUTING LAB CONFIGURATION
From Everand
ROUTING INFORMATION PROTOCOL: RIP DYNAMIC ROUTING LAB CONFIGURATION
Mulayam Singh
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)