Module 4 Final
Module 4 Final
INTRODUCTION OF ARM:
The ARM was originally developed at Acorn Computers Limited of Cambridge , England, between
1983 and 1985. It was the first RISC microprocessor developed for commercial use and has some
significant differences from subsequent RISC architectures. In 1990 ARM Limited was established
as a separate company specifically to widen the exploitation of ARM technology and it is established
as a market-leader for low-power and cost-sensitive embedded applications. The ARM is supported
by a toolkit which includes an instruction set emulator for hardware modelling and software testing
and benchmarking, an assembler, C and C++ compilers, a linker and a symbolic debugger.
The 16-bit CISC microprocessors that were available in 1983 were slower than standard memory parts.
They also had instructions that took many clock cycles to complete (in some cases, many hundreds of
clock cycles), giving them very long interrupt latencies.As a result of these frustrations with the
commercial microprocessor offerings, the design of a proprietary microprocessor was considered and
ARM chip was designed.
The ARM processors are based on RISC architectures and this architecture has provided small
implementations, and very low power consumption. Implementation size, performance, and very low
power consumption remain the key features in the development of the ARM devices.
The ARM 7 processor is based on Von Neman model with a single bus for both data and instructions.
Though this will decrease the performance of ARM, it is overcome by the pipe line concept. ARM
uses the Advanced Microcontroller Bus Architecture (AMBA) bus architecture. This AMBA include
two system buses: the AMBA High-Speed Bus (AHB) or the Advanced System Bus (ASB), and the
Advanced Peripheral Bus (APB).
The ARM processor consists of
• Arithmetic Logic Unit (32-bit)
• One Booth multiplier(32-bit)
• One Barrel shifter
• One Control unit
• Register file of 37 registers each of 32 bits.
In addition to this the ARM also consists of a Program status register of 32 bits, Some special
registers like the instruction register, memory data read and write register and memory address
register ,one Priority encoder which is used in the multiple load and store instruction to indicate
which register in the register file to be loaded or stored and Multiplexers etc.
ARM Registers: ARM has a total of 37 registers. In which - 31 are general-purpose registers of 32-
bits, and six status registers .But all these registers are not seen at once. The processor state and
operating mode decide which registers are available to the programmer. At any time, among the 31
general purpose registers only 16 registers are available to the user. The remaining 15 registers are
used to speed up exception processing. There are two program status registers: CPSR and SPSR (the
current and saved program status registers, respectively
In ARM state the registers r0 to r13 are orthogonal—any instruction that you can apply to r0 you can
equally well apply to any of the other registers.
The main bank of 16 registers is used by all unprivileged code. These are the User mode registers.
User mode is different from all other modes as it is unprivileged. In addition to this register bank, there
is also one 32-bit Current Program status Register(CPSR)
In the 16 registers, the Register 13 acts as a stack pointer register and r14 acts as a link register and
r15 acts as a program counter register. Register r13 is the SP register, and it is used to store the address
of the stack top. R13 is used by the PUSH and POP instructions.
Register 14 is the Link Register (LR). This register holds the address of the next instruction after a
Branch and Link (BL or BLX) instruction, which is the instruction used to make a subroutine call. It
is also used for return address information on entry to exception modes. At all other times, R14 can be
used as a general-purpose register.
Register 15 is the Program Counter (PC). It can be used in most instructions as a pointer to the
instruction which is two instructions after the instruction being executed.
CPSR: The ARM core uses the CPSR register to monitor and control internal operations. The CPSR
is a dedicated 32-bit register and resides in the register file. The CPSR is divided into four fields, each
of 8 bits wide: flags, status, extension, and control. The extension and status fields are reserved for
future use. The control field contains the processor mode, state, and interrupt mask bits. The flags
field contains the condition flags. The 32-bit CPSR register is shown below.
Processor Modes: There are seven processor modes. Six privileged modes abort, fast interrupt
request, interrupt request, supervisor, system, and undefined and one non-privileged mode called user
mode.
Banked Registers: Out of the 32 registers, 20 registers are hidden from a program at different times.
These registers are called banked registers and are identified by the shading in the diagram. They are
available only when the processor is in a particular mode; for example, abort mode has banked
registers r13_abt, r14_abt and spsr _abt. Banked registers of a particular mode are denoted by an
underline character post-fixed to the mode mnemonic or _mode. When the T bit is 1, then the processor
is in Thumb state. To change states the core executes a specialized branch instruction and when T= 0
the processor is in ARM state and executes ARM instructions. There are two interrupt request levels
available on the ARM processor core—interrupt request (IRQ) and fast interrupt request (FIQ).
PIPE LINE : Pipeline is the mechanism used by the RISC processor to execute instructions at an
increased speed. This pipeline speeds up execution by fetching the next instruction while other
instructions are being decoded and executed. During the execution of an instruction ,the processor
Fetches the instruction .It means loads an instruction from memory.And decodes the instruction i.e
identifies the instruction to be executed and finally Executes the instruction and writes the result back
to a register.
The ARM7 processor has a three stage pipelining architecture namely Fetch , Decode and Execute.And
the ARM 9 has five stage Pipe line architecture.The three stage pipelining is explained as below.
To explain the pipelining ,let us consider that there are three instructions Compare, Subtract and
Add.The ARM7 processor fetches the first instruction CMP in the first cycle and during the second
cycle it decodes the CMP instruction and at the same time it will fetch the SUB instruction. During
the third cycle it executes the CMP instruction , while decoding the SUB instruction and also at the
same time will fetch the third instruction ADD. This will improve the speed of operation. This leads
to the concept of parallel processing .This pipeline example is shown in the following diagram.
As the pipeline length increases, the amount of work done at each stage is reduced, which allows the
processor to attain a higher operating frequency. This in turn increases the performance. One important
feature of this pipeline is the execution of a branch instruction or branching by the direct modification
of the PC causes the ARM core to flush its pipeline.
PROGRAMMER'S MODEL
Programmer's model helps the programmer, how to use the core components of the processor to
program it.
Our ARM has a 32-bit data bus and a 32-bit address bus. The data types the processor supports are
Words (32 bits), where words must be aligned to four byte boundaries. Instructions are exactly one
word, and data operations (e.g. ADD) are only performed on word quantities. Load and store
operations can transfer words.
Registers
The processor has a total of 37 registers made up of 31 general 32 bit registers and 6 status registers.
At any one time 16 general registers (R0 to R15) and one or two status registers are visible to the
programmer. The visible registers depend on the processor mode and the other registers (the banked
registers) are switched in to support IRQ, FIQ, Supervisor, Abort and Undefined mode processing.
The register bank organization is shown in below Figure. The banked registers are shaded in the
diagram.
In all modes 16 registers, R0 to R15, are directly accessible. All registers except R15 are general
purpose and may be used to hold data or address values. Register R15 holds the Program Counter
(PC). When R15 is read, bits [1:0] are zero and bits [31:2] contain the PC. A seventeenth register
(the CPSR - Current Program Status Register) is also accessible. It contains condition code flags and
the current mode bits and may be thought of as an extension to the PC.
R14 is used as the subroutine link register and receives a copy of R15 when a Branch and Link
instruction is executed. It may be treated as a general- purpose register at all other times. R14_svc,
R14_irq, R14_fiq, R14_abt and R14_und are used similarly to hold the return values of R15 when
interrupts and exceptions arise, or when Branch and Link instructions are executed within interrupt
or exception routines.
FIQ mode has seven banked registers mapped to R8-14 (R8_fiq-R14_fiq). Many FIQ programs will
not need to save any registers. User mode, IRQ mode, Supervisor mode, Abort mode and
Undefined mode each have two banked registers mapped to R13 and R14. The two banked registers
allow these modes to each have a private stack pointer and link register. Supervisor, IRQ, Abort
and Undefined mode programs which require more than these two banked registers are expected to
save some or all of the caller's registers (R0 to R12) on their respective stacks. They are then free to
use these registers that they will restore before returning to the caller. In addition there are also
five SPSRs (Saved Program Status Registers) that are loaded with the CPSR when an exception
occurs. There is one SPSR for each privileged mode.
(Also explain CPSR Register)
Multiplication
MUL r4, r3, r2 ; r4 = (r3 x r2)[31:0]
– only the bottom 32 bits are returned
– immediate operands are not supported
➢ Multiplication by a constant is usually best done with a short series of adds and subtracts with
shifts
‘Multiply-Accumulate’ form:
MLA r4, r3, r2, r1; r4 = (r3xr2+r1)[31:0]
➢ 64-bit result forms are also supported.
DATA TRANSFER INSTRUCTIONS:
Data transfer instructions transfer data between registers and memory.
➢ Memory to register or LOAD from memory to register
➢ Register to memory or STORE from register to memory
❖ The ARM has three sets of instructions which interact with main memory. These are:
❖ Single register data transfer (LDR/STR)
❖ Block data transfer (LDM/STM)
❖ Single Data Swap (SWP)
❖ The basic load and store instructions are:
❖ Load and Store Word or Byte or Halfword
❖ LDR / STR / LDRB / STRB / LDRH / STRH
Single register data transfer
LDR STR Word
LDRB STRB Byte
LDRH STRH Halfword
time)
Memory to Register:
❖ LDR r2,[r1]
This instruction will take the address in r1, and then load a 4 byte value from the memory
pointed to by it into register r2
❖ Note: r1 is called the base register
Register to Memory:
❖ STR r2,[r1]
This instruction will take the address in r1, and then store a 4 byte value from the register r2 to the
memory pointed to by r1.
❖ Note: r1 is called the base register
Examples:
LDR/STR r1 [r2, #4]; offset: immediate 4
;The effective memory address is calculated as r2+4
LDR/STR r1 [r2, r3]; offset: value in register r3
;The effective memory address is calculated as r2+r3
LDR/STR r1 [r2, r3, LSL #3]; offset: register value *23
;The effective memory address is calculated as r2+r3*23
❖ Example: LDR r0,[r1,#12]
This instruction will take the pointer in r1, add 12 bytes to it, and then load the value from the memory
pointed to by this calculated sum into register r0
❖ Example: STR r0,[r1,#-8]
This instruction will take the pointer in r1, subtract 8 bytes from it, and then store the value from
register r0 into the memory address pointed to by the calculated sum
❖ Notes:
❖ r1 is called the base register
❖ #constant is called the offset
❖ offset is generally used in accessing elements of array or structure: base reg points to
beginning of array or structure
FLOW CONTROL INSTRUCTIONS:
ARM's Flow Control Instructions modify the default sequential execution. They control the operation
of the processor and sequencing of instructions. Determine the instruction to be executed next.
Branch instruction
B label
…
label: …
Conditional branches:
MOV R0, #0
loop:
…
ADD R0, R0, #1
CMP R0, #10
BNE loop
Branch conditions:
Conditional Branches:
Conditional execution
3 STAGE PIPELINE:
The organization of an ARM with a 3-stage pipeline is illustrated in Figure.
• The address register and incrementer, which select and hold all memory addresses and generate
sequential addresses when required.
• The data registers, which hold data passing to and from memory.
• The instruction decoder and associated control logic.
In a single-cycle data processing instruction, two register operands are accessed, the value on the B
bus is shifted and combined with the value on the A bus in the ALU, then the result is written back
into the register bank. The program counter value is in the address register, from where it is fed into
the incrementer, then the incremented value is copied back into rl5 in the register bank and also into
the address register to be used as the address for the next instruction fetch.
ARM processors up to the ARM7 employ a simple 3-stage pipeline with the following pipeline stages:
• Fetch:
the instruction is fetched from memory and placed in the instruction pipeline.
• Decode:
the instruction is decoded and the datapath control signals prepared for the next cycle. In this stage
the instruction 'owns' the decode logic but not the datapath.
• Execute:
the instruction 'owns' the datapath; the register bank is read, an operand shifted, the ALU result
generated and written back into a destination register.
At any one time, three different instructions may occupy each of these stages, so the hardware in each
stage has to be capable of independent operation.
When the processor is executing simple data processing instructions the pipeline enables one
instruction to be completed every clock cycle. An individual instruction takes three clock cycles to
complete, so it has a three-cycle latency, but the throughput is one instruction per cycle. The 3-stage
pipeline operation for single-cycle instructions is shown in Figure:
5 STAGE PIPELINING:
Instruction Fetch (IF):
Function: Fetches the next instruction from memory.
Explanation: In this stage, the program counter (PC) is used to fetch the instruction from the memory
address pointed to by the PC. The fetched instruction is then passed to the next stage.
Instruction Decode (ID):
Function: Decodes the instruction to determine the operation to be performed and the operands.
Explanation: The fetched instruction is decoded to understand its opcode and any associated operands.
This stage determines what operation the instruction is requesting and what data it needs.
Execute (EX):
Function: Performs the actual operation or calculation specified by the instruction.
Explanation: This stage executes the operation indicated by the decoded instruction. For example, if
the instruction is an arithmetic operation, the actual computation takes place in this stage.
Memory Access (MEM):
Function: Accesses memory if the instruction involves a memory operation (e.g., load or store).
Explanation: In this stage, the processor interacts with the memory subsystem. If the instruction
involves reading from or writing to memory, the necessary data is transferred between the processor
and memory.
Write Back (WB):
Function: Writes the result of the executed instruction back to the register file.
Explanation: The final stage involves writing the results of the executed instruction back to the register
file. This stage updates the processor's internal registers with the results of the computation.
to the ARM instruction set architecture in the organization shown in Figure are the three source
operand read ports and two write ports in the register file (where a 'classic' RISC has two read ports
and one write port), and the inclusion of address incrementing hardware in the execute stage to support
load and store multiple instructions.
A data transfer (load or store) instruction computes a memory address in a manner very similar to the
way a data processing instruction computes its result. A register is used as the base address, to which
is added an offset which again may be another register or an immediate value. A 12-bit immediate
value is used without a shift operation rather than a shifted 8-bit value. The address is sent to the
address register, and in a second cycle the data transfer takes place. Rather than leave the datapath
largely idle during the data transfer cycle, the ALU holds the address components from the first cycle
and is available to compute an auto-indexing modification to the base register if it is required.
Branch Instructions
Branch instructions compute the target address in the first cycle. A 24-bit immediate field is extracted
from the instruction and then shifted left two bit positions to give a word-aligned offset which is added
to the PC. The result is issued as an instruction fetch address, and while the instruction pipeline refills
the return address is copied into the link register (r14) if this is required. The third cycle, which is
required to complete the pipeline refilling, is also used to make a small correction to the value stored
in the link register in order that it points directly at the instruction which follows the branch.
ARM IMPLEMENTATION
The design is divided into a datapath section that is described in register transfer level (RTL) notation
and a control section that is viewed as a finite state machine (FSM).
Clocking Scheme
The design is based around 2-phase non-overlapping clocks, as shown in Figure, which are generated
internally from a single input clock signal. This scheme allows the use of level-sensitive transparent
latches. Data movement is controlled by passing the data alternately through latches which are open
during phase 1 and latches which are open during phase 2. The non-overlapping property of the phase
1 and phase 2 clocks ensures that there are no race conditions in the circuit.
Datapath Timing
The normal datapath timing is a 3-stage pipeline. The ALU has input latches which are open during
phase 1, allowing the operands to begin combining in the ALU as soon as they are valid, but they close
at the end of phase 1 so that the phase 2 precharge does not get through to the ALU. The ALU then
continues to process the operands through phase 2, producing a valid output towards the end of the
phase which is latched in the destination register at the end of phase 2.
The first ARM processor prototype used a simple ripple-carry adder as shown in Figure 4.10. Using a CMOS
AND-OR-INVERT gate for the carry logic and alternating AND/OR logic so that even bits use the circuit
shown and odd bits use the dual circuit with inverted inputs and outputs and AND and OR gates swapped
around, the worst-case carry path is 32 gates long. In order to allow a higher clock rate, ARM2 used a 4-bit
carry look-ahead scheme to reduce the worst-case carry path length.
ALU Functions
The ALU does not only add its two inputs. It must perform the full set of data oper ations defined by
the instruction set, including address computations for memory transfers, branch calculations, bit-wise
logical functions, and so on.
Multiplier Design
All ARM processors apart from the first prototype have included hardware support for integer
multiplication. Two styles of multiplier have been used:
• Older ARM cores include low-cost multiplication hardware that supports only the 32-bit result
multiply and multiply-accumulate instructions.
• Recent ARM cores have high-performance multiplication hardware and support the 64-bit result
multiply and multiply-accumulate instructions.
Datapath Layout
The ARM datapath is laid out to a constant pitch per bit. The pitch will be a compromise between the
optimum for the complex functions (such as the ALU) which are best suited to a wide pitch and the
simple functions (such as the barrel shifter) which are most efficient when laid out on a narrow pitch.
Control Structures
The control logic on the simpler ARM cores has three structural components which relate to each
other.
1. An instruction decoder PLA (programmable logic array). This unit uses some of the instruction bits
and an internal cycle counter to define the class of operation to be performed on the datapath in the
next cycle.
2. Distributed secondary control associated with each of the major datapath function blocks. This logic
uses the class information from the main decoder PLA to select other instruction bits and/or processor
state information to control the datapath.
3. Decentralized control units for specific instructions that take a variable number of cycles to complete
(load and store multiple, multiply and coprocessor operations). Here the main decoder PLA locks into
a fixed state until the remote control unit indicates completion.
Physical Design
There are two principal mechanisms used to implement an ARM processor core (or any other core, for
than matter) on a particular process:
• a hard macrocell is delivered as physical layout ready to be incorporated into the final design;
• a soft macrocell is delivered as a synthesizable design expressed in a hardware description language
such as VHDL.