[go: up one dir, main page]

0% found this document useful (0 votes)
414 views19 pages

ARM9E: DSP-Enhanced ARM Processor

The document discusses the ARM9E processor, which is an ARM9TDMI processor with DSP extensions. It addresses applications that require a mix of DSP and control capabilities. The ARM9E provides single-cycle multiply-accumulate instructions and zero overhead saturating fractional arithmetic to improve DSP performance. It maintains compatibility with other ARM processors while enhancing the architecture for DSP workloads. The document outlines the target applications and how the new instructions improve processing efficiency over previous ARM cores.

Uploaded by

api-3702016
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
414 views19 pages

ARM9E: DSP-Enhanced ARM Processor

The document discusses the ARM9E processor, which is an ARM9TDMI processor with DSP extensions. It addresses applications that require a mix of DSP and control capabilities. The ARM9E provides single-cycle multiply-accumulate instructions and zero overhead saturating fractional arithmetic to improve DSP performance. It maintains compatibility with other ARM processors while enhancing the architecture for DSP workloads. The document outlines the target applications and how the new instructions improve processing efficiency over previous ARM cores.

Uploaded by

api-3702016
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

ARM9E

An ARM9TDMI with DSP extensions

1
Market fit
• The ARM9E addresses high volume applications
requiring a mix of DSP and control performance
– Mass storage
• servo control in HDD, DVD and other drives
– Speech coders
• G.723 for voice over IP
• Multiple standards for digital cellular telephony
– Networking applications
– Automotive control applications
– Modems
– Audio decoding (Dolby Digital, MP3, etc.)

2
ARM9E is a DSP enhanced ARM
processor
• A 32-bit RISC single engine solution for mixed
DSP and control applications
– Maintains full compatibility with ARM9TDMI, ARM7TDMI
and all other ARM microprocessors
• Why you want a DSP enhanced ARM processor
– superb array of development tools and options
– unified development environment reduces costs
– good HLL target - can realistically use C and C++
– easy to learn and program the single architecture
– reduced SOC complexity due to elimination of inter-
processor communication and other overheads
3
0.15µm
ARM xx
0.15µm
0.18µm
Performance MIPS (Dhry 2.1)

400 0.25µm ARM 10...


0.25µm 0.18µm
2.1mm2
0.35µm
70-150 DSP MIPS
4.8mm2
ARM 9E
100 ARM 9...
0.25µm 0.18µm
~ 0.5mm2
0.35µm 1.0mm2
2.1mm2
0.6µ
4.8mm2

ARM 7 Thumb Family

1996 1997 1998 1999 2000 2001 2002

4
Application driven architecture
decisions

• ARM has been working with OEM’s and


analyzing key application code
• ARM processors are good at DSP already
• Analysis identified three bottlenecks
– Solutions:-
• Single cycle multiply-accumulate
• Zero overhead saturating fractional arithmetic
• Efficient use of 32-bit bandwidth with packed 16-bit data

5
ARM cores are good at DSP already

• High data bandwidth - 4 bytes per cycle


– same data bandwidth as typical 16-bit DSP
– 600 Mbytes/sec on typical 0.25µm process
– Harvard memory interface
– Large register bank reduces bandwidth required by
many algorithms
• Conditional instruction execution
– every instruction is predicated
– eliminates branch penalties

6
DSP enhancements in ARM9E
• New instruction additions give architecture V5TE
• New 32x16 and 16x16 multiply instructions
– SMLAxy, SMLAWy, SMLALxy, SMULxy, SMULWy
– Allows independent access to 16-bit halves of registers
• Gives efficient use of 32-bit bandwidth for packed 16-bit operands
– ARM ISA already has 32x32 multiply instructions
• Zero overhead fractional saturating arithmetic
– QADD, QSUB, QDADD, QDSUB
• Count leading zeros instruction
– CLZ for faster normalisation and division
• Single cycle 32x16 multiplier array
– speeds up all ARM9E multiply instructions
7
Using the new multiply instructions
SMLAxy Rd,Rm,Rs,Rn
Rm Rs Rn

x=T x=B y=T y=B


32-bit register or
x & y select the upper 16x32 or 16x16 multiply
and lower 16-bits of the
X gives 48-bit or 32-bit
64-bit register-pair
as accumulation
32-bit registers product source

Other instructions include:-


SMUL: 16x16 => 32
SMLAL: 16x16 + 64 => 64
SMLAW: 32x16 + 32 => 32
Rd 32-bit register or 64-
SMULW: 32x16 => 32 bit register-pair as
accumulation
MLA: 32x32 + 32 => 32 destination
MLAL: 32x32 + 64 => 64

8
32x16 saturating multiply primitive
used in international standards
16-bit DSP implementation - 4-cycles
Result_32 = L_mult (mier_hi, mand); SMULWB
temp_32 = L_mult(mier_lo,mand);
X
temp_32 = temp_32>>15;
Result_32 = Result_32 + temp_32;
QADD
ARM9E implementation - 2-cycles
SMULWB Prod, mier, mand SAT

QADD Prod,Prod,Prod

Replacing QADD with QDADD achieves


a 32x16+32 MAC in 2-cycles
9
Programmers prefer ARM9E
• Clean orthogonal architecture with linear
32-bit memory space
– Harvard bus architecture invisible to programmer
• no special table access instructions
– Excellent HLL target
• No ‘extra’ state to keep track of
– instructions select saturation mode etc.
• 32-bit stack pointer with stack located in
external memory
– No interrupt nesting limitations imposed by
architecture
10
ARM9E Datapath
Instruction Decode and Datapath control logic

Byte rotate
RDATA[]
/ Sign
Extension

r0 MUL

WDATA[]
Byte/Half
Replicate

CLZ
REGBANK
DINC
BData[..]
Imm BARREL
SHIFTER
IINC
DA[]

r14 AData[..] SAT(x2)

PC

PSR RESULT[..]
ACC
InsAddr SAT
11
Cycles per element
No
n-
s
at
ur
at
in
g

0
5
10
15
Be
st Q
15
20
w xQ
ith
lo 15
op
Sa un
tu ro
ra llin
tin g
Be g
st Q
15
w xQ
ith
lo 15
op
Sa un
tu ro
ra llin
tin g
Be g
st Q
31
w xQ
ith
lo 15
op
Sa un
tu ro
ra llin
tin g
Be g
st Q
31
w xQ
ith
lo 31
op
un
ro
llin
g
Dot product performance

ARM9E
Underlying operation for state-space servo control

12
ARM9E

dot-product in
10 element 16x16
ARM9TDMI

125ns on 160MHz
Voice over IP
• G.723.1 full-duplex
– Takes 25% of ARM9E at 160MHz.
– 100% performance improvement from the ARM9E
enhancements
• similar improvements with digital cellular speech coders
– Leaves 75% to run other applications
• V.34bis softmodem
– 28% of ARM9E at 160MHz
• Typical VoIP application - single engine
internet appliance
– Windows CE or EPOC32, TCP/IP, Modem, Voice coder

13
Audio and speech processing
• Efficient implementation of digital cellular
speech coders
– DSP requirements of channel coding rising rapidly.
Offloading the voice processing to ARM makes a more
balanced system
• MP3 decoding takes just 11% of an ARM9E at
160MHz
– Can run on a PDA platform with:-
• EPOC32, WINCE, others
• Dolby Digital (AC3) takes just 22% of ARM9E
at 160MHz
14
Enhanced debug capabilities

• Real-time debug
– Core has been enhanced to allow a debugger to step
and debug one task whilst background interrupt
routines continue to run.
• Compatible with ARM Real-time Trace
solution
– ARM9E connects to ARM Embedded Trace Macrocell
– allows real-time non-intrusive instruction and data
tracing

15
Development Tools Support
• ARM9E is fully supported by the ARM software
development toolkit
– The ARM Debugger supports the new instructions
– Cycle accurate simulator models are already being used
– The C and C++ compilers support inline assembly using
the new instructions
– Assembler supports ISA enhancements
– Real-time trace tools support the ARM9E
• ARM is engaged with third-parties to enable
other ARM9E tool chains

16
Everything you need
• EDA
– ARM will use its partnership with leading EDA vendors
to enable ARM9E design simulation and co-simulation
• Consulting and training
– ARM provides hardware and software design support
services and training for all of its products
• RTOS
– More than 25 RTOS are already implemented on ARM
• Operating systems
– Symbian EPOC32, WindowsCE, Linux, JAVA OS

17
Vital statistics
• Both soft and hard macrocell implementations
of ARM9E are planned
• ARM9TMDI is only 2.1mm2 on 0.25µm
– Area increase of ARM9E is less than 30% over
ARM9TDMI
• ARM9E will run at the same clock frequency
as ARM9TDMI on the same process
– 160MHz initial implementation on a 0.25µm process
– 200MHz+ on a 0.18µm process
• ARM9E will be delivered to lead partners in Q3
with first silicon in Q4
18
ARM9E

A DSP enhanced ARM9TDMI core gives:


– single engine for both DSP and control code
– fully supported in ARM’s development and debug
tools
– system cost and complexity savings
– faster time-to-market
– an excellent compiler target
– great solution for high-volume cost sensitive
applications

19

You might also like