PART18
PART18
William Stallings
Computer Organization
and Architecture
10th Edition
© 2016 Pearson Education, Inc., Hoboken,
NJ. All rights reserved.
+ Chapter 18
Multicore Computers
L2 cache
(a) Superscalar
Issue logic
Registers n
Regoster 1
PC n
PC 1
Instruction fetch unit Execution units and queues
L2 cache
(superscalar or SMT)
(superscalar or SMT)
(superscalar or SMT)
Core n
Core 1
Core 2
Core 3
L1-D
L1-D
L1-D
L1-D
L1-I
L1-I
L1-I
L1-I
L2 cache
(c) Multicore
logic
10
memory
Memory
1
0.25 0.18 0.13 0.10
Feature size (µm)
+
2%
relative speedup
6
5%
10%
4
0
1 2 3 4 5 6 7 8
number of processors
(a) Speedup with 0%, 2%, 5%, and 10% sequential portions
2.5
2.0 5%
10%
15%
20%
relative speedup
1.5
1.0
0.5
0
1 2 3 4 5 6 7 8
number of processors
(b) Speedup with overheads
g
lin
ca
ts
ec
scaling
rf
pe
32
16
0
0 16 32 48 64
number of CPUs
Multi-process applications
Process-level parallelism
Characterized by the presence of many single-threaded processes
Java applications
Embrace threading in a fundamental way
Java Virtual Machine is a multi-threaded process that provides scheduling
and memory management for Java applications
Multi-instance applications
If multiple application instances require some degree of isolation,
virtualization technology can be used to provide each of them with its own
separate and secure environment
Scene List
Particles
Character
Bone Setup
Draw
Etc.
L2 cache L2 cache
L2 cache I/O
main memory I/O
main memory
(b) Dedicated L2 cache
(a) Dedicated L1 cache
DRAM DRAM
Last- Last-
Controller Controller
Level Level
Cache Cache
CPU GPU
Clock frequency (GHz) 3.8 0.8
Cores 4 384
FLOPS/core 8 2
GFLOPS 121.6 614.4
FLOPS = floating point operations per second
FLOPS/core = number of parallel floating point operations that can be performed
Multicore Navigator
Queue Packet
Manager DMA
5-Port Security
GPIO x32
2x UART
SRIO x4
EMIF16
PCIe x2
Accelerator
USB 3.0
3x SPI
Ethernet
3xI2C
Switch
Packet
Accelerator
1GBE
1GBE
1GBE
1GBE
Network
Coprocessor
Multiply
Dual Issue
Load/Store
Loop Cache
Integer
Multiply
Floating-Point/NEON
Branch
Load
Store
Power
Performance
M odifie d Ow ne d
I nva iid
Cle a n
Ex clu sive Sh a r e d
(a) M ESI
(b) M OESI
32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB
L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D
12 MB
L3 Cache
Debug
Generic
Unit & Trace GIC
Timer
Interface
L1 L1 L1 L1 L1 L1 L1 L1
TLBs TLBs TLBs TLBs
ICache DCache ICache DCache ICache DCache ICache DCache
Snoop
Control L2 cache
Unit
L2 memory system
• Masking of interrupts
• Prioritization of the interrupts
• Distribution of the interrupts to the target A15 cores
• Tracking the status of interrupts
• Generation of interrupts by software
GIC
• Is memory mapped
• Is a single functional unit that is placed in the system
alongside A15 cores
• This enables the number of interrupts supported in the
system to be independent of the A15 core design
• Is accessed by the A15 cores using a private interface
through the SCU
Pending
One that has been asserted, and for which processing has not started on that
CPU
Active
One that has been started on that CPU, but processing s not complete
Can be pre-empted when a new interrupt of higher priority interrupts A15
core interrupt processing
Decoder
Priority Status
Interrupt number Priority
A15 Core 0
Interrupt list
Migratory lines
Allows moving dirty data between CPUs without writing to L2 and reading back from
external memory
PU2
MCM PU1 PU0
(6 cores) (6 cores) (6 cores)
FBC0 FBC0
FBC1 SC1 SC0 FBC1
FBC2 FBC2
L3 L3
48 MB 48 MB
SC0 SC1
L4 L4
192 MB 192 MB