[go: up one dir, main page]

0% found this document useful (0 votes)
32 views30 pages

PART18

The document discusses multicore processors and their advantages. It covers superscalar, simultaneous multithreading, and multicore architectures. It also addresses considerations like power and heat for multicore designs. Effective applications for multicore processors include multi-threaded, multi-process, Java applications and multi-instance virtualized applications.

Uploaded by

halilkuyuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views30 pages

PART18

The document discusses multicore processors and their advantages. It covers superscalar, simultaneous multithreading, and multicore architectures. It also addresses considerations like power and heat for multicore designs. Effective applications for multicore processors include multi-threaded, multi-process, Java applications and multi-instance virtualized applications.

Uploaded by

halilkuyuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

+

William Stallings
Computer Organization
and Architecture
10th Edition
© 2016 Pearson Education, Inc., Hoboken,
NJ. All rights reserved.
+ Chapter 18
Multicore Computers

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


Issue logic
Program counter Single-thread register file
Instruction fetch unit Execution units and queues

L1 instruction cache L1 data cache

L2 cache

(a) Superscalar

Issue logic

Registers n
Regoster 1
PC n
PC 1
Instruction fetch unit Execution units and queues

L1 instruction cache L1 data cache

L2 cache

(b) Simultaneous multithreading


(superscalar or SMT)

(superscalar or SMT)

(superscalar or SMT)

(superscalar or SMT)
Core n
Core 1

Core 2

Core 3
L1-D

L1-D

L1-D

L1-D
L1-I

L1-I

L1-I

L1-I
L2 cache

(c) Multicore

Figure 18.1 Alternative Chip Organizations


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Power density
(watts/cm2) Power
100

logic

10

memory
Memory

1
0.25 0.18 0.13 0.10
Feature size (µm)
+

Figure 18.2 Power and Memory Considerations

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


0%
8

2%

relative speedup
6
5%

10%
4

0
1 2 3 4 5 6 7 8
number of processors
(a) Speedup with 0%, 2%, 5%, and 10% sequential portions

2.5

2.0 5%
10%
15%
20%
relative speedup

1.5

1.0

0.5

0
1 2 3 4 5 6 7 8
number of processors
(b) Speedup with overheads

Figure 18.3 Performance Effect of Multiple Cores


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
64
Oracle DSS 4-way join
TMC data mining
DB2 DSS scan & aggs
Oracle ad hoc insurance OLTP
48

g
lin
ca
ts
ec
scaling

rf
pe
32

16

0
0 16 32 48 64
number of CPUs

Figure 18.4 Scaling of Database Workloads on Multiple-Processor Hardware


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ Effective Applications for Multicore
Processors
 Multi-threaded native applications
 Thread-level parallelism
 Characterized by having a small number of highly threaded
processes

 Multi-process applications
 Process-level parallelism
 Characterized by the presence of many single-threaded processes

 Java applications
 Embrace threading in a fundamental way
 Java Virtual Machine is a multi-threaded process that provides scheduling
and memory management for Java applications

 Multi-instance applications
 If multiple application instances require some degree of isolation,
virtualization technology can be used to provide each of them with its own
separate and secure environment

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


Render

Skybox Main View Monitor Etc.

Scene List

For each object

Particles

Sim and Draw

Character

Bone Setup

Draw

Etc.

Figure 18.5 Hybrid Threading for Rendering Module


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
CPU Core 1 CPU Core n CPU Core 1 CPU Core n

L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I

L2 cache L2 cache

L2 cache I/O
main memory I/O
main memory
(b) Dedicated L2 cache
(a) Dedicated L1 cache

CPU Core 1 CPU Core n CPU Core 1 CPU Core n

L1-D L1-I L1-D L1-I


L1-D L1-I L1-D L1-I
L2 cache L2 cache
L2 cache L3 cache

main memory I/O main memory I/O

(c) Shared L2 cache (d ) Shared L3 cache

Figure 18.6 Multicore Organization Alternatives


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Heterogeneous Multicore
Organization
The most prominent trend
is the use of both CPUs and
Refers to a processor chip graphics processing units
that includes more than (GPUs) on the same chip
one kind of core • This mix however presents issues
of coordination and correctness

GPUs are characterized by Thus, GPUs are well


the ability to support matched to applications
thousands of parallel that process large amounts
execution trends of vector and matrix data

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


CPU CPU GPU GPU

Cache Cache Cache Cache

On-Chip Interconnection Network

DRAM DRAM
Last- Last-
Controller Controller
Level Level
Cache Cache

Figure 18.7 Heterogenous Multicore Chip Elements


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Table 18.1

Operating Parameters of AMD 5100K


Heterogeneous Multicore Processor

CPU GPU
Clock frequency (GHz) 3.8 0.8
Cores 4 384
FLOPS/core 8 2
GFLOPS 121.6 614.4
FLOPS = floating point operations per second
FLOPS/core = number of parallel floating point operations that can be performed

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


+
Heterogeneous System
Architecture (HSA)
 Key features of the HSA approach include:
 The entire virtual memory space is visible to both CPU and GPU
 The virtual memory system brings in pages to physical main
memory as needed
 A coherent memory policy ensures that CPU and GPU caches
both see an up-to-date view of data
 A unified programming interface that enables users to exploit the
parallel capabilities of the GPUs within programs that rely on CPU
execution as well

 The overall objective is to allow programmers to write


applications that exploit the serial power of CPUs and the
parallel-processing power of GPUs seamlessly with efficient
coordination at the OS and hardware level
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Memory Subsystem
72-bit 6MB
DDR3 EMIF MSM
SRAM C66x
72-bit MSMC DSP
DDR3 EMIF
32kB L1 32kB L1
Debug & Trace P-Cache D-Cache
1MB L2 Cache
Boot ROM 8x
32kB L1 32kB L1 32kB L1 32kB L1
P-Cache D-Cache P-Cache D-Cache
Semaphore
ARM ARM
Power
Cortex-A15 Cortex-A15
Mangement 4MB L2 Cache

PLL ARM ARM


Cortex-A15 Cortex-A15
32kB L1 32kB L1 32kB L1 32kB L1
5x P-Cache D-Cache P-Cache D-Cache
EDMA
8 CSSx DSP cores @ 1.2 GHz
5x 4 ARM cores @ 1.4 Ghz
2x HyperLink TeraNet

Multicore Navigator
Queue Packet
Manager DMA

5-Port Security
GPIO x32

2x UART

SRIO x4
EMIF16

PCIe x2
Accelerator
USB 3.0

3x SPI
Ethernet
3xI2C

Switch
Packet
Accelerator

1GBE
1GBE
1GBE
1GBE
Network
Coprocessor

Figure 18.8 Texas Instruments 66AK2H12 Heterogenous Multicore Chip

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


GIC-400 Global Interrupt Controller
Interrupts Interrupts

Cortex-A15 Cortex-A15 Cortex-A7 Cortex-A7


Core Core Core Core
I/O
Coherent
L2 L2 Master

CCI-400 (Cache Coherent Interconnect)

Memory Controller Ports System Port

Figure 18.9 Big.Litte Chip Components


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Integer Write Back

Multiply

Fetch Decode Issue Floating-Point/NEON

Dual Issue

Load/Store

(a) Cortex A-7 Pipeline

Fetch Decode, Rename, & Dispatch

Loop Cache

Queue Issue Write Back


Integer

Integer

Multiply
Floating-Point/NEON

Branch

Load

Store

(b) Cortex A-15 Pipeline

Figure 18.10 Cortex A-7 and A-15 Pipelines


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Highest Cortex-A15 Operating Point

Power

Lowest Cortex-A15 Operating Point

Highest Cortex-A7 Operating Point

Lowest Cortex-A7 Operating Point

Performance

Figure 18.11 Cortex-A7 and A15 Performance Comparison


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+
Cache Coherence
 May be addressed with software-based techniques
 Software burden consumes too many resources in a SoC chip

 When multiple caches exist there is a need for a cache-coherence


scheme to avoid access to invalid data

 There are two main approaches to hardware implemented cache


coherence
 Directory protocols
 Snoopy protocols

 ACE (Advanced Extensible Interface Coherence Extensions)


 Hardware coherence capability developed by ARM
 Can be configured to implement whether directory or snoopy approach
 Has been designed to support a wide range of coherent masters with
differing capabilities
 Supports coherency between dissimilar processors enabling ARM
big.Little technology
 Supports I/O coherency for un-cached masters, supports masters with
differing cache line sizes, differing internal cache state models, and
masters with write-back or write-through caches
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Un iqu e Sh a r e d I nv a lid
D ir t y

M odifie d Ow ne d

I nva iid
Cle a n

Ex clu sive Sh a r e d

Figure 18.12 ARM ACE Cache Line States


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Table 18.2 Comparison of States in Snoop Protocols

(a) M ESI

M odified Exclusive Shared I nvalid


Clean/Dirty Dirty Clean Clean N/A
Unique? Yes Yes No N/A
Can write? Yes Yes No N/A
Can forward? Yes Yes Yes N/A
Comments Must write back Transitions to M Shared implies Cannot read
to share or on write clean, can
replace forward

(b) M OESI

M odified Owned Exclusive Shared I nvalid


Clean/Dirty Dirty Dirty Clean Either N/A
Unique? Yes Yes Yes No N/A
Can write? Yes Yes Yes No N/A
Can Yes Yes Yes No N/A
forward?
Comments Can share Must write Transitions to Shared, can Cannot read
without write back to M on write be dirty or
back transition clean
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
(Table can be found on page 676 in the textbook.)
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5

32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB 32 kB
L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D

256 kB 256 kB 256 kB 256 kB 256 kB 256 kB


L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache

12 MB
L3 Cache

DDR3 Memory QuickPath


Controllers Interconnect

3 8B @ 1.33 GT/s 4 20b @ 6.4 GT/s

Figure 18.13 Intel Core i7-990X Block Diagram


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Interrupts Timer Events

Debug
Generic
Unit & Trace GIC
Timer
Interface

Core 0 Core 1 Core 2 Core 3

L1 L1 L1 L1 L1 L1 L1 L1
TLBs TLBs TLBs TLBs
ICache DCache ICache DCache ICache DCache ICache DCache

Snoop
Control L2 cache
Unit

L2 memory system

Figure 18.14 ARM Cortex-A15 MPCore Chip Block Diagram


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Interrupt Handling
Generic interrupt controller (GIC) provides:

• Masking of interrupts
• Prioritization of the interrupts
• Distribution of the interrupts to the target A15 cores
• Tracking the status of interrupts
• Generation of interrupts by software

GIC

• Is memory mapped
• Is a single functional unit that is placed in the system
alongside A15 cores
• This enables the number of interrupts supported in the
system to be independent of the A15 core design
• Is accessed by the A15 cores using a private interface
through the SCU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


+
GIC

Can route an interrupt to


one or more CPUs in the
following three ways:
• An interrupt can be directed to a
Designed to satisfy two specific processor only
functional requirements: • An interrupt can be directed to a
• Provide a means of routing an defined group of processors
interrupt request to a single CPU • An interrupt can be directed to
or multiple CPUs as required all processors
• Provide a means of
interprocessor communication so
that a thread on one CPU can
cause activity by a thread on
another CPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


+
Interrupts can be:
 Inactive
 One that is nonasserted, or which in a multiprocessing environment has been
completely processed by that CPU but can still be either Pending or Active in
some of the CPUs to which it is targeted, and so might not have been cleared at
the interrupt source

 Pending
 One that has been asserted, and for which processing has not started on that
CPU

 Active
 One that has been started on that CPU, but processing s not complete
 Can be pre-empted when a new interrupt of higher priority interrupts A15
core interrupt processing

 Interrupts come from the following sources:


 Interprocessor interrupts (IPIs)
 Private timer and/or watchdog interrupts
 Legacy FIQ lines
 Hardware interrupts

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


Private bus Core acknowledge and
Read/Write End Of Interrupt (EOI) information
from CPU interface

Decoder

Top priority interrupts

Priority Status
Interrupt number Priority
A15 Core 0

Interrupt number Priority


A15 Core 1
IRQ request
Interrupt Prioritization to each CPU
interface and selection interface
Interrupt number Priority
A15 Core 2

Interrupt number Priority


A15 Core 3

Interrupt list

Figure 18.15 Interrupt Distributor Block Diagram


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+
Cache Coherency
 Snoop Control Unit (SCU) resolves most of the traditional bottlenecks related to
access to shared data and the scalability limitation introduced by coherence
traffic

 L1 cache coherency scheme is based on the MESI protocol

 Direct Data Intervention (DDI)


 Enables copying clean data between L1 caches without accessing external memory
 Reduces read after write from L1 to L2
 Can resolve local L1 miss from remote L1 rather than L2

 Duplicated tag RAMs


 Cache tags implemented as separate block of RAM
 Same length as number of lines in cache
 Duplicates used by SCU to check data availability before sending coherency commands
 Only send to CPUs that must update coherent data cache

 Migratory lines
 Allows moving dirty data between CPUs without writing to L2 and reading back from
external memory

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


MCU2 HCA2 MCU1 HCA1 MCU0 HCA0

PU2
MCM PU1 PU0
(6 cores) (6 cores) (6 cores)

FBC0 FBC0
FBC1 SC1 SC0 FBC1
FBC2 FBC2

PU3 PU4 PU5


(6 cores) (6 cores) (6 cores)

MCU3 HCA3 MCU4 HCA4 MCU5 HCA5

FBC = fabric book connectivity MCU = memory control unit


HCA = host channel adapter PU = processor unit
MCM = multichip module SC = storage control

Figure 18.16 IBM zEC12 Processor Node Structure


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
PU0 MCM PU5

Core 6 cores Core Core 6 cores Core

D I L1 L1 D I L1: 64-kB I-cache, 96 kB D-cache D I L1 L1 D I

D I L2 L2 D I L2: 1-MB I-cache, 1 MB D-cache D I L2 L2 D I

L3 L3
48 MB 48 MB

SC0 SC1

L4 L4
192 MB 192 MB

Figure 18.17 IBM zEC12 Cache Hierarchy


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ Summary Multicore
Computers
Chapter 18
 Multicore organization
 Hardware performance issues  Levels of cache
 Increase in parallelism and  Simultaneous multithreading
complexity
 Power consumption  Heterogeneous multicore
organization
 Software performance issues  Different instruction set architectures
 Equivalent instruction set
 Software on multicore architectures
 Valve game software example  Cache coherence and the MOESI
model
 Intel Core i7-990X
 ARM Cortex-A15 MPCore
 IBM zEnterprise EC12 mainframe  Interrupt handling
 Organization  Cache coherency
 L2 cache coherency
 Cache structure
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

You might also like