[go: up one dir, main page]

0% found this document useful (0 votes)
197 views17 pages

Alpha 364 Architecture and HT Protocol

The Alpha 21364, code-named "Marvel", also known as EV7 is a microprocessor developed by Digital Equipment Corporation (DEC) and later by Compaq

Uploaded by

Divil Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views17 pages

Alpha 364 Architecture and HT Protocol

The Alpha 21364, code-named "Marvel", also known as EV7 is a microprocessor developed by Digital Equipment Corporation (DEC) and later by Compaq

Uploaded by

Divil Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Alpha 364 Architecture

Introduction:
The Alpha 21364, code-named "Marvel", also known as Alpha 364
and EV7 is microprocessor developed by Digital Equipment
Corporation(DEC), later Compaq Computer Corporation, that
implemented the Alpha instruction set architecture (ISA).
The Alpha 21364 processor provides a high-performance, highly
scalable, and highly reliable network architecture. The router runs at
I.2GHz and routes packets at a peak bandwidth of 22.4 GB/s. The
network architecture scales up to a 128-processor configuration,
which can support up to four terabytes of distributed Rambus
memory and hundreds of terabytes of disk storage. The distributed
Rambus memory is kept coherent via a scalable, directory-based,
cache coherence scheme.
The network also provides a variety of reliability features, such as
per-flit ECC. These features make the 21364 network architecture
well-suited to support communication-intensive server applications.
Main goal of EV7 was to achieve high memory bandwidth and
latency goal by incorporating two on-chip RDRAM memory
controllers and a very large 1.5MB L2 cache.
A second key goal for the processor was scalability.
The

EV7s

memory

bandwidth

scales

the

addition

of

more

Alpha Roadmap
Lower Cost

Higher Performance
0.5m

EV5/333
21164

0.35m

0.18m

EV6/575
21264

EV7/1000
21364

0.35m

0.13m

EV8

0.28m

EV56/600
21164

EV67/750
21264
0.35m

...
0.18m

PCA56/533
21164PC

EV68/1000
21264
0.28m

PCA57/600
21164PC
1995

1996

1997

1998

1999

2000

2001

Higher integration

Higher MHz

New core

Estimated time for TPC-C

21364 Chip Block Diagram


16 L1
Miss Buffers
64K Icache
21264
Core
64K Dcache
16 L1
Victim Buf

Address In
R
A
M
B
U
S

Address Out

L2
Cache

Memory
Controller
Network
Interface

16 L2
Victim Buf

N
S
E
W
I/O

We will start with core of the 21264. The number of


outstanding cache block fills, will be increased from 8 to 16.
Misses to the L1 caches will first access the L2 cache. Data
will be returned on a 128 byte wide bus.
References that miss the L2 cache will access the local
memory and return data to the core. Memory locations not
located in the local memory will access the network.
The integrated network interface will route the request to
the appropriate node in the network using one of the 4
ports (N, S, E and W).
The 21164 cores 8 entry victim buffer is currently used for
both L1 and L2 victims. The new design will increase the
size of the victim buffer to 16 x 64 byte blocks for L1->L2
victims. A new 16 x 64 bytes victim buffer will be used to
hold victims leaving the L2 cache for the local memory or
the network.

Heres the block diagram of a 12 processor system using


the 2D torus topology.
Each processor may have its own local memory and may
have its own local I/O connection.
It is possible for a processor to operate in the system
without memory or I/O if that is attractive

Integrated Memory Controller


The chip contains an integrated Direct RAMbus memory controller. Direct RAMbus provides high data
capacity per bin along with outstanding bandwidth and latency. The pin to pin delay for a page hit in the
RDRAM is 30ns.
The memory controller will provide 6GB/sec of read or write bandwidth to the core. With 2GFLOPs, the chip
provides 3byte/FLOP of usable memory bandwidth, a significant improvement over current systems.
To reduce memory latency the memory controller will track 100s of open pages in the RDRAM array.
A directory based cache coherence protocol is an integral part of the memory controller.
The memory is protected by a single error correct, double error detect ECC code.
The EV7 contains two integrated Direct Rambus (RDRAM) memory controllers. Direct Rambus provides the
highest data rate per pin, with outstanding bandwidth and good access latency.
Direct RAMbus :
High data capacity per pin
800 MHz operation
30ns CAS latency pin to pin
6 GB/sec read or write bandwidth
100s of open pages
Directory based cache coherence.

Integrated L2 Cache
The

1.5MB

write

16

bytes/cycle

at

1GHz,

resulting

in

16GB/second of read or write band6-set L2 cache has a 12 cycle


load to use latency. This latency is set by the existing control in
the core and is used to significantly reduce the power
consumption of the L2 array.
The L2 cache and read or width. The array is protected by a
single error correct, double error detect ECC code. Errors are
corrected on the fly in hardware.

Integrated Network
Interface

The integrated network interface allows multi-processor


systems to be built using a 2D torus topology. Each node is
capable of moving 10GB/second. Each hop in the network
will take an average of 15ns.
The network moves data and control packets from the
source to the destination. It does not guarantee ordering.
Adaptive routing of packets allow the network to detect and
avoid hot spots.
Asynchronous clocking between processors removes the
need to distribute a low skew clock within a large system.
A fifth port provides up to 3GB/sec on bandwidth to
industry standard buses, PCI, PCI-X, AGP, and ServerNet to
name a few.

Alpha 21364 Technology


Design Specifications
0.18 m CMOS
1000+ MHz
100 Watts @ 1.5 volts
3.5 cm2
6 Layer Metal
100 million transistors
8 million logic
92 million RAM

Hyper Transport
protocol

Introduction
HyperTransport (HT) is a technology for
interconnection of computer processors. It is a
bidirectional serial/parallel high-bandwidth,
low-latency point-to-point link that was
introduced on April 2, 2001.
The HyperTransport Consortium is in charge of
promoting and developing HyperTransport
technology.
HyperTransport is best known as the system
bus architecture of modern AMD central
processing units (CPUs) and the associated
Nvidia nForce motherboard chipsets.
HyperTransport has also been used by IBM
and Apple for the Power Mac G5 machines, as
well as a number of modern MIPS systems.

Links and rates

HyperTransport comes in four versions1.x, 2.0, 3.0, and 3.1


which run from 200 MHz to 3.2 GHz. It is also a DDR or "double
data rate" connection, meaning it sends data on both the rising
and falling edges of the clock signal. This allows for a maximum
data rate of 6400 MT/s when running at 3.2 GHz. The operating
frequency is auto negotiated with the motherboard chipset (North
Bridge) in current computing.
HyperTransport supports an auto negotiated bit width, ranging
from 2 to 32 bits per link; there are two unidirectional links per
HyperTransport bus. With the advent of version 3.1, using full 32bit links and utilizing the full
HyperTransport 3.1 specification's operating frequency, the
theoretical transfer rate is 25.6 GB/s (3.2 GHz 2 transfers per
clock cycle 32 bits per link) per direction, or 51.2 GB/s
aggregated throughput, making it faster than most existing bus
standard for PC workstations and servers as well as making it
faster than most bus standards for high-performance computing
and networking.
Links of various widths can be mixed together in a single system
configuration as in one 16-bit link to another CPU and one 8-bit
link to a peripheral device, which allows for a wider interconnect
between CPUs, and a lower bandwidth interconnect to peripherals
as appropriate. It also supports link splitting, where a single 16-bit

Packet-Orientation
HyperTransport is packet-based, where each packet consists
of a set of 32-bit words, regardless of the physical width of
the link. The first word in a packet always contains a
command field. Many packets contain a 40-bit address. An
additional 32-bit control packet is prepended when 64-bit
addressing is required. The data payload is sent after the
control packet. Transfers are always padded to a multiple of
32 bits, regardless of their actual length.
HyperTransport packets enter the interconnect in segments
known as bit times. The number of bit times required depends
on the link width. HyperTransport also supports system
management messaging, signaling interrupts, issuing probes
to adjacent devices or processors, I/O transactions, and
general data transactions.
There are two kinds of write commands supported: posted
and non-posted. Posted writes do not require a response from
the target. This is usually used for high bandwidth devices
such as uniform memory access traffic or direct memory
access transfers. Non-posted writes require a response from
the receiver in the form of a "target done" response. Reads

Frequency Specifications

Implementations
AMD AMD64 and Direct Connect Architecture based CPUs
SiByte MIPS CPUs from Broadcom
PMC-Sierra RM9000X2 MIPS CPU
Raza Thread Processors
Loongson-3 MIPS processor
ht_tunnel from OpenCores project (MPL licence)
ATI Radeon Xpress 200 for AMD Processor
Nvidia nForce chipsets
nForce Professional MCPs (Media and Communication Processor)
nForce 4 series
nForce 500 series
nForce 600 series
nForce 700 series
ServerWorks (now Broadcom) HyperTransport SystemI/O Controllers
HT-2000
HT-2100
The IBM CPC925 and CPC945 PowerPC 970 northbridges, as codesigned and used by Apple in the Power Mac G5[6]
Several open source cores from the HyperTransport Center of
Excellence

You might also like