0% found this document useful (0 votes)

4 views65 pages

02 Architecture

The lecture by Professor Ken Birman covers the evolution and architecture of modern computers, emphasizing the transition from single-core to multicore and NUMA architectures. It discusses the complexities of modern systems, the role of operating systems in managing hardware, and the impact of parallelism on software design. Key topics include the history of Intel processors, machine programming basics, and the importance of understanding architecture for effective programming.

Uploaded by

Mubarek Ramazan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views65 pages

02 Architecture

Uploaded by

Mubarek Ramazan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

THE EVOLUTION AND ARCHITECTURE Professor Ken Birman

OF MODERN COMPUTERS CS4414 Lecture 2

CORNELL CS4414 - FALL 2021. 1

IDEA MAP FOR TODAY

Computers are multicore

Individual CPUs don’t make this NUMA Compiled languages are
NUMA machines capable
dimension obvious. The whole idea is translated to machine language.
of many forms of parallelism.
that if you don’t want to know, you can Understanding this mapping will allow us to
They are extremely complex
ignore the presence of parallelism make far more effective use of the machine.
and sophisticated.

CORNELL CS4414 - FALL 2021. 2

WHAT’S INSIDE? ARCHITECTURE = COMPONENTS
OF A COMPUTER + OPERATING SYSTEM
Registers Registers
CPU (L1 cache) CPU (L1 cache)

L2 Cache L2 Cache
Core A BIGL3PILE
Cache
OF Core
HARDWARE
Memory Bus
REQUIRING A(DRAM)
Memory Unit LOT OF
HIGHLY SKILLED
SSD CARE AND FEEDING! 100G
storage Ethernet

PCIe Bus
CORNELL CS4414 - FALL 2021. 3
WHAT’S INSIDE? ARCHITECTURE = COMPONENTS
OF A COMPUTER + OPERATING SYSTEM
Registers Registers
CPU (L1 cache) CPU (L1 cache)

L2 Cache L2 Cache
Core L3 Cache Core
Memory Bus
Memory Unit (DRAM)

SSD 100G
storage Ethernet

PCIe Bus
CORNELL CS4414 - FALL 2021. 4
WHAT’S INSIDE? ARCHITECTURE = COMPONENTS
OF A COMPUTER + OPERATING SYSTEM

Operating System Process you

launched by
File System running some
program
Network
Bash shell

Job of the operating system (e.g. Linux) is to manage the

hardware and offer easily used, efficient abstractions that hide
details where feasible
CORNELL CS4414 - FALL 2021. 5
ARCHITECTURES ARE CHANGING RAPIDLY!
As an undergraduate (in the late 1970’s) I programmed a DEC
PDP 11/70 computer:
 A CPU (~1/2 MIPS), main memory (4MB)
 A storage device (8MB rotational magnetic disk), tape drive
 I/O devices (mostly a keyboard with a printer).

At that time this cost about $100,000

CORNELL CS4414 - FALL 2021. 6
ARCHITECTURES ARE CHANGING RAPIDLY!
Bill Gates:
“640K ought to be
As an undergraduate (in the late 1970’s) I programmed a DEC
enough computer:
PDP 11/70 for anybody.”
 A CPU (~1/2 MIPS), main memory (4MB)
 A storage device (8MB rotational magnetic disk), tape drive
 I/O devices (mostly a keyboard with a printer).

At that time this cost about $100,000

CORNELL CS4414 - FALL 2021. 7
TODAY: MACHINE PROGRAMMING I: BASICS
History of Intel processors and architectures
Assembly Basics: Registers, operands, move
Arithmetic & logical operations
C/C++, assembly, machine code

CORNELL CS4414 - FALL 2021. 8

MODERN COMPUTER: DELL R-740: $2,600
2 Intel Xenon chips with 28 “hyperthreaded” cores running at 1GIPS
(clock rate is 3Ghz)

Up to 3 TB of memory, multiple levels of memory caches

All sorts of devices accessible directly or over the network

NVIDIA Tesla T4 GPU: adds $6,000, peaks at 269 TFLOPS

CORNELL CS4414 - FALL 2021. 9

One CPU core actually
MODERN COMPUTER: DELL R-740: $2,600
runs two programs at
the same time
2 Intel Xenon chips with 28 “hyperthreaded” cores running at 1GIPS
(clock rate is 3Ghz)

Up to 3 TB of memory, multiple levels of memory caches

All sorts of devices accessible directly or over the network

NVIDIA Tesla T4 GPU: adds $6,000, peaks at 269 TFLOPS

CORNELL CS4414 - FALL 2021. 10

INTEL XENON NVIDIA TESLA

The GPU has so many cores that a photo of the chip is

pointless. Instead they draw graphics like these to help
you visualize ways of using hundreds of cores to process
a tensor (the “block” in the middle) in parallel!

Each core is like a little computer, talking to the others

over an on-chip network (the CMS)
CORNELL CS4414 - FALL 2021. 11
HOW DID WE GET HERE?
In the early years of computing, we went from machines built from
distinct electronic components (earliest generations) to ones built
from integrated circuits with everything on one chip.

Quickly, people noticed that each new generation of computer

had roughly double the capacity of the previous one and could run
roughly twice as fast! Gordon Moore proposed this as a “law”.

CORNELL CS4414 - FALL 2021. 12

BUT BY 2006 MOORE’S LAW
SEEMED TO BE ENDING

CORNELL CS4414 - FALL 2021. 13

WHAT ENDED MOORE’S LAW?
To run a chip at higher and higher speeds, we If you overclock your
use a faster clock rate and keep more of the desktop this can happen…
circuitry busy.

Computing is a form of “work” and work generates heat… as

roughly the square of the clock rate.

Chips began to fail. Some would (literally) melt or catch fire!

CORNELL CS4414 - FALL 2021. 14
BUT PARALLELISM SAVED US!
A new generation of computers emerged in which we ran the
clocks at a somewhat lower speed (usually around 2 GHz, which
corresponds to about 1 billion instructions per second), but had
many CPUs in each computer.

A computer needs to have nearby memory, but applications

needed access to “all” the memory. This leads to what we call a
“non-uniform memory access behavior”: NUMA.
CORNELL CS4414 - FALL 2021. 15
MOORE’S LAW WITH NUMA

Graph from prior slide

CORNELL CS4414 - FALL 2021. 16

… MAKING MODERN MACHINES COMPLICATED!

Prior to 2006, a good program

 Used the best algorithm: computational complexity, elegance
 Implemented it in a language like C++ that offers efficiency
 Ran on one machine

But the past decade has been disruptive! Suddenly even a single
computer might have the ability to do hundreds of parallel tasks!
CORNELL CS4414 - FALL 2021. 17
THE HARDWARE SHAPES THE
APPLICATION DESIGN PROCESS

We need to ask how a NUMA architecture impacts our designs.

If not all variables are equally fast to access, how can we

“code” to achieve the fastest solution?

And how do we keep all of this hardware “optimally busy”?

CORNELL CS4414 - FALL 2021. 18
DEFINITIONS OF TERMS WE OFTEN USE
Architecture: (also ISA: instruction set architecture)
The parts of a processor design that one needs to understand for
writing correct machine/assembly code
 Examples: instruction set specification, registers
 Machine Code: Byte-level programs a processor executes
 Assembly Code: Readable text representation of machine code

CORNELL CS4414 - FALL 2021. 19

DEFINITIONS OF TERMS WE OFTEN USE
Microarchitecture: “drill down”.
Details or implementation of the architecture
 Examples: memory or cache sizes, clock speed (frequency)

Example ISAs:
 Intel: x86, IA32, Itanium, x86-64
 ARM: Used in almost all mobile phones
 RISC V: New open-source ISA
CORNELL CS4414 - FALL 2021. 20
TODAY: MACHINE PROGRAMMING I: BASICS
History of Intel processors and architectures
Assembly Basics: Registers, operands, move
Arithmetic & logical operations
C/C++, assembly, machine code

CORNELL CS4414 - FALL 2021. 21

HOW A SINGLE THREAD COMPUTES
Common way to
depict a single thread

In CS4414 we think of each computation in terms of a “thread”

A thread is a pointer into the program instructions. The CPU

loads the instruction that the “PC” points to, fetches any operands
from memory, does the action, saves the results back to memory.

Then the PC is incremented to point to the next instruction

CORNELL CS4414 - FALL 2021. 22
ASSEMBLY/MACHINE
CODE VIEW
Programmer-Visible State Memory
 PC: Program counter Byte addressable array
 Address of next instruction Code and user data
 Called “RIP” (x86-64) Stack to support procedures

 Register file
 Heavily used program data Puzzle:
 Condition codes  On a NUMA machine, a CPU is near a fast
 Store status information about most recent memory but can access all memory.
arithmetic or logical operation  How does this impact software design?
 Used for conditional branching

CORNELL CS4414 - FALL 2021. 23

ASSEMBLY/MACHINE
Example: With 6 on-board DRAM modules and 12 NUMA CPUs, each pair of
CODE VIEW
CPUs has one nearby DRAM module. Memory in that range of addresses will be
very fast. The other 5 DRAM modules are further away. Data in those address This memory is
ranges is visible and everything looks identical, but access is slower! slower to access!
Programmer-Visible State Memory
 PC: Program counter Byte addressable array Same with this one…
 Address of next instruction Code and user data
 Called “RIP” (x86-64) Stack to support procedures …

 Register file
 Heavily used program data Puzzle: …

 Condition codes  On a NUMA machine, a CPU is near a fast

 Store status information about most recent memory but can access all memory.
…
arithmetic or logical operation  How does this impact software design?
 Used for conditional branching

CORNELL CS4414 - FALL 2021. 24

LINUX TRIES TO HIDE MEMORY DELAYS
If it runs thread t on core k, Linux tries to allocate memory for t
(stack, malloc…) in the DRAM close to that k.

Yet all memory operations work identically even if the thread is

actually accessing some other DRAM. They are just slower.

Linux doesn’t even tell you which parts of your address space are
mapped to which DRAM units.
CORNELL CS4414 - FALL 2021. 25
(We’ll cover what we
MACHINE LANGUAGE can but probably
won’t have time for
all of this)
CORNELL CS4414 - FALL 2021. 26
THE HARDWARE UNDERSTANDS “PRIMITIVE”
DATA TYPES
“Integer” data of 1, 2, 4, or 8 bytes (SIMD vector data types of 8, 16, 32
Data values or 64 bytes)
Addresses (untyped pointers)

No aggregate types such as arrays or

Floating point data of 4, 8, or 10 structures
bytes (new: 4-bit, 8-bit, 16-bit)  Just contiguously allocated bytes in memory
 Example: Raw images are arrays in a
format defined by the camera or video,
such as RGB, jpeg, mpeg. The camera
Code: Byte sequences encoding understands the format. The host computer
series of instructions the camera is attached to just sees bytes

CORNELL CS4414 - FALL 2021. 27

THE HARDWARE UNDERSTANDS “PRIMITIVE”
DATA TYPES
“Integer” data of 1, 2, 4, or 8 bytes (SIMD vector data types of 8, 16, 32
Data values or 64 bytes)
Addresses (untyped pointers)

No aggregate types such as arrays or

CORNELL CS4414 - FALL 2021. 28

X86-64 INTEGER REGISTERS

Can reference low-order 4 bytes (also low-order 1 & 2 bytes)

Not part of memory (or cache)
CORNELL CS4414 - FALL 2021. 29
SOME HISTORY: IA32 REGISTERS

CORNELL CS4414 - FALL 2021. 30

ASSEMBLY CHARACTERISTICS: OPERATIONS
Transfer data between memory and register
Load data from memory into register
Store register data into memory

Perform arithmetic function on register or memory data

Transfer control
Unconditional jumps to/from procedures
Conditional branches
Indirect branches
CORNELL CS4414 - FALL 2021. 31
Carnegie Mellon

Moving Data %rax

 Moving Data %rcx
movq Source, Dest %rdx
 Operand Types %rbx
 Immediate: Constant integer data %rsi
Example: $0x400, $-533
 %rdi
 Like C constant, but prefixed with ‘$’
%rsp
 Encoded with 1, 2, or 4 bytes
 Register: One of 16 integer registers
%rbp
 Example: %rax, %r13
%rN
 But %rsp reserved for special use
 Others have special uses for particular instructions
 Memory: 8 consecutive bytes of memory at address given by register
 Simplest example: (%rax)
Warning: Intel docs use
 Various other “addressing modes”
mov Dest, Source
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 32
Carnegie Mellon

movq Operand Combinations

Source Dest Src,Dest C/C++ Analog

Reg movq $0x4,%rax temp = 0x4;

Imm
Mem movq $-147,(%rax) *p = -147;

movq Reg movq %rax,%rdx temp2 = temp1;

Reg
Mem movq %rax,(%rdx) *p = temp;

Mem Reg movq (%rax),%rdx temp = *p;

Cannot do memory-memory transfer with a single instruction

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 33
Carnegie Mellon

Simple Memory Addressing Modes

 Normal (R) Mem[Reg[R]]
 Register R specifies memory address
 Aha! Pointer dereferencing in C

movq (%rcx),%rax

 Displacement D(R) Mem[Reg[R]+D]

 Register R specifies start of memory region
 Constant displacement D specifies offset

movq 8(%rbp),%rdx

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 34

Carnegie Mellon

Example of Simple Addressing Modes

void
whatAmI(<type> a, <type> b)
{
????
whatAmI:
}
movq (%rdi), %rax
movq (%rsi), %rdx
movq %rdx, (%rdi)
movq %rax, (%rsi)
ret

%rsi
%rdi

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 35

Carnegie Mellon

Example of Simple Addressing Modes

void swap
(long *xp, long *yp)
{ swap:
long t0 = *xp; movq (%rdi), %rax
long t1 = *yp; movq (%rsi), %rdx
*xp = t1; movq %rdx, (%rdi)
*yp = t0; movq %rax, (%rsi)
} ret

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 36

Carnegie Mellon

Understanding swap()
Memory Addr
void swap Registers
(long *xp, long *yp) xp
{ %rdi
long t0 = *xp;
%rsi
long t1 = *yp;
*xp = t1; %rax
*yp = t0; yp
} %rdx

Register Value
%rdi xp
%rsi yp swap:
%rax t0 movq (%rdi), %rax # t0 = *xp
%rdx t1 movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 37
Carnegie Mellon

Understanding swap()
Memory
Registers Address
123 0x120
%rdi 0x120
0x118
%rsi 0x100
0x110
%rax 0x108
%rdx 456 0x100