[go: up one dir, main page]

0% found this document useful (0 votes)
15 views66 pages

04 - Computer Clusters

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views66 pages

04 - Computer Clusters

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

COM7940

Cloud Computing

Chapter 04

Computer Clusters for Scalable


parallel Computing

Reading: Hwang et al. Chap 2


Distributed and Cloud Computing: from parallel processing to the Internet of Things

Slides are modified from Hwang et al. and from Prof Byron Choi
BACKGROUNDS

2
What is a computing cluster?
• A computing cluster consists of a collection of
interconnected stand-alone/complete computers, which
can cooperatively working together as a single,
integrated computing resource. Cluster explores
parallelism at job level and distributed computing with
higher availability.
Benefits of Computer Clusters
• A single computer is limited by its CPU power, RAM size,
and permanent storage space.
— Computer clusters can be used to solve problems with very
large size or long computational time, by using multiple
computers.

• High availability (HA) : Cluster offers inherent high system availability due to
the redundancy of hardware, operating systems, and applications.
• Hardware Fault Tolerance: Cluster has some degree of redundancy in most
system components including both hardware and software modules.
• OS and application reliability : Run multiple copies of the OS and
applications, and through this redundancy
• Scalability : Adding servers to a cluster or adding more clusters to a network
as the application need arises.
• High Performance : Running cluster enabled programs to yield higher
throughput.

4
HPC
• High-performance computing (HPC) refers to
supercomputers (massively parallel processors
or MPPs) that are delicate for special purposes.
• Emphasize the raw speed performance.
—Gflops in 1990s to 442 Pflops in 2019
• Top 5 HPC as at today:
—Frontier
—Aurora
—Eagle
—Fugaku https://www.youtube.com/watch?v=eId-MM84zPI
—LUMI
Ref: top500.org 5
Measurement of HPC
• Sustained Speed RMAX, measured from the
execution of the benchmark program.
• System efficiency reflects the ratio of the
sustained speed to the peak speed, Rpeak, when
all computing elements are fully utilized in the
system.
• Power, measure the power consumption.
• Power efficiency, reflects the ratio of sustained
speed per power consumption.

6
Top 50 HPC Peak Computing
Power

rank As at 11/2023 from top500.org


7
Top 50 HPC Power Efficiency

Most powerful HPC may not be the most efficient.


8
As at 11/2023 from top500.org
Top 500 HPC by country

9
Running example: Blue Gene L
• Blue Gene L is a supercomputer jointly
developed by IBM and Lawrence Livermore
National Laboratory
• An introduction:
https://youtu.be/LlS_QcEbPj0?si=bFgCWNu-
ABNLYm9s

11
Overview of Blue Gene L
• It occupies 17 of the top 100 slots in the rankings at
top500.org, including 5 of the top 10 (collected in 2021)
— 360 TeraFLOPS theoretical peak speed
• Largest configuration:
— At Lawrence Livermore Nat’l Lab.
— Runs simulations on US nuclear weapon stockpile
— 64 physical racks
— 65,536 compute nodes
— Torus interconnection network of 64 x 32 x 32
IBM BlueGene/L Supercomputer: The World

Fastest Message-Passing MPP built in 2005

Built jointly by IBM and LLNL teams and funded by US DoE ASCI Research Program
HTC
• High-throughput computing (HTC) built with
parallel and distributed computing technologies.
E.g. peer-to-peer networks that built over many
client machines.
• Focus on high-flux computing. Multi-millions of
users to use simultaneously.
• Address problems like cost, energy saving,
security, reliability.

14
Evolution of HTC and HPC
systems

15
Cluster Classification

A cluster can be classified by the following attributes:

•Packaging : Compact / Slack


— Compact – packaged in racks in machine room
— Slack – PC, workstations geographically distributed
•Control : Centralized / De-centralized.
•Homogeneity: Same vs. different platforms (e.g. CPUs, OSs)
•Security : Exposed or enclosed of intra-cluster communication.
•Dedication: Dedicated HPC or Enterprise Clusters that use spare
resources.
Attributes Used in Cluster
Classification

Attributes Attribute Value

Packaging Compact Slack

Control Centralized Decentralized

Homogeneity Homogeneous Heterogeneous

Security Enclosed Exposed

Dedication Dedicated cluster Enterprise cluster


Homogeneity
• Two compute nodes architectures:
—Homogeneous design – same architecture
—Hybrid node design – mixed architecture
GPU Clusters for Massive
Parallelism
• Why GPU?
—GPU is dedicate for graphic rendering which is
optimized for fast floating point and matrix
computations
• GPU cluster is often built as a heterogeneous
system with:
—CPU host nodes
—GPU nodes
—Cluster interconnect between them
Cluster Classification based on
application demand
• Compute Clusters:
— Mainly for collective computation over a single large job.
E.g. weather forecast, protein, crypto
— Do not handle many I/O operations such as database
— Requires frequent communication
• High-Availability clusters:
— For fault-tolerant or achieving HA of services
— Many redundant nodes
• Load-balancing clusters:
— For higher resource utilization through load balancing,
small jobs from many concurrent users.
— Requests are distributed to all node computers to form a
cluster.

20
Basic Cluster Architecture
Basic Cluster Architecture
• Built with commodity components and fully
supported with desired SSI features and HA
capability
• Commodity nodes are easy to replace or
upgrade with new generations of hardware.
• OS is designed for multiuser, multitasking,
multithreaded applications.
• Interconnected by fast commodity networks

22
Resource Sharing in Cluster of
Computers

Three ways to connect cluster nodes.


Resource Sharing in Cluster of
Computers
• Shared-nothing: used in most clusters.
—Connected through Ethernet
• Shared-disk: in favor of small-scale availability
clusters in business applications.
—Fast recovery in case of node failure
• Shared-memory: difficult to realize
—Connected by a scalable coherence interface (SCI)
ring (connected to the memory bus of each node).
FUNDAMENTAL CLUSTER
DESIGN ISSUES

25
Fundamental Cluster Design
Issues
1. Scalable Performance
2. Internode Communication
3. Single-System Image (SSI)
4. Availability Support
5. Fault Tolerance and Recovery
6. Cluster Job Management
1. Scalability
• Size scalability: increase processors, cache,
memory, storage, I/O channel
• Software scalability: upgrade OS or compilers,
adding more math and engineering libraries
• Application scalability: match the problem size with
the machine size
• Technology scalability: new technology must
consider:
— Time
— Space
— Heterogeneity

27
Amdahl’s Law
• A program on a uniprocessor with running time
T.
• Assume a fraction α of code must be executed
sequentially, called the sequential bottleneck
• (1 - α ) of the code can be compiled for parallel
execution by n processors.
• Total execution time:

28
Example:
Matrix-Vector Multiplication

29
A running example: Matrix-
Vector Multiplication
• Assumptions
— A total of p processes
— Matrix A (m x n) and vector x (n x 1) are created at process 0
• called “master process” because it coordinates the work of other processes
(i.e., “slave processes”)
1. Message passing:
— Process 0 will send (p-1) sub-matrices to corresponding processes
— Process 0 will send vector x to all other p-1 processes
2. Calculations:
— Each process carries out its own matrix-vector multiplication
3. Message passing:
— Processes 1 to (p-1) send the results (i.e., part of vector y) back to
process 0

30
Img: Wikipedia

img: https://www.reddit.com/r/ProgrammerHumor/comments/bzv8q4/parallelism_be_like/

34
System Efficiency and
Gustafson’s Law
• We can measure if processors are worth invested by
system efficiency:

— Consider α = 0.25, n = 100, E = 3.9%; when n = 10; E = 31%

— This number is measured as the speed-up per processor.


— The bigger number, the more % of speed-up brings to you.

35
System Efficiency and
Gustafson’s Law
• We can also scale up the problem size to match the
cluster capability.
— Let W be the workload in a given program
— Increase the workload to W’ when using n-processor by:
𝑊 ! = 𝛼𝑊 + 1 − 𝛼 𝑛𝑊 (1)
— i.e., the sequential part remained unchanged
— The efficiency will be
𝑠𝑖𝑛𝑔𝑙𝑒 𝑝𝑟𝑜𝑐. 𝑡𝑖𝑚𝑒 (𝛼𝑊 + 1 − 𝛼 𝑛𝑊)/1
𝐸= = =𝛼+ 1−𝛼 𝑛
𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ 𝑛 𝑝𝑟𝑜𝑐. 1 − 𝛼 𝑛𝑊
𝛼𝑊 + ( )
𝑛

36
A running example: Matrix-
Vector Multiplication
1. Message passing: (Sequential)
— Process 0 will send (p-1) sub-matrices to corresponding processes
— Process 0 will send vector x to all other p-1 processes
2. Calculations: (Parallel)
— Each process carries out its own matrix-vector multiplication
3. Message passing: (Sequential)
— Processes 1 to (p-1) send the results (i.e., part of vector y) back to
process 0

Total cost for the sequential part: 𝑛 ∗ 𝑚 + 2 ∗ 𝑛


Total cost for the parallel part: 𝑛 ∗ 𝑚 ∗ 𝑛
"∗$%&∗"%"∗$∗"
The speed up: !∗#∗! Assume n=m=1000000, p=10, the
"∗$%&∗"% $
speed up is approximately 9.9999x, approaching the limitation.

37
2. High Bandwidth
Interconnects
Example : InfiniBand (1)
• Provides applications with an easy-to-use messaging
service.
• Gives every application direct access to the messaging
service without need not rely on the operating system to
transfer messages.
• Provides a messaging service by creating a channel
connecting an application to any other application or
service with which the application needs to
communicate.
• Create these channels between virtual address spaces.

Img: cablerack.com and dedicatednewroksinc.com


Example : InfiniBand (2)

• InfiniBand creates a channel directly connecting an


application in its virtual address space to an application
in another virtual address space.
• The two applications can be in disjoint physical address
spaces – hosted by different servers.
InfiniBand Architecture
• HCA – Host Channel Adapter. An HCA is the point at
which an InfiniBand end node, such as a server or
storage device, connects to the InfiniBand network.
• TCA – Target Channel Adapter. This is a specialized
version of a channel adapter intended for use in an
embedded environment such as a storage appliance.
• Switches – An InfiniBand Architecture switch is
conceptually similar to any other standard networking
switch, but molded to meet InfiniBand’s performance
and cost targets.
• Routers – Although not currently in wide deployment, an
InfiniBand router is intended to be used to segment a
very large network into smaller subnets connected
together by an InfiniBand router.
Example: InfiniBand System
Fabric
Example: The Big Google Search
Engine
• A Supercluster built over high-speed PCs and
Gigabit LANs for global web page searching
applications provided by Google.
• Physically, the cluster is housed in 40 PC/switch
racks with 80 PCs per rack and 3200 PCs in total
• Two racks-to-house 128 x 128 Gigabit Ethernet
switches, the front hosts, and UPSs, etc.
• All commercially available hardware parts with
Google designed software systems for
supporting parallel search, URL linking, page
ranking, file and database management, etc.
Google Search Engine Cluster
Hardware, Software, and
Middleware Support for connection
Example: Apache ZooKeeper

48
3. Single System Image (SSI)

• A single system image is the illusion, created by


software or hardware, that presents a collection of
resources as an integrated powerful resource.
• SSI makes the cluster appear like a single machine to
the user, applications, and network.
• A cluster with multiple system images is nothing but a
collection of independent computers (Distributed
systems in general)
Single-System-Image Features
• Single System: The entire cluster is viewed by the users as
one system, which has multiple processors.
• Single Control: Logically, an end user or system user utilizes
services from one place with a single interface.
• Symmetry: A user can use a cluster service from any node.
All cluster services and functionalities are symmetric to all
nodes and all users, except those protected by access rights.
• Location Transparent: The user is not aware of the
whereabouts of the physical device that eventually provides a
service.
• Some paradigms:
— Single Entry Points
— Single File Hierarchy
— Single I/O, Networking, and Memory Space
— Single Job management system, User Interface, process space,…
A. Single Entry Point

1. Four nodes of a cluster are used as host nodes to receive users’ login requests.
2. To log into the cluster a standard Unix command such as “telnet
cluster.cs.hku.hk”, using the symbolic name of the cluster system is issued.
3. The symbolic name is translated by the DNS, which returns with the IP address
159.226.41.150 of the least-loaded node, which happens to be node Host1.
4. The user then logs in using this IP address.
5. The DNS periodically receives load information from the host nodes to make
load-balancing translation decisions.
B. Single File Hierarchy
Files can reside on 3 types of locations in a cluster:
• Local storage - disk on the local node.
• Remote storage - disks on remote nodes.
• Stable storage -
— Persistent - data, once written to the stable storage,
will stay there at least for a period of time (e.g., a
week), even after the cluster shuts down.
— Fault tolerant - to some degree, by using
redundancy and periodical backup to tapes.
Single File Hierarchy
• A cluster that support Single File Hierarchy
should:
—Single system There is just one file hierarchy from
the user’s viewpoint.
—Symmetry A user can access the global storage
(e.g., /scratch) using a cluster service from any node.
—Location-transparent The user is not aware of the
whereabouts of the physical device that eventually
provides a service.

53
4. High Availability Through
Redundancy
• Three terms often go together: reliability, availability,
and serviceability (RAS).
• Availability combines the concepts of reliability and
serviceability as defined below:
• Reliability measures how long a system can operate
without a breakdown.
• Availability indicates the percentage of time that a
system is available to the user, that is, the
percentage of system uptime.
• Serviceability refers to how easy it is to service the
system, including hardware and software
maintenance, repair, upgrade, etc.
Availability and Failure Rate
Recent Find/SVP Survey of Fortune 1000 companies:
• An average computer is down 9 times a year with an average
downtime of 4 hours.
• The average loss of revenue per hour downtime is $82,500.

The operate-repair cycle of a computer system.

Availability = MTTF / (MTTF + MTTR)


Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 56
Availability Guarantee
Failure Cost Analysis
• Consider a cluster with little availability support. Upon a node
failure, the following sequence of events takes place:
1. The entire system is shut down and powered off
2. The faulty node is replaced if the failure is in hardware
3. The system is powered on and rebooted
4. The user application is reloaded and rerun from the start

• Assume one cluster node fails every 100 hours. Other parts
of the cluster never fail. Steps 1-3 take two hours. Step 4
takes two hours. What is the availability of this cluster?
What is the yearly failure cost if each one-hour downtime
costs $10,000?
— MTTF = 100 hours. MTTR = 2 + 2 = 4 hours.
— So availability = 100 / (100 + 4) = 96.15%
— The downtime in a year is 365x24x(1-96.15%) = 337 hours
— The yearly failure cost is 337 x 10,000 = $3.37 million.

58
Different Types of Failures
• Unplanned failures vs. planned shutdowns
— Unplanned: the system breaks due to an operating system
crash, hardware failure, network disconnection, human
operation errors, power outage, etc.
— Planned: the system is not broken, but is periodically taken off
for upgrades, reconfiguration, and maintenance.
• Transient failures vs. permanent failures
— Transient failures: they occur temporarily and then disappear.
Can be dealt with without replacing any component.
— Permanent failures: cannot be corrected by rebooting. Some
hardware or software component must be repaired by replaced.
• Partial failures vs. total failures
— Partial failures: only affects part of the system, the cluster is still
usable.
— Total failures: the entire cluster becomes unusable.
A key approach to enhancing availability is to make as many failures
as possible partial failures by removing single points of failure.
59
Examples of Single Point of
Failure

WS WS WS
(OS) (OS) (OS)
Where is the single point of failure?
Switch

WS WS WS
(OS) (OS) (OS)
Solution:
use a redundant network switch
Switch Switch

60
Single Points of Failure in SMP
and Clusters
Fault-Tolerant Cluster
Configurations
• Host standby server clusters:
— Only the primary nodes are actively doing the useful work. The
standby nodes are powered on and running some monitoring
programs to communicate heartbeat signals to check the status
of the primary nodes.
• Active-takeover clusters:
— All servers are primary and doing useful work. When a node
fails, the user applications fail over to the available node in the
cluster. Users may experience some delays or may lost some
data.
• Failover clusters:
— When a component fails, it allows the remaining system to take
over the services.
— It provides failure diagnosis, failure notification, and failure
recovery.

62
Failure Cost Analysis Again
• Review the example in Page 58.
• Assume the cluster uses failover to improve the
availability.
— Upon a node failure, its workload automatically fails over
to other nodes after six minutes.
— The faulty node is taken off, repaired, and it rejoins the
cluster, all without impacting the rest of the cluster.
• Failure cost analysis:
— MTTF = 100 hours, MTTR = 0.1 hour
— Availability = 100/100.1 = 99.9%
— The downtime per year is 365x24x0.1% = 8.75 hours
— The yearly failure cost is 8.75x10,000 = $87,500

63
Checkpointing
• Checkpointing:
—The process of periodically saving the state of an
executing program to stable storage, from which the
system can recover after a failure.
—Each program state saved is called a checkpoint.
—Checkpointing can be realized by operating system at
kernel level, third-party library in the user space, or
by the applications themselves.

64
An example: Checkpoint in
Flink

65
6. Cluster Job Scheduling and
Management
• End users make use of a cluster through “jobs”
— Serial jobs: run on a single node
— Parallel jobs: use multiple nodes
— Interactive jobs: require fast turnaround time, and their
input/output is directed to a terminal
• These jobs do not need large resources, and the users expect them
to execute immediately, not made to wait in a queue.
— Batch jobs: need more resources and don’t need immediate
responses. Are submitted to a job queue to be scheduled to run
when the resource becomes available
• But they do not need immediate response.
• They are submitted to a job queue to be scheduled to run when
the resource becomes available (e.g., during off hours).

66
Cluster Job Scheduling and
Management
A Job Management System (JMS) should have three parts:
• A user server lets the user submit jobs to one or more
queues, specify resource requirements for each job,
delete a job from a queue, inquire about the status of a
job or a queue.
• A job scheduler that performs job scheduling and
queuing according to job types, resource requirements,
resource availability, and scheduling policies.
• A resource manager that allocates and monitors
resources, enforces scheduling policies, and collects
accounting information.
JMS Administration
• JMS should be able to dynamically reconfigure the
cluster with minimal impact on the running jobs.
• The administrator’s prologue and epilogue scripts should
be able to run before and after each job for security
checking, accounting, and cleanup.
• Users should be able to cleanly kill their own jobs.
• The administrator or the JMS should be able to cleanly
suspend or kill any job.
— Clean means that when a job is suspended or killed, all its
processes must be included.
— Otherwise some “orphan” processes are left in the system,
wasting cluster resources and may eventually render the system
unusable.
Scheduling Modes (1)
Dedicated Mode :
• Only one job runs in the cluster at a time, and at most
one process of the job is assigned to a node at a time.
• The single job runs until completion before it releases
the cluster to run other jobs.
Space Sharing :
Multiple jobs can run on disjoint partitions (groups) of
nodes simultaneously.
• At most one process is assigned to a node at a time.
• Although a partition of nodes is dedicated to a job, the
interconnect and the I/O subsystem may be shared by
all jobs.
Scheduling Modes (2)
• Time sharing :
• Multiple user processes are assigned to the same node.
• Time-sharing introduces the following parallel
scheduling policies:
1. Independent Scheduling (Independent): Uses the operating
system of each cluster node to schedule different processes
as in a traditional workstation.
2. Gang Scheduling: Schedules all processes of a parallel job
together. When one process is active, all processes are
active.
3. Competition with Foreign (Local) Jobs: Scheduling becomes
more complicated when both cluster jobs and local jobs are
running. The local jobs should have priority over cluster jobs.
• Dealing with Situation: Stay or Migrate job
Lab Session:
• Create your first Telegram ChatBot!

75

You might also like