0% found this document useful (0 votes)

15 views66 pages

04 - Computer Clusters

Uploaded by

christopherhkrita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views66 pages

04 - Computer Clusters

Uploaded by

christopherhkrita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

COM7940

Cloud Computing

Chapter 04

Computer Clusters for Scalable

parallel Computing

Reading: Hwang et al. Chap 2

Distributed and Cloud Computing: from parallel processing to the Internet of Things

Slides are modified from Hwang et al. and from Prof Byron Choi
BACKGROUNDS

2
What is a computing cluster?
• A computing cluster consists of a collection of
interconnected stand-alone/complete computers, which
can cooperatively working together as a single,
integrated computing resource. Cluster explores
parallelism at job level and distributed computing with
higher availability.
Benefits of Computer Clusters
• A single computer is limited by its CPU power, RAM size,
and permanent storage space.
— Computer clusters can be used to solve problems with very
large size or long computational time, by using multiple
computers.

• High availability (HA) : Cluster offers inherent high system availability due to
the redundancy of hardware, operating systems, and applications.
• Hardware Fault Tolerance: Cluster has some degree of redundancy in most
system components including both hardware and software modules.
• OS and application reliability : Run multiple copies of the OS and
applications, and through this redundancy
• Scalability : Adding servers to a cluster or adding more clusters to a network
as the application need arises.
• High Performance : Running cluster enabled programs to yield higher
throughput.

4
HPC
• High-performance computing (HPC) refers to
supercomputers (massively parallel processors
or MPPs) that are delicate for special purposes.
• Emphasize the raw speed performance.
—Gflops in 1990s to 442 Pflops in 2019
• Top 5 HPC as at today:
—Frontier
—Aurora
—Eagle
—Fugaku https://www.youtube.com/watch?v=eId-MM84zPI
—LUMI
Ref: top500.org 5
Measurement of HPC
• Sustained Speed RMAX, measured from the
execution of the benchmark program.
• System efficiency reflects the ratio of the
sustained speed to the peak speed, Rpeak, when
all computing elements are fully utilized in the
system.
• Power, measure the power consumption.
• Power efficiency, reflects the ratio of sustained
speed per power consumption.

6
Top 50 HPC Peak Computing
Power

rank As at 11/2023 from top500.org

7
Top 50 HPC Power Efficiency

Most powerful HPC may not be the most efficient.

8
As at 11/2023 from top500.org
Top 500 HPC by country

9
Running example: Blue Gene L
• Blue Gene L is a supercomputer jointly
developed by IBM and Lawrence Livermore
National Laboratory
• An introduction:
https://youtu.be/LlS_QcEbPj0?si=bFgCWNu-
ABNLYm9s

11
Overview of Blue Gene L
• It occupies 17 of the top 100 slots in the rankings at
top500.org, including 5 of the top 10 (collected in 2021)
— 360 TeraFLOPS theoretical peak speed
• Largest configuration:
— At Lawrence Livermore Nat’l Lab.
— Runs simulations on US nuclear weapon stockpile
— 64 physical racks
— 65,536 compute nodes
— Torus interconnection network of 64 x 32 x 32
IBM BlueGene/L Supercomputer: The World

Fastest Message-Passing MPP built in 2005

Built jointly by IBM and LLNL teams and funded by US DoE ASCI Research Program
HTC
• High-throughput computing (HTC) built with
parallel and distributed computing technologies.
E.g. peer-to-peer networks that built over many
client machines.
• Focus on high-flux computing. Multi-millions of
users to use simultaneously.
• Address problems like cost, energy saving,
security, reliability.

14
Evolution of HTC and HPC
systems

15
Cluster Classification

A cluster can be classified by the following attributes:

•Packaging : Compact / Slack

— Compact – packaged in racks in machine room
— Slack – PC, workstations geographically distributed
•Control : Centralized / De-centralized.
•Homogeneity: Same vs. different platforms (e.g. CPUs, OSs)
•Security : Exposed or enclosed of intra-cluster communication.
•Dedication: Dedicated HPC or Enterprise Clusters that use spare
resources.
Attributes Used in Cluster
Classification

Attributes Attribute Value

Packaging Compact Slack

Control Centralized Decentralized

Homogeneity Homogeneous Heterogeneous

Security Enclosed Exposed

Dedication Dedicated cluster Enterprise cluster

Homogeneity
• Two compute nodes architectures:
—Homogeneous design – same architecture
—Hybrid node design – mixed architecture
GPU Clusters for Massive
Parallelism
• Why GPU?
—GPU is dedicate for graphic rendering which is
optimized for fast floating point and matrix
computations
• GPU cluster is often built as a heterogeneous
system with:
—CPU host nodes
—GPU nodes
—Cluster interconnect between them
Cluster Classification based on
application demand
• Compute Clusters:
— Mainly for collective computation over a single large job.
E.g. weather forecast, protein, crypto
— Do not handle many I/O operations such as database
— Requires frequent communication
• High-Availability clusters:
— For fault-tolerant or achieving HA of services
— Many redundant nodes
• Load-balancing clusters:
— For higher resource utilization through load balancing,
small jobs from many concurrent users.
— Requests are distributed to all node computers to form a
cluster.

20
Basic Cluster Architecture
Basic Cluster Architecture
• Built with commodity components and fully
supported with desired SSI features and HA
capability
• Commodity nodes are easy to replace or
upgrade with new generations of hardware.
• OS is designed for multiuser, multitasking,
multithreaded applications.
• Interconnected by fast commodity networks

22
Resource Sharing in Cluster of
Computers

Three ways to connect cluster nodes.

Resource Sharing in Cluster of
Computers
• Shared-nothing: used in most clusters.
—Connected through Ethernet
• Shared-disk: in favor of small-scale availability
clusters in business applications.
—Fast recovery in case of node failure
• Shared-memory: difficult to realize
—Connected by a scalable coherence interface (SCI)
ring (connected to the memory bus of each node).
FUNDAMENTAL CLUSTER
DESIGN ISSUES

25
Fundamental Cluster Design
Issues
1. Scalable Performance
2. Internode Communication
3. Single-System Image (SSI)
4. Availability Support
5. Fault Tolerance and Recovery
6. Cluster Job Management
1. Scalability
• Size scalability: increase processors, cache,
memory, storage, I/O channel
• Software scalability: upgrade OS or compilers,
adding more math and engineering libraries
• Application scalability: match the problem size with
the machine size
• Technology scalability: new technology must
consider:
— Time
— Space
— Heterogeneity

27
Amdahl’s Law
• A program on a uniprocessor with running time
T.
• Assume a fraction α of code must be executed
sequentially, called the sequential bottleneck
• (1 - α ) of the code can be compiled for parallel
execution by n processors.
• Total execution time:

28
Example:
Matrix-Vector Multiplication

29
A running example: Matrix-
Vector Multiplication
• Assumptions
— A total of p processes
— Matrix A (m x n) and vector x (n x 1) are created at process 0
• called “master process” because it coordinates the work of other processes
(i.e., “slave processes”)
1. Message passing:
— Process 0 will send (p-1) sub-matrices to corresponding processes
— Process 0 will send vector x to all other p-1 processes
2. Calculations:
— Each process carries out its own matrix-vector multiplication
3. Message passing:
— Processes 1 to (p-1) send the results (i.e., part of vector y) back to
process 0

30
Img: Wikipedia

img: https://www.reddit.com/r/ProgrammerHumor/comments/bzv8q4/parallelism_be_like/

34
System Efficiency and
Gustafson’s Law
• We can measure if processors are worth invested by
system efficiency:

— Consider α = 0.25, n = 100, E = 3.9%; when n = 10; E = 31%

— This number is measured as the speed-up per processor.

— The bigger number, the more % of speed-up brings to you.

35
System Efficiency and
Gustafson’s Law
• We can also scale up the problem size to match the
cluster capability.
— Let W be the workload in a given program
— Increase the workload to W’ when using n-processor by:
𝑊 ! = 𝛼𝑊 + 1 − 𝛼 𝑛𝑊 (1)
— i.e., the sequential part remained unchanged
— The efficiency will be
𝑠𝑖𝑛𝑔𝑙𝑒 𝑝𝑟𝑜𝑐. 𝑡𝑖𝑚𝑒 (𝛼𝑊 + 1 − 𝛼 𝑛𝑊)/1
𝐸= = =𝛼+ 1−𝛼 𝑛
𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ 𝑛 𝑝𝑟𝑜𝑐. 1 − 𝛼 𝑛𝑊
𝛼𝑊 + ( )
𝑛

36
A running example: Matrix-
Vector Multiplication
1. Message passing: (Sequential)
— Process 0 will send (p-1) sub-matrices to corresponding processes
— Process 0 will send vector x to all other p-1 processes
2. Calculations: (Parallel)
— Each process carries out its own matrix-vector multiplication
3. Message passing: (Sequential)
— Processes 1 to (p-1) send the results (i.e., part of vector y) back to
process 0

Total cost for the sequential part: 𝑛 ∗ 𝑚 + 2 ∗ 𝑛

Total cost for the parallel part: 𝑛 ∗ 𝑚 ∗ 𝑛
"∗$%&∗"%"∗$∗"
The speed up: !∗#∗! Assume n=m=1000000, p=10, the
"∗$%&∗"% $
speed up is approximately 9.9999x, approaching the limitation.

37
2. High Bandwidth
Interconnects
Example : InfiniBand (1)
• Provides applications with an easy-to-use messaging
service.
• Gives every application direct access to the messaging
service without need not rely on the operating system to
transfer messages.
• Provides a messaging service by creating a channel
connecting an application to any other application or
service with which the application needs to
communicate.
• Create these channels between virtual address spaces.

Img: cablerack.com and dedicatednewroksinc.com

Example : InfiniBand (2)

• InfiniBand creates a channel directly connecting an

application in its virtual address space to an application
in another virtual address space.
• The two applications can be in disjoint physical address
spaces – hosted by different servers.
InfiniBand Architecture
• HCA – Host Channel Adapter. An HCA is the point at
which an InfiniBand end node, such as a server or
storage device, connects to the InfiniBand network.
• TCA – Target Channel Adapter. This is a specialized
version of a channel adapter intended for use in an
embedded environment such as a storage appliance.
• Switches – An InfiniBand Architecture switch is
conceptually similar to any other standard networking
switch, but molded to meet InfiniBand’s performance
and cost targets.
• Routers – Although not currently in wide deployment, an
InfiniBand router is intended to be used to segment a
very large network into smaller subnets connected
together by an InfiniBand router.
Example: InfiniBand System
Fabric
Example: The Big Google Search
Engine
• A Supercluster built over high-speed PCs and
Gigabit LANs for global web page searching
applications provided by Google.
• Physically, the cluster is housed in 40 PC/switch
racks with 80 PCs per rack and 3200 PCs in total
• Two racks-to-house 128 x 128 Gigabit Ethernet
switches, the front hosts, and UPSs, etc.
• All commercially available hardware parts with
Google designed software systems for
supporting parallel search, URL linking, page
ranking, file and database management, etc.
Google Search Engine Cluster
Hardware, Software, and
Middleware Support for connection
Example: Apache ZooKeeper

48
3. Single System Image (SSI)

• A single system image is the illusion, created by

software or hardware, that presents a collection of
resources as an integrated powerful resource.
• SSI makes the cluster appear like a single machine to
the user, applications, and network.
• A cluster with multiple system images is nothing but a
collection of independent computers (Distributed
systems in general)
Single-System-Image Features
• Single System: The entire cluster is viewed by the users as
one system, which has multiple processors.
• Single Control: Logically, an end user or system user utilizes
services from one place with a single interface.
• Symmetry: A user can use a cluster service from any node.
All cluster services and functionalities are symmetric to all
nodes and all users, except those protected by access rights.
• Location Transparent: The user is not aware of the
whereabouts of the physical device that eventually provides a
service.
• Some paradigms:
— Single Entry Points
— Single File Hierarchy
— Single I/O, Networking, and Memory Space
— Single Job management system, User Interface, process space,…
A. Single Entry Point

1. Four nodes of a cluster are used as host nodes to receive users’ login requests.
2. To log into the cluster a standard Unix command such as “telnet
cluster.cs.hku.hk”, using the symbolic name of the cluster system is issued.
3. The symbolic name is translated by the DNS, which returns with the IP address
159.226.41.150 of the least-loaded node, which happens to be node Host1.
4. The user then logs in using this IP address.
5. The DNS periodically receives load information from the host nodes to make
load-balancing translation decisions.
B. Single File Hierarchy
Files can reside on 3 types of locations in a cluster:
• Local storage - disk on the local node.
• Remote storage - disks on remote nodes.
• Stable storage -
— Persistent - data, once written to the stable storage,
will stay there at least for a period of time (e.g., a
week), even after the cluster shuts down.
— Fault tolerant - to some degree, by using
redundancy and periodical backup to tapes.
Single File Hierarchy
• A cluster that support Single File Hierarchy
should:
—Single system There is just one file hierarchy from
the user’s viewpoint.
—Symmetry A user can access the global storage
(e.g., /scratch) using a cluster service from any node.
—Location-transparent The user is not aware of the
whereabouts of the physical device that eventually
provides a service.

53
4. High Availability Through
Redundancy
• Three terms often go together: reliability, availability,
and serviceability (RAS).
• Availability combines the concepts of reliability and
serviceability as defined below:
• Reliability measures how long a system can operate
without a breakdown.
• Availability indicates the percentage of time that a
system is available to the user, that is, the
percentage of system uptime.
• Serviceability refers to how easy it is to service the
system, including hardware and software
maintenance, repair, upgrade, etc.
Availability and Failure Rate
Recent Find/SVP Survey of Fortune 1000 companies:
• An average computer is down 9 times a year with an average
downtime of 4 hours.
• The average loss of revenue per hour downtime is $82,500.

The operate-repair cycle of a computer system.

Availability = MTTF / (MTTF + MTTR)

Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 56
Availability Guarantee
Failure Cost Analysis
• Consider a cluster with little availability support. Upon a node
failure, the following sequence of events takes place:
1. The entire system is shut down and powered off
2. The faulty node is replaced if the failure is in hardware
3. The system is powered on and rebooted
4. The user application is reloaded and rerun from the start

• Assume one cluster node fails every 100 hours. Other parts
of the cluster never fail. Steps 1-3 take two hours. Step 4
takes two hours. What is the availability of this cluster?
What is the yearly failure cost if each one-hour downtime
costs $10,000?
— MTTF = 100 hours. MTTR = 2 + 2 = 4 hours.
— So availability = 100 / (100 + 4) = 96.15%
— The downtime in a year is 365x24x(1-96.15%) = 337 hours
— The yearly failure cost is 337 x 10,000 = $3.37 million.

58
Different Types of Failures
• Unplanned failures vs. planned shutdowns
— Unplanned: the system breaks due to an operating system
crash, hardware failure, network disconnection, human
operation errors, power outage, etc.
— Planned: the system is not broken, but is periodically taken off
for upgrades, reconfiguration, and maintenance.
• Transient failures vs. permanent failures
— Transient failures: they occur temporarily and then disappear.
Can be dealt with without replacing any component.
— Permanent failures: cannot be corrected by rebooting. Some
hardware or software component must be repaired by replaced.
• Partial failures vs. total failures
— Partial failures: only affects part of the system, the cluster is still
usable.
— Total failures: the entire cluster becomes unusable.
A key approach to enhancing availability is to make as many failures
as possible partial failures by removing single points of failure.
59
Examples of Single Point of
Failure

WS WS WS
(OS) (OS) (OS)
Where is the single point of failure?
Switch

WS WS WS
(OS) (OS) (OS)
Solution:
use a redundant network switch
Switch Switch

60
Single Points of Failure in SMP
and Clusters
Fault-Tolerant Cluster
Configurations
• Host standby server clusters:
— Only the primary nodes are actively doing the useful work. The
standby nodes are powered on and running some monitoring
programs to communicate heartbeat signals to check the status
of the primary nodes.
• Active-takeover clusters:
— All servers are primary and doing useful work. When a node
fails, the user applications fail over to the available node in the
cluster. Users may experience some delays or may lost some
data.
• Failover clusters:
— When a component fails, it allows the remaining system to take
over the services.
— It provides failure diagnosis, failure notification, and failure
recovery.

62
Failure Cost Analysis Again
• Review the example in Page 58.
• Assume the cluster uses failover to improve the
availability.
— Upon a node failure, its workload automatically fails over
to other nodes after six minutes.
— The faulty node is taken off, repaired, and it rejoins the
cluster, all without impacting the rest of the cluster.
• Failure cost analysis:
— MTTF = 100 hours, MTTR = 0.1 hour
— Availability = 100/100.1 = 99.9%
— The downtime per year is 365x24x0.1% = 8.75 hours
— The yearly failure cost is 8.75x10,000 = $87,500

63
Checkpointing
• Checkpointing:
—The process of periodically saving the state of an
executing program to stable storage, from which the
system can recover after a failure.
—Each program state saved is called a checkpoint.
—Checkpointing can be realized by operating system at
kernel level, third-party library in the user space, or
by the applications themselves.

64
An example: Checkpoint in
Flink

65
6. Cluster Job Scheduling and
Management
• End users make use of a cluster through “jobs”
— Serial jobs: run on a single node
— Parallel jobs: use multiple nodes
— Interactive jobs: require fast turnaround time, and their
input/output is directed to a terminal
• These jobs do not need large resources, and the users expect them
to execute immediately, not made to wait in a queue.
— Batch jobs: need more resources and don’t need immediate
responses. Are submitted to a job queue to be scheduled to run
when the resource becomes available
• But they do not need immediate response.
• They are submitted to a job queue to be scheduled to run when
the resource becomes available (e.g., during off hours).

66
Cluster Job Scheduling and
Management
A Job Management System (JMS) should have three parts:
• A user server lets the user submit jobs to one or more
queues, specify resource requirements for each job,
delete a job from a queue, inquire about the status of a
job or a queue.
• A job scheduler that performs job scheduling and
queuing according to job types, resource requirements,
resource availability, and scheduling policies.
• A resource manager that allocates and monitors
resources, enforces scheduling policies, and collects
accounting information.
JMS Administration
• JMS should be able to dynamically reconfigure the
cluster with minimal impact on the running jobs.
• The administrator’s prologue and epilogue scripts should
be able to run before and after each job for security
checking, accounting, and cleanup.
• Users should be able to cleanly kill their own jobs.
• The administrator or the JMS should be able to cleanly
suspend or kill any job.
— Clean means that when a job is suspended or killed, all its
processes must be included.
— Otherwise some “orphan” processes are left in the system,
wasting cluster resources and may eventually render the system
unusable.
Scheduling Modes (1)
Dedicated Mode :
• Only one job runs in the cluster at a time, and at most
one process of the job is assigned to a node at a time.
• The single job runs until completion before it releases
the cluster to run other jobs.
Space Sharing :
Multiple jobs can run on disjoint partitions (groups) of
nodes simultaneously.
• At most one process is assigned to a node at a time.
• Although a partition of nodes is dedicated to a job, the
interconnect and the I/O subsystem may be shared by
all jobs.
Scheduling Modes (2)
• Time sharing :
• Multiple user processes are assigned to the same node.
• Time-sharing introduces the following parallel
scheduling policies:
1. Independent Scheduling (Independent): Uses the operating
system of each cluster node to schedule different processes
as in a traditional workstation.
2. Gang Scheduling: Schedules all processes of a parallel job
together. When one process is active, all processes are
active.
3. Competition with Foreign (Local) Jobs: Scheduling becomes
more complicated when both cluster jobs and local jobs are
running. The local jobs should have priority over cluster jobs.
• Dealing with Situation: Stay or Migrate job
Lab Session:
• Create your first Telegram ChatBot!

Unit-1 Part-1
No ratings yet
Unit-1 Part-1
14 pages
CC Unit 1
No ratings yet
CC Unit 1
18 pages
Cluster Computing: DATE: 28 November 2013
No ratings yet
Cluster Computing: DATE: 28 November 2013
32 pages
Module 1-Topic 1
No ratings yet
Module 1-Topic 1
36 pages
Cloud Computing Unit-1
100% (1)
Cloud Computing Unit-1
88 pages
Unit I GCC
No ratings yet
Unit I GCC
6 pages
Unit-Ii PPT
No ratings yet
Unit-Ii PPT
43 pages
Scalable Computing Over The Internet
No ratings yet
Scalable Computing Over The Internet
41 pages
Module 1
No ratings yet
Module 1
33 pages
CC Unit 1
No ratings yet
CC Unit 1
11 pages
Difference Between High-Performance Computing (HPC) High-Throughput Computing
No ratings yet
Difference Between High-Performance Computing (HPC) High-Throughput Computing
49 pages
High Performance Computing
100% (2)
High Performance Computing
61 pages
Cluster and Grid Computing
No ratings yet
Cluster and Grid Computing
37 pages
Scalability: Scalable Computing Over The Internet - Simplified
No ratings yet
Scalability: Scalable Computing Over The Internet - Simplified
17 pages
High Performance Cluster Guide
No ratings yet
High Performance Cluster Guide
70 pages
CC Unit-1 PDF
No ratings yet
CC Unit-1 PDF
18 pages
High Performance Cluster Guide
No ratings yet
High Performance Cluster Guide
70 pages
Cloud & Distributed Computing Guide
No ratings yet
Cloud & Distributed Computing Guide
17 pages
Module1 Part1
No ratings yet
Module1 Part1
26 pages
High-Performance Computing Guide
No ratings yet
High-Performance Computing Guide
13 pages
CC 1
No ratings yet
CC 1
11 pages
Unit-1 CC
No ratings yet
Unit-1 CC
21 pages
Unit - 1 Systems Modelling, Clustering and Virtualization: 1. Scalable Computing Over The Internet
No ratings yet
Unit - 1 Systems Modelling, Clustering and Virtualization: 1. Scalable Computing Over The Internet
28 pages
Cluster Computing
100% (6)
Cluster Computing
28 pages
Distributed and Cloud Computing Topic 1
No ratings yet
Distributed and Cloud Computing Topic 1
10 pages
Explain All The Evolutionary Changes in The Age of Internet Computing. The Age of Internet Computing
No ratings yet
Explain All The Evolutionary Changes in The Age of Internet Computing. The Age of Internet Computing
5 pages
CC Module 1
No ratings yet
CC Module 1
27 pages
Topic 6A 2024
No ratings yet
Topic 6A 2024
55 pages
ADSU1VFTVF25
No ratings yet
ADSU1VFTVF25
118 pages
GCC Unit-I
No ratings yet
GCC Unit-I
51 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Cloudassign1 2
No ratings yet
Cloudassign1 2
27 pages
Module - 01 CC (BCS601)
No ratings yet
Module - 01 CC (BCS601)
47 pages
Cloud Computing Unit1 New
No ratings yet
Cloud Computing Unit1 New
27 pages
Scalable Computing Over The Internet
No ratings yet
Scalable Computing Over The Internet
15 pages
Cloud Computing for Tech Enthusiasts
No ratings yet
Cloud Computing for Tech Enthusiasts
25 pages
Cluster
No ratings yet
Cluster
55 pages
Clustering For Massive Parallelism
No ratings yet
Clustering For Massive Parallelism
3 pages
Unit 1
No ratings yet
Unit 1
52 pages
22 Clusters Slides
No ratings yet
22 Clusters Slides
61 pages
Seminar
No ratings yet
Seminar
20 pages
Visvesvaraya Technological University (VTU) : Created by
No ratings yet
Visvesvaraya Technological University (VTU) : Created by
47 pages
CC Unit-1
No ratings yet
CC Unit-1
18 pages
CC Notes
No ratings yet
CC Notes
30 pages
L1.1 HPC Environment
No ratings yet
L1.1 HPC Environment
27 pages
High Performance Cluster Computing
No ratings yet
High Performance Cluster Computing
15 pages
CC Unit 1
No ratings yet
CC Unit 1
22 pages
Cloud Computing Unit-1 To Unit-5 Material
No ratings yet
Cloud Computing Unit-1 To Unit-5 Material
82 pages
Unit-1 (Cloud Computing) 1. (Accessible) Scalable Computing Over The Internet
100% (1)
Unit-1 (Cloud Computing) 1. (Accessible) Scalable Computing Over The Internet
17 pages
CC Unit-1
No ratings yet
CC Unit-1
17 pages
BCS601 Module 01
No ratings yet
BCS601 Module 01
36 pages
Introduction To High-Performance Computing (HPC) : Scientific Research Engineering Data Analytics Machine Learning
No ratings yet
Introduction To High-Performance Computing (HPC) : Scientific Research Engineering Data Analytics Machine Learning
30 pages
CC - Unit 1
No ratings yet
CC - Unit 1
34 pages
Cloud Computing Unit I and II
No ratings yet
Cloud Computing Unit I and II
20 pages
CC Module-1 Notes
No ratings yet
CC Module-1 Notes
32 pages
Cloud Short Note by Dipu #2
No ratings yet
Cloud Short Note by Dipu #2
26 pages
Advance Computing Technology (170704)
No ratings yet
Advance Computing Technology (170704)
106 pages
Module1 DistributedSystemModels
No ratings yet
Module1 DistributedSystemModels
147 pages
Ejercicios de Presente Continuo
No ratings yet
Ejercicios de Presente Continuo
6 pages
Vectorial Commandof An Asynchronous Motor Fedby A Photovoltaic Generator
No ratings yet
Vectorial Commandof An Asynchronous Motor Fedby A Photovoltaic Generator
10 pages
101 Guide To Product Sourcing Selling On EBay
100% (2)
101 Guide To Product Sourcing Selling On EBay
40 pages
Physics MCQs for Students
No ratings yet
Physics MCQs for Students
12 pages
Translation Is Replacement of An Original Text With Another
No ratings yet
Translation Is Replacement of An Original Text With Another
3 pages
Joan N. Agao: Explore Activity 1: Concept Clarification
80% (5)
Joan N. Agao: Explore Activity 1: Concept Clarification
3 pages
Chapter 2 Gas Research
No ratings yet
Chapter 2 Gas Research
8 pages
Rubrik CDM Version 8.0 User Guide (Rev. A1)
0% (2)
Rubrik CDM Version 8.0 User Guide (Rev. A1)
904 pages
Sourabh Developer Resume
No ratings yet
Sourabh Developer Resume
6 pages
INVOICE - File Copy: Jack Nadel, Inc. Julie Slater ATTN: AP 10000229 8701 Bellanca Ave LOS ANGELES, CA 90045-4411
No ratings yet
INVOICE - File Copy: Jack Nadel, Inc. Julie Slater ATTN: AP 10000229 8701 Bellanca Ave LOS ANGELES, CA 90045-4411
1 page
Ias 19
No ratings yet
Ias 19
16 pages
Georgia
No ratings yet
Georgia
15 pages
Diego Budicin PromoKit 2012 ENG
100% (1)
Diego Budicin PromoKit 2012 ENG
11 pages
ERGONOMICS Activity No. Ergonomic TV Remote Control: Engineering Connection
No ratings yet
ERGONOMICS Activity No. Ergonomic TV Remote Control: Engineering Connection
4 pages
Sonoff TH en
No ratings yet
Sonoff TH en
9 pages
IIT Tirupati ME Exam Instructions
No ratings yet
IIT Tirupati ME Exam Instructions
27 pages
Atos Sample Paper
No ratings yet
Atos Sample Paper
10 pages
17.02.2022 M-2 Sealink-1
No ratings yet
17.02.2022 M-2 Sealink-1
1 page
Conditional Sentences Guide
No ratings yet
Conditional Sentences Guide
28 pages
Oral Notes Rupesh Pillai
100% (6)
Oral Notes Rupesh Pillai
172 pages
Damics Man
No ratings yet
Damics Man
61 pages
Tcs Employment Application Form: Basic Details
No ratings yet
Tcs Employment Application Form: Basic Details
6 pages
Searl John Roy Robert - The Epic Story of Free Energy
No ratings yet
Searl John Roy Robert - The Epic Story of Free Energy
14 pages
Spiritual Warfare - Mystery Babylon The Great
100% (11)
Spiritual Warfare - Mystery Babylon The Great
275 pages
Screenshot 2024-05-13 at 9.37.24 AM
No ratings yet
Screenshot 2024-05-13 at 9.37.24 AM
16 pages
Microeconomics CH 3
No ratings yet
Microeconomics CH 3
37 pages
Major Minerals Resources in Asia
No ratings yet
Major Minerals Resources in Asia
72 pages
Practice Questions I (Transportation Problem Upto MODI Method)
No ratings yet
Practice Questions I (Transportation Problem Upto MODI Method)
2 pages
The Dark Age of Greece-Libre PDF
No ratings yet
The Dark Age of Greece-Libre PDF
5 pages
GD B Cheat Sheet
No ratings yet
GD B Cheat Sheet
6 pages

04 - Computer Clusters

Uploaded by

04 - Computer Clusters

Uploaded by

COM7940

Computer Clusters for Scalable

Reading: Hwang et al. Chap 2

rank As at 11/2023 from top500.org

Most powerful HPC may not be the most efficient.

Fastest Message-Passing MPP built in 2005

A cluster can be classified by the following attributes:

•Packaging : Compact / Slack

Attributes Attribute Value

Packaging Compact Slack

Control Centralized Decentralized

Homogeneity Homogeneous Heterogeneous

Security Enclosed Exposed

Dedication Dedicated cluster Enterprise cluster

Three ways to connect cluster nodes.

— Consider α = 0.25, n = 100, E = 3.9%; when n = 10; E = 31%

— This number is measured as the speed-up per processor.

Total cost for the sequential part: 𝑛 ∗ 𝑚 + 2 ∗ 𝑛

Img: cablerack.com and dedicatednewroksinc.com

• InfiniBand creates a channel directly connecting an

• A single system image is the illusion, created by

The operate-repair cycle of a computer system.

Availability = MTTF / (MTTF + MTTR)

You might also like