0% found this document useful (0 votes)

164 views40 pages

Google Distributed System

The document provides an overview of Google's search engine architecture and design philosophy. It discusses how Google uses large numbers of commodity servers to provide scalable, reliable, and high-performance services like search and Google Apps. It describes key components like the Google File System for data storage, the Chubby coordination service, and BigTable for structured data storage. The architecture is designed to scale horizontally by adding more servers while maintaining reliability even with frequent hardware failures.

Uploaded by

sebghat aslamzai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

164 views40 pages

Google Distributed System

Uploaded by

sebghat aslamzai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

‫بسم هللا الرمحن الرحمی‬

‫‪Safia Zadran‬‬
Google
What we will cover?

 Introducing the case study: Google

 Overall architecture and design philosophy
 Underlying communication paradigms
 Data storage and coordination services
 Distributed computation
Introduction to the Case Study (Google)

 The ability to create an effective design is an important skill in distributed systems

 We illustrate distributed design through a substantial case study, examining in detail the
design of the Google infrastructure a platform and associated middleware that supports
both Google search and a set of associated web services and applications including Google
Apps.
 Google is one of the largest distributed systems in use today, and the Google infrastructure
has successfully dealt with a variety of demanding requirements.
 Google [www.google.com III] is a US-based corporation with its headquarters in Mountain
View, California (the Googleplex), offering Internet search
 Google was born out of a research project at Stanford University, with the company
launched in 1998.
 providing a search engine is now a major player in cloud computing.
 with the growth of the company from its initial production system in 1998 to dealing with
over 88 billion queries a month by the end of 2010, that the main search engine has never
experienced an outage in all that time and that users can expect query results in around 0.2
seconds
The Google search engine

 Google Search engine is complex but the general process is simple.

 to take a given query and return an ordered list of the most relevant results that match that
query by searching the content of the Web.
 The underlying search engine consists of a set of services for
*crawling the Web
*indexing
* ranking
Crawling

 The task of the crawler is to locate and retrieve the contents of the Web and pass the
contents onto the indexing subsystem
 At the root of every search engines are Software known as Crawler
 Crawler also known as bots, robots, or spider.
What Crawlers Do?
What Crawlers Do?
 After the crawler copied the websites this data must be stored on search engine server.
 You can accesss to this copy from the cache.
 Google search engine is not working on your live site,
But on a copy of the site in its server.
Indexing

 More complex phase.

 Occurs in many sub_phases. Sth that happens in many datacenters of the world.
 Search engine algorithms extract signals to find the best information.
 This process is hidden to the public.
Indexing:

 This index will allow us to discover web pages that include the search terms ‘distributed’,
‘systems’ and ‘book’ and, by careful analysis, we will be able to discover pages that
include all of these terms. Forexample, the search engine will be able to identify that the
three terms can all be found. in amazon.com, www.cdk5.net and indeed many other web
sites. Using the index, it is therefore possible to narrow down the set of candidate web
pages from billions to perhaps tens of thousands, depending on the level of discrimination
in the keywords chosen.
Ranking

 The search engine ranks(order) all possible results relevant to the search query.
 Ranking is based on some factors
 Such as:
 Past Researches
 Location
Ranking:

 whereby a higher rank is an indication of the importance of a page and it is used to ensure
that important pages are returned nearer to the top of the list of results than lower-ranked
pages
 in PageRank, a page will be viewed as important if it is linked to by a large number of
other pages
 For example, a link from bbc.co.uk will be viewed as more important than a link from
Gordon Blair’s personal web page
Overall architecture and design philosophy

Physical Model
 The key philosophy of Google in terms of physical infrastructure is to use very large
numbers of commodity PCs to produce a cost-effective environment for distributed storage
and computation.
 PC will typically have around 2 terabytes of disk storage and around 16 gigabytes of
DRAM
 The philosophy of building system from commodity pcs come from the original research
project (Sergey Brin & Larry Page at Stanford university)
 Google has recognized that parts of its infrastructure will fail
 has designed the infrastructure using a range of strategies to tolerate such failures.
 By far the most common source of failure is due to software, with about 20 machines
needing to be rebooted per day due to software failures. (Interestingly, the rebooting
process is entirely manual.)
 Hardware failures represent about 1/10 of the failures due to software with around 2–3% of
PCs failing per annum(year) due to hardware faults. Of these, 95% are due to faults in
disks or DRAM.
 This indicates the decision to procure commodity PCs; given that the vast majority of
failures are due to software, it is not worthwhile to invest in more expensive, more reliable
hardware.
 The physical architecture is constructed as follows [Hennessy and Patterson2006]:
 between 40 and 80 PCs in a given rack, double-sided, has an Ethernet switch
 Switch inside the rack is modular , supporting either 8 100-Mbps network interfaces or a
single 1-Gbps interface.
 Racks are organized into clusters
 A cluster typically consists of 30 or more racks and two high-bandwidth switches
providing connectivity to the outside world( internet & other google centers)
 Clusters are housed in Google data centres that are spread around the world.
 2000, Google relied on key data centres in Silicon Valley (two centres) and in Virginia.
 now centres in many geographical locations across the US and in Dublin (Ireland), Saint-
Ghislain (Belgium), Zurich (Switzerland), Tokyo (Japan) and Beijing (China).
 to build fault-tolerant, large-scale systems
 If each PC offers 2 terabytes of storage, then a rack of 80 PCs will provide 160 terabytes,
with a cluster of 30 racks offering 4.8 petabytes.
To avoid clutter the Ethernet connections are shown from only of the clusters to the external links.
Overall system architecture

 Key requirements:
 Scalability: first , most important

More queries
Better Results

More Data

Scalability problem.
Reliability
Google Apps (Gmail, Google Calender, Google map)

Performance:
To achieve low latency of user interactions.
Better performance = better user return with more queries
 completing web search operations in 0.2 seconds
 is an end-to-end property requiring all associated underlying resources to work together,
including network, storage and computational resources.
 Openness: It is well known that Google as an organization encourages and nurtures
innovation, and this is most evident in the development of new web applications. This is
only possible with an infrastructure that is extensible and provides support for the
development of new applications

 Google has responded to these needs by developing the overall system

architecture
Data storage and coordination services

 complementary services in the Google infrastructure:

 Google File System
 Chubby
 BigTable
The Google File System (GFS)

 Google file system is designed to solve the problem of bigData.

 GFS is a distributed file system
 What is DFS?
 is any file system that allow access to file from multiple hosts sharing via a computer network.
 May include facilities for replication and fault tolerant.
 A DFS manages files and folders across multiple computers.
 What is google file system?
 Google file system is a scalabel Distributed File System created by Google and developed to
accommodate Google’s expanding data processing requirements.
 GFS is formed from many storage systems designed from low-cost commodity hardware
elements.
Google file system Architecture

 Cluster:
 Google organized the GFS clusters of computers. A cluster is simply a network of computers.
 Within GFS clusters there are 3 kinds of entities
 Clients
 Master servers
 Chunk servers
 Client:
 Any entity that that makes a file request. Clients can be other computers or computer
applications. Think of client as the customer of the GFS.
 Master:
 Master server acts as the coordinator for the culsters.
 The master’s duties include maintaining an operation log, which keep track of the activities
of the masters cluster. Masters maintains historical record of critical metadata changes,
namespace and mapping
 The operation log helps keep service interruptions to a minimum—
if the master server crashes, a replacement server can take its place.
 Chunkserver:
 Chunkservers are the horsepower of GFS. They are responsible for storing the 64 mb file
chunks.
 The chunkserver don’t send chunks to the master server. Instead they send requested
chunks directly to the client.
 The GFS copies every chunk multiple times and store it on different chunkservers. Each
copy is called a replica.
 By default the GFS makes 3 replica per chunks but user can change the setting and make
more or fewer replicas if desired.
Chubby

 is a crucial service at the heart of the Google infrastructure

 Chubby is a self described lock service
 offering storage and coordination services for other infrastructure services, including GFS
and Bigtable.
 It provides coarse-grained distributed locks to synchronize distributed activities
 In the role of a lock-management tool, the main operations provided are:
 Acquire,
 TryAcquire
 Release
BigTable

 Nosql database developed by google.

 Very large dataset.
 Highly distributed
 Row/column/timestamp indexing
 No joins.

Google Architecture Case Study
No ratings yet
Google Architecture Case Study
44 pages
Group E
No ratings yet
Group E
29 pages
Google Distributed Systems Overview
No ratings yet
Google Distributed Systems Overview
23 pages
Google Casestudy
No ratings yet
Google Casestudy
33 pages
TLW Assignment 3 27-Sep-2024 10-32-28
No ratings yet
TLW Assignment 3 27-Sep-2024 10-32-28
28 pages
Google Architecture Insights
No ratings yet
Google Architecture Insights
7 pages
CC
No ratings yet
CC
17 pages
Unit 4
No ratings yet
Unit 4
41 pages
Big Data NoSLQ Kopyası
No ratings yet
Big Data NoSLQ Kopyası
51 pages
Unit - 4-Cloud
No ratings yet
Unit - 4-Cloud
122 pages
Google App Engine and Google File System
No ratings yet
Google App Engine and Google File System
5 pages
Google Architecture
No ratings yet
Google Architecture
9 pages
Google Talk: Ed Austin 12-09-09
No ratings yet
Google Talk: Ed Austin 12-09-09
51 pages
5.3.1 Google App Engine
No ratings yet
5.3.1 Google App Engine
5 pages
Case Study-Google
100% (1)
Case Study-Google
16 pages
CCS335-Cloud Computing: Unit IV Cloud Deployment Environment Topic: GAE
100% (1)
CCS335-Cloud Computing: Unit IV Cloud Deployment Environment Topic: GAE
13 pages
UNIT-IV Notes
No ratings yet
UNIT-IV Notes
15 pages
Programming Environment For GAE
No ratings yet
Programming Environment For GAE
35 pages
Storage Systems
No ratings yet
Storage Systems
23 pages
Storage Systems
No ratings yet
Storage Systems
23 pages
Saritha Gfs Report
No ratings yet
Saritha Gfs Report
28 pages
Refer Slide Time: 00:15
No ratings yet
Refer Slide Time: 00:15
31 pages
Unit 4
No ratings yet
Unit 4
14 pages
CC Unit-IV
No ratings yet
CC Unit-IV
41 pages
Google Infrastructure Overview
No ratings yet
Google Infrastructure Overview
6 pages
Chapter Three: Google Technology
No ratings yet
Chapter Three: Google Technology
25 pages
Bba Unit-1
No ratings yet
Bba Unit-1
11 pages
Google Summary GRP 8
No ratings yet
Google Summary GRP 8
13 pages
Building Distributed Systems
100% (3)
Building Distributed Systems
73 pages
Cloud Platforms: GAE, AWS, Azure
No ratings yet
Cloud Platforms: GAE, AWS, Azure
19 pages
(Omran) Introduction To Google Cloud Platform
No ratings yet
(Omran) Introduction To Google Cloud Platform
45 pages
Chubby System and Google API
No ratings yet
Chubby System and Google API
13 pages
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
No ratings yet
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
21 pages
Web Scaling for Startups
No ratings yet
Web Scaling for Startups
21 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Google SearchEngine
No ratings yet
Google SearchEngine
13 pages
GCP Fund Module 9 Summary and Review
No ratings yet
GCP Fund Module 9 Summary and Review
13 pages
Standard Web Search Engine Architecture: User Query
No ratings yet
Standard Web Search Engine Architecture: User Query
101 pages
GFS - Architecture M5 GFS - Architecture M5
No ratings yet
GFS - Architecture M5 GFS - Architecture M5
25 pages
Google Infrastructure
No ratings yet
Google Infrastructure
4 pages
Google Cloud Fundamentals: Core Infrastructure: Summary and Next Steps
No ratings yet
Google Cloud Fundamentals: Core Infrastructure: Summary and Next Steps
15 pages
Ccomputing Madurya
No ratings yet
Ccomputing Madurya
20 pages
GCP (Google Cloud Platform)
No ratings yet
GCP (Google Cloud Platform)
16 pages
Storage Architecture and Challenges: Faculty Summit, July 29, 2010 Andrew Fikes, Principal Engineer
No ratings yet
Storage Architecture and Challenges: Faculty Summit, July 29, 2010 Andrew Fikes, Principal Engineer
25 pages
UNIT5
No ratings yet
UNIT5
34 pages
Data Center Fundamentals: The Datacenter As A Computer: George Porter CSE 124 February 3, 2015
No ratings yet
Data Center Fundamentals: The Datacenter As A Computer: George Porter CSE 124 February 3, 2015
46 pages
Ccs335 CC Unit IV Cloud Computing Unit 4 Notes
No ratings yet
Ccs335 CC Unit IV Cloud Computing Unit 4 Notes
42 pages
An Overview of Google File System (GFS) - Medium
No ratings yet
An Overview of Google File System (GFS) - Medium
10 pages
5.cloud Computing Lecture
No ratings yet
5.cloud Computing Lecture
7 pages
Case Study Google
100% (1)
Case Study Google
6 pages
Google File System Basics: Google World Wide Web Computers
No ratings yet
Google File System Basics: Google World Wide Web Computers
5 pages
Anatomy of A Large-Scale Hypertextual Web Search Engine
No ratings yet
Anatomy of A Large-Scale Hypertextual Web Search Engine
33 pages
CC Ques Bank Cloud Computing QB UNIT 4
No ratings yet
CC Ques Bank Cloud Computing QB UNIT 4
11 pages
Distributed File System
No ratings yet
Distributed File System
21 pages
Transactions & Concurrency Control
No ratings yet
Transactions & Concurrency Control
30 pages
RMI Chapter 5 Nematulllah and Fariba
No ratings yet
RMI Chapter 5 Nematulllah and Fariba
26 pages
Samea Yusofi-Name Service
No ratings yet
Samea Yusofi-Name Service
44 pages
G5 Final Revision 1st Term Questions Only ICT
No ratings yet
G5 Final Revision 1st Term Questions Only ICT
15 pages
The 7 Sources of Innovative Opportunity
No ratings yet
The 7 Sources of Innovative Opportunity
3 pages
LCD Interfacing With Arduino
No ratings yet
LCD Interfacing With Arduino
8 pages
Payroll 23 Dumps
No ratings yet
Payroll 23 Dumps
23 pages
Computer Architecture Basics
No ratings yet
Computer Architecture Basics
26 pages
Hot Iron 112 (May-Jun 2021)
No ratings yet
Hot Iron 112 (May-Jun 2021)
38 pages
Oracle Cloud Applications Ebook
No ratings yet
Oracle Cloud Applications Ebook
21 pages
Noc18 mg35 Assignment5
No ratings yet
Noc18 mg35 Assignment5
4 pages
Đề Cương Tiếng Anh 9 Hk II
No ratings yet
Đề Cương Tiếng Anh 9 Hk II
30 pages
ISO 27001 2022 Information Security Management System ISMS Policy Template
No ratings yet
ISO 27001 2022 Information Security Management System ISMS Policy Template
18 pages
MF 5600
No ratings yet
MF 5600
11 pages
Linear Algebra - YEAR2 - (Waldron, Cherney, and Denton)
No ratings yet
Linear Algebra - YEAR2 - (Waldron, Cherney, and Denton)
320 pages
Folder Pajero Sport
No ratings yet
Folder Pajero Sport
4 pages
Google Takeout
No ratings yet
Google Takeout
13 pages
Colorful / Mono USB2.0 CCD Digital Camera With 1.4MP 5.0MP 8.0MP CCD Image Sensor
No ratings yet
Colorful / Mono USB2.0 CCD Digital Camera With 1.4MP 5.0MP 8.0MP CCD Image Sensor
6 pages
Advanced C Programming Tasks
No ratings yet
Advanced C Programming Tasks
17 pages
Current State Analysis
No ratings yet
Current State Analysis
2 pages
Unit-4 Image Compression
No ratings yet
Unit-4 Image Compression
87 pages
Active Duty & Reserve "A" School Guide
No ratings yet
Active Duty & Reserve "A" School Guide
31 pages
Komatsu - CMB Ac Compressors Application List
No ratings yet
Komatsu - CMB Ac Compressors Application List
1 page
AOS 8.4.9 ReleaseNotesRevA
No ratings yet
AOS 8.4.9 ReleaseNotesRevA
5 pages
High-Performance MOSFET Specs
No ratings yet
High-Performance MOSFET Specs
6 pages
AMD FreeSync™ Technology
No ratings yet
AMD FreeSync™ Technology
11 pages
? Refresher - Adam Yusuf
No ratings yet
? Refresher - Adam Yusuf
24 pages
Airborne Contaminant Limits Comparison
No ratings yet
Airborne Contaminant Limits Comparison
1 page
Enhancing Data Management: An Integrated Solution For Database Backup, Recovery, Conversion, and Encryption Capabilities
No ratings yet
Enhancing Data Management: An Integrated Solution For Database Backup, Recovery, Conversion, and Encryption Capabilities
15 pages
Vaibhav
No ratings yet
Vaibhav
1 page
Unit 5 - Week 4: Assignment 4
No ratings yet
Unit 5 - Week 4: Assignment 4
4 pages
Image Guidelines
No ratings yet
Image Guidelines
44 pages
Unit 1 VE IT2 Test 5
No ratings yet
Unit 1 VE IT2 Test 5
2 pages

Google Distributed System

Uploaded by

Google Distributed System

Uploaded by

‫بسم هللا الرمحن الرحمی‬

 Introducing the case study: Google

 The ability to create an effective design is an important skill in distributed systems

 Google Search engine is complex but the general process is simple.

 More complex phase.

 Google has responded to these needs by developing the overall system

 complementary services in the Google infrastructure:

 Google file system is designed to solve the problem of bigData.

 is a crucial service at the heart of the Google infrastructure

 Nosql database developed by google.

You might also like