بسم هللا الرمحن الرحمی
Safia Zadran
Google
What we will cover?
Introducing the case study: Google
Overall architecture and design philosophy
Underlying communication paradigms
Data storage and coordination services
Distributed computation
Introduction to the Case Study (Google)
The ability to create an effective design is an important skill in distributed systems
We illustrate distributed design through a substantial case study, examining in detail the
design of the Google infrastructure a platform and associated middleware that supports
both Google search and a set of associated web services and applications including Google
Apps.
Google is one of the largest distributed systems in use today, and the Google infrastructure
has successfully dealt with a variety of demanding requirements.
Google [www.google.com III] is a US-based corporation with its headquarters in Mountain
View, California (the Googleplex), offering Internet search
Google was born out of a research project at Stanford University, with the company
launched in 1998.
providing a search engine is now a major player in cloud computing.
with the growth of the company from its initial production system in 1998 to dealing with
over 88 billion queries a month by the end of 2010, that the main search engine has never
experienced an outage in all that time and that users can expect query results in around 0.2
seconds
The Google search engine
Google Search engine is complex but the general process is simple.
to take a given query and return an ordered list of the most relevant results that match that
query by searching the content of the Web.
The underlying search engine consists of a set of services for
*crawling the Web
*indexing
* ranking
Crawling
The task of the crawler is to locate and retrieve the contents of the Web and pass the
contents onto the indexing subsystem
At the root of every search engines are Software known as Crawler
Crawler also known as bots, robots, or spider.
What Crawlers Do?
What Crawlers Do?
After the crawler copied the websites this data must be stored on search engine server.
You can accesss to this copy from the cache.
Google search engine is not working on your live site,
But on a copy of the site in its server.
Indexing
More complex phase.
Occurs in many sub_phases. Sth that happens in many datacenters of the world.
Search engine algorithms extract signals to find the best information.
This process is hidden to the public.
Indexing:
This index will allow us to discover web pages that include the search terms ‘distributed’,
‘systems’ and ‘book’ and, by careful analysis, we will be able to discover pages that
include all of these terms. Forexample, the search engine will be able to identify that the
three terms can all be found. in amazon.com, www.cdk5.net and indeed many other web
sites. Using the index, it is therefore possible to narrow down the set of candidate web
pages from billions to perhaps tens of thousands, depending on the level of discrimination
in the keywords chosen.
Ranking
The search engine ranks(order) all possible results relevant to the search query.
Ranking is based on some factors
Such as:
Past Researches
Location
Ranking:
whereby a higher rank is an indication of the importance of a page and it is used to ensure
that important pages are returned nearer to the top of the list of results than lower-ranked
pages
in PageRank, a page will be viewed as important if it is linked to by a large number of
other pages
For example, a link from bbc.co.uk will be viewed as more important than a link from
Gordon Blair’s personal web page
Overall architecture and design philosophy
Physical Model
The key philosophy of Google in terms of physical infrastructure is to use very large
numbers of commodity PCs to produce a cost-effective environment for distributed storage
and computation.
PC will typically have around 2 terabytes of disk storage and around 16 gigabytes of
DRAM
The philosophy of building system from commodity pcs come from the original research
project (Sergey Brin & Larry Page at Stanford university)
Google has recognized that parts of its infrastructure will fail
has designed the infrastructure using a range of strategies to tolerate such failures.
By far the most common source of failure is due to software, with about 20 machines
needing to be rebooted per day due to software failures. (Interestingly, the rebooting
process is entirely manual.)
Hardware failures represent about 1/10 of the failures due to software with around 2–3% of
PCs failing per annum(year) due to hardware faults. Of these, 95% are due to faults in
disks or DRAM.
This indicates the decision to procure commodity PCs; given that the vast majority of
failures are due to software, it is not worthwhile to invest in more expensive, more reliable
hardware.
The physical architecture is constructed as follows [Hennessy and Patterson2006]:
between 40 and 80 PCs in a given rack, double-sided, has an Ethernet switch
Switch inside the rack is modular , supporting either 8 100-Mbps network interfaces or a
single 1-Gbps interface.
Racks are organized into clusters
A cluster typically consists of 30 or more racks and two high-bandwidth switches
providing connectivity to the outside world( internet & other google centers)
Clusters are housed in Google data centres that are spread around the world.
2000, Google relied on key data centres in Silicon Valley (two centres) and in Virginia.
now centres in many geographical locations across the US and in Dublin (Ireland), Saint-
Ghislain (Belgium), Zurich (Switzerland), Tokyo (Japan) and Beijing (China).
to build fault-tolerant, large-scale systems
If each PC offers 2 terabytes of storage, then a rack of 80 PCs will provide 160 terabytes,
with a cluster of 30 racks offering 4.8 petabytes.
To avoid clutter the Ethernet connections are shown from only of the clusters to the external links.
Overall system architecture
Key requirements:
Scalability: first , most important
More queries
Better Results
More Data
Scalability problem.
Reliability
Google Apps (Gmail, Google Calender, Google map)
Performance:
To achieve low latency of user interactions.
Better performance = better user return with more queries
completing web search operations in 0.2 seconds
is an end-to-end property requiring all associated underlying resources to work together,
including network, storage and computational resources.
Openness: It is well known that Google as an organization encourages and nurtures
innovation, and this is most evident in the development of new web applications. This is
only possible with an infrastructure that is extensible and provides support for the
development of new applications
Google has responded to these needs by developing the overall system
architecture
Data storage and coordination services
complementary services in the Google infrastructure:
Google File System
Chubby
BigTable
The Google File System (GFS)
Google file system is designed to solve the problem of bigData.
GFS is a distributed file system
What is DFS?
is any file system that allow access to file from multiple hosts sharing via a computer network.
May include facilities for replication and fault tolerant.
A DFS manages files and folders across multiple computers.
What is google file system?
Google file system is a scalabel Distributed File System created by Google and developed to
accommodate Google’s expanding data processing requirements.
GFS is formed from many storage systems designed from low-cost commodity hardware
elements.
Google file system Architecture
Cluster:
Google organized the GFS clusters of computers. A cluster is simply a network of computers.
Within GFS clusters there are 3 kinds of entities
Clients
Master servers
Chunk servers
Client:
Any entity that that makes a file request. Clients can be other computers or computer
applications. Think of client as the customer of the GFS.
Master:
Master server acts as the coordinator for the culsters.
The master’s duties include maintaining an operation log, which keep track of the activities
of the masters cluster. Masters maintains historical record of critical metadata changes,
namespace and mapping
The operation log helps keep service interruptions to a minimum—
if the master server crashes, a replacement server can take its place.
Chunkserver:
Chunkservers are the horsepower of GFS. They are responsible for storing the 64 mb file
chunks.
The chunkserver don’t send chunks to the master server. Instead they send requested
chunks directly to the client.
The GFS copies every chunk multiple times and store it on different chunkservers. Each
copy is called a replica.
By default the GFS makes 3 replica per chunks but user can change the setting and make
more or fewer replicas if desired.
Chubby
is a crucial service at the heart of the Google infrastructure
Chubby is a self described lock service
offering storage and coordination services for other infrastructure services, including GFS
and Bigtable.
It provides coarse-grained distributed locks to synchronize distributed activities
In the role of a lock-management tool, the main operations provided are:
Acquire,
TryAcquire
Release
BigTable
Nosql database developed by google.
Very large dataset.
Highly distributed
Row/column/timestamp indexing
No joins.