Google file System
Google file System
Google file system (GFS)
➢ Google File System, a scalable distributed file system for large
distributed data-intensive applications.
➢ Google File System (GFS) to meet the rapidly growing demands of
Google’s data processing needs.
➢ GFS shares many of the same goals as other distributed file systems
such as performance, scalability, reliability, and availability.
➢ GFS provides a familiar file system interface.
➢ Files are organized hierarchically in directories and identified by
pathnames.
➢ Support the usual operations to create, delete, open, close, read, and
write files.
GFS
➢ Small as well as multi-GB files are common.
➢Each file typically contains many application objects such as web
documents.
➢ GFS provides an atomic append operation called record append.
In a traditional write, the client specifies the offset at which data is to
be written.
➢Concurrent writes to the same region are not serializable.
➢GFS has snapshot and record append operations.
GFS (snapshot and record append)
➢The snapshot operation makes a copy of a file or a directory.
➢Record append allows multiple clients to append data to the
same file concurrently while guaranteeing the atomicity of each
individual client’s append.
➢ It is useful for implementing multi-way merge results.
➢GFS consist of two kinds of reads: large streaming reads and
small random reads.
➢ In large streaming reads, individual operations typically read
hundreds of KBs, more commonly 1 MB or more.
➢A small random read typically reads a few KBs at some
arbitrary offset.
Common Goals of GFS
and most Distributed File Systems
➢ Performance
➢ Reliability
➢ Scalability
➢ Availability
Other GFS Concepts
➢ Component failures are the norm rather
than the exception.
➢File System consists of hundreds or even thousands
of storage machines built from inexpensive
commodity parts.
➢ Files are Huge. Multi-GB Files are common.
➢ Each file typically contains many application
objects such as web documents.
➢ Append, Append, Append.
➢ Most files are mutated by appending new data
rather than overwriting
Other GFS Concepts
➢ Why assume hardware failure is the norm?
➢ It is cheaper to assume common failure on poor hardware and
account for it, rather than invest in expensive hardware and still
experience occasional failure.
➢The amount of layers in a distributed system (network, disk,
memory, physical connections, power, OS, application) mean
failure on any could contribute to data corruption.
GFS Interface
➢ GFS – familiar file system interface
➢ Files organized hierarchically in directories,
path names
➢ Create, delete, open, close, read, write
operations
➢ Snapshot and record append (allows multiple
clients to append simultaneously - atomic)
GFS Architecture
GFS Architecture
GFS Architecture
Chunk
GFS Architecture
➢A GFS cluster consists of a single master and multiple chunk-
servers and is accessed by multiple clients, Each of these is typically
a commodity Linux machine.
➢ It is easy to run both a chunk-server and a client on the same
machine.
➢As long as machine resources permit, it is possible to run flaky
application code is acceptable.
GFS Architecture
➢Files are divided into fixed-size chunks.
➢ Each chunk is identified by an immutable and
globally unique 64 bit chunk assigned by the
master at the time of chunk creation.
➢Chunk-servers store chunks on local disks as
Linux files, each chunk is replicated on multiple
chunk-servers.
➢The master maintains all file system metadata.
This includes the namespace, access control
GFS is fault tolerance?
Consistency
Consistency
Write control and data flow
Replica placement
Replica placement
Garbage Collection