Lecture Notes On Big-Data Storage
Lecture Notes On Big-Data Storage
This chapter describes how big-data systems store large files in a distributed file system. The goal
is to educate data scientists how to manage big-data efficiently and the implications of some design
or implementation decisions. The articles are largely based on the Hadoop Distributed File System
(HDFS) but most of the concepts apply to other distributed file systems. It is based on the public
documentation of Hadoop, the personal experience from working with Hadoop and Spark, and
the source code of HDFS. It is not meant to be extensive or to cover all aspects of HDFS. It just
describes the basic features and how they work in a distributed cluster.
Path
Path is essentially a String that stores a path to a file or directory in HDFS. The path object itself
does not contain any information on whether it points to a file, a directory, a symbolic link, ... etc.
It provides convenient functions to manipulate the path such as reaching the parent directory or
converting a relative path to an absolute path. In general, a path is written in the following form.
hdfs://namenode:9000/absolute/path/filename
hdfs is called the scheme of the file system. For example, this could be http, ftp, or file.
namenode indicates the name or the IP address that hosts the file system. This can be the namenode for
HDFS or the server address for http and ftp file systems.
9000 the port on which the server is listening.
/absolute/path/ the absolute path where the file or directory is located. It always starts with a leading ‘/’.
filename is the name of the entity (file or directory) pointed at by the path.
Among all previous parts, only the filename is required. All other parts will be automatically
replaced with their default values if not provided. The default values are defined as follows.
• Scheme: The default file system is defined by the Hadoop configuration “fs.defaultFS".
• namenode: If HDFS is used, the default namenode will be retrieved from Hadoop configura-
tion “fs.defaultFS". For some file systems, this value is ignored, e.g., file, and for others it
is required and there is no default, e.g., ftp and http.
• port: The default port for HDFS is again retrieved from Hadoop configuration. For other file
systems, a corresponding default port is used, e.g., 80 for http and
• If the absolute path is not provided, the current working directory is used instead. The default
working directory in the local file system is typically the directory in which you run the
command. In HDFS, the working directory is “/user/$USER" where $USER is the system
1.2 HDFS API 9
username.
Notice that you cannot override the file system without specifying an absolute path. Examples of
possible paths are provided below.
Example 1.1 — Paths. The following are examples of paths.
• new Path("filename.txt") – points to an entity named “filename.txt" in the working
directory of the default file system.
• new Path("/path/to/a/file") – points to an entity named “file" at the given absolute
path starting at the root and in the default file system.
• new Path("hdfs:///path/to/a/file") – points to a file at the absolute path “/path/-
to/a/file" in HDFS with the default HDFS configuration.
• new Path("file:///home/user/input") – points to an entity at the absolute path “/home-
/user/input" in the local file system, i.e., not in HDFS. This is useful to access the local file
system when the default file system configures is HDFS.
• new Path("http://example.com/path/to/file.txt") - points to an entity named in
a remote HTTP server named ‘example.com’ with the default HTTP port (80) and at the
absolute path ‘/path/to/file.txt’ at the server.
The following code snippet shows a few examples of how to use the Path class in Java.
Configuration
Configuration is a key-value map that stores all the configuration that HDFS needs to work.
One example just mentioned above is “fs.defaultFS" which stores the default file system. Other
configuration parameters include the default replication factor and block size. For an exhaustive list
of parameters related to HDFS and their default values, check the following link.
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.
xml
When you create a Configuration object using the default construction “new Configuration()" it
will load the default values configured by the cluster. You can always override specific parameters
using an appropriate “set" method or retrieve the current value using an appropriate “get” method.
Notice that while some set and get methods seem to deal with non-string value, e.g., setInt and
getBoolean, all values are internally stored as strings. This makes it easier to load the data from
XML and serialize it over network.
FileSystem
This abstract class defines a general interface with dealing with any file system. Using this class in
your code allows your application to seamlessly work on both the local file system and HDFS. For
example, when you are developing and testing your code, you can run it with the local file system.
Once you move it to a real distributed cluster, it will work fine with HDFS.
10 Chapter 1. Distributed File Systems
There are three ways to obtain a file system using the HDFS API.
1. From a Path instance. This is the most common way to get a file system. Given a Path, the
method Path#getFileSystem(Configuration) returns the appropriate file system for the given
path as follows.
(a) If the Path contains a file system scheme, the corresponding file system instance is
created.
(b) If the path does not contain a file system scheme, the default file system is returned.
2. Default file system. To obtain the default file system, using the method FileSystem.get(Configuration).
3. Local fie system. If you want to access the local file system, using the method FileSys-
tem.getLocal(Configuration).
Notice how all the above methods require a Configuration instance so that Hadoop can determine
what the file system is and how to access the master node
UNIX HDFS
Figure 1.2: Analogy between Unix file system and HDFS
Self Writing
One special case in writing is when the writer process is running on one of the datanodes. In this
case, the process works exactly the same except for one change. The namenode will detect that the
writer is one of the datanodes and it will always assign the first replica to that node. The second
and third replicas are assigned as before. This ensures that no network communication is needed to
write the first block which reduces the network overhead. While this case might look like a special
case, it is actually a very common case in distributed data processing. Typically, datanodes are
compute nodes run on the same physical machines as discussed later. In this case, when the final
output is being written to HDFS, every node is writing a small chunk of the output. It makes more
sense in this case that each machine writes the first replica to its own disk while the second and
third replicas are created on other datanodes. If all machines participate in the output writing, the
load will still be roughly balanced.
1.5 Reading Process 13
Node Node
Node Node
Switch Switch
Node Node
Node Node
Rack Rack
Figure 1.3: Analogy between cluster organization in racks and houses in cities
R [A note about racks] In data centers, nodes are organized physically in racks. Each rack
contains about 16-32 nodes each. Nodes in one rack are connected with an on-rack switch
that connects all nodes in the rack. Rack switches are connected together to allows nodes
from different racks to communicate. In this setting, if two nodes in the same switch are
communicating, the on-rack switch is busy but other switches are not involved. However,
when two nodes in two racks are connected, both their on-rack switches and the medium
connecting them are busy which reduces the network efficiency. That is why HDFS tries to
reduce inter-rack communication by assigning two replicas to the same rack.
To understand this setting, think of each node as a house and each rack as a city. When cars
(network packets) move between two houses in the same city (rack), the roads in the city
become busy but other cities are not affected. However, when a car needs to move from one
city to another city, then it has to use the roads inside the two cities as well as the highway
that connects all cities. To optimize traffic, you would want more cars moving inside each
city and fewer cars moving on the highway between two cities.
Problem 1.2 Consider an HDFS cluster with ten datanodes (DN1. . . DN10). DN1 is writing a file
of 9GB and the block size is 128MB and the replication factor is three. Assume that all datanodes
are in one rack.
1. How many blocks are created for this file? Answer: (9×1,024)/128=72 blocks
2. How many block replicas are created in total? Answer: 72×3=216 block replicas
3. How much data is written to the local disk of DN1? Answer: 9 GB (The entire files)
4. What is the average data size written to other datanodes? Answer: Total amount of data
written to other machines = 9×2=18 GB. Average data per node = 18/9=2 GB
Problem 1.3 Repeat the previous question when the writer process is not one of the datanodes.
If HDFS is properly configures, you can access shell commands similar to the following example.
Simply, you precede your command with hdfs dfs –. Most commands work similar to Unix
commands but there could be some changes. For a list of all commands and how to use them, check
the following link.
https://hadoop.apache.org/docs/r3.2.2/hadoop-project-dist/hadoop-common/FileSystemShell.
html
1.7 Exercises
Exercise 1.1 In the following questions, assume we have a cluster of one namenode and 10
datanodes all in one rack. Each datanode has a disk of 10 terabytes which is all available to
HDFS. HDFS is configured with a default replication factor of 3 and a default block size of 128
MB.
1. According to this configuration, what is the capacity of HDFS? In other words, how much
data can we store in HDFS.
2. A driver node that is not one of the data nodes creates a file of size 2GB. How much is the
total network IO (incurred on all the machines) required to upload the data file?
3. One of the data nodes is writing a 2GB file. How much is the total network IO incurred
on all data nodes while the 2GB file is being written? Explain your answer.
4. How much network IO is required to download the file back from HDFS to the master
node?
5. How much is the expected network IO to download the file back from HDFS to one of the
data nodes? (Hint: Calculate the probability of a block being remote to one data node.
You can assume that all nodes are in one rack.)