0 ratings 0% found this document useful (0 votes) 35 views 40 pages 4 Distributed Databases, NOSQL Systems, and BigData-1
The document discusses distributed databases, NoSQL systems, and big data, highlighting the concepts, advantages, and disadvantages of distributed database systems (DDBS). It covers data fragmentation, replication, and allocation techniques, as well as the architecture and components of DDBS, emphasizing their ability to improve data accessibility and organizational efficiency. Additionally, it addresses challenges such as complexity, cost, and security associated with DDBS implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save 4 Distributed Databases, NOSQL Systems, and BigDat... For Later DISTRIBUTED DATABASES,
NOSQL Systems AND BIGDATA
_
§ ISI
wer Comprehensive study of this chapter, you will be able to:
Distributed Database Concepts and Advantages
Data Fragmentation, Replication and Allocation Techniques fr Distributed Database Design
‘Type of Distributed Systems
Distributed Database Architecture
Introduction to NOSQUL Systems
The CAP Theorem
Document-based, Key-value Stores, Column-based and Grapt-basd System
Big Data
MapReduce
Hadoop
possesses8B Advanced Database
DistrisuteD DATABASE CONCEPT AND ADVANTAGES
A major motivation behind the development of database exams is the ee to int
operational data of an organization and to provide controlled a 0 the date, Althoug,
integration and contotied necess may imply centralization, this isnot the intention, fy fot
development of computer networks promotes a decentralized mode of work, ‘This Aecentrat
approach misrors the organizational structure of many companies, which arg ian
distributed into divisions, departments, projects, and so on, and physically distributed ims
offices, plants, factories, where each unit maintains its own operational data, ‘The share abil,
of the data and the efficiency of data access should be improved by the developmen, *
distributed database system that reflects this organizational structure, makes th,
data in a
units accessible, and stores data proximate to the location where it is most frequently used
Distributed DBMSs should help resolve the islands of information problem, Databases tm
sometimes regarded as electronic islands that are distinct and generally inaccessible Places te
remote islands. This may be a result of geographical separation, incompatible computer
architectures, incompatible communication protocols, and so on. Integrating the database. intoa
logical whole may prevent this way of thinking,
A distributed database (DDB) is a collection of multiple, logically interrelated database
distributed over a computer network. A distributed database management system
DBMS) is the software that manages the DDB and provides an access mechanism thes makes
this distribution transparent to the users. The distributed
database (DDB) and distributed
database management system (DBMS) together is called Distributed database system
(DDBS).
arate 4
Database Technology
Computer Networks
Integration
Distribution |
Distributed Database System
Integration (Integration # Centralization)
Figure 4.1: Distributed Database System
Communication’
Network
Figure 4.2: Centralized Environment_ (eure) 0 Distributed Databases, NOSQL Systems and igi 109
Communicati
Network
Figure 4.3: Distributed Environment
characteristics of Distribute Database System
‘The Distributed Database System has following characteristics:
. A collection of logically related shared dat
+ ‘The data is split into a number of fragments;
+ Fragments may be replicated;
+ Fragments/replicas are allocated to sites;
© The sites are linked by a communications network;
* The data at each site is under the control of a DBMS;
. ‘The DBMS at each site can handle local applications, autonomously;
* Each DBMS participates in at least one global application.
Components of Distributed Database System
‘The different components of DDBS are as follows:
‘ Computer workstations or remote devices (sites or nodes) that form the
network system. The distributed database system must be independent of the
computer system hardware.
. Network hardware and software components that reside in each workstation
or device, The network components allow all sites to interact and exchange data.
Because the components—computers, operating systems, network hardware, and so
on—are likely to be supplied by different vendors, it is best to ensure that
distributed database functions can be run on multiple platforms.
s media that carry the data from ohé node to another. The
tions media-independent; that is, it must be able to
wunications media.
hich is tho software component found in each
‘tise transaction processor receives and
(remote and loca). The TP is also’ known
(AP) or the transaction manager mm).
+ Communication:
DDBMS must be communica
support several types of comm
+ The transaction processor (TP), w
computer or device that requests data.
processes the application's date requests
as the application processor| 190, Advanced Database
Advantages of DDBS
‘The advantages of DDBS are as follows:
1.
bw
‘The data processor (DP), which is the software component residing o,
computer or device that stores and retrieves data located at the site. The Dp
known as the data manager (DM). A data processor may even be a cena
al
DBMS.
Reflects organizational structure: Many organizations are naturally distributeg
several locations.
Improved share-ability anc
organization can be reflected i
data stored at other sites. Data can
d local autonomy: The geographical distribution of
n the distribution of the data; users at one site can a
be placed at the site close to the users who norma
use that data. In this way, users have local control of the data and they can conseque
establish and enforce local policies regarding the use of this data. A global DBg ;
responsible for the entire system. Generally, part of this responsibility is devolved toi
local level, so that the local DBA can manage the local DBMS.
Improved availability: In a centralized DBMS, a computer failure terminates 1
operations of the DBMS. However, a failure at one site of a DDBMS or a failure of,
‘communication link making some sites inaccessible does not make the entire syst
inoperable, Distributed DBMSs are designed to continue to function despite such failure’
If a single node fails, the system may be able to reroute the failed node's requests w
another site.
Improved reliability: Because data may be replicated so that it exists at more than om
site, the failure of a node or a communication link does ‘not necessarily make the da
inaccessible. Improved performance as the data is located near the site of “greats
demand,” and given the inherent parallelism of distributed DBMSs, speed of databus
access may be better than that achievable from a remote centralized databs
Furthermore, since each site handles only a part of the entire database, there may pot te
the same contention for CPU and I/O services as characterized by a centralized DBMS.
Economics: The potential cost saving occurs where databases are geographically reno
and the applications require access to distributed data. In such cases, owing Ld
relative expense of data being transmitted across the network as ‘opposed to the eo
Tocal access, it may be much more economical to partition the application and perfor!
processing locally at each site.
Modular growth: In a distributed environment, it is much easier to handle oot
New sites can be added to the network without affecting the operations of other sites
flexibility allows an organization to expand relatively easily.
Integration: At the start of this section, we noted that integratio
of the DBMS approach, not centralization. The integration of
particular example that demonstrates how some organizations are foFee™
Mfetributed data processing to allow their legacy systems to coexist with Oy
ms. At the same time, no one package can provide all the function? ye”
uires nowadays. Thus, it is important for organizations a
pecific 4
tt
was a Key a8
tems
legacy SIP 8
modern syste!
an organization reat
integrate software components from different vendors to mect theirs4. Integrity control more diffi
5. Lack of standard:
6. Lack of experience:
7. Database desig? more oonign of
0 Distributed Databases, NOSQL
Remaining competiti f
petitive: There are number of ative Systems and Bigdata 191
er of relativel
heavily on distributed
: databas, ae
collaborative work and au © technology such as ee developments that rely
aaa low manager e-business, com
their businesses and use distributed dntahoae yy emerties % computer supported
8 d ines have
database technology to remain some ean
pisadvantages of DDBS Poi
‘The disadvantages of DDBS are as follows:
4, Complexity: A distrib
pisces porters that hides the distributed nature fh
ee oumplec thon a ae performance, reliability, and avila iy ie eas
ailability is inherentl
ea eeatommplnie se a = itis fact that data ean be replicated also adds an
replication adequately, x DBMS. If the software does not handle data
there will be d :
performance compared with the aotrilicad poor a in availability, reliability, and
will become disadvantages. and the advantages we cited earlier
2, Cost: Increased complexity m:
sila eerie fee es ‘we can expett the provirement end maintenance
czts for DDEMS to be higher than thon or a cents DEMS. Furthermore 2
eas hardware to establish a network between sites.
re ongoing communication costs incurred with the use of thi
alco additional labor costs to manage and maintain the local DBMSe eat
oars 1e local DBMSs and the underlying
3. Security: In a centralized system, access to the data can be easily controlled. However, in
a distributed DBMS not only does ozs to replated data have to be conte in
multiple locations, but the network itself has to be made secure. In the past, networks
were regarded as an insecure communication medium. Although this is still partially true,
Spnifcant developments have been made to make networks more secure
‘cult; Database integrity refers to the validity and
consiztency of stored data, Integrity is usually expressed in ters of constraints, which
sre consistency rules that the database is not permitted to violate, Enforcing integrity
constraints generally requires access to a large amount of data that defines the constraint
but that is not involved in the actual update ‘operation itself. Ina distributed DBMS, the
communication and, propeasing coats that.Ar° required to enforce integrity constraints
a teem d on effective communication, we
tion and data access
mntial of distributed
centralized
distributed DBMSs depen
a ‘ce of standard communica
to see the appearan a
are only now Stari Mif standards hao significant limited the pote
DBMS a a also no tools oF ‘methodologies to help users convert ®
s.
panes ies Bree eden co distributed DBMSs have nol
General-purPo pat
._ problems
ete protocols and PEON indus
eae athe mat ‘have the same level of experen ar ss
Cones ed DBMS For @ prospective 9
with centralis
significant deterrent- - pesides
t been widely
ddopter of this
augnculties of designing ©
to take account o
centralized database, 1°
of data,
fragmentation492 Advanced Database
DistRIBUTED DATABASE DESIGN
cause techniques that are used to break up the database into logieg
nesigned for storage at the various nodes. We also din
sat date replication, which permits certain data to be stored in more than one ste y ina
wee ibikity and reliability; and the process of allocating fragments—or replicas of fragmon”
for atorage at the various nodes. These techniques are used during the process of distin.
amreabese design. The information concerning data fragmentation, allocation, and replicas
“tore in a global directory that is accessed by the DDBS applications as needed.
In this section, we dit
called fragments, which may be
ony
Data Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of,
table are called fragments. These fragments may be stored at different locations. Morey,
fragmentation increases parallelism and provides better disaster recovery. Fragmentation
be of three types:
= Vertical Fragmentation
* “Horizontal Fragmentation
* Hybrid Fragmentation
Fragmentation should be done in a way so that the original table can be reconstructed from
fragments. This is needed so that the original table can be reconstructed from the fragmens
whenever required. This requirement is called “reconstructiveness”.
Advantages of Fragmentation
= Since data is stored close to the site of usage, efficiency of the database sys**
increased.
. Local query optimization techniques are sufficient for most queries since ds *
locally available.
pt, ; 4 . eer)
+ Since irrelevant data is not available at the sites, security and privat
database system can be maintained,
Disadvantages of Fragmentation
. When data from different fragments are required, the access speeds ™*” 7
high.
: geo
+ In case of recursive fragmentations, the job of reconstruction wil ne! ©
techniques. in! on
. Lack of back-up copies of data in different sites may render the databss?
in case of failure of a site,
Vertical Fragmentation ‘ ot
5 : ts
In vertical fragmentation, the fields or columns of a table are grouped into fre ae) ae
to maintain re-constructiveness, each fragment should contain the primary key
table. Vertical fragmentation can be used to enforce privacy of data.© Distributed Databases, NOSQL, Systems and Bigdata 19
Table
ramp
Fragment 1
Fragment 2
Fragment 3
Fragment n
Figure 4.4: Vertical fragmentation
us consider t] ze
part sen al eve he Reem e oted
Student
Stu_id Stu_name | Stu_address | Dept_id
10 Maya Palpa 1
u Abin Ktm 2
12 Arav Ktm 1
13 Ashna Palpa 3
14 Anju Pokhara 4
16 Manish Banepa 2
16 Pinky Ktm a
Figure 4.5: Student table before fragmentation
Now, the address details are maintained in the admin section. In this cast
fragment the database as follows:
CREATE TABLE Std_address AS
SELECT Stu_id, Stu_address
FROM Student; |
By executing above query, we get the following result:
Std_address
Fraid | Stu.address
Palpa
10
Ktm
13 | Palpa__
Pokhara
Figure 4.6: Std_address To
e, the designer will| 494 Advanced Database
Horizontal Fragmentation
Horizontal fragmentation groups the tuples of a table in accordance to values of one oF mp
fields, Horizontal fragmentation should also confirm to the rule of reconstructiveness, pay
: Ba
horizontal fragment must have all columns of the original base table. ch
Fragment 1
Fragment 2
Table =>
Fragment 3
Fragment n
Figure 4.7: Horizontal fragmentation
Example 4.2: In the student schema which is shown in figure 7.6, if the details of all studentscf
department 1 need to be maintained at the respective faculty, then the designer will horizontally
fragment the database as follows: ;
CREATE TABLE Department AS
SELECT * FROM Student
WHERE Dept i
By executing above query, we get the following result:
Department
16 Pinky Ktm 1
Figure 4.8: Department Table (Horizontal Fragment of Student Table)
Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniaut
used, This is the most flexible fragmentation technique since it generates fragments
minimal extraneous information. However, reconstruction of the original table is often
expensive task. Hybrid fragmentation can be done in two alternative ways:
esi8
oot
1. At first, generate a set of horizontal fragments; then generate vertical fragments from
or more of the horizontal fragments.
from
‘At first, generate a set of vertical fragments; then generate horizontal fragment’
or more of the vertical fragments.Gam) :
O Distributed Dat bases, NOSQL Systoms and Bi a5
L Systems and Bige
Fragment 1
@ ——
aa
ey Fragment 1
Fragment 2
|
‘Fragment 2
Fragment n
‘Table
|
Fragment n-1
Fragment 2
‘x ie Figure 4.9: Hybrid fragmentation
CREATE TABLE Hybrid AS
SELECT Stu_id, Stu_name FROM Studer
WHERE Stu_id=12; -
By executing above query, we get the following result:
Figure 4.10: Hybrid Table (hybrid fragment of Student Table)
Data Replication
Data Replication i
a pote process of generating and reproducing multiple copies of data at one or
ee wig an important mechanism, because it enables organizations to provide
raat current data where and when they need it It is intended to increase the
system such that if one database fails another can continu {0 SE queries
or uy Satins 3
Paha requests, Replication is sometimes described using the publishing industry ‘metaphor
hers, distributors, and subscribers.
ta available to ‘other locations through
+ Publisher: A DEMS that makes dal
replication. The publisher can have one or more publications (made up of one or
more articles), each defining a logically related set of objects and data to replicate.
+ Distributor: A DBMS that store® replication data and metadata about the
publication and in some cases a publisher to
te ag a queue for data moving (om the
the subscribers. A DBMS can ‘act as both the publisher and the distributor.
+ Subscriber: A DBMS that rec!
ives. replicated data. A subscriber can receive data
from multiple publishers and pul tion
yer can also P
ications. Depending on the (0° of repli
chosen, the subscrib ass data changes back to the publisher or
republish the data to other subscribers.“496 Advanced Database
Replication Purpose
| ‘The purpose of replications are as follows:
| 1. System availability: Distributed Database System may remove single point,
| by replicating data, so that data items are accessible from multiple sites, Con,
| even when some sites are down, data may be accessible from other sites,
| 2, Performance: One of the major contributors to response time is the commun
| overhead, Replication enables us to locate the data closer to their access points,
5 ont 88 points, the
localizing most of the access that contributes to a reduction in response time, ey
oF ity
Seu,
4. Scalability: As systems grow geographically and in terms of the number of,
(consequently, in terms of the number of access requests), replication allows for =
support this growth with acceptable response times. mH
4. Application requirements: Finally, replication may be dictated by the applica
which may wish to maintain multiple data copies as part of their operatiny
specifications.
Challenges in Replications
4. Placement of replicas: The major challenge in replication is where to put the repli
‘There are three places to put replicas.
= Permanent replicas: permanent replicas consist of cluster of servers that mayb
geographically dispersed.
= Server initiated replicas: Server initiated caches include placing replicas int
hosting servers and server caches.
: = Client initiated replicas: Client initiated replicas include web browsers cache
2. Propagation of updates among replicas: The net challenge is to how to propagate
updates in one replica among all the replicas efficiently and faster as possible.
| + Push based propagation: A replica in which update occurs pushes the upésts
all other replicas.
te = Pull based propagation: A replicas requests another replica to send the ewes
data it has.
i 3. Lack of consistency: If a copy if modified, the copy becomes inconsistent from therst#
copies. It takes some time for all the copies to be consistent.
Advantages of Data Replication
a + Reliability: In case of failure of any site, the database system continues ©
since a copy is available at another site(s).
« Reduction in Network Load: Since local copies of data are
processing can be done with reduced network usage, particularly
hours. Data updating can be done at non-prime hours.
Quicker Response: Availability of local copies of data
processing and consequently quick response time. ao
Simpler Transactions: Transactions require a smaller number of joins
located at different sites and minimal coordination across the networ™
availbls O
jurins
nsures quick ©
fe eeeieee bere eT ECU
become simpler in nature.y
pis
© Distributed Database
yantages of Data Replication vases, NOSQL Systems and Bigdata 197
Increased Storage Requirements:
‘ ‘Sto ¢ Maintaining multi
le copi
associated with increased storage costs. The storage Rare copies of data is
the storage required for a centralized system, nes
Increased Cost and Complexity of Data Updating:
updated, the update needs to be reflected in all the conics
i " of the i
sites, This requires complex synchronization techniques and pares ‘he Geren
Each time a data item is
+ Undesirable Application - Database coupling:
are not used, removing data inconsistency requ
application level. This results in undesirable applicati
f complex update mechanisms
ires complex co-ordination at
tion — database coupling.
pata Allocation
Bach fragment or each copy of a fragment is stored at a particular site in the distributed system
sith “optimal” distribution. This process is called data distribution (or data allocation). The
oice of sites and the degree of replication depend on the performance and availability goals of
the system and on the types and frequencies of transactions submitted at each site.
Example 4.4: If high availability is required, transactions can be submitted at any site, and
‘most transactions are retrieval only, a fully replicated database is a good choice. However, if
certain transactions that access particular parts of the database are mostly submitted at a
particular site, the corresponding set of fragments can be allocated at that site only. Data that is
accessed at multiple sites can be replicated at those sites. If many updates are performed, it may
be useful to limit replication. Finding an optimal or even a good solution to distributed data
allocation is a complex optimization problem.
There are four alternative strategies regarding the placement of data: centralized, fragmented,
complete replication, and selective replication.
Centralized
This strategy consists of a single database and DBMS stored at one site with wee bt
szoss the network. Locality of reference is at its lowest as al sites, except the central as, 8
; nication costs
to use the network for all data accesses. This also menns that communication Sak, Ms Bt
Reliability and availability are low, as a failure of the central site res
database system.
Fragmented (or Partitioned)
, This strategy partitions the database into disjoint fragments,
ne site, If data items are located at the site where they re
tion, storage costs are
with each fragment assigned to
sed most frequently, locality of
: Seay
Jow; similarly, reliability an¢
Yelerence is high. ; s Sty ant
aera lized case, aS the failure of
ilabili he contraliz ‘
lability are low, although they are higher than in econ ono
sults in the lose of only that site's data, Performance shou
‘ow ifthe distribution is designed properly:
C " . ;
a aplention rote eopy of the database at each site There
: ini i 1d. However,
ay ‘ intaining Oo ance are maximized.
1 ot a Sabie and avi and ener
rence,
ilL =z helt
4198 Advanced Datal -
storage costs and communication costs for updates are the most Ci He acermos ome
these problems, snapshots are sometimes used. A snapshot is a md of oe toa a sven ting
| "The copies are updated periodienlly—for example, hourly or weekly—o bey o8y : - =
up to date, Snapshots are also sometimes used to implement views in a di atabase
I improve the time it takes to perform a database operation on a view.
se
Selective Replication
i ‘This strategy is a combination of fragmentation, replication, and centralization. Some data
t items are fragmented to achieve high locality of reference, and others that are used at many
sites and are not frequently updated are replicated; otherwise, the data items are centralized,
The objective ofthis strategy is to have all the advantages of the other approaches but none
the disadvantages. This is the most commonly used strategy, because of its flexibility.
TYPES OF DISTRIBUTED DATABASE SYSTEMS
ee
Distributed databases can be broadly classified into homogeneous and heterogeneous
distributed database environments.
Homogeneous DDBS
In a homogeneous system, all sites use the same DBMS product. Homogeneous systems are
much easier to design and manage. This approach provides incremental growth, making the
addition of a new site to the DDBMS easy, and allows increased performance by exploiting the
parallel processing capability of multiple sites,
Example 4.5: Consider that we have three departments using Oracle 19¢ for DBMS. If som:
changes are made in one department, then, it would update the other department also.
Figure 4.11; Homogeneous distibuted system
‘Types of Homogeneous Distributed Database System
‘There are two types of homogeneous distributed database system:
* Autonomous: Each database is independent that’ functions on its own. They &°
integrated by a controlling application and
updates.
: 8
use message passing to share di
. |
Non-autonomous: Data is distributed across the homogeneous nodes and ® |
central or master DBMS co-ordinates data updates across the sites. 'yr
geneous DDBS
(Gee) 0 Distributed Databases, NOS, Systems and Bigdata 199
tore
srogencous system, sites may run different DBMS products, which need not be based on
‘came underlying data model, and so the system may be composed of relational, network,
M arehical, and object-oriented DBMSs, Heterogeneous syatems usually result when individual
weave implemented their own databases and integration in considered at a Inter stage. In a
s system, translations are required to allow communication between different
se ss. To provide DBMS transparency, users must be able to make requesta in the language of
the DBMS at their local site, The system then has the task of locating the data and performing
sag necessary translation. Data may be required from another site that may have:
«Different hardware
+ Different DBMS products
+ Different hardware and different DBMS products.
ne
Example 4.6: In the following diagram, different DBMS software are accessible to each other
sesing generic connectivity (ODBC and JDBC).
Communication
Figure 4.12: Heterogeneous distributed system
If the hardware is different but the DBMS products are the same, the translation is
straightforward, involving the change of codes and word lengths. If the DBMS products are
different, the translation is complicated involving the mapping of data structures in one data
‘model to the equivalent data structures in another data model.
Example 4.7: If relations in the relational data model are mapped to records and sets in the
network model. It is also necessary to translate the query language used (for ‘example, SQL
SELECT statements are mapped to the network FIND and GET statements). If both the
hardware and software are different, then both these types of translation are required. This
‘makes the processing extremely complex.
‘Types of Heterogeneous Distributed Databases System
+ Federated: The heterogeneous database systems are independent in nature and
Integrated toghther 20 that thay funtion aa a eingle database e7eien .
© Un-federated: The database systems employ a central coordinating module throug
which the databases are accessed.|
DistrIBUTED DATABASE ARCHITECTURES
——T— ew
DDBS architecture are generally developed depending on three parameters:
General Architecture of Pure Distributed Databases
Here, we discuss both the logical and component architectural models of a DDBS. In figure 4.13,
which describes the generic schema architecture of a DDBS, the enterprise is presented with 1
consistent, unified view showing the logical structures of underlying data across all nodes.
Advanced Database
Distribution: It states the physical distribution of data across the different sitay
i.e., whether the components of the system are located on the same machine or not
‘Autonomy: It indicates the distribution of control of the database system and the
degree to which each constituent DBMS can operate independently. Autonomy is 4
function of a number of factors such as whether the component systems 4,
individual DBMSs) exchange information, whether they can independently exeey
transactions, and whether one is allowed to modify them. ¢
Heterogeneity: It refers to the uniformity or dissimilarity of the data models
system components and databases.
===
Stored
Data
Stored
Data
Site 1
Site n
Figure 4.13: Schema architecture of distributed database© Distributed Databases, NOSQL, Systems and Bigdata 200
is architecture generally has four levels of schemas:
Tier External View or Schema (EV or ES): De
+ Global Conceptual Schema (GC:
provides network transparency,
picts user view of data.
8): Depicts the global logical view of data, which
. Local Conceptual Schema (Les): Depicts logical data organization at each site,
+. Toeal Intamal Echems (116): Depicts physieal data organization at sach aif
federated Database Schema Architecture
typical five-level schema architecture to support global applications in the FDBS environment
is shown in figure 4.14. In this architecture, the local schema is the conceptual schema (full
Gutabese definition) of a component database, and the component schema is derived by
smlating the local schema into a canonical data model or common data model (CDM) for the
PDBS. Schema translation from the local schema to the component schema is accompanied by
eating mappings to transform commands on a component schema into commands on the
= nding local schema. The export schema represents the subset of a component a
that available to the FDBS. The federated schema is the global schema or view, a ee
result of integrating all the shareable export schemas. The pee aes efins
schema for a user group or an application, as in the three-level schema architecture.202 © Advanced Database
Overview of Three -Tier Client/Server Architecture
Full-scale DDBMSs have not been developed to support all the types of functional
have known so far, Instead, distributed database applications are being developed in
of the client/server architectures. It is now more common to use a three-
than a two-tier architecture, particularly in Web applications. This architecture is jl}
figure 4.15,
Client
User interface or presentation tier
(Web browser, HTML, JavaScript, Visual Basic,
Nee te a tbat ane caren canara’ S|
t
HTTP Protocol
ODBC, JDBC, SQUCLI, SQLI
t
Database server
nt/server architecture
the following three layers exist:
Presentation layer (client): This provides the user interface and interacts 62
the user. The programs at this layer present Web interfaces or forms to the =
order to interface with the application. Web browsers are often utilized. and
languages and specifications used include HTML, XHTML, CSS, Flash, Ms
Scalable Vector Graphies (SVG), Java, JavaScript, Adobe Flex, ana others TS
layer handles user input, output, and navigation by accepting user commands 2°
displaying the needed information, usually in the form of static or dynamse W®
pages. The latter are employed when the interaction involves database 8°
When a Web interface is used, this layer typically communicates with
application layer via the HTTP protocol,
«Application layer (business logic): This la
For example,
In the three-tier client/server architecture,
er programs the application
queries can be formulated based on user input from the cle“
query results can be formatted and sent to the client for presentation. Addit
application functionality can be handled at this layer, such as securitY© Distributed
buted Datubaves, NOSQL Syatome and Wilt 208
identity verification, tu
nections, The
and othor
one or more databases or den
using ODBC, JDBC, SQUELI.
fen internet with
ine to the databane
oF othor databane necous techni
ochniquon,
Database server: This tayor hand
application layer, processes the ‘ve
used to access the database
AWOFY AN update requonta from th
Fequosta, and no
‘nnd sends the renults, Unually, SQL, in
° it is rolntio
database procedures may also bo invol, Ge rae eitona, and stored
clbed- Gen nd ator
formatted into XML
XML when transmitte
fennel ine 3 ‘ansmitted between the application no, Ath
ever and the
'¥ ronulta (and queries) may be
INTRODUCTION TO NOSQL systems
ANoSQL database, which stands for "non SQL" or "non-rolational i a d
data storage and retrieval. It avoids joins, and is ensy to scale. The eee
NoSQL databace is for distributed data stores with humongous data sarge node os
used for big data and real-time web apps. For example, companies like "wit = sophal
Google collect terabytes of user data every single day, ae
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights, Instead, a
NoSQL database system encompasses a wide range of database technologies that can store
structured, semi-structured, unstructured and polymorphic data.
Why NoSQL?
t of NoSQL databases became popular with Internet giants like Google, Facebook,
The concept
‘The eystem response time becomes slow when
‘Amazon, ete. who deal with huge volumes of data.
you use RDBMS for massive volumes of data.
wuld “scale up” our systems
e for this issu:
‘This method is known as “scaling out.”
sabases as they are
by upgrading our existing hardware,
To resolve this problem, we co
1c is to distribute database load on
‘This process is expensive. The alternativ:
multiple hosts whenever the load increases.
‘ales out better than relational dats
NoSQL database is non-relational, so it se
designed with web applications in mind.
More Ram
More CPU
More HDD
Figure 4.16: Seale UP (vertical sealing)vo
| 206 Advanced Database
Figure 4.16: Scale-out (Horizentals Scaling)
Differences between RDBMS and NoSQL
RDBMS is called relational databases while NoSQL is called a distributed database. They
do not have any relations ey, there is a proper method in NoSQL to use unstructured data.
RDBMS is scalable vertically and NoSQL is scalable horizontally. Hence in RDBMS,
Maintenance of RDBMS is expensive as manpower is needed to manage the servers added
in the database. NoSQL is mostly automatic and does some repairs on its own. Data
distribution and administration is less in NoSQL.
Example: Lets take data stored in RDBMS as,
User Slcill
Uid | Fname | Lname User ia | Sill name
1 |imdra | Chaudhary 1 Big Data
2 | Chokraj | Dawadi 1 Cloud
2 Caleutus
Experience
User id Role [= Company
1 Full time faculty CAB College
2 Principal Now Summit College
2 Visiting faculty member | KMC
Reading this profile would require the application to read 4 rows from three tables,Come) O Distributed Databases, NOSQL Systems and Bi
Lname | Still name Rote C
: ompany
Chaudhary [BigData [Fatt ame freutty [CAB Con
: wl ane fey lege
Chaudhary [Cloud Full time faculty [Gab cam
CAB College
Dawadi Caleulus Prineipal New Summit C
ae |New § immit College
KMC
in NaSQL we cam express above three tables data inthe form of JSON as below,
{
User: [
{
Vid
First name: “Indra”
Last name: “Chaudhary”
}
{
Uid: 2
First name: “Indra”
Last name: “Chaudhary”
}
1
Skill: [ “Big Data”, “Cloud”, “Calculus”),
Experience: [
{
Role: “Full time faculty”
Company: “CAB College”
J
{
Role: “Principal”
Company: “New Summit College”
}
{
Role: “Visiting faculty”
Company: “KMC”
}
]i
© 208 Acivanced Database
Major differences between them are tabulated below,
Users know RDBMS well as it is old and many]
organizations use this database for the proper’
format of data.
User interface tools to access data is available
in the market s0 that users can try with all the
schema to the RDBMS infrastructure. ‘This
hhelps to interact with the data well and users
will understand the data in a better manner.
"This is relatively new and experts j ocd
i
are less as this database is evolving
User interface tools to access and ma
NoSQL
2p
Nes
"Vy ay]
ipl
"ser dy
data,
data in NoSQL is very less and hence
not have many options to interact with
RDBMS scalability and performance faces:
some issues if the data is huge. Servers may
not run properly with the available load and
this leads to performance issues,
‘Multiple tables can be joined easily in RDBMS.
and this does not cause any latency in the
working of the database. The primary key
helps in this case,
Ie works well with high loads. Sealaiay >
very good in NoSQL. This makes)
performance of the database better qi,
compared with RDBMS. A huge amount ony,
could be easily handled by users,
Multiple tables cannot be joined in NoSQL wy
is not an easy task for the database and ins
not work well with the performance ofthe dats
‘The availability of the database depends on the
server performance and it is mostly available
whenever the database is opened. The data
Provided is consistent and does not confuse|
users.
Though the databases are readily availabl,
consistency provided in some databases is less
‘This results in the performance of the database
Data analysis and querying can be done easily
with RDBMS even though the queries are
complex. Slicing and dicing can be done with
the available data to make the proper analysis
of the data given,
Data analysis is done also in NoSQL but
works well with real-time data analytics
Reports are not done in the database but ifthe
application has to be built, then NoSQL is 4
solution for the same.
Documents cannot be stored in RDBMS
because data in the database should be
structured and in a proper format to create
identifiers.
Documents can be stored in the Nose
database as this is unstructured and nt ®
rows and columns format.
Partitions cannot be created in the database,
Key-value pairs are needed to identify the data
in a particular format specified in the schema
database,
Partitions can be ereated in the database es)
and key-value pairs are not needed to ide)
the data in the source, Software as a se
can be integrated with NoSQL.
Example: MySQL, Oracle, SQL Server ete.
tt
Example: IBM Domino, Oracle NS
Apache HBase etc,ry
© Distributed Data
vax S01, Databa = 'taases, NOSQL Systems and Bigdata 207
type
‘era different Varieties of NoSQL databases
i have bee
Seas cases: These fll into four main eategoron en created to support specific ne
pocument Databases
prcunent databases, like ISON (JavaScript Object Nou
gach document has a set of field and value pairs, The values might be oe
. many sorts, such
texts, Integers, Booleans, arrays, or objects, and their ms
aajects that developers interact with within code, res are usually aligned with the
pocument databases are useful for a brond number of use cases and may be utilized
asa
1 database due to their vari
general-purpose , eir variety of field value types and st
fheyean expand out horizontally to accommodate enormous dats volumes nn nes
ation) objects, store data in documents,
as
Key-Value Databases
Key-value databases are a simpler form of database that has keys and values for each item.
Learning how to query for a certain key-value pair is usually straightforward because a value
ean only be accessed by referencing its key.
Key-value databases are ideal for situations in which you need to store a significant quantity of
data but don't need to access it using complicated queries. Caching and saving user preferences
are two common use cases. Popular key-value databases include Redis and DynamoDB.
Wide-Column stores
Wide-column NoSQL databases store data in tables with rows and columns similar to RDBMS,
but names and formats of columns can vary ftom row to row across the table. Wide-column
databases group columns of related data together. A query can retrieve related data in a single
operation because only the columns associated with the query are retrieved. In an RDBMS. the
deta would be in different rows stored in different places on disk, requiring multiple disk
operations for retrieval.
Graph Databases
Data is stored in nodes and edges in
relationships between nodes, whereas Pi
objects, A graph database uses graph struc
provide index-free adjacency, 0 that adjacent
index, :
1s, Edges hold information about the
graph database: :
fides store information about people, locations, and
‘map, and query relationships. They
tures to store, sd
elements are linked together without using a
a jon of Noe Jitional RDBMS, including:
NoSQL databases offer enterprises important advantages over tradi
ases offer ent
wear ay to add or reduce
me that makes it easy 10 8 :
aod horizontal seale-out net ht ms mia .
= a yt ing th rary when axcempring (0 Sale
‘apacity quickly and non-disruptve
mendous cost and complexity of manual
RDEMs,
1 sharding that is nece
ae
_208 © Advanced Database
Performance
enterprises can increase performance with yy,
Jiably fast user experiences
without the overhead a
commodity resources,
By simply addini E
iy simply ns to continue to deliver re
databases. This enables organizal
predictable return on investment for adding resources—again,
with manual sharding.
with,
Socata
RES ERD
High Availability
NoSQL databases are generally designed to ensure high availability and avoid the complex
that comes with a typical RDBMS architecture that relies on primary and secondary nod
Some “distributed” NoSQL databases use a masterless architecture that automata,
Gistributes data equally among multiple resources so that the application remains available
both read and write operations even when one node fails.
=e
Global Availability
By automatically replicating data across multiple servers, data centers, or cloud resoures,
distributed NoSQL databases can minimize latency and ensure a consistent applicatin
experience wherever users are located. An added benefit is a significantly reduced database
management burden from manual RDBMS configuration, freeing operations teams to focus
other business priorities.
Flexible Data Modeling
NoSQL offers the ability to implement flexible and fluid data models. Application developers cx
leverage the data types and query options that are the most natural fit to the specific application:
use case rather than those that fit the database schema. The result is a simpler interaction
between the application and the database and faster, more agile development.
eee st RR AS
There is less management
Despite huge advancements in our DBMS domain over the years, relational databases ®
heavily on database administrators, also known as DBAs. On the other hand, NoSQL detsbsss
are typically built from the ground up to eliminate needless management, automated ae
distribution, and simpler data models, resulting in lower administration and perform
demands.
The CAP THEOREM
CAP stands for Consistency, Availability, and Partitioning. It is very important to unde
the limitations of NoSQL database, NoSQL. cannot provide consistency and high «a8
together. This was first expressed by Eric Brewer in CAP Theorem. CAP theorem ®
Brewers theorem states that we can only achieve at most two out of three guarant
database: Consistency, Availability and Partition Tolerance.
Consistency
Consistency is all about data consistency, or in other words, making sure that within ® 43°.
environment, every node of the database has exactly the same information at a" ae
Imagine having two nodes with purchase orders from your ecommeree site. If ther? aF g§
O Distribute
scy ene 0 td ayy Distributed Databases, NOSQL Systems and Bigdata 209
and 8 A unique cluste
sees amon thes and the ster, the moment ie
_ show You missing transsctona. Ant the ei oe
ta, the results mi isons,
fo consistency is definitely an important. characteris eee
& “ever, not all of them can provide it. So what
joa “eventual consistency.” Meaning that while a
geal eventually be 80, This helps in making su
iD but also making calculations based on that dy
2 they do instend? They go for something
he Point the cluster may not be consistent
re that you don’t get the types of problems [
availability
vailablity stands for “high availability” or in other words the abilit;
} rin y of the database to alwa
te available, no matter what happens. This is not the same and should not be crt with
‘igult tolerance” however. A highly available database is usually one that has replicas in
aultiple geographical zones, that way if there is a big network outage, itl still be accessible
through one ofits other replicas. For example, a system that's only installed and working on one
of our servers can’t be highly available because the moment that server fails, well lose our
database.
Partitioning
Partitioning stands for “partitioning tolerance” or in other words, having the ability to support
broken links within the cluster the database is distributed in. Think about a graph representing
your database cluster. You have multiple nodes sharing data and working wonderfully and
suddenly there is a problem and a section of that cluster fails. If the database is “partition
tolerant” it'l still work despite the sudden lack of some of its nodes.
‘The CAP theorem categorizes systems into three categories:
K I
AC systems CP systems
Pp
7 ‘AC systems
Figure 4.17: Visual representation of the CAP diagram
[ACP database delivers consistency and
partition oceurs between any two
__ make it unavailable) until the
1) database:
ability, When @ P
nnsistent node (i
CP (Consistent and Partition Tolerant)
Partition tolerance at the expense of avail
nodes, the system has to shut down the non-co!
Partition is resolved.
puted system. Meaning,
within a distrit
a ee bani at m, there is & partition
another node in the s!
Partition refers to a communicatio ns
if'a node cannot receive any messages fo‘210 Advanced Database
Partition could have been because of network failure, server
crash,
rs
between the two nodes.
any other reason.
AP (Available and Partition Tolerant) database: An AP database delivers avaitatitt,
yang
partition tolerance at the expense of ‘consistency. When a partition occurs, all nodes rq
°8 Femaiq
vaitable but those at the wrong end of 2 partition might return an older version of data y,
others, When the partition is resolved, the AP databases typically resync the nodes to repaiy ‘a
a
inconsistencies in the system.
CA (Consistent and Available) database: ACA delivers consistency and availability ing,
absence of any network partition. Often a single node's DB servers are categorized as ¢4
systems. Single node DB servers ‘eal with partition tolerance and are
do not need to di
considered CA systems.
In any networked shared-data systems or distributed systems partition tolerance is a mus
Network portitions and dropped messages are a fact of Fife and must be handled appropriate,
Consequently, system designers must ‘choose between consistency and availability. *
ae
lerosoft
SQL Server
thas
ererenc APACHE
clients see the
same data at the same HBASE
time
mongo DB
Partition-Tolerance
The system continues
to operate in spite of
network failures
Availability
‘The system continues
to operate even in the
presence of node
failures
andra
Figure 4.18: Classification of different databases based on the CAP theoremComes) 0 distin a
javantages of NoSQL 'steibuted Databases, NOSQL Systems and Bigata 211
is!
vio? gisadvantages of NoSQL are described below:
w No standardization rules i
Limited query capabilities
+ RDBMS databases and tools are comparatively mat
ure
It does not offer any traditic
. ; r ional database capabilities, til
multiple transactions are performed simaltaeny. a
ly.
«When the volume of data increases it i
e ases it is diff i ‘
een ficult to maintain unique values as keys
+ Doesn't work as well with relational data
+ The learning curve is stiff for new developers
+ Open source options so not so popular for enterprises.
DOCUMENT-BASED, KEY-VALUE STORES, COLUMN-BASED, AND GRAPH-BASED
SYSTEMS
Key-Value Database
Key.value databases are the simplest type of NoSQL database. Thanks to their simplicity, they
are also the most scalable, allowing horizontal scaling of large amounts of data.
These NoSQL databases have a dictionary data structure that consists of a set of objects that
represent fields of data. Each object is assigned a unique key. To retrieve data stored in a
articular object, you need to use a specific key. In turn, you get the value Ge. data) assigned to
the key. This value can be a number, a string, or even another set of key-value pairs.
2
G00 8
database
Figure 4.19: Key valve ' a
do not require @ prede
key-value databases: un
980 Mey hg data and have faster perOreMee wit
Unlike traditional relational databi
et
‘a lighter solution a3 they require fews
structure, They offer more flexibility when SE
having to rely on placeholders, key-value databases
Tesources,
Such functionalities are suitable for large dat
are commonly used for caching, storink
Tecommendations.
ata, Therefore, they
it rT
deal with simple iy servicing, and
3 that
taboo ‘se8si0n8,
‘and managing user
——i&
e
remeron ani 23
‘Advanced Database
Document Database
atabase is a type of NoSQL database that consists of sets of key-value pai
rg
je units of data which you can algo gry!
UP ints
A document 4
into a document. These documents are basic
collections (databases) based on their functionality.
Figure 4.20: Document database
Being a NoSQL database, you can easily store data without implementing a schema. You ex,
transfer the object model directly into a document using several different formats. The mos,
commonly used are JSON, BSON, and XML.
Here is an example of a simple document in JSON format that consists of three key-value pars:
{
"ID" : "001",
“Name”: "John",
"Grade" : "Senior",
}
What's more, you can also use nested queries in such formats, providing easier data distributia
across multiple disks and enhanced performance. For instance, we can add a nested value stisg
to the document above:
{
"ID" : "001",
}
ca cryp aid
Due to their structure, document databases are optimal for use cases that require flexi
-wual development, For example, you can use them for managing user Profiles,
tion provided. Its schema-less structure allows you © pb
‘Monee?
Examples of NoSQL document databases include
fast, conti
differ according to the informat
different attributes and values.
CouchDB, Elasticsearch, and others.‘
\
\
mis
%.
wee Se
Se
-
|
|
t
Gam) 0
_couumn stores are another type of NoSQL, database
ely stored columns instend of rows, Such
ibuted Databases
jje-column Database buted Databases, NOSQL Systems and Bigdata 218
wit
wile
My them, data is stored and grouped
databases organize information into
om ns that function similarly to tables in relational databases,
Row-oriented
1D Name [Grade GPA
001 John [Senior 4.00
(002 Karen [Freshman [3.67
003 Bil__[Junior 3.33
Column-oriented
Name [ID Grade [ID GPA[ID
John [001 Senior [001 4.00 [001
Karen | 002 Freshman | 002 3.67 |002)
Bill 003, Junior 003 3.33 |003
However, unlike traditional databases, wide-column databases are highly flexible. They have no
‘ed keys nor column names. Their schema-free characteristic allows variation of column
James even within the same table, as well as adding columns in real-time.
‘The most significant benefit of having column-oriented databases is that you can store large
smounts of data within single column; This feature allows you to reduce disk resources and
the time it takes to retrieve information from it. They are also excellent in situations when you
fave to spread data across multiple servers. Examples of popular wide-column databases
include Apache Cassandra, HBase, and CosmoDB.
Graph Database
Graph databases use flexible graphical representation to manage data.
‘These graphs consist of two elements:
. Nodes (for storing data entities)
«Edges (for storing the relationship between entities)
These relationships between entities allow data in the store to be li
many cases, retrieved with one operation. Nodes and edges have defined
using these properties, you can query data easily.
inked together directly and, in
properties, and by
ro 4.21: Graph database_—~ ~
\dvanced Database ©
m Adv ¢ data-storing is quite specific, it is not @ commonly used NoSQL, inky
0 tase cases in which having graphical representations is they,
n use graphs to store information about hoy 1,
networks oftet
raph, and Neodj are just some examples of graph data
since this tyPe
However, there are certain
olution, For instance, social net
vers are linked. OrientDB, Reds
you should consider using.
Bic DATA
jon of data that is huge in volume, yet growing exponentially with time, tj
that none of traditional data management tool gay
also a data but with huge size.
Big Data is a collect
a data with so large size and complexity
store it or process it efficiently. Big data is
Big data can be categorized as unstructured or structured. Structured data consists «f
information already managed by ther organization in databases and spreadsheets; it
frequently numeric in nature. Unstructured data is information that is unorganized and does
ot fall into a predetermined model or format. It includes data gathered from social media
sources, which help institutions gather information on customer needs.
1a be collected from publicly shared comments on social networks and website,
Big data cai
d apps, through questionnaires, product
voluntarily gathered from personal electronics ant
purchases, and electronic check-ins. The presence of sensors and other inputs in smart devices
allows for data to be gathered across a broad spectrum of situations and circumstances.
Big data is most often stored in computer databases and is analyzed using software specifically
designed to handle large, complex data sets. Many software-as-a-service (SaaS) comparis
specialize in managing this type of complex data.
n 28,
Network Sty we
Ae BIG DATA Dh
Cloud
‘Technology i
y Analysis
ze @
Tousen Volume
Research
Figure 4.22: Uses of Big Data
Big Data Architecture
Big data architecture refers to the logical and physical structure that dictates how high vou
of data are ingested, processed, stored, managed, and accessed.
° eTMost big data architectures include some or all of the following components:
1
%
4,
O Distributed Databases, NOSQL Systems and Bigdata 215
Data Storage ||_ Batch
Processing
Analytics
Ra and
Messe Stream Reporting]
Ingestion _Crocessing
Figure 4,23:
ig data architectu
Data sources: All big data solutions start with re
one or more a:
I ° lata sources. Examples
+ Application data stores, such as relational databases.
= Static files produced by applications, such as web server log files.
+ Real-time data sources, such as Io devices.
Data storage: Data for batch processing operations is typically stored in a
distributed file store that can hold high volumes of large files in various formats. This
kind of store is often called a data lake. Options for implementing this storage include
Azure Data Lake Store or blob containers in Azure Storage.
Batch processing: Because the data sets are so large, often a big data solution must
process data files using long-running batch jobs to filter, aggregate, and otherwise
prepare the data for analysis. Usually these jobs involve reading source files,
processing them, and writing the output to new files. Options include running U-SQUL
jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an
HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an
HDInsight Spark cluster.
Real-time message ingestion: If the solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for stream
processing. This might be a simple data store, where incoming messages are dropped
into a folder for processing. However, many solutions need a message aay
to act as a buffer for messages, and to support sealé-out processing, fares an a
and other message queuing semantics. Options include Azure Event Hubs, Az
Hubs, and Kafka.
Stream processing: After ¢
them by filtering, aggregatin he
processed stream data is then wr en
Prove managed er Pessina ae aon se, Ssh
is bounded streams. Ml Becca
See ey aaioal on ke Storm ‘and Spark Streaming in ‘an HDInsight cluster.
streaming technologies like ee
cal data store: Many big for analysis 8
Analytical d ja data solutions prepare data for .
Se ay mY seructured format that can be queried using analytical
ta in
serve the processed dat
apturing real-time messages, the solution must prosess
g, and otherwise preparing the data for analysis. ‘The
; van output sink, Azure Stream Analytics
tually running SQL| 216 = Advanced Database
tools. The analytical data store used to serve these queries ean be a Kim
relational data warehouse, a seen in most traditional busing? integer
solutions. Alternatively, the data could be presented through a low-lateney Nyce
technology such as HBase, or an interactive Hive database that Provides a mS
abstraction over data files in the distributed data store. Amun. Synapse ana
provides a managed service for large-scale, cloud-based data warehousing. HDtn ot
sabborts Interactive Hive, HBase, and Spark SQL, which ean also be uses wrve gt
for analysis, ‘ta
7. Analysis and reporting:
into the data through anal:
the architecture may inch
‘The goal of most big data solution:
lysis and reporting. To empower users
lude a data modeling layer, such as
OLAP cube or tabular data model in Azure Analysis Services. I
self-service BI, using the modeling and visualization te:
BI or Microsoft Excel. Analysis and reporting can also take the form of interactive
data exploration by data scientists or data analysts, For these scenarios, many Azure
Services support analytical notebooks, such as: Jupyter, enabling these users to
leverage their existing skills with Python or R. For large-scale date exploration, you
can use Microsoft R Server, either standalone or with Spark.
Orchestration: Most big data solutions consist of repeated data
eperations, encapsulated in workflows that transform source data, move data
between multiple sources and sinks, load the processed data into an analytical data
store, or push the results straight to a report or dashboard. To automate thee
workflows, you can use an orchestration technology such Azure Data Factory or
Apache Oozie and Sqoop. &
to provide insight,
to analyze the data,
@ multidimensional
t might also suppor
‘chnologies in Microsoft Poyes
Characteristics of Big Data
‘Three characteristics define Big Data: volume,
variety, and velocity. Together, these
characteristics define “Big Data”.
They have created the need for a new class of capabilities to
augment the way things are done today to provide a better line of sight and control over out
existing knowledge domains and the ability to act on them.
Velocit Volume
Terabytes of
—_ \ cata, Biions of
| Recores
real-time
Structured,
Unstructured,
‘Semistructured data
Variety
Figure 4.24: Three V big data modelGore) 0 Distributed Databases, NOSQL,:
Volume of Data ‘Systems and Big 1
ghe iedata 207
sheer volume of data being a
tabytes (PB) of data wore stored tha! aeloti. Ind
being created today isn't analyzed at all and thats oe
considered. This number is expected to reach 35 an by ar
alone generates more than 7 terabytes (TB) of data every day poe oo titer
some enterprises generate terabytes of data every hour of ae seebook 10 TB, and
no longer unheard of for individual enterprises to have stems the Year It's
petabytes of data, storage clusters holding
‘The name Big Data itself is related to an enormow
of data generated from many sources daily, such as business procesecs
social media platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 360 million new posts are uploaded each
day. Big data technologies can handle large amounts of data,
2, Variety
8 size. Big Data is a vast ‘volumes!
machines,
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected from databases and sheets
in the past, But these days the data will comes in array forms, that are PDFs,
Emails, audios, SM posts, photos, videos, etc.
4. Velocity
Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application
logs, business processes, networks, and social media sites, sensors, mobile devices,
ete.
Advantages and disadvantages of Big Data
‘The advantages of Big Data Analytics below show and how retrieving information and
collecting data is useful.
1. Voluminous Collection
A large amount of market data can be generated using Big Data analytics, and
various graphical and mathematical representations can be made for easy analysis,
This massive information is further helpful for deriving market-based conclusions
‘and predict consumer behavior. However, new technology needs to ramp up in this
field as traditional software cannot process big data.
Future Insights are mentionable Advantages of Big Data
With the predictions and statistical data obtained, a business can control rospests;
The growth and future problems of a business can be well handled osing oo
analysis, Using these datasets, a company can plan its launches also create3.
5.
‘Advanced Database
products and services. Scientists foresee the same benefits of Big Data ana,
the healthcare industry and societal affairs. in
Big Data Analytics is Cost Efficient
It is essential to understand the changing trends of the market and to do thay,
market analysis needs to be done. At a particular time, a specific direction is fs, ¢
while others are remaining constant or decreasing. A business is all dependent gy
actual demand, and if they ean predict it, they can have control over the prod
Ttcan save costs incurred in storing raw material and finished products
Research will take less time
New software can easily analyze and interpret data sets, which helps make deisimn,
and saves a lot of time. Also, new data can be generated automatically in bulk xi,
updated information and trends. This can help businesses to stay stable in the Ig
run. -
Fraud Detection and Prevention
Big Data is capable of stopping fraudulent transactions, as in the case of banking
services. The frauds are getting smarter, and people need to know not to share
personal information; the automated software can detect fraudulent accounts and
cards. Based on recurring patterns and the spending behavior of consumers, it woul
also be possible to track what's usually missed manually.
Disadvantages of Big Data
Since
can be examined needs a vast space. Although the analysis of enormous information seexs
all the information collected requires a lot of effort and resources, storing it before i
possible, some significant disadvantages of Big Data come to light in terms of space. cos.
and user security.
1.
2.
3.
Unstructured Data
The data collected can be arranged or present in the form of random informatio.
More variations in data can create difficulty in processing results and generating
solutions. If the information is broken or unstructured, many users can get neglects!
while deriving future outcomes or analyzing present scenarios,
Security Concerns are most dreaded disadvantages of Big Data
For highly secured data or confidential information, highly secured networks #*
needed for its transfer and storage. Furthermore, with the increased global politi**
and complex situations between nations, leaked data can be used as an advantag? bY
enemies, so keeping it secure is essential and requires building such a network.
Expensive
The process of data generation and its analysis is costly without the surety
favorable results. The top businesses can mainly research this field as the sP**
sector, where the wealthiest companies and individuals carry out research. The Co
of setting up super-computers is one of the leading disadvantages of Big D3Cre) Distributed Databases, ia
ot SAL Systems and Bigdate 219°
curred 16 info
impo hs toe arranged forand an the information usuall
Oat Asal maintenance,
analytics. Even if the cost i,
ly residing on
‘ fessionals needed to o
me pote carry out renearch and i
highly paid and hard to find. ‘There in a seatcity of indigent sorare are
analyst job despite the increasing ge tulled for the data
resource of the new generation as to remain in the market; nse the
ourself updated with further information, ‘i i necemsary to keep
5, Hardware and Storage
The servers and hardware needed to store and run high-quality software are very
costly and hard to build. Also, the information is available in bulk with eontancn
changes, and processing requires faster software and applications. And me cone
forget the uncertainty involved with getting accurate results,
MapREDUCE
a
Map-Reduce is a programming model designed for processing large volumes of data in
parallel by dividing the work into a set of independent tasks. Map-Reduce programs are
sritten in a particular style influenced by functional programming constructs, specifically
idioms for processing lists of data. This module explains the nature of this programming
model and how it can be used to write programs which run in the Hadoop environment.
MapReduce is a Hadoop framework used for writing applications that can process vast
amounts of data on large clusters. It can also be called a programming model in which we
can process large datasets across computer clusters. This application allows data to be
stored in a distributed form. It simplifies enormous volumes of data and large scale
computing.
‘There are two primary tasks in MapReduce: map and reduce. We perform the former task
before the latter. In the map job, we split the input dataset into chunks. Map task processes
these chunks in parallell. The map we use outputs as inputs for the reduce tasks. Reducers
Process the intermediate data from the maps into smaller tuples, which reduces the tasks,
leading to the final output of the framework.
The MapReduce framework enhances the scheduling and monitoring of tasks. The failed
tasks are re-executed by the framework. This framework can be used easily, even "
Programmers with little expertise in distributed processing. Mapes in iM
implemented using various programming languages such as Java, Hive, Pig, Scala,
Python,
How MapReduce in Hadoop works
An pas of MapReduce inde and MapReduce’s phases will help us understand
how MapReduce in Hadoop works.
MapReduce architecture