0 ratings0% found this document useful (0 votes) 474 views19 pagesBig Data Unit 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Big Data Analytics
BRIEF CONTENTS
© What’ in Store?
+ Where do we Begin?
+ What is Big Dara Analytics?
+ What Big Data Analytics isn’t?
+ Why this Sudden Hype Around Big Data
Analytics?
Classification of Analytics
Greatest Challenges that Prevent Businesses
from Capitalizing on Big Data
Top Challenges Facing Big Data
Why is Big Data Analytics Importane?
What Kind of Technologies are we Looking
Toward to Help Meet the Challenges Posed by
Big Data?
Dara Science
+ Data Scientist ... Your New Best Friend!!!
‘Terminologies Used in Big Data Environment
In-Memory Analytics
In-Database Processing
Symmetric Multiprocessor System
Massively Parallel Processing
Difference between Parallel and Distributed
Systems
Shared Nothing Architecture
Consistency, Availability, Partition Tolerance
(CAP) Theorem Explained
Basically Available Soft State Eventual
Consistency (BASE)
*+ Few Top Analytics Tools
SSS ae
“Ifyou do nor know how to ask the right question, you discover nothing”
WHAT'S IN STORE?
=W, Edwards Deming
This chapter is abour understanding “Big Data Analytics.” We have taken you through the comprehension
of the term Big Data — datasets which are voluminous, rich in variety, and calls for processing at a great
speed. Big data analytics is
process of exami
ing these lange datasets of big data ~ to unearth hidden2 xe
nthe Amazon site,
ane
‘that does not escape your attention. Amazon has:
‘made a few suggestions (of books on similar topics or
books by the same author) to you to help with your
‘next or future purchases. You wonder how Amazon's
tecommendation engine was able to do this for you.
Is it something that they do for all their customers?
Well, Amazon's recommendation engine churns
out these sort of good suggestions for customers like
you, day in and day out. The company gathers all
information about your past purchases together with
what it knows about you, studies your buying pat
tems, and the buying patterns of customers like you
3.1 WHERE DO WE BEGIN?
“company. Yo
routes and car
It is one of those
trucks are engaged in carryir
help with a cargo delivery. The
ble the charge. You do not want
tunity. But which truck should you enge
that is the nearest but is facing the heaviest
the second nearest one but that is occupie
and will not be able to take more load. The
need to analyze the truck load, the fuel ‘consumption,
the traffic on various routes, etc. before deci
which truck to select to pick up the new delivery.
ca
e busy days W
Raw daa is collected, asified, and organized. Associating ic with adequate metadata and laying bare the con-
Ie is then aggregated and summarized so that it becomes
text converts this data into meaningful information.
easy to consume it for analysis.
helps wit
repository. This, in turn, th actionable insights wh
Organizations have rea
enough and make those timely decisions to ma
Figure 3.1
Gradual accumulation of such m«
lized that they will not be able
ke well of the fleeting opportunities. They wi
aningful information builds a knowledge
ful for decision making. Refer Figure 3.1
they want to be competitive
Il have to analyze
ich prove usel
10 ignore big data
Transformation of data to yield actionable insights.
SSBig Data Analytics
a
Websites Bling (POS) ERP. cRM RFIO~ Social media
Figure 3.2 Types of unstruct ailable for analysis.
gt tured data available lysis.
big time and also take into consideration bi
in terms of volume, velocity, and mae ig data that makes it to the organization at unprecedented level
ig data analytics is the wing
Bile cians ® the proces of xamining big dato uncover puters une vend and find
Se aeae ee ae ‘make faster and better decisions. Analytics begin with
3.2. WHAT IS BIG DATA ANALYTICS?
Big Data Analytics is.
1. ae analytics: Quite a few data analytics and visualization tools are available in
the market today from leading vendors such as IBM, Tableau, SAS, R Analytics, Statistica, World
Programming Systems (WPS), etc. to help process and analyze your big data.
inthe right direc
, About gaining a meaningful, deeper, and richer insight into your business to steer iti
tion, understanding the customer's demographics to cross-sell and up-sell ro them, better leveraging,
the services of your vendors and suppliers, etc
recommendations via email
ded clothing line from my
Author’: experience: The other day I was pleasantly surprised to get a few
arrive at this? In
from one of my frequently visited online retailers. They had recommen«
orite brand and also the color suggested was onc to my liking. How did they
clothing line of a particular brand and the color preference was
fay
making recommendations
the recent past, I had been buying
pastel shades. They had it stored in their database and pulled ic out while
tome,
3, About a competitive ed nabling you with findings that allow quicker and
er decision-making
se over your competitors by €
bei
4, Attight handshake berween three communities: IT, business users, and data scientists.
Refer Figure 3.3.
volume and variety exceed the current storage and processing capabilities
5, Working with datasets whose
enterprise
This makes perfect sense as the program for distributed processing is
che data (Tetabytes or Petabytes today and likely to be Exabytes or
ed to
and infrastructure of you
About moving code to
tiny (just a few KBs) compa
Zettabytes in the ni
3.3. WHAT BIG DATA ANALYTICS ISN'T?
jpants of our learning programs as what comes to mind when you hear the term
peed by the answer... itis “Volume.” But now that we have a clear understanding.
snly about volume but the variety and velocity too are very important factors.
Wehave often asked partic
And we arewhose volume and variety
isbeyond the storage and
eon
Cieibeesocans
Te
nec ac nae
Dyes cea
Rta nuances
feos
Gry oteeea!
Ga aeas
Figure 3.3 What is big data analytics?
echnology: It is about understanding what the data is saying
Refer Figure 3.4. Big data isn't just about t
existed becween datasets. It is about
to us, It is about understanding relationships that we thought neve
patterns and trends waicing ro be unveiled
"And of course, big daca analytics is nor here to replace our now very robust and powerful Relational
Database Management System (RDBMS) or our traditional Data Warehouse, It is here ro coexist with both
RDBMS and Data Warchouse, leveraging the power of cach to yield business value. Big data analytics is not
“One-size fits all” traditional RDBMS built on shared disk and memory.
Figure 3.4 What big data analytics isn’t?Big Data Analytics 39
used by huge online companies like a Google or Amazon, let us clear the
And before we think it is onl
industry that needs actionable insights out of their data (both internal
myth, Ieis for any business and ai
and external).
3.4 WHY THIS SUDDEN HYPE AROUND BIG DATA ANALYTICS?
If we go by the industry buzz, every place there seems to be talk about big data and big data analytics. Why
this sudden hype? Refer Figure 3.5.
Let us put it down to three foremost reasons:
ching nearly 45 ZB by 2020. In 2010, almost
about 1.2 trillion Gigabyte of data was generated. This amount doubled to 2.4 trillion Gigabyte in
2012 and to about 5 trillion Gigabytes in the year 2014. The volume of business data world
js expected to double every 1.2 years. Wal-Mart, the world retailer, processes one million custome!
“Twitter users every day. 2.7 billion “Likes’
5 quintillion byces of daca is created,
transactions per hour. 500 million “tweets” are posted by
and comments are posted by Facebook users in a day. Every day
with 909% of the world’s data created in the past 2 years alone.
Source:
(a) hutp://wwwintel.com/content
(b)heep://www-01 ibm.com/software/data/bigdata/wh
1y dropped.
ber of user-friendly analytics tools available
1/ywwww/us/en/communications/intemet-minute-infographic-hem!
is-big-data. hem!
2. Cost per gigabyte of storage has h
the market today.
3. There are an overwhelming num!
3.5 CLASSIFICATION OF ANALYTICS
sically nwo schools of thought:
alized, advanced, and monetized.
1. Those that classify analytics into basic, operatios
nalytics 2.0, and analytics 3.0.
nalytics into analytics 1.0, a
2. Those that cl:
Sean) More data
growth of produced
analysis
Beter More data
predictions stored
~ More data a
analyzed
Figure 3.5 What big data entails?fig Data and Analy
40+
3.5.1 First School of Thought s
1. Basic analytics: This primarily is slicing and dicing of data to help with basi
is about reporting on historical dat, basle wsualzations
2. Operationalized analytics: It is operationalized analytics
ic business insights. This
i gets woven into the enterprises busines
recasting forthe fare by way of predictive and prescrip.
process ‘
3, Advanced analytics: This largely is about £6
vive modeling. car ‘This is analyses In use to derive direct BURINEA FEN:
4, Monetized analytics:
3.5.2. Second School of Thought
and analytics 3.0. Refer Table 3.1.
Let us take a closer look at analytics 1.0, analytics 2.0,
Table 3.1 Analytics 1.0, 2.0, and 3.0
Analytics 1.0 Analytics 2.0 ‘Analytics 3.0
: 2012 to present
Era: mid 1950s to 2009 2005 to 2012
Descriptive statistics Descriptive statistics + predictive sti ptive + predic
(report on events, {se data from the past to make predictions prescriptive staistics
occurrences, etc. of the for the future) (use data from the past to make
oat prophecies for the future and at the
‘same time make recommendations
to leverage the situation to one’s
advantage)
Key questions asked:
What will happen?
When will tt happen?
Why will it happen?
What should be the action taken to
take advantage of what will happen?
A blend of big data and data from
legacy systems, ERP, CRM, and
3¢ party applications.
Big data is being taken up seriously. Data A blend of big data and traditional
data sources. Data is mainly unstructured, arriving at a much analytics to yield insights and
‘toned in enterprise higher pace. This fast flow of data entailed offerings with speed and impact.
ta warehouses or data that the influx of big volume data had to
be stored and processed rapidly, often on
massive parallel servers running Hadoop.
Data was internally Data was often externally sourced. Data ‘is both being internally and
externally sourced.
istics Descriptive + predictive +
Key questions asked:
What will happen?
Why will it happen?
Key questions asked:
What happened?
Why did it happen?
Data from legacy Big data
stems, ERP, CRM, and
dé party applications.
Small and structured
marts.
sourced.
Relational databases Database appliances, Hadoop clusters, SQL In memory analytics, in database
to Hadoop environments, etc. processing, agile analytical methods,
machine learning techniques, etc.a Analytics oH
How can we.
make it happen?
ore
Wat wil coe
happen?
Why dit ie
happen? ‘ u
ie
Foresight
Insight
. psa
dur!
Hindsight
Figure 3.6 Analytics 1.0, 2.0, and 3.0.
Figure 3.6 shows the subtle growth of analytics ftom Descriptive > Diagnostic > Predictive > PrescaPans
analytics.
3.6 GREATEST CHALLENGES THAT PREVENT BUSINESSES FROM.
CAPITALIZING ON BIG DATA
1. Obsaining executive sponsorships for investments in big data and its related activities (such as train-
ing, etc.)
2, Getting the business units to share information across organizational silos
3, Finding the right skills (business analysts and dara scientists) that can manage
tured, semi-structured, and unscructured data and create insights from it.
4. Determining the approach co scale rapidly and clastically. In other words,
storage and processing of large volume, velocity, and variety of big data
5. Deciding whether to use structured or unstructured, internal or external dara ro make business
decisions.
6. Choosing the optimal way to report findings and analysis of big data (visual presentation and analy-
tics) for the presentations to make the most sense.
Determining what to do with the insights created from big data.
large amounts of struc-
the need to address the
3.7. TOP CHALLENGES FACING BIG DATA
1. Scale: Storage (RDBMS (Relational Database Management System) or NoSQL (Not only SQL) is
tone major concetn that needs to be addressed ro handle the need for scaling rapidly and elastically.
The need of the hour is a storage that can best withstand the onslaught of large volume, velocity, and
variety of big data? Should you scale vertically or should you scale horizontally?=
a: Big Data and Analy,
2. Security: Most of the NoSQL big data platforms have poet security mechanisms (lack of prop,
authentication and authorization mechanisms) when it comes © safeguarding tis data. A spor th
cannot be ignored given that big data ‘carries credit card information, person information, and othe,
3. ae schemas have no place. We want the technology f0 be able to fit our big data and ng,
need ofthe hour is dynamic schema, Static (pre-defined schemas) are pase
4, Continuous availability: The big question here is how to provide 24/7 support because almos:
RDBMS and NoSQL big data platforms have a certain amour of downtime built in.
i i al consistency?
5. Consistency: Should one opt for consistency oF event
a noe oy build pardon ©lemapueyeemet i acacia cof both hardware ang
6. Partition tolerant:
software failures? :
7. Data aie or ato enaintaln dda quality ~davasccuracy Comp tantasaterica? etc? Do we have
appropriate metadata in place?
3.8 WHY IS BIG DATA ANALYTICS IMPORTANT?
the other way around. The
Ler us study the various approaches to analysis of data and what it leads co
‘does Business Intelligence (BI) help us with? Tt allows the
peace 9 make [aster and bever decisions by providing the right information ro the right Person a
the rghe time inthe right format, [cis about analysis ofthe past ot historical data and then displaying
the findings of the analysis or reports in the form of enterprise dashboards ler: notifications, etc I
has support for both pre-specified reports as well as ad hoe querying:
2. Reactive - Big Data Analytics: Here the analysis is done on huge datasets but the approach is sil
reactive as it is still based on static data.
5, Proactive _ Analytics: This is ro support futuristic decision making by the use of data mining, pre:
dictive modeling, texe mining, and stacstical analysis. This analysis is not on big data as it stil uses
“Treabase management practices on big data and therefore has severe limitations on the
1. Reactive — Business Intelligence: What
the tradition
storage capacity and the processing capability.
4. Proactive — Big Data Analytics: This is sieving through terabytes, petabytes, exabytes of information
} te filter out the relevant data to analyze, This also includes high performance analytics to gain rapid
insights from big data and the ability to solve complex problems using more data,
3.9 WHAT KIND OF TECHNOLOGIES ARE WE LOOKING TOWARD TO
HELP MEET THE CHALLENGES POSED BY BIG DATA?
1. The first requirement is of cheap and abundant storage.
2. We need faster processors to help with quicker processing of big data
3, Affordable open-source, distributed big data platforms, such as Hadoop.
4, Parallel processing, clustering, virtualization, large grid environments (to distribute processing to a
number of machines), high connectivity, and high throughputs rather than low latency.
5, Cloud computing and other flexible resource allocation arrangements.3,10 DATA SCIENCE
cei the science of extracting 1
ience es ae a a SRR from data. In other words, it isa science of drawing out hidden
ig tistical ane mathematical techniques. It employs techniques and theories drawn
ids from the broad areas 0
1m the broad areas of mathematics, statistics, information technology including machine
etc
Data si
parrerns amon
from many fiek
faring di
eaday we have a plethora of use-cases for
Today WE plethora of for “Data Science” that are already exploring
za bytes of Information) for weather predictions, oil drillings, seismic activities, financial fra
dia analytics, and so many
, market basket analytics
ary. Refer
gincering, probability models, statistical learning, pattern recognition and learning,
massive datasets
auds,
(Pera to Zee
rete network and activities, global economic impacts, sensor log, social me
yond standard retail, manufacturing use-cases such as customer ch
ining), collaborative filtering, regression analysis, etc. Data science is multi-disciplina
others be}
(associative
to Figure
3.10.1
4 data scientist should
domain further helps,
1, Understanding of domain.
2, Business strategy.
3, Problem solving.
4, Communication.
5, Presentation.
6, Inquisitiveness.
Business Acumen Skills
s of business. A firm understanding of business
have the prowess to counter the pressure:
1 role of data scientist
The following is a lis of traits that needs to be honed to play th
3.10.2 Technology Expertise
saying that technology expertise will come in handy if one is
‘ed as far as technical expertise is concerned.
to play the role of a data scien-
Icgoes without
sist. Cited below are few skills requir
Business
‘acumen
Technology
expertise
Figure 3.7 Data scientist.44
1, Good database knowledge such as RDBMS.
2. Good NoSQL database knowledge such as MongoDB, Cassandra, HBasey ete:
3. Programming languages such as Java, Python, C+, ete.
4, Open-source tools such as Hadoop.
5. Data warehousing,
6, Data mining,
7. Visualization such as Tableau, Flare, Google visualization APIs ete:
3.10.3 Mathematics Expertise :
sm to comprehend data, interpret it, make sense of;
Since the core job of the data scientist will requite hi aoe ep lowing at the ky sl cee
and analyze it, helshe will have to dabble in learning a!
scientist will have to have in his arsenal
1, Mathematics,
y 2. Statistics.
Artificial Intelligence (Al).
4. Algorithms,
5. Machine learning,
6. Partern recognition
7. Natural Language Processing.
To sum it up, the data science process is
ng raw data from multiple disparate data sources.
3. Integrating the data and preparing clean datasets.
4. Engaging in explorative data analysis using model and algorithms. : r
5. Preparing presentations using data visualizations (commonly called Infographics, or BizAnalytics, or
VizAnalytics, etc:)
6. Communicating the findings to all stakeholders
7. Making faster and better decisions.
3.11 DATA SCIENTIST...YOUR NEW BEST FRIEND!
2 age, a data scientist is the best friend that you can gift yourself. Refer Figure 3.8 to learn abou
the data scientise can help you with.
3.11.1 Responsibilities of a Data Scientist
Refer Figure 3.8
1, Data Management: A data scientist employs several approaches to develop the relevant datasets for
nalysis. Raw data is just “RAW,” unsuitable for analysis, The data scientist works on it to prepare it
so seflect the relationships and contexts, This data then becomes useful for processing and furtherBig Dara Analytics
aah
Peet
Susan
ett
Cee Maal
Peer ea
patterns, spots trends
eevee ued
Ie aac tp
Creed aeeeied
es)
mes
Deeks pahed ake teenanaallGainal
Figure 3.8 Data scientist: your new best friend!!!
trying to find answers and
2. Analytical Techniques: Depending on the business questions which we are
yrical techniques to develop
the type of data available at hand, the data scientist employs a blend of anal
tnodels and algorithms to understand the data, interpret relationships, spot trends, and unveil patter
3, Business Analysts: A data scientists a business analyst who distinguishes cool facts fro ights and
is able to apply his business acumen and domain knowledge to see the results in ‘the business context.
He is a good presenter and communicator who is able to communicate ‘the results of his findings in a
Janguage that is understood by the different business stakeholders.
3.12 TERMINOLOGIES USED IN BIG DATA ENVIRONMENTS.
In order to get a good handle on the big data environment, let us get familiar witha few key terminologies
in this arena.
3.12.1 In-Memory Analytics
Data access from non-volatile storage such as hard diskis
Pert Goin fad disk o: secondary sorese: tedlowenene procesurs Oneway combat os challenge
isto pre-process and store data (cubes, aggregate tables, query sets, etc.) so that the CPU has to fetch a small
Fee rey, Bur ahis requires thinking in advance as o what dara will be requited for analysis. If there
sao fos different or mote data, i s back to the intial process of pre-compuring and storing data or
fetching it from secondary storage.
“This problem has been addressed using in-memory analytics. Here all the relevant daca is stored in
Random Avcess Memory (RAM) or primary storage thus eliminating the need ro access the dara from hard
disk The advantage is faster access, rapid deployment, better insights, and minimal IT involvement,
a slow process. The more the data is required co be
3.12.2 In-Database Processing
In-database processing is also called as in-database analytics. Ic works by Fusing data warehouses with analyti-
cal systems, Typically the daca from various enterprise On Line ‘Transaction Processing (OLTP) systems afterEe
process of ETL is stored in the Enterprise p,
hen exported to analytical programs for comp
the database program itself can run the coms.
me. Leading database vendors are offering
cleaning up (de-duplication, scrubbing, etc.) through the
Warehouse (EDW) or data marts, The huge datasets are t
and extensive computations. With in-database processings
tations eliminating the need for export and thereby saving on tn
this feavure to large businesses
3.12.3 Symmetric Multiprocessor System (SMP) ;
‘on main memory that is shared by two or more identical processors. Th,
rolled by a single operating system instance,
has its own high-speed memory, caleg
In SMP, there is a single comm
processors have full access to all I/O devices and are cont
SMP ate tightly coupled multiprocessor systems, Each processor
cache memory and are connected using a system bus, Refer Figure 3.9.
3.12.4 Massively Parallel Processing
Massive Paralel Procesing (MPP) refers to the coordinated processing of programs by a number of processor,
working parallel, The processors, each have theie own operating systems and dedicated memory. They work
on different parts of the same program. The MPP processors communicate using some sort of messaging
interface. The MPP systems are more difficult to program as the application must be divided in such a way
chav all the executing segments can communicate with each other. MPP is different from Symmetrcally
Multiprocessing (SMP) in that SMP works with the processors sharing the same operating system and same
‘memory. SMP is also referred to as tightly-coupled mulriprocessing,
3.12.5 Difference Between Parallel and Distributed Systems
The next ewo terms that we discuss are parallel and distributed systems.
As is evident from Figure 3.10, a parallel database system is a tightly coupled system. The processors
co-operate for query processing, The user is unaware of the parallelism since he/she has no access to a specific
eee |
?
y
Geen ed
|Big Data Analytics
Figure 3.11 Parallel system.
procesor of the system. Fither the processors have acces to a common memory (Refer Fg 3.11) or make
tse of message passing for communication.
Disuibured database systems are known to be loosely coupled and are composed by individual machines.
Refer Figure 3.12. Each of the machines can run their individual application and serve their own rspec.
aeeroee The data is usualy distributed across several machines, thereby necessitating quite a number of
‘machines to be accessed to answer a user query. Refer Figure 3.13.
3.12.6 Shared Nothing Architecture
Leruslook at the three most common types of architecture for multiprocessor high transaction rate systems.
They are:
1, Shared Memory (SM).
2. Shared Disk (SD).
3. Shared Nothing (SN).
Inshated memory architecture, a common central memory i shared by multiple processors. In shared disk
architecture, multiple processors share a common collection of disks while having their own private mem-
‘ory In shared nothing architecture, neither memory nor disk is shared among multiple processors.___ Big Data and 4,
Figure 3.12 Distributed system.
Figure 3.13 Distributed system.
3.12.6.1 Advantages of a “Shared Nothing Architecture”
1. Fault Isolation: A “Shared Nothing Architecture” provides the benefit of isolating fault. A faule ina
single node is contained and confined to that node exclusively and exposed only through messages (or
lack of ie).
2. Scalability: Assume that the disk is a shared resource. It implies that the controller and the disk band-
width are also shared. Synchronization will have to be implemented to maintain a consistent shared
state. This would mean thar different nodes will have to take turns to access the critical data. This9
poses a limit on how many nodes ean i
Bei sally, yy nodes can be added to the distributed shared disk system, chus compro
3.12.7 CAP Theorem Explained
‘The CAP theorem is also called the Brewer's Theorem, bi nvironment
(a collection of interconnected nodes that share Ee Bates ce vide fi fe folowing sa ea
Refer Figure 3.14. At best you can have two of the following thre: oA eee Eiaaaee ie
1¢ ~ one must be sacrificed
1, Consistency
2, Availability
3. Partition tolerance
3.12,7.1 CAP Theorem
Let us spend some time understanding the earlier mentioned terms.
1. Consistency implies thac every read fetches the last write,
2, Availability implies that reads and writes always succeed. In other words, each non-failing node will
return a response in a reasonable amount of time
3, Partition tolerance implies that the system will continue to function when network partition occurs
Let us try to understand this using a real-life situation.
You work for a training institute, “XYZ.” The instiute has 50 instructors including you. All of you report
to a training coordinator. Ar the end of the month, all the instructors together with the training coordina
tor peruse through the training requests received from the various corporate houses and prepare 2 eainit
‘Schedule for each instructor. These taining schedules (one for each instructor) are shared with “Amey,” the
ffce administrator. Each morning, you either call the office helpdesk (essentially Amey’s desk) or check
in-person with Amey for your schedule for the day. In case a training request hhas been cancelled or updated
(updates can be in the form of change in course, change in duration, change of the training timings, etc-),
‘Amey is informed of the updates and the schedules are subsequently updated by him
“Things were good until now. Few corporate houses were your clients and che schedules of each inserucoe
could be smoothly managed without any major hiccups. But your caning institute has been implement
ing promocion campaigns to expand the business. As a ceslt of advertising in the media and word of
smeuth publicity by your existing clients, you suddenly see an upsurge in training requests from existing and
motion, In consequence ofthat, more instructors have been recruited. Few trainers/consulranss have
ls been roped in fiom other training institutes to help tackle the load.
Car scies
items
Figure 3.14 Brewer's CAP.ee ES aa
50+
sdule o¢ call in at the helpdesk, you
Jule or call helpdesk, you are prepared fo,
‘Amey to check your sche
s, the training coordinator decides to recrne ae
Now w you go to
a ir c current state of alfa
fl remain the same and will be shared by beet
in the queue. Looking at th
ail HJoey.” The helpdesk numbet wil
office administrator “Joey
You: Hey Amey!
OP eens a training at 3:00 pm today. Can T please have the details?
You I think Lam scheduled to anchor
training scheduled 404,
s the schedules. He does not se
“You do not have any training to conduct at 3:00 pm
Amey: Sure! Just a minute.
or called up yesterday evening to inform of the same
-s through the file where he maintains
Se hr and responds back,
Jou: How is that possible? The training coordina
said he has updated the office administracors of the same:
Amey: Ob! Did he say which office administrator? It could have been Jocy. Please check with Jocy,
“Amer Hey Jocy! Please check the schedule for Paul here... Do you see something scheduled at 3:09 ,,
today?
Joo: Sure enough! He is anchoring the training for client “Z” today at 3:00 pm.
A clear case of inconsistent system!!! The updates in the schedule were shared by the training coordina,
with Jocy and you were checking for your schedule with Amey.
this incident with the training coordinator and that gets him thinking. The issue has to b.
id a chaotic situation. He comes up with a plan
Amey br
your name at 3:00 p'
You share
addressed immediately otherwise it will be difficult co avoi
‘and shares it with both the office administrators the following day.
each time that either an instructor or me calls any one of you to update 2
of you update ic in your respective files. This way the instructor will always
Training Coordinator: Folks,
respective of whom amongst the two of you he/she
schedule, make sure that both
get the most recent and consistent information ir
speaks to.
Bur thar could mean a delay in answering either a phone call or sharing the schedule with the instructor
poy:
waiting in queue.
Yes, I understand. Bue there is no way that we can give incorrect information.
Training Coordinator.
“Amey, Thete is this other problem as well. Suppose one of us is on leave on a particular day. ‘That would
mean chat we cannot take any update related calls as we will not be able to simultaneously update both the
files (my file and Joey’).
Thats the availability problemi! But 1 have thought about that as
Training Coordinator. Well, good poi
well. Here is the plan:
L. Ifone of you receives the update call (any updates to any schedule), ensure that you inform the othe
person if he is available.
2, In case the other person is not available, ensure that you inform him of all the updates co all schedule
via email. It is a must!!!
3. When the other person resumes duty, the first thing he will do is update his file with all che updates «
all schedules that he has received via email.a
Wow!!! That is sure a Consistent and Available systemtt
in the office
Looks like everything is in control. Wait a minute! ‘There is a tiff thae has taken place betwee
administrators. The vo are pretty much available but are not talking to each other which, in other words,
means that the updates are not flowing from one to the other. We have ta be partition tolerantil As 4 vain
ing coordinator, you instruct them saying that none of you are taking any calls requesting for schedules or
tupdates 0 schedules till you patch up. This implies that the system is partition toleranc but not available ac
that time.
In summary, one can at most decide to go with two of the three.
1, Consistent: The instructors or the training coordinator, once they have updated informacion with
you, will always get the most updated information when they call sub: d
a ey call subsequently.
2, Availability: The instructors or the training coordinators will always get the schedi
the office administrators have reported to work.
3, Partition Tolerance: Work will go on as usual even if there is communication loss between the office
administrators owing to a spat or at
jule if any or both of
“When to choose consistency over availability and vice-versa...
1. Choose availability over consistency when your business requirements allow some flexibility around
when the data in the system synchronizes.
2. Choose consistency over availability when your business requirements demand atomic reads and
writes.
Examples of databases that follow one of the possible three combinations:
1. Availability and Partition Tolerance (AP)
2. Consistency and Partition Tolerance (CP)
3. Consistency and Availability (CA)
Refer Figure 3.15 to get a glimpse of databases that adhere to two of the three characteristics of CAP theorem.
‘A Isavailable/accessible/
operational at all times
‘AP. Riak, Cassandra, CouchDB,
Traditional RDBMS cA
Dynamo like systems
PostgreSQL, MySQL, i
ae yucca
etc.
c cp Pe
Commits are atomic HBase ‘System responds incorrectly
across the entire MongoDB only when there is a total
distributed systems Resis network failure
MemeacheDB
BigTable like systems
Figure 3.15 Databases and CAP.™
Data ang
Ma. and Ang,
52,
Ow eee
3.13 BASICALLY AVAILABLE SOFT STATE EVENTUAL CONSISTENcy
(BASE)
A few basic questions to start ith:
1. Where is it used?
In distributed computing
2. Why is it used?
To achieve high availability.
3. How is it achieved?
‘Assume a given data item. Ifo
new updates are made 0 this given data item fora stipulated perigg
Assume 3 accesses Co his data item wil return the updated value, In other words, f ng
Fe enna) roa given data item fora stipulated period of ime, all updates cha were made ny
e siren data item and che several epics oft will percolate o this data
item
past and not applied to this ¢
F thac it stays as current/recent as is possible.
4. What is replica convergence? von
None that has achieved eventual consistency is said to have eonverged or achieved replica convergen,
5. Conflict resolution: How is the conflict resolved?
(@) Read repair: Ifthe read leads to discrepancy or inconsistency, a correction is initiated. It slows down
the read operation.
(b) Write repair If the write leads to discrepancy or inconsistency, a correction is initiated. This wil)
cause the write operation to slow down.
(0) Asynchronous repair: Here, the correction is not part of a read or write operation
3.14 FEW TOP ANALYTICS TOOLS
There is no dearth of analytical tools in the market. Please find below our list of few top analytics tools
‘We have also provided the links after each tool for you to explore more... :
1. MS Excel
hucps://support office. microsoft.com/en-in/article/Whats-new-in-Excel-2013-lebc42cd-bfaf-43d
9031-5688ef1392fd?CorrelationId= 14217 1cc-191f-47de-8a55-08a5f2c9c739 &ui=en-US&rs=en-
IN&ad=IN
2. SAS
hetp://www.sas.com/en_us/home.html
3. IBM SPSS Modeler
hexp://www-01.ibm.com/sofeware/analytics/spss/products/modeler/
4, Statistica
heep://www.statsoft.com/ee
salford systems (World Programming Systems
2 Rapides com/ eee
wrs
4 pep: //ovorw.teamwpe.co.uk/products/wps
g.14.1 Open Source Analytics Tools
Lecus look ata couple of open source analytics tools, We have also provided the links after each tool for you
co explore mor
1, Ranalytics
heep//ww-revolutionanalytics,com/
2. Weka F
bep://wwwwres.waikato.ac.ne/ml/wekal
REMIND ME
+ Quite a few data analytics and visualization tools are available in the market today from leading
vendors such as IBM, Tableau, SAS, R Analytics, Statistica, World Programming Systems (WPS),
‘exc. to help process and analyze your big data.
+ Big daca analytics is a about a tight handshake between three communities:
data scientists.
+ Data science is the science of extracting knowledge from data.
+ The CAP theorem is also called the Brewer's Theorem. Ie states that in a distributed computing
environment (a collection of interconnected nodes that share data), it is impossible to provide the
following guarantees. At best you can have two of the following three ~ one must be sacrificed.
* Consistency
* Availabilicy
* Partition tolerance
[T, business users, and
CONNECT ME (INTERNET RESOURCES)
+ hetp://en.wikipedia.org/wiki/Data_science
* hup://simplystatistics.org/2013/12/12/the-key-word-in-dava-science-is-not-data-it-i
* hucp://www.oralytics.com/2012/06/data-science-is-multidisciplinary.html
* http://spotfire.tibco.com/blog/?p=4240
* hetp://reports.informationweek.com/abstract/106/1255/Financial/tech-center-taking-advantage-
of in-memory-analytics.htm|
http://www.informationweek.com/software/information-management/oracle-analytics-package-
‘expands-in-database-processing-options/d/d-id/1 102712?