1.1 Big Data Overview
Data is created constantly, and at an ever-increasing rate. Mobile phones, social media,
imaging technologies to determine a medical diagnosis—all these and more create new
data, and that must be stored somewhere for some purpose. Devices and sensors
automatically generate diagnostic information that needs to be stored and processed in real
time. Merely keeping up with this huge influx of data is difficult, but substantially more
challenging is analyzing vast amounts of it, especially when it does not conform to
traditional notions of data structure, to identify meaningful patterns and extract useful
information. These challenges of the data deluge present the opportunity to transform
business, government, science, and everyday life.
Several industries have led the way in developing their ability to gather and exploit data:
Credit card companies monitor every purchase their customers make and can identify
fraudulent purchases with a high degree of accuracy using rules derived by
processing billions of transactions.
‘* Mobile phone companies analyze subscribers’ calling pattems to determine, for
example, whether a caller's frequent contacts are on a rival network. If that rival
network is offering an attractive promotion that might cause the subscriber to defect,
the mobile phone company can proactively offer the subscriber an incentive to
remain in her contract.
‘© For companies such as Linkedin and Facebook, data itself is their primary product.
‘The valuations of these companies are heavily derived from the data they gather and
host, which contains more and more intrinsic value as the data grows.
‘Three attributes stand out as defining Big Data characteristics:
‘© Huge volume of data: Rather than thousands or millions of rows, Big Data can be
billions of rows and millions of columns.
‘© Complexity of data types and structures: Big Data reflects the variety of new data
sources, formats, and structures, including digital races being left on the web and
other digital repositories for subsequent analysis.
‘© Speed of new data creation and growth: Big Data can describe high velocity data,
with rapid data ingestion and near real time analysis.
Although the volume of Big Data tends to attract the most attention, generally the variety
and velocity of the data provide a more apt definition of Big Data. (Big Data is sometimes
described as having 3 Vs: volume, variety, and velocity.) Due to its size or structure, Big
Data cannot be efficiently analyzed using only traditional databases or methods. Big Data
problems require new tools and technologies to store, manage, and realize the business
benefit. These new tools and technologies enable creation, manipulation, and management
of large datasets and the storage environments that house them. Another definition of Big
Data comes from the McKinsey Global report from 2011:Big Data is data whose scale,distribution, diversity, and/or timeliness require the use of new technical
architectures and analytics to enable insights that unlock new sources of business
value.
‘McKinsey & Co, Big Duta: The Next Frontier for Innovation, Competition and Productivity [1]
McKinsey's definition of Big Data implies that organizations will need new data
architectures and analytic sandboxes, new tools, new analytical methods, and an
integration of multiple skills into the new role of the data scientist, which will be
discussed in Section 1.3. Figure 1.1 highlights several sources of the Big Data deluge.
What's Driving Data Deluge?
Mobile Social Video Video
Sensors Surveillance Rendering
‘Smart Geophysical Medical Gene
Grids Exploration Imaging ‘Sequencing
Eigune 1 What’s driving the data deluge
‘The rate of data creation is accelerating, driven by many of the items in Figure 1.1.
Social media and genetic sequencing are among the fastest-growing sources of Big Data
and examples of untraditional sources of data being used for analysis
For example, in 2012 Facebook users posted 700 status updates per second worldwide,
which can be leveraged to deduce latent interests or political views of users and show
relevant ads. For instance, an update in which a woman changes her relationship status
from “single” to “engaged” would trigger ads on bridal dresses, wedding planning, or
name-changing services.
Facebook can also construct social graphs to analyze which users are connected to each
other as an interconnected network. In March 2013, Facebook released a new feature
called “Graph Search,” enabling users and developers to search social graphs for people
with similar interests, hobbies, and shared locations.
Another example comes from genomics. Genetic sequencing and human genome mapping
provide a deiailed understanding of genetic makeup and lineage. The health care industry
is looking toward these advances to help predict which illnesses a person is likely to get in
his lifetime and take steps to avoid these maladies or reduce their impact through the useof personalized medicine and treatment, Such tests also highlight typical responses to
different medications and pharmaceutical drugs, heightening risk awareness of specific
drug treatments.
While data has grown, the cost to perform this work has fallen dramatically. The cost to
sequence one human genome has fallen from $100 million in 2001 to $10,000 in 2011,
and the cost continues to drop. Now, websites such as 23andme (Figure 1.2) offer
genotyping for less than $100. Although genotyping analyzes only a fraction of a genome
and does not provide as much granularity as genetic sequencing, it does point to the fact
that data and complex analysis is becoming more prevalent and less expensive to deploy.
23 pairs of 208% Seems
chromosomes. :
One unique you.
Bring your ancestry to life.
What will your
Ancestry
get deta breakcown Composiion look
tke 24.7%
We Wi
Find relatives across Build your family tree Share your
continents or across. and enhance your knowledge. Watch it
the street. experience. grow.
Figure 1.2 Examples of what can be learned through genotyping, from 23andne..com
As illustrated by the examples of social media and genetic sequencing, individuals and
organizations both derive benefits from analysis of ever-larger and more complex datasets
that require increasingly powerful analytical capabilities.
1.1.1 Data Structures1.2.3 Drivers of Big Data
To better understand the market drivers related to Big Data, it is helpful to first understand
some past history of data stores and the kinds of repositories and wols to manage these
data stores.
As shown in Figure 1.10, in the 1990s the volume of information was often measured in
terabytes. Most organizations analyzed structured data in rows and columns and used
relational databases and data warehouses to manage large stores of enterprise information.
The following decade saw a proliferation of different kinds of data sources—mainly
productivity and publishing tools such as content management repositories and networked
attached storage systems—to manage this kind of information, and the dota began to
increase in size and started to be measured at petabyte scales. In the 2010s, the
information that organizations try to manage has broadened to include many other kinds of
data. In this era, everyone and everything is leaving a digital footprint. Figure 1.10 shows
a summary perspective on sources of Big Data generated by new applications and the
scale and growth rate of the data. These applications, which generate data volumes that
can be measured in exabyte scale, provide opportunities for new analytics and driving new
value for organizations. The data now comes from multiple sources, such as these:
‘© Medical information, such as genomic sequencing and diagnostic imaging
‘© Photos and video footage uploaded to the World Wide Web
‘© Video surveillance, such as the thousands of video cameras spread across a city
‘* Mobile devices, which provide geospatial location data of the users, a5 well as
metadata about text messages, phone calls, and application usage on smart phones
‘* Smart devices, which provide sensor-based collection of information from smart
electric grids, smart buildings, and many other public and industry infrastructures
‘© Nontraditional ITT devices, including the use of radio-frequency identification (RFID)
readers, GPS navigation systems, and seismic processingBig data can come in multiple forms, including structured and non-structured data such as
financial data, text files, multimedia files, and genetic mappings. Contrary to much of the
traditional data analysis performed by organizations, most of the Big Data is unstructured
or semi-structured in nature, which requires different techniques and tools to process and
analyze. [2] Distributed computing environments and massively parallel processing (MPP)
architectures that enable parallelized data ingest and analysis are the preferred approach to
process such complex data.
‘With this in mind, this section takes a closer look at data structures.
Figure 1.3 shows four types of data structures, with 80-90% of future data growth coming
from non-structured data types. 2] Though different, the four are commonly mixed. For
example, a classic Relational Database Management System (RDBMS) may store call
logs for a software support call center. The RDBMS may store characteristics of the
support calls as typical structured data, with attributes such as time stamps, machine type,
problem type, and operating system. In addition, the system will likely have unstructured,
‘quasi- or semi-structured data, such as free-form call log information taken from an e-mail
ticket of the problem, customer chat history, or transcript of a phone call describing the
technical problem and the solution or audio file of the phone call conversation, Many
insights could be extracted from the unstructured, quasi- or semi-structured data in the call
center data.
Big Data Characteristics: Data Structures
Data Growth Is Increasingly Unstructured
Structured
More Structured
igure 1.3 Big Data Growth is increasingly unstructured
Although analyzing structured data tends to be the most familiar technique, a different
technique is required to meet the challenges to analyze semi-structured data (shown as
XML), quasi-structured (shown as a clickstream), and unstructured data.
Here are examples of how each of the four main types of data structures may look.© Structured data: Data containing a defined data type, format, and structure (that is,
transaction data, online analytical processing [OLAP] data cubes, traditional
RDBMS, CSV files, and even simple spreadsheets). See Figure 1.4.
© Semi-structured data: Textual data files with a discernible pattern that enables
parsing (such as Extensible Markup Language [XML] data files that are self-
describing and defined by an XML schema). See Eigure L5.
© Quasi-structured data: Textual data with erratic data formats that can be formatted
with effort, tools, and time (for instance, web clickstream data that may contain
inconsistencies in data values and formats). See Figure 1.6.
© Unstructured data: Data that has no inherent structure, which may include text
documents, PDFs, images, and video. See Figure 1.7.
SUMMER FOOD SERVICE PROGRAM 1]
Butyl Total Federal
Participation Expenditures
Tiion S
a aa 198 0]
era
[af ral
a
rt rg
[erg
[eg er
re
Figure L4 Example of sructred daa(OStouctwad Datat— any data tut can be steal,
acceszehd and pero, eo
tixed format een as Athena cletel
e9% Data Stoved Ir RDBMS. (employee
ot toble ~_ Coctabase)
© vest ourchwk Dede § — data tthe :
unicnswon m py He S aaaut 3
mown as Unsitsuctusock clofa- Ten sfzo at ®
eg Output glver by Gale Seared.
Tr cwrtaiwt, text, Frrages video ete)
Gov & wutdielud date
@) Semt = Shinetuned aaber s ti certatn th
or ef cala-
"ek 8)
mew (M8)
aaa)
ee)
era)
very)History of Big Data |
Lots data go canted uo
Data Keep on Growing
Currey (approx) 328.77 EB of daa created
‘Around 120 2B of data wi bo gonoratod in 2024
181 ZB 0 data wil gonorato n 2025,
Goodie processes 2.5 EB acy (02/2023)
‘Wayback Mactine has 3 PB + 100 Ttmonth (32000)
Facebook has 300 PE + 4 PB a day (092023)
‘eBay has 6.5 PB of user data + 50 TBiday (52008)
CERN Large Hydton Colder (LHC) ganerates 1 PB of
‘collin daalsecend ~ too large o process. Keeping only most
intresting’ ones, CERN Data Conte processes 1 PBIGay.
+ Videos account or over hal of nemet data tame.
tt happens in a internet minute
+ $400M sales on Alibaba
+ 439,000 page views on Wikipedia
+ 194, 000 apps downloaded
+ 31,700 hours of music played on Pandora
+ 38,000 photographs uploaded to Instagram
4.1 Million searches on Google
139,000 hours of video watched on YouTube
10 million ads displayed
3.3 million shares on FacebookMea iaatlma nine}
1 Internet minute
+ $400M sales on Alibaba
+ 439,000 page views on Wikipedia
+ 194, 000 apps downloaded
+ 31,700 hours of music played on Pandora
+ 38,000 photographs uploaded to Instagram
+ 4.1 Million searches on Google Data
+ 189,000 hours of video watched on YouTube
+ 10 million ads displayed
+ 3.3 million shares on Facebook
Each of these
activities generates
DATA
184 Odi ppedinondel | $400M sales on Alibaba
sas 7 ; Product views
all \ Orders
pay el Ratings
Reviews1 Internet minute - Google
1 Internet minute
DATA
____ 4.1 Million searches on Google
iseowm Results returned
“yvonne Results viewed
Results clicked
Data generated in Internet Minute
+ Alibaba And other companies are
+ Wikipedia :
*Pandora Peta Bytes of data
*Instagram .
«Google every minute
+ Youtube
+ FacebookData generated in a Internet Minute
*This is a 1 TB hard disk drive
Data generated in a Internet Minute
* 1000s of such 1 TB drives are filled up every
minute by data collected on the web!!Collecting truckloads of data - why
Why are web
companies
collecting
truckloads
(literally) of
data?
Collecting truckloads of data - why
Reason #1
Because they can afford it
Storage prices have dropped like crazy over the
last 2 decadesCollecting truckloads of data - why
Reason # 2
Because they can monetize it
Large scale data can be processed to derive huge
amounts of value
Collecting truckloads of data - why
Reason #2
Because they can monetize it
Large scale data can be
ed to derive huge amounts of
value
Everything is personalized
Product Recommendations on Amazon,
Newsfeed on Facebook,
Homepage on Netflix
Ads, Offers, Promotions just for you!How do we go from
Truckloads of data Monetizable
products
—,) Recommendations
Newsfeed
Maps
Companies like
Google,
Apple,
Amazon,
Facebook etc
own Huge Data CentersHuge Data Centers
with millions of servers
covering 100s of acresHuge Data Centers
Huge Dat: ters
with millions of servers
running
sophisticated
proprietary software
Huge Data Centers
Data Cent
with millions of se
to process
TBs/PBs of dataThe Big Data Paradigm
Huge Data Centers
. running to process
with Ve
ays sophisticated TBs/PBs of
millions of :
proprietary data
servers
software
The Big Data Paradigm
There are only a handful of companies in the
world that have all of the aboveThe Big Data Paradigm
So, should the rest of us even care ?
The Big Data Paradigm
Because of cloud companies like AWS,
Microsoft Azure, GCP
Anyone can requisition 100s of servers at a
moment’s noticeThe Big Data Paradigm
Netflix, Pinterest, AirBnB
run their entire business just using cloud
services like AWS
The Big Data Paradigm
Open Source Technologies
Hadoop, Spark, HBASE, Hive and many
othersMEASURED IN MEASURED IN .WOLL BE MEASURED IN
TERABYTES: PETABYTES
se =2.900c8 PB = 1.00018
2010s
(ROBMS & DATA (CONTENT & DIGITAL ASSET —_(NO-SQL & KEY VALUE)
WAREHOUSE) MANAGEMENT)
Figure 1.10 Data evolution and rise of Big Data sources
The Big Data trend is generating an enormous amount of information from many new
sources. This data deluge requires advanced analytics and new market players to take
advantage of these opportunities and new market dynamics, which will be discussed in the
following section.on Oye a
Figure 1.11 Emerging Big Data ecosystems
‘As illustrated by this emerging Big Data ecosystem, the kinds of data and the related
market dynamics vary greatly. These datasets can include sensor data, text, structured
datasets, and social media. With this in mind, itis worth recalling that these datasets will
not work well within traditional EDWs, which were architected to streamline reporting
and dashboards and be centrally managed. Instead, Big Data problems and projects require
different approaches to succeed.
Big Data Statistics =Whatis Data? -
FRepresenlaton oF acs, concep, of nstuctons na formalized
charactors:
+ Aocuracy
+ Completeness.
+ Relabity
+ Rolovanco
+ Timeliness
manson, al
Whinsia dots “Big Data a a coecton of at ots
2 sge and eames htt bosomoe coo proses
‘ng errand database menagemet toe o Wedonal
dat proesuing apocatons
Whats Bi Data? =
Big in Big Data refers to
+ Big 20s th rary dito.
‘+ Big crpony rather than big velume, canbe smal and net al args
stearate
+ Sie malts. bu 0 does access neopeabiy and
rooraby
Big datais doseibed in V's
+ Volume (tema and Externa)
* How much dts? TIPE?
+ Velocity: Rate of data creation
* Growng how as? G5?
* Sener: ntesed YaneatonssndintorstoneWhat is Big Data?
> varity
* Dien sources, type, fmt, schemas
1 shucted and Unauctured ta
+ uo ana vido les, Facebook &X Tita commons
polo, GPS daa, model Bos
+ Veracty
“Shows tb quality and origin of ta
+ Accuracy, wstworhiness
+ Value
* How o tum ra data nto use els
+ Varibity
Toa extent, ond ow tats the sche of your dita
canoe?
Characteristics of Big data (a
‘soeedacatarta
‘sobs aators tr he ators
Use of big data
+ Weather proce fr fhenen, amare
+ Predet sqvpment matron are nce power
part choca pants,
Use of Big Data
a
See Sao Sree
‘Sentiment ananysis.
Indust equipment montoring 8 eleting
“epee sranse erbonm toot
enantio
emuHow to Deal with Big Data? |
+ Analysing ig data requires scale-out soluons nol scale
solutons
Seale out
Seale up
+ Move the analysis othe dat,
“+ Work wit scientists to find the most common “20 queries" and
rake them fst
Taltonal Technology =
Laige relational databases
+ On SAN (Storago Ava Network)
+ Highty paral processors
+ Data may be distribute but processing in one placo
+ Bring data to process
Luimiton scalbity
High-end hardware ($80,000/7B)
Big Data Technology =a
+ Paral processing
+ Clusters of comma hardware
+ Faultotrant processing
+ Distributed data and distibuted processing
+ Dataredundsney
+ Datalocaty: Bring process to data
+ Commodity harcwae ($3,00078)
Data and processing on same machine
How is the Big Data Technology
Different?
(omnes aaron nee
+ Shae on
1 Dotetd posesBig Data technology
Big Data technologyaktuwallah.com
# Ba data teh?
ANALYTICS
emnts Katey
Bekdab
“yigy ALIZATION
SYALIZATION
Tabléally :
Pett
(why we Big Gola? )
4|_ STORE
Hewgod® »
val op
MINING
Presto
x ig Rtn Ginporbort
(Cost Sastry
\ Time Sov
e wiles ne arasket condition
i) Soutal. Media Uslent a
(5) Bast Customs. Aeqplcition & fotenltor.
Proka Applialiont
Baur ard Sewstes
ae ob Media, & Enlestednmastt
@ Healt Rrevidlrt :
Education
Grevenrort
8 Tauwane ele
w# BfBig Data importance
Big Data importance
1. Cost Savings
2.Time-Saving
3.Understand the market conditions
4. Social Media Listening
5. Boost Customer Acquisition and Retention
6. Solve Advertisers Problem
7. The driver of Innovations and Product Development| x Big Bata praligtitas—
ck dk a pres Of examintra Wange habs
conbalviva a vottety & ly ped
uncover “piddery patterns > wilenewre cosvelalion
mali ny dowels, custome pretences L
ets useful Undpsmnalf en
4p Challeng: —
@ ms teterk underrtenndt rg and acceplere
oo dala
® E isttin while big dlata -teols Selection.
© Poyira pads & mente
© Bake i teqnachibn
(© frate seeanthy
© Bota Proalyeiie
ap Need of5_Big Bracke Arady ties —
© Optmtu business operat analy
‘i ons b a
Cutters, LahocuBoi ¢ “¢
ey; Gmazen
© Next Genralfen Products:
CR Netflix, SpotWhat is Big Data Analytics?
Big data analytics is the use of advanced analytic
techniques against very large, diverse data sets
that include structured, semi-structured and
unstructured data, from different sources, and in
different sizes from terabytes to zettabytes.
Need for Big Data Analytics
Need for Big Data Analytics
1. Optimize business operations by analyzing
customer behaviour
amazon
eaeal
a,Need for Big Data Analytics
2. Next Generation Products.
NETFLIX @spotity
Types of Big Data Analytics
Types of Big Data Analytics
1.Descriptive Analysis ie 4
2.Predictive Analysis <
3.Prescriptive
Analysis
4.Diagnostic AnalysisTypes of Big Data Analytics
1. Descriptive Analysis
What is happening now based on
incoming data.
Google
Types of Big Data Analytics
2. Predictive Analysis.
What might happen in future.
&Types of Big Data Analytics
3. Prescriptive Analysis
What action should be taken.
Google's self-driving car is perfect
‘example of Presciptive Analysis.
Types of Big Data Analytics
4. Diag nostic Analysis
What did it happen © G
4 90oti Type ef Big Bate f oe
AT Hopped. ®)
() Deseriplive Aralysts ¢wrarabe(
( Predtetve Aref te ( tubo ha
® Prescriptive Analysis ( ‘i
frvallystt twHy p2 ‘aL at lupper
© Diagnostic
© Pes cuplive Anal pit uw’
Gnaights uleto nid bat wena ae
pl a wily Hae wp ag!
detail» This bulp uh a
me 1s.
og Pos fovmaret ok *
yer 2022+
® Preddichtve Pralysts + =
* wnat dl hepper use Hat we
sy Tt type Ak a
eg ot fs of aciton
Neal
He neck
(3) PresckpHve aie ist- a &
s un. Ne analtets Tt explover
rep
ceyedal possible actow & suggestsack! ons clependén: on dhe VUsUltt Of-
ducaiplive 2 predichive auncilypes df:
A gtver datased:
“what achion +e be falcon +o achteve
predicted pusull 9”,
PrediicHve + ) ,
e-9 t Detu?
29 Self Detutng com (MEAs eapine
® Deagqnaste Praal yen = Te gtves a defatlel
ard in-dophh tntlghe fato Het eo
Court of a problem: y
© why did dt happen?
5 & us techni such ct dala dise@vey,
Oe uit cD consedation: ond
dota wen
—_—-. CRD.
—> —
DPdHe cane / Complextiy >”ye Btg Pata Manna gener Cyckes —
Orgs > Ong Hing
Aral ©
Act ©)Big data architecture
A.big data architecture is designed to handle the ingestion, processing, and analysts of data
that is too large or complex for traditional database systems.
Batch
Cee)
Ones
Dre} a
Real-Time Message | Stream
cay Processing
ora
tion
Most big data architectures include some or all of the following components:
1. Data sources: All big data solutions start with one or more data sources. Examples:
Include:
> Application data stores, such as relational databases.
©. Static files produced by applications, such as web server log files.
© Realtime data sources, such as loT devices.
2, Data storage: Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats. This kind of store:
Is often called a dota lake. Options for implementing this storage include Azure Data
Lake Store or blob containers in Azure Storage.
3. Batch processing: Because the data sets are so large, often a big data