0% found this document useful (0 votes)

99 views61 pages

Lecture 6 BigData

Uploaded by

حذيفة فلاح

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views61 pages

Lecture 6 BigData

Uploaded by

حذيفة فلاح

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

UVA HPC & BIG DATA COURSE

Introduction to Big Data

Adam Belloum
Content

•  General Introduction
•  Definitions
•  Data Analytics
•  Solutions for Big Data Analytics
•  The Network (Internet)
•  When to consider BigData solution
•  Scientific e-infrastructure – some challenges to
overcome
Jim Gray Vision in 2007
•  “ We have to do better at producing tools to support the whole research cycle—
from data capture and data curation to data analysis and data visualization. Today, the
tools for capturing data both at the mega-scale and at the milli-scale are just dreadful.
After you have captured the data, you need to curate it before you can start doing
any kind of data analysis, and we lack good tools for both data curation and data
analysis.”

•  “Then comes the publication of the results of your research, and the published
literature is just the tip of the data iceberg. By this I mean that people collect a lot of
data and then reduce this down to some number of column inches in Science or
Nature—or 10 pages if it is a computer science person writing. So what I mean by
data iceberg is that there is a lot of data that is collected but not curated or published
in any systematic way. “
Based on the transcript of a talk given by Jim Gray
to the NRC-CSTB1 in Mountain View, CA, on January 11, 2007
Data keep on growing
•  Google processes 20 PB a day (2008)
•  Wayback Machine has 3 PB + 100 TB/month (3/2009)
•  Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
•  eBay has 6.5 PB of user data + 50 TB/day (5/2009)
•  CERN’s Large Hydron Collider (LHC) generates 15 PB a
year

4
BigData.pptx -‐ Computer Science www.cs.kent.edu/~jin/Cloud12Spring/BigData.pptx
Data is Big If It is Measured in MW

•  A good sweet spot for a data center is 15 MW

•  Facebook’s leased data centers are typically
between 2.5 MW and 6.0 MW.
•  Facebook’s Pineville data center is 30 MW
•  Google’s computing infrastructure uses 260 MW

Robert Grossman, Collin BenneC University of Chicago Open Data Group
Big data was big news in 2012

•  and probably in 2013 too.

•  The Harvard Business Review talks about it as

“The Management Revolution”.

•  The Wall Street Journal

“Meet the New Big Data”,
“Big Data is on the Rise,
Bringing Big Questions”.
BigData is the new hype

By the end of 2011

the world had an es7mated 1.8 ze<abytes (trillion gigabytes) of data and it is es7mated to
7
grow to 35 ze<abytes in the next few years
Where Big Data Comes From?
•  Big Data is not Specific application
type, but rather a trend –or even
a collection of Trends- napping
multiple application types
•  Data growing in multiple ways
–  More data (volume of data )
–  More Type of data (variety of data)
–  Faster Ingest of data (velocity of
data)
–  More Accessibility of data (internet,
instruments , …)
–  Data Growth and availability exceeds
organization ability to make
intelligent decision based on it

Addison Snell CEO. Intersect360, Research

How to deal with Big Data
Advice From Jim Gray

1.  Analysing Big data requires scale‐out solutions not

scale-up solutions
2.  Move the analysis to the data.
3.  Work with scientists to find the most common “20
queries” and make them fast.
4.  Go from “working to working.”

Source: Robert Grossman, Collin Bennec University of Chicago Open Data Group
content

•  Big in Big Data refers to:

–  Big size is the primary definition.
–  Big complexity rather than big
volume. it can be small and not all
large datasets are big data
–  size matters... but so does
accessibility, interoperability and
reusability.

•  define Big Data using 3 Vs; namely:

–  volume, variety, velocity

Big Data -‐ Back to mine 11

volume, variety, and velocity

•  Aggregation that used to be measured

in petabytes (PB) is now referenced by
a term: zettabytes (ZB).
–  A zettabyte is a trillion gigabytes (GB)
–  or a billion terabytes

•  in 2010, we crossed the 1ZB marker,

and at the end of 2011 that number
was estimated to be 1.8ZB
volume, variety, and velocity

•  The variety characteristic of Big

Data is really about trying to
capture all of the data that pertains
to our decision-making process.

•  Making sense out of unstructured

data, such as opinion, or analysing
images.
volume, variety, and velocity
(Type of Data)
•  Relational Data (Tables/Transaction/Legacy Data)
•  Text Data (Web)
•  Semi-structured Data (XML)
•  Graph Data
–  Social Network, Semantic
Web (RDF), …
•  Streaming Data
–  You can only scan the data
once
volume, variety, and velocity

•  velocity is the rate at which

data arrives at the enterprise
and is processed or well
understood

•  In other terms “How long does

it take you to do something
about it or know it has even
arrived?”
volume, variety, and velocity

Today, it is possible using real-‐Rme analyRcs to

opRmize Like () buWons across both website
and on Facebook.
FaceBook use anonymised data to show the
number of Rmes people:
•  saw Like buWons,
•  clicked Like buWons,
•  saw Like stories on Facebook,
•  and clicked Like stories to visit a given
website.

volume, variety, velocity, and veracity

•  Veracity refers to the quality or trustworthiness of

the data.

•  A common complication is that the data is

saturated with both useful signals and lots of noise
(data that can’t be trusted)

LHC ATLAS detector generates about 1 Petabyte raw data per

second, during the collision time (about 1 ms)
Big Data platform must include the six
key imperatives

The Big Data plaYorm manifesto: imperaRves and underlying technologies

content

Analytics Characteristics are not new

•  Value: produced when the analytics output is put into
action
•  Veracity: measure of accuracy and timeliness
•  Quality:
–  well-formed data
–  Missing values
–  cleanliness
•  Latency: time between measurement and availability
•  Data types have differing pre-analytics needs
The Real Time Boom..
Facebook Real Time SaaS Real Time Google Real Time
Social AnalyRcs User Tracking Web Analytics

Twitter paid tweet analytics New Real Time Google Real Time Search
Analytics Startups..

® Copyright 2011 Gigaspaces Ltd. All Rights

21 Reserved
Example of Analytics
(from Analytics @ Twitter )

•  Counting
–  How many request/day?
–  What’s the average latency?
–  How many signups, sms, tweets?

•  Correlating
–  Desktop vs Mobile user ?
–  What devices fail at the same time?
–  What features get user hooked?

•  Researching
–  What features get re-tweeted
–  Duplicate detection
–  Sentiment analysis

® Copyright 2011 Gigaspaces Ltd. All Rights

Reserved
Skills required for Big Data Analytics
(A.K.A Data Science)
•  Store and process
–  Large scale databases
–  Software Engineering
–  System/network Engineering

•  Analyse and model

–  Reasoning
–  Knowledge Representation
–  Multimedia Retrieval
–  Modelling and Simulation
–  Machine Learning
–  Information Retrieval

•  Understand and design

–  Decision theory
–  Visual analytics
–  Perception Cognition

Nancy Grady, PhD, SAIC Co-‐Chair DeﬁniRons and Taxonomy Subgroup NIST Big Data Working
Group
content

•  Scale-up Database
–  Use traditional SQL database
–  Use stored procedure for event driven reports
–  Use flash-based disks to reduce disk I/O
–  Use read only replica to scale-out read queries

•  Limitations
–  Doesn’t scale on write
–  Extremely expensive (HW + SW)
® Copyright 2011 Gigaspaces Ltd. All Rights
25 Reserved
CEP – Complex Event Processing

•  Process the data as it comes

•  Maintain a window of the data in-memory
•  Pros:
–  Extremely low-latency
–  Relatively low-cost

•  Cons
–  Hard to scale (Mostly limited to scale-up)
–  Not agile - Queries must be pre-generated
–  Fairly complex
26
In Memory Data Grid

•  Distributed in-memory
database
–  Scale out (Horizontal scaling)

•  Pros
–  Scale on write/read
–  Fits to event driven (CEP style) , ad-hoc query model

•  Cons
-  Cost of memory vs disk
-  Memory capacity is limited

® Copyright 2011 Gigaspaces Ltd. All Rights

27 Reserved
In Memory Data Grid products

•  Hazelcast
hazelcast.org
•  JBOSS Infinispan
www.infinispan.org
•  IBM eXtreme Scale:
ibm.com/software/products/en/websphere-extreme-scale
•  Gigaspace XAP Elastic caching edition:
www.gigaspaces.com/xap-in-memory-caching-scaling/datagrid
•  Oracle Coherence
www.oracle.com/technetwork/middleware/coherence
•  Terracotta entreprise suite
www.terracotta.org/products/enterprise-suite
•  Pivotal Gemfire
pivotal.io/big-data/pivotal-gemfire
NoSQL

•  Use distributed database

–  Hbase, Cassandra, MongoDB

•  Pros
–  Scale on write/read
–  Elastic

•  Cons
–  Read latency
–  Consistency tradeoffs are hard
–  Maturity – fairly young technology

® Copyright 2011 Gigaspaces Ltd. All Rights

29 Reserved
NoSQL

Bill Howe, UW

Hadoop MapReudce

•  Distributed batch
processing
•  Pros
–  Designed to process
massive amount of data
–  Mature
–  Low cost

•  Cons
–  Not real-time

31
Sorting 1 TB of DATA
•  Estimate:
–  read 100MB/s, write 100MB/s
–  no disk seeks, instant sort
–  341 minutes → 5.6 hours

•  The terabyte benchmark

winner (2008):
–  209 seconds (3.48 minutes)
–  910 nodes x (4 dual-core
processors, 4 disks, 8 GB
memory)

•  October 2012
–  ? see
http://www.youtube.com/watch?
v=XbUPlbYxT8g&feature=youtu.be

32
MapReduce vs. Databases

•  A. Pavlo, et al. "A comparison of approaches to

large-scale data analysis," in SIGMOD ’09:
Proceedings of the 35th SIGMOD international
conference on Management of data, New York, NY,
USA, 2009, pp. 165-178
•  Conclusions: … at the scale of the experiments we
conducted, both parallel database systems
displayed a significant performance advantage over
Hadoop MR in executing a variety of data intensive
analysis benchmarks.
Hadoop Map/Reduce – Reality
check..
“With the paths that go through Hadoop [at Yahoo!], the
latency is about ﬁeeen minutes. … [I]t will never be true
real-‐Rme..” (Yahoo CTO Raymie Stata)

Hadoop/Hive..Not realRme. Many dependencies. Lots of

points of failure. Complicated system. Not dependable
enough to hit realRme goals ( Alex Himel, Engineering
Manager at Facebook.)
The image cannot be displayed. Your computer may not have
enough memory to open the image, or the image may have
been corrupted. Restart your computer, and then open the
file again. If the red x still appears, you may have to delete
the image and then insert it again.

"MapReduce and other batch-‐processing systems cannot

process small updates individually as they rely on creaRng
large batches for eﬃciency,“ (Google senior director of
engineering Eisar Lipkovitz)

® Copyright 2011 Gigaspaces Ltd. All Rights

34 Reserved
Map Reduce

R
M E
Very ParRRoning
A D Result
big FuncRon
P U
data
C
E

•  Map: •  Reduce :
–  Accepts –  Accepts
•  input key/value pair •  intermediate key/value* pair
–  Emits –  Emits
•  intermediate key/value pair •  output key/value pair

35
WING Group MeeRng, 13 Oct 2006 Hendra SeRawan
Apache Spark
Lightning-fast cluster computing

•  Generality
–  Combine SQL, streaming, complex analytics.
•  Runs Everywhere
–  Spark runs on Hadoop, Mesos, standalone,
or in the cloud. It can access diverse data
sources (HDFS, Cassandra, HBase, and S3)
•  Ease of Use
–  Write applications quickly in Java, Scala,
Python, R.
Apache Storm
By Nathan Marz

•  Storm is a distributed real-time

computation system that solves
typical
–  downsides of queues & workers
systems.
–  Built with Big Data in mind (the
“Hadoop of realtime”).
•  Storm Trident (high level
abstraction over Storm core)
–  Micro-batching (~ streaming)
Apache Kafka
A high-throughput distributed messaging system

•  Apache Kafka is publish-subscribe messaging

rethought as a distributed commit log.

•  Kafka maintains feeds of messages in

categories called topics.
–  Processes can publish messages to a Kafka (topic
producers).

–  processes can subscribe to topics and process

the feed of published messages consumers.

•  Kafka is run as a cluster comprised of one or

more servers each of which is called a broker.
Performance
content

•  TCP Was never designed to move large datasets

over wide area high Performance Networks.

•  For loading a webpage, TCP is great.

•  For sustained data transfer, it is far from ideal.
–  Most of the time even though the connection itself is
good (let say 45Mbps), transfers are much slower.
–  There are two reason for a slow transfer over fast
connections:
•  Latency
•  and packet loss bring TCP-based file transfer to a crawl.

Robert Grossman University of Chicago Open Data Group, November 14, 2011
TCP Throughput vs RTT and Packet
Loss
LAN US US-EU US-ASIA
1000

800

600
Throughput (Mb/s)

400

200
1000

800 0.01%

600 0.05%

0.1%
400

0.5%
200

1 10 100 200 400

Round Trip Time (ms)

Source: Yunhong Gu, 2007, experiments over wide area 1G.
The solutions

•  Use parallel TCP streams

–  GridFTP

•  Use specialized network protocols

–  UDT, FAST, etc.

•  Use RAID to stripe data across disks to improve

throughput when reading
•  These techniques are well understood in HEP,
astronomy, but not yet in biology

Robert Grossman University of Chicago Open Data Group, November 14, 2011
Moving 113GB of Bio-mirror Data

•  GridFTP TCP and UDT transfer times for 113 GB from

gridip.bio-‐mirror.net/biomirror/ blast/ (Indiana USA).
–  All TCP and UDT times in minutes.
–  Source: http://gridip.bio-mirror.net/biomirror/

Robert Grossman University of Chicago Open Data Group, November 14, 2011
Case study: CGI 60 genomes

•  Trace by Complete Genomics showing performance of moving 60

complete human genomes from Mountain View to Chicago using the
open source Sector/UDT.
•  Approximately 18 TB at about 0.5 Mbs on 1G link.

Robert Grossman University of Chicago Open Data Group, November 14, 2011
How FedEx Has More Bandwidth Than
the Internet—and When That'll Change
•  If you're looking to transfer hundreds of
gigabytes of data, it's still—weirdly—faster to
ship hard drives via FedEx than it is to transfer
the files over the internet.
•  “ Cisco esRmates that total internet traﬃc currently averages 167 terabits per
second. FedEx has a ﬂeet of 654 aircrae with a lie capacity of 26.5 million pounds
daily. A solid-‐state laptop drive weighs about 78 grams and can hold up to a
terabyte. That means FedEx is capable of transferring 150 exabytes of data per
day, or 14 petabits per second—almost a hundred 7mes the current throughput
of the internet.

hWp://gizmodo.com/5981713/how-‐fedex-‐has-‐more-‐bandwidth-‐than-‐the-‐internetand-‐when-‐
thatll-‐change
content

•  General Introduction
•  Definitions
•  Data Analytics
•  Solutions for Big Data Analytics
•  The Network (Internet)
•  When to consider BigData solution
•  Scientific e-infrastructure – some challenges to
overcome
When to Consider a Big Data Solution
User point of view
•  You’re limited by your current platform or
environment because you can’t process the
amount of data that you want to process

•  You want to involve new sources of data in the

analytics, but you can’t, because it doesn’t fit into
schema-defined rows and columns without
sacrificing fidelity or the richness of the data
When to Consider a Big Data Solution

•  You need to ingest data as quickly as possible and

need to work with a schema-on-demand
–  You‘re forced into a schema-on-write approach (the
schema must be created before data is loaded),
–  but you need to ingest data quickly, or perhaps in a
discovery process, and want the cost benefits of a
schema-on-read approach (data is simply copied to the
file store, and no special transformation is needed) until
you know that you’ve got something that’s ready for
analysis?
When to Consider a Big Data Solution

•  You want to analyse not just raw structured data,

but also semi-structured and unstructured data
from a wide variety of sources

•  you’re not satisfied with the effectiveness of your

algorithms or models
–  when all, or most, of the data needs to be analysed
–  or when a sampling of the data isn’t going to work
When to Consider a Big Data Solution

•  you aren’t completely sure where the

investigation will take you, and you want elasticity
of compute, storage, and the types of analytics
that will be pursued—all of these became useful
as we added more sources and new methods

If your answers to any of these questions are “yes,” you

need to consider a Big Data solution.
content

•  Collection
–  How can we make sure that data are collected together with
the information necessary to re- use them?

•  Trust
–  How can we make informed judgements about whether
certain data are authentic and can be trusted?

–  How can we judge which repositories we can trust? How

can appropriate access and use of resources be granted or
controlled

Riding the wave, How Europe can gain from the rising Rde of scienRﬁc data
Scientific e-infrastructure – some
challenges to overcome

•  Usability
–  How can we move to a situation where non-specialists can
overcome the barriers and be able to start sensible work on
unfamiliar data

•  Interoperability
–  How can we implement interoperability within disciplines and move to an
overarching multi-disciplinary way of understanding and using data?
–  How can we find unfamiliar but relevant data resources beyond simple
keyword searches, but involving a deeper probing into the data
–  How can automated tools find the information needed to tackle data

Riding the wave, How Europe can gain from the rising Rde of scienRﬁc data
Scientific e-infrastructure – some
challenges to overcome
•  Diversity
–  How do we overcome the problems of diversity –
heterogeneity of data, but also of backgrounds and data-
sharing cultures in the scientific community?

–  How do we deal with the diversity of data repositories and

access rules – within or between disciplines, and within or
across national borders?

•  Security
–  How can we guarantee data integrity?
–  How can we avoid data poisoning by individuals or groups
intending to bias them in their interest?

Riding the wave, How Europe can gain from the rising Rde of scienRﬁc data
Scientific e-infrastructure – a wish list

• Open deposit, allowing user-community centres

to store data easily
• Bit-stream preservation, ensuring that data
authenticity will be guaranteed for a specified
number of years
• Format and content migration, executing CPU-
intensive transformations on large data sets at
the command of the communities

Riding the wave, How Europe can gain from the rising Rde of scienRﬁc data
Scientific e-infrastructure – a wish list

• Persistent identification, allowing data centres to

register a huge amount of markers to track the
origins and characteristics of the information
• Metadata support to allow effective
management, use and understanding
•  Maintaining proper access rights as the basis of
all trust
• A variety of access and curation services that will
vary between scientific disciplines and over time

Riding the wave, How Europe can gain from the rising Rde of scienRﬁc data
Scientific e-infrastructure – a wish list

• Execution services that allow a large group of

researchers to operate on the stored date
• High reliability, so researchers can count on its
availability
• Regular quality assessment to ensure adherence to
all agreements
• Distributed and collaborative authentication,
authorisation and accounting
• A high degree of interoperability at format and
semantic level

Riding the wave, How Europe can gain from the rising Rde of scienRﬁc data
Google BigQuery

•  Google BigQuery is a web service that lets you

do interactive analysis of massive datasets—up
to billions of rows. Scalable and easy to use,
BigQuery lets developers and businesses tap
into powerful data analytics on demand

–  http://www.youtube.com/watch?v=P78T_ZDwQyk
IBM BigInsights

•  BigInsights = analytical platform for persistent

“big data”
–  Based on open sources & IBM technologies

•  Distinguishing characteristics
–  Built-in Analytics

Big Data: Frequently Asked QuesRons for IBM InfoSphere BigInsights

hWp://www.youtube.com/watch?v=I4hsZa2jwAs
References
•  T. Hey, S. Tansley, and K. Tolle, The Fourth Paradigm: Data-Intensive
Scientific Discovery, T. Hey, S. Tansley, and K. Tolle, Eds. Microsoft, 2009.
Available:
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
•  Enabling knowledge creation in data-driven science
https://sciencenode.org/feature/enabling-knowledge-creation-data-
driven-science.php
•  Science as an open enterprise: open data for open science
http://royalsociety.org/uploadedFiles/Royal_Society_Content/policy/
projects/sape/2012-06-20-SAOE.pdf
•  Realtime Analytics for Big Data: A Facebook Case Study
http://www.youtube.com/watch?v=viPRny0nq3o

Basic Concepts in Big Data 1
No ratings yet
Basic Concepts in Big Data 1
43 pages
Unit 5 - Principles of Big Data 2
No ratings yet
Unit 5 - Principles of Big Data 2
14 pages
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
BigData Processing Intro
No ratings yet
BigData Processing Intro
34 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
Unit-1 Introduction To Big Data Analytics
No ratings yet
Unit-1 Introduction To Big Data Analytics
57 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
Bigdata Units
No ratings yet
Bigdata Units
80 pages
Big Data
No ratings yet
Big Data
54 pages
Introduction To Big Data Management
No ratings yet
Introduction To Big Data Management
53 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
ICT30005 - Assignment 1 - Begum Bolu 6623433 - Big Data Analytics
No ratings yet
ICT30005 - Assignment 1 - Begum Bolu 6623433 - Big Data Analytics
7 pages
Understanding Big Data: Challenges & Applications
No ratings yet
Understanding Big Data: Challenges & Applications
82 pages
Lecture 1
No ratings yet
Lecture 1
22 pages
Wibd Notes
No ratings yet
Wibd Notes
32 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
24 pages
Prepared By: Asmita Deshmukh
No ratings yet
Prepared By: Asmita Deshmukh
51 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
83 pages
117769
No ratings yet
117769
20 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
An Introduction To Big Data: Data Management For Data Science
No ratings yet
An Introduction To Big Data: Data Management For Data Science
32 pages
Big Data
No ratings yet
Big Data
52 pages
Beyond The Hype
No ratings yet
Beyond The Hype
30 pages
Lec 1 - Introduction To Big Data
No ratings yet
Lec 1 - Introduction To Big Data
37 pages
Unit 1
No ratings yet
Unit 1
89 pages
Big Data
No ratings yet
Big Data
30 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Class - Big Data UNIT-I
No ratings yet
Class - Big Data UNIT-I
40 pages
The Age of Big Data: Kayvan Tirdad
No ratings yet
The Age of Big Data: Kayvan Tirdad
26 pages
BD Unit 1
No ratings yet
BD Unit 1
63 pages
Overview of Big Data
No ratings yet
Overview of Big Data
4 pages
Big Data: Submitted By-Rajashree Rashmita Reg - No-1825209016 Mca 4 Sem
0% (1)
Big Data: Submitted By-Rajashree Rashmita Reg - No-1825209016 Mca 4 Sem
27 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
88 pages
Unit 1 - BDS - DS307
No ratings yet
Unit 1 - BDS - DS307
47 pages
Big Data Basics Unit 1
No ratings yet
Big Data Basics Unit 1
12 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
55 pages
Big Data Intro
No ratings yet
Big Data Intro
32 pages
Bigdatanalyticsintro
No ratings yet
Bigdatanalyticsintro
60 pages
1 - Big Data
No ratings yet
1 - Big Data
204 pages
5 - Big Data Dimensions, Evolution, Impacts, and Challenges PDF
No ratings yet
5 - Big Data Dimensions, Evolution, Impacts, and Challenges PDF
11 pages
Big Data, Hadoop
No ratings yet
Big Data, Hadoop
24 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
37 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
11 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Big Data Analytics - AAM - Unit 1
No ratings yet
Big Data Analytics - AAM - Unit 1
178 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Introduction To Big Data Platform
No ratings yet
Introduction To Big Data Platform
20 pages
Unit 1
No ratings yet
Unit 1
107 pages
Data Strategy and Architecture
100% (6)
Data Strategy and Architecture
19 pages
Data Strategy
100% (1)
Data Strategy
1 page
Applied Microsoft Power BI Bring Your Data To Life
100% (14)
Applied Microsoft Power BI Bring Your Data To Life
592 pages
Big Data Analytics Overview
100% (6)
Big Data Analytics Overview
112 pages
Microsoft Power BI Cookbook by Greg Deckler
100% (20)
Microsoft Power BI Cookbook by Greg Deckler
655 pages
Data Governance Playbook
100% (16)
Data Governance Playbook
168 pages
Architecting A Data Lake
100% (9)
Architecting A Data Lake
60 pages
Data Strategy Template
75% (4)
Data Strategy Template
18 pages
Data Management Complete Study Guide
100% (10)
Data Management Complete Study Guide
122 pages
The DAMA Guide To The Data Management Bo PDF
100% (6)
The DAMA Guide To The Data Management Bo PDF
430 pages
Data Governance Toolkit
100% (10)
Data Governance Toolkit
29 pages
250+ McKinsey Frequently Used Templates (7 Color Themes)
83% (12)
250+ McKinsey Frequently Used Templates (7 Color Themes)
256 pages
Learn Excel Dashboard
100% (18)
Learn Excel Dashboard
233 pages
Learn Excel Data Analysis
100% (17)
Learn Excel Data Analysis
721 pages
DATA ANALYTICS - A Comprehensive Beginner's Guide To Learn About The Realms of Data Analytics From A-Z
89% (18)
DATA ANALYTICS - A Comprehensive Beginner's Guide To Learn About The Realms of Data Analytics From A-Z
102 pages
Excel Dashboards Tutorial PDF
93% (28)
Excel Dashboards Tutorial PDF
166 pages
Top 101 Consulting Framework
91% (11)
Top 101 Consulting Framework
205 pages
Creating An Enterprise Data Strategy - Final
100% (4)
Creating An Enterprise Data Strategy - Final
42 pages
It Modernize Data Architecture For Measurable Business Results Executive Brief
No ratings yet
It Modernize Data Architecture For Measurable Business Results Executive Brief
24 pages
Data Modelling Concepts
No ratings yet
Data Modelling Concepts
17 pages
DAMA Data Governance 90 Min PDF
86% (7)
DAMA Data Governance 90 Min PDF
58 pages
Map Business Architecture To Define Your Strategy
100% (12)
Map Business Architecture To Define Your Strategy
94 pages
Key Performance Indicators KPIs
100% (26)
Key Performance Indicators KPIs
142 pages
Power BI Data Analysis and Visualization
100% (10)
Power BI Data Analysis and Visualization
289 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Data Management: THE Cookbook
90% (10)
Data Management: THE Cookbook
33 pages
PWC Information Management Framework: Data Governance Is A Key Component of Information Management
100% (9)
PWC Information Management Framework: Data Governance Is A Key Component of Information Management
3 pages
BABOK 3 ONLINE - A Guide To The Business Analysis Body of Knowledge
99% (68)
BABOK 3 ONLINE - A Guide To The Business Analysis Body of Knowledge
514 pages
PPM06 Project Portfolio Management Template - Advanced
No ratings yet
PPM06 Project Portfolio Management Template - Advanced
92 pages
Toumi - Sonia Mebarka PDF
No ratings yet
Toumi - Sonia Mebarka PDF
223 pages
تنظيم عقود انتقال اللاعبين المحترفين
No ratings yet
تنظيم عقود انتقال اللاعبين المحترفين
90 pages
COVID-19 Testing Procedure Guide
No ratings yet
COVID-19 Testing Procedure Guide
2 pages
The Essential Eight Technologies: Board Byte: Artificial Intelligence
No ratings yet
The Essential Eight Technologies: Board Byte: Artificial Intelligence
8 pages
Global System For Mobile Global System For Mobile (GSM) (GSM)
No ratings yet
Global System For Mobile Global System For Mobile (GSM) (GSM)
37 pages
An Application To Empower Online Shopping Experience of Students
No ratings yet
An Application To Empower Online Shopping Experience of Students
16 pages
BTS 8000 Fundamentals 411-9001-063.16.04
No ratings yet
BTS 8000 Fundamentals 411-9001-063.16.04
376 pages
Xgenius A New Dimension: Datasheet
No ratings yet
Xgenius A New Dimension: Datasheet
8 pages
MSI GeForce GTX 960 GAMING 2G MS-V320 Rev 2.0 (DIAGRAMAS - COM.BR)
No ratings yet
MSI GeForce GTX 960 GAMING 2G MS-V320 Rev 2.0 (DIAGRAMAS - COM.BR)
34 pages
ImageView in Android With Example
No ratings yet
ImageView in Android With Example
4 pages
MPLS Convergence: IGP & BGP Impacts
No ratings yet
MPLS Convergence: IGP & BGP Impacts
29 pages
IR Underwater Comms for Engineers
75% (8)
IR Underwater Comms for Engineers
4 pages
EEB334 Computer Programming I Course Outline and Teaching Plan 2022
No ratings yet
EEB334 Computer Programming I Course Outline and Teaching Plan 2022
3 pages
3-2 R19 Supple2023
No ratings yet
3-2 R19 Supple2023
4 pages
Batch File Tricks: (Overriding DOS Commands With Batch Files - Version 1.0)
No ratings yet
Batch File Tricks: (Overriding DOS Commands With Batch Files - Version 1.0)
4 pages
CGL Lab Manual 23-24
No ratings yet
CGL Lab Manual 23-24
30 pages
How To Permanently Activate The MS Office Professional Plus 2010 - PC Tips Tricks & Tutotrial PDF
No ratings yet
How To Permanently Activate The MS Office Professional Plus 2010 - PC Tips Tricks & Tutotrial PDF
3 pages
Cybercrime Law Literature Review
100% (2)
Cybercrime Law Literature Review
5 pages
Manual de Instrucciones Televisor 32 Pulgadas Led Smart GRS HDLEDSM 32 G6I - 1
No ratings yet
Manual de Instrucciones Televisor 32 Pulgadas Led Smart GRS HDLEDSM 32 G6I - 1
30 pages
Introduction To Veeva
100% (1)
Introduction To Veeva
19 pages
CSF212DBS Lab3
100% (1)
CSF212DBS Lab3
16 pages
Evolution of Operating Systems
No ratings yet
Evolution of Operating Systems
2 pages
4D An P4005 ViSi Genie Show Image REV1
No ratings yet
4D An P4005 ViSi Genie Show Image REV1
23 pages
Step 1. Map Layer: Manually Tracing The Outlines of Your Map
No ratings yet
Step 1. Map Layer: Manually Tracing The Outlines of Your Map
5 pages
8086 & MSP430 Microcontroller Q&A
No ratings yet
8086 & MSP430 Microcontroller Q&A
6 pages
Chapter 4 Application and OS Security
No ratings yet
Chapter 4 Application and OS Security
41 pages
Knowledge Management Revision Paper With Answers - OUM / VILLA COLLEGE
No ratings yet
Knowledge Management Revision Paper With Answers - OUM / VILLA COLLEGE
15 pages
Installing Samba On Solaris 8
No ratings yet
Installing Samba On Solaris 8
13 pages
Eco Report App
No ratings yet
Eco Report App
52 pages
FemaleLiver 02 NetworkConstr Blockwise
No ratings yet
FemaleLiver 02 NetworkConstr Blockwise
6 pages
Juniper Practice Questions
No ratings yet
Juniper Practice Questions
6 pages
Web Applications and Security Revision Notes
No ratings yet
Web Applications and Security Revision Notes
10 pages
Juniper Attack Detection & Defence SSG500
No ratings yet
Juniper Attack Detection & Defence SSG500
266 pages
AMD Ryzen Threadripper 3990X: Say Hello To The Most Powerful Desktop Processor Ever
No ratings yet
AMD Ryzen Threadripper 3990X: Say Hello To The Most Powerful Desktop Processor Ever
1 page
Merge Multiple Excel Sheets Project Repot
No ratings yet
Merge Multiple Excel Sheets Project Repot
7 pages

Lecture 6 BigData

Uploaded by

Lecture 6 BigData

Uploaded by

UVA HPC & BIG DATA COURSE

Introduction to Big Data

• A good sweet spot for a data center is 15 MW

• and probably in 2013 too.

• The Harvard Business Review talks about it as

• The Wall Street Journal

By the end of 2011

Addison Snell CEO. Intersect360, Research

1. Analysing Big data requires scale‐out solutions not

• Big in Big Data refers to:

• define Big Data using 3 Vs; namely:

Big Data -­‐ Back to mine 11

• Aggregation that used to be measured

• in 2010, we crossed the 1ZB marker,

• The variety characteristic of Big

• Making sense out of unstructured

• velocity is the rate at which

• In other terms “How long does

Today, it is possible using real-­‐Rme analyRcs to

• Veracity refers to the quality or trustworthiness of

• A common complication is that the data is

LHC ATLAS detector generates about 1 Petabyte raw data per

The Big Data plaYorm manifesto: imperaRves and underlying technologies

Analytics Characteristics are not new

® Copyright 2011 Gigaspaces Ltd. All Rights

® Copyright 2011 Gigaspaces Ltd. All Rights

• Analyse and model

• Understand and design

• Process the data as it comes

® Copyright 2011 Gigaspaces Ltd. All Rights

• Use distributed database

® Copyright 2011 Gigaspaces Ltd. All Rights

Bill Howe, UW

• The terabyte benchmark

• A. Pavlo, et al. "A comparison of approaches to

Hadoop/Hive..Not realRme. Many dependencies. Lots of

"MapReduce and other batch-­‐processing systems cannot

® Copyright 2011 Gigaspaces Ltd. All Rights

• Storm is a distributed real-time

• Apache Kafka is publish-subscribe messaging

• Kafka maintains feeds of messages in

– processes can subscribe to topics and process

• Kafka is run as a cluster comprised of one or

• TCP Was never designed to move large datasets

• For loading a webpage, TCP is great.

1 10 100 200 400

Round Trip Time (ms)

• Use parallel TCP streams

• Use specialized network protocols

• Use RAID to stripe data across disks to improve

Site RTT TCP UDT TCP/UDT Km

• GridFTP TCP and UDT transfer times for 113 GB from

• Trace by Complete Genomics showing performance of moving 60

• You want to involve new sources of data in the

• You need to ingest data as quickly as possible and

• You want to analyse not just raw structured data,

• you’re not satisfied with the effectiveness of your

• you aren’t completely sure where the

If your answers to any of these questions are “yes,” you

– How can we judge which repositories we can trust? How

– How do we deal with the diversity of data repositories and

• Open deposit, allowing user-community centres

• Persistent identification, allowing data centres to

• Execution services that allow a large group of

• Google BigQuery is a web service that lets you

• BigInsights = analytical platform for persistent

Big Data: Frequently Asked QuesRons for IBM InfoSphere BigInsights

You might also like

•  A good sweet spot for a data center is 15 MW

•  and probably in 2013 too.

•  The Harvard Business Review talks about it as

•  The Wall Street Journal

1.  Analysing Big data requires scale‐out solutions not

•  Big in Big Data refers to:

•  define Big Data using 3 Vs; namely:

Big Data -‐ Back to mine 11

•  Aggregation that used to be measured

•  in 2010, we crossed the 1ZB marker,

•  The variety characteristic of Big

•  Making sense out of unstructured

•  velocity is the rate at which

•  In other terms “How long does

Today, it is possible using real-‐Rme analyRcs to

•  Veracity refers to the quality or trustworthiness of

•  A common complication is that the data is

•  Analyse and model

•  Understand and design

•  Process the data as it comes

•  Use distributed database

•  The terabyte benchmark

•  A. Pavlo, et al. "A comparison of approaches to

"MapReduce and other batch-‐processing systems cannot

•  Storm is a distributed real-time

•  Apache Kafka is publish-subscribe messaging

•  Kafka maintains feeds of messages in

–  processes can subscribe to topics and process

•  Kafka is run as a cluster comprised of one or

•  TCP Was never designed to move large datasets

•  For loading a webpage, TCP is great.

•  Use parallel TCP streams

•  Use specialized network protocols

•  Use RAID to stripe data across disks to improve

•  GridFTP TCP and UDT transfer times for 113 GB from

•  Trace by Complete Genomics showing performance of moving 60

•  You want to involve new sources of data in the

•  You need to ingest data as quickly as possible and

•  You want to analyse not just raw structured data,

•  you’re not satisfied with the effectiveness of your

•  you aren’t completely sure where the

–  How can we judge which repositories we can trust? How

–  How do we deal with the diversity of data repositories and

• Open deposit, allowing user-community centres

• Persistent identification, allowing data centres to

• Execution services that allow a large group of

•  Google BigQuery is a web service that lets you

•  BigInsights = analytical platform for persistent