0% found this document useful (0 votes)

52 views25 pages

Lecture 2

grssfgsfg

Uploaded by

sarahgohar0308

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views25 pages

Lecture 2

grssfgsfg

Uploaded by

sarahgohar0308

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

BIG DATA ANALYTICS

Lecture 2 --- Week 2

Content

 BDA system architecture

 Techniques towards Big Data

 Types of data (Transaction data, Document data, Network data, Genomic

sequences, Medical data, Environmental data, Behavioral data)

 Types of attributes

 Big Data business use cases

BDA system architecture

Specialized Specialized
services services
for domain A for domain B

Big Data Services Layer

Knowledge Management Layer

Data Storage and Management Layer

BDA system architecture

 Large amounts of data, distributed environment

 Unstructured and semi-structured data
 Not necessarily a schema
 Heterogeneous

 Streams
 Varying quality

Data Storage and Management Layer

BDA system architecture

 A big data architecture is designed to handle the ingestion, processing, and

analysis of data that is too large or complex for traditional database systems.
BDA system architecture

 Big data solutions typically involve one or more of the following types of
workload:

 Batch processing of big data sources at rest.

 Real-time processing of big data in motion.

 Interactive exploration of big data.

 Predictive analytics and machine learning.

BDA system architecture
 Most big data architectures include some or all of the following components:

 Data sources: All big data solutions start with one or more data sources.
Examples include:

 Application data stores, such as relational databases.

 Static files produced by applications, such as web server log files.

 Real-time data sources, such as IoT devices.

 Data storage: Data for batch processing operations is typically stored in a

distributed file store that can hold high volumes of large files in various
formats. This kind of store is often called a data lake. Options for
implementing this storage include Azure Data Lake Store or blob containers in
Azure Storage.
BDA system architecture
 Batch processing: Because the data sets are so large, often a big data solution must

process data files using long-running batch jobs to filter, aggregate, and otherwise

prepare the data for analysis. Usually these jobs involve reading source files, processing

them, and writing the output to new files. Options include running U-SQL jobs in Azure

Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop

cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.

 Real-time message ingestion: If the solution includes real-time sources, the architecture

must include a way to capture and store real-time messages for stream processing. This

might be a simple data store, where incoming messages are dropped into a folder for

processing. However, many solutions need a message ingestion store to act as a buffer for

messages, and to support scale-out processing, reliable delivery, and other message

queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs, and Kafka.
BDA system architecture
 Stream processing: After capturing real-time messages, the solution must process them by filtering,

aggregating, and otherwise preparing the data for analysis. The processed stream data is then written

to an output sink. Azure Stream Analytics provides a managed stream processing service based on

perpetually running SQL queries that operate on unbounded streams. You can also use open source

Apache streaming technologies like Storm and Spark Streaming in an HDInsight cluster.

 Analytical data store: Many big data solutions prepare data for analysis and then serve the processed

data in a structured format that can be queried using analytical tools. The analytical data store used

to serve these queries can be a Kimball-style relational data warehouse, as seen in most traditional

business intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency

NoSQL technology such as HBase, or an interactive Hive database that provides a metadata abstraction

over data files in the distributed data store. Azure Synapse Analytics provides a managed service for

large-scale, cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL,

which can also be used to serve data for analysis.

BDA system architecture
 Analysis and reporting: The goal of most big data solutions is to provide insights into the data
through analysis and reporting. To empower users to analyze the data, the architecture may
include a data modeling layer, such as a multidimensional OLAP cube or tabular data model in
Azure Analysis Services. It might also support self-service BI, using the modeling and visualization
technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting can also take the form
of interactive data exploration by data scientists or data analysts. For these scenarios, many Azure
services support analytical notebooks, such as Jupyter, enabling these users to leverage their
existing skills with Python or R. For large-scale data exploration, you can use Microsoft R Server,
either standalone or with Spark.

 Orchestration: Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data between multiple sources and
sinks, load the processed data into an analytical data store, or push the results straight to a report
or dashboard. To automate these workflows, you can use an orchestration technology such Azure
Data Factory or Apache Oozie and Sqoop.
When to use this architecture

 Consider this architecture style when you need to:

 Store and process data in volumes too large for a traditional database.
 Transform unstructured data for analysis and reporting.
 Capture, process, and analyze unbounded streams of data in real time, or
with low latency.
 Use Azure Machine Learning or Microsoft Cognitive Services.
Techniques towards Big Data

 Massive Parallelism
 Huge Data Volumes Storage
 Data Distribution
 High-Speed Networks
 High-Performance Computing
 Task and Thread Management
 Data Mining and Analytics
 Data Retrieval
 Machine Learning
 Data Visualization
Types of data

 Transaction data is data describing an event (the change as a result of

a transaction) and is usually described with verbs. Transaction data always has
a time dimension, a numerical value and refers to one or more objects (i.e.
the reference data).

 Typical transactions are:

 Financial: orders, invoices, payments

 Work: plans, activity records

 Logistics: deliveries, storage records, travel records, etc.

Types of data

 A key form of document data is creating metadata, or in other words “data about
data”. Metadata are characteristics describing the data, which facilitates
cataloguing and discovery of the data. When depositing your data into a trusted
data repository, the repository generates machine-readable metadata.

 Network data is an information processed or stored by a computer. This

information may be in the form of text documents, images, audio clips, software
programs, or other types of data. This allows data to be transferred from one
computer to another using a network connection or various media devices.
Types of data

 Genomic sequencing involves a host of decisions on the part of the provider

and client, including whether to be tested, when and how to receive
sequencing results, whether to inform biological relatives, and whether to
take protective actions to reduce risk.

 Medical (clinical) data refers to health-related information that is associated

with regular patient care or as part of a clinical trial program.
Types of data

 Environmental data is that which is based on the measurement of

environmental pressures, the state of the environment and the impacts on
ecosystems. All data generated by the execution of environmental law are to
be considered as environmental data.

 Behavioral data refers to information produced as a result of actions,

typically commercial behavior using a range of devices connected to the
Internet, such as a PC, tablet, or smartphone. Behavioral data tracks the sites
visited, the apps downloaded, or the games played.
Attribute

 It can be seen as a data field that represents the characteristics or features

of a data object. For a customer, object attributes can be customer Id,
address, etc. We can say that a set of attributes used to describe a given
object are known as attribute vector or feature vector.
Type of attributes

 Qualitative (Nominal (N), Ordinal (O), Binary(B)).

 Quantitative (Numeric, Discrete, Continuous)

Qualitative Attributes

 Nominal Attributes – related to names: The values of a Nominal attribute are

names of things, some kind of symbols. Values of Nominal attributes
represents some category or state and that’s why nominal attribute also
referred as categorical attributes and there is no order (rank, position)
among values of the nominal attribute.

 Example :
Qualitative Attributes

 Binary Attributes: Binary data has only 2 values/states. For Example yes or
no, affected or unaffected, true or false.

 Symmetric: Both values are equally important (Gender).

 Asymmetric: Both values are not equally important (Result).

Qualitative Attributes

 Ordinal Attributes : The Ordinal Attributes contains values that have a

meaningful sequence or ranking(order) between them, but the magnitude
between values is not actually known, the order of values that shows what is
important but don’t indicate how important it is.
Quantitative Attributes
 Numeric: A numeric attribute is quantitative because, it is a measurable quantity,
represented in integer or real values. Numerical attributes are of 2 types, interval,
and ratio.

 An interval-scaled attribute has values, whose differences are interpretable, but the
numerical attributes do not have the correct reference point, or we can call zero points.
Data can be added and subtracted at an interval scale but can not be multiplied or
divided. Consider an example of temperature in degrees Centigrade. If a day’s
temperature of one day is twice of the other day we cannot say that one day is twice as
hot as another day.

 A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is

ratio-scaled, we can say of a value as being a multiple (or ratio) of another value. The
values are ordered, and we can also compute the difference between values, and the
mean, median, mode, Quantile-range, and Five number summary can be given.
Quantitative Attributes

 Discrete : Discrete data have finite values it can be numerical and can also
be in categorical form. These attributes has finite or countably infinite set of
values.

 Example:
Quantitative Attributes

 Continuous: Continuous data have an infinite no of states. Continuous data is

of float type. There can be many values between 2 and 3.

 Example :
Business Use Cases
 Product Recommendation

 Customer Churn Analysis

 Customer Segmentation

 Sales Leads Prioritization

 Sentiment Analysis

 Fraud Detection

 Predictive Maintenance

 Market Basket Analysis

 Predictive Medical Diagnosis

 Predicting Patient Re-admission

 Detecting Anomalous Record Access

 Insurance Risk Analysis

 Predicting Oil and Gas Well Production Levels

Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Unit 1 Big Data - VII SEM (2024-25)
No ratings yet
Unit 1 Big Data - VII SEM (2024-25)
48 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Big Data Essentials for IT Professionals
No ratings yet
Big Data Essentials for IT Professionals
31 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Detailednotes - Unit1 - Big Data
No ratings yet
Detailednotes - Unit1 - Big Data
22 pages
CS8091 BDA Unit 1
No ratings yet
CS8091 BDA Unit 1
118 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Unit - 1 (Big Data)
No ratings yet
Unit - 1 (Big Data)
15 pages
Big Data Architectures
No ratings yet
Big Data Architectures
8 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Types of Digital Data & Big Data
No ratings yet
Types of Digital Data & Big Data
136 pages
Big Data Architecture Overview
No ratings yet
Big Data Architecture Overview
8 pages
Introduction To Big Data Ecosystem V 2.0
No ratings yet
Introduction To Big Data Ecosystem V 2.0
76 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Big Data Unit 1 AKTU Notes
100% (1)
Big Data Unit 1 AKTU Notes
87 pages
BDA Unit-1
No ratings yet
BDA Unit-1
33 pages
Data Structures
No ratings yet
Data Structures
50 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Module 1
No ratings yet
Module 1
29 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
Big Data Tools & Techniques Guide
No ratings yet
Big Data Tools & Techniques Guide
69 pages
Introduction To Big Data Platform
No ratings yet
Introduction To Big Data Platform
20 pages
Big Data Analytics - Unit 2
No ratings yet
Big Data Analytics - Unit 2
10 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Bda Unit-I
No ratings yet
Bda Unit-I
15 pages
Digitization Week 3
No ratings yet
Digitization Week 3
13 pages
Big Data Analy Cs - Architecture
No ratings yet
Big Data Analy Cs - Architecture
10 pages
DP 900 Day 4
No ratings yet
DP 900 Day 4
40 pages
Big Data Complete Notes
100% (2)
Big Data Complete Notes
33 pages
Unit-1 Introduction To Data Analytics
No ratings yet
Unit-1 Introduction To Data Analytics
35 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Big Data
100% (2)
Big Data
20 pages
Big Data Analytics - Overview
No ratings yet
Big Data Analytics - Overview
66 pages
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
BCE Report
No ratings yet
BCE Report
14 pages
UNUT 1 - Introduction and Data Analytics Life Cycle
No ratings yet
UNUT 1 - Introduction and Data Analytics Life Cycle
86 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
BDA1-4 Bunits
No ratings yet
BDA1-4 Bunits
113 pages
Big Data Analytics Unit-1
100% (2)
Big Data Analytics Unit-1
5 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
BIG Data - Unit - 1
No ratings yet
BIG Data - Unit - 1
24 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Big Data Analytics Overview
100% (1)
Big Data Analytics Overview
81 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Big Data
No ratings yet
Big Data
20 pages
Big Data 1 Unit
No ratings yet
Big Data 1 Unit
21 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Data Science Lecture: Classification & Regression
No ratings yet
Data Science Lecture: Classification & Regression
27 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
FAAM Catalogo Traccion Battek Ingles
No ratings yet
FAAM Catalogo Traccion Battek Ingles
7 pages
YEA-SIA-IEC-4 - Hardware Configuration - 2013-03-26 PDF
No ratings yet
YEA-SIA-IEC-4 - Hardware Configuration - 2013-03-26 PDF
165 pages
Inside The Carpenters Toolbox PDF
67% (3)
Inside The Carpenters Toolbox PDF
16 pages
Taurus Ascendant
No ratings yet
Taurus Ascendant
19 pages
Kinetics of Oxidation of Ammonia in Solutions Containing Ozone With or Without Hydrogen Peroxide
No ratings yet
Kinetics of Oxidation of Ammonia in Solutions Containing Ozone With or Without Hydrogen Peroxide
6 pages
Sample Dessertation
No ratings yet
Sample Dessertation
42 pages
CHLOROPLAST
No ratings yet
CHLOROPLAST
4 pages
Unit 33. Creating and Building
100% (1)
Unit 33. Creating and Building
6 pages
Gmail - Your Application For Inside Sales Associate at Fresh Prints - Next Steps
No ratings yet
Gmail - Your Application For Inside Sales Associate at Fresh Prints - Next Steps
2 pages
Manulife Horizons Plan
100% (1)
Manulife Horizons Plan
7 pages
AirBnb EDA
100% (1)
AirBnb EDA
20 pages
Top OTC Drugs for Common Ailments
No ratings yet
Top OTC Drugs for Common Ailments
3 pages
Dag Hammarskjöld - Markings (1985, Ballantine Books) PDF
92% (13)
Dag Hammarskjöld - Markings (1985, Ballantine Books) PDF
228 pages
The Golden Broth
No ratings yet
The Golden Broth
8 pages
GI B2 Wordlist English
No ratings yet
GI B2 Wordlist English
7 pages
Grammar Skills for Young Learners
No ratings yet
Grammar Skills for Young Learners
16 pages
Job Description - BDA UPDATED - Bangalore
No ratings yet
Job Description - BDA UPDATED - Bangalore
4 pages
Daka Research v. Schedule A - Complaint
No ratings yet
Daka Research v. Schedule A - Complaint
160 pages
11029/koyna Express Second Sitting (2S)
No ratings yet
11029/koyna Express Second Sitting (2S)
3 pages
Tech Resume: Pranati Mudi
No ratings yet
Tech Resume: Pranati Mudi
2 pages
Excel EX
No ratings yet
Excel EX
10 pages
CEH-Certified Ethical Hacker: Required Prerequisites
No ratings yet
CEH-Certified Ethical Hacker: Required Prerequisites
3 pages
GEG 311 - 3 Calculus of Several Variablespdf2
No ratings yet
GEG 311 - 3 Calculus of Several Variablespdf2
26 pages
Open World C1 TB OCR Compressed
No ratings yet
Open World C1 TB OCR Compressed
195 pages
Siemens: RELAY 7PA30 2-1AA00
No ratings yet
Siemens: RELAY 7PA30 2-1AA00
7 pages
WFJ-80 Kratom Grinder Quotation
No ratings yet
WFJ-80 Kratom Grinder Quotation
15 pages
Case Study Houses
No ratings yet
Case Study Houses
6 pages
Dock Safety & Lifting Gear Guide
No ratings yet
Dock Safety & Lifting Gear Guide
8 pages
TAM 2018 - Basic Training Deck
No ratings yet
TAM 2018 - Basic Training Deck
54 pages
RWA - Monthly Income and Expenditure - August 2025
No ratings yet
RWA - Monthly Income and Expenditure - August 2025
1 page