BIG DATA ANALYTICS
Lecture 2 --- Week 2
Content
BDA system architecture
Techniques towards Big Data
Types of data (Transaction data, Document data, Network data, Genomic
sequences, Medical data, Environmental data, Behavioral data)
Types of attributes
Big Data business use cases
BDA system architecture
Specialized Specialized
services services
for domain A for domain B
Big Data Services Layer
Knowledge Management Layer
Data Storage and Management Layer
BDA system architecture
Large amounts of data, distributed environment
Unstructured and semi-structured data
Not necessarily a schema
Heterogeneous
Streams
Varying quality
Data Storage and Management Layer
BDA system architecture
A big data architecture is designed to handle the ingestion, processing, and
analysis of data that is too large or complex for traditional database systems.
BDA system architecture
Big data solutions typically involve one or more of the following types of
workload:
Batch processing of big data sources at rest.
Real-time processing of big data in motion.
Interactive exploration of big data.
Predictive analytics and machine learning.
BDA system architecture
Most big data architectures include some or all of the following components:
Data sources: All big data solutions start with one or more data sources.
Examples include:
Application data stores, such as relational databases.
Static files produced by applications, such as web server log files.
Real-time data sources, such as IoT devices.
Data storage: Data for batch processing operations is typically stored in a
distributed file store that can hold high volumes of large files in various
formats. This kind of store is often called a data lake. Options for
implementing this storage include Azure Data Lake Store or blob containers in
Azure Storage.
BDA system architecture
Batch processing: Because the data sets are so large, often a big data solution must
process data files using long-running batch jobs to filter, aggregate, and otherwise
prepare the data for analysis. Usually these jobs involve reading source files, processing
them, and writing the output to new files. Options include running U-SQL jobs in Azure
Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop
cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
Real-time message ingestion: If the solution includes real-time sources, the architecture
must include a way to capture and store real-time messages for stream processing. This
might be a simple data store, where incoming messages are dropped into a folder for
processing. However, many solutions need a message ingestion store to act as a buffer for
messages, and to support scale-out processing, reliable delivery, and other message
queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs, and Kafka.
BDA system architecture
Stream processing: After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is then written
to an output sink. Azure Stream Analytics provides a managed stream processing service based on
perpetually running SQL queries that operate on unbounded streams. You can also use open source
Apache streaming technologies like Storm and Spark Streaming in an HDInsight cluster.
Analytical data store: Many big data solutions prepare data for analysis and then serve the processed
data in a structured format that can be queried using analytical tools. The analytical data store used
to serve these queries can be a Kimball-style relational data warehouse, as seen in most traditional
business intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency
NoSQL technology such as HBase, or an interactive Hive database that provides a metadata abstraction
over data files in the distributed data store. Azure Synapse Analytics provides a managed service for
large-scale, cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL,
which can also be used to serve data for analysis.
BDA system architecture
Analysis and reporting: The goal of most big data solutions is to provide insights into the data
through analysis and reporting. To empower users to analyze the data, the architecture may
include a data modeling layer, such as a multidimensional OLAP cube or tabular data model in
Azure Analysis Services. It might also support self-service BI, using the modeling and visualization
technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting can also take the form
of interactive data exploration by data scientists or data analysts. For these scenarios, many Azure
services support analytical notebooks, such as Jupyter, enabling these users to leverage their
existing skills with Python or R. For large-scale data exploration, you can use Microsoft R Server,
either standalone or with Spark.
Orchestration: Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data between multiple sources and
sinks, load the processed data into an analytical data store, or push the results straight to a report
or dashboard. To automate these workflows, you can use an orchestration technology such Azure
Data Factory or Apache Oozie and Sqoop.
When to use this architecture
Consider this architecture style when you need to:
Store and process data in volumes too large for a traditional database.
Transform unstructured data for analysis and reporting.
Capture, process, and analyze unbounded streams of data in real time, or
with low latency.
Use Azure Machine Learning or Microsoft Cognitive Services.
Techniques towards Big Data
Massive Parallelism
Huge Data Volumes Storage
Data Distribution
High-Speed Networks
High-Performance Computing
Task and Thread Management
Data Mining and Analytics
Data Retrieval
Machine Learning
Data Visualization
Types of data
Transaction data is data describing an event (the change as a result of
a transaction) and is usually described with verbs. Transaction data always has
a time dimension, a numerical value and refers to one or more objects (i.e.
the reference data).
Typical transactions are:
Financial: orders, invoices, payments
Work: plans, activity records
Logistics: deliveries, storage records, travel records, etc.
Types of data
A key form of document data is creating metadata, or in other words “data about
data”. Metadata are characteristics describing the data, which facilitates
cataloguing and discovery of the data. When depositing your data into a trusted
data repository, the repository generates machine-readable metadata.
Network data is an information processed or stored by a computer. This
information may be in the form of text documents, images, audio clips, software
programs, or other types of data. This allows data to be transferred from one
computer to another using a network connection or various media devices.
Types of data
Genomic sequencing involves a host of decisions on the part of the provider
and client, including whether to be tested, when and how to receive
sequencing results, whether to inform biological relatives, and whether to
take protective actions to reduce risk.
Medical (clinical) data refers to health-related information that is associated
with regular patient care or as part of a clinical trial program.
Types of data
Environmental data is that which is based on the measurement of
environmental pressures, the state of the environment and the impacts on
ecosystems. All data generated by the execution of environmental law are to
be considered as environmental data.
Behavioral data refers to information produced as a result of actions,
typically commercial behavior using a range of devices connected to the
Internet, such as a PC, tablet, or smartphone. Behavioral data tracks the sites
visited, the apps downloaded, or the games played.
Attribute
It can be seen as a data field that represents the characteristics or features
of a data object. For a customer, object attributes can be customer Id,
address, etc. We can say that a set of attributes used to describe a given
object are known as attribute vector or feature vector.
Type of attributes
Qualitative (Nominal (N), Ordinal (O), Binary(B)).
Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes
Nominal Attributes – related to names: The values of a Nominal attribute are
names of things, some kind of symbols. Values of Nominal attributes
represents some category or state and that’s why nominal attribute also
referred as categorical attributes and there is no order (rank, position)
among values of the nominal attribute.
Example :
Qualitative Attributes
Binary Attributes: Binary data has only 2 values/states. For Example yes or
no, affected or unaffected, true or false.
Symmetric: Both values are equally important (Gender).
Asymmetric: Both values are not equally important (Result).
Qualitative Attributes
Ordinal Attributes : The Ordinal Attributes contains values that have a
meaningful sequence or ranking(order) between them, but the magnitude
between values is not actually known, the order of values that shows what is
important but don’t indicate how important it is.
Quantitative Attributes
Numeric: A numeric attribute is quantitative because, it is a measurable quantity,
represented in integer or real values. Numerical attributes are of 2 types, interval,
and ratio.
An interval-scaled attribute has values, whose differences are interpretable, but the
numerical attributes do not have the correct reference point, or we can call zero points.
Data can be added and subtracted at an interval scale but can not be multiplied or
divided. Consider an example of temperature in degrees Centigrade. If a day’s
temperature of one day is twice of the other day we cannot say that one day is twice as
hot as another day.
A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is
ratio-scaled, we can say of a value as being a multiple (or ratio) of another value. The
values are ordered, and we can also compute the difference between values, and the
mean, median, mode, Quantile-range, and Five number summary can be given.
Quantitative Attributes
Discrete : Discrete data have finite values it can be numerical and can also
be in categorical form. These attributes has finite or countably infinite set of
values.
Example:
Quantitative Attributes
Continuous: Continuous data have an infinite no of states. Continuous data is
of float type. There can be many values between 2 and 3.
Example :
Business Use Cases
Product Recommendation
Customer Churn Analysis
Customer Segmentation
Sales Leads Prioritization
Sentiment Analysis
Fraud Detection
Predictive Maintenance
Market Basket Analysis
Predictive Medical Diagnosis
Predicting Patient Re-admission
Detecting Anomalous Record Access
Insurance Risk Analysis
Predicting Oil and Gas Well Production Levels