; @
Quantum,
peries,
nin
QUANTUM Series
* Topic-wise coverage of entire syllabus
in Question-Answer form.
* Short Questions (2 Marks)QUANTUM SERIES
For
B.Tech Students of Third Year
of All Engineering Colleges Affiliated to
Dr. A.P.J. Abdul Kalam Technical University,
Uttar Pradesh, Lucknow
(Formerly Uttar Pradesh Technical University)
Data Analytics
By
Aditya Kumar
™
ZS
Quantum
— Page —
QUANTUM PAGE PVT. LTD.
Ghaziabad mg New DelhiPUBLISHED BY: Apram Singh
Quantum Publications®
(A Unit of Quantum Page Pvt. Ltd.)
Plot No. 59/2/7, Site - 4, Industrial Area,
Sahibabad, Ghaziabad-201 010
Phone : 0120- 4160479
Email: pagequantum@gmail.com Website: www.quantumpage.co.in
Delhi Office : 1/6590, East Rohtas Nagar, Shahdara, Delhi-110032
© Att Ricuts Reservep
No part of this publication may be reproduced or transmitted,
|,
in any form or by any means, without permission.
Information contained in this work is derived from sources
believed to be reliable. Every effort has been made to ensure
accuracy, however neither the publisher nor the authors
guarantee the accuracy or completeness of any information
published herein, and neither the publisher nor the authors
shall be responsible for any errors, omissions, or damages
arising out of use of this information.
Data Analytics (CS : Sem-5 and IT : Sem-6)
1* Edition : 2020-21
Price: Rs. 55/- only
Printed Version : e-Book.UNIT-1 : INTRODUCTION TO DATA ANALYTICS (1-1 J to 1-20 J)
Introduction to Data Analytics: Sources and nature of data,
classification of data (structured, semi-structured, unstructured),
characteristics of data, introduction to Big Data platform, need of
data analytics, evolution of analytic scalability, analytic process
and tools, analysis vs reporting, modern data analytic tools,
applications of data analytics. Data Analytics Lifecycle” Need, key
roles for successful analytic projects, various phases of data analytics
lifecycle — discovery, data preparation, model planning, model
building, communicating results, operationalization
UNIT-2 : DATA ANALYSIS (2-1 J to 2-28 J)
Regression modeling, multivariate analysis, Bayesian modeling,
inference and Bayesian networks, support vector and kernel methods,
analysis of time series: linear systems analysis & nonlinear dynamics,
rule induction, neural networks: learning and generalisation,
competitive learning, principal component analysis and neural
networks, fuzzy logic: extracting fuzzy models from data, fuzzy
decision trees, stochastic search methods
UNIT-3 : MINING DATA STREAMS. (3-1 J to 3-20 J)
Introduction to streams concepts, stream data model and
architecture, stream computing, sampling data in a stream, filtering
streams, counting distinct elements in a stream, estimating moments,
counting oneness in a window, decaying window, Real-time
Analytics Platform (RTAP) applications, Case studies — real time
sentiment analysis, stock market predictions.
UNIT-4 : FREQUENT ITEMSETS & CLUSTERING (4-1 J to 4-28 J)
Mining frequent itersets, market based modelling, Apriori algorithm,
handling large data sets in main memory, limited pass algorithm,
counting frequent itemsets in a stream, clustering techniques:
hierarchical, K-means, clustering high dimensional data, CLIQUE
and ProCLUS, frequent pattern based clustering methods, clustering
in non-euclidean space, clustering for streams & parallelism.
UNIT-5 : FRAME WORKS & VISUALIZATION (5-1 J to 5-30 J)
Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive,
HBase, MapR, Sharding, NoSQL Databases, $3, Hadoop Distributed
File Systems, Visualization: visual data analysis techniques,
interaction techniques, systems and applications. Introduction to R
-R graphical user interfaces, data import and export, attribute and
data types, descriptive statistics, exploratory data analysis,
visualization before analysis, analytics for unstructured data
SHORT QUESTIONS (SQ-1 J to SQ-15 J)Gis
QUANTUM Series
US LAR Ls
CE Ucn MAEM RST the big picture without spending
ned n ec
For Semester - 5 Quantum Series is the complete one-stop
os solution for engineering student looking for
(Computer Science & Engineering a simple yet effective guidance system for
I Information Technology) core engineering subject. Based on the
needs of students and catering to the
+ Database Management System requirements of the syllabi, this series
uniquely addresses the way in which
+ Design and Analysis of Algorithm concepts are tested through university
examinations. The essy to comprehend
Compiler Design question answer form adhered to by the
+ Web Technology books in this series is suitable and
recommended for student. The students are
Departmental Electiv able to effortlessly grasp the concepts and
. ie ideas discussed in their course books with
Data Analytics the help of this series. The solved question
+ Computer Graphics papers of previous years act as a additional
advantage for students to comprehend the
+ Object Oriented System Design paper pattem, and thus anticipate and
Departmental Elective! prepare for examinations accordingly,
The coherent manner in which the books in
* Machine Leaming Techniques this series present new ideas and concepts
* Application of Soft Computing to students makes this series play an
essential role in the preparation for
+ Human Computer Interface university exarninations. The detailed and
Commom Non Credit Course (NC) comprehensive discussions, easy to
. __| understand examples, objective questions
+ Constitution of India, Law & Engineering) and ample exercises, all aid the students to
a understand everything in an all-inclusive
+ Indian Tradition, Culture & Society pies mee
‘© Topic-wise coverage in Question-Answer form. © The perfect assistance for scoring good marks.
Clears: course fundamentals. Good for brush up before exams.
Includes solved
deal for sel
Quantum Publications*
™
(A Unit of Quantum Page Pvt. Ltd.)
4 Piot No. 59/2/7, Site-4, Industrial Area, Sahibabad,
Ghaziabad, 201010, (U.P) Phane: 0120-416047:
bah E-mail: pagequantum@>gmail.com Web: www quantumpage.coin
Hii) Find us on: facebook.com/quantumseriesofticialData Analytics (KCS-051)
‘Course Outcome (CO) Bloom’s Knowledge Level (KL)
A the end of course , the student will be able to :
col
co2
Describe the life cycle phases of Data Analytics through discovery, planning and
building.
‘Understand and apply Data Analysis Techniques.
KIK2
E23
cos
Implement various Data streams.
KB
co4
Understand item sets, Clustering, fame works & Visualizations
KD
cos
‘Apply R tool for developing and evaluating real time applications.
K3.KS.K6
DETAILED SYLLABUS
3.00
Unit
Topic
Proposed
Lecture
Introduction to Data Analytics: Sources and nature of data, classification of data
(structured, semi-structured, unstructured), characteristics of data, introduction to Big Data
platform, need of data analytics, evolution of analytic scalability, analytic process and
tools, analysis vs reporting, modern data analytic tools, applications of data analytics.
Data Analytics Lifecycle: Need, key roles for successfull analytic projects, various phases
of data analytics lifecycle — discovery, data preparation, model planning, model building,
commmnicating results, operationalization.
08.
u
Data Analysis: Regression modeling, mulfivariate analysis, Bayesian modeling, inference
and Bayesian networks, support vector and kernel methods, analysis of time series: linear
systems analysis & nonlinear dynamics, rule induction, neural networks: learning and
generalisation, competitive learning, principal component analysis and neural networks,
fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search
methods.
08,
m
Mining Data Streams: Introduction to streams concepts, stream data model and
architecture, stream computing, sampling data in a stream, filtering streams, counting
distinct elements in a stream, estimating moments, counting oneness in a window,
decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies — real
time sentiment analysis, stock market predictions.
08,
Vv
Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling,
Apriori algorithm, handling large data sets in main memory, limited pass algorithm,
counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means,
clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering
methods, clustering in non-euclidean space, clustering for streams and parallelism.
08,
v
Frame Works and Visualization: MapReduce, Hadoop. Pig. Hive, HBase, MapR,
Sharding, NoSQL Databases, $3, Hadoop Distributed File Systems, Visualization: visual
data analysis techniques, interaction techniques. systems and applications.
Introduction to R - R graphical user interfaces, data import and export, attribute and data
types, descriptive statistics, exploratory data analysis, visualization before analysis,
analytics for unstructured data.
08,
Teat books and References:
Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer
‘Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press.
Bill Franks, Taming the Big Data Tidal wave: Finding Opportunities in Huge Data Streams with Advanced
Analytics, John Wiley & Sons.
Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big Analytics: Emerging Business
L
3
4.10.
1
12.
13
14.
15
16.
. David Dietrich, Barry Heller, Beibei Yan
Tntelligence and Analytic Trends for Today's Businesses", Wiley
‘Data Science and Big Data Analytics”, EMC Education Series,
John Wiley -
Frank J Oblhorst,
ig Data Analytics: Turning Big Data into Big Money”, Wiley and SAS Business Series
. Colleen Mecue, “Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis”,
Elsevier
Michael Berthold, David J. Hand,” Intelligent Data Analysis”, Springer
Paul Zikopoulos, Chris Eaton, Paul Zikopoulos, “Understanding Big Data: Analytics for Enterprise Class
Hadoop and Streaming Data”, McGraw Hill
Trevor Hastie, Robert Tibshirani, Jerome Friedman, "The Elements of Statistical Learning", Springer
‘Mark Gardner, “Beginning R: The Statistical Programming Language”, Wrox Publication
Pete Warden, Big Data Glossary, O’Reilly
Glenn J. Myatt, Making Sense of Data, John Wiley & Sons
Pete Warden, Big Data Glossary, O'Reilly.
. Peter Biihlmann, Petros Drineas, Michael Kane, Mark van der Laan, "Handbook of Big Data", CRC Press
Jiawei Han, Micheline Kamber “Data Mining Concepts and Techniques”, Second Edition, ElsevierData Analyties
1-1 J (CS-5/IT-6)
UNIT
Introduction to
Data Analytics
CONTENTS
Part-1
Part-2
Part-3
Part-4
Part-5
Introduction of Data Analytics : 1-2J to 1-53
Sources and Nature of Data,
Classification of Data (Structured,
Semi-Structured, Unstructured),
Characteristics of Data
Introduction to Big Data ....... ... 1-53 to 1-65
Platform, Need of Data Analytics
Evolution of Analytic 00.00.0202... 1-6J to 1-133
Scalability, Analytic
Process and Tools, Analysis
Vs Reporting, Modern Data
Analytic Tools, Applications of
Data Analysis
Data Analytics Lifecycle a 1-133 to 1-175
Need, Key Roles for
Successful Analytic Projects,
Various Phases of Data Analytic Life
Cyele : Discovery, Data Preparations
Model Planning, Model ........ 1-17J to 1-205
Building, Communicating
Results, Operationalization JIntroduction to Data Analytics 1-2J (CS-5/IT-6)
PART-1
Introduction To Data Analytics : Sourees and Nature of Data,
Classification of Data (Structured, Semi-Structured, Unstructured),
Characteristics of Data.
Long Answer Type and Medium Answer Type Questions
Que 1.1. | What is data analytics ?
Answer
1. Data analytics is the science of analyzing raw data in order to make
conclusions about that information.
Ww
Any type of information can be subjected to data analytics techniques to
get insight that can be used to improve things.
3. Data analytics techniques can help in finding the trends and metrics
that would be used to optimize processes to increase the overall efficiency
of a business or system.
4. Many of the techniques and processes of data analytics have been
automated into mechanical processes and algorithms that work over
raw data for human consumption.
5. For example, manufacturing companies often record the runtime,
downtime, and work queue for various machines and then analyze the
data to better plan the workloads so the machines operate closer to peak
capacity.
Que 1.2. | Explain the source of data (or Big Data).
Answer
Three primary sources of Big Data are:
1. Social data:
a, Social data comes from the likes, tweets & retweets, comments,
video uploads, and general media that are uploaded and shared via
social media platforms.
b. This kind of data provides invaluable insights into consumer
behaviour and sentiment and can be enormously influential in
marketing analytics.Data Analyties 1-3 J (CS-5/IT-6)
com
The public web is another good source of social data, and tools like
Google trends can be used to good effect to increase the volume of
big data.
2. Machine data:
3.
a.
Machine data is defined as information which is generated by
industrial equipment, sensors that are installed in machinery, and
even web logs which track user behaviour.
This type of data is expected to grow exponentially as the internet
of things grows ever more pervasive and expands around the world.
Sensors such as medical devices, smart meters, road cameras,
satellites, games and the rapidly growing Internet of Things will
deliver high velocity, value, volume and variety of data in the very
near future
Transactional data :
a
b.
Transactional data is generated from all the daily transactions that
take place both online and offline.
Invoices, payment orders, storage records, delivery receipts are
characterized as transactional data.
Que 1.3. | Write short notes on classification of data.
Answer
Unstructured data:
1.
a
b.
C.
Unstructured data is the rawest form of data.
Data that has no inherent structure, which may include text
documents, PDFs, images, and video.
This data is often stored in a repository of files.
Structured data:
a.
Structured data is tabular data (rows and columns) which are very
well defined.
Data containing a defined data type, format, and structure, which
may include transaction data, traditional RDBMS, CSV files, and
simple spreadsheets.
Semi-structured data:
a
Textual data files with a distinct pattern that enables parsing such
as Extensible Markup Language [XML] data files or JSON.
A consistent format is defined however the structure is not very
strict.
Semi-structured data are often stored as files.Introduction to Data Analytics 144.4 (CS-5/IT-6)
Que 1.4. | Differentiate between structured, semi-structured an
unstructured data.
Answer
Properties | Structured data | Semi-structured | Unstructured
data data
Technology | Itis based on Itisbasedon XML/|It is based o
Relational database | RDF character and
table. binary data.
[Transaction | Matured transaction | Transaction is No transactio
management} and various adapted from management an
coneurrency DBMS. no coneurrency.
techniques
Flexibility Itis schema It is more flexible |It very flexible an
dependent and less/than structured|there is absence of
flexible. data but less than|schema.
flexible than
unstructured data.
Scalability | It is very difficult to |It is more scalable|It is very scalable]
scale database | than structured
schema. data.
Query Structured query | Queries over Only textual query
lperformance| allow complex anonymous nodes|are possible.
joining. are possible.
Que 1.5. | Explain the characteristics of Big Data.
Answer
Big Data is characterized into four dimensions :
1. Volume:
a. Volume is concerned about scale of data i.e., the volume of the data
at which it is growing.
b. The volume of datais growing rapidly, due to several applications of
business, social, web and scientific explorations.
2. Velocity:
a. The speed at which data is increasing thus demanding analysis of
streaming data.
b. The velocity is due to growing speed of business intelligence
applications such as trading, transaction of telecom and banking
domain, growing number of internet connections with the increased
usage of internet etc.Data Analyties 1-5 J (CS-5/IT-6)
3.
4,
Variety : It depicts different forms of data to use for analysis such as
structured, semi structured and unstructured.
Veracity:
a. Veracity is concerned with uncertainty or inaccuracy of the data.
b. Inmany cases the data will be inaccurate hence filtering and selecting
the data which is actually needed is a complicated task.
e. Alot of statistical and analytical process has to go for data cleansing
for choosing intrinsic data for decision making.
PART-2
Introduction to Big Data Platform, Need of Data Analytics.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Que 1.6. | write short note on big data platform.
Answer
L
be
Big data platform is a type of IT solution that combines the features and
capabilities of several big data application and utilities within a single
solution.
It is an enterprise class IT platform that enables organization in
developing, deploying, operating and managing a big data infrastructure!
environment.
Big data platform generally consists of big data storage, servers, database,
big data management, business intelligence and other big data
management utilities.
It also supports custom development, querying and integration with
other systems.
The primary benefit behind a big data platform is to reduce the complexity
of multiple vendors/ solutions into a one cohesive solution.
Big data platform are also delivered through cloud where the provider
provides an all inclusive big data solutions and services
Que 1.7. ] What are the features of big data platform ?Introduction to Data Analytics 1-6 J (CS-5/IT-6)
Answer
Features of Big Data analytics platform :
1. Big Data platform should be able to accommodate new platforms and
tool based on the business requirement.
It should support linear scale-out.
It should have capability for rapid deployment.
It should support variety of data format
Platform should provide data analysis and reporting tools.
It should provide real-time data analysis software.
Ao Fe WON
It should have tools for searching the data through large data sets.
Que 1.8. | Why there is need of data analytics ?
Answer
Need of data analytics :
1. It optimizes the business performance.
2. It helps to make better decisions.
3.
Tt helps to analyze customers trends and solutions
PART-3
Evolution of Analytic Scalability, Analytic Process and Tools,
Analysis vs Reporting, Modern Data Analytic Tools, Applications
of Data Analysis.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Que 1.9. | What are the steps involved in data analysis ?
Answer
Steps involved in data analysis are:
1. Determine the data:
a. The first step is to determine the data requirements or how the
data is grouped.
b. Data may be separated by age, demographic, income, or gender.
ce. Data values may be numerical or be divided by category.Data Analyties 1-7 J (CS-5/IT-6)
2.
»
Collection of data:
a. The second step in data analytics is the process of collecting it.
b. This can be done through a variety of sources such as computers,
online sources, cameras, environmental sources, or through
personnel.
Organization of data:
a. Third step is to organize the data.
b. Once the datais collected, it must be organized so it can be analyzed.
e. Organization may take place on a spreadsheet or other form of
software that can take statistical data.
Cleaning of data:
a. In fourth step, the datais then cleaned up before analysis.
b. This means it is scrubbed and checked to ensure there is no
duplication or error, and that it is not incomplete.
ce. This step helps correct any errors before it goes on to adata analyst
to be analyzed.
Que 1.10. ] write short note on evolution of analytics scalability.
Answer
L
i]
In analytic scalability, we have to pull the data together in a separate
analytics environment and then start performing analysis.
——
—_——oas> TOOT =
——
tt
The heavy processing occurs |_|
in the analytic environment
Analytic server
or PC
Analysts do the merge operation on the data sets which contain rows
and columns.
<_
The columns represent information about the customers such as name,
spending level, or status.
In merge or join, two or more data sets are combined together. They
are typically merged / joined so that specific rows of one data set or
table are combined with specific rows of another.Introduction to Data Analytics 1-8 J (CS-5/IT-6)
5. Analysts also do data preparation. Data preparation is made up of
joins, aggregations, derivations, and transformations. In this process,
they pull data from various sources and merge it all together to create
the variables required for an analysis.
6. Massively Parallel Processing (MPP) system is the most mature, proven,
and widely deployed mechanism for storing and analyzing large
amounts of data.
=
An MPP database breaks the data into independent pieces managed by
independent storage and central processing unit (CPU) resources.
00 GB 100 GB 100 GB 100 GB 100 GB
Chunks | | Chunks | | Chunks | | Chunk Chunks
1 terabyte Cd
table
100 GB 100 GB 100 GB 100 GB 100 GB
Chunks || Chunks | | Chunks | | Chunks | | Chunks
A traditional database 10 Simultaneous 100-GB queries
will query a one
terabyte one row at time.
Fig. 1.10.1. Massively Parallel Processing system data storage.
8. MPP systems build in redundancy to make recovery easy.
9. MPP systems have resource management tools :
a. Manage the CPU and disk space
b. Query optimizer
Que 1.11. ] write short notes on evolution of analytic process.
Answer
1. With increased level of scalability, it needs to update analytic processes
to take advantage of it.
w
This can be achieved with the use of analytical sandboxes to provide
analytic professionals with a scalable environment to build advanced
analytics processes.
One of the uses of MPP database system is to facilitate the building and
deployment of advanced analytic processes.
4. An analytic sandbox is the mechanism to utilize an enterprise data
warehouse.
5. If used appropriately, an analytic sandbox can be one of the primary
drivers of value in the world of big data.
Analytical sandbox :
1. An analytie sandbox provides a set of resources with which in-depth
analysis can be done to answer critical business questions.Data Analyties 1-9 J (CS-5/IT-6)
2.
ae
a
An analytic sandbox is ideal for data exploration, development of analytical
processes, proof of concepts, and prototyping.
Once things progress into ongoing, user-managed processes or production
processes, then the sandbox should not be involved
Asandbox is going to be leveraged by a fairly small set of users.
There will be data created within the sandbox that is segregated from
the production database.
Sandbox users will also be allowed to load data of their own for brief
time periods as part of a project, even if that datais not part of the official
enterprise data model.
Que 1.12. | Explain modern data analytic tools.
Answer
Modern data analytic tools :
1
Apache Hadoop :
a. Apache Hadoop, a big data analytics tool which is a Java based free
software framework.
b. It helps in effective storage of huge amount of data in a storage
place known as a cluster.
ec. It runs in parallel ona cluster and also has ability to process huge
data across all nodes in it.
d. There isa storage system in Hadoop popularly known as the Hadoop
Distributed File System (HDF), which helps to splits the large
volume of data and distribute across many nodes present in a
cluster.
KNIME:
a. KNIME analytics platform is one of the leading open solutions for
data-driven innovation.
b. This tool helps in discovering the potential and hidden in a huge
volume of data, it also performs mine for fresh insights, or predicts
the new futures.
OpenRefine:
a. OneRefine tool is one of the efficient tools to work on the messy
and large volume of data.
b. It includes cleansing data, transforming that data from one format
another.
ec. Ithelps to explore large data sets easily.
Orange:
a. Orange is famous open-source data visualization and helps in data
analysis for beginner and as well to the expert.Introduction to Data Analytics 1-10 J (CS-5/IT-6)
Q
b. This tool provides interactive workflows with a large toolbox option
to create the same which helps in analysis and visualizing of data.
RapidMiner:
a. RapidMiner tool operates using visual programming and also it is
much capable of manipulating, analyzing and modeling the data.
b. RapidMiner tools make data science teams easier and productive
by using an open-source platform for all their jobs like machine
learning, data preparation, and model deployment.
R-programming :
a. Risa free open source software programming language and a
software environment for statistical computing and graphics.
b. Itisused by data miners for developing statistical software and data
analysis.
ec. It has become a highly popular tool for big data in recent years.
Datawrapper:
a. Itis an online data visualization tool for making interactive charts.
b. It uses data file ina esv, pdf or excel format.
e. Datawrapper generate visualization in the form of bar, line, map
etc. It can be embedded into any other website as well.
Tableau :
a. Tableauis another popular big data tool. It issimple and very intuitive
to use.
b. It communicates the insights of the data through data visualization.
Through Tableau, an analyst ean check a hypothesis and explore
the data before starting to work on it extensively.
ue 1.13. | What are the benefits of analytic sandbox from the view
Ea
of an analytic professional ?
Answer |
Benefits of analytic sandbox from the view of an analytic
professional :
1
Independence : Analytic professionals will be able to work
independently on the database system without needing to continually
go back and ask for permissions for specific projects.
Flexibility : Analytic professionals will have the flexibility to use
whatever business intelligence, statistical analysis, or visualization tools
that they need to use.
Efficiency: Analytic professionals will be able to leverage the existing
enterprise data warehouse or data mart, without having to move or
migrate data.Data Analyties 1-11 J (CS-5/IT-6)
4.
a
Freedom: Analytic professionals can reduce focus on the administration
of systems and production processes by shifting those maintenance
tasks to IT.
Speed : Massive speed improvement will be realized with the move to
parallel processing. This also enables rapid iteration and the ability to
“fail fast” and take more risks to innovate.
Que 1.14. ] What are the benefits of analytic sandbox from the
view of IT?
Answer
Benefits of analytic sandbox from the view of IT :
1
i
pS
”
Centralization : IT will be able to centrally manage a sandbox
environment just as every other database environment on the system is
managed.
Streamlining: A sandbox will greatly simplify the promotion of analytic
processes into production since there will be a consistent platform for
both development and deployment.
Simplicity: There will be no more processes built during development
that have to be totally rewritten to run in the production environment.
‘Control : IT will be able to control the sandbox environment, balancing
sandbox needs and the needs of other users. The production environment
is safe from an experiment gone wrong in the sandbox.
‘Costs : Big cost savings can be realized by consolidating many analytic
data marts into one central system.
Que 1.15. | Explain the application of data analytics.
Answer
Answer |
Application of data analytics :
L
yp
Security : Data analytics applications or, more specifically, predictive
analysis has also helped in dropping crime rates in certain areas.
Transportation :
a. Data analytics can be used to revolutionize transportation.
b. It can be used especially in areas where we need to transport a
large number of people to a specific area and require seamless
transportation.
Risk detection:
a. Many organizations were struggling under debt, and they wanted a
solution to problem of fraud.
b. They already had enough customer data in their hands, and so,
they applied data analytics.Introduction to Data Analytics 1-12 J (CS-5/IT-6)
ce. They used ‘divide and conquer’ policy with the data, analyzing recent
expenditure, profiles, and any other important information to
understand any probability of a customer defaulting.
4. Delivery:
a. Several top logistic companies are using data analysis to examine
collected data and improve their overall efficiency.
b. Using data analytics applications, the companies were able to find
the best shipping routes, delivery time, as well as the most cost-
efficient transport means.
5. Fast internet allocation :
a. While it might seem that allocating fast internet in every area
makes a city ‘Smart’, in reality, it is more important to engage in
smart allocation. This smart allocation would mean understanding
how bandwidth is being used in specific areas and for the right
cause.
b. Itis alsoimportant to shift the data allocation based on timing and
priority. It is assumed that financial and commercial areas require
the most bandwidth during weekdays, while residential areas
require it during the weekends. But the situation is much more
complex. Data analytics can solve it.
ce. For example, using applications of data analysis, a community can
draw the attention of high-tech industries and in such cases; higher
bandwidth will be required in such areas
6. Internet searching :
a. When we use Google, we are using one of their many data analytics
applications employed by the company.
b. Most search engines like Google, Bing, Yahoo, AOL etc., use data
analytics. These search engines use different algorithms to deliver
the best result for a search query.
Digital advertisement :
a. Data analytics has revolutionized digital advertising.
ns
b. Digital billboards in cities as well as banners on websites, that is,
most of the advertisement sources nowadays use data analytics
using data algorithms.
Que 1.16. | What are the different types of Big Data analytics ?
Answer
Different types of Big Data analytics :
1. Descriptive analytics :
a. Itusesdata aggregation and data mining to provide insight into the
past.Data Analyties 1-13 J (CS-5/IT-6)
b. Descriptive analytics describe or summarize raw data and make it
interpretable by humans.
2. Predictive analytics:
a. It uses statistical models and forecasts techniques to understand
the future.
b. Predictive analytics provides companies with actionable insights
based on data. It provides estimates about the likelihood ofa future
outcome.
3. Prescriptive analytics :
a. Ituses optimization and simulation algorithms to advice on possible
outcomes.
b. Itallows users to “prescribe” a number of different possible actions
and guide them towards a solution.
4. Diagnostic analytics :
a. It is used to determine why something happened in the past.
b. Itis characterized by techniques such as drill-down, data discovery,
data mining and correlations.
ec. Diagnostic analytics takes a deeper look at data to understand the
root causes of the events.
PART-4
Data Analytics Life Cycle : Need, Key Roles For Successful Analytic
Projects, Various Phases of Data Analytic Life Cycle : Discovery,
Data Preparations.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Que 1.17. | Explain the key roles for asuccessful analytics projects.
Answer
Key roles for a successful analytics project :
1. Business user:
a. Business user is someone who understands the domain area and
usually benefits from the results.
b. This person can consult and advise the project team on the context
of the project, the value of the results, and how the outputs will be
operationalized.Introduction to Data Analytics 1-14 J (CS-5/IT-6)
e
Usually a business analyst, line manager, or deep subject matter
expert in the project domain fulfills this role.
2. Project sponsor :
a
Project sponsor is responsible for the start of the project and provides
all the requirements for the project and defines the core business
problem.
Generally provides the funding and gauges the degree of value
from the final outputs of the working team.
This person sets the priorities for the project and clarifies the desired
outputs.
3. Project manager : Project manager ensures that key milestones and
objectives are met on time and at the expected quality.
4. Business Intelligence Analyst :
a
Analyst provides business domain expertise based on a deep
understanding of the data, Key Performance Indicators (KPIs),
key metrics, and business intelligence from a reporting perspective.
Business Intelligence Analysts generally create dashboards and
reports and have knowledge of the data feeds and sources.
5. Database Administrator (DBA):
a
DBA provisions and configures the database environment to support
the analytics needs of the working team.
These responsibilities may include providing access to key databases
or tables and ensuring the appropriate security levels are in place
related to the data repositories.
6. Data engineer: Data engineer have deep technical skills to assist with
tuning SQL queries for data management and data extraction, and
provides support for data ingestion into the analytic sandbox.
7. Datascientist :
a.
Data scientist provides subject matter expertise for analytical
techniques, data modeling, and applying valid analytical techniques
to given business problems.
They ensure overall analytics objectives are met.
They designs and executes analytical methods and approaches with
the data available to the project.
Que 1.18. ] Explain various phases of data analytics life cycle.
Answer
Various phases of data analytic lifecycle are :
Phase 1: Discovery:Data Analyties 1-15 J (CS-5/IT-6)
1
In Phase 1, the team learns the business domain, including relevant
history such as whether the organization or business unit has attempted
similar projects in the past from which they can learn.
The team assesses the resources available to support the project in
terms of people, technology, time, and data.
Important activities in this phase include framing the business problem
as an analytics challenge and formulating initial hypotheses (IHs) to test
and begin learning the data.
Phase 2 : Data preparation:
L
Phase 2 requires the presence of an analytic sandbox, in which the team
can work with data and perform analytics for the duration of the project.
The team needs to execute extract, load, and transform (ELT) or extract,
transform and load (ETL) to get data into the sandbox. Data should be
transformed in the ETL process so the team can work with it and analyze
it.
In this phase, the team also needs to familiarize itself with the data
thoroughly and take steps to condition the data.
Phase 3 : Model planning:
L
bo
Phase 3 is model planning, where the team determines the methods,
techniques, and workflow it intends to follow for the subsequent model
building phase.
The team explores the data to learn about the relationships between
variables and subsequently selects key variables and the most suitable
models.
Phase 4: Model building:
L
we
In phase 4, the team develops data sets for testing, training, and
production purposes.
In addition, in this phase the team builds and executes models based on
the work done in the model planning phase.
The team also considers whether its existing tools will be adequate for
running the models, or if it will need a more robust environment for
executing models and work flows.
Phase 5: Communicate results :
L
i]
In phase 5, the team, in collaboration with major stakeholders,
determines if the results of the project are a success or a failure based
on the criteria developed in phase 1
The team should identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to stakeholders.
Phase 6 : Operationalize :
L
In phase 6, the team delivers final reports, briefings, code, and technical
documents.Introduction to Data Analytics 1-16 J (CS-5/IT-6)
2. Inaddition, the team may run a pilot project to implement the models in
a production environment.
Que 1.19. | What are the activities should be performed while
identifying potential data sources during discovery phase ?
Answer
Main activities that are performed while identifying potential data sources
during discovery phase are :
1. Identify datasources :
a. Make alist of candidate data sources the team may need to test the
initial hypotheses outlined in discovery phase
b. Make an inventory of the datasets currently available and those
that can be purchased or otherwise acquired for the tests the team
wants to perform.
2. Capture aggregate data sources :
a. This is for previewing the data and providing high-level
understanding.
b. Itenables the team to gain a quick overview of the data and perform
further exploration on specific areas.
ec. Italso points the team to possible areas of interest within the data.
3. Review the raw data:
a. Obtain preliminary data from initial data feeds
b. Begin understanding the interdependencies among the data
attributes, and become familiar with the content of the data, its
quality, and its limitations.
4, Evaluate the datastructures and tools needed :
a. The data type and structure dictate which tools the team can use to
analyze the data.
b. This evaluation gets the team thinking about which technologies
may be good candidates for the project and how to start getting
access to these tools.
Scope the sort of data infrastructure needed for this type of
problem : In addition to the tools needed, the data influences the kind
of infrastructure required, such as disk storage and network capacity.
Que 1.20. | Explain the sub-phases of data preparation.
Answer
Sub-phases of data preparation are:
1. Preparing an analytics sandbox :
aData Analyties 1-17 J (CS-5/IT-6)
a
The first sub-phase of data preparation requires the team to obtain
an analytic sandbox in which the team can explore the data without
interfering with live production databases.
When developing the analytic sandbox, it is a best practice to collect
all kinds of data there, as team members need access to high volumes
and varieties of data for a Big Data analytics project.
This can include everything from summary-level aggregated data,
structured data, raw data feeds, and unstructured text data from
call logs or web logs.
2. Performing ETLT:
a
In ETL, users perform extract, transform, load processes to extract
data from a data store, perform data transformations, and lead the
data back into the data store
In this case, the datais extracted in its raw form and loaded into the
data store, where analysts can choose to transform the data into a
new state or leave it in its original, raw condition.
8. Learning about the data:
a.
Acritical aspect of a data science project is to become familiar with
the data itself.
Spending time to learn the nuances of the datasets provides context
to understand what constitutes a reasonable value and expected
output.
In addition, it is important to catalogue the data sources that the
team has access to and identify additional data sources that the
team can leverage.
4. Data conditioning:
a
Data conditioning refers to the process of cleaning data, normalizing
datasets, and performing transformations on the data.
Data conditioning can involve many complex steps to join or merge
datasets or otherwise get datasets into a state that enables analysis
in further phases.
It is viewed as processing step for data analysis.
PART-5
Model Planning, Model Building, Communicating Results Open.
Questions-Answers
Long Answer Type and Medium Answer Type QuestionsIntroduction to Data Analytics 1-18 J (CS-5/IT-6)
Que 1.21. | What are activities that are performed in model planning
phase?
Answer
Activities that are performed in model planning phase are :
1. Assess the structure of the datasets :
a. The structure of the data sets is one factor that dictates the tools
and analytical techniques for the next phase.
b. Depending on whether the team plans to analyze textual data or
transactional data different tools and approaches are required.
i]
Ensure that the analytical techniques enable the team to meet the
business objectives and accept or reject the working hypotheses.
Determine if the situation allows a single model or a series of techniques
as part of a larger analytic workflow.
Que 1.22. | What are the common tools for the model planning
phase ?
Answer
Common tools for the model planning phase :
1° oR:
a. It has acomplete set of modeling capabilities and provides a good
environment for building interpretive models with high-quality
code.
b. It has the ability to interface with databases via an ODBC
connection and execute statistical tests and analyses against Big
Data via an open source connection.
2. SQL analysis services : SQL Analysis services can perform in-
database analytics of common data mining functions, involved
aggregations, and basic predictive models.
3. SAS/IACCESS:
a. SAS/ACCESS provides integration between SAS and the analytics
sandbox via multiple data connectors such as OBDC, JDBC, and
OLE DB.
b. SAS itselfis generally used on file extracts, but with SAS/ACCESS,
users can connect to relational databases (such as Oracle) and
data warehouse appliances, files, and enterprise applications.
Que 1. 3.] Explain the common commercial tools for model
building phase.Data Analyties 1-19 J (CS-5/IT-6)
Answer
Commercial common tools for the model building phase :
L
»
a
SAS enterprise Miner:
a. SAS Enterprise Miner allows users to run predictive and descriptive
models based on large volumes of data from across the enterprise.
b. It interoperates with other large data stores, has many partnerships,
and is built for enterprise-level computing and analytics.
SPSS Modeler provided by IBM : It offers methods to explore and
analyze data through a GUI.
Matlab : Matlab provides a high-level language for performing a variety
of data analytics, algorithms, and data exploration.
Apline Miner : Alpine Miner provides a GUI frontend for users to
develop analytic workflows and interact with Big Data tools and platforms
on the backend.
STATISTICA and Mathematica are also popular and well-regarded data
mining and analytics tools.
Que 1.24. | Explain common open-source tools for the model
building phase.
Answer
Free or open source tools are:
1
RandPLiR:
a. R provides a good environment for building interpretive models
and PL/R is a procedural language for PostgreSQL with R.
b. Using this approach means that R commands can be executed in
database.
e. This technique provides higher performance and is more scalable
than running R in memory.
Octave :
a. It is a free software programming language for computational
modeling, has some of the functionality of Matlab.
b. Octave is used in major universities when teaching machine
learning.
WEKA: WEKAis a free data mining software package with an analytic
workbench. The functions created in WEKA can be executed within
Java code
Python : Python is a programming language that provides toolkits for
machine learning and analysis, such as numpy, scipy, pandas, and related
data visualization using matplotlib.Introduction to Data Analytics
1-20 J (CS-5/IT-6)
5.
MADIib : SQL in-database implementations, such as MADIib, provide
an alternative to in-memory desktop analytical tools. MADIib provides
an open-source machine learning library of algorithms that can be
executed in-database, for PostgreSQL.
©O®OData Analyties
2-1 J (CS-5/IT-6)
UNIT
Data Analysis
CONTENTS
Part-1
Part-2
Part-3
Part-4
Part-5
Data Analysis :
Regression Modeling, a
Multivariate Analysis
Bayesian Modeling,
Inference and Bayesian
Networks, Support Vector
and Kernel Methods
Analysis of Time Series ©...
Linear System Analysis
of Non-Linear Dynamics,
Rule Induction
Neural Networks
Learning and Generalisation, —
Competitive Learning,
Principal Component Analysis
and Neural Networks
Fuzzy Logic : Extracting Fuzzy ...
Models From Data, Fuzzy
Decision Trees, Stochastic
Search Methods
_.. 2-27 to 2-45
. 2-113 to 2-205
.. 2-203 to 2-285
2-7J to 2-11JData Analysis 2-2 J (CS-5/IT-6)
Data Analyiss : Regression Modeling, Multivarient Analysis.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Que 2.1. | Write short notes on regression modeling.
Answer
1. Regression models are widely used in analytics, in general being among
the most easy to understand and interpret type of analytics techniques.
Regression techniques allow the identification and estimation of possible
relationships between a pattern or variable of interest, and factors that
influence that pattern.
be
3. For example, a company may be interested in understanding the
effectiveness of its marketing strategies.
4. Aregression model can be used to understand and quantify which of its
marketing activities actually drive sales, and to what extent.
5. Regression models are built to understand historical data and relationships
to assess effectiveness, as in the marketing effectiveness models.
6. Regression techniques are used across a range of industries, including
financial services, retail, telecom, pharmaceuticals, and medicine.
Que 2.2. | What are the various types of regression analysis
techniques ?
Answer
Various types of regression analysis techniques :
1. Linear regression : Linear regressions assumes that there is a linear
relationship between the predictors (or the factors) and the target
variable.
wm
Non-linear regression : Non-linear regression allows modeling of
non-linear relationships.
3. Logistic regression : Logistic regression is useful when our target
variable is binomial (accept or reject).
4, Time series regression : Time series regressions is used to forecast
future behavior of variables based on historical time ordered data.Data Analyties 2-3 J (CS-5/IT-6)
Que 2.3. | Write short note on linear regression models.
Que 23]
Linear regression model:
1. We consider the modelling between the dependent and one independent.
variable. When there is only one independent variable in the regression
model, the model is generally termed as a linear regression model.
be
Consider a simple linear regression model
yHp +P Xte
Where,
y is termed as the dependent or study variable and X is termed as the
independent or explanatory variable.
The terms f, and f, are the parameters of the model. The parameter f,
is termed as an intercept term, and the parameter §, is termed as the
slope parameter.
3. These parameters are usually called as regression coefficients. The
unobservable error component accounts for the failure of data to lie on
the straight line and represents the difference between the true and
observed realization of y.
4. There can be several reasons for such difference, such as the effect of
all deleted variables in the model, variables may be qualitative, inherent
randomness in the observations etc
5. Weassume that ¢ is observed as independent and identically distributed
random variable with mean zero and constant variance o” and assume
that cis normally distributed.
6. The independent variables are viewed as controlled by the experimenter,
so it is considered as non-stochastic whereas y is viewed as a random
variable with
Fy) = 8, +B, Xand Var (y) =o.
Sometimes X can also be a random variable. In such a case, instead of
the sample mean and sample variance of y, we consider the conditional
mean of y given X =x as
FQ |x) = B, + Bx
and the conditional variance of y given X =x as
Var(y |x) =o
8. When the values of B,, B,, and o° are known, the model is completely
described. The parameters B,, B, and o? are generally unknownin practice
and < is unobserved. The determination of the statistical model
y=B,+6,X + depends on the determination (i.e. estimation) of B,, B,,
and o*. In order to know the values of these parameters, n pairs of
observations (x,,y,(i = 1, ....,m) on (X,y) are observed/collected and are
used to determine these unknown parameters.Data Analysis 2-4 J (CS-5/IT-6)
Que 24. | Write short note on multivariate analysis.
Answer
1,
be
10.
Multivariate analysis (MVA) is based on the principles of multivariate
statistics, which involves observation and analysis of more than one
statistical outcome variable at a time.
These variables are nothing but prototypes of real time situations,
products and services or decision making involving more than one
variable.
MVA is used to address the situations where multiple measurements
are made on each experimental unit and the relations among these
measurements and their structures are important.
Multiple regression analysis refers to a set of techniques for studying
the straight-line relationships among two or more variables.
Multiple regression estimates the f’s in the equation
9; = Bo + Pity + Bova + + Bt, +5
Where, the x’s are the independent variables. y is the dependent variable.
The subscript j represents the observation (row) number. The f'’s are
the unknown regression coefficients. Their estimates are represented
by b's. Each B represents the original unknown (population) parameter,
while bis an estimate of this 6. The ¢, is the error (residual) of observation
i
Regression problem is solved by least squares. In least squares method
regression analysis, the b’s are selected so as to minimize the sum of the
squared residuals. This set of b’sis not necessarily the set we want, since
they may be distorted by outliers points that are not representative of
the data. Robust regression, an alternative to least squares, seeks to
reduce the influence of outliers.
Multiple regression analysis studies the relationship between a
dependent (response) variable and p independent variables (predictors,
regressors).
The sample multiple regression equation is
the point at which the regression plane intersects the Y axis. The 6, are
the slopes of the regression plane in the direction of x,. These coefficients
are called the partial-regression coefficients. Each partial regression
coefficient represents the net effect the i* variable has on the dependent
variable, holding the remaining a's in the equation constantData Analyties 2-5 J (CS-5/IT-6)
Bayesian Modeling, Inference and Bayesian Networks,
Support Vector and Kernel Methods.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Que 2.5. | Write short notes on Bayesian network.
Answer
1. Bayesian networks area type of probabilistic graphical model that uses
Bayesian inference for probability computations.
A Bayesian network is a directed acyclic graph in which each edge
corresponds to a conditional dependency, and each node corresponds to
a unique random variable.
Bayesian networks aim to model conditional dependence by representing
edges in a directed graph.
be
go
C | P(S=T)P(S=F)
og
05 05 P(W=T) P(W=F)
Fig. 2.5.1.
3. Through these relationships, one can efficiently conduct inference on
the random variables in the graph through the use of factors.Data Analysis 2-6 J (CS-5/IT-6)
4.
Using the relationships specified by our Bayesian network, we can obtain
acompact, factorized representation of the joint probability distribution
by taking advantage of conditional independence.
Formally, if an edge (A, B) exists in the graph connecting random
variables A and B, it means that P(B|A) is a factor in the joint probability
distribution, so we must know P(B | A) for all values of B and A in order
to conduct inference.
In the Fig. 2.5.1, since Rain has an edge going into WetGrass, it means
that P(WetGrass | Rain) will be a factor, whose probability values are
specified next to the WetGrass node in a conditional probability table.
Bayesian networks satisfy the Markov property, which states that a
node is conditionally independent of its non-descendants given its
parents. In the given example, this means that
P(Sprinkler | Cloudy, Rain) = P(Sprinkler | Cloudy)
Since Sprinkler is conditionally independent of its non-descendant, Rain,
given Cloudy.
Que 2.6. | Write short notes on inference over Bayesian network.
Answer
Inference over a Bayesian network can come in two forms.
1
First form:
a. The first is simply evaluating the joint probability of a particular
assignment of values for each variable (or a subset) in the network.
b. For this, we already have a factorized form of the joint distribution,
so we simply evaluate that product using the provided conditional
probabilities.
ec. Ifwe only care about a subset of variables, we will need to marginalize
out the ones we are not interested in.
dad Inmany cases, this may result in underflow, so it is common to take
the logarithm of that product, which is equivalent to adding up the
individual logarithms of each term in the product.
Second form:
In this form, inference task is to find P (x | ¢) or to find the probability of
some assignment of a subset of the variables (x) given assignments of
other variables (our evidence, ¢).
In the example shown in Fig. 2.6.1, we have to find
P(Sprinkler, WetGrass | Cloudy),
where {Sprinkler, WetGrass} is our x, and {Cloudy} is oure.
Tn order to calculate this, we use the fact that P(x|e) = P(x, e) / Ple)
= oP(x, e), where « is a normalization constant that we will calculate at
the end such that P(x|e)+ P(ax | e)=1.Data Analyties 2-7 J (CS-5/IT-6)
d
In order to calculate P(x, ¢), we must marginalize the joint probability
distribution over the variables that do not appear inx or e, which we will
denote as Y.
Plxje)= a> Pixie, Y)
For the given example in Fig. 2.6.1 we can caleulate P(Sprinkler,
WetGrass | Cloudy) as follows :
(Sprinkler, WetGrass | Cloudy) =
a = P(WetGrass | Sprinkler,Rain)P(Sprinker | Cloudy)P(Rain | Cloudy)
P(Cloudy) =
aP(WetGrass | Sprinkler, Rain)P(Sprinker | Cloudy)P(Rain | Cloudy)
P(Cloudy) +
oP(WetGrass | Sprinkler—Rain)P(Sprinker | Cloudy)P(—Rain | Cloudy)
P(Cloudy)
PART-3
Analysis of Time Series : Linear System Analysis
of Non-Lineor Dynamics, Rule Introduction.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Que 2.7. | Explain the application of time series analysis.
Answer
Applications of time series analysis:
L
2.
Retail sales :
a. For various product lines, a clothing retailer is looking to forecast
future monthly sales.
b. These forecasts need to account for the seasonal aspects of the
customer's purchasing decisions.
ce. An appropriate time series model needs to account for fluctuating
demand over the calendar year.
Spare parts planning:
a. Companies service organizations have to forecast future spare part
demands to ensure an adequate supply of parts to repair customerData Analysis 2-8 J (CS-5/IT-6)
products. Often the spares inventory consists of thousands of distinct
part numbers.
b. To forecast future demand, complex models for each part number
can be built using input variables such as expected part failure
rates, service diagnostic effectiveness and forecasted new product
shipments.
c. However, time series analysis can provide accurate short-term
forecasts based simply on prior spare part demand history.
Stock trading:
a. Some high-frequency stock traders utilize a technique called pairs
trading.
b. In pairs trading, an identified strong positive correlation between
the prices of two stocks is used to detect a market opportunity.
ec. Suppose the stock prices of Company A and Company B consistently
move together.
d. Time series analysis can be applied to the difference of these
companies’ stock prices over time.
e. Astatistically larger than expected price difference indicates that it
is a good time to buy the stock of Company A and sell the stock of
Company B, or vice versa.
Que 2.8. | What are the components of time series ?
Answer
Atime series can consist of the following components :
1
Trends:
a. The trend refers to the long-term movement in a time series.
b. It indicates whether the observation values are increasing or
decreasing over time.
ce. Examples of trends are a steady increase in sales month over month
or an annual decline of fatalities due to car accidents.
Seasonality :
a. The seasonality component describes the fixed, periodic fluctuation
in the observations over time
b. Itis oftenrelated to the calendar.
e. For example, monthly retail sales can fluctuate over the year due
to the weather and holidays.
Cyclie:
a. Acyclic component also refers to a periodic fluctuation, which is not
as fixed.Data Analyties 2-9 J (CS-5/IT-6)
b. For example, retails sales are influenced by the general state of the
economy.
Que 2.9. | Explain rule induction.
Answer
1.
me oo
mp oO
Rule induction is a data mining process of deducing if-then rules from a
dataset.
These symbolic decision rules explain an inherent relationship between
the attributes and class labels in the dataset
Many real-life experiences are based on intuitive rule induction.
Rule induction provides a powerful classification approach that can be
easily understood by the general users.
Itis used in predictive analytics by classification of unknown data.
Rule induction is also used to describe the patterns in the data.
The easiest way to extract rules from a data set is from a decision tree
that is developed on the same data set.
Que 2.10. | Explain an iterative procedure of extracting rules from
data sets.
Answer
1
be
Sequential covering is an iterative procedure of extracting rules from
the data sets.
The sequential covering approach attempts to find all the rules in the
data set class by class.
One specific implementation of the sequential covering approach is called
the RIPPER, which stands for Repeated Incremental Pruning to Produce
Error Reduction.
Following are the steps in sequential covering rules generation approach :
Step 1: Class selection:
a. The algorithm starts with selection of class labels one by one.
b. The rule set is class-ordered where all the rules for a class are
developed before moving on to next class.
P
The first class is usually the least-frequent class label.
da. From Fig. 2.10.1, the least frequent class is “+” and the algorithm
focuses on generating all the rules for “+” class.Data Analysis 2-10 J (CS-5/IT-6)
Fig. 2.10.1. Data set with two classes and two dimensions.
Step 2: Rule development:
a. The objective in this step is to cover all “+” data points using
classification rules with none or as few “—” as possible
b, For example, in Fig. 2.10.2. rule r, identifies the area of four “+” in
the top left corner.
Fig. 2.10.2. Generation of ruler r,.
e. Since this rule is based on simple logic operators in conjuncts, the
boundary is rectilinear.
da Once rule 7, is formed, the entire data points covered by r, are
eliminated and the next best rule is found from data sets.
Step 3: Learn-One-Rule:
a. Each ruler, is grown by the learn-one-rule approach.
b. Each rule starts with an empty rule set and conjuncts are added
one by one to increase the rule accuracy.
e. Rule accuracy is the ratio of amount of “+” covered by the rule to all
records covered by the rule :
Correct records by rule
All records covered by the rule
Rule accuracy A (r) =
dad Learn-one-rule starts with an empty rule set: if {} then class = “+”.
e. The accuracy of this rule is the same as the proportion of + data
points in the data set. Then the algorithm greedily adds conjuncts
until the accuracy reaches 100 %.
f. If the addition of a conjunct decreases the accuracy, then the
algorithm looks for other conjuncts or stops and starts the iteration
of the next rule.Data Analyties 2-11 J (CS-5/IT-6)
Step 4: Next rule:
a
After a rule is developed, then all the data points covered by the
rule are eliminated from the data set.
b. The above steps are repeated for the next rule to cover the rest of
the “+” data points.
ce. InFig. 2.10.3, ruler, is developed after the data points covered byr,
are eliminated.
+
Fig. 2.10.3. Elimination of r1 data pots and next rule.
Step 5: Development of rule set:
a
After the rule set is developed to identify all “+” data points, the rule
model is evaluated with a data set used for pruning to reduce
generalization errors.
The metric used to evaluate the need for pruning is (py —n)(p +n),
where p is the number of positive records covered by the rule and
nis the number of negative records covered by the rule.
All rules to identify “+” data points are aggregated to form a rule
group.
PART-4
Neural Networks : Learning and Generalization, Competitive
Learning, Principal Component Analysis and Neural Networks.
Questions-Answers
Long Answer Type and Medium Answer Type QuestionsData Analysis 2-12 J (CS-5/IT-6)
Que 2.11.| Describe supervised learning and unsupervised
learning.
Answer |
Supervised learning:
1. Supervised learning is also known as associative learning, in which the
network is trained by providing it with input and matching output
patterns.
ry
Supervised training requires the pairing of each input vector with a
target vector representing the desired output.
3. The input vector together with the corresponding target vector is called
training pair.
4. Tosolve a problem of supervised learning following steps are considered :
a. Determine the type of training examples.
b. Gathering of a training set.
e. Determine the input feature representation of the learned function.
d Determine the structure of the learned function and corresponding
learning algorithm.
e. Complete the design.
5. Supervised learning can be classified into two categories :
i Classification ii, Regression
Unsupervised learning :
1. Unsupervised learning, an output unit is trained to respond to clusters
of pattern within the input.
3. Inthis method of training, the input vectors of similar type are grouped
without the use of training data to specify how a typical member of each
group looks or to which group a member belongs.
3. Unsupervised training does not require a teacher; it requires certain
guidelines to form groups.
4. Unsupervised learning can be classified into two categories :
i Clustering ii, Association
Que 2.12.| Differentiate between supervised learning and
unsupervised learning.
Answer ]
Difference between supervised and unsupervised learning:Data Analyties 2-13 J (CS-5/IT-6)
S.No. Supervised Unsupervised
learning learning
1 It uses known and labeled | It uses unknown data as input.
data as input.
2. Computational complexity is | Computational complexity is less
very complex.
3. It uses offline analysis. It uses real time analysis of data.
4 Number of classes is | Number of classes is not known.
known.
5. Accurate and reliable | Moderate accurate and reliable
results. results.
Que 2.13. | What is the multilayer perceptron model ? Explain it.
Answer
=
bo
oo
Multilayer perceptron is a class of feed forward artificial neural network.
Multilayer perceptron model has three layers; an input layer, and output
layer, and a layer in between not connected directly to the input or the
output and hence, called the hidden layer.
For the perceptrons in the input layer, we use linear transfer function,
and for the perceptrons in the hidden layer and the output layer, we use
sigmoidal or squashed-S function.
The input layer serves to distribute the values they receive to the next
layer and so, does not perform a weighted sum or threshold.
The input-output mapping of multilayer perceptron is shown in
Fig. 2.13.1 and is represented by
Hidden layer
Input layer Output layer
Fig. 2.13.1.Data Analysis 2-14 J (CS-5/IT-6)
6.
Multilayer perceptron does not increase computational power over a
single layer neural network unless there is a non-linear activation
function between layers.
Que 2.14. | Draw and explain the multiple perceptron with its
learning algorithm.
Answer
1
The perceptrons which are arranged in layers are called multilayer
(multiple) perceptron.
This model has three layers : an input layer, output layer and one or
more hidden layer.
For the perceptrons in the input layer, the linear transfer function used
and for the perceptron in the hidden layer and output layer, the sigmoidal
or squashed-S function is used. The input signal propagates through the
network in a forward direction.
In the multilayer perceptron bias b(n) is treated as a synaptic weight
driven by fixed input equal to +1.
(n) = [+1, x,(m), 2,0), verre x(n)?
xO
where n denotes the iteration step in applying the algorithm.
Correspondingly we define the weight vector as :
w(n) = (b(n), w,(n), w, (7)... WCDI?
Accordingly the linear combiner output is written in the compact form
Vin) = YwAn)x,(m) = wn) x(n)
=
Architecture of multilayer perceptron :
Output
signal
O
O
Output layer
Input layer First hidden Second hidden
layer layer
Fig. 2.14.1.Data Analyties 2-15 J (CS-5/IT-6)
7. Fig. 2.14.1 shows the architectural model of multilayer perceptron with
two hidden layer and an output layer.
8. Signal flow through the network progresses in a forward direction,
from the left to right and on a layer-by-layer basis.
Learning algorithm :
1. Ifthe nth number of input set x(n), is correctly classified into linearly
separable classes, by the weight vector w(n) then no adjustment of
weights are done.
wn + 1) =w(n)
Tf w7x(n) > 0 and x(n) belongs to class G,.
wn + 1) = w(n)
Tfw7x(n) < 0 and x(n) belongs to class G,.
bo
Otherwise, the weight vector of the perceptron is updated in accordance
with the rule.
Que 2.15. | Explain the algorithm to optimize the network size.
Answer
Algorithms to optimize the network size are :
1. Growing algorithms :
a. This group of algorithms begins with training a relatively small
neural architecture and allows new units and connections to be
added during the training process, when necessary.
b. Three growing algorithms are commonly applied: the upstart
algorithm, the tiling algorithm, and the cascade correlation.
c e first two apply to binary input/output variables and networks
with step activation function.
d e third one, which is applicable to problems with continuous
input/output variables and with units with sigmoidal activation
function, keeps adding units into the hidden layer until a satisfying
error value is reached on the training set.
2. Pruning algorithms :
a. General pruning approach consists of training a relatively large
network and gradually removing either weights or complete units
that seem not to be necessary.
b. The large initial size allows the network to learn quickly and with a
lower sensitivity to initial conditions and local minima.
ce. The reduced final size helps to improve generalization.
Que 2.16. | Explain the approaches for knowledge extraction from
multilayer perceptrons.Data Analysis 2-16 J (CS-5/IT-6)
Answer
Approach for knowledge extraction from multilayer perceptrons :
a. Global approach:
L
This approach extracts a set of rules characterizing the behaviour
of the whole network in terms of input/output mapping.
A tree of candidate rules is defined. The node at the top of the tree
represents the most general rule and the nodes at the bottom of
the tree represent the most specific rules.
Each candidate symbolic rule is tested against the network's
behaviour, to see whether such a rule can apply.
The process of rule verification continues until most of the training
set is covered.
One of the problems connected with this approach is that the
number of candidate rules can become huge when the rule space
becomes more detailed.
b. Local approach:
L
bo
This approach decomposes the original multilayer network into a
collection of smaller, usually single-layered, sub-networks, whose
input/output mapping might be easier to model in terms of symbolic
rules.
Based on the assumption that hidden and output units, though
sigmoidal, can be approximated by threshold functions, individual
units inside each sub-network are modeled by interpreting the
incoming weights as the antecedent of a symbolic rule.
The resulting symbolic rules are gradually combined together to
define a more general set of rules that describes the network as a
whole.
The monotonicity of the activation function is required, to limit the
number of eandidate symbolic rules for each unit.
Local rule-extraction methods usually employ a special error
function and/or a modified learning algorithm, to encourage hidden
and output units to stay in a range consistent with possible rules
and to achieve networks with the smallest number of units and
weights.
Que 2.17. | Discuss the selection of various parameters in BPN.
Answer
Selection of various parameters in BPN (Back Propagation Network) :
1. Number of hidden nodes:Data Analyties 2-17 J (CS-5/IT-6)
i. The guiding criterion is to select the minimum nodes which would
not impair the network performance so that the memory demand
for storing the weights can be kept minimum.
ii When the number of hidden nodes is equal to the number of
training patterns, the learning could be fastest.
iii. In such cases, Back Propagation Network (BPN) remembers
training patterns losing all generalization capabilities.
iv. Hence, as far as generalization is concerned, the number of hidden
nodes should be small compared to the number of training patterns
(say 10:1).
2 Momentum coefficient (a):
i. The another method of reducing the training time is the use of
momentum factor because it enhances the training process.
(Weight change
without momentum)
a
aw) a [Aw] °
(Momentum term)
Fig. 2.17.1. Influence of momentum term on weight change
ii The momentum also overcomes the effect of local minima.
iii. It will carry a weight change process through one or local minima
and get it into global minima.
3. Sigmoidal gain (4) :
i When the weights become large and force the neuron to operate
in aregion where sigmoidal function is very flat, a better method
of coping with network paralysisis to adjust the sigmoidal gain.
ii By decreasing this scaling factor, we effectively spread out
sigmoidal function on wide range so that training proceeds faster.
4, Local minima :
i One of the most practical solutions involves the introduction of a
shock which changes all weights by specific or random amounts.
ii If this fails, then the solution is to re-randomize the weights and
start the training all over.
iii. Simulated annealing used to continue training until local minima
is reached.
iv. After this, simulated annealing is stopped and BPN continues
until global minimum is reached.
v. Inmost of the cases, only a few simulated annealing cycles of this
two-stage process are needed.Data Analysis 2-18 J (CS-5/IT-6)
5.
Learning coefficient (n) :
i The learning coefficient cannot be negative because this would
cause the change of weight vector to move away from ideal weight
vector position.
ii If the learning coefficient is zero, no learning takes place and
hence, the learning coefficient must be positive.
ii. Ifthe learning coefficient is greater than 1, the weight vector will
overshoot from its ideal position and oscillate.
iv. Hence, the learning coefficient must be between zero and one.
Que 2.18. | What is learning rate ? What is its function ?
Answer
1
3.
Learning rate is a constant used in learning algorithm that define the
speed and extend in weight matrix corrections.
Setting a high learning rate tends to bring instability and the system is
difficult to converge even to a near optimum solution.
Alow value will improve stability, but will slow down convergence.
Learning function :
1
bo
In most applications the learning rate is a simple function of time for
example L.R. = 1/(1 +2).
These functions have the advantage of having high values during the
first epochs, making large corrections to the weight matrix and smaller
values later, when the corrections need to be more precise
Using a fuzzy controller to adaptively tune the learning rate has the
added advantage of bringing all expert knowledge in use.
Ifit was possible to manually adapt the learning rate in every epoch, we
would surely follow rules of the kind listed below :
a. Ifthe change in error is small, then increase the learning rate.
b. Ifthere are a lot of sign changes in error, then largely decrease the
learning rate.
e. Ifthe change in error is small and the speed of error change is
small, then make a large increase in the learning rate.
Que 2.19. | Explain competitive learning.
Answer
1.
be
Competitive learning is a form of unsupervised learning in artificial
neural networks, in which nodes compete for the right to respond to a
subset of the input data.
Avariant of Hebbian learning, competitive learning works by increasing
the specialization of each node in the network. It is well suited to finding
clusters within data.Data Analyties 2-19 J (CS-5/IT-6)
3.
on
Models and algorithms based on the principle of competitive learning
include vector quantization and self-organizing maps.
Ina competitive learning model, there are hierarchical sets of units in
the network with inhibitory and excitatory connections.
The excitatory connections are between individual layers and the
inhibitory connections are between units in layered clusters.
Units in a cluster are either active or inactive.
There are three basic elements to a competitive learning rule :
a. Asset of neurons that are all the same except for some randomly
distributed synaptic weights, and which therefore respond
differently to a given set of input patterns.
b. A limit imposed on the “strength” of each neuron.
c. Amechanism that permits the neurons to compete for the right to
respond to a given subset of inputs, such that only one output
neuron (or only one neuron per group), is active (i.e., “on”) at a
time. The neuron that wins the competition is called a “winner-
take-all” neuron.
Que 2.20. | Explain Principle Component Analysis (PCA) in data
analysis.
Answer
1.
p
~
on
=
PCA is a method used to reduce number of variables in dataset by
extracting important one from a large dataset.
It reduces the dimension of our data with the aim of retaining as much
information as possible.
In other words, this method combines highly correlated variables
together to forma smaller number ofan artificial set of variables which
is called principal components (PC) that account for most variance in the
data.
A principal component can be defined as a linear combination of
optimally-weighted observed variables.
The first principal component retains maximum variation that was
present in the original components.
The principal components are the eigenvectors of a covariance matrix,
and hence they are orthogonal.
The output of PCA are these principal components, the number of which
is less than or equal to the number of original variables.
The PCs possess some useful properties which are listed below :
a. The PCs are essentially the linear combinations of the original
variables and the weights vector.
b. The PCs are orthogonal.Data Analysis 2-20 J (CS-5/IT-6)
e. The variation present in the PC decrease as we move from the Ist
PC to the last one.
PART-5S
Fuzzy Logic : Extracting Fuzzy Models From Data, Fuzzy
Decision Trees, Stochastic Search Methods.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
=
Que . | Define fuzzy logic and its importance in our daily life.
What is role of crisp sets in fuzzy logic ?
Answer ]
1. Fuzzy logic is an approach to computing based on “degrees of truth”
rather than “true or false” (1 or 0).
Fuzzy logic includes 0 and 1 as extreme cases of truth but also includes
the various states of truth in between.
bo
3. Fuzzy logic allows inclusion of human assessments in computing
problems.
4. Itprovides an effective means for conflict resolution of multiple criteria
and better assessment of options.
Importance of fuzzy logic in daily life:
1. Fuzzy logic is essential for the development of human-like capabilities
for AI.
It is used in the development of intelligent systems for decision making,
identification, optimization, and control.
be
3. Fuzzy logic is extremely useful for many people involved in research
and development including engineers, mathematicians, computer
software developers and researchers.
4. Fuzzy logic has been used in numerous applications such as facial pattern.
recognition, air conditioners, vacuum cleaners, weather forecasting
systems, medical diagnosis and stock trading.
Role of crisp sets in fuzzy logic:
1. It contains the precise location of the set boundaries.
2. It provides the membership value of the set.
Que Define classical set and fuzzysets. State the importance
of fuzzy sets.Data Analyties 2-21 J (CS-5/IT-6)
Answer ]
Classical set :
1. Classical set is a collection of distinct objects
2. Each individual entity in a set is called a member or an element of the
set.
3. The classical set is defined in such a way that the universe of discourse
is splitted into two groups as members and non-members.
Fuzzy set:
1. Fuzzy set is a set having degree of membership between 1 and 0.
2. Fuzzy sets A in the universe of discourse U can be defined as set of
ordered pair and it is given by
Az (x,p,@)|x2U)}
Where 1; is the degree of membership of xin A .
Importance of fuzzy set :
1. Itisused for the modeling and inclusion of contradiction in a knowledge
base.
2. It also increases the system autonomy.
3. It acts as an important part of microchip processor-based appliances.
Que 2.23. | Compare and contrast classical logic and fuzzy logic.
Answer ]
Crisp (classical) logie Fuzzy logic
1. | Inelassical logic an element | Fuzzy logic supports a flexible sense
either belongs to or does not | of membership of elements to a set.
belong to a set.
2. | Crisp logic is built on a} Fuzzy logic is built on a multistate
2-state truth values| truth values.
(True/False).
3. |The statement which is| A fuzzy proposition is a statement
either True or False’ but not | which acquires a fuzzy truth value.
both is called a proposition in
crisp logic.
4. | Law of excluded middle and | Law of excluded middle and law of
law of non-contradiction | contradiction are violated.
holds good in crisp logic.Data Analysis 2-22 J (CS-5/IT-6)
Que 2.24. | Define the membership function and state its importance
in fuzzy logic. Also discuss the features of membership functions.
Answer ]
Membership function:
1. Amembership function for a fuzzy set A on the universe of discourse X
is defined as u, : X — [0,1], where each element of XY is mapped toa value
between 0 and 1
bo
This value, called membership value or degree of membership, quantifies
the grade of membership of the element in X to the fuzzy set A.
3. Membership functions characterize fuzziness (i.¢., all the information in
fuzzy set), whether the elements in fuzzy sets are discrete or continuous.
4. Membership functions can be defined as a technique to solve practical
problems by experience rather than knowledge.
5. Membership functions are represented by graphical forms.
Importance of membership function in fuzzy logic :
1. Itallows us to graphically represent a fuzzy set.
2. It helps in finding different fuzzy set operation.
Features of membership function:
1 Core:
a. The core of amembership function for some fuzzy set A isdefined
as that region of the universe that is characterized by complete and
full membership in the set.
b. The core comprises those elements x of the universe such that
My@)=1.
2. Support:
a. The support of a membership function for some fuzzy set Ais
defined as that region of the universe that is characterized by
nonzero membership in the set A,
b. The support comprises those elements x of the universe such that
Mg (x) > 0.Data Analyties 2-23 J (CS-5/IT-6)
u(x)
1
Support
Boundary
Boundary
Fig. 2.24.1. Core, support, and boundaries of a fuzzy set.
Boundaries :
a. The boundaries of a membership function for some fuzzy set A
are defined as that region of the universe containing elements that
have a non-zero membership but not complete membership.
b. The boundaries comprise those elements x of the universe such
that O< uj(x)<1.
Que 2.25. | Explain the inference in fuzzy logic.
Answer ]
Fuzzy Inference:
1.
Inferences is a technique where facts, premises F, F,, ....... ,F anda
goal Gis to be derived from a given set.
Fuzzy inference is the process of formulating the mapping from a given
input to an output using fuzzy logic
The mapping then provides a basis from which decisions can be made.
Fuzzy inference (approximate reasoning) refers to computational
procedures used for evaluating linguistic (IF-THEN) descriptions.
The two important inferring procedures are :
Generalized Modus Ponens (GMP):
1. GMPis formally stated as
Ifxis A THEN yis B
xis A’
yis B
Here, A, B, A’ and B' are fuzzy terms.
Every fuzzy linguistic statement above the line is analytically known
and what is below is analytically unknown.
beData Analysis 2-24 J (CS-5/IT-6)
Here Bi = A'oR(x,y)
where ‘o’ denotes max-min composition (IF-THEN relation)
3. The membership function is
By) = max (min (u(x), wg (x,99))
where 11, (9) ismembership function of B, w..(x) ismembership
function of A’ and us(%,») is the membership function of
implication relation.
ii. Generalized Modus Tollens (GMT):
1. GMTis defined as
Ifxis A. Thenyis B
yis B
xis A’
2. The membership of A’ is computed as
A’ = BoR (x,y)
3. In terms of membership function
(x) = max (min (u Y, Hele)
Que 2.26. | Explain Fuzzy Decision Tree (FDT).
Answer ]
1. Decision trees are one of the most popular methods for learning and
reasoning from instances.
2. Given aset of n input-output training patterns D = {(X,y')i=1,...,n}.
where each training pattern X has been described by a set ofp conditional
(or input) attributes (c,,....¢,) and one corresponding discrete class label
y where y' ¢ {1,.....g} and g is the number of classes.
3. The decision attribute y' represents a posterior knowledge regarding
the elass of each pattern.
4. An arbitrary class has been indexed by / (1