[go: up one dir, main page]

0% found this document useful (0 votes)
369 views148 pages

Data Analytics Quantum

Uploaded by

gargmanav443
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
369 views148 pages

Data Analytics Quantum

Uploaded by

gargmanav443
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 148
; @ Quantum, peries, nin QUANTUM Series * Topic-wise coverage of entire syllabus in Question-Answer form. * Short Questions (2 Marks) QUANTUM SERIES For B.Tech Students of Third Year of All Engineering Colleges Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Uttar Pradesh, Lucknow (Formerly Uttar Pradesh Technical University) Data Analytics By Aditya Kumar ™ ZS Quantum — Page — QUANTUM PAGE PVT. LTD. Ghaziabad mg New Delhi PUBLISHED BY: Apram Singh Quantum Publications® (A Unit of Quantum Page Pvt. Ltd.) Plot No. 59/2/7, Site - 4, Industrial Area, Sahibabad, Ghaziabad-201 010 Phone : 0120- 4160479 Email: pagequantum@gmail.com Website: www.quantumpage.co.in Delhi Office : 1/6590, East Rohtas Nagar, Shahdara, Delhi-110032 © Att Ricuts Reservep No part of this publication may be reproduced or transmitted, |, in any form or by any means, without permission. Information contained in this work is derived from sources believed to be reliable. Every effort has been made to ensure accuracy, however neither the publisher nor the authors guarantee the accuracy or completeness of any information published herein, and neither the publisher nor the authors shall be responsible for any errors, omissions, or damages arising out of use of this information. Data Analytics (CS : Sem-5 and IT : Sem-6) 1* Edition : 2020-21 Price: Rs. 55/- only Printed Version : e-Book. UNIT-1 : INTRODUCTION TO DATA ANALYTICS (1-1 J to 1-20 J) Introduction to Data Analytics: Sources and nature of data, classification of data (structured, semi-structured, unstructured), characteristics of data, introduction to Big Data platform, need of data analytics, evolution of analytic scalability, analytic process and tools, analysis vs reporting, modern data analytic tools, applications of data analytics. Data Analytics Lifecycle” Need, key roles for successful analytic projects, various phases of data analytics lifecycle — discovery, data preparation, model planning, model building, communicating results, operationalization UNIT-2 : DATA ANALYSIS (2-1 J to 2-28 J) Regression modeling, multivariate analysis, Bayesian modeling, inference and Bayesian networks, support vector and kernel methods, analysis of time series: linear systems analysis & nonlinear dynamics, rule induction, neural networks: learning and generalisation, competitive learning, principal component analysis and neural networks, fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search methods UNIT-3 : MINING DATA STREAMS. (3-1 J to 3-20 J) Introduction to streams concepts, stream data model and architecture, stream computing, sampling data in a stream, filtering streams, counting distinct elements in a stream, estimating moments, counting oneness in a window, decaying window, Real-time Analytics Platform (RTAP) applications, Case studies — real time sentiment analysis, stock market predictions. UNIT-4 : FREQUENT ITEMSETS & CLUSTERING (4-1 J to 4-28 J) Mining frequent itersets, market based modelling, Apriori algorithm, handling large data sets in main memory, limited pass algorithm, counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means, clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering methods, clustering in non-euclidean space, clustering for streams & parallelism. UNIT-5 : FRAME WORKS & VISUALIZATION (5-1 J to 5-30 J) Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase, MapR, Sharding, NoSQL Databases, $3, Hadoop Distributed File Systems, Visualization: visual data analysis techniques, interaction techniques, systems and applications. Introduction to R -R graphical user interfaces, data import and export, attribute and data types, descriptive statistics, exploratory data analysis, visualization before analysis, analytics for unstructured data SHORT QUESTIONS (SQ-1 J to SQ-15 J) Gis QUANTUM Series US LAR Ls CE Ucn MAEM RST the big picture without spending ned n ec For Semester - 5 Quantum Series is the complete one-stop os solution for engineering student looking for (Computer Science & Engineering a simple yet effective guidance system for I Information Technology) core engineering subject. Based on the needs of students and catering to the + Database Management System requirements of the syllabi, this series uniquely addresses the way in which + Design and Analysis of Algorithm concepts are tested through university examinations. The essy to comprehend Compiler Design question answer form adhered to by the + Web Technology books in this series is suitable and recommended for student. The students are Departmental Electiv able to effortlessly grasp the concepts and . ie ideas discussed in their course books with Data Analytics the help of this series. The solved question + Computer Graphics papers of previous years act as a additional advantage for students to comprehend the + Object Oriented System Design paper pattem, and thus anticipate and Departmental Elective! prepare for examinations accordingly, The coherent manner in which the books in * Machine Leaming Techniques this series present new ideas and concepts * Application of Soft Computing to students makes this series play an essential role in the preparation for + Human Computer Interface university exarninations. The detailed and Commom Non Credit Course (NC) comprehensive discussions, easy to . __| understand examples, objective questions + Constitution of India, Law & Engineering) and ample exercises, all aid the students to a understand everything in an all-inclusive + Indian Tradition, Culture & Society pies mee ‘© Topic-wise coverage in Question-Answer form. © The perfect assistance for scoring good marks. Clears: course fundamentals. Good for brush up before exams. Includes solved deal for sel Quantum Publications* ™ (A Unit of Quantum Page Pvt. Ltd.) 4 Piot No. 59/2/7, Site-4, Industrial Area, Sahibabad, Ghaziabad, 201010, (U.P) Phane: 0120-416047: bah E-mail: pagequantum@>gmail.com Web: www quantumpage.coin Hii) Find us on: facebook.com/quantumseriesofticial Data Analytics (KCS-051) ‘Course Outcome (CO) Bloom’s Knowledge Level (KL) A the end of course , the student will be able to : col co2 Describe the life cycle phases of Data Analytics through discovery, planning and building. ‘Understand and apply Data Analysis Techniques. KIK2 E23 cos Implement various Data streams. KB co4 Understand item sets, Clustering, fame works & Visualizations KD cos ‘Apply R tool for developing and evaluating real time applications. K3.KS.K6 DETAILED SYLLABUS 3.00 Unit Topic Proposed Lecture Introduction to Data Analytics: Sources and nature of data, classification of data (structured, semi-structured, unstructured), characteristics of data, introduction to Big Data platform, need of data analytics, evolution of analytic scalability, analytic process and tools, analysis vs reporting, modern data analytic tools, applications of data analytics. Data Analytics Lifecycle: Need, key roles for successfull analytic projects, various phases of data analytics lifecycle — discovery, data preparation, model planning, model building, commmnicating results, operationalization. 08. u Data Analysis: Regression modeling, mulfivariate analysis, Bayesian modeling, inference and Bayesian networks, support vector and kernel methods, analysis of time series: linear systems analysis & nonlinear dynamics, rule induction, neural networks: learning and generalisation, competitive learning, principal component analysis and neural networks, fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search methods. 08, m Mining Data Streams: Introduction to streams concepts, stream data model and architecture, stream computing, sampling data in a stream, filtering streams, counting distinct elements in a stream, estimating moments, counting oneness in a window, decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies — real time sentiment analysis, stock market predictions. 08, Vv Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling, Apriori algorithm, handling large data sets in main memory, limited pass algorithm, counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means, clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering methods, clustering in non-euclidean space, clustering for streams and parallelism. 08, v Frame Works and Visualization: MapReduce, Hadoop. Pig. Hive, HBase, MapR, Sharding, NoSQL Databases, $3, Hadoop Distributed File Systems, Visualization: visual data analysis techniques, interaction techniques. systems and applications. Introduction to R - R graphical user interfaces, data import and export, attribute and data types, descriptive statistics, exploratory data analysis, visualization before analysis, analytics for unstructured data. 08, Teat books and References: Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer ‘Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press. Bill Franks, Taming the Big Data Tidal wave: Finding Opportunities in Huge Data Streams with Advanced Analytics, John Wiley & Sons. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big Analytics: Emerging Business L 3 4. 10. 1 12. 13 14. 15 16. . David Dietrich, Barry Heller, Beibei Yan Tntelligence and Analytic Trends for Today's Businesses", Wiley ‘Data Science and Big Data Analytics”, EMC Education Series, John Wiley - Frank J Oblhorst, ig Data Analytics: Turning Big Data into Big Money”, Wiley and SAS Business Series . Colleen Mecue, “Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis”, Elsevier Michael Berthold, David J. Hand,” Intelligent Data Analysis”, Springer Paul Zikopoulos, Chris Eaton, Paul Zikopoulos, “Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data”, McGraw Hill Trevor Hastie, Robert Tibshirani, Jerome Friedman, "The Elements of Statistical Learning", Springer ‘Mark Gardner, “Beginning R: The Statistical Programming Language”, Wrox Publication Pete Warden, Big Data Glossary, O’Reilly Glenn J. Myatt, Making Sense of Data, John Wiley & Sons Pete Warden, Big Data Glossary, O'Reilly. . Peter Biihlmann, Petros Drineas, Michael Kane, Mark van der Laan, "Handbook of Big Data", CRC Press Jiawei Han, Micheline Kamber “Data Mining Concepts and Techniques”, Second Edition, Elsevier Data Analyties 1-1 J (CS-5/IT-6) UNIT Introduction to Data Analytics CONTENTS Part-1 Part-2 Part-3 Part-4 Part-5 Introduction of Data Analytics : 1-2J to 1-53 Sources and Nature of Data, Classification of Data (Structured, Semi-Structured, Unstructured), Characteristics of Data Introduction to Big Data ....... ... 1-53 to 1-65 Platform, Need of Data Analytics Evolution of Analytic 00.00.0202... 1-6J to 1-133 Scalability, Analytic Process and Tools, Analysis Vs Reporting, Modern Data Analytic Tools, Applications of Data Analysis Data Analytics Lifecycle a 1-133 to 1-175 Need, Key Roles for Successful Analytic Projects, Various Phases of Data Analytic Life Cyele : Discovery, Data Preparations Model Planning, Model ........ 1-17J to 1-205 Building, Communicating Results, Operationalization J Introduction to Data Analytics 1-2J (CS-5/IT-6) PART-1 Introduction To Data Analytics : Sourees and Nature of Data, Classification of Data (Structured, Semi-Structured, Unstructured), Characteristics of Data. Long Answer Type and Medium Answer Type Questions Que 1.1. | What is data analytics ? Answer 1. Data analytics is the science of analyzing raw data in order to make conclusions about that information. Ww Any type of information can be subjected to data analytics techniques to get insight that can be used to improve things. 3. Data analytics techniques can help in finding the trends and metrics that would be used to optimize processes to increase the overall efficiency of a business or system. 4. Many of the techniques and processes of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consumption. 5. For example, manufacturing companies often record the runtime, downtime, and work queue for various machines and then analyze the data to better plan the workloads so the machines operate closer to peak capacity. Que 1.2. | Explain the source of data (or Big Data). Answer Three primary sources of Big Data are: 1. Social data: a, Social data comes from the likes, tweets & retweets, comments, video uploads, and general media that are uploaded and shared via social media platforms. b. This kind of data provides invaluable insights into consumer behaviour and sentiment and can be enormously influential in marketing analytics. Data Analyties 1-3 J (CS-5/IT-6) com The public web is another good source of social data, and tools like Google trends can be used to good effect to increase the volume of big data. 2. Machine data: 3. a. Machine data is defined as information which is generated by industrial equipment, sensors that are installed in machinery, and even web logs which track user behaviour. This type of data is expected to grow exponentially as the internet of things grows ever more pervasive and expands around the world. Sensors such as medical devices, smart meters, road cameras, satellites, games and the rapidly growing Internet of Things will deliver high velocity, value, volume and variety of data in the very near future Transactional data : a b. Transactional data is generated from all the daily transactions that take place both online and offline. Invoices, payment orders, storage records, delivery receipts are characterized as transactional data. Que 1.3. | Write short notes on classification of data. Answer Unstructured data: 1. a b. C. Unstructured data is the rawest form of data. Data that has no inherent structure, which may include text documents, PDFs, images, and video. This data is often stored in a repository of files. Structured data: a. Structured data is tabular data (rows and columns) which are very well defined. Data containing a defined data type, format, and structure, which may include transaction data, traditional RDBMS, CSV files, and simple spreadsheets. Semi-structured data: a Textual data files with a distinct pattern that enables parsing such as Extensible Markup Language [XML] data files or JSON. A consistent format is defined however the structure is not very strict. Semi-structured data are often stored as files. Introduction to Data Analytics 144.4 (CS-5/IT-6) Que 1.4. | Differentiate between structured, semi-structured an unstructured data. Answer Properties | Structured data | Semi-structured | Unstructured data data Technology | Itis based on Itisbasedon XML/|It is based o Relational database | RDF character and table. binary data. [Transaction | Matured transaction | Transaction is No transactio management} and various adapted from management an coneurrency DBMS. no coneurrency. techniques Flexibility Itis schema It is more flexible |It very flexible an dependent and less/than structured|there is absence of flexible. data but less than|schema. flexible than unstructured data. Scalability | It is very difficult to |It is more scalable|It is very scalable] scale database | than structured schema. data. Query Structured query | Queries over Only textual query lperformance| allow complex anonymous nodes|are possible. joining. are possible. Que 1.5. | Explain the characteristics of Big Data. Answer Big Data is characterized into four dimensions : 1. Volume: a. Volume is concerned about scale of data i.e., the volume of the data at which it is growing. b. The volume of datais growing rapidly, due to several applications of business, social, web and scientific explorations. 2. Velocity: a. The speed at which data is increasing thus demanding analysis of streaming data. b. The velocity is due to growing speed of business intelligence applications such as trading, transaction of telecom and banking domain, growing number of internet connections with the increased usage of internet etc. Data Analyties 1-5 J (CS-5/IT-6) 3. 4, Variety : It depicts different forms of data to use for analysis such as structured, semi structured and unstructured. Veracity: a. Veracity is concerned with uncertainty or inaccuracy of the data. b. Inmany cases the data will be inaccurate hence filtering and selecting the data which is actually needed is a complicated task. e. Alot of statistical and analytical process has to go for data cleansing for choosing intrinsic data for decision making. PART-2 Introduction to Big Data Platform, Need of Data Analytics. Questions-Answers Long Answer Type and Medium Answer Type Questions Que 1.6. | write short note on big data platform. Answer L be Big data platform is a type of IT solution that combines the features and capabilities of several big data application and utilities within a single solution. It is an enterprise class IT platform that enables organization in developing, deploying, operating and managing a big data infrastructure! environment. Big data platform generally consists of big data storage, servers, database, big data management, business intelligence and other big data management utilities. It also supports custom development, querying and integration with other systems. The primary benefit behind a big data platform is to reduce the complexity of multiple vendors/ solutions into a one cohesive solution. Big data platform are also delivered through cloud where the provider provides an all inclusive big data solutions and services Que 1.7. ] What are the features of big data platform ? Introduction to Data Analytics 1-6 J (CS-5/IT-6) Answer Features of Big Data analytics platform : 1. Big Data platform should be able to accommodate new platforms and tool based on the business requirement. It should support linear scale-out. It should have capability for rapid deployment. It should support variety of data format Platform should provide data analysis and reporting tools. It should provide real-time data analysis software. Ao Fe WON It should have tools for searching the data through large data sets. Que 1.8. | Why there is need of data analytics ? Answer Need of data analytics : 1. It optimizes the business performance. 2. It helps to make better decisions. 3. Tt helps to analyze customers trends and solutions PART-3 Evolution of Analytic Scalability, Analytic Process and Tools, Analysis vs Reporting, Modern Data Analytic Tools, Applications of Data Analysis. Questions-Answers Long Answer Type and Medium Answer Type Questions Que 1.9. | What are the steps involved in data analysis ? Answer Steps involved in data analysis are: 1. Determine the data: a. The first step is to determine the data requirements or how the data is grouped. b. Data may be separated by age, demographic, income, or gender. ce. Data values may be numerical or be divided by category. Data Analyties 1-7 J (CS-5/IT-6) 2. » Collection of data: a. The second step in data analytics is the process of collecting it. b. This can be done through a variety of sources such as computers, online sources, cameras, environmental sources, or through personnel. Organization of data: a. Third step is to organize the data. b. Once the datais collected, it must be organized so it can be analyzed. e. Organization may take place on a spreadsheet or other form of software that can take statistical data. Cleaning of data: a. In fourth step, the datais then cleaned up before analysis. b. This means it is scrubbed and checked to ensure there is no duplication or error, and that it is not incomplete. ce. This step helps correct any errors before it goes on to adata analyst to be analyzed. Que 1.10. ] write short note on evolution of analytics scalability. Answer L i] In analytic scalability, we have to pull the data together in a separate analytics environment and then start performing analysis. —— —_——oas> TOOT = —— tt The heavy processing occurs |_| in the analytic environment Analytic server or PC Analysts do the merge operation on the data sets which contain rows and columns. <_ The columns represent information about the customers such as name, spending level, or status. In merge or join, two or more data sets are combined together. They are typically merged / joined so that specific rows of one data set or table are combined with specific rows of another. Introduction to Data Analytics 1-8 J (CS-5/IT-6) 5. Analysts also do data preparation. Data preparation is made up of joins, aggregations, derivations, and transformations. In this process, they pull data from various sources and merge it all together to create the variables required for an analysis. 6. Massively Parallel Processing (MPP) system is the most mature, proven, and widely deployed mechanism for storing and analyzing large amounts of data. = An MPP database breaks the data into independent pieces managed by independent storage and central processing unit (CPU) resources. 00 GB 100 GB 100 GB 100 GB 100 GB Chunks | | Chunks | | Chunks | | Chunk Chunks 1 terabyte Cd table 100 GB 100 GB 100 GB 100 GB 100 GB Chunks || Chunks | | Chunks | | Chunks | | Chunks A traditional database 10 Simultaneous 100-GB queries will query a one terabyte one row at time. Fig. 1.10.1. Massively Parallel Processing system data storage. 8. MPP systems build in redundancy to make recovery easy. 9. MPP systems have resource management tools : a. Manage the CPU and disk space b. Query optimizer Que 1.11. ] write short notes on evolution of analytic process. Answer 1. With increased level of scalability, it needs to update analytic processes to take advantage of it. w This can be achieved with the use of analytical sandboxes to provide analytic professionals with a scalable environment to build advanced analytics processes. One of the uses of MPP database system is to facilitate the building and deployment of advanced analytic processes. 4. An analytic sandbox is the mechanism to utilize an enterprise data warehouse. 5. If used appropriately, an analytic sandbox can be one of the primary drivers of value in the world of big data. Analytical sandbox : 1. An analytie sandbox provides a set of resources with which in-depth analysis can be done to answer critical business questions. Data Analyties 1-9 J (CS-5/IT-6) 2. ae a An analytic sandbox is ideal for data exploration, development of analytical processes, proof of concepts, and prototyping. Once things progress into ongoing, user-managed processes or production processes, then the sandbox should not be involved Asandbox is going to be leveraged by a fairly small set of users. There will be data created within the sandbox that is segregated from the production database. Sandbox users will also be allowed to load data of their own for brief time periods as part of a project, even if that datais not part of the official enterprise data model. Que 1.12. | Explain modern data analytic tools. Answer Modern data analytic tools : 1 Apache Hadoop : a. Apache Hadoop, a big data analytics tool which is a Java based free software framework. b. It helps in effective storage of huge amount of data in a storage place known as a cluster. ec. It runs in parallel ona cluster and also has ability to process huge data across all nodes in it. d. There isa storage system in Hadoop popularly known as the Hadoop Distributed File System (HDF), which helps to splits the large volume of data and distribute across many nodes present in a cluster. KNIME: a. KNIME analytics platform is one of the leading open solutions for data-driven innovation. b. This tool helps in discovering the potential and hidden in a huge volume of data, it also performs mine for fresh insights, or predicts the new futures. OpenRefine: a. OneRefine tool is one of the efficient tools to work on the messy and large volume of data. b. It includes cleansing data, transforming that data from one format another. ec. Ithelps to explore large data sets easily. Orange: a. Orange is famous open-source data visualization and helps in data analysis for beginner and as well to the expert. Introduction to Data Analytics 1-10 J (CS-5/IT-6) Q b. This tool provides interactive workflows with a large toolbox option to create the same which helps in analysis and visualizing of data. RapidMiner: a. RapidMiner tool operates using visual programming and also it is much capable of manipulating, analyzing and modeling the data. b. RapidMiner tools make data science teams easier and productive by using an open-source platform for all their jobs like machine learning, data preparation, and model deployment. R-programming : a. Risa free open source software programming language and a software environment for statistical computing and graphics. b. Itisused by data miners for developing statistical software and data analysis. ec. It has become a highly popular tool for big data in recent years. Datawrapper: a. Itis an online data visualization tool for making interactive charts. b. It uses data file ina esv, pdf or excel format. e. Datawrapper generate visualization in the form of bar, line, map etc. It can be embedded into any other website as well. Tableau : a. Tableauis another popular big data tool. It issimple and very intuitive to use. b. It communicates the insights of the data through data visualization. Through Tableau, an analyst ean check a hypothesis and explore the data before starting to work on it extensively. ue 1.13. | What are the benefits of analytic sandbox from the view Ea of an analytic professional ? Answer | Benefits of analytic sandbox from the view of an analytic professional : 1 Independence : Analytic professionals will be able to work independently on the database system without needing to continually go back and ask for permissions for specific projects. Flexibility : Analytic professionals will have the flexibility to use whatever business intelligence, statistical analysis, or visualization tools that they need to use. Efficiency: Analytic professionals will be able to leverage the existing enterprise data warehouse or data mart, without having to move or migrate data. Data Analyties 1-11 J (CS-5/IT-6) 4. a Freedom: Analytic professionals can reduce focus on the administration of systems and production processes by shifting those maintenance tasks to IT. Speed : Massive speed improvement will be realized with the move to parallel processing. This also enables rapid iteration and the ability to “fail fast” and take more risks to innovate. Que 1.14. ] What are the benefits of analytic sandbox from the view of IT? Answer Benefits of analytic sandbox from the view of IT : 1 i pS ” Centralization : IT will be able to centrally manage a sandbox environment just as every other database environment on the system is managed. Streamlining: A sandbox will greatly simplify the promotion of analytic processes into production since there will be a consistent platform for both development and deployment. Simplicity: There will be no more processes built during development that have to be totally rewritten to run in the production environment. ‘Control : IT will be able to control the sandbox environment, balancing sandbox needs and the needs of other users. The production environment is safe from an experiment gone wrong in the sandbox. ‘Costs : Big cost savings can be realized by consolidating many analytic data marts into one central system. Que 1.15. | Explain the application of data analytics. Answer Answer | Application of data analytics : L yp Security : Data analytics applications or, more specifically, predictive analysis has also helped in dropping crime rates in certain areas. Transportation : a. Data analytics can be used to revolutionize transportation. b. It can be used especially in areas where we need to transport a large number of people to a specific area and require seamless transportation. Risk detection: a. Many organizations were struggling under debt, and they wanted a solution to problem of fraud. b. They already had enough customer data in their hands, and so, they applied data analytics. Introduction to Data Analytics 1-12 J (CS-5/IT-6) ce. They used ‘divide and conquer’ policy with the data, analyzing recent expenditure, profiles, and any other important information to understand any probability of a customer defaulting. 4. Delivery: a. Several top logistic companies are using data analysis to examine collected data and improve their overall efficiency. b. Using data analytics applications, the companies were able to find the best shipping routes, delivery time, as well as the most cost- efficient transport means. 5. Fast internet allocation : a. While it might seem that allocating fast internet in every area makes a city ‘Smart’, in reality, it is more important to engage in smart allocation. This smart allocation would mean understanding how bandwidth is being used in specific areas and for the right cause. b. Itis alsoimportant to shift the data allocation based on timing and priority. It is assumed that financial and commercial areas require the most bandwidth during weekdays, while residential areas require it during the weekends. But the situation is much more complex. Data analytics can solve it. ce. For example, using applications of data analysis, a community can draw the attention of high-tech industries and in such cases; higher bandwidth will be required in such areas 6. Internet searching : a. When we use Google, we are using one of their many data analytics applications employed by the company. b. Most search engines like Google, Bing, Yahoo, AOL etc., use data analytics. These search engines use different algorithms to deliver the best result for a search query. Digital advertisement : a. Data analytics has revolutionized digital advertising. ns b. Digital billboards in cities as well as banners on websites, that is, most of the advertisement sources nowadays use data analytics using data algorithms. Que 1.16. | What are the different types of Big Data analytics ? Answer Different types of Big Data analytics : 1. Descriptive analytics : a. Itusesdata aggregation and data mining to provide insight into the past. Data Analyties 1-13 J (CS-5/IT-6) b. Descriptive analytics describe or summarize raw data and make it interpretable by humans. 2. Predictive analytics: a. It uses statistical models and forecasts techniques to understand the future. b. Predictive analytics provides companies with actionable insights based on data. It provides estimates about the likelihood ofa future outcome. 3. Prescriptive analytics : a. Ituses optimization and simulation algorithms to advice on possible outcomes. b. Itallows users to “prescribe” a number of different possible actions and guide them towards a solution. 4. Diagnostic analytics : a. It is used to determine why something happened in the past. b. Itis characterized by techniques such as drill-down, data discovery, data mining and correlations. ec. Diagnostic analytics takes a deeper look at data to understand the root causes of the events. PART-4 Data Analytics Life Cycle : Need, Key Roles For Successful Analytic Projects, Various Phases of Data Analytic Life Cycle : Discovery, Data Preparations. Questions-Answers Long Answer Type and Medium Answer Type Questions Que 1.17. | Explain the key roles for asuccessful analytics projects. Answer Key roles for a successful analytics project : 1. Business user: a. Business user is someone who understands the domain area and usually benefits from the results. b. This person can consult and advise the project team on the context of the project, the value of the results, and how the outputs will be operationalized. Introduction to Data Analytics 1-14 J (CS-5/IT-6) e Usually a business analyst, line manager, or deep subject matter expert in the project domain fulfills this role. 2. Project sponsor : a Project sponsor is responsible for the start of the project and provides all the requirements for the project and defines the core business problem. Generally provides the funding and gauges the degree of value from the final outputs of the working team. This person sets the priorities for the project and clarifies the desired outputs. 3. Project manager : Project manager ensures that key milestones and objectives are met on time and at the expected quality. 4. Business Intelligence Analyst : a Analyst provides business domain expertise based on a deep understanding of the data, Key Performance Indicators (KPIs), key metrics, and business intelligence from a reporting perspective. Business Intelligence Analysts generally create dashboards and reports and have knowledge of the data feeds and sources. 5. Database Administrator (DBA): a DBA provisions and configures the database environment to support the analytics needs of the working team. These responsibilities may include providing access to key databases or tables and ensuring the appropriate security levels are in place related to the data repositories. 6. Data engineer: Data engineer have deep technical skills to assist with tuning SQL queries for data management and data extraction, and provides support for data ingestion into the analytic sandbox. 7. Datascientist : a. Data scientist provides subject matter expertise for analytical techniques, data modeling, and applying valid analytical techniques to given business problems. They ensure overall analytics objectives are met. They designs and executes analytical methods and approaches with the data available to the project. Que 1.18. ] Explain various phases of data analytics life cycle. Answer Various phases of data analytic lifecycle are : Phase 1: Discovery: Data Analyties 1-15 J (CS-5/IT-6) 1 In Phase 1, the team learns the business domain, including relevant history such as whether the organization or business unit has attempted similar projects in the past from which they can learn. The team assesses the resources available to support the project in terms of people, technology, time, and data. Important activities in this phase include framing the business problem as an analytics challenge and formulating initial hypotheses (IHs) to test and begin learning the data. Phase 2 : Data preparation: L Phase 2 requires the presence of an analytic sandbox, in which the team can work with data and perform analytics for the duration of the project. The team needs to execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. Data should be transformed in the ETL process so the team can work with it and analyze it. In this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data. Phase 3 : Model planning: L bo Phase 3 is model planning, where the team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase. The team explores the data to learn about the relationships between variables and subsequently selects key variables and the most suitable models. Phase 4: Model building: L we In phase 4, the team develops data sets for testing, training, and production purposes. In addition, in this phase the team builds and executes models based on the work done in the model planning phase. The team also considers whether its existing tools will be adequate for running the models, or if it will need a more robust environment for executing models and work flows. Phase 5: Communicate results : L i] In phase 5, the team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in phase 1 The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders. Phase 6 : Operationalize : L In phase 6, the team delivers final reports, briefings, code, and technical documents. Introduction to Data Analytics 1-16 J (CS-5/IT-6) 2. Inaddition, the team may run a pilot project to implement the models in a production environment. Que 1.19. | What are the activities should be performed while identifying potential data sources during discovery phase ? Answer Main activities that are performed while identifying potential data sources during discovery phase are : 1. Identify datasources : a. Make alist of candidate data sources the team may need to test the initial hypotheses outlined in discovery phase b. Make an inventory of the datasets currently available and those that can be purchased or otherwise acquired for the tests the team wants to perform. 2. Capture aggregate data sources : a. This is for previewing the data and providing high-level understanding. b. Itenables the team to gain a quick overview of the data and perform further exploration on specific areas. ec. Italso points the team to possible areas of interest within the data. 3. Review the raw data: a. Obtain preliminary data from initial data feeds b. Begin understanding the interdependencies among the data attributes, and become familiar with the content of the data, its quality, and its limitations. 4, Evaluate the datastructures and tools needed : a. The data type and structure dictate which tools the team can use to analyze the data. b. This evaluation gets the team thinking about which technologies may be good candidates for the project and how to start getting access to these tools. Scope the sort of data infrastructure needed for this type of problem : In addition to the tools needed, the data influences the kind of infrastructure required, such as disk storage and network capacity. Que 1.20. | Explain the sub-phases of data preparation. Answer Sub-phases of data preparation are: 1. Preparing an analytics sandbox : a Data Analyties 1-17 J (CS-5/IT-6) a The first sub-phase of data preparation requires the team to obtain an analytic sandbox in which the team can explore the data without interfering with live production databases. When developing the analytic sandbox, it is a best practice to collect all kinds of data there, as team members need access to high volumes and varieties of data for a Big Data analytics project. This can include everything from summary-level aggregated data, structured data, raw data feeds, and unstructured text data from call logs or web logs. 2. Performing ETLT: a In ETL, users perform extract, transform, load processes to extract data from a data store, perform data transformations, and lead the data back into the data store In this case, the datais extracted in its raw form and loaded into the data store, where analysts can choose to transform the data into a new state or leave it in its original, raw condition. 8. Learning about the data: a. Acritical aspect of a data science project is to become familiar with the data itself. Spending time to learn the nuances of the datasets provides context to understand what constitutes a reasonable value and expected output. In addition, it is important to catalogue the data sources that the team has access to and identify additional data sources that the team can leverage. 4. Data conditioning: a Data conditioning refers to the process of cleaning data, normalizing datasets, and performing transformations on the data. Data conditioning can involve many complex steps to join or merge datasets or otherwise get datasets into a state that enables analysis in further phases. It is viewed as processing step for data analysis. PART-5 Model Planning, Model Building, Communicating Results Open. Questions-Answers Long Answer Type and Medium Answer Type Questions Introduction to Data Analytics 1-18 J (CS-5/IT-6) Que 1.21. | What are activities that are performed in model planning phase? Answer Activities that are performed in model planning phase are : 1. Assess the structure of the datasets : a. The structure of the data sets is one factor that dictates the tools and analytical techniques for the next phase. b. Depending on whether the team plans to analyze textual data or transactional data different tools and approaches are required. i] Ensure that the analytical techniques enable the team to meet the business objectives and accept or reject the working hypotheses. Determine if the situation allows a single model or a series of techniques as part of a larger analytic workflow. Que 1.22. | What are the common tools for the model planning phase ? Answer Common tools for the model planning phase : 1° oR: a. It has acomplete set of modeling capabilities and provides a good environment for building interpretive models with high-quality code. b. It has the ability to interface with databases via an ODBC connection and execute statistical tests and analyses against Big Data via an open source connection. 2. SQL analysis services : SQL Analysis services can perform in- database analytics of common data mining functions, involved aggregations, and basic predictive models. 3. SAS/IACCESS: a. SAS/ACCESS provides integration between SAS and the analytics sandbox via multiple data connectors such as OBDC, JDBC, and OLE DB. b. SAS itselfis generally used on file extracts, but with SAS/ACCESS, users can connect to relational databases (such as Oracle) and data warehouse appliances, files, and enterprise applications. Que 1. 3.] Explain the common commercial tools for model building phase. Data Analyties 1-19 J (CS-5/IT-6) Answer Commercial common tools for the model building phase : L » a SAS enterprise Miner: a. SAS Enterprise Miner allows users to run predictive and descriptive models based on large volumes of data from across the enterprise. b. It interoperates with other large data stores, has many partnerships, and is built for enterprise-level computing and analytics. SPSS Modeler provided by IBM : It offers methods to explore and analyze data through a GUI. Matlab : Matlab provides a high-level language for performing a variety of data analytics, algorithms, and data exploration. Apline Miner : Alpine Miner provides a GUI frontend for users to develop analytic workflows and interact with Big Data tools and platforms on the backend. STATISTICA and Mathematica are also popular and well-regarded data mining and analytics tools. Que 1.24. | Explain common open-source tools for the model building phase. Answer Free or open source tools are: 1 RandPLiR: a. R provides a good environment for building interpretive models and PL/R is a procedural language for PostgreSQL with R. b. Using this approach means that R commands can be executed in database. e. This technique provides higher performance and is more scalable than running R in memory. Octave : a. It is a free software programming language for computational modeling, has some of the functionality of Matlab. b. Octave is used in major universities when teaching machine learning. WEKA: WEKAis a free data mining software package with an analytic workbench. The functions created in WEKA can be executed within Java code Python : Python is a programming language that provides toolkits for machine learning and analysis, such as numpy, scipy, pandas, and related data visualization using matplotlib. Introduction to Data Analytics 1-20 J (CS-5/IT-6) 5. MADIib : SQL in-database implementations, such as MADIib, provide an alternative to in-memory desktop analytical tools. MADIib provides an open-source machine learning library of algorithms that can be executed in-database, for PostgreSQL. ©O®O Data Analyties 2-1 J (CS-5/IT-6) UNIT Data Analysis CONTENTS Part-1 Part-2 Part-3 Part-4 Part-5 Data Analysis : Regression Modeling, a Multivariate Analysis Bayesian Modeling, Inference and Bayesian Networks, Support Vector and Kernel Methods Analysis of Time Series ©... Linear System Analysis of Non-Linear Dynamics, Rule Induction Neural Networks Learning and Generalisation, — Competitive Learning, Principal Component Analysis and Neural Networks Fuzzy Logic : Extracting Fuzzy ... Models From Data, Fuzzy Decision Trees, Stochastic Search Methods _.. 2-27 to 2-45 . 2-113 to 2-205 .. 2-203 to 2-285 2-7J to 2-11J Data Analysis 2-2 J (CS-5/IT-6) Data Analyiss : Regression Modeling, Multivarient Analysis. Questions-Answers Long Answer Type and Medium Answer Type Questions Que 2.1. | Write short notes on regression modeling. Answer 1. Regression models are widely used in analytics, in general being among the most easy to understand and interpret type of analytics techniques. Regression techniques allow the identification and estimation of possible relationships between a pattern or variable of interest, and factors that influence that pattern. be 3. For example, a company may be interested in understanding the effectiveness of its marketing strategies. 4. Aregression model can be used to understand and quantify which of its marketing activities actually drive sales, and to what extent. 5. Regression models are built to understand historical data and relationships to assess effectiveness, as in the marketing effectiveness models. 6. Regression techniques are used across a range of industries, including financial services, retail, telecom, pharmaceuticals, and medicine. Que 2.2. | What are the various types of regression analysis techniques ? Answer Various types of regression analysis techniques : 1. Linear regression : Linear regressions assumes that there is a linear relationship between the predictors (or the factors) and the target variable. wm Non-linear regression : Non-linear regression allows modeling of non-linear relationships. 3. Logistic regression : Logistic regression is useful when our target variable is binomial (accept or reject). 4, Time series regression : Time series regressions is used to forecast future behavior of variables based on historical time ordered data. Data Analyties 2-3 J (CS-5/IT-6) Que 2.3. | Write short note on linear regression models. Que 23] Linear regression model: 1. We consider the modelling between the dependent and one independent. variable. When there is only one independent variable in the regression model, the model is generally termed as a linear regression model. be Consider a simple linear regression model yHp +P Xte Where, y is termed as the dependent or study variable and X is termed as the independent or explanatory variable. The terms f, and f, are the parameters of the model. The parameter f, is termed as an intercept term, and the parameter §, is termed as the slope parameter. 3. These parameters are usually called as regression coefficients. The unobservable error component accounts for the failure of data to lie on the straight line and represents the difference between the true and observed realization of y. 4. There can be several reasons for such difference, such as the effect of all deleted variables in the model, variables may be qualitative, inherent randomness in the observations etc 5. Weassume that ¢ is observed as independent and identically distributed random variable with mean zero and constant variance o” and assume that cis normally distributed. 6. The independent variables are viewed as controlled by the experimenter, so it is considered as non-stochastic whereas y is viewed as a random variable with Fy) = 8, +B, Xand Var (y) =o. Sometimes X can also be a random variable. In such a case, instead of the sample mean and sample variance of y, we consider the conditional mean of y given X =x as FQ |x) = B, + Bx and the conditional variance of y given X =x as Var(y |x) =o 8. When the values of B,, B,, and o° are known, the model is completely described. The parameters B,, B, and o? are generally unknownin practice and < is unobserved. The determination of the statistical model y=B,+6,X + depends on the determination (i.e. estimation) of B,, B,, and o*. In order to know the values of these parameters, n pairs of observations (x,,y,(i = 1, ....,m) on (X,y) are observed/collected and are used to determine these unknown parameters. Data Analysis 2-4 J (CS-5/IT-6) Que 24. | Write short note on multivariate analysis. Answer 1, be 10. Multivariate analysis (MVA) is based on the principles of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time. These variables are nothing but prototypes of real time situations, products and services or decision making involving more than one variable. MVA is used to address the situations where multiple measurements are made on each experimental unit and the relations among these measurements and their structures are important. Multiple regression analysis refers to a set of techniques for studying the straight-line relationships among two or more variables. Multiple regression estimates the f’s in the equation 9; = Bo + Pity + Bova + + Bt, +5 Where, the x’s are the independent variables. y is the dependent variable. The subscript j represents the observation (row) number. The f'’s are the unknown regression coefficients. Their estimates are represented by b's. Each B represents the original unknown (population) parameter, while bis an estimate of this 6. The ¢, is the error (residual) of observation i Regression problem is solved by least squares. In least squares method regression analysis, the b’s are selected so as to minimize the sum of the squared residuals. This set of b’sis not necessarily the set we want, since they may be distorted by outliers points that are not representative of the data. Robust regression, an alternative to least squares, seeks to reduce the influence of outliers. Multiple regression analysis studies the relationship between a dependent (response) variable and p independent variables (predictors, regressors). The sample multiple regression equation is the point at which the regression plane intersects the Y axis. The 6, are the slopes of the regression plane in the direction of x,. These coefficients are called the partial-regression coefficients. Each partial regression coefficient represents the net effect the i* variable has on the dependent variable, holding the remaining a's in the equation constant Data Analyties 2-5 J (CS-5/IT-6) Bayesian Modeling, Inference and Bayesian Networks, Support Vector and Kernel Methods. Questions-Answers Long Answer Type and Medium Answer Type Questions Que 2.5. | Write short notes on Bayesian network. Answer 1. Bayesian networks area type of probabilistic graphical model that uses Bayesian inference for probability computations. A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional dependency, and each node corresponds to a unique random variable. Bayesian networks aim to model conditional dependence by representing edges in a directed graph. be go C | P(S=T)P(S=F) og 05 05 P(W=T) P(W=F) Fig. 2.5.1. 3. Through these relationships, one can efficiently conduct inference on the random variables in the graph through the use of factors. Data Analysis 2-6 J (CS-5/IT-6) 4. Using the relationships specified by our Bayesian network, we can obtain acompact, factorized representation of the joint probability distribution by taking advantage of conditional independence. Formally, if an edge (A, B) exists in the graph connecting random variables A and B, it means that P(B|A) is a factor in the joint probability distribution, so we must know P(B | A) for all values of B and A in order to conduct inference. In the Fig. 2.5.1, since Rain has an edge going into WetGrass, it means that P(WetGrass | Rain) will be a factor, whose probability values are specified next to the WetGrass node in a conditional probability table. Bayesian networks satisfy the Markov property, which states that a node is conditionally independent of its non-descendants given its parents. In the given example, this means that P(Sprinkler | Cloudy, Rain) = P(Sprinkler | Cloudy) Since Sprinkler is conditionally independent of its non-descendant, Rain, given Cloudy. Que 2.6. | Write short notes on inference over Bayesian network. Answer Inference over a Bayesian network can come in two forms. 1 First form: a. The first is simply evaluating the joint probability of a particular assignment of values for each variable (or a subset) in the network. b. For this, we already have a factorized form of the joint distribution, so we simply evaluate that product using the provided conditional probabilities. ec. Ifwe only care about a subset of variables, we will need to marginalize out the ones we are not interested in. dad Inmany cases, this may result in underflow, so it is common to take the logarithm of that product, which is equivalent to adding up the individual logarithms of each term in the product. Second form: In this form, inference task is to find P (x | ¢) or to find the probability of some assignment of a subset of the variables (x) given assignments of other variables (our evidence, ¢). In the example shown in Fig. 2.6.1, we have to find P(Sprinkler, WetGrass | Cloudy), where {Sprinkler, WetGrass} is our x, and {Cloudy} is oure. Tn order to calculate this, we use the fact that P(x|e) = P(x, e) / Ple) = oP(x, e), where « is a normalization constant that we will calculate at the end such that P(x|e)+ P(ax | e)=1. Data Analyties 2-7 J (CS-5/IT-6) d In order to calculate P(x, ¢), we must marginalize the joint probability distribution over the variables that do not appear inx or e, which we will denote as Y. Plxje)= a> Pixie, Y) For the given example in Fig. 2.6.1 we can caleulate P(Sprinkler, WetGrass | Cloudy) as follows : (Sprinkler, WetGrass | Cloudy) = a = P(WetGrass | Sprinkler,Rain)P(Sprinker | Cloudy)P(Rain | Cloudy) P(Cloudy) = aP(WetGrass | Sprinkler, Rain)P(Sprinker | Cloudy)P(Rain | Cloudy) P(Cloudy) + oP(WetGrass | Sprinkler—Rain)P(Sprinker | Cloudy)P(—Rain | Cloudy) P(Cloudy) PART-3 Analysis of Time Series : Linear System Analysis of Non-Lineor Dynamics, Rule Introduction. Questions-Answers Long Answer Type and Medium Answer Type Questions Que 2.7. | Explain the application of time series analysis. Answer Applications of time series analysis: L 2. Retail sales : a. For various product lines, a clothing retailer is looking to forecast future monthly sales. b. These forecasts need to account for the seasonal aspects of the customer's purchasing decisions. ce. An appropriate time series model needs to account for fluctuating demand over the calendar year. Spare parts planning: a. Companies service organizations have to forecast future spare part demands to ensure an adequate supply of parts to repair customer Data Analysis 2-8 J (CS-5/IT-6) products. Often the spares inventory consists of thousands of distinct part numbers. b. To forecast future demand, complex models for each part number can be built using input variables such as expected part failure rates, service diagnostic effectiveness and forecasted new product shipments. c. However, time series analysis can provide accurate short-term forecasts based simply on prior spare part demand history. Stock trading: a. Some high-frequency stock traders utilize a technique called pairs trading. b. In pairs trading, an identified strong positive correlation between the prices of two stocks is used to detect a market opportunity. ec. Suppose the stock prices of Company A and Company B consistently move together. d. Time series analysis can be applied to the difference of these companies’ stock prices over time. e. Astatistically larger than expected price difference indicates that it is a good time to buy the stock of Company A and sell the stock of Company B, or vice versa. Que 2.8. | What are the components of time series ? Answer Atime series can consist of the following components : 1 Trends: a. The trend refers to the long-term movement in a time series. b. It indicates whether the observation values are increasing or decreasing over time. ce. Examples of trends are a steady increase in sales month over month or an annual decline of fatalities due to car accidents. Seasonality : a. The seasonality component describes the fixed, periodic fluctuation in the observations over time b. Itis oftenrelated to the calendar. e. For example, monthly retail sales can fluctuate over the year due to the weather and holidays. Cyclie: a. Acyclic component also refers to a periodic fluctuation, which is not as fixed. Data Analyties 2-9 J (CS-5/IT-6) b. For example, retails sales are influenced by the general state of the economy. Que 2.9. | Explain rule induction. Answer 1. me oo mp oO Rule induction is a data mining process of deducing if-then rules from a dataset. These symbolic decision rules explain an inherent relationship between the attributes and class labels in the dataset Many real-life experiences are based on intuitive rule induction. Rule induction provides a powerful classification approach that can be easily understood by the general users. Itis used in predictive analytics by classification of unknown data. Rule induction is also used to describe the patterns in the data. The easiest way to extract rules from a data set is from a decision tree that is developed on the same data set. Que 2.10. | Explain an iterative procedure of extracting rules from data sets. Answer 1 be Sequential covering is an iterative procedure of extracting rules from the data sets. The sequential covering approach attempts to find all the rules in the data set class by class. One specific implementation of the sequential covering approach is called the RIPPER, which stands for Repeated Incremental Pruning to Produce Error Reduction. Following are the steps in sequential covering rules generation approach : Step 1: Class selection: a. The algorithm starts with selection of class labels one by one. b. The rule set is class-ordered where all the rules for a class are developed before moving on to next class. P The first class is usually the least-frequent class label. da. From Fig. 2.10.1, the least frequent class is “+” and the algorithm focuses on generating all the rules for “+” class. Data Analysis 2-10 J (CS-5/IT-6) Fig. 2.10.1. Data set with two classes and two dimensions. Step 2: Rule development: a. The objective in this step is to cover all “+” data points using classification rules with none or as few “—” as possible b, For example, in Fig. 2.10.2. rule r, identifies the area of four “+” in the top left corner. Fig. 2.10.2. Generation of ruler r,. e. Since this rule is based on simple logic operators in conjuncts, the boundary is rectilinear. da Once rule 7, is formed, the entire data points covered by r, are eliminated and the next best rule is found from data sets. Step 3: Learn-One-Rule: a. Each ruler, is grown by the learn-one-rule approach. b. Each rule starts with an empty rule set and conjuncts are added one by one to increase the rule accuracy. e. Rule accuracy is the ratio of amount of “+” covered by the rule to all records covered by the rule : Correct records by rule All records covered by the rule Rule accuracy A (r) = dad Learn-one-rule starts with an empty rule set: if {} then class = “+”. e. The accuracy of this rule is the same as the proportion of + data points in the data set. Then the algorithm greedily adds conjuncts until the accuracy reaches 100 %. f. If the addition of a conjunct decreases the accuracy, then the algorithm looks for other conjuncts or stops and starts the iteration of the next rule. Data Analyties 2-11 J (CS-5/IT-6) Step 4: Next rule: a After a rule is developed, then all the data points covered by the rule are eliminated from the data set. b. The above steps are repeated for the next rule to cover the rest of the “+” data points. ce. InFig. 2.10.3, ruler, is developed after the data points covered byr, are eliminated. + Fig. 2.10.3. Elimination of r1 data pots and next rule. Step 5: Development of rule set: a After the rule set is developed to identify all “+” data points, the rule model is evaluated with a data set used for pruning to reduce generalization errors. The metric used to evaluate the need for pruning is (py —n)(p +n), where p is the number of positive records covered by the rule and nis the number of negative records covered by the rule. All rules to identify “+” data points are aggregated to form a rule group. PART-4 Neural Networks : Learning and Generalization, Competitive Learning, Principal Component Analysis and Neural Networks. Questions-Answers Long Answer Type and Medium Answer Type Questions Data Analysis 2-12 J (CS-5/IT-6) Que 2.11.| Describe supervised learning and unsupervised learning. Answer | Supervised learning: 1. Supervised learning is also known as associative learning, in which the network is trained by providing it with input and matching output patterns. ry Supervised training requires the pairing of each input vector with a target vector representing the desired output. 3. The input vector together with the corresponding target vector is called training pair. 4. Tosolve a problem of supervised learning following steps are considered : a. Determine the type of training examples. b. Gathering of a training set. e. Determine the input feature representation of the learned function. d Determine the structure of the learned function and corresponding learning algorithm. e. Complete the design. 5. Supervised learning can be classified into two categories : i Classification ii, Regression Unsupervised learning : 1. Unsupervised learning, an output unit is trained to respond to clusters of pattern within the input. 3. Inthis method of training, the input vectors of similar type are grouped without the use of training data to specify how a typical member of each group looks or to which group a member belongs. 3. Unsupervised training does not require a teacher; it requires certain guidelines to form groups. 4. Unsupervised learning can be classified into two categories : i Clustering ii, Association Que 2.12.| Differentiate between supervised learning and unsupervised learning. Answer ] Difference between supervised and unsupervised learning: Data Analyties 2-13 J (CS-5/IT-6) S.No. Supervised Unsupervised learning learning 1 It uses known and labeled | It uses unknown data as input. data as input. 2. Computational complexity is | Computational complexity is less very complex. 3. It uses offline analysis. It uses real time analysis of data. 4 Number of classes is | Number of classes is not known. known. 5. Accurate and reliable | Moderate accurate and reliable results. results. Que 2.13. | What is the multilayer perceptron model ? Explain it. Answer = bo oo Multilayer perceptron is a class of feed forward artificial neural network. Multilayer perceptron model has three layers; an input layer, and output layer, and a layer in between not connected directly to the input or the output and hence, called the hidden layer. For the perceptrons in the input layer, we use linear transfer function, and for the perceptrons in the hidden layer and the output layer, we use sigmoidal or squashed-S function. The input layer serves to distribute the values they receive to the next layer and so, does not perform a weighted sum or threshold. The input-output mapping of multilayer perceptron is shown in Fig. 2.13.1 and is represented by Hidden layer Input layer Output layer Fig. 2.13.1. Data Analysis 2-14 J (CS-5/IT-6) 6. Multilayer perceptron does not increase computational power over a single layer neural network unless there is a non-linear activation function between layers. Que 2.14. | Draw and explain the multiple perceptron with its learning algorithm. Answer 1 The perceptrons which are arranged in layers are called multilayer (multiple) perceptron. This model has three layers : an input layer, output layer and one or more hidden layer. For the perceptrons in the input layer, the linear transfer function used and for the perceptron in the hidden layer and output layer, the sigmoidal or squashed-S function is used. The input signal propagates through the network in a forward direction. In the multilayer perceptron bias b(n) is treated as a synaptic weight driven by fixed input equal to +1. (n) = [+1, x,(m), 2,0), verre x(n)? xO where n denotes the iteration step in applying the algorithm. Correspondingly we define the weight vector as : w(n) = (b(n), w,(n), w, (7)... WCDI? Accordingly the linear combiner output is written in the compact form Vin) = YwAn)x,(m) = wn) x(n) = Architecture of multilayer perceptron : Output signal O O Output layer Input layer First hidden Second hidden layer layer Fig. 2.14.1. Data Analyties 2-15 J (CS-5/IT-6) 7. Fig. 2.14.1 shows the architectural model of multilayer perceptron with two hidden layer and an output layer. 8. Signal flow through the network progresses in a forward direction, from the left to right and on a layer-by-layer basis. Learning algorithm : 1. Ifthe nth number of input set x(n), is correctly classified into linearly separable classes, by the weight vector w(n) then no adjustment of weights are done. wn + 1) =w(n) Tf w7x(n) > 0 and x(n) belongs to class G,. wn + 1) = w(n) Tfw7x(n) < 0 and x(n) belongs to class G,. bo Otherwise, the weight vector of the perceptron is updated in accordance with the rule. Que 2.15. | Explain the algorithm to optimize the network size. Answer Algorithms to optimize the network size are : 1. Growing algorithms : a. This group of algorithms begins with training a relatively small neural architecture and allows new units and connections to be added during the training process, when necessary. b. Three growing algorithms are commonly applied: the upstart algorithm, the tiling algorithm, and the cascade correlation. c e first two apply to binary input/output variables and networks with step activation function. d e third one, which is applicable to problems with continuous input/output variables and with units with sigmoidal activation function, keeps adding units into the hidden layer until a satisfying error value is reached on the training set. 2. Pruning algorithms : a. General pruning approach consists of training a relatively large network and gradually removing either weights or complete units that seem not to be necessary. b. The large initial size allows the network to learn quickly and with a lower sensitivity to initial conditions and local minima. ce. The reduced final size helps to improve generalization. Que 2.16. | Explain the approaches for knowledge extraction from multilayer perceptrons. Data Analysis 2-16 J (CS-5/IT-6) Answer Approach for knowledge extraction from multilayer perceptrons : a. Global approach: L This approach extracts a set of rules characterizing the behaviour of the whole network in terms of input/output mapping. A tree of candidate rules is defined. The node at the top of the tree represents the most general rule and the nodes at the bottom of the tree represent the most specific rules. Each candidate symbolic rule is tested against the network's behaviour, to see whether such a rule can apply. The process of rule verification continues until most of the training set is covered. One of the problems connected with this approach is that the number of candidate rules can become huge when the rule space becomes more detailed. b. Local approach: L bo This approach decomposes the original multilayer network into a collection of smaller, usually single-layered, sub-networks, whose input/output mapping might be easier to model in terms of symbolic rules. Based on the assumption that hidden and output units, though sigmoidal, can be approximated by threshold functions, individual units inside each sub-network are modeled by interpreting the incoming weights as the antecedent of a symbolic rule. The resulting symbolic rules are gradually combined together to define a more general set of rules that describes the network as a whole. The monotonicity of the activation function is required, to limit the number of eandidate symbolic rules for each unit. Local rule-extraction methods usually employ a special error function and/or a modified learning algorithm, to encourage hidden and output units to stay in a range consistent with possible rules and to achieve networks with the smallest number of units and weights. Que 2.17. | Discuss the selection of various parameters in BPN. Answer Selection of various parameters in BPN (Back Propagation Network) : 1. Number of hidden nodes: Data Analyties 2-17 J (CS-5/IT-6) i. The guiding criterion is to select the minimum nodes which would not impair the network performance so that the memory demand for storing the weights can be kept minimum. ii When the number of hidden nodes is equal to the number of training patterns, the learning could be fastest. iii. In such cases, Back Propagation Network (BPN) remembers training patterns losing all generalization capabilities. iv. Hence, as far as generalization is concerned, the number of hidden nodes should be small compared to the number of training patterns (say 10:1). 2 Momentum coefficient (a): i. The another method of reducing the training time is the use of momentum factor because it enhances the training process. (Weight change without momentum) a aw) a [Aw] ° (Momentum term) Fig. 2.17.1. Influence of momentum term on weight change ii The momentum also overcomes the effect of local minima. iii. It will carry a weight change process through one or local minima and get it into global minima. 3. Sigmoidal gain (4) : i When the weights become large and force the neuron to operate in aregion where sigmoidal function is very flat, a better method of coping with network paralysisis to adjust the sigmoidal gain. ii By decreasing this scaling factor, we effectively spread out sigmoidal function on wide range so that training proceeds faster. 4, Local minima : i One of the most practical solutions involves the introduction of a shock which changes all weights by specific or random amounts. ii If this fails, then the solution is to re-randomize the weights and start the training all over. iii. Simulated annealing used to continue training until local minima is reached. iv. After this, simulated annealing is stopped and BPN continues until global minimum is reached. v. Inmost of the cases, only a few simulated annealing cycles of this two-stage process are needed. Data Analysis 2-18 J (CS-5/IT-6) 5. Learning coefficient (n) : i The learning coefficient cannot be negative because this would cause the change of weight vector to move away from ideal weight vector position. ii If the learning coefficient is zero, no learning takes place and hence, the learning coefficient must be positive. ii. Ifthe learning coefficient is greater than 1, the weight vector will overshoot from its ideal position and oscillate. iv. Hence, the learning coefficient must be between zero and one. Que 2.18. | What is learning rate ? What is its function ? Answer 1 3. Learning rate is a constant used in learning algorithm that define the speed and extend in weight matrix corrections. Setting a high learning rate tends to bring instability and the system is difficult to converge even to a near optimum solution. Alow value will improve stability, but will slow down convergence. Learning function : 1 bo In most applications the learning rate is a simple function of time for example L.R. = 1/(1 +2). These functions have the advantage of having high values during the first epochs, making large corrections to the weight matrix and smaller values later, when the corrections need to be more precise Using a fuzzy controller to adaptively tune the learning rate has the added advantage of bringing all expert knowledge in use. Ifit was possible to manually adapt the learning rate in every epoch, we would surely follow rules of the kind listed below : a. Ifthe change in error is small, then increase the learning rate. b. Ifthere are a lot of sign changes in error, then largely decrease the learning rate. e. Ifthe change in error is small and the speed of error change is small, then make a large increase in the learning rate. Que 2.19. | Explain competitive learning. Answer 1. be Competitive learning is a form of unsupervised learning in artificial neural networks, in which nodes compete for the right to respond to a subset of the input data. Avariant of Hebbian learning, competitive learning works by increasing the specialization of each node in the network. It is well suited to finding clusters within data. Data Analyties 2-19 J (CS-5/IT-6) 3. on Models and algorithms based on the principle of competitive learning include vector quantization and self-organizing maps. Ina competitive learning model, there are hierarchical sets of units in the network with inhibitory and excitatory connections. The excitatory connections are between individual layers and the inhibitory connections are between units in layered clusters. Units in a cluster are either active or inactive. There are three basic elements to a competitive learning rule : a. Asset of neurons that are all the same except for some randomly distributed synaptic weights, and which therefore respond differently to a given set of input patterns. b. A limit imposed on the “strength” of each neuron. c. Amechanism that permits the neurons to compete for the right to respond to a given subset of inputs, such that only one output neuron (or only one neuron per group), is active (i.e., “on”) at a time. The neuron that wins the competition is called a “winner- take-all” neuron. Que 2.20. | Explain Principle Component Analysis (PCA) in data analysis. Answer 1. p ~ on = PCA is a method used to reduce number of variables in dataset by extracting important one from a large dataset. It reduces the dimension of our data with the aim of retaining as much information as possible. In other words, this method combines highly correlated variables together to forma smaller number ofan artificial set of variables which is called principal components (PC) that account for most variance in the data. A principal component can be defined as a linear combination of optimally-weighted observed variables. The first principal component retains maximum variation that was present in the original components. The principal components are the eigenvectors of a covariance matrix, and hence they are orthogonal. The output of PCA are these principal components, the number of which is less than or equal to the number of original variables. The PCs possess some useful properties which are listed below : a. The PCs are essentially the linear combinations of the original variables and the weights vector. b. The PCs are orthogonal. Data Analysis 2-20 J (CS-5/IT-6) e. The variation present in the PC decrease as we move from the Ist PC to the last one. PART-5S Fuzzy Logic : Extracting Fuzzy Models From Data, Fuzzy Decision Trees, Stochastic Search Methods. Questions-Answers Long Answer Type and Medium Answer Type Questions = Que . | Define fuzzy logic and its importance in our daily life. What is role of crisp sets in fuzzy logic ? Answer ] 1. Fuzzy logic is an approach to computing based on “degrees of truth” rather than “true or false” (1 or 0). Fuzzy logic includes 0 and 1 as extreme cases of truth but also includes the various states of truth in between. bo 3. Fuzzy logic allows inclusion of human assessments in computing problems. 4. Itprovides an effective means for conflict resolution of multiple criteria and better assessment of options. Importance of fuzzy logic in daily life: 1. Fuzzy logic is essential for the development of human-like capabilities for AI. It is used in the development of intelligent systems for decision making, identification, optimization, and control. be 3. Fuzzy logic is extremely useful for many people involved in research and development including engineers, mathematicians, computer software developers and researchers. 4. Fuzzy logic has been used in numerous applications such as facial pattern. recognition, air conditioners, vacuum cleaners, weather forecasting systems, medical diagnosis and stock trading. Role of crisp sets in fuzzy logic: 1. It contains the precise location of the set boundaries. 2. It provides the membership value of the set. Que Define classical set and fuzzysets. State the importance of fuzzy sets. Data Analyties 2-21 J (CS-5/IT-6) Answer ] Classical set : 1. Classical set is a collection of distinct objects 2. Each individual entity in a set is called a member or an element of the set. 3. The classical set is defined in such a way that the universe of discourse is splitted into two groups as members and non-members. Fuzzy set: 1. Fuzzy set is a set having degree of membership between 1 and 0. 2. Fuzzy sets A in the universe of discourse U can be defined as set of ordered pair and it is given by Az (x,p,@)|x2U)} Where 1; is the degree of membership of xin A . Importance of fuzzy set : 1. Itisused for the modeling and inclusion of contradiction in a knowledge base. 2. It also increases the system autonomy. 3. It acts as an important part of microchip processor-based appliances. Que 2.23. | Compare and contrast classical logic and fuzzy logic. Answer ] Crisp (classical) logie Fuzzy logic 1. | Inelassical logic an element | Fuzzy logic supports a flexible sense either belongs to or does not | of membership of elements to a set. belong to a set. 2. | Crisp logic is built on a} Fuzzy logic is built on a multistate 2-state truth values| truth values. (True/False). 3. |The statement which is| A fuzzy proposition is a statement either True or False’ but not | which acquires a fuzzy truth value. both is called a proposition in crisp logic. 4. | Law of excluded middle and | Law of excluded middle and law of law of non-contradiction | contradiction are violated. holds good in crisp logic. Data Analysis 2-22 J (CS-5/IT-6) Que 2.24. | Define the membership function and state its importance in fuzzy logic. Also discuss the features of membership functions. Answer ] Membership function: 1. Amembership function for a fuzzy set A on the universe of discourse X is defined as u, : X — [0,1], where each element of XY is mapped toa value between 0 and 1 bo This value, called membership value or degree of membership, quantifies the grade of membership of the element in X to the fuzzy set A. 3. Membership functions characterize fuzziness (i.¢., all the information in fuzzy set), whether the elements in fuzzy sets are discrete or continuous. 4. Membership functions can be defined as a technique to solve practical problems by experience rather than knowledge. 5. Membership functions are represented by graphical forms. Importance of membership function in fuzzy logic : 1. Itallows us to graphically represent a fuzzy set. 2. It helps in finding different fuzzy set operation. Features of membership function: 1 Core: a. The core of amembership function for some fuzzy set A isdefined as that region of the universe that is characterized by complete and full membership in the set. b. The core comprises those elements x of the universe such that My@)=1. 2. Support: a. The support of a membership function for some fuzzy set Ais defined as that region of the universe that is characterized by nonzero membership in the set A, b. The support comprises those elements x of the universe such that Mg (x) > 0. Data Analyties 2-23 J (CS-5/IT-6) u(x) 1 Support Boundary Boundary Fig. 2.24.1. Core, support, and boundaries of a fuzzy set. Boundaries : a. The boundaries of a membership function for some fuzzy set A are defined as that region of the universe containing elements that have a non-zero membership but not complete membership. b. The boundaries comprise those elements x of the universe such that O< uj(x)<1. Que 2.25. | Explain the inference in fuzzy logic. Answer ] Fuzzy Inference: 1. Inferences is a technique where facts, premises F, F,, ....... ,F anda goal Gis to be derived from a given set. Fuzzy inference is the process of formulating the mapping from a given input to an output using fuzzy logic The mapping then provides a basis from which decisions can be made. Fuzzy inference (approximate reasoning) refers to computational procedures used for evaluating linguistic (IF-THEN) descriptions. The two important inferring procedures are : Generalized Modus Ponens (GMP): 1. GMPis formally stated as Ifxis A THEN yis B xis A’ yis B Here, A, B, A’ and B' are fuzzy terms. Every fuzzy linguistic statement above the line is analytically known and what is below is analytically unknown. be Data Analysis 2-24 J (CS-5/IT-6) Here Bi = A'oR(x,y) where ‘o’ denotes max-min composition (IF-THEN relation) 3. The membership function is By) = max (min (u(x), wg (x,99)) where 11, (9) ismembership function of B, w..(x) ismembership function of A’ and us(%,») is the membership function of implication relation. ii. Generalized Modus Tollens (GMT): 1. GMTis defined as Ifxis A. Thenyis B yis B xis A’ 2. The membership of A’ is computed as A’ = BoR (x,y) 3. In terms of membership function (x) = max (min (u Y, Hele) Que 2.26. | Explain Fuzzy Decision Tree (FDT). Answer ] 1. Decision trees are one of the most popular methods for learning and reasoning from instances. 2. Given aset of n input-output training patterns D = {(X,y')i=1,...,n}. where each training pattern X has been described by a set ofp conditional (or input) attributes (c,,....¢,) and one corresponding discrete class label y where y' ¢ {1,.....g} and g is the number of classes. 3. The decision attribute y' represents a posterior knowledge regarding the elass of each pattern. 4. An arbitrary class has been indexed by / (1

You might also like