[go: up one dir, main page]

0% found this document useful (0 votes)
2 views81 pages

Data Warehouse and Data Mining Organizer

The document discusses the curriculum changes for the Data Warehousing and Data Mining course at MAKAUT, which has been moved to the 6th semester. It covers key concepts such as data warehousing, data mining, and their applications, as well as recent trends and differences between data marts and data warehouses. Additionally, it includes model questions and answers to help students prepare for university assessments.

Uploaded by

Suman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
2 views81 pages

Data Warehouse and Data Mining Organizer

The document discusses the curriculum changes for the Data Warehousing and Data Mining course at MAKAUT, which has been moved to the 6th semester. It covers key concepts such as data warehousing, data mining, and their applications, as well as recent trends and differences between data marts and data warehouses. Additionally, it includes model questions and answers to help students prepare for university assessments.

Uploaded by

Suman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 81
DATA WAREHOUSING AND DATA MINING Introduction to Data Warehousing Classification and Prediction of Data Warehousing Mining Time Series Data Mining Data Streams Web Mining Recent Trends in Distributed Warehousing NOTE: 20 35 & 72 MAKAUT course structure and syllabus of 6 semester has been changed from 2021 Previously DATA WAREHOUSING AND DATA MINING was in 7" semester. This subject has been redesigned and shifted in 6” semester as per present curriculum. Subject organization has been changed slightly. Taking special care of this matter we are providing the relevant MAKAUT university solutions. and some model questions & answers for newly introduced topics, so that students can get an idea about university Questions patterns. POPULAR PUBLICATIONS ouse is an integrated collection of a) Itis a collection of data of different types o Itis a collection of data derived from multiple sources c) Itisa relational database d) Itcontains summarized data Answer: (b) 4. A data wareht f 2, A data warehouse is said to contain a ‘subject oriented’ coe of ‘oa because [we , a) Its contents have a common theme : b) [tis built for a specific application c) Itcannot support multiple subjects d) Itis a generalization of ‘object-oriented’ Answer: (a) varying collection of data 3. A Data warehouse is said to be contain in time ta [WBUT 2010, 2013, 2015] because a) Its content vary automatically with time b) Its life-span is very limited c) Every key structure of data warehouse contains either implicitly explicitly an element of time d) Its content has explicit time-stamp Answer: (c) 4, Data Warehousing is used for [WBUT 2010, 2012] 2S Omseapecacoe 2} Sun etputtion apleatint | Answer: (a) 5. Which of the following is TRUE? [WBUT 2010, 2012) a) Data warehouse can be used for analytical processing only a b) Data warehouse can be used for informati ir pee ore an ation processing (query, report) c) Data warehouse can be used for data minin: ig onl: d) Data warehouse can be used for information Treen (query, analytical processing and di ‘Answer: (d) . ern 6. Data wareh if data sae architecture is just an over guideline. Itis not ab a) True Answer: (b) B)Faies Nw ‘DATA WAREHOUSING AND DATA MINING rhe most distinguishing characteristic of DSS data is [WBUT 2011] t 4) Granularity b) Timespan ‘c) Dimensionality _d) Data currency answer (©) > A data warehouse is built as a separate repository of data, different from the a, A devral data of an enterprise because [WBUT 2042, 2013] operwt is necessary to keep the operational data free of any warehouse rations b) nee warehouse cannot afford to allow corrupted data within it ¢) A data warehouse contains summarized data whereas the operational database contains transactional data d) None of these Answer: (©) 9, Dimension data within a warehouse exhibits which one of the following perties? [WBUT 2012, 2015] a) Dimension data consists of the minor part of the warehouse b) The aggregated information is actually dimension data ¢} It contains historical data d) Dimension data is the information that is used to analyze the elemental transaction Answer: (b) pro} 10. A data warehouse is said to contain a ‘time-varying’ collection of data because a) its content has explicit time-stamp [WBUT 2014, 2015, 2046, 2017] b)its life-span is very limited ¢)it contains historical data d) it contents vary automatically with time Answer: (c) 14. The important aspect of the data warehouse environment is that data found within the data warehouse is [WBUT 2016, 2018] a) subject-oriented b) time-variant ¢) integrated d) all of these Answer: (a) 12, Data warehousing is used for P [WBUT 2016) 8) decision support system b) OLAP applications ieee applications d) Data manipulation applications wer: (a) is a subject-oriented, integrated, time-variant, non-volatile [WBUT 204 data 4) Data Mining ©) Doo b) Data Warehousing 1SWer: Go Mining d) Text Mining. 5 DWM-3 POPULAR PUBLICATIONS Short Answer Ty] T 2009, 2010, 2011, 2015, 2 1. Define Data Marts. [WBUT 2009, 2010, 2011, Define the types of Data Marts. Answer: ’ 1" Part: : 7 ‘A data mart is a group of subjects that are organized in 2 Neen them to assist ’ departments in making specific decisions. For example, the adv an ee will have its own data mart, while the finance department will eae : 7 that separate from it, In addition to this, each department will have ownership of software, hardware, and other components that make up their data mart. 2" Part: ‘There are two types of Data Marts: J © Independent data marts — sources from data captured form OLTP system, external providers or from data generated locally within a particular department or geographic area. + < © Dependent data mart— sources directly form enterprise data warehouses. 2. Define data mining. What is the advantages data mining over traditional - approaches? [WBUT 2009], Answer: 1" Part: ‘ Data mining, which is also known as knowledge discovery, is one of the most popular topics in information technology. It concerns the process of automatically extracting useful information and has the promise of discovering hidden relationships that exist in large databases. ‘These relationships represent valuable knowledge that is crucial for maity applications. Data mining is not confined to the analysis of data stored in data warehouses. It may analyze data existing at more detailed granularities than the suluuurized data provided ina data warehouse, It may also analyze transactional, textual, Spatial, and multimedia data which are difficult to model with current multidimensional — daiah> wechnology. i 2" Part: With the help of data mining, organizations are in a better position to predict the 1 Ie ‘Ge business trend: the possible amount of revenue that could be generated, Orders tha could fe expectes and the type of customers that could be approached. eo approaches will not be able to generate such accurate results as they im i a ithms. One major advantage of data mining over a traditional stati HE Rs abstity 1 deat surety with heterogeneous data fields, ’ i tomers The advartaves of data minin; hel sinesses : ; and beip nao’ of ther areas like ata ana ra es . Yr DATA WAREHOUSING AND DATA MINING s. What isthe importance of Association Renin Data wiedeg? [WBUT 2003) Explain support, confidence, frequent item set and give a formal definition of association rule, ie [WBUT 2013] What is an Association Rule? Define Support, Confidence, Item set and Frequent item set in Association Rule Mining? [WBUT 2017) Answert fa illustrate the concepts, We use a small example from the supermarket domain. The set of items is I~ {milk, bread, butter, beer} and a small database containing the items is shown in Table below. ‘Transaction 1D | Ttems 1 Milk, bread 2 Bread, butter 3 Beer 4 Milk, bread, butter 5 Bread, butter An example supermarket database with five transactions, ‘An example tule for the supermarket could be {milk, bread} > {butter} meaning that if milk and bread is bought, customers also buy butter. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence. The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. In the example database in Table 1, the itemset {milk, bread} has a support of 2/5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions). The confidence of a rule is defined conf(X ) Y) = supp(X [ Y)/supp(X). For example, the tule {milk, bread}) {butter} has a confidence of 0.2/0.4 = 0.5 in the database in the Table, which means that for 50% of the transactions containing milk and bread the rule is _ correct. Confidence can be interpreted as an estimate of the probability P(Y |X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. In many (but not all) situations, we only care about association rules or causalities involving sets of items that appear frequently in baskets. For example, we cannot run a 800d marketing strategy involving items that no one buys anyway. Thus, much data mining starts with the assumption that we only care about sets of items with high support; they appear together in many baskets. We then find association rules or causalities only involving a high-support set of items must appear in at least a certain percent of the baskets, called the support threshold. We use the term frequent itemset for “a set S that Sppears in at least fraction s of the baskets,” where s is some chosen constant, typically 9.01 oF 1%, : X,} DY , meaning that if we pSoiation tules are statements of the form {XX .... ‘nd all of X,.X:,...X, in the market basket, then we have a good Probability of finding Y for us to accept this rule is the ‘ould search only for rules that had confidence abow POPULAR PUBLICATIONS 4, State the differences between Data Mart & Warehouse. {WBUT 2016, 2015 Answer: A data warehouse has a structure which is separate from a data mart, even o may look similar. Because of this, it is difficult to coordinate the data across departments. Each department will have its own view of how a data mart: ‘The data mart that they use will be specific to them. In contrast, data w designed around the organization as 4 whole. Instead of | being owned by one a data warehouse will generally be owned by the entire company: While ‘ contained in data warehouses are granular, the information contained in data marts very granular at all. a Another thing that separates data warehouses from data marts is that data contain larger amounts of information. ‘The information that often summarized. Data warehouses will not contain information 1 ‘of the department. Instead, it will demonstrate the information that is analyzed processed by the organization. Much of the information that is held in data waret historical in nature, and they are designed to process this information. §, When is a Data Mart appropriate? Answer: Data’Marts are created for: a) Speeding up queries by reducing the volume of data to be scanned b) Structuring data in a form suitable for user access tools ¢) Partitioning data to impose access control strategies if d) Segmenting data into different hardware platforms. 4 Data marts should not be used in other cases as the operational costs of data marting be high and once a strategy in place, it can be difficult to change it without substantial cost. 6. What is sequence mining? Answer: Sequence mining is a type of structured data mining in which the database administrator look for sequences of trends in the data. This data mining is split into fields. Itemset sequence mining typically is used in marketing, and string seq mining is used in biology research, Sequence mining is different from regular mining, because the data are more specific, which makes building an effective difficult for database designers, and it can sometimes go awry if the different from the common sequence. ‘ 7. Differentiate among Enterprise Warehouse, Data mart and Virtual Answer: Enterprise warehouse - ; peries collects all ofthe DATA WAREHOUSING AND-DATA MINING pata Mart ~ a subset of sorporate-wide data that is of value to specific groups!of users. iis scope is confined to specific, selected groups, such as marketing data mart. Virtual warehouse - A set of views over operational databases. Only some of the possible summary views may be materialized. 8, How is data warehouse different from adatabase? [WBUT 2013, 2016, 2018] Answer: ‘A database is used {0 store data while a data warehouse is mostly used to facilitate ing and analysis. Basically, database is just where the data is stored; in order to 5 data or analyze it a database management system is required. However, a data warehouse does not necessarily require a DBMS. The purpose of a data warehouse is for easy access to the data for a user. The data warehouse may also be used to analyze the data; however the actual process of analysis is called data mining, Some differences detween a database and a data warehouse: « A database is used for Online Transactional Processing (OLTP) but can be used for other purposes such as Data Warehousing. « A data warehouse is used for Online Analytical Processing (OLAP). This reads the historical data for the Users for business decisions, * In a database the tables and joins are complex since they are normalized for RDMS, This reduces redundant data and saves storage space In data warehouse, the tables and joins are simple since they are de-normalized~ This is done to reduce the response time for analytical queries. * Relational modeling techniques are used for RDMS database design, whereas modeling techniques are used for the Data Warehouse design. © A database is optimized for write operation, while a data warehouse is optimized for read operations. * In a database, the performance is low for analysis queries, while in a data warehouse, there is high performance for analytical queries, * A data warehouse is a step ahead of a database. It includes a database in its structure. ae are the issues relating to the diversity of database types? [WBUT 2016) nswer: * Handling of relational and complex types of data The database may contain complex data objects, multimedia data objects, spatial data, temporal data ete. It is Hot possible for one system to mine all these kind of dati Mining information from heterogeneous databases and global information systems ~The data is available at different data sources which may be structured, ~ semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining, ‘ 10. Introd ne luce the idea of Data Mining with ex: 'n Knowledge discovery in Database (KDD) POPULAR PUBLICATIONS Answer: Data Mining: Refer to Question No. 21" Part) of Short Answer Type Questions, a Mining and Business Intelligence for when a customer leaves their company to gett oe rovider. They collate billing informath stomer d other metrics to give each customer a probability. ‘to. customers whom they perceive to be ata Data mining examp! Mobile phone and uti predict ‘churn’, the terms the: phone/gas/broadband from another p services interactions, website visits ant score. then target offers and incentives risk of chuming, ities companies use Dat Steps involved in KDD process 2 2 ts Mientify the goal of the KDD process from the customer's perspective. . Understand application domains involved and the knowledge that's required 5 on which discovery is be performed. i] 2. 3. Select a target data set or subset of data samples 0 ver) 4. Cleanse an pre-process data by deciding. strategies to handle missing fields and; the data as per the requirements. g 5, Simplify the data sets by removing unwanted variables. Then, analyze useful features that can be used to represent the data, depending on the goal or task. : a 6, Match KDD goals with data mining methods to suggest hidden patterns. +, Choose data mining algorithms to discover hidden pattems. This process includes deciding which models and parameters might be appropriate for the overall KDD- process. & Search for patterns of interest in a particular representational form, which include - classification rules or trees, regression and clustering, 9, Interpret essential knowledge from the smined patterns. 10. Use the knowledge and incorporate it into another system for further action. 11. Document it and make reports for interested parties. 41. Does a data warehouse involve a transaction? Explain. Answer: A data warehouse is a relational database that is desi fc i i : igned for query and analysis than for transaction processing. It usually contains historical data derived transaction data, but it can include data from other sources. It separates analysis from transaction worklo: izati i abe jad and enables an organization to consolidate data from se , stions |. Define Data Warehouse and briefly describe its characteristics. Answer: . A dat is ae Heise Consists of a computer database responsible for the collection son mation for a specific organization. This collection of inf ist 0 manage information efficiently and analyze the collected data, Althou DWM-8 alt DATA WAREHOUSING AND DATA MINING warehouses vary in overall design, majority of them are subject oriented, meaning that the stored information is connected to objects or events that occur in reality. The data provided by the data warehouse for analysis. provides information on a specific subject, rather than the functions of the company and is collected from varying sources into one unit having time-variant, ’ ‘data warehouse has significantly different features from other enterprise-wide systems, particularly in how data is stored, managed and tmanipulated. ‘There are four key characteristics which separate the data warehouse from other major operational systems: 1. Subject Orientation: Data organized by subject 2, Integration: Consistency of defining parameters 3. Non-volatility: Stable data storage medium 4, Time-variance: Timeliness of data and access terms 2. a) What are the different tiers in a typical 3-tier data warehousing architecture? b) What do you mean by warehousing schema? Explain. [WBUT 2009) Answer: a) The bottom tier is a warehouse database server which is almost always a relational database system. The middle tier is an OLAP seryer which is typically implemented using either (1) a Relational. OLAP (ROLAP) model, i.e., an extended relational DBMS that maps operations on multidimensional data to standard relational operations: or (2) a Multidimensional OLAP (MOLAP) model, i.c., 2 special purpose server that directly implements multidimensional data and operations. The top tier is a client, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on). b) The entity-relationship data model is commonly used in the design of relational databases, where a database schema consists of a set of entities or objects, and the elationships between them. Such a data model is appropriate for online transaction Processing. Data warehouses, however, require a concise, subject-oriented schema which facilitates on-line data analysis 3. 2) Introduce the concept of data mining and cite two application areas. -, : : [WBUT 2010, 2012] ae hat are the different steps of a data mining task? (WBUT 2010, 2012] “ppose that the data mining task is to cluster the following eight points (with sy) Tepresenting location) into 3 clusters. A(2,10).4,(2,5),4,(8.4), 8 (5,8). B, (7,5), 2 (6,4),6 (1.2), (4.9) The distance function is Euclidian distance. Initially we assign 4, B, and C, asthe Centre of each cluster. Use k-means algorithm to determine the three clusters. ‘Aiswer: : [WBUT 2010, 2013, 2014 1" Part: Refer to of | Question No. 2(1" Part) of Short / DWM-9 POPULAR PUBLICATIONS 2" Part: An application of data mining is market company will be able to find behaviors that are common allow a company to estimate o i products or services and go toone of its competitors. b) The Following Steps Are Usually Followed In Data Mining. These steps as with the process moving backward whenever . . . Develop an understanding of the application, of the relevant prior kno the end user's goals. . Create a target data set to be used for discovery. ‘eA | Clean and preprocess data (including handling missing data fields, noise in accounting for time series, and known changes). | Reduce the number of variables and find invariant representations of data if . Choose the data-mining task (classification, regression, clustering, etc.), s. Choose the data-mining algorithm. . Search for pattems of interest (this is the actual data mining). _ . Interpret the pattern mined. If necessary, iterate through any of steps | through . Consolidate knowledge discovered and prepare a report. ¢) The given eight points were: Al:¢ ‘The points Al, B2, and Cl were assumed to be the initial three ts Al, cluster centres. two possible ways of clustering the given points using the k-means algorithm, on to which cluster we assign the point B1: or fa: The first way: 1.1. Round One: ~ Compute for each of the eight points its di ; a points its distance Pr tee oat eel ; i ee Chater I: Al, BI, C2 ne E 7 uster2: BD. A3, B3 Cluster 3: C1, A2 DATA WAREHOUSING AND DATA MINING - update the cluster centres: The now contre of Cluster 1: (3,67, 9) ‘The new centre of Cluster 2: (7, 4.33) ‘The new centre of Cluster 3: (1,5, 3,5) 1.2. Round Two: -No changes ‘The second way: 2.1. Round One: - Compute for each of the eight points its distance from each cluster centre, and assign it to the cluster to which it is most similar. => The clustering after the first round execution is: Cluster 1: Al, C2 Cluster 2: B2, 3, B1, BS Cluster 3: C1, A2 ~ update the cluster centres: The new centre of Cluster 1: (3, 9.5) The new centre of Cluster 2: (6.5, 5.25) The new centre of Cluster 3: (1.5, 3.5) 2.2. Round Two: -secompute for each of the eight points its distance from each cluster centre, and assign it tothe cluster to which it is most similar. => The clustering after the second round execution is: Cluster 1: Al, BI, C2 Cluster 2: B2, A3, B3 Cluster 3: C1, A2 - update the cluster centres: The new centre of Cluster 1: (3.67, 9) The new centre of Cluster 2: (7, 4.33) The new centre of Cluster 3: (1.5, 3.5) 23. Round Three: ~No changes. 4. a) Write down the design steps for a typical data warehous 5) Explain the flowchart for KDD process. ¢) Define Roll-up and Drill-down process with a suitable example. Answer: 4) In general, building any data warehouse consists of the following steps: 1. Extracting the transactional data from the data sources into a staging area - The Extract step covers the data extraction from the source system and makes it accessible for further processing, The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible, The txtract step should be designed in a way that it does not negatively affect the source system in terms or performance, response time or any kind of lov WM-11 (WBUT 2015) POPULAR PUBLICATIONS ictional data ~ . Transforming the transat : taishorm the data from the source ome t cistoah tet measured data to the a (Le. mension) using: ‘its so that they can later be joined. ae from several sources, generating apEreEets: ge etic deriving new calculated values, and applying bee ha he ii dimensio1 3. Loading the transformed ae aaa mare and via it is necessary to ensure that one a i tof the Load process database. In of te ee it is helpful to disable any constraints and the load process efficient, oor Gees tad enable them back only after the load completes. integrity needs to be maintained by ETL tool to ensure consistency. b) An Outline of the Steps of the KDD Process yer Patterns * - i (__ seesion t ransorned [paw | Mp Preprocessed | i | i i i ¢ Sept bas The overall process of finding and interpreting patterns from data involves the application of the following steps: 1. Developing an understanding of © the application domain © the relevant prior knowledge © the goals of the end-user » Creating a target data set: selecting a data set, or focusing on variables, or data samples, on which discovery is to be performed. 3. Data cleaning and preprocessing. e pie of noise or outliers. - i? ° lecting necessary information to model or account for noise. 6 Sarena for handling missing data fields. J ing for time sequence information anc known changes. 4. Data reduction and ree aioe ee aoe useful features to represent the data 9. representations for the data. ;, Choosing the data mining task. o Deciding whether the goal of the KDD classificati clustering, ete. procera ge eect ;. Choosing the data mining algorithm(s). o Selecting method(s) to be used for searching for patterns in the data. © Deciding which models and parameters may be appropriate, © Matching a particular data mining method with the overall criteria of the KDD process. Data mining. © Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth, Interpreting mined patterns. Consolidating discovered knowledge. ¢) Roll-up Roll-up performs aggregation ona data cube in any of the following ways: By climbing up a concept hierarchy for a dimension By dimension reduction The following diagram illustrates how roll-up works. Mobile Modem Phone Security Fal-up onjocation ponies POPULAR PUBLICATIONS Roll-up is performed by climbing up 4 concept hierarchy for the d Jocation. Initially « Onrolling up, the data is aggregated by: level of city to the level of country. : + The data is grouped into cities rather than countries. © When roll-up is performed, one or more dimensions from the data removed. he concept hierarchy was “street y,| Use k-means or k-medoid algorithm to determine the two clusters. [WBUT: Answer: Step 1 Initialize centers. Let us-assume c; = (3,4) and ¢) = (7,4) So here c) and c; are selected as medoids. Costs to the nearest medoid are shown bold in the table. i a Data objects Cost Cy) (distance) Te (a Je 3 3,3 ]4] 31 8 4 atstetse tk 4 3s )3 4 6 2 5 6 314 6 4 z 7 3 | 4 7 3 3 9 3 4 8 5 6 i [3 ,4)?7l6 ; i 5 ree Cost (distance) i a a 7 kal ee 8 8 4/7) 4 4 ‘Fi 6 5_ | a4 6 2 3 6 {7} 4 | 6 [4 T j 7 Zi 4 7 3 1 - 7 4 8 3 2 O17 ‘These hiiers becomes 2 I = Cluster, = {(3,4)(2.6)(3,8)(4,7) Chster = (62X67 3H8.5178)) : _ DATA WAREHOUSING AND DATA MINING since the points (2,6) (3.8) and (4,7) are closer to-e, hence they form one cluster whilst semaining points form another cluster, $0 the total cost involved is 20. Where cost between any two points is found using formula cot(nse)= Dh a : where x is any data object, c is the medoid, and d is the dimension of the object which in this case is 2. ’ Total cost is the summation of the cost of data object from its medoid in its cluster so here: Total cost = {cost((3,4), (2,6) + cost((3,4), (3,8)) + cost((3,4), (4.7))} + {cast((7,4), (6,2)) + cost((7,4), (6,4) + cost((7,4), (7,3) + cost((7,4), (8,5)) + eost((7,4), (7,6))} =(3+44+4)+(3+141 +242) =20 Select one of the nonmedoids O' Lotus assume O! =(7,3) So now the medoids are c\(3.4) and O'(7,3) ifcl and O' are new medoids, calculate the total cost involved By using the formula in the step 1 i “ Data objects | Cost (distance) ) 1 3 4 2 6 3 3 3 4 3 8 4 4 3 4 4 7 4 5 3 4 6 | 2 3 6 3 4 6 | 4 3 7 3 4 Jams 4 9 3 4 Salers 6 10 3 4 76 4 i ai Data objects Cost CY) (distance) 7 7 3 2 6 8 3 7 sy 3 8 9 4 7 3 4 7 7 3 7 3 6 2 2 6 7 3 6 4 2 7 7 3 7 4 1 9 7 3 8 3 3 10 7 3 7 6 " Toial cost=3 544442429143 43 922 DWM-25 i. POPULAR PUBLICATIONS i i Oris So cost of swapping medoid frome, to " i =22-20=270 = current total cost — past total cost : " S joni to O' would be a bad idea, so the previous choice bi good. So we nonmedoids and found that our first choice was the best. So the configuration ; change and algorithm: terminates here (i.e. there is no change in the medoids), 6. What is metadata in Data warehousing? What is metadata catalog? different categories of metadata used in Feed warehouse. a) What is metadata? b) What is the typical content of metadata of a data warehouse? What is Metadata? Explain different types of Metadata. nswer: . Mata is simply defined as data about data. The data that is used to repre data is known as metadata. In terms of data warehouse, we can define metz follows. © Metadata is the road-map to a data warehouse. © Metadata ina data warehouse defines the warehouse objects. + Metadata acts as a directory, This directory helps the decision support sys locate the contents of a data warehouse. 2 A collection of metadata records combined with data management and search tools a data catalogue, The architecture of these data catalogues can’ vary from tightly’ internal data indexes to widely accessible distributed catalogues spread Internet. Metadata can be broadly categorized into three categories: * . Business Metadata; Ii has the data ownership information, business definition changing policies. * Technical Metadata: It includes database system names, table and column and sizes, data types and allowed values. Technical metadata also includes st information such as primary and foreign key attributes and indices. Operational Metadata: It includes currency of data and data lineage. Ci data means whether the data is active, archived, or purged. Lineage of data history of data migrated and transformation applied on it. 4. a) What is clustering? Why is it difficult to handle categorical clustering? Distinguish between partition clustering and hierarchical ¢ What is clustering? What are the features cluster? What do y‘ hierarchical clustering technique? ae [WBUT 2013, DATA WAREHOUSING AND DATA MINING Answers Part: Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions. ‘There are two main types of clustering techniques, those that create a hierarchy of clusters and non-hierarchy of clusters. The hierarchical clustering techniques create a hierarchy of clusters from small to big. With a hierarchy of clusters defined it is possible to choose the number of clusters that are desired. At the extreme jt is possible to have as many clusters as there are records in the database. In this case the records within the cluster are optimally similar to each other (since there is only one) and certainly different from the other clusters, The advantage of hierarchical clustering methods is that they allow the end user to choose from either many clusters or only a few. 2” Part: Since, the process of grouping a set of physical or abstract objects into classes of similar objects is clustering, similarity metric is important here because it is used for outlier detection. The clustering algorithm which is main memory based can operate only on the following two data structures namely, a) Data Matrix b) Dissimilarity Matrix Soit is difficult to handle categorical data. In panition clustering, every data sample is initially assigned to a cluster in some {possibly random) way. Samples are then iteratively transferred from cluster to cluster until some criterion function is minimized. Once the process is complete, the samples will have been partitioned into separate compact clusters. Examples of partition clustering methods are k-means and Lloyd's method, In hierarchical clustering, each sample is initially considered a member of its own cluster, after which clusters are recursively combined in pairs according to some predetermined condition until eventually every point belongs to a single cluster. The resulting hierarchical structure may be represented by a binary tree or "dendrogram", from which the desired clusters may be extracted. Examples of hierarchical clustering methods are the single-link, Ward's, centroid, complete-link, group average, median, and parametric Lance Williams methods. 5) Describe the working of the PAM algorithm. Compare its performance with CLARA and CLARANS. Answer: . iy PAM k-medoids clustering algorithm, for example, evaluates a set of k objects une to be representative objects (medoids) of k clusters within T objects such that 4 eet ee are clustered with the medoid to which it is the most similar (i.e. in terms of the provided distance metric). The process operates by swapping one Gf the medoids with one of the objects iteratively such that the total distance between Talo t! objects and their medoid is reduced. The algorithm can be depicted as DWM-27 lilies SATIONS 7 Step 1: Initialization - choose k medoids from T objects randomly, Step 2: Evaluation - calculate the cost Di, — D, for each Swap of one sane object, where D, is the total distance before the swap and Dj js the total distance swap. ; oe 3: Selection - accept the swap with the best cost and if the cost is step 2; otherwise record the medoids and terminate the program. The computational complexity of the PAM algorithm is O(1 + B)KCT ~ ky) based on the number of partitions per object, where f is the number of succes It can also be expressed as O'(| + P)k'(T ~ ky’) based on the number of calculations, i.c., one partition per object is equivalent to k distances calc Clearly, this is time consuming even for the moderate number of objects and number of medoids. The computational complexity of the CLARA algorithm is O(q(ks? + (T — ky) based on the number of partitions per object or O'(q(k's’ + k(T —k)) + Bk’s*) b number of distance calculations, where q, s, k, and T are the number of samples, « size per sample, number of medoids, the number of successful swaps for all tested and the total number of objects, respectively. Clearly, the CLARA algorith deal with a larger number objects than can PAM algorithm ifs « T. : The computational complexity of CLARANS is O((f + numlocal)(T — k)) based number of partitions per object or O'((f5 + numlocal)k(T — k)) based on the nu distance calculations, where B is the number of test moves between nodes, 2. a) What is Clustering? Why is it required for data warehousing and data What are the differences between partitioning clustering and Hi clustering? b) Suppose the following table shows the distances between different Construct a hierarchical tree u: Hierarchical clustering method. BA FA Mi NA _| RM | TO BA 0 662 | 877 | 255 | 412 | 996 FI 662 oO 295 | 468 | 268 | 400 Mi_| 877 | 295 0 754 | 564 | 138 NA | 255 | 468 | 754 0 219 Rm | 412 | 268 | 564 | 219 0 To 996 | 400 | 138 | 869 | 669 oO ¢) Discuss K-mean algorithm with a suitable example. Answer: a) I Part: Refer to Question No. I(a) (1 Part) of Long Answer Type 2™ Part: Clustering is required for data ‘Warehousing and data mining for the follo * Scalability — We need highly scalable clustering algorithms to DATA WAREHOUSING AND DATA MINING, Ability to deal with different kinds of attributes — Algorithms should be capable to be Jpplied on any kind of data such as interval-based (numerical) dat, categorical, and binary data. Discovery of clusters with attribute shape — The clustering algorithm ‘should be capable of detecting clusters of arbitrary shape, They should not be bounded to only distance measures that tend to find spherical cluster of smal! sizes, . tligh dimensionality ~ The clustering algorithm should not only: be able to handle Jow-dimensional data but also the high dimensional space. + Ability to deal with noisy data ~ Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters. interpretability ~ The clustering results should be interpretable, comprehensible, and usable. 3" Part: Refer to Question No. I(a) (2" Part) of Long Answer Type Questions. b) The nearest pair of cities is MI and TO, at distance 138. These are merged intoa single cluster called "MUTO". The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1. Then we compute the distance from this new compound ‘object to all other objects. In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO to RM is chosen to be 564, which is the distance {rom MI to RM, and so on. After merging ML with TO we obtain the following matrix: BA. FL | MUTO| NA RM BA 0 662, 817 255 412 FL 662 Q 295 468 268. muro |__877 295 0 754 564 NA 255 468 754 0 219 RM 42 268 | s64_[ 219 0 mmin d(ij) a(NA,RM) = 219 = merge NA and RM into a new cluster called NA/RM LONA/RM) = 219. m=2 BA Fl__| MiTO | NA/RM BA 0 662 | 877 | 255 FI 662 0 295 | 268 MVTO |_ 877 295 0 564 NARM | _255 268 | 504 0 Bute d(BA,NA/RM) = 255 => merge BA and NAJRM into a new cluster called EANAIRM) 255 BA/NA/RM FL [| mito DWM-29 min’ d(i,j) = a(BA/NAIRM,FI) = 268 called BA/FUNA/RM L{BA/FUNA/RM) = 268 m=4 BA/NA/RM hike) BA/NAIRM 0 MI/TO 295 a eee Finally, we merge the last two clusters at level 295. 4 ‘The process is summarized by the following hierarchical tree: BA FL MI 7 c) K-means is an unsupervised learning algorithms that solve the clustering pr procedure follows a simple and easy way to classify a given data set through number of clusters (assume k clusters) fixed a priori. The algorithm is composed following steps: 1. Place K points into the spacé represented by the objects that are being These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the Kicer 4. Repeat Steps 2 and 3 untit the centroids no longer move. This produces a se} of the objects into groups from which the metric to be minimized can be cale Suppose that we have n sample feature’ vectors x,, X;, —., x, all from the same we know that they fall into k compact clusters, k Y, with the ikerspecified minimum support and minimum confidence, X and ¥ can be sets of items atany level of the taxonomy, such that no item in Y is an ancestor of any item in X. For example, there might be a rule which says that "50% of transactions that contain Soft drinks also: contain Snacks: 5% of all transactions contain both these items" ) PAM Clustering Technique: PAM stands for “partition around medoids”. The algorithm is intended to find a sequence of objects called medoids that are centrally located in clusters. Objects that are tentatively defined as medoids are placed into aset S of selected objects. 1fO is the set of objects that the set U =O — Sis the set of unselected objects. The goal of the algorithm is to minimize the average dissimilarity of objects to their closest selected abject, Equivalently, we can minimize the sum of the dissimilarities between object and their closest selected object. The algorithm has two phases: (i) Inthe first phase, BUILD, a collection of k objects are selected for an initial set S. (ii) In the second phase, SWAP, one tries to improve the quality of the clustering by exchanging selected objects with unselected objects. The goal of the algorithm is ta minimize the average dissimilarity of objects to their closest selected object. Equivalently, we can minimize the sum of the dissimilarities between object and their closest selected object. 4) CLARANS clustering algorithm yis-A-vis PAM and CLARA: The mediods searching in PAM or CLARA is abstracted as searching k subgraphs from 1 tae mah, and based on this understanding, a PAM-like clustering algorithm called woe oe (Clustering Large Applications based upon RANdomized Search) is defined. Cae searches the whole graph and CLARA searches some random sub-graphs, aeons Apoel samples a set and selects k medoids in climbing sub-graph nedhite’,, CLARANS selects the neighboring objects of medoids as candidates of new IS It samples subsets to verify medoids in multiple times to avoid bad samples. DWM-33 TOkworsls, multiple time sampling of medoids verification is time limits CLARANS from clustering very large datasets in an acceptab POPULAR PUBLICATIONS e) K-Medoid Algorithm: PeAineet Aneind, Medoids or the K-medoids algorithm is a p algorithm which is slightly modified from the K-means, algorithm. The i algorithm is to first compute the K representative objects which are called, ‘After finding the set of medoids, each object of the data set is assignc. *m the medoid. That is, object 7 is put into cluster v,, when medoid my; is nearer than medoid my. The algorithm proceeds in two steps: ; * BUILD-step: This step sequentially selects k "centrally located” objes used as initial medoids SWAP-step: If the objective function can be reduced by i (swapping) a selected object with an unselected object, then the swap out. This is continued till the objective function can no longer be decreased ‘The algorithm is as follows: 1. Initially select k random points as the medoids from the given n data data set. 2. Associate each data point to the closest medoid by using any of the most eo distance metrics. ‘ 3. For each pair of non-selected object h and selected object i, calculate swapping cost TC),. If TC), <0, i is replaced by h 5. Repeat the steps 2-3 until there is no change of the medoids. DATA WAIKEMOUSING AND DATA MINING MINING TIME SERIES DATA Multiplo Choice Type Questions 4, Yo optimize data warehouse design, which one is done? [WBUT 2011] 4) Normalization of fact tablos and demoralization of dimension tables b) Normalization of fact tables and dimension tables ¢) Demoralization of fact tables and dimension tables d) Normalization of dimension tables and demoralization of fact tables Answers (i) a , Is the value of an attribute mined as it varies over time. ‘a)Time series Analysis b) Classification [MODEL QUESTION) ¢) Association d) Prediction Answer: (3) Long Answer Type Questions 4, a) Define decision tree. [WBUT 2009, 2010} b) What are the advantages and disadvantages of the decision tree approach over other approaches for data mining? [WBUT 2003] ¢) Discuss briefly the tree construction principle. [WBUT 2009) R, Describe the basic algorithm for decision induction. {WBUT 2013) d) What is a cl: ication problem? What is the difference between supervised and unsupervised classification? (WBUT 2009] Answer: a) Decision trees are powerful and popular tools for classification and prediction, The attractiveness of decision trees is due to the fact that, in contrast to neural networks, decision trees represent rules, Decision tree is a classifier in the form of a tree structure (see Figure 1), where each node is either: * a leaf node - indicates the value of the target attribute (class) of examples, or~ * adecision node - specifies some test to be carried out on a single attribute-value, ' with one branch and sub-tree for each possible outcome of the test. A decision tree can be used to classify an exaniple by starting at the root of the tree and os through ituntil a leaf node, which provides the classification of the instance. i re gain, is simply the expected reduction in entropy caused by partitioning the = Se to this attribute. More precisely, the information gain, Gain (S, A) of (cA. relative to a collection of examples S, is defined as Gain(S, A) = Entra ~ (Sd. py(S) ‘Ent ‘S, den rer) wi , Tren aiestA) is the set of all possible values for attribute A, and S, is the subset of S Attribute A has value v (ic, S, = {8 © S| A(s) = ¥}). Note the first term in the DWM-35

You might also like