0 ratings0% found this document useful (0 votes) 644 views97 pagesData Warehousing & Data Mining
Data Warehousing & Data Mining Makaut Organizer of 2023-24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
DATA WAREHOUSING AND
DATA MINING
Introduction to Data Warehousing 2
Classification and Prediction of Data Warehousing 20
Mining Time Series Data 35
Mining Data Streams 48
Web Mining 66
. Recent Trends in Distributed Warehousing 72
NOTE:
MAKAUT course structure and syllabus of 6” semester has been changed from 2021.
Previously DATA WAREHOUSING AND DATA MINING was in 7° semester. This
Subject has been redesigned and shifted in 6” semester as per present curriculum.
Subject organization has been changed slightly. Taking special care of this matter we
‘are providing the relevant MAKAUT university solutions and some model questions &
@nswers for newly introduced topics, so that students can get an idea about university
GUuestions patterns,POPULAR PUBLICATIONS
INTRODUCTION TO DATA WAREHOUSING
1. A data warehouse is an integrated collection of data because [WBUT 2009, 2015]
a) Itis a collection of data of different types
b) Itis a collection of data derived from multiple sources
¢) {tis a relational database
d) Itcontains summarized data
Answer: (b)
‘subject oriented’ collection of data
2. A data warehouse is said to contain a
[WBUT 2009, 2013]
because
a) Its contents have acommon theme
b) Itis built for a specific application
¢) It cannot support multiple subjects
d) Itis a generalization of ‘object-oriented’
Answer: (a)
3. A Data warehouse is said to be contain in time-varying collection of data
because [WBUT 2010, 2013, 2015]
a) Its content vary automatically with time
b) Its life-span is very limited
c) Every key structure of data warehouse contains either implicitly or
explicitly an element of time
4) Its content has explicit time-stamp
Answer: (c)
4, Data Warehousing is used for {WBUT 2010, 2012]
a) Decision Support System b) OLTP applications .
c) Database applications d) Data Manipulation applications
Answer: (a)
5. Which of the following is TRUE? [WBUT 2010, 2012]
3 vee Jarshouse can be used for analytical processing only
ata warehouse can be used for information processii and
analytical processing Processing (query, report}
? Data warehouse can be used for data mining only
) Data warehouse can be used for informatior ort),
analytical processing and data mining @ processing (query, rep!
Answer: (d)
6. Data warehouse architecture is just an over guideline. it is not a blueprint for the
data warehouse [WBUT 2014]
a) True b) False
Answer: (b)
DWM-2TA WAREHOUSING AND DATA MINING.
7. The most distinguishing characteristic of DSS data is [WBUT 2011)
a)Granularity —_b) Timespan ¢) Dimonsionality _d) Data currency
Answer: (c)
g. A data warehouse is built as a separate repository of data, different from the
operational data of an enterprise because [WBUT 2012, 2013]
a) It is necessary to keep the operational data free of any warehouse
operations
b) A data warehouse cannot afford to allow corrupted data within it
c) A data warehouse contains summarized data whereas the operational
database contains transactional data
d) None of these
Answer: (c) f
9. Dimension data within a warehouse exhibits which one of the following
propertie [WBUT 2042, 2015]
a) Dimension data consists of the minor part of the warehouse
b) The aggregated information is actually dimension data
c) It contains historical data
d) Dimension data is the information that is used to analyze the elemental
transaction
Answer: (b)
10. A data warehouse is said to contain a ‘time-varying’ collection of data because
a) its content has explicit time-stamp (WBUT 2014, 2015, 2016, 2017]
.b) its life-span is very limited
¢) it contains historical data
4) it contents vary automatically with time
Answer: (c)
14. The important aspect of the data warehouse environment is that data found
within the data warehouse Is [WBUT 2016, 2018)
a) subject-oriented b) time-variant
c) integrated d) all of these
Answer: (a)
12. Data warehousing is used for [BUT 2016)
a) decision support system b) OLAP applications
¢) Database applications d) Data manipulation applications
Answer: (a)
13... is a subject-oriented, integrated, time-variant, non-volatile
Collection of data [WBUT 2017]
a) Data Mining b) Data Warehousing
¢) Document Mining d) Text Mining ~
Answer: (b)
DWM-31. Define Data Marts. [WBUT 2008, 2010, 2011, 2015, 204;
Define the types of Data Marts. [WBUT 2009, 2010, 2011, 2015)
Answer:
"Part: .
A data mart is a group of subjects that are organized in a way that allows them to assist
departments in making specific decisions. For example, the advertising department will
have its own data mart, while the finance department will have a data mart that is
separate from it. In addition to this, each department will have full ownership of the
software, hardware, and other components that make up their data mart.
2™ Part:
There are two types of Data Marts: .
© Independent data marts — sources from data captured form OLTP system, external
Providers or from data generated locally within a particular department or geographic
area.
Dependent data mart —sources directly form enterprise data warehouses.
2. Define data mining. What is the advantages data mining over traditional
approaches? IWBUT 2009)
Answer:
1" Part:
Data mining, which is also kriown as knowledge discovery, is one of the most popular
topics in information technology. It concems the Process of automatically extracting
Useful information and has the promise of discovering hidden relationships that exist in
large darahase.. These relationships represent valuable knowledge that is crucial for
maily applications. Data mining is not confined to the analysis of data stored in data
warehouses!) mav analyze data existing at more detailed granularities ‘than the
ssauanized data vided in a data warehouse. It may also analyze transactional, textual;
Spatial, and mu:timedia data which are difficult to model with current multidimensional
data’. echnolopy.
2° Part:
With the help of data ming, organizations are in a better position to predict the future
Heat. ow Dusiltess tren . the possible amount of revenue that could be generated, the
Orders tha could he experts: and the type of customers that could be approached. The
iucnal approaches will not be able to generate such accurate results as they use
Simpler algorithms. One major advantage of data mining over a traditional statistical
preoaeh fe fy uw ge -cuy with heterogeneous data fields.
The advantuyes of data mining helps the businesses grow help the customers be happy,
and help in a lot of other areas like data management,
0WM-4w. DAT.
3, What is the importance of Association Rules in Data mini:
0 ‘mining? [WBUT 2009]
Explain support, confidence, frequent item
sesociation rule. Set and give a formal definition of
or, [WBUT 2013]
What is an Association Rule? Define Support, Confide
item set in Assodiation Rule Mining? nce, Item sat err
Answer:
To illustrate the concepts, we use a small example from the supermarket domain. The set
of items is I-= {milk, bread, butter, beer} and a small database containing the items is
shown in Table below.
Milk, bread
Bread, butter
Beer
Milk, bread, butter
Bread, butter
An example supermarket database with five transactions.
An example rule for the supermarket could be {milk, bread} {butter} meaning that if
milk and bread is bought, customers also buy butter. To select interesting rules from the
set of all possible rules, constraints on various measures of significance and interest can
be used. The best-known constraints are minimum thresholds on support and confidence.
The support supp(X) of an itemset X is defined as the proportion of transactions in the
data set which contain the itemset. In the example database in Table I, the itemset {milk,
bread} has a support of 2/5 = 0.4 since it occurs in 40% of all transactions (2 out of 5
transactions). ~
The confidence of a rule is defined conf{X ) Y) = supp(X [ Y)supp(X). For example, the
rule {milk, bread}) {butter} has a confidence of 0.2/0.4 = 0.5 in the database in the Table,
which means that for 50% of the transactions containing milk and bread the rule is
cofrect. Confidence can be interpreted as an estimate of the probability P(Y |X), the
probability of finding the RHS of the rule in transactions under the condition that these
transactions also contain the LHS.
In many (but not all) situations, we only care about association rules or causalities
involving sets of items that appear frequently in baskets. For example, we cannot run a
00d marketing strategy involving items that no one buys anyway. Thus, much data
mining starts with the assumption that we only care about sets of items with high support,
ie., they appear together in many baskets. We then find association rules or causalities
only involving a high-support set of items must appear in at least a certain percent of the
baskets, called the support threshold. We use the term frequent itemset for “a set S that
appears in at least fraction s of the baskets," where s is some chosen constant, typically
0.01 or 1%,
Association rules are statements of the form {X,,X:, ....Xo} BY , meaning that if we
find all of X,,Xz,...X» in the market basket, then we have a good chance of nding Y . The
Probability of finding Y for us to accept this rule is the condence of the rule. We normally
Would search only for rules that had confidence above a certain threshold.
DWM-5POPULAR PUBLICATIONS
4. State the differences between Data Mart & Warehouse. [WBUT 2010, 2015)
Answer:
‘A data warehouse has a structure which is separate from a data mart, even though they
may look similar. Because of this, it is difficult to coordinate the data across multiple
departments, Each department will have its own view of how a data mart should look.
The data mart that they use will be specific to them. In contrast, a data warehouse is
designed around the organization as a whole. Instead of being owned by one department,
‘a data warehouse will generally be owned by the entire company. While the data
contained in data warehouses are granular, the information contained in data marts is not
very granular at all.
Another thing that separates data warehouses from data marts is that data warehouses
contain larger amounts of information. The information that is held by data marts are
often summarized. Data warehouses will not contain information that is biased on the part
of the department. Instead, it will demonstrate the information that is analyzed and
processed by the organization. Much of the information that is held in data warehouses is
historical in nature, and they are designed to process this information.
5. When is a Data Mart appropriate? [WBUT 2011]
Answer: . .
Data Marts are created for:
a) Speeding up queries by reducing the volume of data to be scanned
b) Structuring data in a form suitable for user access tools
c) Partitioning data to impose access control strategies
d) Segmenting data into different hardware platforms.
Data marts should not be used in other cases as the operational costs of data marting can
be high and once a strategy in place, it can be difficult to change it without incurring
substantial cost.
6. What is sequence mining? [WBUT 2011]
Answer:
Sequence mining is a type of structured data mining in which the database and
administrator look for sequences or trends in the data. This data mining is split into two
fields. Itemset sequence mining typically is used in marketing, and string sequence
mining is used in biology research. Sequence mining is different from regular trend
mining, because the data are more specific, which makes building an effective database
difficult for database designers, and it can sometimes go awry if the sequence is any
different from the common sequence.
7. Differentiate among Enterprise Warehouse, Data mart and Virtual warehouse.
[WBUT 2013]
Answer:
Enterprise warehouse - collects all of the information about subjects spanning the entire
organization
DWM-+4PATA WAREHOUSING AND DATA MINING
Data Mart - a subset of corporate-wide data that is of value to specific groups of users.
Its scope is ‘confined to specific, selected groups, such as mmaketing d tn oat. <
Virtual warehouse - A set of views over operational databases. Only some of the
possible summary views may be materialized.
8. How is data warehouse different from a database? [WBUT 2013, 2016, 2018]
Answer:
A database is used to store data while a data warehouse is mostly used to facilitate
reporting and analysis. Basically, database is just where the data is stored; in order to
access this data or analyze it a database management system is required. However, a data
warehouse does not necessarily require a DBMS. The purpose of a data warehouse is for
easy access to the data for a user. The data warehouse may also be used to analyze the
data; however the actual process of analysis is called data mining.
Some differences between a database and a data warehouse:
¢ A database is used for Online Transactional Processing (OLTP) but carr be used
for other purposes such as Data Warehousing.
eA data warehouse is used for Online Analytical Processing (OLAP). This reads
the historical data for the Users for business decisions.
© In a database the tables and joins are complex since they are normalized for
RDMS. This reduces redundant data and saves storage space
© In data warehouse, the tables and joins are simple since they are de-normalized.
This is done to reduce the response time for analytical queries.
© Relational modeling techniques are used for RDMS database design, whereas
modeling techniques are used for the Data Warehouse design
© A database is optimized for write operation, while a data warehouse is optimized
» for read operations.
© In a database, the performance is low for analysis queries, while in a data
warehouse, there is high performance for analytical queries.
¢ A data warehouse is a step ahead of a database. It includés a database in its
structure,
‘9. What are the issues relating to the diversity of database types? [WBUT 2016)
Answer;
© Handling of relational and complex types of data— The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
© Mining information from heterogeneous databases and global information
systems — The data is available at different data sources which may be structured,
semi structured or unstructured. Therefore mining the knowledge from them adds
challenges to data mining,
10. Introduce the idea of Data Mining with example(s). What are the steps involved
In Knowledge discovery in Database (KDD) process? (WBUT 2016)
. DWM-7POPULAR PUBLICATIONS
Answer: .
Data Mining: Refer to Question No. 2(1" Part) of Short Answer Fype Questions,
Data mining example: 7 ; /
Mobile phone and utilities companies use Data Mining and Business Intelligence to
predict ‘churn’, the terms they use for when a customer leaves their company to get their
phone/gas/broadband from another provider. They collate billing information, customer
services interactions, website visits and other metrics to give each customer a probability
score, then target offers and incentives to customers whom they perceive to be at a higher
risk of churning.
Steps involved in KDD process ,
1. Identify the goal of the KDD process from the customer’s perspective.
2. Understand application domains involved and the knowledge that's required
3. Select a target data set or subset of data samples on which discovery is be performed.
4. Cleanse and pre-process data by deciding strategies to handle missing fields and alter
the data as per the requirements.
5. Simplify the data sets by removing unwanted variables. Then, analyze useful features
that can be used to represent the data, depending on the goal or task.
6. Match KDD goals with data mining methods to suggest hidden patterns.
7. Choose data mining algorithms to discover hidden patterns. This process includes
deciding which models and parameters might be appropriate for the overall KDD
process.
Search for patterns of interest in a particular representational form, which include
classification rules or trees, regression and clustering.
9. Interpret essential knowledge from the mined patterns,
10. Use the knowledge and incorporate it into another system for further action.
11. Document it and make reports for interested parties.
8.
11. Does a data warehouse involve a transaction? Explain. [WBUT 2018)
Answer:
A data warehouse is @ relational database that is designed for query and analysis rather
than for transaction processing. It usually contains historical data derived from
transaction data, but it can include data from other sources. It separates analysis workload
from transaction workload and enables an organization to consolidate data from several
sources.
Answer estions
1. Define Data Warehouse and briefly describe its characterictics.
. [WBUT 2009, 2018]
Answer:
A data warehouse consists of a computer database responsible for the collection and
storage of information for a specific organization. This collection of information is then
used to manage information efficiently and analyze the collected data. Although data
DWM-8DATA WARE!
warehouses vary in overall design, majority of them are subject ori ing that
the stored information is connected to objects or events hat cow nae The data
provided by the data warehouse for analysis provides information on a specific subject,
rather than the functions of-the company and is collected from varying sources into one
unit having time-variant.
A data warehouse has significantly different features from other enterprise-wide systems,
particularly in how data is stored, managed arid manipulated.
‘There are four key characteristics which separate the data warehouse from other major
‘operational systems: .
1. Subject Orientation: Data organized by subject
2. Integration: Consistency of defining parameters
3. Non-volatility: Stable data storage medium
4, Time-variance: Timeliness of data and access terms
2. a) What are the different tiers in a typical 3-tier data warehousing architecture?
b) What do you mean by warehousing schema? Explain. {WBUT 2009]
Answer:
a) The bottom tier is a warehouse database server which is almost always a relational
database system.
The middle tier is an OLAP server which is typically implemented using either (i) a
Relational OLAP (ROLAP) model, ie., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations; or (2) a
Multidimensional OLAP (MOLAP) model, i.e., 2 special purpose server that directly
implements multidimensional data and operations.
The top tier is a client, which contains query and reporting tools, analysis tools, and/or
data mining tools (e.g., trend analysis, prediction, and so on).
b) The entity-relationship data model is commonly used in the design of relational
databases, where a database schema consists of a set of entities or objects, and the
telationships between them. Such a data model is appropriate for online transaction
Processing. Data warehouses, however, require a concise, subject-oriented schema which
facilitates on-line data analysis.
3. a) Introduce the concept of data mining and cite two application areas.
[WBUT 2010, 2012]
b) What are the different steps of a data mining task? [WBUT 2010, 2012)
C) Suppose that the data mining task is to cluster the following eight points (with
(%, y) representing location) into 3 clusters.
A (2,10), 4, (2.5). (8,4)sB (5.8) B,(7,5),B (64).C,(1L2).C, (4.9)
The distance function is Euclidian distance. Initially we assign 4, B, and C, as the
Centre of each cluster. Use k-means algorithm to determine the three clusters.
[WBUT 2010, 2013, 2014, 2016, 2018}
Answer:
a) t" Part: Refer to of Question No. 2(1" Part) of Short Answer Type Questions.
DWM-9POPULAR PUBLICATIONS
2" Part:
An application of data mining is market segmentation. With market segmentation, a
company will be able to find behaviors that are common among its customers. One can
look for pattems among customers that seem to purchase the same products at the same
time. Another application of data mining is called customer churn. Customer churn will
allow a company to estimate which customers are the most likely to stop purchasing its
products or services and go to one of its competitors.
b) The Following Steps Are Usually Followed In Data Mining. These steps are iterative,
with the process moving backward whenever needed. °
1, Develop an understanding of the application, of the relevant prior knowledge, and of
the end user’s goals.
2. Create a target data set to be used for discovery.
3. Clean and preprocess data (including handling missing data fields, noise in the data,
accounting for time series, and known changes).
4. Reduce the number of variables and find invariant representations of data if possible.
5. Choose the data-mining task (classification, regression, clustering, etc.)
6. Choose the data-mining algorithm.
7. Search for patterns of interest (this is the actual data mining).
8. Interpret the pattern mined. If necessary, iterate through any of steps 1 through 7.
9. Consolidate knowledge discovered and prepare a report.
gr The given eight points were:
C2: (4,9)
The points Al, B2, and C1 were assumed to be the initial three cluster centres. There are
two possible ways of clustering the given points using the k-means algorithm, depending
on to which cluster we assign the point B1:
‘The first way:
1.1. Round One: . ,
= Compute for each of the eight points its distance from each cluster centre, and assig"
itto the cluster to which it is most similar.
=> The clustering after the first round execution is:
Cluster 1: Al, B1, C2
Cluster 2: B2, A3, B3
Cluster 3: Cl, AZ
DWM-10USIN iG
- update the cluster centres:
‘The new centre of Cluster 1: (3.67, 9)
The new centre of Cluster 2: (7, 4.33)
The new centre of Cluster 3:(1.5, 3.5)
1.2, Round Two:
-No changes
‘The second way:
2.1. Round One:
= Compute for each of the eight points its distance from each cluster centre, and assign it
to the cluster to which it is most similar.
=> The clustering after the first round execution is:
Cluster I: Al, C2 .
Cluster 2: B2, A3, B1, B3
Cluster 3: C1, A2
- update the cluster centres:
The new centre of Cluster 1: (3, 9.5)
‘The new centre of Cluster 2: (6.5, 5.25)
The new centre of Cluster 3: (1.5, 3.5)
2.2, Round Two:
-recompute for each of the eight points its distance from each cluster centre, and assign it
to the cluster to which it is most similar.
=> The clustering after the second round execution is:
Cluster 1: Al, B1,C2
Cluster 2: B2, A3, B3
Cluster 3: C1, A2
- update the cluster centres:
The new centre of Cluster 1: (3.67, 9)
The new centre of Cluster 2: (7, 4.33)
The new centre of Cluster 3: (1.5, 3.5)
2.3. Round Three:
No changes
4. a) Write down the design steps for a typical data warehouse. (WBUT 2015]
b) Explain the flowchart for KDD process.
¢) Define Roll-up and Drill-down process with a suitable example.
Answer:
) In general, building any data warehouse consists of the following steps:
1. Extracting the transactional data from the data sources into a staging area - The
Extract step covers the data extraction from the source system and -makes it
accessible for further processing. The main objective of the extract step is to retrieve
all the required data from the source system with as little resources as possible. The
extract step should be designed in a way that it does not negatively affect the source
system in terms or performance, response time or any kind of locking.
DWM-11PU! UBLICATIONS
2. Transforming the transactional data - The transform step applies a set of rules to
transform the data from the source to the target. This includes converting any
measured data to the same dimension (i.e. conformed dimension) using the same
units so that they can later be joined. The transformation step also requires joining
data from several sources, generating aggregates, generating Surrogate keys, Sorting,
deriving new calculated values, and applying advanced validation rules.
3. Loading the transformed data into a dimensional database - During the load step,
it is necessary to ensure that the load is performed correctly and with as little
resources as possible. The target of the Load process is often a database. In order to
make the load process efficient, it is helpful to disable any constraints and indexes
before the load and enable them back only after the load completes. The referential
integrity needs to be maintained by ETL tool to ensure consistency.
b) An Outline of the Steps of the KDD Process
—\ et
The overall process of finding and interpreting pattems from data involves the repeated
application of the following steps:
1. Developing an understanding of
© the application domain
© the relevant prior knowledge
© the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on-a subset of
variables, or data samples, on which discovery is to be performed,
3. Data cleaning and preprocessing
© Removal of noise or outliers,
© Collecting necessary information to mode! or account for noise.
© Strategies for handling missing data fields.
© Accounting for time sequence information and known changes.
4. Data reduction and projection.
© Finding useful features to represent the data depending on the goal of the
task.
Target
Dats
DWM-12DATA WAREHOUSING,AND DATA MINING
o Using dimensionality reduction or transformation methods to reduce the
effective number of variables under consideration or to find invariant
representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification, regression,
clustering, etc.
.6, Choosing the data mining algorithm(s).
© Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of the
KDD process.
7. Data mining.
Searching for patterns of interest in a particular representational form or a set
of such representations as classification rules or trees, regression, clustering,
and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
c)Roll-up
Roll-up performs aggregation on a data cube in any of the following ways:
¢ By climbing up a concept hierarchy for a dimension
© By dimension reduction
The following diagram illustrates how roll-up works.POPULAR PUBLICATIONS
e Roll-up is performed by climbing up a concept hierarchy for the dimension
location.
Initially the concept hierarchy was “street < city < province < country",
On rolling up, the data is aggregated by ascending the location hierarchy from the
level of city to the level of country.
‘© The data is grouped into cities rather than countries.
When roll-up is performed, one or more dimensions from the data cube are
removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following
ways:
«By stepping down a concept hierarchy for a dimension
«By introducing a new dimension.
‘The following diagram illustrates how drill-down works:
La ZZ LZ
MMA AA
Time (months)
gyi eeeagt
‘Mobile ModemPhone Security
: . . itemitypes) .
‘ Fal-iown is performed by stepping down a concept hierarchy for the dimension
ime.
. Initially the concept hierarchy was "day < month < quarter < year."
© On drilling down, the time dimension is descended from the level of quarter !
the level of month.
. re drill-down is performed, one or more dimensions from the data cube a?
«It navigates the data from less detailed data to highly detailed data.
DWM-14DATA WAREHOUSING AND DATA MINING
6. Write short notes on the following:
a) Strategic information rl 1
b) Data Mart Oe aur zr]
b) Issues and challenges in data mining (WBUT 2008]
c) Text mining . [weur.
d) DBMS vis-a-vis Data Mining ee all mal
Answer:
a) Strategic information: 7
Strategic information is needed for the day to day operations, meeting of government
requirements as well long and short range planning. Strategic information systems are
used by organizations to achieve a competitive advantage for the organization.
A Strategic Information System (SIS) is a system that helps companies change or
otherwise alter their business strategy and/or structure. It is typically utilized to
streamline and quicken the reaction time to environmental changes and aid it in achieving
a competitive advantage. Strategic information systems relies on data gathered at all
organizational levels and from all information systems whether it was transaction
processing systems, management information systems, decision support systems or
executive information systems.
Key features of the Strategic Information Systems are the following:
i) Decision support systems that enable to develop a ‘strategic approach to align
Information Systems. (IS) or Information Technologies (IT) with an organization's
business strategies
if) Primarily Enterprise resource planning solutions that integrate/link the business
processes to meet the enterprise objectives for the optimization of the enterprise resources
iii) Database systems with the "data mining" capabilities to make the best use of available
corporate information for marketing, production, promotion and innovation. The SIS
Systems also facilitate identification of the data collection strategies to help optimize
database marketing opportunities.
iv) The real-time information Systems that intend to maintain a rapid-response and the
quality indicators.
b) Data Mart: Refer to Question No. I of Short Answer Type Questions.
¢) Issues and challenges in data mining:
One of the key issues raised by data mining technotogy is not a business or technological
‘one, but a social one. It is the issue of individual privacy. Data mining makes it possible
to analyze routine business transactions and glean a significant amount of information
about individuals buying habits and preferences.
Another isstie is that of data integrity. Clearly, data analysis can only be as good as the
data that is being analyzed. A key implementation challenge is integrating conflicting or
redundant data from different sources. For example, a bank may maintain credit cards
accounts on several different databases. The addresses (or even the names) of a single
DWM-15PI IOI
cardholder may be different in each. Software must translate data from one system to
another and select the address most recently entered. .
A hotly debated technical issue is whether it is better to set up a relational database
structure or a multidimensional one. In a relational structure, data is stored in tables,
permitting ad hoc queries. In a multidimensional structure, on the other hand, sets of
cubes are arranged in arrays, with subsets created according to category. While
multidimensional structures facilitate multidimensional data mining, relational structures
thus far have performed better in client/server environments. And, with the explosion of
the Internet, the world is becoming one big client/server environment. :
Finally, there is the issue of cost. While system hardware Costs have dropped
dramatically within the past five years, data mining and data warehousing tend to be self.
reinforcing. The more powerful the data mining queries, the greater the utility of the
information being gleaned from the data, and the greater the pressure to increase the
amount of data being collected and maintained, which increases the pressure for faster,
more powerful data mining queries. This increases pressure for larger, faster systems,
which are more expensive.
4) Text mining:
Text Mining is the discovery by computer of new, previously unknown information, by
automatically extracting information from different written resources. A key element is
the linking together of the extracted information together to form new facts or new
hypotheses to be explored further by more conventional means of experimentation. In
text mining, the goal is to discover heretofore’ unknown information, something that no
‘one yet knows and so could not have yet written down. -
The difference between regular data mining and text mining is that in text mining the
patterns are extracted from natural language text rather than from structured databases of
facts. Databases are designed for programs to Process automatically; text is written for
People to read. We do not have programs that can "read" text and will not have such for
the forseeable future. Many researchers think it will require a full simulation of how the
mind works before we can write programs that read the way people do. .
e) DBMS vis-a-vis Data Mining:
DBMS and data mining answer different
whereas DBMS gives reports.
Example DBMS Reports
+ Last months sales for each service type
* Sales per service grouped by customer sex or age bracket .
+ _ List of customers who lapsed their policy
Questions answered using Data Mining
* What characteristics do customers that lapse their policy have in common and
how do they differ from customers who renew their policy?
* Which motor insurance policy holders would be potential customers for my
House Content Insurance policy?
queries. Data mining helps in predicting future
DWM-16DAT) (EHOUSING AND DAT:
6. What aro data mining primitives? [MODEL QUESTION]
Answer a
Data mining primitives
project ip atribute 4 pre:
weigh
muble |\) Daa mung
Daa mining sean stgorhen
primitives on
SQL
select
Fig: Data mining primiti
Contrary to current belief, data mining is first and foremost a server technology. Until
recently, most data mining tools operated on flat files and in-memory databases. With the
shift of attention to larger databases, the need for implementing data mining algorithms
on top of relational databases has been recognized. However, current relational
technology, with its emphasis on efficient updates and exact idertification of records, is
ill-sujted to the support of data mining algorithms.
Data mining algorithms require:
© efficient sampling methods
© storage of intermediate results during execution of queries
© capacity to handle large amounts of data
© geometric indexes that allow users to search in the neighborhood of record and
bit-mapped operations.
Data mining is supported by client/server technology, with the server technology being of
great importance when working with large databases. All these functions can in principle
be realized on top of existing implementations, as they are based on generalizations of the
same mathematical structure, that is, first-order logic and algebra.
Can these ideas be brought into a general framework for the revision of the relational
Model? Following figure describes a number of fundamental data mining routines in
pation to SQL primitives. In general, data mining routines can be divided into four
Ips:
1. Pre-processing: Routines in this group take samples, compress attribute values into
Clusters, and code tables into appropriate input formats for algorithms (that is,
DWM-17 ‘Pt \R PUBLICATIONS
transform historical files into time series, multi-valued attributes into sets of binary
attributes, tables into bitmaps, etc.).
2. Nearest neighbor search: Neighbor search can be interpreted as a generalization of
exact identification of records by means of traditional keys.
3._N table scan algorithms: This is a class of algorithms that need to access a table
several times during execution. It is not yet clear whether it is possible to speed up
the database performance of these algorithms, although enriched user-defined
functions may prove to be of help here.
4. Zoom scan algorithms: This is a group of algorithms that also access a table
repeatedly. but in this case it is clear that an adaptation of relational technology could
improve the speed of execution dramatically. These algorithms zoom into interesting
sub-parts of the database during their execution. This basically involves the repeated
execution of SQL statements that access similar or overlapping selections of records.
This short overview clearly indicates that it is already possible to draw a rough outline of
an enriched relational model. A number of details still have to be worked out, however,
botti from a formal perspective (neighborhood topologies, tabular projections, zoom scan
queries. etc.) and from an implementational view (that is, how to implement this model
on a parallel database server capable of handling billions of records). Data mining and
data warehousing represent radical technological advances which will necessarily alter
our fundamental approach to database technology.
7. Explain scalable methods for mining sequential patterns.
[MODEL QUESTION)
Answer:
Sequential pattern mining is computationally challenging because such mining may
generate and/or test a combinatorically explosive number of intermediate subsequences.
“How can we develop efficient and scalable methods Sor sequential pattern
mining?” .
Recent developments have made progress in two directions: (1) efficient methods for
mining the ful set of sequential patterns, and (2) efficient methods for mining only the set
of closed sequential patterns, where a sequential pattern s is closed if there exists no
Sequential pattern s’ where s ‘where sis a proper super-sequence of s, and s’has the same
(frequency) support as s. because all of the subsequences of a frequent sequence are also
frequent, mining the set of closed sequential patterns may avoid the generation of
unnecessary subsequences and thus lead to more compact results as well as more efficient
methods than mining the full set. We will first examine methods for mi ing the full set
and then study how they can be extended for mining the closed set. In addition, we
discuss modifications for mining multilevel, multidimensional sequential patterns (i.e.
with multiple levels of granularity).
We can give example of three such approaches for sequential pattern mining, represented
by the algorithms GSP, SPADE, and PrefixSpan, respectively. GSP adopts a candidate
generate-and-test approach using horizongal data format (where the data are represent
as , as usual, where each itemset is an event)-
SPADE adopts a candidate generate-and-test approach using vertical data format (where
DWM-18DATA WAREHOUSING AND DATA MINING
the data are represented as . The vertical data format
can be obtained by transforming from a horizontally formatted sequence database in just
one scan. Prefix Span is a pattern growth method, which does not require candidate
generation.
All three approaches either directly or indirectly explore the Apriori property, stated as
follows: every nonempty subsequence of a sequential pattern is a sequential pattern.
(Recall that for a pattern to be called sequential, it must be frequent. That is, it must
satisfy minimum support.) the Apriori property is antimonotonic (ar downward-closed) in
that, if a sequence cannot pass a test (e.g., regarding minimum support), all of its super-
sequences will also fail the test. Use of this property to prune the search space can help
make the discovery of sequential pattems more efficient.
DWM-19POPULAR PUBLICATIONS
CLASSIFICATION AND PREDICTION OF
DATA WAREHOUSING
Multiple Choice Type Questions
4. Which of the following techniques are appropriate for data warehousing?
a) Hashing on primary keys
(WBUT 2009, 2013, 2018)
b) Indexing on foreign keys of the fact table
c) Bit-map indexing
d) Join indexing
Answer: (c)
2.
of descriptive type of data mining.
a) Association Rule, Clustering
¢) Classification, Clustering
Answer: (c)
3. What is Metadata?
a) Summarized data
¢) Data about data
Answer: (c)
4. The full form of OLAP is
a) Online Analytical Processing
¢) Online Advanced preparation
Answer: (2)
5. The apriori algorithm is a
a) top — down search
¢) depth first search
Answer: (d)
6. Classification rules are extracted from
a) Root node b) Decision tree
Answer: (b)
7. Which of the following is a predictive model?
a) Clustering
¢) Summarization
Answer: (b)
is an example of predictive type of data mining whereas ...... is an example
[WBUT 2010, 2012]
b) Association Rule, Classification
q) Clustering, Classification
[WBUuT 2017]
b) Operational data
d) None of these
. (WBUT 2017]
b) Online Advanced Processing
d) Online Analytical Performance
[WBUT 2017]
b) breadth first search
d) bottom-up search
[WBUT 2017]
¢) Siblings d) Branches
(WBUT 2017]
b) Regression
d) Association rules
8. All set of items whose support is greater than the user-specified minimum
Support are called as
[WBUT 2017]
DWM-20DATA WAREHOUSING AND DATA MINING
a) Border set b) Frequent set
c) Maximal frequent set 4) Lattice
‘Answer: (b)
4. Explain the use of Dynamic itemset counting algorithm. IWBUT 2009]
OR,
Describe the principle of Dynamic Itemset Counting technique for Frequent Itemset
generation. [WBUT 2010}
Answer:
Dynamic Itemset Counting is an alternative to Apriori [temset Generation. Here Itemsets
are dynamically added and deleted as transactions are read and it relies on the fact that for
an itemset to be frequent, all of its subsets must also be frequent, so we only examine
those itemsets whose subsets are all frequent.
Algorithm stops after every M transactions to add more itemsets.
e Train analogy: There are stations every M transactions. The passengers are itemsets.
Itemsets can get on at any stop as long as they get off at the same stop in the next
pass around the database. Only itemsets on the train are counted when they occur in
transactions. At the very beginning we can start counting I-itemsets, at the first
station we can start counting some of the 2-itemsets. At the second station we can
start counting 3-itemsets as well as any more 2-itemsets that can be counted and so
on.
Itémsets are marked in four different ways as they are counted:
Solid box: [_] confirmed frequent itemset - an itemset we have finished counting
and exceeds the support threshold minsupp
«Solid circle: () confirmed infrequent itemset - we have finished counting and it is
below minsupp
© Dashed box
. exceeds minsupp _
© Dashed circle: {_.} suspected infrequent itemset - an itemset we are still counting
that is below minsupp
suspected frequent itemset - an itemset we are still counting that
DIC Algorithm
1. Mark the empty itemset with a solid square. Mark all the I-itemsets with dashed
circles. Leave all other itemsets unmarked.
2. While any dashed itemsets remain:
a) Read M transactions (if we reach the end of the transaction file, continue from
the beginning). For each transaction, increment the respective counters for the
itemsets that appear in the transaction and are marked with dashes.
b) Ifa dashed circle's count exceeds minsupp, tum it into a dashed square. If any
immediate superset of it has all of its subsets as solid or dashed squares, add a
new counter for it and make it a dashed circle.
DWM-21POPULAR PUBLICATIONS
Once a dashed itemset has been counted through all the transactions, make it solid and
stop counting it.
2. State Apriori Algorithm for frequent item set generation. (BUT 2019)
Answer:
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for
boolean association rules.
Key Concepts:
¢ Frequent Itemsets: The sets of item which has minimum support (denoted by L, for
ith-Itemset).
¢ Apriori Property: Any subset of frequent itemset must be frequent. .
¢ Join Operation: To find Lj, a set of candidate k-itemsets is generated by joining L,..
with itself.
Algorithm
° Find the frequent itemsets: the sets of items that have minimum support
© A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} isa
frequent itemset, both {4} and {2} should be a frequent itemset
© Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
© Use the frequent itemsets to generate association rules.
Pseudo code
¢ Join Step: C, is generated by joining L,., with itself
© Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-
itemset
Pseudo-code:
Cy: Candidate itemset of size k
Ly: frequent itemset of size k
L\= {frequent items};
For (k= 1; Ly!=9; k++) do begin
C\.1= candidates generated from Ly;
for each transaction tin database do
increment the count of all candidates in Cys: that are contained int
Liy= candidates in C,,, with min_support
end
return UL;
3. Describe the principle of partitioning technique for Frequent itemset generation
and justify how it improves the
efficiency of Frequent item set generation
compared to a priori Algorithm. ¥ reat [WBUT 2010, 2012
Answ
A partitioning techni
frequent itemsets. It
ique can be used that requires just two database scans to mine the
consists of two phases. In Phase I, the algorithm subdivides the
DWM.22\TA WAREHOUSING AND DAT
transactions of D into » nonoverlapping partitions. If the minimum support threshold for
transactions in D is min sup, then the minimum support count for a partition is minsup,
the number of transactions in that partition. For each partition, all frequent itemsets
within the partition are found. These are referred to as local frequent itemsets. In Phase I,
a second scan of Dis conducted in which the actual support of each candidate is assessed
in order to determine the global frequent itemsets. Partition size and the number of
partitions are set so that each partition can fit into main memory and therefore be read
only once in each phase, The following figure describes the phases.
Apriori algorithm needs many database scans and for each scan. frequent itemsets are
searched by pattern matching, which is especially time-consuming for large frequent
itemsets with long patterns. Partition algorithm uses the intersection of transaction ids
(tids) to determine the support count. Because it performs a breadth-first search, it
partitions the database into blocks in which local frequent itemsets are found to overcome
the memory limitations. An exira database scan is needed to determine the global
frequency of local frequent itemsets.
4, What is clustering? Discuss two main methods of clustering. IWBUT 2010)
Answer:
1" Part: Refer to Question No. 1(a) (1 Part) of Long Answer Type Questions.
2™ Part:
The non-hierarchical techniques in general are faster to create from the historical
database but require that the user make some decision about the number of clusters
desired or the minimum “nearness” required for two records to be within the same
cluster. These non-hicrarchical techniques are often run multiple times starting off with
Some arbitrary or even random clustering and then iteratively improving the clustering by
shuffling some records around. These techniques sometimes create clusters that are
created with only one pass through the database adding records to existing clusters when
they exist and creating new clusters when no existing cluster is a good candidate for the
Biven record. Because the definition of which clusters are formed can depend on these
initial choices of which starting clusters should be chosen or even how many clusters
these techniques can be less repeatable than the hierarchical techniques and can
Sometimes create either too many or too few clusters because the number of clusters is
Predetermined by the user not determined solely by the patterns inherent in the database.
DWM-23LIGATION:
5. Suppose that the data mining task is to cluster the following ten points (with
(xy)) representing location into two clusters:
xt 2 6
x2 3 4
x3 3 8
x4 4 7
X5 6 2
X6 6 4
x7 7 3
x8 7 4
x9 8 5
X10 7 6
The distance function is defined as [ x, -x, ]+[y,-y, ]
Use k-means or k-medoid algorithm to determine the two clusters. [WBUT 2012)
Answer:
Step 1
Initialize & centers.
Let us assume c, = (3,4) and c; = (7,4)
So here c, and c, are selected as medoids.
Costs to the nearest medoid are shown bold in the table.
7 3 Data objects Cast
ae (distance)
1/3742 ]6 3
3 [3,4 t 3 & 4
4 3 4 4 7_/ 4
s [3a a |e 2 5
6,3 ,4aefs 3
713 ]4 7173 5
a9 Tatas 5 6
10 [3 [47 6 6
1
i & Data leet Cost (distance)
1 7 4 2 me 6 7
3 [7,43 8 3
4 7 4 4 7 6
s[7]4].6 | 2 3
6[7/4 fe. 4 1
7/7/14 7 [3 1
9/7] 4 8 [5 2
lof 74 716 2
Then the clusters become:
Cluster; = {(3,4)(2,6)(3,8)(4,7)}
Clusters = {(7,4)(6,2(6,4)(7,3\(8,5\7,6)}
DWM-24Ts sING AND DAT: IG
Since the points (2,6) (3,8) and (4,7) are closer to ¢, hence they form one cluster whilst
remaining points form another cluster.
So the total cost involved is 20.
Where cost between any two points is found using formula
matea Sed
where x is any data object, c is the medoid, and d is the dimension of the object which in
this case is 2.
Total cost is the summation of the cost of data object from its medoid in its cluster so
here: -
Total cost = {cost((3,4), (2,6) + cost((3,4), (3,8)) + cost((3,4), (4,7))}
+ {cost((7,4), (6,2)) + cost((7,4), (6,4)) + cost((7,4), (7,3)
+ cost((7,4), (8,5)) + cost((7,4), (7,6))}
=(344+4)+(34+1+1+242)
=20
Select one of the nonmedoids O'
Let us assume O' = (7,3)
So now the medoids are c,(3,4) and O(7,3),
Ifcl and O' are new medoids, calculate the total cost involved
By using the formula in the step 1
Data objects
i * a
Cost (distance)
alulalul—
ela
1
Cost
(distance)
g
Total cost=344+4+2+2+14+3+3=22
DWM-25POPULAR PUBLICATIONS
So cost of swapping medoid from c; to O' is
'S = current total cost — past total cost = 22-20 =2 > 0 ;
‘So moving to O' would be a bad idea, so the previous choice was good. So we try other
nonmedoids and found that our first choice was the best. So the configuration does not
change and algorithm terminates here (i.e. there is no change in the medoids)..
6. What is metadata in Data warehousing? What is metadata catalog? Discuss the
different categories of metadata used in data warehouse. IWBUT 2014)
a) What is metadata? IWBUT 2015)
'b) What is the typical content of metadata of a data warehouse? [WBUT 2015)
OR,
What is Metadata? Explain different types of Metadata. [WBUT 2017]
Answer: a
Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata. In terms of data warehouse, we can define metadata as
follows.
* Metadata is the road-map to a data warehouse.
* Metadata in a data warehouse defines the warehouse objects.
© Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
A collection of metadata records combined with data management and search tools forms
a data catalogue. The architecture of these data catalogues can vary from tightly held
internal data indexes to widely accessible distributed catalogues spread across the
Internet.
Metadata can be broadly categorized into three categories:
* Business Metadata: It has the data ownership information, business definition, and
changing policies.
Technical Metadata: It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
© Operational Metadata: It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
1. a) What is clustering? Why is
clustering? Distinguish between partitio
difficult to handle categorical data for
n clustering and hierarchical clustering.
[WBUT 2009)
OR,
What is clustering? What are the features of good cluster? What do you mean by
hierarchical clustering technique? [WBUT 2013, 2014, 2016, 2018]
DWM-26DATA WAREHOUSING AND DATA MINING.
Answer:
1" Part: |
Clustering is a data mining (machine leaning) technique used to Place data elements into
related groups without advance knowledge of the group definitions,
There are two main types of clustering techniques, those that create a hierarchy of
clusters and non-hierarchy of clusters. The hierarchical clustering techniques create a
hierarchy of clusters from small to big. With a hierarchy of clusters defined it is possible
to choose the number of clusters that are desired. At the extreme it is possible to have as
many clusters as there are records in the database. In this case the records within the
cluster are optimally similar to each other (since there is only one) and certainly different
from the other clusters. The advant
tage of hierarchical clustering methods is that they
allow the end user to choose from either many clusters or-only a few.
2™ Part:
Since, the process of grouping a set of | physical or abstract objects into classes of similar
objects is clustering, similarity metric is important here because it is used for outlier
detection. The clustering algorithm which is main memory based can operate only on the
following two data structures namely,
a) Data Matrix
b) Dissimilarity Matrix
So it is difficult to handle categorical data.
In partition clustering, every data sample is initially assigned to a cluster in some
(possibly random) way. Samples are then iteratively transferred from cluster to cluster
until some criterion function is minimized. Once the Process is complete, the samples
will have been partitioned into separate compact clusters. Examples of partition
clustering methods are k-means and Lloyd's method.
In hierarchical clustering, each sample is initially considered a member of its own cluster,
after which clusters are recursively combined in pairs according to some predetermined
Condition until eventually every point belongs to a single cluster. The resulting
hierarchical structure may be represented by a binary tree or "dendrogram", from which
the desired clusters may be extracted. Examples of hierarchical clustering methods are
the single-link, Ward's, centroid, complete-link, group average, median, and parametric
Lance Williams methods.
') Describe the working of the PAM algorithm. Compare its performance with
CLARA and CLARANS. (WBUT 2009)
Answer:
The PAM k-medoids clustering algorithm, for example, evaluates a set of k objects
Considered to be representative objects (medoids) of k clusters within T objects such that
the non-selected objects are clustered with the medoid to which it is the most similar (i.e.
Closest in terms of the provided distance metric). The Process operates by swapping one
Of the medoids with one of the objects iteratively such that the total distance between
Non-selected objects and their medoid is reduced, The algorithm can be depicted as
follows:
DWM-27POPULAR PUBLICATIONS
Step 1: Initialization - choose k medoids from T objects randomly.
Step 2: Evaluation - calculate the cost D’, ~ D, for each swap of one medoid with one
object, where D, is the total distance before the swap and D’, is the total distance after the
swap.
Step 3: Selection - accept the swap with the best cost and if the cost is negative, go to
step 2; otherwise record the medoids and terminate the program.
The computational complexity of the PAM algorithm is O((1 + B)K(T — k)’) which is
based on the number of partitions per object, where B is the number of successful swaps,
It can also be expressed as O'((1 + f)k'(T — k)*) based on the number of distance
calculations, ie. one partition per object is equivalent to k distances calculations,
Clearly, this is time consuming even for the moderate number of objects and a small
number of medoids.
The computational complexity of the CLARA algorithm is O(q(ks’ + (T — k)) + Bks’)
based on the number of partitions per object or O'(q(k’s” + k(T —k)) + Bk’s’) based on the
number of distance calculations, where q, s, k, B and T are the number of samples, object
size per sample, number of medoids, the number of successful swaps for all samples
tested and the total number of objects, respectively. Clearly, the CLARA algorithm can
deal with a larger number objects than can PAM algorithm ifs « T.
The computational complexity of CLARANS is O((B + numlocal)(T ~ k)) based on the
number of partitions per object or O'((f + numlocal)k(T — k)) based on the number of
distance calculations, where B is the number of test moves between nodes.
2. a) What is Clustering? Why is it required for data warehousing and data mining?
What are the differences between partitioning -clustering and Hierarchical
clustering? .
b) Suppose the following table shows the distances between different cities.
Construct a hierarchical tree using Hierarchical clustering method.
BA | FA | MI | NA | RM | TO
BA | 0 | 662 | 877 | 255 | 412 | 996
Fi | 662 | 0 | 295 | 468 | 268 | 400
mi_| 677 | 295 | 0 | 754 | 564 | 138
NA | 255 | 468 | 754 | 0 | 219 | 869
RM_| 412 [268 | 564 | 219 | 0 | 669
To | 996 | 400 | 138 | 869 | 669 | 0
c) Discuss K-mean algorithm with a suitable example. [WBUT 2017]
Answer:
a) 1" Part: Refer to Question No. I(a) (I" Part) of Long Answer Type Questions.
2° Part:
Clustering is required for data warehousing and data mining for the following reasons:
© Scalability - We need highly scalable clustering algorithms to deal with large
databases.
DWM-28TA, HOt iG AND Dy IG
Ability to deal with different kinds of attributes — Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and
binary data.
eDiscovery of clusters with attribute shape — The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes,
¢ High dimensionality — The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
Ability to deal with noisy data - Databases contain noisy, missing or erroneous data.
‘Some algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability — The clustering results should be interpretable, comprehensible, and
usable.
3" Part: Refer to Question No. I(a) (2 Part) of Long Answer Type Questions.
b) The nearest pair of cities is MI and TO, at distance 138. These are merged intoa single
cluster called "MI/TO". The level of the new cluster is L(MUTO) = 138 and the new
sequence number is m= 1.
Then we compute the distance from this new compound object to all other objects. In
single link clustering the rule is that the distance from the compound object to another
object is equal to the shortest distance from any member of the cluster to the outside
object. So the distance from "MI/TO” to RM is chosen to be 564, which is the distance.
from MI to RM, and so on.
After merging MI with TO we obtain the following matrix:
BA FL [MUTOT NA RM
BA 0 662 877 255 412
FI 662 0 295_| 468 268
MVTO | _877 295 | 0 754 364
NA 255 468__| 754 0 219
RM. 412 268 | 564 [| 219 0
min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM
L(NARM) = 219
m=2
NAIRM
BA 255
FI 662 0 295_| 268
MUTO | _ 877 295 0 564
NARM| 255 208 | sa] 0
min d(ij) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called
BA/NA/RM
LBANA/RM) = 255
m=3
| BaNARM [FI | MUTO |
DWM-29UBLICATION:
BA/NA/RM 0 268 | «564
FI 268 0 295
MUTO 564 295 0
min (i,j) = ((BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster
called BA/FI/NA/RM
L(BA/FINA/RM) = 268
m=4
[__BA/NA/RM MUTO |
BA/NAIRM_| 9 295
MuTO | 295 298,
Finally, we merge the last two clusters at level 295.
The process is summarized by the following hierarchical tree:
BA NA RM FI OM TO
¢) K-means is an unsupervised learning algorithms that solve the clustering problem. The
procedure follows a simple and easy way to classify a given data set through a certain
number of clusters (assume k clusters) fixed a priori. The algorithm is composed of the
following steps:
1. Place K points into the space represented by the objects that are being clustered.
These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation
of the objects into groups from which the metric to be minimized can be calculated.
Suppose that we have n sample feature’ vectors x,, X2, ..., Xq all from the same class, and
we know that they fall into k compact clusters, k A, confidence(c) =0(A,C,EVa(C,E)=3/4=0.75
4. Write short notes on the following:
a) Dimensional Modeling [WBUT 2018]
b) Generalized Association Rule [WBUT 2010]
c) PAM Clustering Technique [WBUT 2010)
4) CLARANS clustering algorithm vis-a-vis PAM and CLARA
e) K-Medoid Algorithm [WBUT 2017]
Answer:
a) Dimensional Modeling:
A dimensional model is a data structure technique optimized for Data warehousing tools.
The concept of Dimensional Modelling was developed by Ralph Kimball and is
comprised of "fact" and "dimension" tables.
A Dimensional model is designed to read, summarize, analyze numeric information i
values, balances, counts, weights, etc. in a data warehouse. In contrast, relation m
DWM-32DATA WAREHOUS! IND
are optimized for addition, updating and deletion of data in a real-time Online
Transaction System.
‘These dimensional and relational models have their unique way of data storage that has
specific advantages.
For instance, in the relational mode, normalization and ER models reduce redundancy in
data. On the contrary, dimensional model arranges data in such a way that it is easier to
retrieve information and generate reports.
Hence, Dimensional models are used in data warchouse systems and not a good fit for
relational systems.
b) Generalized Association Rule:
In most real-life applications taxonomies (is-a hierarchies) over the items are available.
In general, a taxonomy can be represented as a directed acyclic graph (DAG). Given a
set of transactions T each of which is a set of items, anda taxonomy Tax, the problem of
mining generalized association rules is to discover all rules of the form X > Y, with the
user-specified minimum support and minimum confidence. X and Y can be sets of items
at any level of the taxonomy, such that no item in Y is an ancestor of any item in X. For
example, there might be a rule which says that "50% of vansactions that contain Soft
drinks also contain Snacks; 5% of all transactions contain both these items”
c) PAM Clustering Technique:
PAM stands for “partition around medoids”. The algorithm is intended to find a sequence
of objects called medoids that are centrally located in clusters. Objects that are tentatively
defined as medoids are placed into a set S of selected objects.
If O is the set of objects that the set U = O - Sis the set of unselected objects. The goal of
the algorithm is to minimize the average dissimilarity of objects to their closest selected
object. Equivalently, we can minimize the sum of the dissimilarities between object and
their closest selected object.
The algorithm has two phases:
(i) In the first phase, BUILD, a collection of k objects are selected for an initial set S.
(ii) In the second phase, SWAP, one tries to improve the quality of the clustering by
exchanging selected objects with unselected objects.
The goal of the algorithm is to minimize the average dissimilarity of objects to their
closest selected object. Equivalently, we can minimize the sum of the dissimilarities
between object and their closest selected object.
4) CLARANS clustering algorithm vis-a-vis PAM and CLARA:
‘The mediods searching in PAM or CLARA is abstracted as searching k subgraphs from n
Points graph, and based on this understanding, a PAM-like clustering algorithm called
CLARANS (Clustering Large Applications based upon RANdomized Search) is defined.
While PAM searches the whole graph and CLARA searches some random sub-graphs,
CLARANS randomly samples a set and selects k medoids in climbing sub-graph
Mountains, CLARANS selects the neighboring objects of medoids as candidates of new
medoids. It samples subsets to verify medoids in multiple times to avoid bad samples,
DWM-33POPULAR PUBLICATIONS
Obviously, multiple time sampling of medoids verification is time consuming which
limits CLARANS from clustering very large datasets in an acceptable time period,
¢) K-Medoid Algorithm:
Partitioning Around Medoids or the K-medoids algorithm is a partitional justerng
algorithm which is slightly modified from the K-means algorithm. The basic idea of thi
algorithm is to first compute the K representative objects which are called as medoids,
After finding the set of medoids, each object of the data set is assignc. *o the nearest
medoid. That is, object i is put into cluster v,, when medoid my, is nearer than any other
medoid m,.
The algorithm proceeds in two steps:
BUILD-step: This step sequentially selects k "centrally located" objects, to be
used as initial medoids
SWAP-step: If the objective function can be reduced by interchanging
(swapping) a selected object with an unselected object, then the swap is carried
out. This is continued till the objective function can no longer be decreased.
The algorithm is as follows:
1. Initially select k random points as the medoids from the given n data points of the
data set.
Associate each data point to the closest medoid by using any of the most common
distance metrics.
For each pair of non-selected object h and selected object i, calculate the total
swapping cost TC.
If TCj, <0, i is replaced by h
Repeat the steps 2-3 until there is no change of the medoids.
2.
3.
DWM-34: DATA WAREHOUSING AND DATA MINING
MINING TIME SERIES DATA
1. To optimize data warehouse design, which one is done? [WBUT 2011]
a) Normalization of fact tables and demoralization of dimension tables:
b) Normalization of fact tables and dimension tables
c) Demoralization of fact tables and dimension tables
d) Normalization of dimension tables and demoralization of fact tables
Answer: (d)
1/8 the value of an attribute is examined as it varies over time.
Analysis b) Classification [MODEL QUESTION]
4) Prediction
Long Answer Type Questions
4. a) Define decision tree. [WBUT 2009, 2010)
b) What are the advantages and disadvantages of the decision tree approach over
other approaches for data mining? (WBUT 2003)
¢) Discuss briefly the tree construction principle. [WBUT 2009]
OR,
Describe the basic algorithm for decision tree induction. {WBUT 2013]
d) What is a classification problem? What is the difference between supervised
and unsupervised classification? (WBUT 2009)
Answer:
a) Decision trees are powerful and popular tools for classification and prediction. The
attractiveness of decision trees is due to the fact that, in contrast to neural networks, .
ision trees represent rules.
ion tree is a classifier in the form of a tree structure (see Figure 1), where each node
is either:
© aleaf node- indicates the value of the target attribute (class) of examples, or
© adecision node - specifies some test to be carried out ona single attribute-value,
with one branch and sub-tree for each possible outcome of the test.
A decision tree can be used to classify an example by starting af the root of the tree and
moving through it until a leaf node, which provides the classification of the instance.
Information gain, is simply the expected reduction in entropy caused by partitioning the
examples according to this attribute. More precisely, the information gain, Gain (S, A) of
an attribute A, relative to a collection of examples S, is defined as
. : — {s,
Gain(S, A) =Entropy(S)-- ¥ Slenropy(s,)
vet [5]
Where Values(A) is the set of all possible values for attribute A, and S, is the subset of S$
for which attribute A has value v (i.e., S, = {8 6 S| A(s) = v})- Note the first term in the
DWM-3SPOPULAR PUBLICATIONS
equation for Gain is just the entropy of the original collection S and the second term js
the expected value of the entropy after S is partitioned using. attribute A. The expected
entropy described by this second term is simply the sum of the entropies of each subset
Sv, weighted by the fraction of examples |S,|//S| that belong to S,. Gain (S,A) is therefore
the expected reduction in entropy caused by knowing the value of attribute A. Put another
way, Gain(S,A) is the information provided about the target attribute value, given the
value of some other attribute A. The value of Gain(S,A) is the number of bits saved when,
encoding the target value of an arbitrary member of S, by knowing the value of attribute
A.
A decision tree can be constructed top-down using the information gain in the following
way:
1. begin at the root node
2. determine the attribute with the highest information gain which is not already
used as an ancestor node
3. add achild node for each possible value of that attribute
4, attach all examples to the child node where the attribute values of the examples
are idemtical to the attribute value attached to the node
5. if all examples attached to the child node can be classified uniquely add that
classification to that node and mark it as leaf node
6. go back to step two if there are unused attributes left, otherwise add the
classification of most of the examples attached to the child node
b) Advantages of Decision Tree model
* Decision tree methods tend to produce models that are easy to interpret. At each non-
terminal node, a decision is based upon just one predictor variable and this makes it
€asy to follow. For example, to explain particular classification one need only look at
the series of simple decisions that led to it. The final tree model can in-fact be cast
into a set of rules one can follow to classify a given case. In comparison, generalized
linear models use linear combinations of variables that can be difficult to interpret or
explain. 7
Tree models make no assumptions about the distribution of the underlying data and
they are thus a non-parametric procedure. This can be especially useful if the
distribution of the data is indeed unknown, something that happens a lot in practice.
* Decision tree methods are easily able to handle both categorical and’ continuous
variables.
¢ Decision tree methods have a built-i
feature selection method that makes them
immune to the presence of useless variables. Such variables are ignored and they 40
no affect the tree building process. This is a common problem with over-paramctized
datasets.
¢ Tree models are very adept at revealing complex interactions between variables.
Each branch of a tree can contain different combinations of variables and the same
variable can appear more than once in different Parts of the tree. This can reveal how
a variable can depend on another and in what specific context this dependency exists-
DWM-36.
[AREHOUSING AND DATA MININ(
Decision tree models are extremely robust to the effect of outliers. This is so because
the models are constructed in a frequency-based technique where one is counting the
instances in a split. Outliers that occur in the independent variables do not affect the
tree growing process because the values used to split each node are not likely to be
on the outlier values. Outliers in the dependent variable go into their own nodes and
do not affect the rest of tree.
Tree models offer several ways of dealing with missing values that can often
minimize or eliminate the effect of such values on model performance. In many other
methods the whole set of instances with missing values would be omitted from
analysis. This advantage is present in both the when training the tree model or when
applying the model on new data for class prediction.
Disadvantages of Decision Tree Models: .
Decision trees are indeed powerful data mining tools, but they are not perfect and are
_ Suited to certain types of problems. The main weaknesses in decision tree modeling are
listed below:
Classification trees can be unstable and small variations in the data (such as that
made by randomization) can cause very different looking trees being generated. This
is especially so in cases where the goodness of splits for the different variables are
close to each other in value. In this case, a small variation in the data is enough to
influence the picking of on split over another.
© Many of these problems are dealt with by using stratified sampling when creating
testing and training datasets. In this case, the training and testing datasets have
the same class distribution.
© Such problems are also a symptom of a dataset that is too small. In such cases, it
is important to use the entire dataset to train and test the model. This is the idea
behind the cross-validation technique. Since the final model is built on the whole
dataset, randomization of this same data will not cause tov much instability.
‘Some the tree models generated may be very large and complex. This can make the
model difficult to interpret, understand or justify.
Tree models really shine when applied to classification problems: unfortunately, they
are not very good at estimation tasks(as in regression}, i.e. having numerical
dependent variable.
Decision tree models are computationally expensive to train
€) Most algorithms that have been developed for learning decision trees are variations on |
@ core algorithm that employs a top-down, greedy search through the space of possible
decision trees. Decision tree programs construct a decision tree T from a set of training
cases,
Construvting Decision Tree using 1D3
function ID3 .
(R: a set of non-target attributes,
C: the target attribute, oo
S: a training set) returns a decision tree;
DWM-37
Input:POPULAR PUBLICATION:
begin .
ii S is empty, return a single node with
value Failure;
If S consists of records all with the same
value for the target attribute,
return a single leaf node with that value;
If R is empty, then return a single node
with the value of the most frequent of the
values of the target attribute that are
found in records of S; [in that case
there may be be errors, examples
that will be improperly classified],
Let A be the attribute with largest
Gain(A,S) among attributes in R;
Let ({aj| j=1,2, .., m) be the values of
attribute a;
Let (Sj| j=1,2, .., m) be the subsets of
S consisting respectively of records
with value aj for A;
Return a tree with root labeled A and arcs
labeled al, a2, .., am going respectively
to the trees (ID3(R-{A}, C, $1), ID3(R-(A}, C, 82),
sates ,ID3(R-{A}, C, Sm);
Recursively apply ID3 to subsets (Sj|
until they are empty
12, .., my
ond
4) Classification is the process of dividing a dataset into mutually exclusive groups such
that the members of ea
ich group are as "close" as possible to one another, and different
groups are as “far" as possible from one another, where distance is measured with respect
to specific variable(s) one is trying to predict. For example, a typical classification
Problem is to divide a database of companies into groups that are as homogeneous as
possible with respect to a creditworthiness variable with values "Good" and "Bad."
With supervised classification, we identify examples of the Information classes (i.e. land
mage(say). These are called "training sites". The image
cover type) of interest in an ‘i
Processing software system is then used to develop a statistical characterization of the
Ss stage js often called “signature analysis" and
reflectance for each information class. Thi
tion as simple as the mean or the rage of
may involve developing a characterizat
reflectance on cach bands, or as complex as detailed analyses of the mean, variances and
‘atistical characterization has been achieved for each
covariance over all bands. Once a st
information class, the image is then classified by examining the reflectance for each pixt!
ignatures it resembles most.
and making a decision about which of the s
Unsupervised classification is a method which examines a large number of unknown
Pixels and divides into a number of classed based on natural groupings present in the
fication, unsupervised classification does
image values. unlike supervised classi
require analyst-specified training data. The basic premise is that values within a give?
DWM-38DATA WAREHOUSING AND DATA MINING
in the measurement space (ie. have similar gray
asses should be comparatively well separated (i.e.
cover type should be close together
levels), whereas data in different cl
have very different gray levels)
2. a) What is the use of Regression? What May be the reasons for not using the
linear regression made! to estimate the output data? [WBUT 2016]
Answer:
Regression analysis is widely used for prediction and forecasting, whert its use has
substantial overlap with the field of machine learning. Regression analysis is also used to
understand which among the independent variables are related to the dependent variable,
and to-explore the forms of these relationships.
Linear Regression Disadvantages:
¢ By its nature, linear regression only looks at linear relationships between
dependent and independent variables. That is, it assumes there is a straight-line
relationship between them. Sometimes this is incorrect.
¢ Linear regression looks at a relationship between the mean of the dependent
variable and the independent variables.
© Linear Regression is sensitive to Outliers
Linear regression assumes that the data are independent. This is often, but not
always, sensible. Two common cases where it does t.0t make sense are clustering
in space and time.
b) How is time series data used in pattern analysis? Give the formula for Pearson's
r. . (WBUT 2016]
Answer:
Time series data have a temporal order that makes analysis distinctly different from other
data analysis. The goal of time series analysis can be divided into characterization or
Prediction. There is a consistent pattern contaminated with random noise, which typically
requires filtering to aid in identifying the underlying pattern. The pattern itself can be
divided into the main trend and a seasonal component.
Pearson's correlation coefficient when applied to a population is
cov(X,Y)
Pn ay
Oxo,
where:
® cov is the covariance
© — Gy is the standard deviation of X
© — gy is the standard deviation of Y
¢) Explain Bayesian classification. [WBUT 2016]
Answer:
Bayesian classification technique is based on Bayes’ Theorem with an assumption of
independence among predictors. In simple terms, a Naive Bayes classifier ‘assumes that
DWM-39POPULAR PUBLICATIONS
of a particular feature in a class is unrelated to the Presence of any other
tam For example a fruit may be considered to be an apple if it is red, round, and
about 3 inches in diameter. Even if these features depend on each other or upon the
existence of the other features, all of these Properti S independently contribute to the
probability that this fruit is an apple and that is why it is known as ‘Naive’,
Naive Bayes model is easy to build and particularly useful for very large data seis, Alon,
with simplicity, Naive Bayes is known to outperform even highly sophisticateg
ification methods.
a thestem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c). Look at the equation below:
Likehood Class Prior Probability
\ )P(c)
_P(xle)P(c
Pela)
Posterior Probability Ny
Predictor Prior Probability
P(c|X)= P(x |e) P(x, |e) x...x P(x, |e) x P(e)
Above,
+ Plcix) is the posterior probability of class(c, target) given predictor
(x, attributes).
* P(c)is the prior probability of class.
* P(x(c) is the likelihood which is the probability of predicior given class.
* P(x)is the prior probability of predictor.
3. a) What are the uses of training data set and test data set for a decision tree
classification scheme? [WBUT 2010, 2012, 2018]
b) Discuss the principle of FP- tree Growth algorithm. [WBUT 2010, 2013]
R,
Discuss the different phases of FP-tree growth algorithm. [WBUT 2012)
ol
Define FP tree. Discuss the method of computing FP tree. [WBUT 2018]
Answer:
a) Decision trees need two kinds of data: Training and Testing. .
Training data, which are usually the bigger part of data, are used for constructing trees.
The more training data collected, the higher the accuracy of the results. The other group
of data, testing, is used to get the accuracy rate and misclassification rate of the decision
tree.
b) FP-Growth algorithm allows frequent itemset discovery without candidate itemset
generation. It is a two step approach:
Step 1: Build a compact data structure called the FP-tree
Built using 2 passes over the data-set.
DWM-40AT, HOUSING Al
Step 2: Extracts frequent itemsets directly from the FP-tree
‘Traversal through FP-Tree
The frequent-pattern tree (FP-tree) is a compact structure that stores quantitative
information about frequent patterns in a database.
1) One root labeled as “null” with a set of item-prefix subtrees as children, and a
frequent-item-header table (presented in the right side of Figure |);
2) Each node in the item-prefix subtree consists of three fields:
a) Item-name: registers which item is represented by the node;
b) Count: the number of transactions represented by the portion of the path
. g the node;
c), Node-link: links to the next node in the FP-tree carrying the same item-name. or
null if there is none,
Each entry in the frequent-item-header table consists of two fields:
a) Item-name: as the same to the node;
b) Head of node-link: a pointer to the first node in the FP-tree carrying the item-
name.
Additionally the frequent-item-header table can have the count support for an item.
3
Algorithm 1: FP-tree construction
Input: A transaction database DB and a minimum support threshold.
Output: FP-tree, the frequent-pattem tree of DB. .
Method: The FP-tree is constructed as follows.
1. Scan the transaction database DB once. Collect F, the set of frequent items, and the
Support of each frequent item. Sort F in support-descending order as FList, the list of
frequent items.
2. Create the root.of an FP-tree, T, and label it as “null”. For each transaction Trans in
DB do the following:
* Select the frequent items in Trans and sort them according to the order of FList.
Let the sorted frequent-item list in Trans be [p | P], where p is the first element
and P is the remaining list. Call insert tree({ p | P], T ).
© The function insert tree({ p | P], T ) is performed as follows. If T has a child N
such that N.item-name = p.item-name, then increment N *s count by 1; else
create a new node N , with its count initialized to 1. its parent link linked to T ,
and its node-link linked to the nodes with the same item-name via the node-link
Structure, If P is nonempty, call insert tree (P, N) recursively
By using this algorithm, the FP-tree is constructed in two scans of the database, The first
Scan collects and sort the set of frequent items; and the second conseructs the FP-Tree.
After constructing the FP-Tree it is possible to mine it to find the complete set of frequent
Patterns,
Algorithm 2: FP-Growth
Input: A database DB, represented by FP-tree constructed according to Algorithn 1, and
4 minimum support threshold.
Output: The complete set of frequent patterns.
DWM-41