0 ratings 0% found this document useful (0 votes) 2 views 81 pages Data Warehouse and Data Mining Organizer
The document discusses the curriculum changes for the Data Warehousing and Data Mining course at MAKAUT, which has been moved to the 6th semester. It covers key concepts such as data warehousing, data mining, and their applications, as well as recent trends and differences between data marts and data warehouses. Additionally, it includes model questions and answers to help students prepare for university assessments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save Data Warehouse and Data Mining Organizer For Later DATA WAREHOUSING AND
DATA MINING
Introduction to Data Warehousing
Classification and Prediction of Data Warehousing
Mining Time Series Data
Mining Data Streams
Web Mining
Recent Trends in Distributed Warehousing
NOTE:
20
35
&
72
MAKAUT course structure and syllabus of 6 semester has been changed from 2021
Previously DATA WAREHOUSING AND DATA MINING was in 7" semester. This
subject has been redesigned and shifted in 6” semester as per present curriculum.
Subject organization has been changed slightly. Taking special care of this matter we
are providing the relevant MAKAUT university solutions. and some model questions &
answers for newly introduced topics, so that students can get an idea about university
Questions patterns.POPULAR PUBLICATIONS
ouse is an integrated collection of
a) Itis a collection of data of different types
o Itis a collection of data derived from multiple sources
c) Itisa relational database
d) Itcontains summarized data
Answer: (b)
4. A data wareht
f
2, A data warehouse is said to contain a ‘subject oriented’ coe of ‘oa
because [we ,
a) Its contents have a common theme :
b) [tis built for a specific application
c) Itcannot support multiple subjects
d) Itis a generalization of ‘object-oriented’
Answer: (a)
varying collection of data
3. A Data warehouse is said to be contain in time ta
[WBUT 2010, 2013, 2015]
because
a) Its content vary automatically with time
b) Its life-span is very limited
c) Every key structure of data warehouse contains either implicitly
explicitly an element of time
d) Its content has explicit time-stamp
Answer: (c)
4, Data Warehousing is used for [WBUT 2010, 2012]
2S Omseapecacoe 2} Sun etputtion apleatint |
Answer: (a)
5. Which of the following is TRUE? [WBUT 2010, 2012)
a) Data warehouse can be used for analytical
processing only a
b) Data warehouse can be used for informati ir
pee ore an ation processing (query, report)
c) Data warehouse can be used for data minin:
ig onl:
d) Data warehouse can be used for information Treen (query,
analytical processing and di
‘Answer: (d) . ern
6. Data wareh if
data sae architecture is just an over guideline. Itis not ab
a) True
Answer: (b) B)Faies
Nw‘DATA WAREHOUSING AND DATA MINING
rhe most distinguishing characteristic of DSS data is [WBUT 2011]
t 4) Granularity b) Timespan ‘c) Dimensionality _d) Data currency
answer (©) >
A data warehouse is built as a separate repository of data, different from the
a, A devral data of an enterprise because [WBUT 2042, 2013]
operwt is necessary to keep the operational data free of any warehouse
rations
b) nee warehouse cannot afford to allow corrupted data within it
¢) A data warehouse contains summarized data whereas the operational
database contains transactional data
d) None of these
Answer: (©)
9, Dimension data within a warehouse exhibits which one of the following
perties? [WBUT 2012, 2015]
a) Dimension data consists of the minor part of the warehouse
b) The aggregated information is actually dimension data
¢} It contains historical data
d) Dimension data is the information that is used to analyze the elemental
transaction
Answer: (b)
pro}
10. A data warehouse is said to contain a ‘time-varying’ collection of data because
a) its content has explicit time-stamp [WBUT 2014, 2015, 2046, 2017]
b)its life-span is very limited
¢)it contains historical data
d) it contents vary automatically with time
Answer: (c)
14. The important aspect of the data warehouse environment is that data found
within the data warehouse is [WBUT 2016, 2018]
a) subject-oriented b) time-variant
¢) integrated d) all of these
Answer: (a)
12, Data warehousing is used for P [WBUT 2016)
8) decision support system b) OLAP applications
ieee applications d) Data manipulation applications
wer: (a)
is a subject-oriented, integrated, time-variant, non-volatile
[WBUT 204
data
4) Data Mining
©) Doo b) Data Warehousing
1SWer: Go Mining d) Text Mining. 5
DWM-3POPULAR PUBLICATIONS
Short Answer Ty]
T 2009, 2010, 2011, 2015, 2
1. Define Data Marts. [WBUT 2009, 2010, 2011,
Define the types of Data Marts.
Answer: ’
1" Part: : 7
‘A data mart is a group of subjects that are organized in 2 Neen them to assist ’
departments in making specific decisions. For example, the adv an ee will
have its own data mart, while the finance department will eae : 7 that
separate from it, In addition to this, each department will have ownership of
software, hardware, and other components that make up their data mart.
2" Part:
‘There are two types of Data Marts: J
© Independent data marts — sources from data captured form OLTP system, external
providers or from data generated locally within a particular department or geographic
area. + <
© Dependent data mart— sources directly form enterprise data warehouses.
2. Define data mining. What is the advantages data mining over traditional -
approaches? [WBUT 2009],
Answer:
1" Part: ‘
Data mining, which is also known as knowledge discovery, is one of the most popular
topics in information technology. It concerns the process of automatically extracting
useful information and has the promise of discovering hidden relationships that exist in
large databases. ‘These relationships represent valuable knowledge that is crucial for
maity applications. Data mining is not confined to the analysis of data stored in data
warehouses. It may analyze data existing at more detailed granularities than the
suluuurized data provided ina data warehouse, It may also analyze transactional, textual,
Spatial, and multimedia data which are difficult to model with current multidimensional —
daiah> wechnology. i
2" Part:
With the help of data mining, organizations are in a better position to predict the 1
Ie ‘Ge business trend: the possible amount of revenue that could be generated,
Orders tha could fe expectes and the type of customers that could be approached.
eo approaches will not be able to generate such accurate results as they
im i a ithms. One major advantage of data mining over a traditional stati
HE Rs abstity 1 deat surety with heterogeneous data fields, ’
i tomers
The advartaves of data minin; hel sinesses : ;
and beip nao’ of ther areas like ata ana ra es .Yr
DATA WAREHOUSING AND DATA MINING
s. What isthe importance of Association Renin Data wiedeg? [WBUT 2003)
Explain support, confidence, frequent item set and give a formal definition of
association rule, ie [WBUT 2013]
What is an Association Rule? Define Support, Confidence, Item set and Frequent
item set in Association Rule Mining? [WBUT 2017)
Answert
fa illustrate the concepts, We use a small example from the supermarket domain. The set
of items is I~ {milk, bread, butter, beer} and a small database containing the items is
shown in Table below.
‘Transaction 1D | Ttems
1 Milk, bread
2 Bread, butter
3 Beer
4 Milk, bread, butter
5 Bread, butter
An example supermarket database with five transactions,
‘An example tule for the supermarket could be {milk, bread} > {butter} meaning that if
milk and bread is bought, customers also buy butter. To select interesting rules from the
set of all possible rules, constraints on various measures of significance and interest can
be used. The best-known constraints are minimum thresholds on support and confidence.
The support supp(X) of an itemset X is defined as the proportion of transactions in the
data set which contain the itemset. In the example database in Table 1, the itemset {milk,
bread} has a support of 2/5 = 0.4 since it occurs in 40% of all transactions (2 out of 5
transactions).
The confidence of a rule is defined conf(X ) Y) = supp(X [ Y)/supp(X). For example, the
tule {milk, bread}) {butter} has a confidence of 0.2/0.4 = 0.5 in the database in the Table,
which means that for 50% of the transactions containing milk and bread the rule is _
correct. Confidence can be interpreted as an estimate of the probability P(Y |X), the
probability of finding the RHS of the rule in transactions under the condition that these
transactions also contain the LHS.
In many (but not all) situations, we only care about association rules or causalities
involving sets of items that appear frequently in baskets. For example, we cannot run a
800d marketing strategy involving items that no one buys anyway. Thus, much data
mining starts with the assumption that we only care about sets of items with high support;
they appear together in many baskets. We then find association rules or causalities
only involving a high-support set of items must appear in at least a certain percent of the
baskets, called the support threshold. We use the term frequent itemset for “a set S that
Sppears in at least fraction s of the baskets,” where s is some chosen constant, typically
9.01 oF 1%, :
X,} DY , meaning that if we
pSoiation tules are statements of the form {XX ....
‘nd all of X,.X:,...X, in the market basket, then we have a good
Probability of finding Y for us to accept this rule is the
‘ould search only for rules that had confidence abowPOPULAR PUBLICATIONS
4, State the differences between Data Mart & Warehouse. {WBUT 2016, 2015
Answer:
A data warehouse has a structure which is separate from a data mart, even o
may look similar. Because of this, it is difficult to coordinate the data across
departments. Each department will have its own view of how a data mart:
‘The data mart that they use will be specific to them. In contrast, data w
designed around the organization as 4 whole. Instead of | being owned by one
a data warehouse will generally be owned by the entire company: While ‘
contained in data warehouses are granular, the information contained in data marts
very granular at all. a
Another thing that separates data warehouses from data marts is that data
contain larger amounts of information. ‘The information that
often summarized. Data warehouses will not contain information 1
‘of the department. Instead, it will demonstrate the information that is analyzed
processed by the organization. Much of the information that is held in data waret
historical in nature, and they are designed to process this information.
§, When is a Data Mart appropriate?
Answer:
Data’Marts are created for:
a) Speeding up queries by reducing the volume of data to be scanned
b) Structuring data in a form suitable for user access tools
¢) Partitioning data to impose access control strategies if
d) Segmenting data into different hardware platforms. 4
Data marts should not be used in other cases as the operational costs of data marting
be high and once a strategy in place, it can be difficult to change it without
substantial cost.
6. What is sequence mining?
Answer:
Sequence mining is a type of structured data mining in which the database
administrator look for sequences of trends in the data. This data mining is split into
fields. Itemset sequence mining typically is used in marketing, and string seq
mining is used in biology research, Sequence mining is different from regular
mining, because the data are more specific, which makes building an effective
difficult for database designers, and it can sometimes go awry if the
different from the common sequence. ‘
7. Differentiate among Enterprise Warehouse, Data mart and Virtual
Answer:
Enterprise warehouse - ;
peries collects all oftheDATA WAREHOUSING AND-DATA MINING
pata Mart ~ a subset of sorporate-wide data that is of value to specific groups!of users.
iis scope is confined to specific, selected groups, such as marketing data mart.
Virtual warehouse - A set of views over operational databases. Only some of the
possible summary views may be materialized.
8, How is data warehouse different from adatabase? [WBUT 2013, 2016, 2018]
Answer:
‘A database is used {0 store data while a data warehouse is mostly used to facilitate
ing and analysis. Basically, database is just where the data is stored; in order to
5 data or analyze it a database management system is required. However, a data
warehouse does not necessarily require a DBMS. The purpose of a data warehouse is for
easy access to the data for a user. The data warehouse may also be used to analyze the
data; however the actual process of analysis is called data mining,
Some differences detween a database and a data warehouse:
« A database is used for Online Transactional Processing (OLTP) but can be used
for other purposes such as Data Warehousing.
« A data warehouse is used for Online Analytical Processing (OLAP). This reads
the historical data for the Users for business decisions,
* In a database the tables and joins are complex since they are normalized for
RDMS, This reduces redundant data and saves storage space
In data warehouse, the tables and joins are simple since they are de-normalized~
This is done to reduce the response time for analytical queries.
* Relational modeling techniques are used for RDMS database design, whereas
modeling techniques are used for the Data Warehouse design.
© A database is optimized for write operation, while a data warehouse is optimized
for read operations.
* In a database, the performance is low for analysis queries, while in a data
warehouse, there is high performance for analytical queries,
* A data warehouse is a step ahead of a database. It includes a database in its
structure.
ae are the issues relating to the diversity of database types? [WBUT 2016)
nswer:
* Handling of relational and complex types of data The database may contain
complex data objects, multimedia data objects, spatial data, temporal data ete. It is
Hot possible for one system to mine all these kind of dati
Mining information from heterogeneous databases and global information
systems ~The data is available at different data sources which may be structured, ~
semi structured or unstructured. Therefore mining the knowledge from them adds
challenges to data mining, ‘
10. Introd ne
luce the idea of Data Mining with ex:
'n Knowledge discovery in Database (KDD)POPULAR PUBLICATIONS
Answer:
Data Mining: Refer to Question No. 21" Part) of Short Answer Type Questions,
a Mining and Business Intelligence
for when a customer leaves their company to gett
oe rovider. They collate billing informath stomer
d other metrics to give each customer a probability.
‘to. customers whom they perceive to be ata
Data mining examp!
Mobile phone and uti
predict ‘churn’, the terms the:
phone/gas/broadband from another p
services interactions, website visits ant
score. then target offers and incentives
risk of chuming,
ities companies use Dat
Steps involved in KDD process 2 2
ts Mientify the goal of the KDD process from the customer's perspective. .
Understand application domains involved and the knowledge that's required 5
on which discovery is be performed. i]
2.
3. Select a target data set or subset of data samples 0 ver)
4. Cleanse an pre-process data by deciding. strategies to handle missing fields and;
the data as per the requirements. g
5, Simplify the data sets by removing unwanted variables. Then, analyze useful features
that can be used to represent the data, depending on the goal or task. : a
6, Match KDD goals with data mining methods to suggest hidden patterns.
+, Choose data mining algorithms to discover hidden pattems. This process includes
deciding which models and parameters might be appropriate for the overall KDD-
process.
& Search for patterns of interest in a particular representational form, which include -
classification rules or trees, regression and clustering,
9, Interpret essential knowledge from the smined patterns.
10. Use the knowledge and incorporate it into another system for further action.
11. Document it and make reports for interested parties.
41. Does a data warehouse involve a transaction? Explain.
Answer:
A data warehouse is a relational database that is desi fc i
i : igned for query and analysis
than for transaction processing. It usually contains historical data derived
transaction data, but it can include data from other sources. It separates analysis
from transaction worklo: izati i
abe jad and enables an organization to consolidate data from se
, stions
|. Define Data Warehouse and briefly describe its characteristics.
Answer: .
A dat is
ae Heise Consists of a computer database responsible for the collection
son mation for a specific organization. This collection of inf ist
0 manage information efficiently and analyze the collected data, Althou
DWM-8 altDATA WAREHOUSING AND DATA MINING
warehouses vary in overall design, majority of them are subject oriented, meaning that
the stored information is connected to objects or events that occur in reality. The data
provided by the data warehouse for analysis. provides information on a specific subject,
rather than the functions of the company and is collected from varying sources into one
unit having time-variant, ’
‘data warehouse has significantly different features from other enterprise-wide systems,
particularly in how data is stored, managed and tmanipulated.
‘There are four key characteristics which separate the data warehouse from other major
operational systems:
1. Subject Orientation: Data organized by subject
2, Integration: Consistency of defining parameters
3. Non-volatility: Stable data storage medium
4, Time-variance: Timeliness of data and access terms
2. a) What are the different tiers in a typical 3-tier data warehousing architecture?
b) What do you mean by warehousing schema? Explain. [WBUT 2009)
Answer:
a) The bottom tier is a warehouse database server which is almost always a relational
database system.
The middle tier is an OLAP seryer which is typically implemented using either (1) a
Relational. OLAP (ROLAP) model, i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations: or (2) a
Multidimensional OLAP (MOLAP) model, i.c., 2 special purpose server that directly
implements multidimensional data and operations.
The top tier is a client, which contains query and reporting tools, analysis tools, and/or
data mining tools (e.g., trend analysis, prediction, and so on).
b) The entity-relationship data model is commonly used in the design of relational
databases, where a database schema consists of a set of entities or objects, and the
elationships between them. Such a data model is appropriate for online transaction
Processing. Data warehouses, however, require a concise, subject-oriented schema which
facilitates on-line data analysis
3. 2) Introduce the concept of data mining and cite two application areas.
-, : : [WBUT 2010, 2012]
ae hat are the different steps of a data mining task? (WBUT 2010, 2012]
“ppose that the data mining task is to cluster the following eight points (with
sy) Tepresenting location) into 3 clusters.
A(2,10).4,(2,5),4,(8.4), 8 (5,8). B, (7,5), 2 (6,4),6 (1.2), (4.9)
The distance function is Euclidian distance. Initially we assign 4, B, and C, asthe
Centre of each cluster. Use k-means algorithm to determine the three clusters.
‘Aiswer: : [WBUT 2010, 2013, 2014
1" Part: Refer to of | Question No. 2(1" Part) of Short /
DWM-9POPULAR PUBLICATIONS
2" Part:
An application of data mining is market
company will be able to find behaviors that are common
allow a company to estimate o i
products or services and go toone of its competitors.
b) The Following Steps Are Usually Followed In Data Mining. These steps as
with the process moving backward whenever . .
. Develop an understanding of the application, of the relevant prior kno
the end user's goals.
. Create a target data set to be used for discovery. ‘eA
| Clean and preprocess data (including handling missing data fields, noise in
accounting for time series, and known changes).
| Reduce the number of variables and find invariant representations of data if
. Choose the data-mining task (classification, regression, clustering, etc.),
s. Choose the data-mining algorithm.
. Search for pattems of interest (this is the actual data mining). _
. Interpret the pattern mined. If necessary, iterate through any of steps | through
. Consolidate knowledge discovered and prepare a report.
¢) The given eight points were:
Al:¢
‘The points Al, B2, and Cl were assumed to be the initial three
ts Al, cluster centres.
two possible ways of clustering the given points using the k-means algorithm,
on to which cluster we assign the point B1: or fa:
The first way:
1.1. Round One:
~ Compute for each of the eight points its di ;
a points its distance Pr
tee oat eel
; i ee
Chater I: Al, BI, C2 ne E 7
uster2: BD. A3, B3
Cluster 3: C1, A2DATA WAREHOUSING AND DATA MINING
- update the cluster centres:
The now contre of Cluster 1: (3,67, 9)
‘The new centre of Cluster 2: (7, 4.33)
‘The new centre of Cluster 3: (1,5, 3,5)
1.2. Round Two:
-No changes
‘The second way:
2.1. Round One:
- Compute for each of the eight points its distance from each cluster centre, and assign it
to the cluster to which it is most similar.
=> The clustering after the first round execution is:
Cluster 1: Al, C2
Cluster 2: B2, 3, B1, BS
Cluster 3: C1, A2
~ update the cluster centres:
The new centre of Cluster 1: (3, 9.5)
The new centre of Cluster 2: (6.5, 5.25)
The new centre of Cluster 3: (1.5, 3.5)
2.2. Round Two:
-secompute for each of the eight points its distance from each cluster centre, and assign it
tothe cluster to which it is most similar.
=> The clustering after the second round execution is:
Cluster 1: Al, BI, C2
Cluster 2: B2, A3, B3
Cluster 3: C1, A2
- update the cluster centres:
The new centre of Cluster 1: (3.67, 9)
The new centre of Cluster 2: (7, 4.33)
The new centre of Cluster 3: (1.5, 3.5)
23. Round Three:
~No changes.
4. a) Write down the design steps for a typical data warehous
5) Explain the flowchart for KDD process.
¢) Define Roll-up and Drill-down process with a suitable example.
Answer:
4) In general, building any data warehouse consists of the following steps:
1. Extracting the transactional data from the data sources into a staging area - The
Extract step covers the data extraction from the source system and makes it
accessible for further processing, The main objective of the extract step is to retrieve
all the required data from the source system with as little resources as possible, The
txtract step should be designed in a way that it does not negatively affect the source
system in terms or performance, response time or any kind of lov
WM-11
(WBUT 2015)POPULAR PUBLICATIONS
ictional data ~
. Transforming the transat
: taishorm the data from the source ome t cistoah tet
measured data to the a (Le. mension) using:
‘its so that they can later be joined.
ae from several sources, generating apEreEets: ge etic
deriving new calculated values, and applying bee ha he
ii dimensio1
3. Loading the transformed ae aaa mare and via
it is necessary to ensure that one a
i tof the Load process database. In of
te ee it is helpful to disable any constraints and
the load process efficient, oor
Gees tad enable them back only after the load completes.
integrity needs to be maintained by ETL tool to ensure consistency.
b) An Outline of the Steps of the KDD Process
yer Patterns
*
- i
(__ seesion t ransorned
[paw |
Mp Preprocessed | i
| i
i i
¢
Sept bas
The overall process of finding and interpreting patterns from data involves the
application of the following steps:
1. Developing an understanding of
© the application domain
© the relevant prior knowledge
© the goals of the end-user
» Creating a target data set: selecting a data set, or focusing on
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
e pie of noise or outliers. - i?
° lecting necessary information to model or account for noise.
6 Sarena for handling missing data fields. J
ing for time sequence information anc known changes.
4. Data reduction and ree aioe ee
aoe useful features to represent the data9.
representations for the data.
;, Choosing the data mining task.
o Deciding whether the goal of the KDD classificati
clustering, ete. procera ge eect
;. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
© Deciding which models and parameters may be appropriate,
© Matching a particular data mining method with the overall criteria of the
KDD process.
Data mining.
© Searching for patterns of interest in a particular representational form or a set
of such representations as classification rules or trees, regression, clustering,
and so forth,
Interpreting mined patterns.
Consolidating discovered knowledge.
¢) Roll-up
Roll-up performs aggregation ona data cube in any of the following ways:
By climbing up a concept hierarchy for a dimension
By dimension reduction
The following diagram illustrates how roll-up works.
Mobile Modem Phone Security
Fal-up onjocation poniesPOPULAR PUBLICATIONS
Roll-up is performed by climbing up 4 concept hierarchy for the d
Jocation.
Initially
« Onrolling up, the data is aggregated by:
level of city to the level of country. :
+ The data is grouped into cities rather than countries.
© When roll-up is performed, one or more dimensions from the data
removed.
he concept hierarchy was “street y,|
Use k-means or k-medoid algorithm to determine the two clusters. [WBUT:
Answer:
Step 1
Initialize centers.
Let us-assume c; = (3,4) and ¢) = (7,4)
So here c) and c; are selected as medoids.
Costs to the nearest medoid are shown bold in the table.
i a Data objects Cost
Cy) (distance)
Te (a Je 3
3,3 ]4] 31 8 4
atstetse tk 4
3s )3 4 6 2 5
6 314 6 4 z
7 3 | 4 7 3 3
9 3 4 8 5 6
i [3 ,4)?7l6 ;
i 5 ree Cost (distance)
i a a 7
kal ee 8 8
4/7) 4 4 ‘Fi 6
5_ | a4 6 2 3
6 {7} 4 | 6 [4 T j
7 Zi 4 7 3 1
- 7 4 8 3 2
O17
‘These hiiers becomes 2 I =
Cluster, = {(3,4)(2.6)(3,8)(4,7)
Chster = (62X67 3H8.5178)) :_
DATA WAREHOUSING AND DATA MINING
since the points (2,6) (3.8) and (4,7) are closer to-e, hence they form one cluster whilst
semaining points form another cluster,
$0 the total cost involved is 20.
Where cost between any two points is found using formula
cot(nse)= Dh
a :
where x is any data object, c is the medoid, and d is the dimension of the object which in
this case is 2. ’
Total cost is the summation of the cost of data object from its medoid in its cluster so
here:
Total cost = {cost((3,4), (2,6) + cost((3,4), (3,8)) + cost((3,4), (4.7))}
+ {cast((7,4), (6,2)) + cost((7,4), (6,4) + cost((7,4), (7,3)
+ cost((7,4), (8,5)) + eost((7,4), (7,6))}
=(3+44+4)+(3+141 +242)
=20
Select one of the nonmedoids O'
Lotus assume O! =(7,3)
So now the medoids are c\(3.4) and O'(7,3)
ifcl and O' are new medoids, calculate the total cost involved
By using the formula in the step 1
i “ Data objects | Cost (distance)
)
1 3 4 2 6 3
3 3 4 3 8 4
4 3 4 4 7 4
5 3 4 6 | 2 3
6 3 4 6 | 4 3
7 3 4 Jams 4
9 3 4 Salers 6
10 3 4 76 4
i ai Data objects Cost
CY) (distance)
7 7 3 2 6 8
3 7 sy 3 8 9
4 7 3 4 7 7
3 7 3 6 2 2
6 7 3 6 4 2
7 7 3 7 4 1
9 7 3 8 3 3
10 7 3 7 6 "
Toial cost=3 544442429143 43 922
DWM-25 i.POPULAR PUBLICATIONS
i i Oris
So cost of swapping medoid frome, to " i
=22-20=270
= current total cost — past total cost : "
S joni to O' would be a bad idea, so the previous choice bi good. So we
nonmedoids and found that our first choice was the best. So the configuration ;
change and algorithm: terminates here (i.e. there is no change in the medoids),
6. What is metadata in Data warehousing? What is metadata catalog?
different categories of metadata used in Feed warehouse.
a) What is metadata?
b) What is the typical content of metadata of a data warehouse?
What is Metadata? Explain different types of Metadata.
nswer: .
Mata is simply defined as data about data. The data that is used to repre
data is known as metadata. In terms of data warehouse, we can define metz
follows.
© Metadata is the road-map to a data warehouse.
© Metadata ina data warehouse defines the warehouse objects.
+ Metadata acts as a directory, This directory helps the decision support sys
locate the contents of a data warehouse. 2
A collection of metadata records combined with data management and search tools
a data catalogue, The architecture of these data catalogues can’ vary from tightly’
internal data indexes to widely accessible distributed catalogues spread
Internet.
Metadata can be broadly categorized into three categories:
* . Business Metadata; Ii has the data ownership information, business definition
changing policies.
* Technical Metadata: It includes database system names, table and column
and sizes, data types and allowed values. Technical metadata also includes st
information such as primary and foreign key attributes and indices.
Operational Metadata: It includes currency of data and data lineage. Ci
data means whether the data is active, archived, or purged. Lineage of data
history of data migrated and transformation applied on it.
4. a) What is clustering? Why is it difficult to handle categorical
clustering? Distinguish between partition clustering and hierarchical ¢
What is clustering? What are the features cluster? What do y‘
hierarchical clustering technique? ae [WBUT 2013,DATA WAREHOUSING AND DATA MINING
Answers
Part:
Clustering is a data mining (machine learning) technique used to place data elements into
related groups without advance knowledge of the group definitions.
‘There are two main types of clustering techniques, those that create a hierarchy of
clusters and non-hierarchy of clusters. The hierarchical clustering techniques create a
hierarchy of clusters from small to big. With a hierarchy of clusters defined it is possible
to choose the number of clusters that are desired. At the extreme jt is possible to have as
many clusters as there are records in the database. In this case the records within the
cluster are optimally similar to each other (since there is only one) and certainly different
from the other clusters, The advantage of hierarchical clustering methods is that they
allow the end user to choose from either many clusters or only a few.
2” Part:
Since, the process of grouping a set of physical or abstract objects into classes of similar
objects is clustering, similarity metric is important here because it is used for outlier
detection. The clustering algorithm which is main memory based can operate only on the
following two data structures namely,
a) Data Matrix
b) Dissimilarity Matrix
Soit is difficult to handle categorical data.
In panition clustering, every data sample is initially assigned to a cluster in some
{possibly random) way. Samples are then iteratively transferred from cluster to cluster
until some criterion function is minimized. Once the process is complete, the samples
will have been partitioned into separate compact clusters. Examples of partition
clustering methods are k-means and Lloyd's method,
In hierarchical clustering, each sample is initially considered a member of its own cluster,
after which clusters are recursively combined in pairs according to some predetermined
condition until eventually every point belongs to a single cluster. The resulting
hierarchical structure may be represented by a binary tree or "dendrogram", from which
the desired clusters may be extracted. Examples of hierarchical clustering methods are
the single-link, Ward's, centroid, complete-link, group average, median, and parametric
Lance Williams methods.
5) Describe the working of the PAM algorithm. Compare its performance with
CLARA and CLARANS.
Answer: .
iy PAM k-medoids clustering algorithm, for example, evaluates a set of k objects
une to be representative objects (medoids) of k clusters within T objects such that
4 eet ee are clustered with the medoid to which it is the most similar (i.e.
in terms of the provided distance metric). The process operates by swapping one
Gf the medoids with one of the objects iteratively such that the total distance between
Talo t! objects and their medoid is reduced. The algorithm can be depicted as
DWM-27
liliesSATIONS 7
Step 1: Initialization - choose k medoids from T objects randomly,
Step 2: Evaluation - calculate the cost Di, — D, for each Swap of one sane
object, where D, is the total distance before the swap and Dj js the total distance
swap. ;
oe 3: Selection - accept the swap with the best cost and if the cost is
step 2; otherwise record the medoids and terminate the program.
The computational complexity of the PAM algorithm is O(1 + B)KCT ~ ky)
based on the number of partitions per object, where f is the number of succes
It can also be expressed as O'(| + P)k'(T ~ ky’) based on the number of
calculations, i.c., one partition per object is equivalent to k distances calc
Clearly, this is time consuming even for the moderate number of objects and
number of medoids.
The computational complexity of the CLARA algorithm is O(q(ks? + (T — ky)
based on the number of partitions per object or O'(q(k's’ + k(T —k)) + Bk’s*) b
number of distance calculations, where q, s, k, and T are the number of samples, «
size per sample, number of medoids, the number of successful swaps for all
tested and the total number of objects, respectively. Clearly, the CLARA algorith
deal with a larger number objects than can PAM algorithm ifs « T. :
The computational complexity of CLARANS is O((f + numlocal)(T — k)) based
number of partitions per object or O'((f5 + numlocal)k(T — k)) based on the nu
distance calculations, where B is the number of test moves between nodes,
2. a) What is Clustering? Why is it required for data warehousing and data
What are the differences between partitioning clustering and Hi
clustering?
b) Suppose the following table shows the distances between different
Construct a hierarchical tree u: Hierarchical clustering method.
BA FA Mi NA _| RM | TO
BA 0 662 | 877 | 255 | 412 | 996
FI 662 oO 295 | 468 | 268 | 400
Mi_| 877 | 295 0 754 | 564 | 138
NA | 255 | 468 | 754 0 219
Rm | 412 | 268 | 564 | 219 0
To 996 | 400 | 138 | 869 | 669 oO
¢) Discuss K-mean algorithm with a suitable example.
Answer:
a) I Part: Refer to Question No. I(a) (1 Part) of Long Answer Type
2™ Part:
Clustering is required for data ‘Warehousing and data mining for the follo
* Scalability — We need highly scalable clustering algorithms toDATA WAREHOUSING AND DATA MINING,
Ability to deal with different kinds of attributes — Algorithms should be capable to be
Jpplied on any kind of data such as interval-based (numerical) dat, categorical, and
binary data.
Discovery of clusters with attribute shape — The clustering algorithm ‘should be
capable of detecting clusters of arbitrary shape, They should not be bounded to only
distance measures that tend to find spherical cluster of smal! sizes,
. tligh dimensionality ~ The clustering algorithm should not only: be able to handle
Jow-dimensional data but also the high dimensional space.
+ Ability to deal with noisy data ~ Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
interpretability ~ The clustering results should be interpretable, comprehensible, and
usable.
3" Part: Refer to Question No. I(a) (2" Part) of Long Answer Type Questions.
b) The nearest pair of cities is MI and TO, at distance 138. These are merged intoa single
cluster called "MUTO". The level of the new cluster is L(MI/TO) = 138 and the new
sequence number is m = 1.
Then we compute the distance from this new compound ‘object to all other objects. In
single link clustering the rule is that the distance from the compound object to another
object is equal to the shortest distance from any member of the cluster to the outside
object. So the distance from "MI/TO to RM is chosen to be 564, which is the distance
{rom MI to RM, and so on.
After merging ML with TO we obtain the following matrix:
BA. FL | MUTO| NA RM
BA 0 662, 817 255 412
FL 662 Q 295 468 268.
muro |__877 295 0 754 564
NA 255 468 754 0 219
RM 42 268 | s64_[ 219 0
mmin d(ij) a(NA,RM) = 219 = merge NA and RM into a new cluster called NA/RM
LONA/RM) = 219.
m=2
BA Fl__| MiTO | NA/RM
BA 0 662 | 877 | 255
FI 662 0 295 | 268
MVTO |_ 877 295 0 564
NARM | _255 268 | 504 0
Bute d(BA,NA/RM) = 255 => merge BA and NAJRM into a new cluster called
EANAIRM) 255
BA/NA/RM FL [| mito
DWM-29min’ d(i,j) = a(BA/NAIRM,FI) = 268
called BA/FUNA/RM
L{BA/FUNA/RM) = 268
m=4
BA/NA/RM hike)
BA/NAIRM 0
MI/TO 295 a eee
Finally, we merge the last two clusters at level 295. 4
‘The process is summarized by the following hierarchical tree:
BA FL MI
7
c) K-means is an unsupervised learning algorithms that solve the clustering pr
procedure follows a simple and easy way to classify a given data set through
number of clusters (assume k clusters) fixed a priori. The algorithm is composed
following steps:
1. Place K points into the spacé represented by the objects that are being
These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the Kicer
4. Repeat Steps 2 and 3 untit the centroids no longer move. This produces a se}
of the objects into groups from which the metric to be minimized can be cale
Suppose that we have n sample feature’ vectors x,, X;, —., x, all from the same
we know that they fall into k compact clusters, k Y, with the
ikerspecified minimum support and minimum confidence, X and ¥ can be sets of items
atany level of the taxonomy, such that no item in Y is an ancestor of any item in X. For
example, there might be a rule which says that "50% of transactions that contain Soft
drinks also: contain Snacks: 5% of all transactions contain both these items"
) PAM Clustering Technique:
PAM stands for “partition around medoids”. The algorithm is intended to find a sequence
of objects called medoids that are centrally located in clusters. Objects that are tentatively
defined as medoids are placed into aset S of selected objects.
1fO is the set of objects that the set U =O — Sis the set of unselected objects. The goal of
the algorithm is to minimize the average dissimilarity of objects to their closest selected
abject, Equivalently, we can minimize the sum of the dissimilarities between object and
their closest selected object.
The algorithm has two phases:
(i) Inthe first phase, BUILD, a collection of k objects are selected for an initial set S.
(ii) In the second phase, SWAP, one tries to improve the quality of the clustering by
exchanging selected objects with unselected objects.
The goal of the algorithm is ta minimize the average dissimilarity of objects to their
closest selected object. Equivalently, we can minimize the sum of the dissimilarities
between object and their closest selected object.
4) CLARANS clustering algorithm yis-A-vis PAM and CLARA:
The mediods searching in PAM or CLARA is abstracted as searching k subgraphs from 1
tae mah, and based on this understanding, a PAM-like clustering algorithm called
woe oe (Clustering Large Applications based upon RANdomized Search) is defined.
Cae searches the whole graph and CLARA searches some random sub-graphs,
aeons Apoel samples a set and selects k medoids in climbing sub-graph
nedhite’,, CLARANS selects the neighboring objects of medoids as candidates of new
IS It samples subsets to verify medoids in multiple times to avoid bad samples.
DWM-33TOkworsls, multiple time sampling of medoids verification is time
limits CLARANS from clustering very large datasets in an acceptab
POPULAR PUBLICATIONS
e) K-Medoid Algorithm:
PeAineet Aneind, Medoids or the K-medoids algorithm is a p
algorithm which is slightly modified from the K-means, algorithm. The i
algorithm is to first compute the K representative objects which are called,
‘After finding the set of medoids, each object of the data set is assignc. *m the
medoid. That is, object 7 is put into cluster v,, when medoid my; is nearer than
medoid my.
The algorithm proceeds in two steps: ;
* BUILD-step: This step sequentially selects k "centrally located” objes
used as initial medoids
SWAP-step: If the objective function can be reduced by i
(swapping) a selected object with an unselected object, then the swap
out. This is continued till the objective function can no longer be decreased
‘The algorithm is as follows:
1. Initially select k random points as the medoids from the given n data
data set.
2. Associate each data point to the closest medoid by using any of the most eo
distance metrics. ‘
3. For each pair of non-selected object h and selected object i, calculate
swapping cost TC),.
If TC), <0, i is replaced by h
5. Repeat the steps 2-3 until there is no change of the medoids.DATA WAIKEMOUSING AND DATA MINING
MINING TIME SERIES DATA
Multiplo Choice Type Questions
4, Yo optimize data warehouse design, which one is done? [WBUT 2011]
4) Normalization of fact tablos and demoralization of dimension tables
b) Normalization of fact tables and dimension tables
¢) Demoralization of fact tables and dimension tables
d) Normalization of dimension tables and demoralization of fact tables
Answers (i)
a , Is the value of an attribute mined as it varies over time.
‘a)Time series Analysis b) Classification [MODEL QUESTION)
¢) Association d) Prediction
Answer: (3)
Long Answer Type Questions
4, a) Define decision tree. [WBUT 2009, 2010}
b) What are the advantages and disadvantages of the decision tree approach over
other approaches for data mining? [WBUT 2003]
¢) Discuss briefly the tree construction principle. [WBUT 2009)
R,
Describe the basic algorithm for decision induction. {WBUT 2013)
d) What is a cl: ication problem? What is the difference between supervised
and unsupervised classification? (WBUT 2009]
Answer:
a) Decision trees are powerful and popular tools for classification and prediction, The
attractiveness of decision trees is due to the fact that, in contrast to neural networks,
decision trees represent rules,
Decision tree is a classifier in the form of a tree structure (see Figure 1), where each node
is either:
* a leaf node - indicates the value of the target attribute (class) of examples, or~
* adecision node - specifies some test to be carried out on a single attribute-value,
' with one branch and sub-tree for each possible outcome of the test.
A decision tree can be used to classify an exaniple by starting at the root of the tree and
os through ituntil a leaf node, which provides the classification of the instance.
i re gain, is simply the expected reduction in entropy caused by partitioning the
= Se to this attribute. More precisely, the information gain, Gain (S, A) of
(cA. relative to a collection of examples S, is defined as
Gain(S, A) = Entra ~ (Sd.
py(S) ‘Ent ‘S,
den rer)
wi ,
Tren aiestA) is the set of all possible values for attribute A, and S, is the subset of S
Attribute A has value v (ic, S, = {8 © S| A(s) = ¥}). Note the first term in the
DWM-35