[go: up one dir, main page]

0% found this document useful (0 votes)
211 views110 pages

Data-Mining Notes

Data mining notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
211 views110 pages

Data-Mining Notes

Data mining notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

DATA MINING AND WAREHOUSING

UNIT – I

Introduction: Data mining application – data mining techniques – data mining case
studies-the future of data mining – data mining software - Association rules mining:
Introduction-basics- task and a naïve algorithm- apriori algorithm – improve the efficient
of the apriori algorithm – mining frequent pattern without candidate generation (FP-
growth) – performance evaluation of algorithms.

UNIT – II
Classification : Introduction – decision tree – over fitting and pruning - DT rules--
naïve bayes method- estimation predictive accuracy of classification methods - other
evaluation criteria for classification method – classification software

UNIT – III

Cluster analysis: cluster analysis – types of data – computing distances-types of


cluster analysis methods - partitioned methods – hierarchical methods – density based
methods – dealing with large databases – quality and validity of cluster analysis methods -
cluster analysis software.

UNIT – IV

Web data mining: Introduction- web terminology and characteristics- locality and
hierarchy in the web- web content mining-web usage mining- web structure mining – web
mining software - Search engines: Search engines functionality- search engines
architecture – ranking of web pages.

UNIT – V
Data warehousing: Introduction – Operational data sources- data warehousing - Data
warehousing design – Guidelines for data warehousing implementation - Data warehousing
metadata - Online analytical processing (OLAP): Introduction – OLAP characteristics of
OLAP system – Multidimensional view and data cube - Data cube implementation - Data
cube operations OLAP implementation guidelines

BOOK FOR STUDY:


―Introduction to Data mining with case studies‖, G.K. Gupta, PHI Private limited, New
Delhi, 2008. 2nd Edition, PHI , 2011

1
UNIT-1
INTRODUCTION
Define Data Mining
Data mining definition as follows
1. It refers to extracting or mining knowledge from large amounts of data.
2. It is the technique which is used to retrieve the hidden information from data warehouse
(collection of different type of db).
3. It is a collection of techniques for efficient automated discovery of previously
unknown, valid, novel, useful and understandable patterns in large databases. The
patterns must be actionable so that they may be used in an enterprise’s decision
making process.
4. Data mining or knowledge Discovery in Database (KDD) is a collection of exploration
techniques based on advanced analytical method & tools for handling a large amount of
information.
5. The extraction of hidden predictive information from large databases, is a powerful new
technology with great potential to help companies focus on the most important
information in their data warehouses.
6. It is the intersection of computer science and statistics, is the process that attempts to
discover pattern in large sets of data.
Example: The mining of gold from rock/sand is referred to as gold mining.
Goal and methods of data mining:
 Extract information from a dataset.
 Transform it into an understandable structure for further use.
 Method used: AI, machine learning, statistics, database system.
Other Names:
1. “Knowledge mining from data” 4.”Data/pattern analysis”
2. “Knowledge mining” 5.”Data archaeology”
3. “Knowledge extraction” 6.”Data dredging”.
NEED OF DATA MINING:
What are the needs of Data Mining?
 Growth in OLTP data:
 Many enterprises have more than 30 years of experience in using database system
and they have accumulated large amount of data.
 So it needs a technique to extract the data from the large data set.
 Growth in data due to cards:
 The growing use of credit cards and loyalty cards is an important area of data
growth . The credit cards are not the only cards that are generating enormous
amounts of data.
 Growth in data due to the web:
 E-commerce developments have resulted in information about visitors to web sites
being capture, once again resulting in mountains of data for some companies.
 Growth in data due to other sources:
 There are many other sources of data. Some of them are: Telephone transaction,
Medical transactions, banking transactions, shopping transactions and so on.

2
 Decline in the cost of processing:
 Not only the prices for processors continue to decline, the prices for computer
peripherals have also been declining.
 Competitive environment:
 Due to increased globalization of trade, the business environment in most
countries has become very competitive. Leading to intense competition the
companies have to work harder to find new customer and retain old ones.
 Availability of software:
 a number of companies have developed useful data mining software based on
statistical algorithms.

THE DATA MINING PROCESS:


Explain the steps involved in Data mining process.(5m)
1. Requirements analysis:
 The business problem must be clearly defined.
 Because, the technique to be used & the data that is required differ for different
goals. If the objective is clearly defined, it is easier to evaluate the result of the
project.
2. Data selection and collection:
 This step may include finding the best source databases for the data that is
required.
 If the enterprise has implemented a data warehouse, then most of the data could be
available there.
3. Clearing and preparing data:
 If data warehouse exists, these tasks are easier, since most of this must have been
done when data was loaded in the warehouse.
 Otherwise, this task is very resource intensive * more than 50% of effort in data
mining is spent on this step. An ELT (extraction, transformation and loading) tool
may be used to overcome these problems.
4. Data mining exploration and validation:
 Once appropriate data has been collected and cleaned, it is possible to start data
mining exploration. This step is likely to be an iterative process which leads to
selection of one or more techniques that are suitable.
5. Implementing , evaluating and monitoring:
 Once a model has been selected and validated, the model can be implemented for
use by the decision makers.
 Evaluation may involve checking the accuracy and effectiveness of the
technique. Monitoring leads from time to time refinement of tools and
techniques that have been implemented.
6. Results visualization:
 Explaining the results of data mining to the decision makers is an important step
of the data mining process.
 Clever data visualization tools are being developed to display results that deal
with more than two dimensions.

3
Architecture of Data Mining
Write about the architecture of data mining.(5m)

User interface

Pattern Evaluation
Knowledge
Base
Data Mining Engine

Database or

Data warehouse server

Data cleaning, integration and selection

Data Base Data Other Info


World Wide
Repositories
Warehouse Web

Architecture of data mining system


DATA MINING APPLICATIONS
Write short notes on data mining applications.(5m)
Discuss the areas where data mining is used.(5m/10m)
 We group the data mining applications into the following 6 groups:
Prediction and description:
 Data mining techniques may be used for sales forecasting and analysis.
 Usually the techniques involve selecting some or all the attributes of the objects
available in a database to predict other variable of interest.
Relationship marketing:
 Data mining can help in analyzing customer profiles, discovering sales triggers, and in
identifying critical issues that determine client loyalty and help in improving customer
retention.
 This is also includes customer profiles and improving direct marketing plans.
 It may be possible to use cluster analysis to identify customers suitable for cross-
selling other products.

4
Customer profiling:
 It is process of using the relevant and available information to describe the
characteristics of a group of customer.
 Profiling can help an enterprise identify its most valuable customer so that the
enterprise may different their needs and values.
 This may also help in increasing the lifetime value of some customer. Also facilitates
loan & credit card approvals.
Outlier’s identification and detecting fraud:
 These might be as simple as identifying unusual expense claims by staff, identifying
anomalies in expenditure between similar units of an enterprise perhaps during
auditing or identifying fraud.
Customer segmentation:
 It is way to assess and view individuals in the market based on their status & need
 Data mining can be used for customer segmentation, for promoting the cross-selling
of services, and in increasing customer retention.
 Data mining may be used to understand and predict customer behaviours and
profitability, to develop new products and services and to effectively market new
offerings.
Web site design and promotion:
 Web mining may be used to discovery how users navigate a web site and results can
help in improving the site design and making it more visible on the web.
 Data mining may also be used in cross-selling by suggesting to a web customer, items
that he/she may be interested in.
 Data mining is widely used in the following fields: banking, marketing, fraud
detection, insurance, medical/pharmaceutical, telecommunication & e-
commerce.
DATA MINING TECHNIQUES
Discuss about the different types of data mining techniques. (5m/10m)
 Data mining employs a number of techniques which are follows:
Association rules mining or market basket analysis
 Association rules mining is a technique that analyses a set of transactions like those
captured at a supermarket checkout, each transactions being a list of products or items
purchased by one customer.
 Aim is to determine which items are purchased together frequently so that they may
be grouped together on store shelves or the information may be used for cross-selling.
 Term “Lift” is used to measure the power of association b/w items that are purchased
together.
 Lift indicates how much more likely an item is to be purchased if the customer has
bought the other item that has been identified as having an association with the first
item compared to the likelihood of it being purchased without the other item being
purchased.
 A simple algorithm called the “Apriori algorithm” may be used to find associations,
but the algorithm becomes inefficient for applications where the data is very large.

5
Supervised classification
 This is appropriate to use if the data is known to have a small number of classes, the
classes are already known and some training data with their classes known is
available.
 The model built based on the training data may then be used to assign a new object to
a predefined class.
 Supervised classification can be used in predicting the class to which an object or
individual is likely to belong.
 One of the most widely used supervised classification techniques is the decision tree.
The decision tree technique is widely used because it generates easily understandable
rules for classifying data.
Cluster analysis
 Cluster analysis or clustering, is similar to classification, but in contrast to super-vised
classification. It is useful when the classes in the data are not already known and the
training data is not available.
 Aim is to find groups that are very different from each other in a collection of data.
Cluster analysis breaks up a single collection of perhaps diverse data into a number of
groups.
 One of the most widely used cluster analysis method is called the k-means algorithm,
which requires that the user specifies not only the number of clusters but also their
starting seeds.
 The algorithm assigns each object in the given data to the closest seed which provides
the initial cluster.
 The algorithm then iteratively reviews the cluster obtained & refines them.
Web data mining
 Web mining is the application of data mining techniques to find interesting and
potentially useful knowledge from web data.
 It is normally expected that their hyperlink structure of the web or the web log data or
both have been used in the mining process.
Search engines
 Search engines, are huge databases of web page as well as software packages for
indexing and retrieving the pages that enable users to find information of interest to
them.
 When one searches the web using one of the search engines, one is not searching the
entire web. Instead he is searching the database that has been compiled by the search
engine.
Data warehousing and OLAP
 Data warehousing is a process by which an enterprise collects data from the whole
enterprise to build a single version of the truth. This information is useful for decision
makers and may also be used for data mining.
 OLAP tools are decision support tools that are often built on top of a data warehouse
or another database.(multidimensional database).
 Data mining is different from OLAP since in data mining a hypothesis is not being
tested. Instead data mining is used to uncover novel patterns in the data.

6
DATA MINING CASE STUDIES
Briefly explain some data mining case studies. (5m/10m)
Aviation
Wipro’s Frequent Flyer Program:
 Wipro has reported a study of frequent flyer data from an Indian airline. Before
carrying out data mining, the data was selected and prepared.
 It was discovered that much of data supplied by the airline was incomplete or
inaccurate. Ex: reason for taking a journey.
 Still, there were some interesting and simple facts about customers flying between
metropolitan cities of India.
 It was found that customers flying Mumbai-Delhi also flew to other cities, for ex,
Mumbai-Chennai, Mumbai-Kolkata, Mumbai-Bangalore.
 Most customers flying Bangalore-Hyderabad also travelled Delhi-Bangalore. Those
who flew Gawahati to Bagdogra did not fly back to gawahati but instead flew of
Delhi.
 These results provided the airline with a better understanding of its business & may
have helped it to refine flight programming to better suit the customers.
Astronomy:
 An interesting data mining application area is astronomy. Astronomers produce huge
amounts of data every night on the fluctuating intensity of around 20 million stars
which are classified by their spectra and their surface temperature.
 Using the enormous amount of data that is being collected, astronomers want to
classify stars and galaxies and find any new classes of objects that might be out there.
 Decision trees and cluster analysis techniques have been used for such data mining.
Banking and finance:
 Banking and finance is a rapidly changing competitive industry. The industry is using
data mining for a variety of task including building customer profiles to better
understand the customer, to identify fraud and so on.
 In the field of credit evaluation, data mining can assist in establishing an automated
decision support system which would allow credit card or loan providing companies
to quickly and accurately assess risk and approve or reject an application.
 Another application of data mining in banking and finance is to identify high-value
customer.
Climate:
 A study has been reported on atmospheric and oceanic parameter that cause drought
in the state of Nebraska in the USA.
 Many variables were considered which includes: standardized precipitation index
(SPI), pacific/north American index (PAN), North Atlantic oscillation index (NOA)
and so on.
Crime Prevention:
 A number of case studies have now been published about the use of data mining
techniques in analyzing crime data.
 As in most case studies, the data was not complete and there were errors and
discrepancies in the data supplied by the police.

7
Direct mail service:
 In this case study, a direct mail company held a list of a large number of potential
customers. The company’s response rate had been only 1% which the company
wanted to improve.
 To carry out data mining, the company had to first prepare data, which included
sampling the data to select a subset of customers including those that responded to
direct mail and those did not.
 Using the decision tree approach, the company was able to identify the characteristics
of customers who were more likely to respond and was thus able to reduce the
number of customer it mailed to while simultaneously improving the response rate.
Healthcare:
 In drug testing, data mining may assist in isolating those patients where the drug is
most efficient or where the drug is having unintended side effect.
 In this case study, data from a hospital was analyzed. The aim of the study was to be
able to predict the length of hospital stay for patients suffering from spinal cord
injuries.
 The study was undertaken because the spinal injury patients often had extended
hospital stays and the hospital system wanted to know if the length of stay could be
reduced and efficiency of the hospital improved.
Manufacturing:
 Data mining tools have been used to identify factors that lead to critical
manufacturing situations in order to warn engineers of impending problems.
 The data mining was able to predict and prevent the failure of critical manufacturing
components in some manufacturing companies.
 Data mining applications have been reported in power stations, petrochemical plants
and other types of manufacturing plants.
Marketing and Electronic Commerce:
 Most of the widely discussed applications of data mining are that by Amazon.com,
which uses simple data mining tools to recommend to customers what other products
they might consider buying.
 The rules use the customers own history of purchase as well as purchases by similar
customer.
Telecommunication:
 The telecommunication industry in many countries is in confusion due to
deregulations. The carriers are offering a variety of products including packaged
landline, broadband and mobile phone services.
 There is a need to retain customer and to poach other companies’ customer. Data
mining can provide information about customers who are most likely to remain loyal
and those who are most likely to switch.
 A telecommunication company has to deal with a large number of variables including
cost of local calls, the installation and disconnection rates and so on.
 A variety of data mining tools including data warehousing, OLAP, classification,
clustering and association rules are being used in the telecommunication sector.

8
THE FUTURE OF DATA MINING
Discuss the future of data mining in detail. (5m/10m)
 Although the data mining techniques moves from research algorithms to business
applications and as storage prices continue to decline and enterprise data continues to
grow, data mining is still not being used widely.
 Since most time in data mining is actually spent in data extraction, data cleaning and
data manipulation, it is expected that technologies like data warehousing will grow in
important.
 Business users often find the techniques difficult to understand and integrate into
business process.
 To make data mining techniques more accessible to business, academics need to put
more effort into understanding the kind of problems, the kind of business process and
explain suitable data mining technique.
 For a successful data mining application in an organization, it is perhaps best to
identify one particular area that needs to be explored. This may require a small project
team with suitable expertise.
 Many data mining techniques are not based on sound theoretical background.
Practices have to be focus on data mining efforts in the future.
 Other techniques that are likely to receive more attention in the future are text and
web-content mining, bioinformatics and multimedia data mining.
 The important of visualizations is expected to grow and there should be more tools
available with better with visualization capabilities in the future.
Guidelines for successful Data mining
Discuss some of the guidelines for data mining applications. (2m)
Some of the basic requirements for a data mining project to be successful are:
 The data must be available.
 The data must be relevant, adequate and clean.
 There must be a well-defined problem.
 The problem should not be solvable by means of ordinary query or OLAP tools.
 The results must be actionable.
DATA MINING SOFTWARE
Explain some of the s/w used in data mining application.(10m)
 There is considerable data mining s/w available on the market. Most major computing
companies are IBM, Oracle and Microsoft.
 Some data mining s/w can be expensive while other software is available for free. A
list of some software packages are:
Angoss software:
 Angoss has data mining software called Knowledge STUDIO. It is a complete data
mining package that includes facilities for classification, cluster analysis and
prediction.
 It provides a visual and easy-to-use interface.
CART and MARS:
 This software from Salford systems includes CART decision tree, MARS predictive
modelling, automated regression, clustering, and anomaly detection and so on.

9
Clementine:
 This software from SPSS is a well-known and comprehensive package that provides
association rules, classification, cluster analysis, forecasting, prediction and sequence
discovery.
Data Miner Software Kit:
 It is a collection of data mining tools.
DB Miner Technologies:
 DB Miner provides techniques for association rules, classification and cluster
analysis.
Enterprise Miner:
 SAS institute has a comprehensive integrated data mining package. It provides a user-
friendly icon-based GUI front-end using their process model called SEMMA.
Ghost Miner:
 It is a complete data mining suite, including data pre-processing, feature selection, k-
nearest neighbours, and decision tree and so on.
Intelligent Miner:
 This is a comprehensive data mining package from IBM. Intelligent Miner uses DB2
but can access data from other databases.
JDA Intellect:
 JDA software group has a comprehensive package called JDA Intellect that provides
facilities for association rules, classification, cluster analysis and prediction.
Mantas:
 The Mantas suite is designed to focus on detecting and analysing suspicious
behaviour in financial markets and to assist companies in complying with global
regulations.
MCubiX from Diagnose:
 It is a complete and affordable data mining toolbox, including decision tree, neural
networks, association and visualization.
Mine Set:
 Mine Set specializes in visualization and provides a variety of visualation tools
including the scatter visualize, the tree visualizer, the statistics visualizer and the map
visualizer.
Mining Mart:
 This software focuses on data cleaning and provides a graphical tool for data pre-
processing.
Oracle:
 Oracle 10g has data mining facilities embedded in it. Users of oracle therefor have
access to techniques for association rules, classification and prediction.
Weka 3:
 A collection of machine learning algorithms for solving data mining problems.

10
Software Evaluation and Selection:
How to evaluate and select the software in the data mining. (10m/5m)
 It should be noted that there is no software tool or suite of tools that could be
considered the best data mining software.
 What a user looking for data mining software need to do is to find the package or
packages that are suited to meet the needs of the enterprise.
 Many factors must be considered in evaluating suitability of software. Some of the
important factor are:
1. Product and vendor information:
 Is the vendor reliable?
 How easy is it to input information in the software?
2. Total cost of ownership:
 How much will the product cost?
 Does it include the cost of maintenance?
3. Performance:
 If the performance information is provided, is the information reliable?
 How does the software deal with large datasets?
4. Functionality and Modularity:
 What techniques and algorithms does the software provide?
 Does the software provide tools for data cleaning, data filtering and data
binning?
5. Training and Support:
 What documentation is provided?
 Does the vendor provide training for the software available?
6. Usability:
 Is the software easy to learn?
 Is the software flexible?
 Much information about the software may be obtained from the software vendors’
web sites, but one must be particularly careful in evaluating the information listed on
them.
 Also, when one is attempting to compare a number of software packages, it is best to
use a matrix by which all the different packages may be compared against all
important criteria.
ASSOCIATION RULES MINING
INTRODUCTION
 Association rules mining or market basket analysis involves searching for interesting
customer habits by looking at associations. (Or) finding the interesting relationship
among the items in data set.
 Example is analyzing a large database or supermarket transactions.
 Association rules mining has many applications includes marketing, customer
segmentation, classification, web mining and so on.
 Goal:
Improve the retailer sale.
Retailer analysis: what items are frequently purchased by the customer?

11
BASICS
Discuss some of the terminologies used in association rules mining. (10m/5m)
 Consider the following association rules mining task of a small shop. The shop sells
only a small variety of products:
Bread Cheese Coffee
Juice Milk Tea
Biscuits Newspaper Sugar
 Assume that the shopkeeper keeps records of what each customer purchases. Such
records of ten customers are given in the below table:
 The shopkeeper wants to find which products are sold together frequently. If for
example, sugar and tea are the two items that are sold together frequently.
 Assume that the number of items the shops stocks is n (n=9 for this shop) and these
items are represented by I {i1, i2…I n}.
 And also there are N transactions (N=10). Denote them by T {t1, t2,…tn} each with a
unique identifier (TID) and each specifying a subset of items form the item set I
purchased by one customer.
 Let each transaction of m items be I {i1, i2…I n} with m≤n.
Define Lift. (2m)
 Association rules are often written as: X →Y meaning that whenever X appears Y
also tends to appear. X and Y may be single items or sets of items. X is referred as
antecedent and Y as the consequent.
 X →Y indicates only that X & Y have been found together frequently in the given
data and does not imply that buying of X by a customer causes him/her to buy Y.
 Suppose items X and Y appear together in only 10% of the transaction but whenever
X appears there is an 80% chance that Y also appears.
 The 10% presence of X and Y together is called the support (or prevalence) of the
rule and the 80% chance is called the confidence (or predictability) of the rule.
 A high level of support indicates that the rule is frequent enough for the business. A
high level of confidence shows that the rule is true often enough to justify a decision
based on it.
 Let the total number of transactions is N. then,
Support (X) = (Number of times X appears) / N = P(X)
Support (XY) = (Number of times X and Y appear together) / N = P(X →Y)
P(X) is used to mean probability of X in the database.
 Confidence for X →Y is defined as the ratio of the support for X and Y together, the
confidence will be low.
Confidence of (X →Y) =support (XY) /support (X) =P(X Y) /P(X) =P
(Y|X)
 The term “lift” is used to measure the power of association between items that are
purchased together. Lift essentially indicates how much more likely an item Y is to be
purchased if the customer has bought the item X that has been identified as having an
association with the first item Y, compared to the likelihood of Y being purchased
without the other item being purchased.
Life(X, Y) = P (Y| X) / P(Y) equal to conf(X →Y) / sub(Y)

12
TASK AND A NAIVE ALGORITHM
Explain NAIVE algorithm with an example. (5m/10m)
Explain briefly about the improved Naïve algorithm. (5m)
 Naive algorithm is a procedure to discover all association rules which have at least
p% support with at least q% confidence such that all rules satisfying these constraints
are found efficiently.
EXAMPLE
 Let us consider the following table. The shopkeeper is interested in finding
association rules with a minimum support of 50% and minimum confidence of 75%.
Transaction IDITEMS
100 Bread, cheese
200 Bread, cheese, juice
300 Bread, milk
400 Cheese, juice, milk
 Basics of algorithm is:
If we can list all the combination of the items that we have in stock & find which
of these combination are frequent, then we can find the association rules that have
‘confidence’ from these frequent combinations.
 The four items and all the combination of these four items and their frequencies of
occurrence in the transaction database is shown below:
Item sets Frequency
Bread 3
Cheese 3
Juice 2
Milk 2
(Bread, Cheese) 2
(Bread, Juice) 1
(Bread, Milk) 1
(Cheese, Juice) 2
(Cheese, Milk) 1
(Juice, Milk) 1
(Bread, Cheese, Juice) 1
(Bread, Cheese, Milk) 0
(Bread, Juice, Milk) 0
(Cheese, Juice, Milk) 1
(Bread, Cheese, Juice, Milk) 0
 The required minimum support is 50% .The following item sets are available from the
table:
ItemsetsFrequency
Bread 3
Cheese 3
Juice 2
Milk 2
(Bread, Cheese) 2
(Cheese, juice) 2

13
 We proceed to determine if the two 2-itemsets (bread, cheese) and (cheese, juice) lead
to association rules with required confidence of 75%.
 Every 2-itemset (A,B) can lead to two rules A->B & B->A, if both satisfy the required
confidence.
CONFIDENCE OF (A-->B) =SUPPORT (AB)/SUPPORT (A)
Bread-->Cheese; confidence=2/3=67%
Cheese-->Bread; confidence=2/3=67%
Cheese-->Juice; confidence=2/3=67%
Juice-->Cheese; confidence=3/3=100%
 Therefore juice-->cheese alone has confidence above minimum 75% and qualifies.
 The algorithm works well with four items but if the numbers of items increase the
combinations will be large.so the naïve algorithm can be improved to deal more
effectively with larger data sets.
IMPROVED NAÏVE ALGORITHM
 The improved algorithm counts only the combination that actually occurs rather than
all possible items in the Itemsets. (i.e., we do not count Itemsets with 0 frequency)
Transaction ID Itemsets combinations
100 Bread, Cheese {Bread, Cheese}
200 Bread, Cheese, Juice {Bread, Cheese} {Bread,
Juice}
{Cheese, Juice} {Bread,
Cheese, Juice}
300 Bread, Milk {Bread, milk}
400 Cheese, Juice, Milk {cheese, juice} {cheese,
milk}
{Juice, milk} {Cheese,
juice, milk}
 The following are combination and their frequencies according to improved naïve
algorithm:
Itemsets Frequency
Bread 3
Cheese 3
Juice 2
Milk 2
(Bread, Cheese) 2
(Bread, Juice) 1
(Bread, Milk) 1
(Cheese, Juice) 2
(Cheese, Milk) 1
(Juice, Milk) 1
(Bread, Cheese, Juice) 1
(Cheese, Juice, Milk) 1
 Now the same naïve algorithm is used to find the support and confidence.
 Here the list of item combinations is significantly reduced (15 to 11) and this will
much larger for bigger problems.
 Therefore, the naïve algorithm or any improvement of it is not suitable for large
problems.

14
APRIORI ALGORITHM
List the two parts/stages of the Apriori algorithm. (2m)
Define closed & maximal Itemsets. (2m)
Explain in detail about the Apriori algorithm with an example. (10m)
 Apriori is a seminal algorithm for finding the association rules and was proposed in
1994.
 The name of the algorithm is based on the fact that the algorithm uses “prior
knowledge” of frequent Itemsets properties.
 This algorithm consists of two parts:
1. The item sets that exceed the minimum support requirements are found. Such
Itemsets are called ‘FREQUENT ITEMSET’.
2. Association rules that meet the minimum confidence requirement are found from
the frequent Itemsets.
FIRST PART
Frequent Itemsets:
 Apriori employs an iterative approach called LEVEL-WISE SEARCH, where K-
Itemsets are used to explore (k+1) Itemsets.
 First, the set of frequent 1-itemsets is found, by scanning the database to accumulate
the count for each item and collecting those items that satisfy minimum support. The
resulting set is denoted L1.
 Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3
and so on, until no more frequent K-Itemsets can be found. The finding of each Lk
requires one full scan of the database.
 To improve the efficiency of the level-wise generation of frequent items, an important
property called APRIORI PROPERTY is used.
 APRIORI PROPERTY: “All non-empty subsets of a frequent Itemsets must be
frequent”.it is based on the following observation.
1. If an Itemsets I do not satisfy the minimum support threshold, min_sup, then I
am not frequent.
(i.e.)P (I) <min_sup.
2. If an item A is added to the Itemsets I, then the resulting Itemsets cannot occur
more frequently than I.
(i.e.)P (I U A) <min_sup.
 The Apriori property is used in the algorithm in two-step process which is as follows
1. THE JOIN STEP (L1 L2)
To find Lk, as set of candidate K-Itemsets is generated by joining Lk-1 with itself. This
set of candidate is denoted by Ck.
CONDITION:
L1 [l1 ]-L2 [l1] ^
L1 [ l2]-L2 [l2] ^
L1 [lk-2]-L2 [lk-2] ^
L1 [lk-1]-L2 [lk-1] ^
2. THE PRUNE STEP:
It compares the candidate subset with minimum threshold value
Ck is a superset of Lk i.e., members may or may not be frequent, but all of the frequent
K-Itemsets are included in Ck.
A scan of the database to determine the count of each candidate in C k would result in
finding of Lk.
Ck is huge and so this called involve heavy computation. To reduce the size of C k ,
Apriori property is used.

15
EXAMPLE
TID LIST OF ITEMSETS
T100 11, 12, 15
T200 12, 14
T300 12, 13
T400 11, 12, 14
T500 11, 13
T600 12, 13
T700 11, 13
T800 11, 12, 13, 15
T900 11, 12, 13
 There are nine transactions in the database (i.e.,) [D] =9.Apriori algorithm for
finding frequent Itemsets in N:
1. In the first iteration, each item is a member of the set of candidate 1-itemset,
C1.simply scan all transactions to count the number of each item.
2. Suppose min_sup=2, the set of frequent 1-itemset L1 , can then be determined.it
consist to the candidate 1-itemsets satisfying minimum support.
STEP 1: SCAN WHOLE DATABASE/FIND TOTAL NUMBER OF ITEMS
[INDIVIDUAL]
C1 L1
ITEMS Support ITEMS Support
Scan D for count Count Count
Of each candidate {11} 6 {11} 6
{12} 7 {12} 7
{13} 6 {13} 6
{14} 2 {14} 2
{15} 2 {15} 2

3. To discover the set of frequent 2-itemsets L2, the algorithm uses the join L1
∞L1,to generate a candidate set of 2-itemsets,c2.
4. Next, the transaction in D is scanned and support count of each candidate Itemsets
in c2 in accumulated.
5. The set of frequent 2-itemsets, L2 is then determined consisting of those candidate
2-Itemsets in C2 having minimum support.
6. Next is the generation of the set of candidate 3-itemsets, C3. From the joint step
C3=L2∞L2
C3={{11,12,13},{11,12,15},{11,13,15},{12,13,14},{12,13,15},[12,14,15}}
Using the Apriori property, remove 4 letter candidate from the Itemsets.
7. The transaction in D are scanned in order to determine L 3 consisting of those
candidate 3-Itemsets in C3 having minimum support.
8. Algorithm uses L3∞L3 to generate a candidate set of 4-itemset, C4.

16
STEP2:L2=JOIN L1 L1

Itemsets Itemsets Count Itemsets count


{11,12} {11,12} 4 {11,12} 4
{11,13} {11,13} 4 {11,13} 4
{11,14} {11,14} 1 {11,15} 2
{11,15} {11,15} 2 {12,13} 4
{12,13} {12,13} 4 {12,14} 2
{12,14} {12,14} 2 {12,15} 2
{12,15} {12,15} 2
{13,14} {13,14} 0
{13,15} {13,15} 1
{14,15} {14,15} 0

STEP 3:L3=JOIN L2

ITEMSET COUNT ITEMSET COUNT


Itemsets
{11,12,13} 2 {11,12,13} 2
{11, 12, {11,12,15} 2 {11,12,15} 2
13}
{11, 12,
15}
STEP 4:L4=JOIN L3 L3
C3 C3L3

Itemsets Count Itemsets count


ITEMSET
{11,12,13,15} {11,12,13,15} 1 {11,12,13,15} 1

9. Although the joint result in {11, 12, 13, 15}, this Itemsets is pruned because its
subset {12, 13, 15} is not frequent.
10. Thus in C 4 the algorithm terminates having found all of the frequent Itemsets.

SECOND PART
GENERATING ASSOCIATION RULES FROM FREQUENT ITEMSETS
 We need to generate strong association rules from them (where strong association
rules satisfy both minimum support and minimum confidence).
 This can be done using:
CONFIDENCE (A->B)=P(B│A)=SUPPORT_COUNT(AUB)/SUPPORT_COUNT(A)
Where, support_count (AUB) =number of transaction containing the Itemsets AUB.
support_count (A) =number of transaction containing the Itemsets A.

17
 Based on this, association rules can be generating as follows;
1. For each frequent Itemsets L, generate all nonempty subsets of L.
2. For every nonempty subset S of L, output the rule “S-> (L-S)”,
If SUPPORT_COUNT(L)?SUPPORT_COUNT(S)≥MIN_CON
Where, min_conf =minimum confident threshold.
EXAMPLE
1. Frequent se{11,12,15} which can be separated in all possible ways as follows:
Set min threshold=75%
ITEMS RULE VALUES CONFIDENT
{11,12} I1 U I2 → I5 2/4 50%
{11,15} I1 U 15 → 12 2/2 100%
{12,15} I2 U 15 → I1 2/2 100%
{11} I1 → I2 U I5 2/6 33%
{12} I2 → I1 U 15 2/7 28%
{15} I5 → I1 U I2 2/2 100%

2. Frequent set{11,12,13} which can be separated in all possible ways as follows:


SET MIN THRESHOLD=50%
ITEMS RULE VALUES CONFIDENT
{11,12} I1 U I2 → I3 2/6 33%

{11,13} I1 U I3 → I2 2/7 28%


{12,13} I2 U I3 → I1 2/6 33%

{11} I1 → I2 U I3 2/4 50%


{12} I2 → I1 U I5 2/4 50%
{13} I3 → I1 U I2 2/4 50%
FURTHER TERMINOLOGY:
1. A frequent CLOSED Itemsets is a frequent Itemsets X such that there exists no
superset of X with the same support count as X.
2. A frequent Itemsets Y is MAXIMAL if it is a not a proper subset of any other
frequent Itemsets a maximal Itemsets is a closed itemset, but a closed Itemsets is not
necessarily maximal.
IMPROVE THE EFFICIENT OF THE APRIORI ALGORITHM
Explain about improving the efficiency of Apriori algorithm. (5m)
 The Apriori algorithm is resource intensive for large set of transaction that has a large
set of frequent items. Major reason for this are:
Number of candidate Itemsets growth quickly and can result in huge candidate
sets. Larger the candidate set, higher the processing cost for scanning the database
to find the frequent Itemsets.

18
Given early set of candidate Itemsets are often very large, the initial iteration
dominates the cost. In later stages, no.of candidate sets that need to be checked
usually goes down.
Apriori requires many scans of the database. If n is the length of the longest
Itemsets, then (n+1) scans are required.
Many trivial rules are derived and it can often difficult to extract the most
interesting rules from all the rules derived.
Some rules can be inexplicable and very fine grained. Ex: Toothbrush was most
frequently item on Thursday morning.
Redundant rules are generated. Ex: if A→B is a rule, then AC→B is redundant.
Apriori assumes sparsity since the number of items in each transaction is small
compared with the total number of items.
 A number of techniques for improving the performance of Apriori algorithm have
been suggested. They can be classified into 4 categories:
Reduce the number of candidate Itemsets.
Reduce the number of transaction.
Reduce the number of comparisons.
Reduce candidate sets efficiently.
 Some algorithm that use one or more of the above approaches are:
Apriori –TID.
Direct Hashing and pruning(DHP)
Dynamic Itemsets Counting (DIC)
Frequent Pattern Growth
EXAMPLE FOR HASHING: min_support =02

Index 0 1 2 3 4 5 6 7 8 9

Bucket 4 4 1 2 4 2 2 0 1 0
count

Bucket {11,12} {11,13} {11,14} {11,15} {12,13} {12,14} {12,15} {13,14} {13,15} {14,15}
content
{11,12} {11,13} {11,15} {12,13} {12,14} {12,15} {13,15}
{11,12} {11,13} {12,13}
{11,12} {11,13} {12,13}
{11,12} {11,13}

MINING FREQUENT PATTERN WITHOUT CANDIDATE GENERATION (FP-


GROWTH)
Mention the advantage of FP-Tree approach. (2m)
Explain FP-growth algorithm with an example. (5m/10m)
How will you generate FP-Tree? Explain with an example. (5m)

19
 This algorithm differs from other algorithm.it does not generate the candidates, it only
tests. But, Apriori generate the candidate Itemsets and then tests.
 Motivation for FP-tree method is as follows:
 Only the frequent items are needed to find association rules.so it is best to
find frequent items and ignore the others.
 If the frequent items can be stored in a compact structure, then the original
transaction database does not need to be used repeatedly.
 If multiple transactions share a set of frequent items, it may be possible to
merge the shared sets with the number of occurrences registered as count.
GENERATING FP-TREE
1. Scan the transaction databases once to find all frequent items and their support.
2. Sort the frequent items in descending order of their support.
3. Initially, start creating the FP-tree with a root ‘null’
4. Get the first transaction from the transaction database. Remove all non-frequent items and
list the remaining items according to the order in the sorted frequent items.
5. Use the transaction to construct the first branch of the tree with each node corresponding to
a frequent item and showing the items frequency, which is 1 for the first transaction.
6. Get next transaction from the transaction database. Remove all non-frequent items and list
the remaining items according to the order in the sorted frequent items.
7. Inserted the transaction in the tree using any common prefix that may appear. Increase the
item count.
8. Continue with step 6 until all transaction in the databas0e are processed.EXAMPLE: 1
 Let support be 50% and confidence be 75%
TID ITEMS
ITEM
100 FREQUENCY
BREAD,CHEESE,EGG,JUICE BREAD 4
200 JUICE 4
BREAD,CHEESE,JUICE CHEESE 3
300 MILK 3
BREAD,MILK,YOGURT
400 BREAD,JUICE,MILK
500
CHEESE,JUICE,MILK

 Remove the items that are not frequent from the transaction and order the items
according to their frequency.
TID ITEMS
100
BREAD,JUICE,CHEESE
200
BREAD,JUICE,CHEESE
300 BREAD,MILK
400

20
BREAD,JUICE,MILK
500
JUICE,CHEESE,MILK

 An FP-tree for the transaction will be:


 An FP tree consists of nodes. A node contains 3 fields: an items name, a count and a
node link.
 Count: no of occurrences the path has in the transaction database.
 Node/link: link to text node containing the same item name or a null pointer if the
node is the last one with this name.
 Nodes near the root in the tree are more frequent than those further down the tree. FP-
tree contains a header table with an entry for each Itemsets and a link to the first item
in the tree with the same name.

MINING THE FP-TREE FOR FREQUENT ITEMS


 Mining on FP tree structure is done using an algorithm called frequent pattern
growth(FP-Growth).
 This algorithm starts with the least frequent item, (i.e.,) last item in header table. Then it
finds all paths from root to this item and adjusts the count according to this support
count.
 Consider the FP-tree. We start with item M and find these patterns:
BM(1) BJM(1) JCM(1)
 No frequent Itemsets is discovered; since no Itemsets appears 3 times. Next look at C
and find these:
BJC (2) JC (1)
 These two patterns give us a frequent Itemsets JC (3).looking at Jawed obtain:
BJ (3) J (1)
 We obtain a frequent Itemsets, BJ (3).there is no need to follow links from item B as
there are no other frequent Itemsets.
 The process above may be represented by ”conditional” tree for M,C and J as below:

EXAMPLE 2:
Itemsets Support Itemsets Support Node link
count count
{11} 6 {12} 7
{12} 7 {11} 6
{13} 6 {13} 6
{14} 2 {14} 2
{15} 2 {15} 2

21
TID ITEMSET
DESCENDING ORDER
T100 11,12,15
12,11,15
T200 12,14 12,14
T300 12,13 12,13
T400 11,12,14 12,11,14
T500 11,13 11,13
T600 12,13 12,13
T700 11,13 11,13
T800 11,12,13,15 12,11,13,15
T900 11,12,13 12,11,13

ITEM CONDITION CONDITION FP FREQUENT PATTERN GENERATION


PATTERN TREE
15 12 11:1 12:2 12 15:2
11 15:2
12 11 15:2
14 12 11:1 12:2 12 14:2

ADVANTAGES OF FP-TREE APPROACH


 It avoids scanning the database more than twice to find the support counts.
 It completely eliminates the costly candidate generation.
 FP-growth algorithm is better than Apriori, when the transaction database is huge and
min_sup count is low.
 FP-growth uses a more efficient structure to mine pattern when the database grows.
FURTHER IMPROVEMENTS
 This algorithm works best if the FP-growth fits in main memory, which is not always
possible in case of large database.
 The solution for this problem will be:
 Divide the database and then construct FP-tree for each of databases.
 Conduct FP-tree on a disk.

PERFORMANCE EVALUATION OF ALGORITHM

Write about the performance of various algorithm of association rule mining. (5m)

 Performance evaluation has been carried out on a number of implementation of


different association mining algorithms.
 In one study that compared a number of methods including Apriori, CHARM and FP-
growth methods using real world data and artificial data, it was concluded that:
 FP-growth method was better than Apriori algorithm

22
 CHARM was better than Apriori and in some cases, CHARM was better than
FP-growth method
 Apriori was generally better than other algorithm if the support was high,
since high support leads to a smaller number of frequent items which suits
Apriori algorithm.
 At very low support, number of frequent items became large and none of the
algorithm was able to handle large frequent sets.
 In 2003 performance evaluation of program, it was found that 2 algorithms were best.
They are:
 An efficient implementation of FP-growth algorithm.
 An algorithm that combined a number of algorithm using multiple heuristics.
 The performance evaluation in 2004 found an implementation of an algorithm that
involves a novel tree traversal as the most efficient algorithm for finding frequent,
frequent closed and maximal frequent Itemsets.

UNIT-1 COMPLETED

FAQ
Two marks:

1. Define data mining.


2. What is association rules mining.
3. Name list any four data mining products.
4. Define confident.
5. What is the aim of association rules mining?
6. What is the use of data mining?
7. list any four data mining software.

Five marks:

1. Explain about data mining applications.


2. How to improve the efficiency of the Apriori algorithm? explain.
3. Describe about Naïve algorithm.
4. Write a note on data mining techniques.
5. Explain about performance Evaluation of Algorithms.
6. Write short notes on Any three data mining Techniques

Ten marks:

1. Discuss about ‘Apriori algorithm’.


2. Explain in detail about Mining Frequent Patterns without candidate generation.

23
UNIT-II
CLASSIFICATION-INTRODUCTION
Define classification with its types. (2m)
What is supervised learning? (2m)
 “Classification” is the separation or ordering of objects (or things) into classes. If the
classes are created without looking at the data, the classification is called “Apriori
Classification”.
 If the classes are created by looking at the data, the classification is called “Posteriori
Classification”.
 Classification is based on two things
1. Learning
2. Decision Tree
Learning:
 Mostly it is assumed that the classes have been deemed Apriori and classification then
consists of training the system so that when a new object is presented to the trained
system it is able to assign the object to one of the existing classes. This approach is
called “Supervised learning”.
 A classification process in which classes have been pre-defined needs a method that
will train the classification system to allocate object to the classes.
 The training is based on a training sample, a set of sample data where for each sample
the class is already known.
 We assume each object to have a number of attributes, one if which tells us which
class the object belongs to.
 This attribute is known for the training data but for data other than training data (i.e.,
“test date”), we assume that the value of the attribute is unknown & is to be
determined by the classification method.
 This attribute may be considered as output of all other attributes & is referred to as
‘output attribute’ or ‘dependent attribute’.
 Attributes other than output attribute are called ‘input attributes’ or ‘independent
attributes’. Attributes whose domain is numerical are called ‘categorical attributes’.
 Categorical attributes may be ordered (ex: student’s grade) or unordered (ex: gender).
Classification has many applications. Ex: prediction of customer behavior & identify
fraud.
 In assessing credit card applicants, a credit card company may already have sample
data based on past applicants and knowledge about the applicants that were good
credit risks and those that were not.
 A classification method may use the sample to derive a set of rules for allocating new
applications to either of the two classes.
Classification
Algorithm

Training data
Classification If age=31.40 and
Rule salary=high credit
C- Age Salary Credit Rate=Excellent
name
Kalai =30 Low Fair
Kavi >30 High Excellent

24
 In the classification techniques, testing and validation play an important role.
Supervised learning requires training data while testing requires additional data which
also has been pre-classified.
 Classifying the test data and comparing the results with the known results can then
determine the accuracy.
 The number of cases classified correctly provides us with an estimate of the accuracy
of the model.
 Our aim is to find highly accurate models that are easy to understand and which are
efficient when dealing with large data-sets.

Decision tree:

Briefly explain about decision trees with an example (5m)


 It is a popular classification method that results in a flow-chart like tree structure
where each node denotes a test on an attributes value and each branch represents an
outcome of the test.
 Tree leaves represent the classes. Consider this example: let us classify Animals and
some training data is given below. We want to build a model based on this data.
Name Eggs Pouch Files Feather Class
Cockatoo Yes No Yes Yes Bird
Dugong No No No No Mammal
Echidna Yes Yes No No Marsupial
Emu Yes No No Yes Bird
Kangaroo No Yes No No Marsupial
Koala No Yes No No Marsupial
Kookaburra Yes No Yes Yes Bird
Owl Yes No Yes Yes Bird
Penguin Yes No No Yes Bird
Platypus Yes No No No Mammal
Possum No Yes No No Marsupial
Wombat No Yes No No Marsupial
 Following shows the decision tree of classifying the data in the above table.

Pouch?

No
Yes
Feather? Marsupial

Yes No
Bird Mammal

25
 Here, each node is a decision while the leaves are the classes. Decision tree is a model
that is both predictive and descriptive.
 A decision tree is a tree that displays relationships found in the training data. The tree
consists of zero or more internal nodes & one or more leaf nodes with each internal
node being a decision node having two or more child nodes.
 Using the training data, the decision tree method generates a tree that consists of
nodes that are rules, to determine the class of an object after the training is completed.
 The training process that generates the tree is called induction.
 The decision tree technique is popular, since the rules generated are easy to describe
& understand, the technique is fast unless the data is very large and there is a variety
of software available.
 Also, this approach does not make any assumptions about the underlying probability
distributions of the attributes values.
 Generally, the number of training samples required is likely to be relatively small if
the number of independent attributes is small and the number of training samples
required is likely to be large when the number of attributes is large.
 The complexity of a decision tree increases as the number of attributes increases,
although in some situations it has been found that only a small number of attributes
can determine the class to which an object belongs and the rest of the attributes have
little or no impact.
 Quality of training data plays an important role in determining the quality of the
decision tree.
 If there are a number of classes, then there should normally be sufficient training data
available that belongs to each of the classes.
 Classification accuracy determined using test data is obviously a good measure but
other measures may be used.
 These include average cost and worst case cost of classifying an object.

OVERFITTING AND PRUNING


Define Pruning & pre-pruning. (2m)
Shortly explain about over fitting & pruning. (5m)
 The decision tree building algorithm continues until either all leaf nodes are single
class nodes or no more attributes are available for splitting a node that has objects of
more one class.
 When the objects classified have a large number of attributes & a tree of maximum
possible depth is built, the tree quality may not high since the tree is built to deal
correctly with the training set.
 In order to do so, it may become quite complex, with long and very uneven paths.
Some branches of the tree may reflect anomalies due to noise or outliers in the
training samples.
 Such decision trees are a result of over fitting the training data and may result in poor
accuracy for unseen samples
 It is better to choose a simple model ,we can therefore ’’share off ’’ nodes and
branches of a decision tree , essentially replacing a whole sub tree by a leaf node , if it
can be established that the expected error rate in sub tree is greater than that in the
single leaf.

26
 This makes the classifier simpler. A simpler model has less chance of introducing in
consistencies, ambiguities and redundancies.
 Pruning is the technique to make an over fitted decision tree simpler and more
general. There are many techniques for pruning and one approach involves a
removing branches from a ‘’fully grown’’ tree to obtain a sequence of progressively
pruned trees.
 The accuracy of these trees is then computed and a pruned tree that is accurate enough
and simple is selected .It is good to use a set of data different from the training data to
decide which is the ‘’best pruned tree’’.
 Another approach is called ‘’pre pruning’’ in which tree construction is halted early.
 Essentially a node is not split if this would result in the goodness measure of the
falling below a threshold. It is quite different to choose an appropriate threshold.

DECISION TREE (DT) RULES


What are the advantages in converting DT to DT Rules? (2m)
Briefly explain about decision tree rules. (5m)
 The decision tree method is a popular and relatively simple supervised classification
method that involves each node of the tree specifying a test of some attribute and each
branch from the node corresponding to one of the values of the attribute.
 Each path from the root to a leaf of the decision tree therefore consists of attribute
tests, finally reaching a leaf that describes the class.
 The popularity of the decision trees is partly due to the ease of understanding the rules
that the nodes specify.
 The rules specified by a decision tree can be used to retrieve data from a relational
database satisfying the rules using SQL.
 There are a number of advantages in converting a decision tree to rules.
 Decision rules make it easier to make pruning decisions since it is
easier to see the content of each rule.
 Also, converting to rules removes the distinction between attribute
tests that occur near the root of the tree and those that occur near the
leaves.
 Rules are easier to read than a decision tree and they are easier for
people to understand.
 IF-THEN rules may be derived based on the various paths from the root to the leaf
nodes.
 Although the simple approach will lead to as many rules as the leaf nodes, rules can
often be combined to produce a smaller set of rules.
 Ex: If Gender= “Male” then class= B
 If Gender= “Female” and Married= “Yes” then class= C, else class=A
 Once all rules have been generated, it may be possible to simplify the rules. Rules
with only one antecedent (ex: If Gender= “Male” then class= B) cannot be further
simplified, so we only consider those with two or more antecedents.

27
 It may be possible to eliminate unnecessary rule antecedents that have no effect on the
conclusion reached by the rule.
 Some rules may be unnecessary & these may be removed. In some cases a number of
rules that lead to the same class may be combined.

NAÏVE BAYES METHOD


Mention the advantage of using Naïve Bayes method? (2m)
Explain Naïve Bayes method with an example in detail. (5m/10m)
 This is based on the work of Thomas Bayes. Bayesian classification is quite different
from the decision tree approach.
 In this classification, we have a hypothesis that the given data belongs to a particular
class. We can then calculate the probability for the hypothesis to be true.
 The approach requires only one scan of the whole data. Some notations are defined
below:
P (A) Probability that event A will occur.
P (A|B) Probability that event A will happen; given that event B has already
happened.
Ex: A & B may be probabilities of passing a course A & passing another
course B.
P (A|B) Probability of obtaining attribute values X whatever class the object
belongs to.
 Here is the Bayes’ theorem:
P (A|B) = P (B|A) P (A) / P (B) Eq. 1
This can be derived using:
P (A|B) = P (A & B) / P (B) Eq. 2
P (B|A) = P (A & B) / P (A) Eq. 3
Dividing Eq. 2 by Eq. 3 gives us the Bayes’ theorem Eq. 1.
 If we consider X to be an object to be classified then Eq. 1 may be read as giving the
probability of it belonging to one of the classes C1, C2, C3, Etc… by calculating P
(Ci|X)
 Once these probabilities have been computed for all the classes, we simply assign X
to the class that has the highest conditional probability.
 Let us now consider how probabilities P (Ci|X) may be calculated. We have
P (Ci|X) = [P (X|Ci) P (Ci)] P (X)
P (Ci|X) Probability of the object X belonging to class C i.
P (X|Ci) Probability of obtaining attribute values X if we know that it
belongs to class Ci.
P (Ci) Probability of any objects belonging to class C i without any other
information.
P (X) Probability of obtaining attribute values X whatever class the
object belongs to.
The probabilities we need to compute are P (X|C i), P (Ci) & P (X).

28
 P (X) is independent of C i and is not required to be known since we are interested
only in comparing probabilities P (C i|X). So we need to find only P (X|Ci) & P (C i)
for each class.
 To compute P (Ci), we count the number of instances of each class in the training data
and divide each by the total number of instance.
 To compute P (X|C i), we use a naïve approach by assuming that all attributes of X are
independent which is often not true.
 Using the independence of attributes assumption and based on the training data, we
compute an estimate of the probability of obtaining the data X that we have by
estimating the probability of each of the attribute values by counting the frequency of
those values for class Ci.
 We then determine the class allocation of X by computing [P (X|C i) P (Ci)] for each
of the classes & allocating X to the class with the largest values.
 In Bayesian approach, the probability of dependent attribute can be estimated by
computing estimates of the probabilities of independent attributes.
 Advantage of this method is that, it is possible to use this approach even if values of
all the independent attributes are not known since we can still estimate the
probabilities of the attribute values that we know.

Owns Married Gender Employed Credit Risk Class


Home? Rating
Yes Yes Male Yes A B
No No Female Yes A A
Yes Yes Female Yes B C
Yes No Male No B B
No Yes Female Yes B C
No No Female Yes B A
No No Male No B B
Yes No Female Yes A A
No Yes Female Yes A C
Yes Yes Female Yes A C
 There are 10 samples (s=10) & 3 classes:
Credit risk class A=3
Credit risk class B=3
Credit risk class C=4
 The prior probabilities are obtained by dividing these frequencies by the total number
in the training data (i.e., 10). P (A) =3/10=0.3; P (B) =3/10=0.3;
P(C) =4/10=0.4
 If the data that is presented to us is {Yes, No, Female, Yes, A} for the 5 attributes, we
can compute the posterior probability for each class as noted earlier.
EX: P (X |Ci) = P ({Yes, No, Female, Yes, A} |Ci)
=P (Own Home = Yes | Ci) X P (Married = No | Ci) X P (Gender =
Female | Ci) X
P (Employed = Yes | C i) X P (Credit rating = A/Ci)
 Using these, we are able to compute the 3 posterior for the 3 classes, namely that the
person with attributes values X has credit risk class A or class B or class C.

29
 We compute P (X |Ci) P (Ci) for each of 3e classes given P (A) =0.3, P (B) =0.3, P(C)
=0.4 and these values are the basis for comparing the 3 classes.
 To compute P (X |Ci) =P ({Yes, No, Female, Yes, A } |Ci) for each of the classes, we
need the following probabilities for each:
P (Own Home = Yes | C i)
P (Married = No | Ci)
P (Gender = Female | Ci)
P (Employed = Yes | Ci)
P (Credit rating = A/C i)
 These probabilities are given below. We order the data by risk class to make it
convenient.
Owns Home? Married Gender Employed Credit Rating Risk Class
No No Female Yes A A
No No Female Yes B A
Yes No Female Yes A A
1/3 1 1 1 2/3 probability of
having
{Yes, No, Female,Yes, A}
Owns Home? Married Gender Employed Credit Rating Risk Class
Yes Yes Male Yes A B
Yes No Male No B B
No No Male No B B
2/3 2/3 0 1/3 1/3 probability of
having
{Yes, No, Female, Yes, A} attributes
Values given riskClass B

Owns Home? Married Gender Employed Credit Rating Risk Class


Yes Yes Female Yes B C
No Yes Female Yes B C
No Yes Female Yes A C
Yes Yes Female Yes A C
0.5 0 1 1 0.5 probability of
having
{Yes, No, Female, Yes, A}
Class C
 Given these estimates of probabilities, we can find the posterior probabilities as:
P (X|B) =2/9
P (X|B) = 0
P (X|C) = 0
 Therefore the values of P (X |Ci) P (C i) are zero for classes B & C and
0.3*2/9=0.0666. So, X is assigned to class A.
 Bayer’s theorem assumes that all attributes are independent and that the training
sample is a good sample to estimate probabilities.
 These assumptions are not always true, as attributes are often correlated but in spite of
this, Bayes method performs reasonably well.
 Other techniques have been designed to overcome this limitation. One approach is to
use Bayesian networks that combine Bayesian reasoning with casual relationships
between attributes.
30
ESTIMATION PREDICTIVE ACCURACY OF CLASSIFICATION METHODS

Define confusion matrix & its advantage. (2m)


What do you mean by the terms False Positive & False Negative? (2m)
Explain how you will estimate the accuracy of classification methods. (5m/10m)
 The accuracy of a classification method is the ability of the method to correctly
determine the class of a randomly selected data instance. It may be expressed as the
probability of correctly classifying unseen data.
 The accuracy estimate problem is easier when much more data is available than is
required for training the model.
 Often, when a great deal of past data is available, only a part of this data us used for
training and the rest, which does not include the training set, can be used for testing
the method.
 Different sets of training data would often lead to somewhat different models and
therefore testing is very important in determining how accurate each model is.
 When several models are built, the most accurate may then be selected. Accuracy may
be measured using a number of metrics: sensitivity, specificity, precision & accuracy.
 The models estimating errors include holdout, random sub-sampling, k-fold cross-
validation & leave one-out. Some of the methods may be used when the data
available is limited.
 Let us assume that the test data has a total of T objects. When testing a method we
find that C of the T objects is correctly classified.
 The error rate then may be defined as: Error rate = (T-C)/T
 This is only an estimate of the true error rate and it is expected to be a good estimate
if the number of test data T is large and representative of the population.
 A method of estimating the error rate is considered biased if it either tends to under
estimate the error or tends to overestimate it. So, we prefer unbiased methods.
 A “confusion matrix “is sometimes used to represent the result of testing in more
detail as shown below:
Predicted class True class
1 2 3
 Advantage of this
1 is that, it not
8 only tells
1 us how 1 many got misclassified but also what
misclassifications occurred. Ex: we can see that 1 out of 10 objects that belonged to
2
class 3 got misclassified 9 this 2matrix, we can define the terms “false
2 1. Using
to class
positive” (FP) & “false negative” (FN).
3 that did not
 FP cases are those 0 belong 0to a class
7 but were allocated to it. Ex: there
were 2 FPs for class 1 &4 for class 2. Class 3 has no FPs.
 FN is cases that belong that to a class but were not allocated to it. Ex: there were 2
FNs for class 1 & 1 for class 2 &3 for class3.
 So, there were a total of 6 FPs & FNs here. We now define sensitivity and specificity.
Sensitivity= TP/ (TP+FN)
Specificity=TN/ (TN+FP)
 Where, TP (Total positives) is the total correctly classified objects & TN (Total
negative) is the total number of objects that did get classified to a class they did not
belong to.
For class 1, we have: TP=8, TN=18, FN=2, FP=2
For class 2, we have: TP=9, TN=16, FN=1, FP=4
For class 3, we have: TP=7, TN=20, FN=3, FP=0

31
 Now we can compute:
Sensitivity= 24/30= 80%
Specificity= 54/ 60= 90%
 Sometimes terms like “FALSE ACCEPTANCE RATE” (FAR) & “FALSE
REJECTION RATE” (FRR) or type 1 & type 2 errors are also used. Some methods
for estimating the accuracy of a method are given below:

HOLDOUT METHOD
Define Holdout method. (2m)
 Also called ‘test sample method’ requires a training set and a set. The tests are
mutually exclusive.
 It may be that only one dataset is available which has been divided into two subsets,
the training subset & the test or holdout subset.
 Once the classification method produces the model using the training set, the test set
can be used to estimate the accuracy.
 A larger training set produce a better classifier, while a larger test set would produce a
better estimate of the accuracy. A balance must be achieved.
 Since none of the test data is used in training, the estimate is not biased, but a good
estimate is obtained only if the tests set are large enough and representative of the
whole population.
Random sub-sampling method
 This is much like the holdout method except that, it does not rely on a single test set.
 The holdout estimation is repeated several times and the accuracy estimate is obtained
by computing the mean of the several trials.
 This method produce better error estimates than holdout method.
K-Fold cross-validation method:
Write the steps involved in K-Fold cross-validation method. (2m)
 In this method, the available data is randomly data is randomly divided into K disjoint
subsets of approximately equal size.
 One of the subsets is then used as the test set and the remaining K-1 sets are used for
building the classifier. The test set is then used to estimate the accuracy.
 The is done repeatedly K-times so that each subset is used as a test subset once. The
accuracy estimate is then the mean of the estimates for each of the classifiers.
 Cross-validation been tested extensively and has been found to generally work well
when sufficient data is available. A value of 10 for has been found to be adequate &
accurate.

Leave-one-out method
What is Leave-one-out method? (2m)
 This is a simpler version of K-Fold cross-validation. Here, one of the training samples
is taken out and the model is generated using the remaining training data.
 Once the model is built, the one remaining sample is used for testing and the result is
coded as 1 or 0 depending if it was classified correctly or not.
 The average of such results provides an estimate of the accuracy. This method is
useful when the dataset is small.
 For large datasets, this method is expensive. Since many iterations are required.
 This method is unbiased but has high variance & is therefore not particularly reliable.

32
Bootstrap method
Write about Bootstrap method. (2m)
 In this method, given a dataset of size n, a bootstrap sample is randomly selected
uniformly with replacement by sampling n times and used to build a model.
 It can be shown that only 63.2% of these samples are unique. The error in building the
model is estimated by using the remaining 36.8% of objects that are not in the
bootstrap sample.
 Final error is then computed as 0.632 times the training error plus 0.368 times the
testing error.
 0.632 & 0.368 are based on assumption that, if there were n samples initially from
which n samples with replacement were randomly selected for training data then the
expected % of unique samples in the training data would be 0.632 or 63.2%
 Number of remaining unique & the average of error estimates are obtained.
 This method is unbiased, but has low variance and much iteration is needed for good
error estimates, if the sample is small (if 25 or less).

OTHER EVALUATION CRITERIA FOR CLASSIFICATION METHODS


List the criteria’s for evaluating classification methods. (2m)
Explain about different evaluation criteria for classification methods. (5m)
 Some criteria for evaluation of classification methods are:
 Speed
 Robustness
 Scalability
 Interpretability
 Goodness of the model
 Flexibility
 Time complexity
Speed:
 Speed involves not just the time or computation cost of constructing a model (Ex: a
decision tree). It also includes the time required to learn to use the model.
 Clearly, a user wishes to minimize both times although it has to be understood that
any data mining project will take time to plan & prepare the data.
 If the problem to be solved is large, a careful study of the methods available may need
to be carried out so that an efficient classification method may be chosen.
Robustness:
 Data errors are common, in particular when data is being collected from a number of
sources and errors may remain even after data cleaning.
 It is therefore desirable that a method be able to produce good results in spite of some
errors and missing values in datasets.
Scalability:
 Many data mining methods were originally designed for small datasets. Many have
been modified to deal with large problems.
 Given that large datasets are becoming common, it is desirable that a method
continues to work efficiently for large disk=resident databases as well.
Interpretability:
 It is important to ensure that the results of data mining are explained to the decision
makers.
 It is therefore desirable that the end-user be able to understand and gain insight from
the results produced by the classification method.

33
Goodness of the model:
 For a model to be effective3, it needs to fit the problem that is being solved. Ex: in a
decision tree classification, it is desirable to find a decision tree of the “right” size &
compactness with high accuracy.

CLASSIFICATION SOFTWARE

Mention some software’s used for classification with its usage. (2m)
 The lists of software for classification alone are given below. It should be noted that
different software using the same technique may not produce the same results since
there are subtle differences in the techniques used.
1. C4.5, version 8 of the “classic” decision-tree tool.
2. C5.0 deals with large data sets. It constructs classifiers in the form of decision
trees or a set of if-then=else rules.
3. CART 5.0 & Tree Net are decision tree software package. They facilitate for
data pre-processing & predictive modeling.
4. DTREG generates classification trees when the classes are categorical &
regression decision trees when the classes are numerical intervals and finds the
optimal tree size.
5. Model Builder for decision trees specializes in credit card & fraud detection.
6. OCI, Oblique Classifier 1 written in ANSI C is a decision tree system that
accepts numerical attribute values.
7. Quad stone system version 5 provides data extraction, management, pre-
processing & visualization, customer profiling, etc...
8. Shih Data Miner includes Tree Builder to build decision trees using a variety
of split algorithm. The software handle missing values, provides tools for
testing including cross-validation, pruning & misclassification costs.
9. SMILES offer a quite effective handling of costs.
10. NBC: a simple Naïve Bayes classifier.

UNIT-2 Completed

34
FAQ
Two marks:

1. What is decision tree?


2. What is meant by Apriori classification?
3. What is idection?
4. What is meant by Pruning?Explain.
5. Write any two classification of software.

Five marks:
1. Explain any five Evaluation criteria for classification methods.
2. Write down the decision tree rules.
3. Explain about decision tree in classification.
4. Discuss about classification of software.
5. Write a note on classification.
6. What is Over fitting of a decision tree? What problems can over fitting lead to?

Ten marks:
1. Explain about naïve bayes method.
2. Discuss about Estimating Predictive Accuracy of classification method.

35
UNIT-3
CLUSTER ANALYSIS
Define cluster analysis. (2m)
Mention the difference between classification & cluster analysis. (2m)
 The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.
 A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters. A cluster of data objects can
be treated collectively as one group and so may be considered as a form of data
compression.
 We organize observations or objects or things into meaningful groups so that we are
able to make comments about the groups rather than individual objects.
Example: 1. “Chemical periodic table”
 Here chemical elements are grouped into rows & columns. So that elements adjacent
to each other within a group have similar physical properties.
i.e., Elements are grouped as:
a. Alkali metals e. Transition metals
b. Actinide series f. Nonmetals
c. Alkaline earth metals g. Lanthanide series
d. Other metals h. Noble gases
2. Fauna & flora
 Classification and cluster analysis (or clustering) is important techniques partition
objects that have many attributes into meaningful disjoint subgroups so that, objects
in each group are more similar to each other in the values of their attributes than they
are to objects in other groups. Major difference is follows:
 In supervised classification, the classes are predefined, the user already knows what
classes there are & some training data that is already labeled by their class
membership is available to train or build a model.
 The classification problem then is to build a model that would be able to label or
classify newly encountered data.
 In cluster analysis, one does not know what classes or clusters exist and the problem
to be -solved is to group the given data into meaningful clusters.
Example:
Owns Home? Married Gender Employed Credit Rating
Yes Yes Male Yes A
No No Female Yes A
Yes Yes Female Yes B
Yes No Male No B
No Yes Female Yes B
No No Female Yes B
No No Male No B
Yes No Female Yes A
No Yes Female Yes A
Yes Yes Female Yes A

36
 We have no prior information about the classes. We have the attribute values but we
do not know if there are a number of groups that might be interesting.
 Cluster analysis is useful for such analysis. A cluster analysis example may include
numerical attributes like: Age, Salary, Length of residence at current address, etc…
 Aim of Cluster Analysis is to find if data naturally falls into meaningful groups with
small within-group variation & large between-group variations.
 Cluster analysis can be defined as an optimization problem in which a given function
consisting of within cluster (intra-cluster) similarity and between cluster (inter-
cluster) dissimilarity needs to be optimized.
 Cluster analysis has applications in many areas like marketing, university, medicine,
banking & finance, real estate, etc...
 Classical cluster analysis algorithms assumed small datasets but with the advent of
computing, bar codes and the web, users wish to apply cluster analysis to large
datasets.
Advantages of such a clustering-based process:
 Adaptable to changes
 Helps single out useful features that distinguish different groups.
Desired Features of Cluster Analysis
List out the desired features of cluster analysis. (2m)
 Following list shows the desired features than an ideal cluster analysis method should
have:
1. Scalability 5. Robustness
2. Only one scan of the dataset 6. Ability of discover different cluster shapes
3. Ability to stop & resume 7. Different data types
4. Minimal input parameters 8. Result independent of data input order

1. Scalability:
 Data mining problems can be large and therefore it is desirable that a cluster
analysis method be able to deal with small as well as large problems
gracefully.
 The method should also scale well to datasets in which the number of
attributes is large.
2. Only one scan of the dataset:
 For large problems, the data must be stored on the disk and the cost of I/O
from the disk can then become significant in solving the problem.
 It is therefore desirable that a cluster analysis method not require more than
one scan of the disk resident data.
3. Ability to stop & resume:
 When the dataset is very large, cluster analysis may require considerable
processor time to complete the task.
 In such cases, it is desirable that the task be able to be stopped and then
resumed when convenient.

37
4. Minimal input parameters:
 The cluster analysis method should not expect too much guidance from the
user. A data mining analyst may be working with a dataset about which his/her
knowledge is limited.
 It is therefore desirable that the user not be expect to have domain knowledge
of the data and not be expected to process insight into clusters that might exist
in the data.
5. Robustness:
 Most data obtained from a variety of source has errors. It is therefore desirable
that a cluster analysis method be able to deal with noise, outliers and missing
values gracefully.
6. Ability to discover different cluster shapes:
 Clusters come in different shapes and not all clusters are spherical.
 It is therefore desirable that a cluster analysis method be able to discover
cluster shapes other than spherical. Some applications require that various
shapes be considered.
7. Different data types:
 Many problems have a mixture of data types, for example numerical,
categorical and even textual.
 It is therefore desirable that a cluster analysis method be able to deal with not
only numerical data but also Boolean and categorical data.
8. Results independent of data input order:
 It is desirable that a cluster analysis method not be sensitive to data input
order.
 Whatever the order, the result of cluster analysis of the same data should be
the same.
Types of data
List the types of data used in cluster analysis. (2m)
Write short notes on different types of data used in cluster analysis. (5m)
 We assume we have N objects that we wish to cluster. These objects are t 1 , i= 1, 2…
N. each object has m attributes, so ti= {ti1, ti2…tim}
 Example: Consider 10 objects with each object having 5 attributes. We now discuss
data types of these attributes.
 Datasets come in a number of different forms.
 The data may be quantitative, binary, nominal or ordinal.
Quantitative (or numerical) data: For example: weight, marks, height, price,
salary, and count. There are a number of methods for computing similarity
between quantitative data.
Attributes values of the two objects are different amongst n attributes and
using this as an indication of distance.
Qualitative nominal data: similar to binary data which may take more than
two values but has no natural order. For example: religion, foods or colors. For

38
nominal data too, an approach similar to that suggested for computing distance
for binary data may be used.
Qualitative ordinal data: Similar to nominal data except that the data has an
order associated with it. For example: grades A, B, C and D, sizes S,M,L and
XL. The problem of measuring distance between ordinal variables is different
than for nominal variables since the order of the values is important. One
method of computing involves transferring the values to numeric values
according to their rank. Ex: Grades A, B, C, D transformed to 4.0, 3.0, 2.0 and
1.0.

COMPUTING DISTANCES
Write the properties of distance metric. (2m)
Mention the distance measures used in cluster analysis with their formulae. (2m)
Explain about computing distance in cluster analysis. (5m/10m)
 Most cluster analysis methods are based on measuring similar between objects by
computing the distance between each pair.
 There is no distance metric that one can say is the best in all applications. Distance
concept has a number of simple properties:
1. Distance is always positive.
2. Distance from point x to itself is always zero.
3. Distance from point x to point y cannot be greater than the sum of the distance
from x to some other point z and distance from z to y.
4. Distance from x to y is always the same as from y to x.
 Before the data is subject to cluster analysis, it is best to ensure that only the most
descriptive and discriminatory attributes are used in the input.
 Unit of measurement of data can significantly influence the result of analysis.
 Following distance measures assume that equal importance is being given to each
attribute.
 It is possible to assign weight to all attributes indicating their importance. Distance
measure will then use these weights in computing the total distance.
 Let the distance between two points x and y be D(x, y). We now define a number of
distance measures.
Euclidean Distance:
 Euclidean distance or the L2 norm of the difference vector is most commonly used to
compute distances and has an intuitive appeal but the largest valued attribute may
dominate the distance.
 It is therefore essential that the attributes are properly scaled.
D(x, y) = (ϵ (xi-yi) 2)1/2
 We can use this distance measure without the square root. If one wanted to place
greater weight on differences that are large.
 This method is more appropriate when the data is not standardized.

39
Manhattan Distance:
 Manhattan distance also called the L1 norm of the difference vector. Mostly, the
results obtained by this measure are similar to those obtained by using the Euclidean
distance.
 The largest-valued attributes can dominate the distance. Although not as much as in
the Euclidean distance.
D(x, y) = 𝛜 |xi –yi |
Chebychev Distance:
 This distance metric is based on the maximum attribute difference. Also called the L,
norm of the difference vector.
D(x, y) = Max |xi, yi|
Categorical Data Distance:
 This distance measure may be used if many attributes have categorical values with
only a small number of values (e. g. binary values).
 Let N be the total number of categorical attributes. D(x, y) = (number of x i-yi)/N
TYPES OF CLUSTER ANALYSIS METHODS
Explain different types of cluster analysis methods. (5m)
 The cluster analysis methods may be divided into the following categories:

Cluster Analysis

Hierarchical Partitional Grid-Based Model-Based


Methods Methods

Agglomerative Divisive K-means Density-Based Expectation

Maximization
Partitional Methods:
 Partitional methods obtain a single level partition of objects. These methods usually
are based on greedy heuristics that are used iteratively to obtain a local optimum
solution.
 Given n objects, these methods make k ≤ n cluster of data and use an iterative
relocation method.
 It is assumed that each cluster has at least one object and each object belongs to only
one cluster.
 Objects may be relocated between clusters as the clusters are refined.

40
Hierarchical Methods:
Define Hierarchical method with its types. (2m)
 Hierarchical methods obtain a nested partition of the objects resulting in a tree of
clusters.
 These methods either start with one cluster or then split into smaller and smaller
clusters (called divisive or top down) or start with each object in an individual cluster
and ten try to merge similar clusters into larger and cluster (called agglomerative or
bottom up).
 In this approach, in contrast to partitioning, tentative clusters may be merged or split
based on some criteria.
Density-based Methods:
 In this class of methods, typically for each data point in a cluster, at least a minimum
number of points must exist within a given radius.
 Density-based methods can deal with arbitrary shape clusters since the major
requirement of such methods is that each cluster be a dense region of points
surrounded by regions of low density.
Grid-based Methods:
 In this class of methods, the object space rather than the data is divided into a grid.
Grid partitioning is based on characteristics of the data and such methods can deal
with non-numeric data more easily.
 Grid-based methods are not affected by data ordering.
Model-based Methods:
 A model is assumed, perhaps based on a probability distribution.
 Essentially the algorithm tries to build clusters with a high level of similarity within
them and a low level of similarity between them.
 Similarity measurement is based on the mean values and the algorithm tries to
minimize the squared-error function.

PARTITIONED METHODS
Explain in detail about the partitioned methods of clusters analysis with suitable
example. (10m)
 Partitional methods are computationally efficient and are more easily adapted for very
large datasets.
 The algorithm is iterative refinement methods (sometimes called hill-climbing or
greedy methods) and they converge to a local minimum rather than the global
minimum.
 These methods not only require specifying the number of clusters Apriori, they also
require the user to normally specify the starting states of the clusters.
 This may be difficult for a user since the user may not have such knowledge and there
is no simple reliable method for finding initial conditions for the clusters.
 The aim of partition methods is to reduce the variance within each cluster as much as
possible and have large variance between clusters.

41
 Two methods of partitioning are:

K-means method.
Expectation maximization method.
K-means method

Explain k-means method with an example.(5m)


Explain centroid method with an example. (5m)
 The method is called k-means since each of the k clusters is represented by the means
of objects within it.
 It is also called the centroid method since at each step the centroid point of each
cluster is assumed to be known and each of the remaining points are allocated to the
cluster whose centroid is closest to it.
 Once this allocation is completed, the centroids of the clusters are recomputed using
simple means and the process of allocating points to each cluster is repeated until
there is no change in the clusters.
 The k-means method uses the Euclidean distance measure, which appears to work
well with compact clusters.
 The k-means method can be less sensitive to outliers. The k-means method may be
described as follows:
1. Select the number of clusters. Let this number of K.
2. Pick K seeds as centroids of the K clusters. The seeds may be picked
randomly unless the user has some insight into the data.
3. Computer the Euclidean distance of each object in the dataset from each of
centroids.
4. Allocate each object to the cluster it is nearest to base on the distances
computed in the previous step.
5. Computer the centroids of the clusters by computing the means of the
attribute values of the objects in each cluster.
6. Check if the stopping criterion has been met, If yes, go to step 7. If not, go
to step 3.
7. [Optional] one may decide to stop at this stage or to split a cluster or
combine two clusters heuristically until a stopping criterion is met.
Algorithm:
Input:
 K: the number of clusters,
 D: a data set containing n objects.
Output:
A set of K clusters.
Method:
1. Arbitrarily choose K objects from D as the initial cluster centers;
2. Repeat
3. (Re) assign each object to the cluster to which the object is the most similar.
Based on the mean value of the objects in the cluster;

42
4. Update the cluster means, i.e., calculate the mean value of the objects for each
cluster;
5. Until no change;

Advantages:
 Works well when the clusters are compact clouds that are rather well separated from
one another.
 Relatively scalable and efficient in processing.
Limitations:
 Can be applied only when the mean of a cluster is defined.
 The necessity for users to specify k in advance can be seen as a disadvantage.
 Not suitable for discovering clusters with no convex shapes or clusters of very
different size.
 Sensitive to noise and outlier data points because a small number of such data can
substantially influence the mean value.
Example:
Consider the data about students given below. The only attributes are the age and the
three marks.
Student Age Mark1 Mark2 Mark2
S1 18 73 75 57
S2 18 79 85 75
S3 23 70 70 52
S4 20 55 55 55
S5 22 85 86 87
S6 19 91 90 89
S7 20 70 65 60
S8 21 53 56 59
S9 19 82 82 60
S10 17 75 76 77

Step 1 and 2: Let the three seeds be the first three students as shown here:
Student Age Mark1 Mark2 Mark2
S1 18 73 75 57
S2 18 79 85 75
S3 23 70 70 52
Step 3 and 4: Now compute the distances using the four attributes and using the sum of
absolute differences for simplicity (i.e., using the K-median method). The distance values for
all the objects are given in the below table:
Ci=[ci-si]=[ci-si]+[ci-si]+[ci-si]
Distances from Allocation to the
clusters nearest cluster
C1 18.0 73.0 75.0 57.0 From From From
C2 18.0 79.0 85.0 75.0 C1 C2 C3
C3 23.0 70.0 70.0 52.0
S1 18.0 73.0 75.0 57.0 0.0 34.0 18.0 C1

43
S2 18.0 79.0 85.0 75.0 34.0 0.0 52.0 C2
S3 23.0 70.0 70.0 52.0 18.0 52.0 0.0 C3
S4 20.0 55.0 55.0 55.0 42.0 76.0 36.0 C3
S5 22.0 85.0 86.0 87.0 57.0 23.0 67.0 C2
S6 19.0 91.0 90.0 89.0 66.0 32.0 82.0 C2
S7 20.0 70.0 65.0 60.0 18.0 46.0 16.0 C3
S8 21.0 53.0 56.0 59.0 44.0 74.0 40.0 C3
S9 19.0 82.0 82.0 60.0 20.0 22.0 36.0 C1
S10 47.0 75.0 76.0 77.0 52.0 44.0 60.0 C2

 Based on these distance, each students is allocated to the nearest cluster. The first
iteration leads to two students in the first cluster and four each in the second and third
clusters.
Step 5: The below table compares the cluster means of clusters found in the above table with
the original seeds.
Age Mark 1 Mark 2 Mark 2
C1 18.5 77.5 78.5 58.5
C2 26.5 82.5 84.5 82.0
C3 21 61.5 61.5 56.5
Seed 1 18 73 75 57
Seed2 18 79 85 75
Seed3 23 70 70 52

It is interesting to note that the mean marks for C3 are significantly lower than
for C1 and C2.

Step 6 and 7: Use the new cluster means to re compute the distance of each object to each of
the means, again allocating each object to the nearest cluster. The below table shows the
second iteration result:
Allocation
C1 18.5 77.5 78.5 58.5 From From From to the
C2 26.5 82.5 84.5 82.0 C1 C2 C3 nearest
C3 21.0 62.0 61.5 56.5 cluster
S2 18.0 73.0 75.0 57.0 10.0 52.3 28.0 C1
S2 18.0 79.0 85.0 75.0 25.0 19.8 62.0 C2
S3 23.0 70.0 70.0 52.0 27.0 60.3 23.0 C3
S4 20.0 55.0 55.0 55.0 51.0 90.3 16.0 C3
S5 22.0 85.0 86.0 87.0 47.0 13.8 79.0 C2
S6 19.0 91.0 90.0 89.0 56.0 28.8 92.0 C2
S7 20.0 70.0 65.0 60.0 24.0 60.3 16.0 C3
S8 21.0 53.0 56.0 59.0 50.0 86.3 17.0 C3
S9 19.0 82.0 82.0 60.0 10.0 32.3 46.0 C1
S10 47.0 75.0 76.0 77.0 52.0 41.3 74.0 C2
 The number of students in cluster 1 is again 2 and the other two clusters still have four
students each. A more careful look shows that the clusters have not changed at all.
Therefore the method has converged rather quickly for this very simple dataset.

44
 The cluster membership is as follows: cluster 1-S1, S9 cluster2-S2,S5,S6,S10
cluster3-S3,S4,S7,S8.
 It should be noted that the above method was K-median rather than K-means because
we wanted to use simple distance measure that a reader could check without using a
computer or a calculator.
 Example: when we started with the last three objects as the seeds we obtained C1 with
4 students, C2 with 5 students and C3 with only one student.
 The one object in C3 was the starting seed for C3. Therefore, we chose as one of the
seeds an object that was an outlier and hence it appeared as a cluster of one object in
the final result.
Scaling and Weighting:
Define scaling & weighting.(2m)
 Using K-means, the cluster were determined based on the 3 attributes mark1, mark 2,
mark 3.
 The attribute age did not have much influence on determining which cluster of an
object belonged to since the values of this attribute were smaller than other attributes
and the variation within the attribute was small except for the object S10 which was
an outlier.
 For clustering to be effective. All attributes should be converted to a similar scale
unless we want to five more weight to some attributes that are relatively large in scale.
 One possibility is to transform them all to a normalized score to a range. Such
transformations are called Scaling.
 Some other approaches to scaling are:
Divide each attribute by the mean value of that attribute.
Divide each attribute by the difference between the largest and smallest value.
Convert the attribute values to “standardized scores” by subtracting the mean
of the attribute from each attribute value and dividing it by the standard
deviation.
 Scaling removes the bias due to different scales of the attributes and then fives equal
importance to all attributes.
 In some cases, one or two attributes may be much more important than the rest & may
be desirable to five more weight to these attributes.
 Purpose of giving weights to some attributes is to introduce bias in the clustering
process by purposely giving higher weights to some attributes.
Issues related to K-means method:
Mention some issues related to k-means method.(2m/5m)
 Although k-means method is widely used, there are a number of issues related to this
method:
The results of K-means method depend strongly on the initial guesses of the
seeds.
K-means method can be sensitive to outliers. If an outlier is picked as a
starting seed, it may end up in a cluster of its own. Also, if an outlier moves
from one cluster to another during iterations, it can have a major impact on the

45
clusters because the means of the two clusters are likely to change
significantly.
K-means method does not consider the size of the clusters. Some clusters may
be large and some very small.
K-means method does not deal with overlapping clusters.
Expectation Maximization Method:
Write notes on Expectation maximization method.(5m)
 K-means method does not explicitly assume any probability distribution for the
attribute values.
 It only assumes that the dataset consists of groups of objects that are similar and the
groups can be discovered because the user has provided cluster seeds.
 In contrast to K-means method. Expectation maximization (EM) method is based on
the assumption that the objects in dataset have attributes whose values are distributed
according to some linear combination of simple probability distributions.
 EM method assigns objects to different clusters with certain probabilities in an
attempt to maximize expectation (or likelihood)of assignment.
 This method consists of a two-step iterative algorithm.
First step called estimation step or E-step. Involves estimating the probability
distribution of the cluster given the data.
Second step, called maximization step or M-step, involves finding the model
parameters that maximize the like hood of the solution.
 EM method assumes that all attributes are independent random variables. In a simple
case of just two clusters with objects having only a single attribute, we may assume
that the attribute values vary according to a normal distribution.
 EM method requires the following parameters to be estimated:
Mean & standard deviation of the normal distribution for cluster 1.
Mean & standard deviation of the normal distribution for cluster 2.
Probability p of a sample belonging to cluster 1 and therefore probability 1-p
of belonging to cluster 2.

 EM method then works as follows:


1. Guess the initial values of the five parameters (2 means, 2 SDS and
probability p) given above.
2. Use the two normal distributions and compute the probability of each
object belonging to each of the two clusters.
3. Compute the likelihood of data coming from these two clusters by
multiplying the sum of probabilities of each object.
4. Re-estimate the five parameters and go to step-2 until a stopping criterion
has been met.

46
HIERATCHICAL METHODS
Explain about hierarchical methods with the steps.(10m)
 They produce a nested series of clusters as opposed to the Partitional methods which
produce only a flat set of clusters.
 They attempt to capture the structure of the data by constructing a tree of clusters.
Two types of hierarchical approaches are:
 Agglomerative approach for merging group (or bottom up approach)
Here each object at the start is a cluster by itself and the nearby clusters are
repeatedly merged resulting in larger & larger clusters until some stopping criterion is met or
all the objects are merged into a single large cluster which is the highest level of hierarchy.
 Divisive approach (or top-down approach)
Here all the objects are put in a single cluster to start. Method then repeatedly
performs splitting of clusters resulting in smaller & smaller clusters until a stopping criterion
is reached or each cluster has only one object in it.
Distance Between Clusters:
Define linkage metrics.(2m)
Write about computing distances between clusters in hierarchical clustering
methods.(5m)
 Hierarchical clustering methods require distances between clusters to be computed.
These metrics are often called linkage metrics.
 Computing distance between large clusters is expensive. Example: one cluster has 50
objects and another has 100, then computing most of the distance metrics listed below
would require computing distance between each object in the first cluster with every
object in the second. So, 5000 distances need to computed, just to compute a distance
between two clusters.
 Following are methods for computing distances between clusters:
1. Single-link algorithm
2. Complete-link algorithm
3. Centroid algorithm
4. Average link algorithm
5. Ward’s minimum-variance method

1. Single-link algorithm:
 Also called nearest neighbor algorithm is the simplest algorithm for computing
distance between two clusters.
 Algorithm determines the distance between two clusters as the minimum of
the distances between all pairs of points (a,x) where is from first cluster and z
is from the second.
 Algorithm requires that all pair wise distances be computed and smallest
distance found. The algorithm can form chains and can form elongated
clusters.
 Following shows 2 clusters A & B and the single-link distance between them.

47
Disadvantage:
Each cluster may have an outlier and the two outlier may be nearby and so the
distance between the two clusters would be computed to be small.
Single-link can form a chain of objects as clusters are combined since there is
no constraint on the distance between objects that are of away from each other.
2. complete-link Algorithm:
 Also called farthest neighbor algorithm. In this algorithm, the distance
between two clusters is defined as the maximum of the pair wise distances (a,
x).
 Therefore, if there are m element in one cluster and n in other, all mn pair wise
distances therefore must be computed and largest are chosen.
 Complete-link is strongly biased towards compact cluster.
 Following shows 2 clusters A & B and the complete-link distance between
them.

Disadvantage:
Two outliers may be very far away although the clusters are nearby and the
complete-link algorithm will compute the distance as large.
This algorithm works well but if a cluster is naturally of a shape, say, like
banana then perhaps complete link is not appropriate.
3. Centroid:
 Here, the distance between two clusters is determined as the distance between
the centroids of the clusters as shown below:

 The centroid algorithm computes the distance between two clusters as the
distance between the average points of each of the two clusters.
 Usually the squared Euclidean distance between the centroids is used.
 This is more tolerant of somewhat longer clusters than the complete-link
algorithm.

4. Average Link:
 This algorithm computes the distance between two clusters as the average of
all pair wise distances between an object from one cluster and another from
the other cluster.
 Therefore, if there are m elements in one cluster and n in the other, there are
mn distances to be computed, added and divided by mn.
 It tends to join clusters with small variance.

48
5. Ward’s minimum-variance method:
 This method results in creating small tight clusters.
 Ward’s distance is the difference between the total within the cluster sum of
squares for the two clusters separately and within the cluster sum of squares
resulting from merging the two clusters.
 An expression for ward’s distance may be derived. It may be expressed as
follows:
DW(A,B)=NANBDC(A,B)/(NA+NB)
Where,
Dw (A,B)-ward’s minimum-variance distance between cluster A and B with NA and NB
objects in them respectively.
Dc(A,B)-Centroid distance between two clusters.

Agglomerative Method:
Explain agglomerative method in detail with an example.(10m)
 This bottom-up strategy starts by placing each object in its own cluster and then
merges these atomic clusters into larger and larger clusters, until all of the objects are
in a single cluster or until certain termination conditions are satisfied. Most
hierarchical clustering methods belong to this category.
 Some applications have a hierarchical structure. EX: Fauna and Flora.
 Agglomerative clustering method is tries to discover such structure given a dataset.
 The basic idea of this method is to start out with n clusters for n data points. i.e., each
cluster consisting of a single data point.
 Using a measure of distance, at each step of the method, merges two nearest clusters,
thus reducing the number of clusters and building
 Successively larger clusters.
 The process continues until the required number of cluster has been obtained or all the
data points are in one cluster.
 This method leads to hierarchical clusters in which at each step we build larger
and larger clusters that include increasingly dissimilar objects.
 This method is a bottom-up approach which involves the following step:
1. Allocate each point to a cluster of its own. Thus, we start with n clusters for n
objects.
2. Create a distance matrix by computing distance between all pairs of clusters
either using, for example the single-link metric or the complete link metric.
Some other metric may also be used. Sort these distances in ascending order.
3. Find the two clusters that have the smallest distance between them.
4. Remove the pair of objects and merge them.
5. If there is only one cluster left then stop.
6. Compute all distance from the new cluster and update the distance matrix after
the merger and go to step 3
Example: consider the below table. The centroid method is used for computing the distance
between clusters.

49
Student Age Mark 1 Mark2 Mark 2
S1 18 73 75 57
S2 18 79 85 75
S3 23 70 70 52
S4 20 55 55 55
S5 22 85 86 87
S6 19 91 90 89
S7 20 70 65 60
S8 21 53 56 59
S9 19 82 82 60
S10 47 75 76 77
Step 1 and step 2:
 Allocate each point to a cluster and compute the distance matrix. The distance matrix
is symmetric, so only half is shown. The distance matrix for the above object will be:
S1 S2 S3 S4 S5 S6 S7 S8 S9
S1 0
S2 34 0
S3 18 52 0
S4 42 76 36 0
S5 57 23 67 95 0
S6 66 A 82 106 15 0
S7 18 46 16 30 65 76 0
S8 44 74 40 8 91 104 28 0
S9 20 22 36 60 37 46 30 115 0
S10 52 44 60 90 55 70 60 98 58
Step 3 and 4:
 The smallest distance is 8 between object S 4 and object S8. They are combined and
removed and we put the combined cluster (C1) where the object S4 was.
 The below table shows the new distance matrix. All distance except those with cluster
C1 remain unchanged.
S1 S2 S3 C1 S5 S6 S7 S9
S1 0
S2 34 0
S3 18 52 0
C1 41 75 38 0
S5 57 23 67 93 0
S6 66 A 82 105 15 0
S7 18 46 16 29 65 76 0
S9 20 22 36 59 37 46 30 0
S10 52 44 60 88 55 70 60 58

Step 5 and 6:

 The smallest distance now is 15 between object S5 and S6. They are combined in a
cluster and S5 and S6 are removed. The following table shows the updated distance
matrix.

50
S1 S2 S3 C1 C2 S7 S9
S1 0
S2 34 0
S3 18 52 0
C1 41 75 38 0
C2 61.5 27.5 74.5 97.5 0
S7 18 46 16 29 69.5 0
S9 20 22 36 59 41.5 30 0
S10 52 44 60 88 62.5 60 58
 We find the shortest distance again. S3 and S7 are at a distance 16. They are then
merged and the process continues.
 The results of using the agglomerative method is shown in the below figure.
 The first one being merging of objects S4 and S8 followed by S5 and S 6 and so on.

C4

C3 C1 C2
S1 S3 S7 S2 S9 S4 S8 S5 S6 S10
Figure 4.6 A possible result of using the agglomerative method
Example 2:
The table gives the fat and protein content. Apply minimum spanning tree clustering
algorithm to solve the problem.
Food Item Number Protein content (P) Fat Content (F)
1 1.1 60
2 8.2 20
3 4.2 35
4 1.5 21
5 7.6 15
6 2 55
7 3.9 39
Assume each item to form a separate cluster. Hence, there are 7 clusters.
Calculating Euclidean distance between each food item by the formula
Dclc2=√ (xc1-xc2)2+ (yc1-yc2)2

51
Formulating a table:
C1 C2 C3 C4 C5 C6 C7
C1 0
C2 40.6 0
C3 25.19 15.52 0
C4 39 6.77 14.25 0
C5 45.46 5.03 20.28 8.55 0
C6 5.08 35.54 20.12 34 46.39 0
C7 21.18 19.48 4.01 18.19 24.28 16.11 0
Between cluster C3 and C7 the distance is minimum, hence merge it.
C37 (protein value) = avg (C3 (P), C7 (P)) =4.05
C37 (fat value) = avg (C3 (F), C7 (F)) =37
New table is as follows:
C1 C2 C37 C4 C5 C6
C1 0
C2 40.6 0
C37 23.18 17.49 0
C4 39 6.77 16.2 0
C5 45 5.03 22.28 8.55 0
C6 5.08 35.54 18.18 18.19 24.48 0
Min distance is between C2 and C5. Hence merging C2, C5.
C25 (protein value) = avg (C2 (P), C5 (P)) =7.9
C25 (fat value) = avg (C2 (F), C5 (F)) =17.5
Next min distance is between C1, C6. Merge C1, C6.
C16 (protein value) = avg (C1 (P), C6 (P)) =1.55
C216 (fat value) = avg (C1 (F), C6 (F)) =57.5
The merged table is:
Cluster Protein Value Fat Value
C16 1.55 57.5
C25 7.9 17.5
C37 4.05 37
C4 1.5 21

Divisive Hierarchical Method


Explain the advantages & disadvantages of Hierarchical methods. (5m)
Explain divisive hierarchical method with an example. (5m/10m)
 This top-down strategy does the reverse of agglomerative hierarchical clustering by
starting with all objects in one cluster. It subdivides the cluster into smaller and
smaller pieces, until each object forms a cluster on its own or until it satisfies certain
termination conditions, such as a desired number of clusters is obtained or the
diameter of each cluster is within a certain threshold.
 Divisive method is the opposite of the agglomerative method in that the method starts
with the whole dataset as one cluster and then proceeds to recursively divide the
cluster into two sub-clusters, and continues until each cluster has only one object or
some other stopping criterion has been reached.

52
 There are 2 types of divisive methods:
1. Monothetic: it splits a cluster using only one attribute at a time. An attribute that
has the most variation could be selected.
2. Polythetic: it splits a cluster using all of the attributes together. Two clusters far
apart could be built based on distance between objects.
 A Polythetic divisive method works like the following:
 Decide on a method of measuring the distance between two objects. Also
decide a threshold distance.
 Create a distance matrix by computing distances between all pairs of objects
within the cluster. Sort these distances in ascending order.
 Find the two objects that have the largest distance between them. They are the
most dissimilar objects.
 If the distance between the two objects is smaller than the pre-specified
threshold and there is no other cluster that needs to be divided then stop,
otherwise continue.
 Use the pair of objects as seeds of a k-means method to create two new
clusters.
 If there is only one object in each cluster then stop otherwise continue with
step 3.
 In the above method, we need to resolve the 2 issues:
1. Which cluster to split next? 2. How to split a cluster?
Which cluster to split next?
 There are a number of possibilities when selecting the next cluster to split:
1. Split the cluster in some sequential order.
2. Split the cluster that has the largest number of objects.
3. Split the cluster that has the largest variation within it.
 First two are clearly simple but the third is better since it is based on the sound
criterion of splitting a cluster that has the most variance.
How to split a cluster?
 In the above method, a simple approach is used for splitting a cluster based on
distance between the objects in the cluster.
 A distance matrix is created and the two most dissimilar objects are selected as seeds
of 2 new clusters.
 The k-means method is then used to split the cluster.
Example 1:
 Consider the same example used in the above method.
 The distance matrix for that will be:
S1 S2 S3 S4 S5 S6 S7 S8 S9
S1 0
S2 34 0
S3 18 52 0
S4 42 76 36 0
S5 57 23 67 95 0
S6 66 32 82 106 15 0

53
S7 18 46 16 30 65 76 0
S8 44 74 40 8 91 104 28 0
S9 20 22 36 60 37 46 30 115 0
S10 52 44 60 90 55 70 60 98 58
 The largest distance is 115 between the objects S 8 and S9. They become the seeds of
two new clusters.

 K-means in used to split the group into two clusters.


S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
S8 44 74 40 8 91 104 28 0 115 98
S9 20 22 36 60 37 46 30 115 0 58
Cluster C1 includes S4, S7, S8.
Cluster C2 includes S1, S2, S3, S5, S6, S9, S10.
 Then we find the largest distance is C2. The distance matrix may be built using
distances given in the first table.
S1 S2 S3 S5 S6 S9
S1 0
S2 34 0
S3 18 52 0
S5 57 23 67 0
S6 66 32 82 15 0
S9 20 22 36 37 46 0
S10 52 44 60 55 60 58
 The largest distance in c2 is 82 between s3 and s6 .c2 can therefore be split with s3
and s6 as seeds.
 the distance matrix of cluster 1 is given in the table:

S4 S7
S4 0
S7 30 0
S8 8 28
 The largest distance is 30 between S4 and S7 .C1 can be split with s4 and s7 as seeds.
The method continues like this until one of the stopping criteria is met.
Advantages of the hierarchical approach:
 Hierarchical approach can provide more insight into the data by showing a hierarchy
of clusters than a flat cluster structure created by a partitioning method like the k -
means method.
 Hierarchical methods are conceptually simpler and can be implemented easily.
 In some applications only proximity data is available and then the hierarchical
approach may be better.
 Hierarchical methods can provide clusters at different levels of granularity.

54
Disadvantage of the hierarchical approach:
 The hierarchical methods do not include a mechanism by which objects that have
been incorrectly put in a cluster may be reassigned to another cluster.
 Different distance metrics and scaling of data can significantly change the results.
 In hierarchical methods, there is no need to specify the number of clusters and the
cluster seeds.
 The agglomerative approach starts with each object in an individual cluster and then
merges clusters to build larger clusters.
 Difficulties regarding the selection of merge or split points.
 It will neither undo what was done previously nor perform objects swapping between
clusters .thus merge or split decisions, if not well chosen at some step, may lead to
low-quality clusters.
 The method does not scale well.
DENSITY BASED METHODS
Explain shortly about density based cluster analysis methods. (5m)
 These methods are based on the assumption that clusters are high density
collections of data of arbitrary shape that are separated by a large of low density
data.
 Basis is that, for each data point in a cluster, at least a minimum number of points
must exist within a given distance.
 Data that is not within such high-density clusters is regarded as outliers or noise.
 Another way of looking at density-based clustering is that the clusters are dense
regions of probability density in the data sets.
 DBSCAN (density based spatial clustering of applications with Noise) is an
example of density-based method.
 The method was designed for spatial databases but can be used in other
applications.
 It requires two input parameters.
1. Size of the neighborhood(R) 2. Minimum points in the
neighborhood(N)
 These parameters determine the density within the clusters the user is willing to
accept since they specify how many points must be in region.
 The number of points not only determines the density of acceptable clusters but it
also determines which objects will be labeled outliers or noise.
 Objects are declared to be outliers if there are few other objects in their
neighborhood.
 The size parameter R determines the size of the clusters found. If R is big enough,
there would be one big cluster and no outliers.
 If R is small, there will be small dense clusters and there might be many outliers.
 The following concepts are required in DBSCAN method;
 Neighborhood: the neighborhood of an object Y is defined as all the objects that are
within the radius R from Y.
 Core object: an object Y is called a core object if there are N objects within its
neighborhood.

55
 Proximity: two objects are defined to be in each other if they belong to the same
cluster. Object x1 is in proximity to object x2 if two conditions are satisfied.
a. The objects are close enough to each other, I.e., within a distance of R.
b. X2 is a core object as defined above.
 Connectivity: two objects x1 and xn are connected if there is a path or chain of objects
x1, x2 ….xn from x1 to xn such that each xi+1 are in proximity to object xi.
 the algorithm for density-based clustering is:
1. Select values of R and N.
2. Arbitrarily select an object P.
3. Retrieve all objects that are connected to P, given R and N.
4. If P is a core object, a cluster is formed.
5. If P is a border object, no objects are in its proximity. Choose another object. Go
to step3.
6. Continue the process until all of the objects have been processed.
 The performance of DBSCAN has been found to be sensitive to the parameters K and
N.
 If R and N are not chosen carefully, an algorithm like DBSCAN may result in either a
large number of very small clusters or few very large clusters.
DEALING WITH LARGE DATABASES
Explain about clustering when large database are included. (5m/10m)
 Most clustering methods implicitly assumes that all data is accessible in the main
memory.
 Often the size of the database is not considered but a method requiring multiple scans
of data that is disk resident could be quite inefficient for large problems.
 One possible approach to deal with large database that could be used with any type of
clustering method is to draw as large a sample from the large dataset as could be
accommodated in the main memory.
 The sample is then clustered .each remaining object is then assigned to the nearest
cluster obtained from the sample.
 This process could be repeated several times and the clusters that lead to the smallest
within clusters variance could be chosen.
K-means method (for large databases:

 K-means method is modified for the data that is too large to fit in main memory.
 The modified method is based on the assumption that centroids or starting seeds have
been specified and these seeds are reasonable in that the actual cluster is likely to have
attribute values normally distributed around the seeds.
 The method –first picks the number of clusters and their seed centroids and then
attempts to classify each objects to belong to one of the following three groups:
 Those that is certain to belong to clusters. These objects together are called the
discard set. Some information about these objects is computed and saved.
 Those are sufficient close to each other to be replaced by their summary. The
objects are however sufficiently far away from each cluster’s centroid that

56
they cannot yet be put in the discard set of objects. These objects together are
called the compression set.
 The remaining objects that are too difficult to assign to either of the two
group’s above. These objects are called retained set and stored as individual
objects.
Hierarchical method (for large database)
CONCEPT OF FRACTONATION:-
Define fractions & fractionation. (2m)
 Dealing with large datasets is difficult using hierarchical methods since the methods
require an NxN distance matrix to be computed for n objects
 A modification of classical hierarchical methods is based on the idea of splitting the
data into manageable subsets called “fractions “and then applying a hierarchical
method to each fraction. The concept is called fractionation.
 Basic algorithm used in the method is: assuming that m is the largest number of that
the hierarchical method may be applied to.
 The size m may be determined perhaps based on the size of the main memory.
algorithm is :
1. Split the large dataset into fractions of size M.
2. The hierarchical clustering technique being used is applied to each fraction.
Let the number of clusters so obtained from all the fractions are C.
3. For each of the C clusters, compute the mean of the attribute values of the
objects in it. Let this mean vector be mi, i=1,…, C. These cluster means are
called meta-observations. The meta-observation now becomes the data values
that represent the fractions.
4. If the C meta-observation are too large (greater than M), go to step 1,
otherwise apply the same hierarchical clustering technique to the meta-
observations obtained in step 3.
5. Allocate each object of the original dataset to the cluster with the nearest
mean obtained in step 4.

QUALITY AND VALIDEITY OF CLUSTER ANALYSIS METHODS

Discuss the quality measures of cluster analysis methods with its property.
 Evaluation is difficult since all clustering methods will produce clusters even if there
are no meaningful is based on computing within cluster variation (I) and between
cluster variation (E). These variations may clusters in the data. And there is no test
data available in the cluster analysis.
 Different methods produce different clusters. Different methods are biased towards
different types of clusters.
 Ex: minimizing the maximum distance between the points in a cluster is trying to
favor a spherical shape.
 The results of K-means may in the first instance be evaluated by examining each
attribute’s mean for each cluster in an attempt to assess how far each cluster is from
the other.

57
 Another approach is based be computed as follows:
 Let the number of clusters be k and let the clusters be C i, i=1,…, k.
 Let the total number of objects be N and let number of objects in cluster C i be
Mi so that
 M1+M2+ … + Mk=N
 The within-cluster variation between the objects in a cluster C i is defined as
the average squared distance of each object from the centroid of the cluster.
 i.e., if mi is the centroid of the cluster Ci then the mean of the cluster is given
by
Mi= {ϵxj} /Mi
 and the internal cluster variation is given by
Ii= {ϵ (xj-mj) 2}/Mi
 The between cluster distances E may now be computed given the centroids of
the clusters. It is the average sum of squares of pair wise distances between the
centroids of the k clusters. We may write E as
E = (1/k2) ϵ ϵ (µi-µj) 2
 Evaluating the quality of clustering methods or results of a cluster analysis is a
challenging task. The quality of a method involves a number of criteria:
Efficiency of the method.
Ability of the method to deal with noisy and missing data.
Ability of the method to deal with large problems.
Ability of the method to deal with a variety of attributes types and magnitude.
 Once several clustering results have been obtained, they can be combined, hopefully
providing better results than those from just one clustering of the given data.
 The clustering can obey at least three properties:
 Scale-invariance
 Richness
 Consistency
Scale-invariance results that using the same method for clustering can change if the scale of
one or more attributes is changed.
Richness refers to the possibility of achieving any or all possible partitions depending on the
attribute values and the method.
Consistency requires that if we shrink distances between points inside a cluster and expand
distances inside another cluster, the result are still the same.

CLUSTER ANALYSIS SOFTWARE


Mention some of the cluster analysis software. (2m)
 Following are some of the cluster analysis software:
 ClustanGraphics7 includes k-means, density-based and hierarchical cluster
analysis methods.
 CViz cluster visualization – a visualization tool designed for analysis high-
dimensional data in large, complex data sets.

58
 Auto Class – an unsupervised Bayesian Classification system, based on
SNOB.
 Cluster 3.0 – open source software. It uses the k-means method, which
includes multiple trails to find the best clustering solution.
 CLUTO provides a set of clustering methods including Partitional,
agglomerative, and graph partitioning.
 SNOB – It does not require the number of classes or the starting seeds to be
specified.

Unit 3 completed

FAQ
Two marks:
1. List out the Categories of Cluster analysis method.
2. What are the types of data.
3. Define Clustering with examples.
4. Define Density based method.
5. What is quantitative data?
6. What is the use of Agglomerative method.
7. What is the difference between monothetic and polythetic in hierarchical method.
8. Bring out the four aspects of quality of method.

Five marks:
1. Discuss about types of Cluster analysis method.
2. Describe the Agglomerative method in Cluster analysis
3. Write a note on cluster analysis.
4. Describe how to deal with large database in Cluster analysis.
5. Describe about quality and validity of Cluster analysis method.
6. How many concepts are required in Density based method? Explain shortly.
7.
Ten marks:
1. Describe about the methods for computing distances between clusters.
2. Describe in detail about Partition method in Cluster analysis
3. Describe in detail about hierarchical methods in Cluster analysis
4. Explain about computing distance properties and their types.
5. Explain in detail about hierarchical methods

59
UNIT – IV
WEB DATA MINING
INTRODUCTION
What do you mean by web data mining? (2m)
Explain about the various categories of web data mining. (5m)
 The web has become the number one source for information for internet users. The
information is very diverse in content as well as in topics.
 A feature of web is the hyperlink by which one web page provides links to other
pages allowing a user to traverse the web following links of interest to him.
 Web is an open medium. This means that the web has grown exponentially which is
its strength as well as its weakness.
 Strength – one can find information on just about anything.
 ‘Weakness – there is the problem of abundance of information.
 Ex: searching for the phase “data mining” on Google will find you more than 2.5
million pages. How to find the most useful one? How reliable is the information? It is
the aim of web mining to help with problems like this.
 “Web mining is the application of data mining techniques to find interesting and
potentially useful knowledge from wen data. It is normally expected that either the
hyperlink structure of the web or the web log data or both have been used in the
mining process.” Web mining

Web content mining Web structure mining Web usage mining

Web page content Searching result General access Customized usage


mining mining pattern tracking tracking

 Web mining may be divided into several categories:


 Web Content Mining
 Web Structure Mining
 Web Usage Mining
Web Content Mining:
 A page includes no machine readable semantic information. Search engines, subject
directories, It ideals with discovering useful information or knowledge from web
page contents.
 Web content mining focuses on the web page content rather than the links.
 Web content is a rich information resource consisting of many types of information,
for example: unstructured free text, image, audio, video & metadata as well as
hyperlinks.

60
 Content of web intelligent agents, cluster analysis & portals are employed to find
what a user might be looking for.

Web Structure Mining:


 It ideals with discovering & modeling the link structure of the web. Work has been
carried out to model the web based on the topology of the hyperlinks.
 This can help in discovering similarities between sites or in discovering important
sites for a particular topic or discipline or in discovering web communities.
Web Usage Mining:
 It ideals with understanding user behavior in interacting with the web or with a web
site.
 One of the aims is too obtain information that may assist website reorganization or
assist site adaptation to better suited the user.
 Mined data often includes data logs of user’s interactions with the web.
 Logs include information about the referring pages, user identification, time a user
spends as a website and the sequence of pages visited.
 Information is also collected via cookie files.
 While web structure mining shows that page A has a link to page B, web usage
mining shows who or how many people took that link, which site they came from &
where they went when they left page B.
 Following are the major differences between searching conventional text and
searching the web:
1. Hyperlink:
 Text documents do not have hyperlinks. In hard copy documents in a
library, documents are usually structured, no linkage between them is
identified.
 But web hyperlinks provide important information. A link from page A to
page B includes that the author of page A considered page B of some
value.
2. Types of information:
 Web pages can consists of text, frames, multimedia objects, animation &
other types of information quite different from text documents which
mainly consist of text but may have some other objects link tables,
diagrams & some images.
3. Dynamics:
 Text documents do not change unless new edition of a book appears
while web pages change frequently because the information on the web
including linkage information is updated all the time & new pages appear
almost every second.

61
4. Quality:
 Text documents are of high quality, but much of information on the web
is of low quality. Only 10% of web pages are really useful & of high
quality.
5. Huge size:
 Size of web is approaching 100terabyte which is equivalent to about 200
million books.
6. Document Use:

 Compared to the use of conventional documents, the use of web


documents is very difficult. The web users tend to pose short queries,
browse perhaps the first page of the results and then move on.

WEB TERMINOLOGY AND CHARACTERISTICS


Define the term URL. (2m)
What do you mean by proxy & domain name server? (2m)
Explain briefly the web terminologies used in web data mining. (5m)
Explain in detail about web terminology & its characteristics. (5m/10m)
 We define some web terminology based on the work of World Wide Web
consortium (W3C).
 WWW (World Wide Web) is the set of all the nodes which are interconnected by
hypertext links.
 Link expresses one or more relationships between two or more resources. Links may
also be established within a document by using anchors.
 A web page is a collection of information, consisting of one or more wen resources,
intended to be rendered simultaneously & identified by a singly URL.
 A Website is a collection of interlinked wen pages, including a home page, residing at
the same network location.
 A client browser is the primary user interface to the web. It is program which allows
a person to view the contents of web pages & for navigating from one page to
another.
 A Uniform Resource Locator (URL) is an identifier for an abstract or physical
resource, for example a server and file path. URLs are location dependent & each
URL contains 4 distinct parts:
 Protocol types (usually http)
 Name of web server
 Directory path
 File name
 If a file is not specified, index. Html is assumed.

62
 A web server serves web pages using http to client machines so that a browser can
display them.
 A client is the role adopted by an application when it is retrieving a web resource.
 A proxy is an intermediary who acts as both a server and a client for the purpose of
retrieving resources on behalf of other clients.
 A domain name server is a distributed database of name to address mappings.
When a DNS server looks up a computer name, it either finds it in its list, or asks
another DNS server which knows more names.
 A cookie is the data sent by a web server to a web client, to be stored locally by the
client and sent back to the server on subsequent request.
 Obtaining information from the web using a search engine is called information
“pull” while information sent to users is called information “push”.
 Ex: users may register with a site & then information is sent (pushed) to such users
without them requesting it.

Graph Terminology
Define Directed & undirected graph in terms of web. (2m)
Define Diameter of a graph. (2m)
Explain basic graph terms & the components of the web. (5m)
 “Directed graph” is a set of nodes (that corresponds to pages on the web) donated
by V and edges (which corresponds to links on the web) denoted by E. (i.e.,) a graph
is (V, E) but the edges have no directed.
 “Undirected graph” may also be represented by nodes and edges (V, E) but the
edges have no direction specified.
 Directed graph is like link that points from one page to another in web, while
undirected graph is not like the pages & links on the web unless we assume the
possibility of traversal in both directions.
 A graph may be searched either by a breadth-first search or by a depth-first search.
 Breadth-first is based on first searching all the nodes that can be reached from the
node where the search is starting and once these nodes have been searched,
searching the nodes at the next level that can be reached from those nodes and so
on.
 Depth-first search is based on first searching any unvisited descendants of a given
node first, then visiting the node and then any brother nodes.
 “diameter of a graph” – Maximum of the minimum distances between all possible
ordered node pairs (u, v) i.e., maximum number of links that one would need to
follow starting from any page u to reach any page v assuming that the best path has
been followed.
 An investigation on the structure of the web was carried out & it was found that the
web is not a ball of highly connected nodes.

63
 It displayed structural properties U& the web could be divided into the following 5
components:
1. The Strongly Connected Core(SCC):
This part of the web was found to consist of about 30% of the web, which is
still very large given more than 4 billion pages on the web in 2004.
This is the heart of the web & its main property is that pages in the core can
reach each other following directed edges. (i.e., hyperlinks)
2. The IN Group:
This part of the web was found to consist of about 20% of the web.
Main property – pages in the group can reach the SCC but cannot be reach
the SCC.
3. The Out Group:
This part of the web was found to consist of about 20% of the web.
Main property – pages in this group can be reached from SCC but cannot
reach the SCC.
4. Tendrils:
This part of the web was found to consist of about 20% of the web.
Main property – pages cannot be reached by the SCC and cannot reach the
SCC.
It does not imply that these pages have no linkages to pages outside the
group since they could well have linkages from the IN group & to the OUT
group.
5. The Disconnected Group:
This part of the web was found to be less than 10% of the web & is essentially
disconnected from the rest of the web world.
Ex: personal pages at many sites that link to no other page and have no links
to them.
Size of the Web
List the categories of web pages. (2m)
Write the categories of web pages. (2m)
 The Deep web includes information stored in searchable databases often
inaccessible to search engines. This information can often only be accessed using
each web site’s interface. Some of this information may be available only to
subscribers.
 The Shallow web (or indexed web) is the information on the web that the search
engines can access without accessing the web databases.
 It ben estimated that the deep web is about 500 times the size of the shallow web.
 The web pages are very dynamic, changing almost daily. Many new web pages are
added to the web every day.
 Perhaps web pages are used for other purposes well, for example, communicating
information to small number of individuals via the web rather than via email.

64
 In many cases, this makes good sense. (Wasting disk storage in mailboxes is avoided)
 But if this grows, then a very large number of such web pages with a short life-span
& low connectivity to other pages are generated each day.
 Large numbers of website disappear every day & create many problems on the web.
Links from even well-known sites do not always work.
 Not all results of a search engine search are guaranteed to work. To overcome these
problems, web pages are categorized as follows:
1. A web page that is guaranteed not to change over.
2. A web page that will not delete any content, may add content/ links but the page
will not disappear.
3. A web page that may change content / links but the page will not disappear.
4. A web page without any guarantee.
Web Metrics
Write short notes on web metrics. (5m)
 There are a number of properties (other than size & structure of the web) about the
web that are useful to measure.
 Some of the measures are based on distances between the nodes of the web graph.
 Centrality may be out-centrality, which is based on distances measured from the
node using its out-links while in-centrality is based on distances measured from
other nodes that are connected to the node using the ion-links.
 Based on these metrics, it is possible to define the concept of compactness that
varies from 0 to 1, 0 for a completely disconnected web graph and 1 for a fully
connected web graph.
 Perhaps the most important measurements about web pages are about a page’s
relevance and its quality.
 Relevance is usually related to a user query since it concerns the user finding pages
that are relevant to his query and may be defined in a number of ways.
 A simple approach involves relevance to be measured by the number of query terms
that appear in the page.
 Another approach is based on in-links from relevant pages. In this, relevance of a
page many be defined as the sum of the number of query terms in the pages that
refer to the page.
 Relevance may also use the concept of co-citation. In co-citation, if both pages ‘a’ &
‘b’ point to a page ‘c’ then it is assumed that ‘a’ and ‘b’ have a common interest.
Similarly if ‘a’ points to both ‘b’ and ‘c’, then we assume that ‘b’ & ‘c’ also share a
common interest.
 Quality is not determined by page content since there is no automatic means of
evaluating the quality of content. And it is determined by the link structure.
 Example: if page ‘a’ points to page ‘b’ then it is assumed that page ‘a’ is endorsing
page ‘b’ and we can have some confidence in the quality of page ‘b’.

65
LOCALITY AND HIERARCHY IN THE WEB
What are the basic principles in designing the content & structure of web? (2m)
Explain briefly about the locality & hierarchy in the web. (5m)
 The web shows a strong hierarchical structure. A website of any enterprise has the
homepage as the root of the tree and as we go down the tree we find more detailed
information about the enterprise.
 Example: the homepage of a university will provide basic information and then links,
for example, to; prospective students, staff, research, information for current
students, information for current staff.
 Prospective student’s node will have number of links like: course offered admission
requirements, scholarships available, semester dates, etc…
 Web also has a strong locality feature to the extent that almost two-thirds of all links
are to sites within the enterprise domain.
 One-third links to sites outside the enterprise domain have a higher percentage of
broken links.
 Website fetches information from a database to ensure that the information is
accurate & timely.
 Web pages cab be classified as:
1. Homepage or the head page: These pages represent an entry point for the web
site of an enterprise (so frequently visited) or a section within the enterprise or
an individual’s web page.
2. Index page: these pages assist the user to navigate through the enterprises
website. A homepage may also act as an index page.
3. Reference page: These pages provide some basic information that is used by a
number of other pages. Ex: each page in a website may have link to a page that
provides enterprise’s privacy policy.
4. Content page: These pages only provide content & have little role in assisting a
user’s navigation. Often they are larger in size, have few out-links & are often the
leaf nodes of a tree.
 Web site structure & content are based on careful analysis of what the most
common user questions are. The content should be organized into categories in a
way that, traversing the site is simple & logical.
 Careful web user data mining can help. A number of simple principles have been
developed to help design the structure & content of a web site.
 Three basic principles are:
1. Relevant linkage principle: It is assumed that link from a page point to other
relevant resources. Links are assumed to reflect the judgment of the page
creator. By providing link to another page means that, he recommends for the
other relevant page.
2. Topical unity principle: It is assumed that web pages that are co-cited (i.e., linked
from the same pages) are related.

66
3. Lexical affinity principle: It is assumed that the text and the links within a page
are relevant to each other.
 Unfortunately not all pages follow these basic principles, resulting in difficulties for
web users & web mining researchers.
WEB CONTENT MINING
Explain wen content mining in detail. (10m)
 Web content mining examines the content of web pages as well as results of web
searching. The content includes text as well as graphics data. Web content mining is
further divided into web page content mining and search results mining.
 Web page content mining is traditional searching of web pages via content, while
search results mining is a further search of pages found from a previous search.
 Web content mining can improve on traditional search engines through such
techniques as concept hierarchies and synonyms, user profiles, and analyzing the
links between pages.
 Web content mining uses data mining techniques for efficiency, effectiveness and
scalability.
 Web content mining can be divided into:
 Agent-based approach
 Database based approach
1. Agent-based approach:
 They have software systems (agents) that perform the content mining.
 For example, search engines. Intelligent search agents use techniques like
user profiles or knowledge concerning specific domains.
2. Database-based approach:
 It views the web data as belonging to a database.
 Many query languages that target the web.
 Web content mining deals with discovering useful information from the web.
 When we use search engines like Google to search contents on the web, search
engines find pages based on location & frequency of keywords on the page although
some now use the concept of page rank.
 If we consider the type of queries that are posed to search engines, we find that if
the query is very specific we face scarcity problem.
 If the query is for a broad topic, we face the abundance problem. Because of these
problems, keyword search using a search engine is called suggestive of the
document’s relevance. User is then left with the task of finding relevant documents.
 Brin presents an example of finding content on the web. It shows how relevant
information form a wide variety of sources presented in a wide variety of formats
way be integrated for the user.
 Example involves extracting a relation of books in the form of (author, title) from the
web starting with a small sample list.

67
 Problem is defined as: we wish to build a relation R that has a number of attributes.
The information about tuples of R is found on web pages but is unstructured.

Iterative Pattern Relation Extraction (DIPRE)


Write about DIPRE algorithm. (5m)
 Aim is to extract the information and structure it with low error rate. Algorithm
proposed is called Dual Iterative Pattern Relation Extraction (DIPRE). It works as
follows:
1. Sample: Start with a sample S provided by the user.
2. Occurrence: find occurrences of tuples starting with those in S. once tuples are
found the context of every occurrence is saved. Let these be O.
O S
3. Patterns: generate patterns based on the set of occurrences O. this requires
generating patterns with similar contexts.
P O
4. Match patterns: web is now searched for the patterns.
5. Stop if enough matches found, else go to step 2.
 Here the pattern is defined as a tuple likes (order, URL prefix, prefix, middle, and
suffix). It may then match strings like (author, middle, title) or (title, middle, author)
 In step 2 it is assumed that the occurrence consists of title and author with a variety
of other information.
 Web pages are usually semi-structured (ex: HTML documents). Database generated
HTML pages are more structured, however web pages consist of unstructured free
text data which makes it more difficult to extract information from them.
 The semi-structured web pages may be represented based on the HTML structures
inside the documents. Some may also use hyperlink structure.
 A different approach called database approach involves inferring the structure of a
website & transforming the website content into a database.
 Web pages are published without any control on them. Some of the sophisticated
searching tools being developed are based on the work of the artificial intelligence
community. These are called Intelligent Web Agents.
 Some web agents replace the work of search engines in finding information. Ex:
ShopBot finds information about a product that the user is searching for.
Web Document Clustering
Define the terms brows able summaries & snippet tolerance. (2m)
How will you cluster web documents? Explain. (5m)
 Web document clustering is an approach to find relevant documents on a topic or
about query keywords. Search engines return huge, unmanageable list of
documents, and finding the useful one is often tedious.

68
 User could apply clustering to a set of documents returned by search engines with
the aim of finding meaningful clusters that are easier to interpret.
 It is not necessary to insist that a document can only belong to one cluster since in
some cases it is justified to have document belong to two or more clusters.
 Web clustering may be based on content alone, may be based on both content &
links or based only on links.
 One approach that is specifically designed for web document cluster analysis is Suffix
Tree Clustering (STC) & it uses a phrase-based clustering approach rather than using
a single word frequency.
 In STC, the key requirements of a web document clustering algorithm include:
Relevance: This is most obvious requirement. We want cluster that are
relevant to user query.
Brows able summaries: The cluster must be easy understood. User should be
quickly able to browse the description of a cluster.
 Snippet tolerance: The clustering method should not require whole documents &
should be able to produce relevant cluster based only on the information that the
search engine returns.
Performance: The clustering method should be able to process the results of
search engine quickly & provide the resulting clusters to the user.
 STC algorithm consists of document cleaning, identifying base clusters & combining
base clusters.
 STC only requires the snippets of text that are returned by the search engine and
removes unwanted text that might get in the way of identifying common phrases.
(Ex: prefixes, punctuations...)
 The second step of algorithm uses the cleaned snippets to mark sentence boundaries
and then build a suffix tree of efficiently identify a set of documents that share
common phrases.
 Once a suffix tree has been constructed, number of base clusters can be derived, by
reading the tree from root to each leaf which gives a phrase.
 If such phrase belongs to more than two snippets, then the phrase is considered a
base cluster. Each base cluster is given a score based on number of words in the
phrase and the number of snippets that have that phrase.
 Many base clusters will be overlapping, since documents often share more than just
one phrase. So clusters must be consolidated.
 A similarity index for every pair of base clusters is created based on the documents
that are in each cluster. If the base clusters are similar, they are combined.

69
Finding similar web pages
Define replication & mirroring. (2m)
Write the reasons for existence of identical pages in the web. (2m)
Define resemblance, containment. (2m)
 It has been found that almost 30% of all web pages are very similar to other pages.
There are many reasons for identical pages. For example:
A local copy may have been made to enable faster access to the material.
FAQs are duplicated since such pages may be used frequently.
Online documentation of popular software like UNIX may be duplicated for
local use.
There are mirror sites that copy highly accessed sites to reduce traffic.
 In some cases, documents are not identical because different formatting might be
used at different sites. A larger document may be split into smaller ones or a
composite document may be joined together to build a single document.
 Copying a single web page is called replication and copying an entire web site is
called mirroring.
 Similarity between web pages usually means content-based similarity. It is also
possible to consider link-based & usage-based similarity.
 Link-based is related to the concept of co-citation & is used for discovering a core set
of web pages on a topic.
 Usage-based is useful in grouping pages or users into meaningful groups.
 Content-based is based on comparing textual content of the web pages. Non-text
contents are not considered. We define two concepts:
Resemblance: Resemblance of two documents is defined to be a number
between 0 and 1 with 1 indicating that the two documents are virtually
identical & any value close to 1 indicating that the documents are very
similar.
Containment: Containment of one document in another is defined as a
number between 0 and 1 with 1 indicating that the first document is
completely contained in the second.
 Number of approaches is there to assess the similarity of documents. One Brute
force approach is to compare two document using software like ‘diff’ in Unix OS,
which compares two documents as files.
 Other string comparison algorithms may be used to find how many characters need
to be deleted, changed or added to transform one document to other, but this is
expensive for comparing millions of documents.
 Issues in document matching are:
If we are looking to compare millions of documents then the storage
requirement of the method should not be large.
Documents may be in HTML, PDF, MS word. They need to be converted to
text for comparison which may introduce errors.

70
Method should be robust. i.e., it should not be possible to avoid the matching
process with the changes to a document.
Finger Printing
What do you mean by shingles? (2m)
Define full fingerprinting. (2m)
Explain with an example about fingerprinting documents. (5m/10m)
 An approach for comparing large number of documents is based on the idea of
fingerprinting documents.
 A document may be divided into all possible substrings of length L. these substrings
are called shingles. Based on the shingles we can define resemblance R(X, Y) and
containment C(X, Y) between two documents X & Y as follows.
 We assume S(X) & S(Y) to be a set of shingles for documents X and Y respectively.
R(X, Y) ={S(X) S(Y)} / {S(X) U S(Y)}
C(X, Y) ={S(X) S(Y)} / {S(X)}
 Following algorithm may be used to find similar documents:
1. Collect all the documents that one wishes to compare.
2. Choose a suitable shingle width and compute the shingles for each document.
3. Compare the shingles for each pair of documents.
4. Identify those documents that are similar.
 This algorithm is inefficient for large collection of documents. Web is very large &
algorithm requires enormous storage to store shingles and takes long time. This
approach is called full fingerprinting.
Example:
 We find if the two documents with the following content are similar:
Document1: “the web continues to grow at a fast rate”
Document2: “the web is growing at a fast rate”
 First step: making a set of shingles from the documents. We obtain the below
shingles if we assume the shingle length to be 3 words.
Shingles Shingles in Doc2
The web continues The web is
Web continues to Web is growing
Continues to grow Is growing at
To grow at Growing at a
Grow at a At a fast
At a fast A fast rate
A fast rate

 Comparing two sets of shingles we find that only 2 of them are identical. So the
documents are not very similar. We illustrate the approach using 3 shingles that are
the shortest in the number of letters including space.

71
Shingles in Doc1 Number of Letters Shingles in Doc2 Number of Letters
The web continues 17 The web is 10
Web continues to 16 Web is growing 14
Continues to grow 17 Is growing at 13
To grow at 10 Growing at a 12
Grow at a 9 At a fast 9
At a fast 9 A fast rate 11
A fast rate 11

 We select 3 shortest shingles for comparison. For first documents, these are “to
grow at”, “grow at a”, “at a fast”.
 For second document, these are “the web is”, “at a fast”, “a fast rate”. There is only
one match out of 3 shingles.
 False positives with the original document would be obtained for documents like
“the Australian economy is growing at a fast rate”.
 False negatives would be obtained for strings like “the web is growing at quite a
rapid rate”.
 So, small length shingles because many false positives while larger shingles result in
more false negatives. A better approach involves randomly selecting shingles of
various lengths.
 Issues in comparing documents using fingerprinting are:
1. Should the shingle length be in number of words or characters?
2. How much storage would be needed to store shingles?
3. Should upper & lower-case letters be treated differently?
4. Should punctuation marks be ignored?
5. Should end of paragraph be ignored?
WEB USAGE MINING
Mention some information obtained from web usage mining. (2m)
Explain web usage mining. (5m)
 Web using mining performs mining on Web usage data, or Web logs. A Web log is a
listing of page reference data. It is also called click stream data because each entry
corresponds to a mouse click.
 When these logs are examined from a server perspective, mining uncovers
information about the sites where the service resides a can help improve the design
of the sites.
 By evaluating a client’s sequence of clicks, information about a user (or a group of
users) is detected. This could be used to perform pre-fetching and caching of pages.
 Objective in web usage mining is to be able to understand and be able to predict
user behavior in interacting with web or with a web site in ordered to improve the
quality of service.
 Other aims are to obtain information & discover usage patterns that may assist web
site design / redesign, perhaps to assist navigation through the site.

72
 Mined data includes web data repositories which include data logs of user’s
interactions with the web, web server logs, proxy server logs, browser server logs,
etc.
 Information collected in the web server logs includes information about the access,
referrer, etc… access information includes the server’s interaction with the server,
referrer information is about referee page & the agent information is about the
browser used.
 Some of the routine information may be obtained using tools that are available at a
low cost. Using such tools, it is possible to find at least the following information:
1. Number of hits – number of times each page in the web site has been viewed.
2. Number of visitors – number of users who came to the site.
3. Visitor referring web site – web site URL of the site the user came from.
4. Visitor referral web site – web site URL of the site where the user went when he
left the web site.
5. Entry point – which web site page the user entered from.
6. Visitor time & duration – time & day of visit and how long the visitor browsed the
site.
7. Path analysis – list of path of pages that the user took.
8. Visitor IP address – this helps in finding which part of the world the user came
from.
9. Browser type.
10. Platform.
11. Cookies.
 This simple information can assist an enterprise to achieve the following:
1. Shorten the paths to high visit pages.
2. Eliminate or combine low visit pages.
3. Redesign some pages including homepage to help user navigation.
4. Redesign some pages so that the search engines can find them.
5. Help evaluate effectiveness of an advertising campaign.
 We usage mining may be desirable to collect information on:
Path traversal: what paths do the customers traverse? What are the most
commonly traversed paths? These patterns need to be analyses & acted
upon.
Conversion rates: what are the basket-to-buy rates for each product?
Impact of advertising: which banners are pulling in the most traffic? What is
their conversion rate?
Impact of promotions: which promotions generate the most sales?
Web site design: which links do the customers click most frequently? What
links do they buy from most frequently?
Customer segmentation: what are the features of customers who stop
without buying? Where do most profitable customers come from?

73
Enterprise search: which customers use enterprise search? Are they likely to
purchase? What they search?
 Web usage mining also deals with caching & proxy servers. Not all page requests are
recorded in the server log because pages are often cached and served from the
cache when the user requests them again.
 Caching occurs at several levels. Brower itself has a cache to keep copies of recently
visited pages which may be retained only for a short time.
 Enterprises maintain a proxy server to reduce internet traffic & to improve
performance.
 Proxy server interprets any requests for web pages from any user in the enterprise &
serves them from the proxy if a copy that has not expired is resident in the proxy.
 Use of caches & proxies reduces the number of hits that the web server log records.
 One may be interested in the behavior of users, not just in one session but over
several sessions. Returning visitors may be recognized using cookies (which is issued
by the web server & stored on the client).The cookie is then presented to web server
on return visits.
 Log data analysis has been investigated using the following techniques:
 Using association rules
 Using cluster analysis
 In usage analysis, we want to know the sequence of nodes visited by a user, which is
not possible using association rules analysis.
 It should be noted that, in association rules mining, patterns do not play any role.
We are only interested in finding which items appear with which other items
frequently.
 In fact, if pages A and B appear together in an association rule, it is difficult to tell if A
was traversed first or B was.
Uses of web usage mining:
 Personalization for a user can be achieved by keeping track of previously accessed
pages.
 By determining frequent access behaviors for users, needed links can be identified to
improve the performance.
 Identifying common access behaviors can help improve actual design of web pages
and make other modifications to the site.
 Patterns can be used to gather business intelligence to improve sales and
advertisement.
Activities of web usage mining:
 Preprocessing activities center on reformatting the web log data before processing.
 Pattern discovery activities form the major portion of the mining activities because
these activities look to find hidden patterns with in the log data.
 Pattern analysis is the process of looking at and interpreting the results of the
discovery activities.

74
Data Structure:
 Data structures are used to keep track of patterns identified during the web usage
mining process.
 One of the basic data structure is tries. A tries is rooted tree, where each path from
the root to leaf represents a sequence. They are used to store strings for pattern-
matching applications.
 The wastage of apace in tries is solved by compressing nodes together when they
have degrees of one. A compressed tries is called suffix tree.
With the help of suffix tree, it is efficient to find not only the subsequence among
multiple sequences.

WEB STRUCTURE MINING


Definition:
 With web structure mining, information is obtained from the actual organization of
pages on the web.
 It can be viewed as creating a model of the web organization or a portion of it.
Uses:
To classify web pages
To create similarity measures between documents.
Techniques of web structure mining:
Page rank and
HITS algorithm
Aim of web structure mining
 To discover the link structure or the model that is assumed to underline the web.
Model may be based on the topology of hyperlinks.
 This can help in discovering similarity between sites or in discovering authority sites
for a particular topic or in discovering overview or survey sites that point to many
authority sites (such sites are called hubs).
 The links on web pages provide a useful source of information that may be bound
together in web searches.
 Kleinberg has developed a connectivity analysis algorithm called Hyperlink-induced
topic search (HITS) based on the assumption that links represent human judgment.
 Algorithm also assumes that for any topic, there are a set of “authoritative” sites
that are relevant on the topic and there are “hub” sites that contain useful links to
relevant sites on the topic.
 HITS algorithm is based on the idea that if the creator of page p provides a link to
page q, then p gives some authority on page q.
 But not all links give authority, since some may be for navigational purposes. Ex links
to the home page.
Definitions required for HITS:

75
A graph is a set of vertices & edges (V, E). The vertices correspond to the
pages & edges correspond to the links.
A directed edge (P, Q) indicates a link from page p to page q.
 HITS algorithm has 2 major steps:
1. Sampling step: It collects a set of relevant web pages given a topic.
2. Iterative step: It finds hubs & authorities using the information collected
during sampling.
 Example: for a query “web mining” the algorithm involves carrying out the sampling
step by querying a search engine and then using the most highly ranked web pages
retrieved for determining the authorities & hubs.
 Posing a query to search engine often results in abundance. In some cases, the
search engine may not retrieve all relevant pages for the query.
 Query for java may not retrieve pages for object-oriented programming, some of
which may also be relevant.

HITS algorithm
Explain HITS algorithm in detail with an example. (10m)
Step-1: sampling step
 First step involves finding a subset of nodes or a sub graph S, which are relevant
authoritative pages. To obtain such a sub graph, the algorithm starts with a root set
of, say 200 pages selected from the result of searching for the query in a traditional
search engine.
 Let the root set R. Starting from R, we wish to obtain a set S that has the following
properties:
1. S is relatively small.
2. S is rich in relevant pages given the query.
3. S contains most of the strongest authorities.
 HITS expand the root set R into a base set S by using the following algorithm:
1. Let S =R.
2. For each page in S, do steps 3 to 5.
3. Let T be the set of all pages S points to.
4. Let F be the set of all pages that points to S.
5. Let S =S+ T+ some or all of F. (some if F is large)
6. Delete all links with the same domain name.
7. This S is returned.
 Set S is called base set for the given query. We find the hubs & authorities from the
base set as follows:
 One approach is ordering them by the count of their out-degree & the count of their
in-degree.

76
 Before starting hubs & authorities step 6. I.e. links between pages on same site are
for navigational purpose, not for conferring authority.
Step-2: Finding Hubs & Authorities
 Algorithm for finding hubs & authorities is:
1. Let a page p have a non-negative authority weight Xp& a non-negative hub
weight Yp. Pages with large weights Xp will be classified to be the authorities
(similarly for hubs).
2. Weights are normalized so their squared sum for each type of weight is 1 since
only the relative weights are important.
3. For a page, value of Xp is updated to be the sum of Yq over all pages q that like to
p.
4. For a page, value of Yp is updated to be the sum of Xq over all pages q that p like
to.
5. Continue with step 2 unless a termination condition has been reached.
6. On termination, the output of the algorithm is a set of pages with the largest X p
weights that can be assumed to be authorities & those with the largest Yp
weights that can be assumed to be hubs.
Example:
 Assume a query has been specified. First step has been carried out & a set of pages
that contain most of the authorities has been found.
 Let the set be given by the following 10 pages. Their out-links are listed below:
Page A (out-links to G, H, I, J)
Page B (out-links to A, G, H, I, J)
Page C (out-links to B, G, H, I, J)
Page D (out-links to B, G, H, J)
Page E (out-links to B, H, I, J)
Page F (out-links to B, G, I, J)
Page G (out-links to H, I)
Page H (out-links to G, I, J)
Page I (out-links to H)
Page J (out-links to F, G, H, I)
 Connection matrix for this information is as follows:

Page A B C D E F G H I J
A 0 0 0 0 0 0 1 1 1 1
B 1 0 0 0 0 0 1 1 1 1
C 0 1 0 0 0 0 1 1 1 1
D 0 1 0 0 0 0 1 1 0 1
E 0 1 0 0 0 0 0 1 1 1
F 0 1 0 0 0 0 1 0 1 1
G 0 0 0 0 0 0 0 1 1 0
H 0 0 0 0 0 0 1 0 1 1

77
I 0 0 0 0 0 0 0 1 0 0
J 0 0 0 0 0 1 1 1 1 0

 Every row in the matrix shows the out-links from the web page that is identified in
the first column. Every column shows the in-links from the web page that is
identified in the first row of the table.
 Another representation is:
Web In-links Out-links
page
A B G, H, I, J
B C, D, E, F A, G, H, I, J
C None B, G, H, I, J
D None B, G, H, J
E None B, H, I, J
F J B, G, I, J
G A, B, C, D, E, F, H, J H, I
H A, B, C, D, E, G, I, J G, I, J
I A, B, C, D, E, F, G, H
H, J
J A, B, C, D, E, F, H F, G, H, I
 Next step involves iterations to compute the hubs & authorities. Initially, for ease of
understanding, we carry out the iterations without normalization as in the following
table. We assume all initial weights to be 1.

 Note that the X values are the authority weights and Y values are the hub weights.
Weight X0 Y0 X1 Y1 X2 Y2
page
A 1 1 1 4 5 118
B 1 1 4 5 17 123
C 1 1 0 5 0 135
D 1 1 0 4 0 104
E 1 1 0 4 0 106
F 1 1 1 4 4 106
G 1 1 7 2 29 60
H 1 1 8 3 29 89
I 1 1 8 1 31 29
J 1 1 7 4 29 93
Problems with the HITS algorithm:
 The algorithm works well for most queries, it does not work well for some others.
There are a number of reasons for this:
Hubs and authorities:
 A clear-cut distinction between hubs & authorities may not be appropriate since
many sites are hubs as well as authorities.

78
Topic drift:
 Certain arrangements of tightly connected documents, perhaps due to mutually
reinforcing relationships between hosts, can dominate HITS computation. These
documents may not be the most relevant to query sometimes.
Automatically generated links:
 Some of the links are computer generated & represent no human judgment but HITS
still gives them equal importance.
Non-relevant documents:
 Some queries can return non-relevant documents in the highly ranked queries and
this can then lead to erroneous results from HITS.
Efficiency:
 The real-time performance of the algorithm is not good given the steps that involve
finding sites that are pointed by pages in the rooted pages.
 Proposals to modify HITS are:
 More careful selection of the base set to reduce the possibility of topic drift.
 One may argue that the in-link information is more important than the out-
link information. A hub can become important just by pointing to a lot of
authorities.
Web communities:
 A web community is generated by a group of individuals that share a common
interest. Ex: religious group, sports…
 Web communities may be discovered by exploring the web as a graph & identifying
sub-graphs that have a strong link structure within them but weak associations with
other parts of the graph. Such sub graphs may be called web-communities or cyber-
communities.
 The idea of cyber-communities is to find all web communities on any topic by
processing the whole web graph.

WEB MINING SOFTWARE


Mention some web mining software. (2m)
 Some software packages listed below are commercial products:
 123 log analyzer- low cost web mining software that provides an overview of a web
site’s performance, statistics about web server activity, web pages viewed, files
downloaded, images that were accessed…
 Analog- an ultra-fast, scalable, highly configurable, works on any operating system &
is free.
 Azure web log analyzer- finds usual web information including popular pages,
number of visitors, what browser & computer they used.
 Click Tracks- it offers a numbers of modules to provide web site analysis.

79
 Datanautics- web mining software for data collection, processing, analysis and
reporting.
 Net Tracker web analyzer- analyses log files. Comprehensive analysis of behavior of
visitors to internet, intranet sites.
 Nihuo web log analyzer- provides reports on how many visitors came to the web site,
where they came from, which pages they viewed, how long they spent on the site &
more.
 Web log Expert 3.5- produces a report that includes this information: accessed files,
paths through the site, search engines, browsers, OS & more.
 WUM: web Utilization Miner – an open source project. Java-based web mining
software for log files preparation, discovery of sequential patterns, etc…
SEARCH ENGINE
Define search engines. (2m)
Write the goals of search engine. (2m)
 A number of web sites provide a large amount of interesting information about
search engines.
 Search engines are huge database of web pages as well as software packages for
indexing and retrieving the pages that enable users to find information of interest to
them.
 Normally the search engine databases of web pages are built & updated
automatically by web crawlers.
 Every search engine provides advertising (sometimes called sponsored links). There
are a variety of ways businesses advertise on the search engine.
 Commonly, whenever a particular keyword is typed by the user, related sponsored
links appear with the results of the search.
Goals of web search:
 It has been suggested that the information needs of the users may be divided into
three classes:
1. Navigational: Primary information need in these queries is to reach a web site
that the user has in mind. User may know that the site exists or may have visited
the site earlier but does not the URL.
2. Information: Primary information need in such queries is to find a web site that
provides useful information about a topic of interest. User does not have a
particular web site in mind. Ex: user wants to investigate IT career prospects in
Kolkata.
3. Transactional: Primary need in such queries is to perform some kind of
transaction. User may or may not know the target web site. Ex: to buy a book,
users wish to go to amazon.com.
Quality of search results:
 Results from a search engine ideally should satisfy the following quality
requirements:

80
1. Precision: Only relevant documents should be returned.
2. Recall: The entire relevant document should be returned.
3. Ranking: a ranking of the documents providing some indication of the relative
ranking of the results should be returned.
4. First Screen: First page of results should include the most relevant results.
5. Speed: Results should be provided quickly since users have little patience.
SEARCH ENGINES FUNCTIONALITY
Explain the functionalities of search engines.
 A search engine is a rather complex collection of software modules. A search engine
carries out a variety tasks. These include:
1. Collecting information:
A search engine would normally collect web pages or information about
them by web crawling or by human submission of pages.
2. Evaluating and categorizing information:
In some cases, for example, when web pages are submitted to a
directory, it may be necessary to evaluate and decide whether a
submitted page should be selected.
3. Creating a database and creating indexes:
The information collected needs to be stored either in a database or
some kind of file system. Indexes must be created so that the information
may be searched efficiently.
4. Computing ranks of the web documents:
A variety of methods are being used to determine the rank of each page
retrieved in response to a user query. The information used may include
frequency of keywords, value of in-links & out-links from the page and
frequency of use of the page.
5. Checking queries and executing them:
Queries posed by users need to be checked, for example, for spelling
errors and whether words in the query are recognizable. Once checked, a
query is executed by searching the search engine database.
6. Presenting results:
How the search engine presents the results to the user is important. The
search engine must determine what results to present and how to display
them.
7. Profiling the users:
To improve search performance, the search engines carry out user
profiling that deals with the way users use search engines.

81
SEARCH ENGINES ARCHITECHTURE
What are the major components of search engines? (2m)
Define crawler & Term extraction. (2m)
What are the advantages of proxy cache? (2m)
Explain about the crawler and indexer of search engine. (5m)
With a neat diagram explain the architecture of search engines. (10m)
 No two search engines are exactly the same in terms of size, indexing techniques,
page ranking algorithms or speed of search.
 Typical search engine architecture is shown below. It consists of many components
including the following three major components to carry out the functions that were
listed above.
The crawler and the indexer: it collects pages from the web, creates and
maintains the index.
The user interface: it allows users to submit queries and enables result
presentation.
The database and the query server: it stores information about the web
pages and processes the query and returns results.
The crawler:
 The crawler (or spider or robot or bot) is an application program that carries out a
task similar to graph traversal.
 It is given a set of starting URLs that it uses to automatically traverse the web by
retrieving a page, initially from the starting set.
 Some search engines use a number of distributed crawlers. Since a crawler has a
relatively simple task, it is not CPU-bound. It is bandwidth bound.
 A web crawler must take into account the load (bandwidth, storage) on the search
engine machines and sometimes also on the machine being traversed in guiding its
traversal.
 A web crawler starts with a given set of URLs and fetches the pages that the URLs
refer to. The crawler then reads the out-links of the pages so fetched and fetches
those pages. This continues until no new pages are found.
 Each page found by the crawler is often no stored as a separate file otherwise four
billion pages would require managing four billion files. Usually lots of pages are
stuffed into one file.
 Crawlers follow an algorithm like the following:
1. Find base URLs-a set of known & working hyperlinks are collected.
2. Build a queue-put the base URLs in the queue and add new URLs to the queue as
more are discovered.
3. Retrieve the next page-retrieve the next page in the queue, process and store in
the search engine database.
4. Add to the queue-check if the out-links of the current page have already been
processed. Add the unprocessed out-links to the queue of URLs.

82
5. Continue the process until some stopping criteria are met.
 A crawler would often have a traversal strategy. It may traverse the web breadth
first, depth first or the traversal may be priority based.

The Indexer:
 Given the size of the web & the number of documents that current search engines
have in their databases, an indexer is essential to reduce the cost of query
evaluation.
 Many researchers recommended file structure in which each document is essentially
indexed on every word other than the most common words. Google uses this
approach.
 Indexing can be a very resource intensive process and is often CPU-bound since it
involves a lot of analysis of each page.
 Building an index requires document analysis and term extraction. Term extraction
may involve extracting all the words from each page, elimination of stop words
(common words like “the”, “it”, “and”, “that”) and stemming (transforming words
like “computer”, “computing” and “computation” into one word, say “computer”)to
help modify, for example, the inverted file structure.
 A good indexing approach is to create an index that will relate keywords to pages.
The overheads of maintaining such indexes can be high.
 An inverted index, rather than listing keywords for each page, lists pages for each
keyword.
 This description is not quite accurate because an inverted database normally still
maintains a collection of pages including its content in some way.
 As an example for inverted index consider the data structure shown below:
Words Page ID
Data 10, 18, 26, 41
Mining 26, 30, 31, 41
Search 72, 74, 45, 46
Engine 72, 74, 79, 82
Google 72, 74, 90
 This index specifies a number of keywords and the pages that contain the keywords.
If we were searching for “data mining” we look for pages for each keyword “data”
and “mining” and then find the intersection of the two lists and obtain pages
numbers 26 & 41.
 Challenging part is when the two lists are too large. Usual approach is to split the
lists across many chines to find the intersection quickly.
 When the index is distributed over many machines, the distribution may be based on
a local inverted index or global inverted index.

83
 Local inverted index results in each machine searching for candidate documents for
each query while the global inverted index is so distributed that each machine is
responsible for only some of the index terms.
Updating the index
 As the crawler updates the search engine database, the inverted index must also be
updated.
 Depending on how the index is stored, incremental updating may be relatively easy
but at some stage after many incremental updates it may become necessary to
rebuild the whole index, which normally will take substantial resources.
User profiling
 Currently most search engine provides just one types of interface to the user. They
provide an input box in which the user types in the keywords & then waits for the
results.
 Interface other than those currently provided are being investigated. They include
form fill-in or a menu or even a simple natural language interface.
Query server
 First of all, a search engine needs to receive the query and check the spelling of
keywords that the user has typed.
 If the search engine cannot recognize the keywords as words in the language or
proper nouns it is desirable to suggest alternative spelling to the user.
 Once the keywords are found to be acceptable, the query may need to be
transformed.
 Stemming is then carried out and stop words are removed unless they form part of a
phrase that the user wants to search for.
Query composition
 Query composition is a major problem. Often a user has to submit a query a number
of times using some-what different keywords before more or less the “right” result is
obtained.
 A search engine providing query refinement based on user feedback would be
useful. Search engines often cache the results of a query and can then use the
cached results if the refined query is a modification of a query that has already been
processed.
Query processing
 Search engine query processing is different from query processing in relational
database systems. In database systems, query processing requires that attribute
values match exactly the values provided in the query.
 In search engine provide an exact match is not always necessary. A partial match or a
fuzzy match may be sufficient.
 Some search engines provide an indication of how well the document matches the
user query.

84
 A major search engine will run a collection of machines, several replicating the index.
A query may then be broadcast to all that are not heavily loaded.
 If the query consists of a number of keywords, then the indexes will need to be
search a number of times (intersection of results to be found).
 A search engine may put weight on different words in a query. Ex: If query has words
“going” “to” “Paris”, then the search engine will put greater weight on “Paris”
Caching query results:
 Although more bandwidth is available & cheap, the delay in fetching pages from the
web continues.
 Common approach is to use web caches & proxies as intermediates between the
client browser processes and the machines serving the web pages.
 Web cache or proxy is an intermediary that acts are as server on behalf of some
other server by supplying web resources without contacting server.
 Essentially a cache is supposed to catch a user’s request, get the page that the user
has requested if one is not available in the catch & then save a copy of it in the
cache.
 It is assumed that if a page is saved, there is a strong likelihood that the same\other
user will request it again.
 Web has very popular sites which are requested very frequently. The proxy satisfies
such requests without going to the web server.
 A proxy cache must have algorithms to replace pages in the hope that the pages
being stored in the cache are fresh and are likely to be pages that users are likely to
request.
 Advantages of proxy caching are:
It reduces average latency in fetching web pages, reduces network traffic &
reduces load on busy web servers.
Since all user requests go through cache, it is possible for site managers to
monitor who is accessing what information & if necessary block some sites.
 A part of user’s hard disk is dedicated to cache to store pages that the user has
browsed. Browser cache is useful when the use hits the back button since the most
recently visited pages are likely to be in the cache and do not need to be fetched
again.
RESULTS PRESENTATION
 All search engines present results ordered according to relevance. Therefore, before
presentation of results to user, the result must be ranked.
 Relevance ranking is based on page popularity in terms i=of number of back links to
the page.
 How much of what has been found should be presented to the user. Most search
engines present the retrieved item as a list on the screen with some information for
each item , ex: type of document, its URL & a brief extract from the document

85
RANKING OF WEB PAGES
Why page ranking is required? (2m)
Define damping factor (2m)
Write short notes on ranking of web pages. (5m)
Explain the page ranking algorithm with an example. (10m)
 Web contains a large number of documents published without any quality control. It
is important to find methods for determining importance & quality of pages , given a
topic
Search engines differ in size & so number of document that they index may
be quite different. No two search engines will have exactly the same pages on
a given topic
Page Rank Algorithm
 Google has the most well-known ranking algorithm, called the Page Rank Algorithm
that has been claimed to supply top ranking page that are relevant.
 That is based on using the hyperlinks as indicators of a page’s importance. It is almost
like vote counting in election
 Every unique page is assigned a page rank. If a lot of pages vote for a page by linking
to it than the page that is being pointed to will be considered important.
 Votes cast by a link farm (a page with many links) are given less importance than
votes cast by an article that only link to a few pages.
 Internal site links also count in assessing page rank. Page rank has no relationship
with the content of the page
 Page Rank was developed based on a probability model of a random surfer visiting a
web page. The probability of a random surfer clicking on a link may be estimated
based on the number of links on that page
 The probability of the surfer clicking any link is assumed to be inversely proportional
to the number of number of links on the page. More links a page has the less chance
that a surfer will click on a particular link
 The probability for the random surfer reaching one page is the sum of probabilities
for the random surfer following linking to that page
 Model also introduces a damping factor- that the random surfer sometimes does not
click any link & jumps to another page
 If the damping factor is d (assumed to be between 0 and 1) then the probability of
surfer jumping off to other page is assumed to be 1-d.
 Higher the value of d, suffer will follow one of the links. Given that the surfer has 1-d
probability of jumping to some random page every page has a minimum page rank of
1-d
Algorithm works as follows:
 Let A be the page whose Page Rank PR (a) is required. let c (T1) be the number of links
going out from page T1.Page rank of A is then given by;
 PR(A)=(1-d)+d(PR(T1)/C(T1)+PR(T1)/C(T2)+…… where d-damping factor.
 To computer a page rank of A, we need to know the page rank of each page that
points to it . and the number of out=links from each of those pages
Example
 Let us consider an example of three pages. We are given the following information:
1. Damping factor=0.8

86
2. Page A has an out-link to B
3. Page B has an out-line to A& another to C A B
4. Page C has an out-link to A
5. Starting page rank for each page is 1
 We may show the link between the pages as below:
 Page rank equations may be written as :
PR (A) =0.2+04PR (B) +0.8PR (C)
PR (B) =0.2 + 0.8 PR (A) C
PR (C) =0.2+ 0.4 PR (B)
 These are three linear equations in 3 unknowns, we may solve then. We write them
as follows if we replace PR (A) by a & others similarly.
A -0 .4b-0.8c =0.2, B -0.8a =0.2 , C -0.4b
=0.2
 Solution of the above equations is :
a = PR (A) = 1.19, b =PR (B) = 1.15, c = PR (C) =
0.66
Weaknesses of the algorithm
Mention some weaknesses of the page rank algorithm (2M)
 Weakness is that a link cannot always be considered a recommendation & people are
now trying to influence the links.
 Major weakness this algorithm focuses on importance of a page rather on its
relevance given the user query.
 A side effect of page rank therefore can be something links this, ex: suppose
amazon.com decides to get into the real estate business.
 Obviously its real estate web site will have links all other amazon.com sites which
have no relevance whatsoever with real estate. When a user then searches for real
estate, amazon.com real estate site is likely to be server high in the results because of
its high page rank while most other real estate businesses whatever their quality will
appear further down the list.
 Because of the nature of the algorithm page rank does not deal with page fairly since
it makes high page rank page even more popular by serving then at the results .thus
the rich get richer &the poor get poorer .it takes time for a new page to become
popular ,even if the new page is of high quality
Other issues in page ranking
 Some page rank algorithms consider the following in determining page rank
 Page title including keywords
 Page content including links ,keywords ,keywords in link text ,spelling
errors ,length of the page
 Out-links including relevance of links to the content
 In-links &their importance
 In-links text – keyword.
 In –linking page content
 Traffic including the number of unique visitors & the time spent on the
page.

87
Yahoo! web rank
 Yahoo! Has developed its own page ranking algorithm. Algorithm is called web rank
 Rank is calculated by analyzing the web page text ,title & description as well as its
associated links & other unique document characteristics

UNIT-4 COMPLETED

FAQ
Two marks:
1. What are the categories of web mining.
2. Define URL.
3. What is web mining?
4. List out the goal of Web search.
5. What is domain name server?
6. Define crawler.
7. Define cookie.
8. List out the Three major components of search engine.
9. Give the three category of web mining.
10. Explain the terms of precision and recall

Five marks:
1. Write about page rank algorithm.
2. Write down the search engine functionality.
3. Explain about web content mining.
4. Explain about locality and hierarchical in the web.
5. Describe about web structure mining.

Ten marks:
1. Write a note on web content mining.
2. Write a note on web structure mining in detail.
3. Explain about search engine architecture.
4. How to represent graph terminology of web?.Explain briefly.
5. Explain the important characteristics of web with respect to architecture and
standards.

88
UNIT-V
DATA WAREHOUSING
Introduction

Define date warehouse. (2m)


1. A data warehouse is a subject oriented, integrated, and nonvolatile and time variant
collection of data in support of management’s decisions.
2. A data warehouse is a semantically consistent data story that serves as a physical
implementation of a decision support data model and stores the information on which
an enterprise need to make strategic decisions. A data warehouse is also often viewed
as architecture constructed by integrating data from multiple heterogeneous sources to
support structured and /or ad hoc queries analytical reporting and decision making
Define OLTO & on demand query processing (2m).
Define on demand query processing & eager approach (2m).
Discuss the different types of need for the management staff in an enterprise (5m).
 The OLTP system are mostly relational database system designed for transaction
processing
 The performance of OLTP system is usually very important since such systems are
used to support the users that provide service to the customers
 The systems therefore must be able with insert and update operation as well as
answering simple queries quickly.
 The OLTP systems are suitable indexed and designed to handle such transaction but
they are not designed to handle management queries efficiently.
 One simple solution to meet the need of managers is to pose queries of interest to a
mediator system that decompose each query into appropriate sub queries for the
systems that the query need to access , obtain results from those systems and then
combines and present the result to user . This is sometimes called lazy or on –demand
query processing.
 Another approach is that one could collect the most common queries that the managers
ask and run them regularly and have the results available when a manager poses one of
those queries. This approach ,called the eager approach
 The third approach is somewhere in between the two extremes.
 It involves the creation of a separate database that only stores information that is of
interest to the management staff and involves the following two steps:
 The information needs of management staff are analyzed and the
enterprise systems that store some or all of the information are
identified .a suitable data modal is developed and the information is
then extracted, transformed and loaded in the new database.
 The new database is then used to answer management queries and the
OLTP systems are not accessed for such queries
Operational data sources
Define ODS. (2m)
What is mean by ZLE? What is its aim? (2m)
What is mean by ODS? Discuss about them. (5m)
Discuss how to implement and design the ODS system. (10m)
 To meet the information need of management staff. One possible solution is to build a
separate database one such approach is called operational data store (ODC)approach

89
 An ODS is designed to provide a consolidated view of the enterprise’s current
operational information.
 An ODS has been defined as a subject-oriented ,integrated , volatile ,current valued
data store ,containing only corporate detailed data
 The ODS is :
 Subject -oriented –that is it is organized around the major data subject of
an enterprise.
 Integrated- that is it is a collection of subject –oriented data from a
variety of systems to provide an enterprise-wide view of the data
 Current valued-that is ODS is up-to data and reflects the current status
of the information.
 Volatile-that is the data in the ODS changes frequently as new
information refreshes the ODS.
 Detailed-that is the ODS is detailed enough to serve the needs of the
operational management staff in the enterprise.
 An ODS does not include historical data. Since the OLTP systems data is changing all
the time .ODS may also be used as an interim database for a data warehouse.
 Since the contents of an ODS are updated frequently it may also be used to quickly
perform relatively simple queries (such as finding the status of a customer order).
 An ODS may be viewed as an enterprise ‘s short=term memory in that it stores only
very recent information
ODS design and implementation
 To implement an ODS just link implementing any database system, a data model
should be developed.
Model should match the structure of the enterprise.
Source systems ETL ODS end
users

Extraction transformation loading


Management reports

 Although all the attributes included in the source databases do not need to be included
in the ODS, all attributers that are likely to be need by the operational management
staff should be included.
 The extraction of information from source database needs to be efficient and the
quality of data needs to be maintained.
 Since the data is refreshed regularly and frequently suitable checks are required to
ensure quality of data after each refresh.
 A typical architect
Why a separate database?
 The most important reason why an ODS should be separate from the operational
database is that from time to time complex queries are likely to degrade the
performance of the OLTP system.
 The OLTP systems have to provide a quick response to operational users
Zero Latency Enterprise (ZLE)
 An ODS typically contains current operational data that is frequently updated
 The Gartner group has used ZLE for real-time integration of operational data so that
there is no significant delay in getting information from one part or one system of an
enterprise to another part or another system that needs the information

90
 Integration of heterogeneous systems can be a major hurdle because enterprise tends
to have many incompatible systems.
 The heart of a ZLE system is an operation data store. ODS and DW solve many such
problems but they are not always able to help management to react in time or zero
latency.
 A ZLE data store is something like an ODS that is integrated and up-to-data.
 The aim of a ZLE data store is to allow management a single view of enterprise
information by bringing together relevant data in real –time and providing
management a “360-degree”view of the customer.
 A ZLE usually has the following characteristics:
 It has a unified view of the enterprise operational data
 It has a high level of availability and involves online refreshing of
information
Data warehouses
Define data warehouse. (2m)
What are all the benefits of using data warehouse (5m), compare OLTP & data
warehouse (5m), discuss about data warehouse structure and marts in detail (10m)
 Data warehouse is a process for assembling and managing data from various sources
for the purpose of gaining a single detailed view of an enterprise.
 A widely accepted definition is: data warehouse is an integrated, subject-oriented
and time –variant repository of information in support of management decision
making process.
 The primary aims in building a data warehouse are to provide a single version of the
truth about the enterprise information and to provide good performance for ad hoc
management queries required for enterprise analysis to manipulate animate and
synthesize enterprise information
 The benefit of implementing a data warehouse are as following :
 To provide a single version of truth about enterprise information
 To speed up ad hoc reports and queries that involves aggregation
across many attributes which are resource intensive.
 To provide a system in which managers that do not have a strong
technical background are able to run complex queries.
 To provide a database that store relatively clean data
 To provide a database that store historical data that have been deleted
from the OLTP systems.

 The following table shows the comparison between the OLTP and data warehouse:
Property OLTP Data warehouse
Nature of the database 3NF Multidimensional
Indexes Few Many
Joins Many Some
Duplicated data Normalized data Demoralized data
Derived data and Rare Common
aggregates Mostly predefined Mostly ad hoc
Queries Mostly simple Mostly complex
Nature of queries all the time Not allowed ,only refreshed
Updates Often not available Essential
Historical data

91
 A useful way of showing the relationship between OLTP systems a data warehouse
and an ODS is given below:

Data ODS
Warehouse

 A data warehouse does not store real time data and does not require real time updates
while the ODS does.
 Just like in building an ODS data warehouse is a process of integrating enterprise
wide data originating from a variety of sources into a single repository
 The warehouse may be a central enterprise wide data warehouse for use by all the
decision makers in the enterprise or it may consist of a number of a number of smaller
data warehouse
 Data marts stores information for a limited number of subject areas. For example a
company might have data mart about marketing and sales .a data mart be used as a
proof of data warehouse concept.
 Steps involved in implementing data warehouses are:
 First of all, the aims of aims of implementing a warehouse should
be clearly identified.
 Once the decision to implement has been made one need to
consider whether the enterprise is going go to vendor that provides
a full data warehousing solution or to other vendors who provide
parts of the solution

Data mart Data mart Data mart

Central data warehouse

Database Database Legacy

The FEATURES OF DATA WAREHOUSE


Subject –oriented:
 A data warehouse is organized around major subjects suchas customer supplier
product and sales .rather than concent rating on the day-to-day operation and
transaction processing of an organization a data warehouse focuses on the modeling
and analysis of data for decision makers.
 Hence data warehouses typically provide a simple and concise view around particular
subject issues by excludingdata that are not useful in the decision support process.

92
Integrated:
 A data warehouse is usually constructed by integrating
multiple heterogeneous sources such as relational data
base flat files and on line transaction records.
 Data cleaning and data integration techniques are
applied to ensure consistency in naming conventions
encoding structures attribute measures and so on

Time —variant:
 Data are stored to provide information from a historical perspective (e.g the past 5 -10
years ) every key structure in the data warehouse contains either implicitly or
explicitly an element of time

Nonvolatile:
 A data warehouse is always a physically separate
Data
store of data transformed from the application OLTP
Warehouse
data found in the operational environment. Database
 Due to this separation a data warehouse does
not require transaction processing recovery
and concurrency control mechanisms .it usually
requires only two operations in data accessing :
initial loading of data and access of data.

ODS and DW Architecture:

ETL

Daily
changes
process

93
 The architecture of a system that includes an ODS and data warehouse shown below.
 It involves extraction information from source systems by using an ETL process and
then storing the information in a staging database .the daily changes also come to the
staging area.
 Another ETL process is used to transform information from the staging area to
populate the ODS. The ODS is then used for supplying information via another ETL
press to the data warehouse which in turn feeds a number of data marts that generate
the reports required by management
 It should be noted that not all ETL process in this architecture involve data cleaning
some may only in valve data extraction and transformation to suit the target system

Data warehouse design


Define star schema & snowflakes schema. (2m)
Explain how to design data warehouses (5m/10m)
 There are a number of ways of conceptualizing a data warehouse.
 One approach is to view it as a three-level structure. The lowest level consists of the
OLTP and legacy systems the middle level consists of the global or central data
warehouse while the top level consists of local data warehouse or data marts.
 Another approach is possible if the enterprise has an ODS.
 To develop a data model we view a data warehouse as a multidimensional structure
consisting of dimensions. In the multidimensional view of data values that depend on
the dimensional are called measures.

Products Times

Sales
(Amount sold,
Quantity sold)

Channels
Customers

A data warehouse model often consists of a central fact table and a set of surrounding
dimension tables on which the fact depends. Such a model is called a star schema because of
the shape of the model representation.
 A simple example of such a schema for a university where we assume that the
number of students is given by the four dimensions: degree, year, country and
scholarship.
 These four dimensions were chosen because we are interested in finding out how
many students come to each degree program each year, from country under each
scholarship scheme.

94
 A characteristic of a star schema is that the entire dimension directly link to the fact
table. The fact table may look like table 1.1 and the dimension tables may look like
table 1.2 &1.5
An example of the fact table 1.1
year Degree name Country name Scholar name Number
2003-01 BSc Australia Govt 35
1999-02 MBBS Canada None 50
2000-02 LLB USA ABC 22
1999-01 B.Com UK Commonwealth 7
2001-02 LLB Australia Equity 2

 The first dimension is the degree dimension. An example of this dimension table

An example of the degree dimension table 1.2


Name Faculty Scholarship eligibility Number of semesters

BSC Science Yes 6


MBBS Medicine No 10
LLB Law Yes 8
BCOM Business No 6
BA Arts No 6

 We now present the second dimension the country dimension. An example of this
dimension table 1.3:
An example of the country dimension table 1.3
Name Continent Education level Major religion
Nepal Asia Low Hinduism
Indonesia Asia Low Islam
Norway Europe High Christianity
Singapore Asia High Null
Colombia South America Low Christianity

 The third dimension is the scholarship dimension. The dimension table 1.4:
An example of the scholarship dimension table 1.4
Name Amount (%) Scholarship eligibility Number
Colombo 100 All 6
Equity 100 Low income 10
Asia 50 Top 5% 8
Merit 75 Top 5% 5
Bursary 25 Low income 12

 The fourth dimension is the year dimension. The dimension table 1.5

95
An example of the year dimension table 1.5
Name New programs
2001 Journalism
2002 Multimedia
2003 Biotechnology

 The following table shows a star schema for a model with four dimensions.

Degree Country
Name
Faculty Name
Scholarship Continent
Eligibility Education level
Number of
Semesters Major religion

Degree name

Country name

Scholarship name

Year number of student

Revenue

Name

Amount Name
New program
Eligible

Last year


 Star schemas may be refined info snowflake schema if we wish to provide support for
dimension hierarchies by allowing the dimension table to have sub tables to represent
the hierarchies.

96
 The star and snowflake schemas are intuitive, easy to understand can deal with
aggregate data and easily extendible by adding new attributes or new dimensions.

Star schema for a four –dimensional example

Name

Number of
academic staff Degree name
Budget
Scholarship name
Number of students
Name
Name
Faculty
Amount
Scholarship eligibility
Eligibly
Number of semester

Example of snowflake schema


Guidelines for data warehouse implementation
Discuss the implementation steps and guidelines for data warehousing (10m)
 A data warehouse may be built as a centralized data warehouse with data marts or a
distributed data warehouse depending on needs of an enterprise.
 A centralized data warehouse perhaps is relatively simple and is therefore a common
choice for most large organizations.
 The benefit of this approach is that each data mart can have a local view of the data
and must processing can be carried out in the local data mart.
Implementation steps
1. requirements analysis and capacity planning :
 The first step in data warehousing involves defining enterprise needs, defining
architecture, carrying out capacity planning and selecting the hardware and
software tools.
 These steps will involve consulting with senior management as well as with
the various stakeholders.
2. hardware integration:
 Once the hardware and software have been selected they need to be put
together by integration the servers. The storage devices and the client software
tools.
3. modeling:
 Modeling is a major step that involves designing the warehouse schema and
views. This may involve using a modeling tool if the data warehouse is
complex
4. ELT:
 The data from the source systems will need to go through an ETL process.

97
 The step of designing and implementing the ELT process may involve
identifying a suitable ELT tool vendor and purchasing and implementing the
tool.
5. Populating the data warehouse.
 Once the ELT tools been agreed upon testing tools will be required perhaps
using a staging area.

6. user application:
 For the data warehouse to be useful there must be end-user applications. This
step involves designing and implementing application required by the end
users
7. Roll-out the warehouse and application:
 Once the data warehouse has been populated and the end-user application
tested the warehouse system and the application may be rolling out for the
user community to use.
Implementation Guidelines
These are general guidelines not all of them will be applicable to every data warehouse
project.
1. Build incrementally:
 Data warehouse must be built incrementally generally it is recommended that
a data mart may first be built with one particular in project in mind and once it
is implemented a number of other sections of the enterprise may also wish to
implement similar systems.
2. Need a champion:
 A data warehouse project must have a champion who is willing to carry out
considerable research into expected costs and benefits of the project.
 Date warehousing project require inputs from many units in sn enterprise and
therefore need to be driver by someone who is capable of interaction with
people in the enterprise and can actively persuade colleagues.
3. Senior management support:
 A data warehouse project must be fully supported by the senior management.
 Given the resource intensive nature of such project and the time they can take
to implement a warehouse project calls for a sustained commitment from
senior management.
4. Ensure quality:
 Only data that has been cleaned and is of a quality that is understood by the
organization should be loaded in the data warehouse.
 The data quality in the source systems is not always high and often little effort
is made to improve data quality in the source systems
5. Corporate strategy:
 A data warehouse project must fit with corporate strategy and business
objectives. The objectives of the project must be clearly defined before the
start the project.

98
6. Business plan:
 The financial costs exposits and a project plan for a data warehouse project
must be clearly outlined and understood by all stakeholders.
7. Training:
 A data warehouse project must not overlook data warehouse requirements.
 For a data warehouse project to be successful the user must be trained to use
the warehouse and to understand its capabilities.
8. Adaptability:
 The project should build in adaptability so that changes may be made to the
data warehouse if and when required. Like any system a data warehouse will
need to change as an enterprise need change

9. Joint management:
 The project must be managed by both IT and business professional in the
enterprise.
 To ensure good communication with the stakeholders and that the project is
focused on assisting the enterprise business, business professionals must be
involved in the project along with technical professionals.
Data warehousing metadata
Define metadata. (2m)
Explain about data warehouse metadata (5m)
 Given the complexity of information in an ODS and a data warehouse it is
essential that there be a mechanism for users to easily to easily find out what
data is there and how it can be used to meet their needs. Providing metadata
about the ODS or the data warehouse achieves this
 Metadata is data warehouse, but answers the “who, what, when, why, and
how” questions about the data warehouse.
 Ex: a library catalogue may be considered metadata.
 The table of contents and the index in a book may be considered metadata for
the book.
 In a database metadata usually consists of table (relation) lists, primary key
names attribute names their domains schemas record counts and perhaps a list
of the most common quires.
 Additional information may be provided including logical and physical data
structures and when and what data was loaded
 In the context of a data warehouse metadata has been defined as “all of the
information in the data warehouse environment the is not the actual data
itself”.
 In a data warehouse, metadata needs to be much more comprehensive. It may
be classified into two groups:
 Back room metadata and front room metadata.
 The front room metadata is more is more descriptive and could include
information needed by the users.

99
 Metadata needs to be complete, correct and current and should also include
accessible, perhaps via the web using secure access to all users.
Difference between data warehouse and metadata
Data warehouse Meta data
Corporate /enterprise-wide Department-wide
Union of all data marts A single business process
Takes time to build Faster and easier implementation
Low risk of failure High risk of failure
Structure to suit the departmental view Structure for corporate view of data
of data Data received from star joins (facts &
Data received from staging area dimensions)
Well sutured and architecture
Queries on presentation resource

Online analytical processing (OLAP)


Define data cube. (2m)
Introduction:
 A data cube computers aggregate over all subsets of dimensions specified in the cube
 A data cube is often implemented as a database in which there are dimension table
each of which provides details of a dimension. This database may be the enterprise
data warehouse.
OLAP
Define OLAP. (2m)
Differentiate between OLAP and OLTP systems. (5m)
 OLAP is primarily a software technology concerned with fast analysis of enterprise
information.
 Often OLAP systems are data warehouse front-end software tools to make aggregate
data available efficiently, for advanced analysis, to an enterprise’s managers.
 It is essential that an OLAP system provides facilities for a manager to pose ad hoc
complex queries to obtain the information required. Another term that is being used
increasingly is business intelligence.
 Another definition of OLAP is that OLAP is software technology that enables
analysis, managers and executives to gain insight into data through fast, consistent,
iterative access to a wide variety of possible views of information that has been
transformed from raw data to reflect the real dimensionality of the enterprise as
understood by the user.
 An even simpler definition is that OLAP is fast analysis of shared multidimensional
information for advanced analysis.
 The following are the comparison between the OLAP systems with OLAP system.
Users:
 OLAP systems are designed for office workers while the OLAP systems are designed
for decision makers.

100
Function:
 OLAP systems are mission-critical. OLAP systems on the other hand are
management-critical.
Nature:
 Although SQL queries often return a set of records, OLTP systems are designed to
process one record at a time, for example a record related to the customer who might
be on the phone or in the store.
 OLAP systems are not designed to deal; with individual customer records.
Design:
 OLTP database systems are designed to be application-oriented while OLAP systems
are designed to be subject-oriented.
Data:
 OLAP systems normally deal only with the current status of information.
Kind of use:
 OLAP systems are used for read and write operations while OLAP systems normally
do not update the data.

CHARACTERISTICS OF OLAP SYSTEMS


Briefly write about FASMI characteristics. (5m)
Write short notes on Codd’s OLAP Characteristics. (5m)
What are the characteristics of OLAP systems? Discuss in detail. (10m)
FASMI Characteristics
The FASMI characteristics of OLAP systems, the name derived from the first letters of the
characteristics, are:
Fast:
 Most OLAP queries should be answered very quickly, perhaps within seconds. The
performance of an OLAP system has to be like that of a search has to be like that of a
search engine.
 Achieving such performance is difficult. The data structures must be efficient.
 The hardware must be powerful enough for the amount of data and the number of
users. One approach is to per-compute the most commonly queried aggregate and
compute the remaining on-the-fly.
Analytic:
 An OLAP system must provide rich analysis functionality and it is expected that most
OLAP queries can be answered without any programming.
 The systems should be able to cope with any relevant queries for the application and
the user.
Shared:
 An OLAP system is a shared resource although it is unlikely to be shared by hundreds
of users. Being a shared system, an OLAP system should provide adequate security
for confidentially as well as integrity.
Multidimensional:
 This is the basic requirement. Whatever OLAP software is being used, it must provide
a multidimensional conceptual view of the data.

101
 It is because of the multidimensional view of data that we often refer to the data as a
cube.
Information:
 OLAP systems usually obtain information form a data warehouse. The system should
be able to handle a large amount of input data.
 The capacity of an OLAP system to handle information and its integration with the
data warehouse may be critical.
Codd’s OLAP Characteristics
 Codd list characteristics (or rules) of OLAP systems. The most important is as
follows:
Multidimensional conceptual vies:
 As noted above, this is the central characteristics of an OLAP system. By requiring a
multidimensional view, it is possible to carry out operations like slice and dice.
Accessibility (OLAP as a mediator):
 The OLAP software should be sitting between data sources and an OLAP front-end.
Batch extraction Vs interpretive:
 An OLAP system should provide multidimensional data staging plus partial
precalculation of aggregates in large multidimensional database.

Multi-user support:
 Since the OLAP system is shared, the OLAP software should provide many normal
database operation including retrieval, update, concurrency control, ingrity and
security.
Storing OLAP results:
 OLAP results data should be kept separate from source data.
 Read-write OLAP applications should not be implemented directly on live transaction
data if OLTP source systems are supplying information to the OLAP system directly.
Extraction of missing values:
 The OLAP system should distinguish missing values from zero values. A large data
cube may have a large number of zeros as well as some missing values.
 If a distinction is not made between zero values and missing values, the aggregates
are likely to be computed incorrectly.
Treatment of missing values:
 An OLAP system should ignore all missing values regardless of their source. Correct
aggregate values will be computed nce the missing values are ignored.
Uniform reporting performance:
 Increasing the number of dimensions or database size should not significantly degrade
the reporting performance of the OLAP system.
 This is a good objective although it may be difficult to achieve in practice.
Generic dimensionality:
 An OLAP system should treat each dimension as equivalent in both its structure and
operational capabilities.
 Additional operational capabilities may be granted to selected dimensions but such
additional functions should be grantable to any dimenstion.

102
Unlimited dimension and aggregation levels:
 An OLAP system should allow unlimited dimension and aggregation levels. In
practice, the number of dimensions is rarely more than 10 and the number of
hierarchies rarely more than six.
MULTIDIMENSIONAL VIEW AND DATA CUBE
Explain how the data can be viewed in multidimensional with appropriate example.
(10m)
 The multidimensional view of data is in some way a natural view of any enterprise
for managers.
 We illustrate the multidimensional view of data by using an example of a simple
OLAP database consisting of three tables.
 The first relation (student) provides information about students. The second relation
(enrolment) provides information about the student degree enrolment and the semester
of their first enrolment.
 The third relation (degree) provides information about the degree in the university and
the department that offer them.
Student (Student-id, Student-name, Country, DOB,
Address)
Enrolment (Student-id, Degree-id, SSemester)
Degree (Degree-id, Degree-name, Degree-length, fee,
Department)

 The table 2.1 presents an example of the relation enrolment.


 In this table, the attribute SSemester is the semester in which the student started the
current degree. We code it by using the year followed by 01 for the first semester and
02 for the second.
The relation enrolment Table 2.1
Studend_id Degree_id SSemester
8900020 1256 2002-01
8700074 3271 2002-01
8700074 3321 2002-02
8900020 4444 2000-01
8801234 1256 2001-01
8801234 3321 1999-02
8801234 3333 1999-02
8977665 3333 2000-02
 Table 2.2 shows an example of semester it normally takes to finish it. The fee is
assumed to be in thousands of dollars per year. The relation degree Table 2.2
Degree-id Degree-name Degree-length Fee Department
1256 BIT 6 18 CS
2345 BSc 6 20 CS
4325 BSc 6 20 Chemistry
3471 BSc 6 20 Physics
3321 B. Com 6 16 Business

103
4444 MBBS 12 30 Medicine
3333 LLB 8 22 Law
 It is clear that the information given above is not suitable for efficient for efficient
management decision making.
 What the managers need is the trends in student numbers in different degree programs
and from different countries.
 We first consider only two dimensions. Therefore we may visualize the data as two-
dimensional.
Country x Degree
 A table that summarizes this type information may be represented by a two-
dimensional spreadsheet given in table 2.3.
 We may call this table that gives the number of student admitted (in, say, 2000-01) a
two-dimensional “cube”. Table 2.3: A two-dimensional table of aggregate for
semester 2000-01
Degree/country B. SC LLB MBBS B. Com BIT ALL
Australia 5 20 15 50 11 101
India 10 1 15 25 17 67
Malaysia 5 1 10 12 23 51
Singapore 2 2 10 10 31 55
Sweden 5 0 5 25 7 42
UK 5 15 20 20 13 73
USA 0 2 20 15 19 56
ALL 32 40 95 157 121 445
 Using this two-dimensional view we are able to find the number of students joining
any degree from any country (only for semester 2000-01).
 Other queries able to answer are:
 How many students stared BIT in 2000-01?
 How many students joined from Singapore in 2000-01?
 A similar table can be available for other semesters. Let us assume that the data for
2001-01 is given in table 2.4 Two-dimensional table of aggregates for semester 2001-
01 Table 2.4
Degree/country B. SC LLB MBBS B. Com BIT ALL
Australia 7 10 16 53 10 96
India 9 0 17 22 13 61
Malaysia 5 1 19 19 20 64
Singapore 2 2 10 12 23 49
Sweden 8 0 5 16 7 36
UK 4 13 20 26 11 74
USA 4 2 10 10 12 38
ALL 39 28 97 158 96 418
 Let us now image that table 2.3 is put off the table 2.4. we now have a three-
dimensional cube with SSemester as the vertical dimensional.
 We now put on top of these two tables another table that gives the vertical sums, as
the shown in table 2.5.
Two-dimensional table of aggregate for both semesters Table 2.5

104
Degree/country B. SC LLB MBBS B. Com BIT ALL
Australia 12 30 31 103 21 197
India 19 0 32 47 30 128
Malaysia 10 2 29 31 43 115
Singapore 4 4 20 22 54 104
Sweden 13 0 10 41 14 78
UK 9 28 40 46 24 147
USA 4 4 30 25 31 94
ALL 71 68 192 315 217 863
 All the above tables together now form a three-dimensional cube.
 Note that a cube does not need to have an equal number of members in each
dimension. Putting the three tables together gives a cube of 8 x 6 x 3 (=144) cells
including the total along every dimension.
 A cube such as we have described could be represented by: Country x Degree x
Semester
Cube formed by tables above
 Each of the edges in the cube represents a dimension. Each dimension has a number
of members. Similarly the dimension country and SSemester have members.
 All the cells in the cube represent measure of aggregations. Different types of
measures may behave differently in computations.
 We can add the number of student who was admitted in 2000-01 was the total number
of students in the two semesters. Such measures are called additive.
 Some measures cannot be so added. Such measured are called semi-additive. Some
measures, called non-additive, cannot be combined along any dimensions.
 Many different multidimensional views of the same data cube are possible and the
same data may be consolidated in different ways.
 In the three dimensional cube discussed here, the following eight types of
aggregations or queries (the term view has also been used since any query may be
defined as a view) are possible:
1. Null (e.g. how many students are there? Only 1 possible query)
2. Degrees (e.g. how many students are doing BSc? 5 possible queries if we assume
5 different degrees)
3. Semester (e.g. how many students entered in semester 2000-01? 2 possible queries
if we only have data about 2 semesters.)
4. Country (e.g. how many student are from the USA? 7 possible queries if there are
7 countries)
5. Degrees, semester (e.g. how many students entered in 2000-01 to enroll in Bcom?
With 5 degrees and 2 different semesters we have 10 different queries of this type)
6. Semester, country (e.g. how many student from UK entered in 2000-01? With 2
degrees and 7 different countries we have 14 different queries of this type)
7. Degrees, country (e.g. how many students from Singapore are enrolled in Bcom?
With 5 degrees and different countries we have 35 different queries of this type)

105
8. All (e.g. how many students from Malaysia entered in 2000-01 to enroll in Bcom?
With 5 degrees, 2 different semesters and 7 different countries we have 70
different queries of this type)
DATA CUBE IMPLEMENTTATION
Briefly explain about MOLAP & ROLAP. (5m)
Differentiate MOLAP & ROLAP. (5m)
Explain how to implement Data Cube in multidimensional data. (5m/10m)
 We have already identified 10,000 aggregates for a simple example and the number of
aggregates can be very large for a real example.
 Millions of aggregates are likely in a large enterprise when there are more than three
dimensions.
 Should all these aggregates be pre-computed? Many of the aggregates will be zero.
How should they be stored? Some possible solutions are:
Pre-compute and store all:
 This means that perhaps millions of aggregates will need to be computed and stored.
 Although this is the best solution as far as query response time is concerned, the
solution is impractical large for a data cube.
Pre-computed (and store) name:
 This means that the aggregates are computed on-the-fly using the raw data wherever a
query is posed.
 This approach does not require additional space for storing the cub e but the query
response time is likely to be very poor for large data cube.
Pre-compute and store some:
 This means that we pre-compute and store the most frequently queried aggregates and
compute others as the need arise.
 It is not appropriate to use data directly from OLTP systems because OLAP
applications are often large and use complex aggregates analysis.
 Also, OLAP often uses data from more than one OLAP system and the process of
merging these multiple systems can be complex.
 The first model, supported by vendors of traditional relational model database, is
called the ROLAP mode of the Relational OLAP model.
 The second model is called the MOLAP model for multidimensional OLAP.
ROLAP
 ROLAP uses a relational DBMS to implement an OLAP environment.
 It may be considered a bottom-up approach which is typically based on using a data
warehouse that has been designed using a star schema. The data therefore is likely to
be in a renormalized structure.
MOLAP
 MOLAP is based on using a multidimensional DBMS rather than a data warehouse to
store and access data. It may be considered as a top-down approach to OLAP.
 The multi-dimensional database systems do not have a standard approach to storing
and maintaining their data.
 They often use special-purpose file system or indexes that store pre-computation of all
aggregations in the cube.

106
 For example, in ROLAP a cell was represented as (BSc, India, 2001-01) with a value
30 stored in the last column.
 In MOLAP, the same information is stored as just 30 and the storage location
implicitly gives the values of the dimensions.
 For example, the cube in fig 1.1 may be represented as an array like the following:
12 30 31 103 21 197 19 0 32 47 30 128 10 2 29 31 43 …
 If the values of the dimensions are known, we can infer the cell location in the array.
If the cell location is known, the values of the dimension may be inferred.
 This is obviously a very compact notation for storing a multidimensional data cube
although a couple of problems remain.
 First, the array is likely to be too large to be stopped in the main
memory.
 Secondly, this representation does not solve the problem of efficiently
representing spare cubes.
 To overcome the problem of the array being too large for the main memory, the array
may be split into pieces called “chunks”, each of which is small enough to fit in the
main memory.
 It has been suggested that MOLAP is easier to use and therefore may be more suitable
for inexperienced users.
 Users who need a tool that can use for frequently changing query requirements may
be better advised to use ROLAP.
 The differences between ROLAP and MOLAP are:
DATA CUBE OPERATIONS
List out the types of data cube operation. (2m)
Explain the types of operations applied in data cubes. (5m/10m)
 A number of operations may be applied to data cubes. The common ones are:
1. Roll-up 3. Pivot
2. Slice and dice 4. Drill-down
 Roll-up is like zooming out on the data cube. It is required when the user needs
further abstraction or less detail. This operation performs further aggregations on the
data.
 We provide an example of roll-up based on the table 2.5. We first define hierarchies
on two dimensions. Among countries, let us define:
1. Asia (India, Malaysia, Singapore)
2. Europe (Sweden, UK)
3. Rest (Australia, USA)
 Another hierarchy us define on the dimension degree:
1. Science (BSc, BIT)
2. Medicine (MBBS)
3. Business and Law (B com, LLB)
 The result of a roll-up for semester together from tables as below:
Table 3.1 Result of a roll-up operation
Drill-down
 Drill-down is like zooming in on the data and is therefore the reverse of roll-up.

107
 It is an appropriate operation when the user needs further details or when the user
wants to partition more finely or wants to focus on some particular values of certain
dimension.
 Drill-down adds more details to the data. Hierarchy define on a dimension may be
involved in drill-down.
 For example, if someone is interested in more detail then it is possible to drill-down to
tables 2.3 & 2.4 for student numbers in each of the semesters for each country and for
each degree.

Slice and Dice


 Slice and dice are operations for browsing the data in the cube. The terms refer to the
ability to look at information from different viewpoints.
 A slice is a subject of the cube corresponding to a single value for one or more
members of the dimension.
 For example, a slice operation is performed when the user wants a selection on one
dimension of a three-dimensional cube resulting in a two-dimensional slice.
 The dice operation is similar to slice but dicing does not involve reducing the number
of dimensions. A dice is obtained by performing a selection on two or more
dimensions.
 For example, one may be interested in degree BIT and B Com and countries
Australia, India and Malaysia for semesters 2000-01 and 2000-01.
Table 3.3 a three-dimensional dice from a three-dimensional cube (SSemster 2000-
01)
Degree/country B Com BIT
Australia 50 11
India 25 12
Malaysia 12 23
Table 3.3 a three-dimensional dice from a three-dimensional cube (SSemster 2000-
01)
Degree/country B Com BIT
Australia 53 10
India 22 13
Malaysia 19 20

Pivot or Rotate
 The pivot operation is used when the user wishes to re-orient the view of the data
cube.
 It may involve swapping the rows and columns, or moving one of the row dimensions
into the column dimension.
.

108
Table on which other similar tabled piled up in a rotated cube (Country dimension
value= Australia)
Degree/Semester BSc LLB MBBS B Com BIT ALL
2000-01 5 20 15 50 11 101
2000-01 7 10 16 53 10 96
ALL 12 30 31 103 21 197
Another table that is part of a rotated cube (Country dimension value= India)
Degree/Semester BSc LLB MBBS B Com BIT ALL
2000-01 10 0 15 25 17 67
2000-01 9 0 17 22 13 61
ALL 19 0 32 47 30 128
 Clearly this rotated cube gives a different view of the same data. Such views can be
particularly useful if the number of dimensions is greater than three.
OLAP IMPLEMENT GUIDELINES
What are the guidelines used for OLAP implementation? (5m)
 A guideline is somewhat similar to those presented for data warehouse implement.
They are:
Vision:
 The OLAP team must in consultation with the user, develop a clear vision the OLAP
systems.
 This vision including the business objectives should be clearly defined, understood,
and shared by the stakeholders.
Senior Management Support:
 The OLAP project should be fully supported by the senior managers. Since a data
warehouse may have developed already, this should not be difficult.
Selecting an OLAP tool:
 The OLAP team should familiarize themselves with the ROLAP and MOLAP tools
available in the market.
 Since tools are quite different, careful planning may be required in selecting a tool
that is appropriate for the enterprise. In some situations, a combination of ROLAP and
MOLAP may be most effective.
Corporate Strategy:
 The OLAP strategy should fit with the enterprise stratagem and business objectives. A
good fit will result in the OLAP tools being used more widely.
Focus on the Users:
 The OLAP project should be focused on the users. Users should, in consultation with
the technical professionals, decide what task will be done first and what will be done
later.
 Attempts should be made to provide each user with a tool suitable for that persons
skill level and information needs. A good GUI user interface should be provided to
non-technical users.
 The project can only be successful with the full support of the users.

109
Joint Management:
 The OLAP project must be managed by both the IT and business professionals.
 Many other people should be involved in supplying ideas. An appropriate committee
structure may be necessary to channel these ideas.
Review and adapt: Regular reviews of the project may be required to ensure that the project
is meeting the current needs of the enterprise.

FAQ
Two marks:
1. Define data warehouse.
2. What do you mean by OLAP.?
3. What is meant by eager approach?
4. Define measures.
5. What is star schema?
6. Expand :ZLE and ODS.
7. List out four data operations applied to data cubes.
8. Give the acronym for FASMI of OLAP.

Five marks:
1. Compare operational data stores and data warehouse.
2. Describe the data cube operations: drill down and Slice and dice.
3. Explain about data warehouse design.
4. Describe data cube implementation.
5. Describe about data cube operations: Rollup and slice and dice
6. Discuss about data warehouse.
7. What are the major difference between ODS and DW.
8. Explain about operational data source.
9. Give the guidelines for OLAP implementation.

Ten marks:
1. Write in detail about data warehouse design.
2. Explain about multidimensional view and data cube in OLAP.
3. Explain briefly about operational data stores.
4. Explain about data cube operations with a examples.
5. Describe data warehouse in detail.

110

You might also like