1
What’s the difference between data
mining and data warehousing?
Data mining is the process of finding patterns in a given data set. These
patterns can often provide meaningful and insightful data to whoever is
interested in that data. Data mining is used today in a wide variety of
contexts – in fraud detection, as an aid in marketing campaigns, and
even supermarkets use it to study their consumers.
Data warehousing can be said to be the process of centralizing or
aggregating data from multiple sources into one common repository.
Example of data mining
If you’ve ever used a credit card, then you may know that credit card
companies will alert you when they think that your credit card is being
fraudulently used by someone other than you. This is a perfect example
of data mining – credit card companies have a history of your purchases
from the past and know geographically where those purchases have been
made. If all of a sudden some purchases are made in a city far from
where you live, the credit card companies are put on alert to a possible
fraud since their data mining shows that you don’t normally make
purchases in that city. Then, the credit card company can disable your
card for that transaction or just put a flag on your card for suspicious
activity.
Another interesting example of data mining is how one grocery store in
the USA used the data it collected on it’s shoppers to find patterns in
2
their shopping habits.
They found that when men bought diapers on Thursdays and Saturdays,
they also had a strong tendency to buy beer.
The grocery store could have used this valuable information to increase
their profits. One thing they could have done – odd as it sounds – is
move the beer display closer to the diapers. Or, they could have simply
made sure not to give any discounts on beer on Thursdays and
Saturdays. This is data mining in action – extracting meaningful data
from a huge data set.
Example of data warehousing – Facebook
A great example of data warehousing that everyone can relate to is what
Facebook does. Facebook basically gathers all of your data – your
friends, your likes, who you stalk, etc – and then stores that data into one
central repository. Even though Facebook most likely stores your
friends, your likes, etc, in separate databases, they do want to take the
most relevant and important information and put it into one central
aggregated database. Why would they want to do this? For many reasons
– they want to make sure that you see the most relevant ads that you’re
most likely to click on, they want to make sure that the friends that they
suggest are the most relevant to you, etc – keep in mind that this is the
data mining phase, in which meaningful data and patterns are extracted
from the aggregated data. But, underlying all these motives is the main
motive: to make more money – after all, Facebook is a business.
We can say that data warehousing is basically a process in which data
from multiple sources/databases is combined into one comprehensive
and easily accessible database. Then this data is readily available to any
business professionals, managers, etc. who need to use the data to create
3
forecasts – and who basically use the data for data mining.
Datawarehousing vs Datamining
Remember that data warehousing is a process that must occur before any
data mining can take place. In other words, data warehousing is the
process of compiling and organizing data into one common database,
and data mining is the process of extracting meaningful data from that
database. The data mining process relies on the data compiled in the
datawarehousing phase in order to detect meaningful patterns.
In the Facebook example that we gave, the data mining will typically be
done by business users who are not engineers, but who will most likely
receive assistance from engineers when they are trying to manipulate
their data. The data warehousing phase is a strictly engineering phase,
where no business users are involved. And this gives us another way of
defining the 2 terms: data mining is typically done by business users
with the assistance of engineers, and data warehousing is typically a
process done exclusively by engineers
Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)
It involves historical
1 It involves day-to-day processing.
processing of information.
OLAP systems are used by
knowledge workers such as OLTP systems are used by clerks,
2
executives, managers, and DBAs, or database professionals.
analysts.
4
It is used to analyze the
3 It is used to run the business.
business.
It focuses on Information
4 It focuses on Data in.
out.
It is based on Star Schema,
It is based on Entity Relationship
5 Snowflake Schema, and Fact
Model.
Constellation Schema.
6 It focuses on Information out. It is application oriented.
7 It contains historical data. It contains current data.
It provides summarized and It provides primitive and highly
8
consolidated data. detailed data.
It provides summarized and It provides detailed and flat
9
multidimensional view of data. relational view of data.
The number of users is in The number of users is in
10
hundreds. thousands.
The number of records The number of records accessed
11
accessed is in millions. is in tens.
The database size is from The database size is from 100 MB
12
100GB to 100 PB. to 100 GB.
13 These are highly flexible. It provides high performance.
Data mining work
While large-scale information technology has been evolving separate
transaction and analytical systems, data mining provides the link
between the two. Data mining software analyzes relationships and
patterns in stored transaction data based on open-ended user queries.
Several types of analytical software are available: statistical, machine
learning, and neural networks. Generally, any of four types of
relationships are sought:
● Classes: Stored data is used to locate data in predetermined
groups. For example, a restaurant chain could mine customer
purchase data to determine when customers visit and what they
5
typically order. This information could be used to increase traffic
by having daily specials.
● Clusters: Data items are grouped according to logical relationships
or consumer preferences. For example, data can be mined to
identify market segments or consumer affinities.
● Associations: Data can be mined to identify associations. The
beer-diaper example is an example of associative mining.
● Sequential patterns: Data is mined to anticipate behavior patterns
and trends. For example, an outdoor equipment retailer could
predict the likelihood of a backpack being purchased based on a
consumer's purchase of sleeping bags and hiking shoes.
Data mining consists of five major elements:
● Extract, transform, and load transaction data onto the data
warehouse system.
● Store and manage the data in a multidimensional database system.
● Provide data access to business analysts and information
technology professionals.
● Analyze the data by application software.
● Present the data in a useful format, such as a graph or table.
Different levels of analysis are available:
● Artificial neural networks: Non-linear predictive models that
learn through training and resemble biological neural networks in
structure.
6
● Genetic algorithms: Optimization techniques that use processes
such as genetic combination, mutation, and natural selection in a
design based on the concepts of natural evolution.
● Decision trees: Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification of a
dataset. Specific decision tree methods include Classification and
Regression Trees (CART) and Chi Square Automatic Interaction
Detection (CHAID) . CART and CHAID are decision tree
techniques used for classification of a dataset. They provide a set
of rules that you can apply to a new (unclassified) dataset to
predict which records will have a given outcome. CART segments
a dataset by creating 2-way splits while CHAID segments using
chi square tests to create multi-way splits. CART typically requires
less data preparation than CHAID.
● Nearest neighbor method: A technique that classifies each record
in a dataset based on a combination of the classes of the k record(s)
most similar to it in a historical dataset (where k 1). Sometimes
called the k-nearest neighbor technique.
● Rule induction: The extraction of useful if-then rules from data
based on statistical significance.
● Data visualization: The visual interpretation of complex
relationships in multidimensional data. Graphics tools are used to
illustrate data relationships.
What technological infrastructure is required?
Today, data mining applications are available on all size systems for
mainframe, client/server, and PC platforms. System prices range from
several thousand dollars for the smallest applications up to $1 million a
terabyte for the largest. Enterprise-wide applications generally range in
7
size from 10 gigabytes to over 11 terabytes. NCR has the capacity to
deliver applications exceeding 100 terabytes. There are two critical
technological drivers:
● Size of the database: the more data being processed and
maintained, the more powerful the system required.
● Query complexity: the more complex the queries and the greater
the number of queries being processed, the more powerful the
system required.
Relational database storage and management technology is adequate for
many data mining applications less than 50 gigabytes. However, this
infrastructure needs to be significantly enhanced to support larger
applications. Some vendors have added extensive indexing capabilities
to improve query performance. Others use new hardware architectures
such as Massively Parallel Processors (MPP) to achieve order-of-
magnitude improvements in query time. For example, MPP systems
from NCR link hundreds of high-speed Pentium processors to achieve
performance levels exceeding those of the largest supercomputers.