[go: up one dir, main page]

0% found this document useful (0 votes)
29 views158 pages

Data Mining Models David L. Olson Download

The document discusses the second edition of 'Data Mining Models' by David L. Olson, which highlights the importance of data mining in business, its processes, and applications. It emphasizes the use of various data mining tools, such as KNIME and Rattle, to analyze large datasets for actionable insights in areas like retail, banking, and insurance. The book aims to provide a comprehensive understanding of data mining techniques and their practical implementations in business contexts.

Uploaded by

pusbhruwr6131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views158 pages

Data Mining Models David L. Olson Download

The document discusses the second edition of 'Data Mining Models' by David L. Olson, which highlights the importance of data mining in business, its processes, and applications. It emphasizes the use of various data mining tools, such as KNIME and Rattle, to analyze large datasets for actionable insights in areas like retail, banking, and insurance. The book aims to provide a comprehensive understanding of data mining techniques and their practical implementations in business contexts.

Uploaded by

pusbhruwr6131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

Data Mining Models David L.

Olson

https://textbookfull.com/product/data-mining-models-david-l-olson/

★★★★★ 4.8/5.0 (49 reviews) ✓ 138 downloads ■ TOP RATED


"Fantastic PDF quality, very satisfied with download!" - Emma W.

DOWNLOAD EBOOK
Data Mining Models David L. Olson

TEXTBOOK EBOOK TEXTBOOK FULL

Available Formats

■ PDF eBook Study Guide TextBook

EXCLUSIVE 2025 EDUCATIONAL COLLECTION - LIMITED TIME

INSTANT DOWNLOAD VIEW LIBRARY


Collection Highlights

Enterprise Risk Management Models 2nd Edition David L.


Olson

Data Mining and Big Data Ying Tan

Data Mining Yee Ling Boo

Mobile Data Mining Yuan Yao


Learning Data Mining with Python Layton

Learning Data Mining with Python Robert Layton

Mobile Data Mining and Applications Hao Jiang

R Data Mining Implement data mining techniques through


practical use cases and real world datasets 1st Edition
Andrea Cirillo

Computational Intelligence in Data Mining Himansu Sekhar


Behera
Data Mining Models
Data Mining Models
Second Edition

David L. Olson
Data Mining Models, Second Edition

Copyright © Business Expert Press, LLC, 2018.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any
other except for brief quotations, not to exceed 400 words, without the prior permission of the
publisher.

First published in 2016 by


Business Expert Press, LLC
222 East 46th Street, New York, NY 10017
www.businessexpertpress.com

ISBN-13: 978-1-94858-049-6 (paperback)


ISBN-13: 978-1-94858-050-2 (e-book)

Business Expert Press Big Data and Business Analytics Collection

Collection ISSN: 2333-6749 (print)


Collection ISSN: 2333-6757 (electronic)

Cover and interior design by Exeter Premedia Services Private Ltd., Chennai, India

Second edition: 2018

10 9 8 7 6 5 4 3 2 1

Printed in the United States of America.


Abstract
Data mining has become the fastest growing topic of interest in business programs in
the past decade. This book is intended to first describe the benefits of data mining in
business, describe the process and typical business applications, describe the workings
of basic data mining models, and demonstrate each with widely available free
software. This second edition updates Chapter 1, and adds more details on Rattle data
mining tools.
The book focuses on demonstrating common business data mining applications. It
provides exposure to the data mining process, to include problem identification, data
management, and available modeling tools. The book takes the approach of
demonstrating typical business data sets with open source software. KNIME is a very
easy-to-use tool, and is used as the primary means of demonstration. R is much more
powerful and is a commercially viable data mining tool. We will demonstrate use of R
through Rattle. We also demonstrate WEKA, which is a highly useful academic
software, although it is difficult to manipulate test sets and new cases, making it
problematic for commercial use. We will demonstrate methods with a small but
typical business dataset. We use a larger (but still small) realistic business dataset for
Chapter 9.

Keywords
big data, business analytics, clustering, data mining, decision trees, neural network
models, regression models
Contents
Acknowledgments
Chapter 1 Data Mining in Business
Chapter 2 Business Data Mining Tools
Chapter 3 Data Mining Processes and Knowledge Discovery
Chapter 4 Overview of Data Mining Techniques
Chapter 5 Data Mining Software
Chapter 6 Regression Algorithms in Data Mining
Chapter 7 Neural Networks in Data Mining
Chapter 8 Decision Tree Algorithms
Chapter 9 Scalability

Notes
References
Index
Acknowledgments
I wish to recognize some of the many colleagues I have worked and published with,
specifically Yong Shi, Dursun Delen, Desheng Wu, and Ozgur Araz. There are many
others I have learned from in joint efforts as well, both students and colleagues, all of
whom I wish to recognize with hearty thanks.
CHAPTER 1

Data Mining in Business

Introduction
Data mining refers to the analysis of large quantities of data that are stored in
computers. Bar coding has made checkout very convenient for us and provides retail
establishments with masses of data. Grocery stores and other retail stores are able to
quickly process our purchases and use computers to accurately determine the product
prices. These same computers can help the stores with their inventory management,
by instantaneously determining the quantity of items of each product on hand. -
Computers allow the store’s accounting system to more accurately measure costs and
determine the profit that store stockholders are concerned about. All of this
information is available based on the bar coding information attached to each product.
Along with many other sources of information, information gathered through bar
coding can be used for data mining analysis.
The era of big data is here, with many sources pointing out that more data are
created over the past year or two than was generated throughout all prior human
history. Big data involves datasets so large that traditional data analytic methods no
longer work due to data volume. Davenport1 gave the following features of big data:

Data too big to fit on a single server


Data too unstructured to fit in a row-and-column database
Data flowing too continuously to fit into a static data warehouse
Lack of structure is the most important aspect (even more than the size)
The point is to analyze, converting data into insights, innovation, and
business value

Big data has been said to be more about analytics than about the data itself. The era
of big data is expected to emphasize focusing on knowing what (based on correlation)
rather than the traditional obsession for causality. The emphasis will be on
discovering patterns offering novel and useful insights.2Data will become a raw
material for business, a vital economic input and source of value. Cukier and Mayer–
Scheonberger3 cite big data providing the following impacts on the statistical body of
theory established in the 20th century: (1) There is so much data available that
sampling is usually not needed (n = all). (2) Precise accuracy of data is, thus, less
important as inevitable errors are compensated for by the mass of data (any one
observation is flooded by others). (3) Correlation is more important than causality—
most data mining applications involving big data are interested in what is going to
happen, and you don’t need to know why. Automatic trading programs need to detect
the trend changes, not figure out that the Greek economy collapsed or the Chinese
government will devalue the Renminbi (RMB). The programs in vehicles need to
detect that an axle bearing is getting hot and the vehicle is vibrating and the wheel
should be replaced, not whether this is due to a bearing failure or a housing rusting
out.
There are many sources of big data.4 Internal to the corporation, e-mails, blogs,
enterprise systems, and automation lead to structured, unstructured, and
semistructured information within the organization. External data is also widely
available, much of it free over the Internet, but much also available from the
commercial vendors. There also is data obtainable from social media.
Data mining is not limited to business. Both major parties in the U.S. elections
utilize data mining of potential voters.5 Data mining has been heavily used in the
medical field, from diagnosis of patient records to help identify the best practices.6
Business use of data mining is also impressive. Toyota used data mining of its data
warehouse to determine more efficient transportation routes, reducing the time to
deliver cars to their customers by an average 19 days. Data warehouses are very large
scale database systems capable of systematically storing all transactional data
generated by a business organization, such as Walmart. Toyota also was able to
identify the sales trends faster and to identify the best locations for new dealerships.
Data mining is widely used by banking firms in soliciting credit card customers, by
insurance and telecommunication companies in detecting fraud, by manufacturing
firms in quality control, and many other applications. Data mining is being applied to
improve food product safety, criminal detection, and tourism. Micromarketing targets
small groups of highly responsive customers. Data on consumer and lifestyle data is
widely available, enabling customized individual marketing campaigns. This is
enabled by customer profiling, identifying those subsets of customers most likely to
be profitable to the business, as well as targeting, determining the characteristics of
the most profitable customers.
Data mining involves statistical and artificial intelligence (AI) analysis, usually
applied to large-scale datasets. There are two general types of data mining studies.
Hypothesis testing involves expressing a theory about the relationship between actions
and outcomes. This approach is referred to as supervised. In a simple form, it can be
hypothesized that advertising will yield greater profit. This relationship has long been
studied by retailing firms in the context of their specific operations. Data mining is
applied to identifying relationships based on large quantities of data, which could
include testing the response rates to various types of advertising on the sales and
profitability of specific product lines. However, there is more to data mining than the
technical tools used. The second form of data mining study is knowledge discovery.
Data mining involves a spirit of knowledge discovery (learning new and useful
things). Knowledge discovery is referred to as unsupervised. In this form of analysis,
a preconceived notion may not be present, but rather relationships can be identified by
looking at the data. This may be supported by visualization tools, which display data,
or through fundamental statistical analysis, such as correlation analysis. Much of this
can be accomplished through automatic means, as we will see in decision tree
analysis, for example. But data mining is not limited to automated analysis.
Knowledge discovery by humans can be enhanced by graphical tools and
identification of unexpected patterns through a combination of human and computer
interaction.

Requirements for Data Mining


Data mining requires identification of a problem, along with the collection of data that
can lead to better understanding, and computer models to provide statistical or other
means of analysis. A variety of analytic computer models have been used in data
mining. In the later sections, we will discuss various types of these models. Also
required is access to data. Quite often, systems including data warehouses and data
marts are used to manage large quantities of data. Other data mining analyses are done
with smaller sets of data, such as can be organized in online analytic processing
systems.
Masses of data generated from cash registers, scanning, and topic-specific databases
throughout the company are explored, analyzed, reduced, and reused. Searches are
performed across different models proposed for predicting sales, marketing response,
and profit. The classical statistical approaches are fundamental to data mining.
Automated AI methods are also used. However, a systematic exploration through
classical statistical methods is still the basis of data mining. Some of the tools
developed by the field of statistical analysis are harnessed through automatic control
(with some key human guidance) in dealing with data.
Data mining tools need to be versatile, scalable, capable of accurately predicting the
responses between actions and results, and capable of automatic implementation.
Versatile refers to the ability of the tool to apply a wide variety of models. Scalable
tools imply that if the tools works on a small dataset, it should also work on a larger
dataset. Automation is useful, but its application is relative. Some analytic functions
are often automated, but human setup prior to implementing procedures is required. In
fact, analyst judgment is critical to successful implementation of data mining. Proper
selection of data to include in searches is critical. Data transformation also is often
required. Too many variables produce too much output, while too few can overlook
the key relationships in the data.
Data mining is expanding rapidly, with many benefits to business. Two of the most
profitable application areas have been the use of customer segmentation by marketing
organizations to identify those with marginally greater probabilities of responding to
different forms of marketing media, and banks using data mining to more accurately
predict the likelihood of people to respond to the offers of different services offered.
Many companies are using this technology to identify their blue-chip customers, so
that they can provide them with the service needed to retain them.
The casino business has also adopted data warehousing and data mining.
Historically, casinos have wanted to know everything about their customers. A typical
application for a casino is to issue special cards, which are used whenever the
customer plays at the casino, or eats, or stays, or spends money in other ways. The
points accumulated can be used for complimentary meals and lodging. More points
are awarded for activities that provide Harrah’s more profit. The information obtained
is sent to the firm’s corporate database, where it is retained for several years. Instead
of advertising the loosest slots in town, Bellagio and Mandalay Bay have developed
the strategy of promoting luxury visits. Data mining is used to identify high rollers, so
that these valued customers can be cultivated. Data warehouses enable casinos to
estimate the lifetime value of the players. Incentive travel programs, in-house
promotions, corporate business, and customer follow-up are the tools used to maintain
the most profitable customers. Casino gaming is one of the richest datasets available.
Very specific individual profiles can be developed. Some customers are identified as
those who should be encouraged to play longer. Other customers are identified as
those who are discouraged from playing.

Business Data Mining


Data mining has been very effective in many business venues. The key is to find
actionable information or information that can be utilized in a concrete way to
improve profitability. Some of the earliest applications were in retailing, especially in
the form of market basket analysis. Table 1.1 shows the general application areas we
will be discussing. Note that they are meant to be representative rather than
comprehensive.

Table 1.1 Data mining application areas


Application area Applications Specifics
Retailing Affinity positioning Position products effectively
Cross-selling; develop and maintain Find more products for customers
customer loyalty
Banking Customer relationship management (CRM) Identify customer value
Develop programs to maximize the
revenue
Credit card management Lift Identify effective market segments
Churn Identify likely customer turnover
(Loyalty)
Insurance Fraud detection Identify claims meriting -
investigation
Telecommunications Churn Identify likely customer turnover
Telemarketing Online information Aid telemarketers with easy data
Recommender systems access
Human resource - Churn (Retention) Identify potential employee turnover
management
Retailing

Data mining offers retailers, in general, and grocery stores, specifically, valuable
predictive information from mountains of data. Affinity positioning is based on the
identification of products that the same customer is likely to want. For instance, if you
are interested in cold medicine, you probably are interested in tissues. Thus, it would
make marketing sense to locate both items within easy reach of the other. Cross-
selling is a related concept. The knowledge of products that go together can be used
by marketing the complementary product. Grocery stores do that through position
product shelf location. Retail stores relying on advertising can send ads for sales on
shirts and ties to those who have recently purchased suits. These strategies have long
been employed by wise retailers. Recommender systems are effectively used by
Amazon and other online retailers. Data mining provides the ability to identify less
expected product affinities and cross-selling opportunities. These actions develop and
maintain customer loyalty.
Grocery stores generate mountains of cash register data that require automated tools
for analysis. Software is marketed to service a spectrum of users. In the past, it was
assumed that cash register data was so massive that it couldn’t be quickly analyzed.
However, the current technology enables the grocers to look at customers who have
defected from a store, their purchase history, and characteristics of other potential
defectors.

Banking

The banking industry was one of the first users of data mining. Banks are turning to
technology to find out what motivates their customers and what will keep their
business (customer relationship management—CRM). CRM involves the application
of technology to monitor customer service, a function that is enhanced through data
mining support. Understanding the value a customer provides the firm makes it
possible to rationally evaluate if extra expenditure is appropriate in order to keep the
customer. There are many opportunities for data mining in banking. Data mining
applications in finance include predicting the prices of equities involve a dynamic
environment with surprise information, some of which might be inaccurate and some
of which might be too complex to comprehend and reconcile with intuition.
Data mining provides a way for banks to identify patterns. This is valuable in
assessing loan applications as well as in target marketing. Credit unions use data
mining to track member profitability as well as monitoring the effectiveness of
marketing programs and sales representatives. They also are used in the effort of
member care, seeking to identify what credit union customers want in the way of
services.

Credit Card Management

The credit card industry has proven very profitable. It has attracted many card issuers,
and many customers carry four or five cards. Balance surfing is a common practice,
where the card user pays an old balance with a new card. These are not considered
attractive customers, and one of the uses of data warehousing and data mining is to
identify balance surfers. The profitability of the industry has also attracted those who
wish to push the edge of credit risk, both from the customer and the card issuer
perspective. Bank credit card marketing promotions typically generate 1,000
responses to mailed solicitations, a response rate of about 1 percent. This rate is
improved significantly through data mining analysis.
Data mining tools used by banks include credit scoring. Credit scoring is a
quantified analysis of credit applicants with respect to the prediction of on-time loan
repayment. A key is a consolidated data warehouse, covering all products, including
demand deposits, savings, loans, credit cards, insurance, annuities, retirement
programs, securities underwriting, and every other product banks provide. Credit
scoring provides a number for each applicant by multiplying a set of weighted
numbers determined by the data mining analysis multiplied times ratings for that
applicant. These credit scores can be used to make accept or reject recommendations,
as well as to establish the size of a credit line. Credit scoring used to be conducted by
bank loan officers, who considered a few tested variables, such as employment,
income, age, assets, debt, and loan history. Data mining makes it possible to include
many more variables, with greater accuracy.
The new wave of technology is broadening the application of database use and
targeted marketing strategies. In the early 1990s, nearly all credit card issuers were
mass-marketing to expand their card-holder bases. However, with so many cards
available, broad-based marketing campaigns have not been as effective as they
initially were. Card issuers are more carefully examining the expected net present
value of each customer. Data warehouses provide the information, giving the issuers
the ability to try to more accurately predict what the customer is interested in, as well
as their potential value to the issuer. Desktop campaign management software is used
by the more advanced credit card issuers, utilizing data mining tools, such as neural
networks, to recognize customer behavior patterns to predict their future relationship
with the bank.

Insurance

The insurance industry utilizes data mining for marketing, just as retailing and
banking organizations do. But, they also have specialty applications. Farmers
Insurance Group has developed a system for underwriting, which generates millions
of dollars in higher revenues and lower claims. The system allows the firm to better
understand narrow market niches and to predict losses for specific lines of insurance.
One discovery was that it could lower its rates on sports cars, which increased their
market share for this product line significantly.
Unfortunately, our complex society leads to some inappropriate business operations,
including insurance fraud. Specialists in this underground industry often use multiple
personas to bilk insurance companies, especially in the automobile insurance
environment. Fraud detection software use a similarity search engine, analyzing
information in company claims for similarities. By linking names, telephone numbers,
streets, birthdays, and other information with slight variations, patterns can be
identified, indicating a fraud. The similarity search engine has been found to be able
to identify up to seven times more fraud than the exact-match systems.

Telecommunications

Deregulation of the telephone industry has led to widespread competition. Telephone


service carriers fight hard for customers. The problem is that once a customer is
obtained, it is attacked by competitors, and retention of customers is very difficult.
The phenomenon of a customer switching carriers is referred to as churn, a
fundamental concept in telemarketing as well as in other fields.
A director of product marketing for a communications company considered that
one-third of churn is due to poor call quality and up to one-half is due to poor
equipment. That firm has a wireless telephone performance monitor tracking
telephones with poor performances. This system reduced churn by an estimated 61
percent, amounting to about 3 percent of the firm’s overall subscribers over the course
of a year. When a telephone begins to go bad, the telemarketing personnel are alerted
to contact the customer and suggest bringing in the equipment for service.
Another way to reduce churn is to protect customers from subscription and cloning
fraud. Cloning has been estimated to have cost the wireless industry millions. A
number of fraud prevention systems are marketed. These systems provide verification
that is transparent to the legitimate subscribers. Subscription fraud has been estimated
to have an economic impact of $1.1 billion. Deadbeat accounts and service shutoffs
are used to screen potentially fraudulent applicants.
Churn is a concept that is used by many retail marketing operations. Banks widely
use churn information to drive their promotions. Once data mining identifies
customers by characteristic, direct mailing and telemarketing are used to present the
bank’s promotional program. The mortgage market has seen massive refinancing in a
number of periods. Banks were quick to recognize that they needed to keep their
mortgage customers happy if they wanted to retain their business. This has led to
banks contacting the current customers if those customers hold a mortgage at a rate
significantly above the market rate. While they may cut their own lucrative financial
packages, banks realize that if they don’t offer a better service to borrowers, a
competitor will.

Human Resource Management

Business intelligence is a way to truly understand markets, competitors, and


processes. Software technology such as data warehouses, data marts, online analytical
processing (OLAP), and data mining make it possible to sift through data in order to
spot trends and patterns that can be used by the firm to improve profitability. In the
human resources field, this analysis can lead to the identification of individuals who
are liable to leave the company unless additional compensation or benefits are
provided.
Data mining can be used to expand upon things that are already known. A firm
might know that 20 percent of its employees use 80 percent of services offered, but
may not know which particular individuals are in that 20 percent. Business
intelligence provides a means of identifying segments, so that programs can be
devised to cut costs and increase productivity. Data mining can also be used to
examine the way in which an organization uses its people. The question might be
whether the most talented people are working for those business units with the highest
priority or where they will have the greatest impact on profit.
Companies are seeking to stay in business with fewer people. Sound human
resource management would identify the right people, so that organizations could treat
them well to retain them (reduce churn). This requires tracking key performance
indicators and gathering data on talents, company needs, and competitor requirements.

Summary
The era of big data is here, flooding businesses with numbers, text, and often more
complex data forms, such as videos or pictures. Some of this data is generated
internally, through enterprise systems or other software tools to manage a business’s
information. Data mining provides a tool to utilize this data. This chapter reviewed the
basic applications of data mining in business, to include customer profiling, fraud
detection, and churn analysis. These will all be explored in greater depth in Chapter 2.
But, here our intent is to provide an overview of what data mining is useful for in
business.
The process of data mining relies heavily on information technology, in the form of
data storage support (data warehouses, data marts, or OLAP tools) as well as software
to analyze the data (data mining software). However, the process of data mining is far
more than simply applying these data mining software tools to a firm’s data.
Intelligence is required on the part of the analyst in selection of model types, in
selection and transformation of the data relating to the specific problem, and in
interpreting results.
CHAPTER 2

Business Data Mining Tools


Have you ever wondered why your spouse gets all of these strange catalogs for
obscure products in the mail? Have you also wondered at his or her strong interest in
these things, and thought that the spouse was overly responsive to advertising of this
sort? For that matter, have you ever wondered why 90 percent of your telephone calls,
especially during meals, are opportunities to purchase products? (Or for that matter,
why calls assuming you are a certain type of customer occur over and over, even
though you continue to tell them that their database is wrong?)
One of the earliest and most effective business applications of data mining is in
support of customer segmentation. This insidious application utilizes massive
databases (obtained from a variety of sources) to segment the market into categories,
which are studied with data mining tools to predict the response to particular
advertising campaigns. It has proven highly effective. It also represents the
probabilistic nature of data mining, in that it is not perfect. The idea is to send catalogs
to (or call) a group of target customers with a 5 percent probability of purchase rather
than waste these expensive marketing resources on customers with a 0.05 percent
probability of purchase. The same principle has been used in election campaigns by
party organizations—give free rides to the voting booth to those in your party;
minimize giving free rides to voting booths to those likely to vote for your opponents.
Some call this bias. Others call it sound business.
Data mining offers the opportunity to apply technology to improve many aspects of
business. Some standard applications are presented in this chapter. The value of
education is to present you with past applications, so that you can use your
imagination to extend these application ideas to new environments.
Data mining has proven valuable in almost every academic discipline.
Understanding business application of data mining is necessary to expose business
college students to current analytic information technology. Data mining has been
instrumental in customer relationship management,1 credit card management,2
banking,3 insurance,4 telecommunications,5 and many other areas of statistical support
to business. Business data mining is made possible by the generation of masses of
data from computer information systems. Understanding this information generation
system and tools available leading to analysis is fundamental for business students in
the 21st century. There are many highly useful applications in practically every field
of scientific study. Data mining support is required to make sense of the masses of
business data generated by computer technology.
This chapter will describe some of the major applications of data mining. By doing
so, there will also be opportunities to demonstrate some of the different techniques
that have proven useful. Table 2.1 compares the aspects of these applications.

Table 2.1 Common business data mining applications


Application Function Statistical technique AI tool
Catalog sales Customer segmentation Cluster analysis K-means
Mail stream optimization Neural network
CRM (telecom) Customer scoring Cluster analysis Neural network
Churn analysis
Credit scoring Loan applications Cluster analysis K-means
Pattern search
Banking (loans) Bankruptcy prediction Prediction Decision tree
Discriminant analysis
Investment risk Risk prediction Prediction Neural network
Insurance Customer retention (churn) Prediction Decision tree
Pricing Logistic regression Neural network

A wide variety of business functions are supported by data mining. Those


applications listed in Table 2.1 represent only some of these applications. The
underlying statistical techniques are relatively simple—to predict, to identify the case
closest to past instances, or to identify some pattern.

Customer Profiling
We begin with probably the most spectacular example of business data mining.
Fingerhut, Inc. was a pioneer in developing methods to improve business. In this case,
they sought to identify the small subset of the most likely purchasers of their specialty
catalogs. They were so successful that they were purchased by Federated Stores.
Ultimately, Fingerhut operations were a victim to the general malaise in IT business in
2001 and 2002. But, they still represent a pioneering development of data mining
application in business.

Lift

This section demonstrates the concept of lift used in customer segmentation models.
We can divide the data into groups as fine as we want (here, we divide them into 10
equal portions of the population, or groups of 10 percent each). These groups have
some identifiable features, such as zip code, income level, and so on (a profile). We
can then sample and identify the portion of sales for each group. The idea behind lift
is to send promotional material (which has a unit cost) to those groups that have the
greatest probability of positive response first. We can visualize lift by plotting the
responses against the proportion of the total population of potential customers, as
shown in Table 2.2. Note that the segments are listed in Table 2.2 sorted by expected
customer response.

Table 2.2 Lift calculation


Ordered Expected Proportion Cumulative Random average Lift
segment customer (expected response proportion
response responses) proportion
Origin 0 0 0 0 0
1 0.20 0.172 0.172 0.10 0.072
2 0.17 0.147 0.319 0.20 0.119
3 0.15 0.129 0.448 0.30 0.148
4 0.13 0.112 0.560 0.40 0.160

5 0.12 0.103 0.664 0.50 0.164


6 0.10 0.086 0.750 0.60 0.150
7 0.09 0.078 0.828 0.70 0.128
8 0.08 0.069 0.897 0.80 0.097
9 0.07 0.060 0.957 0.90 0.057
10 0.05 0.043 1.000 1.00 0.000

Both the cumulative responses and cumulative proportion of the population are
graphed to identify the lift. Lift is the difference between the two lines in Figure 2.1.
Figure 2.1 Lift identified by the mail optimization system

The purpose of lift analysis is to identify the most responsive segments. Here, the
greatest lift is obtained from the first five segments. We are probably more interested
in profit, however. We can identify the most profitable policy. What needs to be done
is to identify the portion of the population to send promotional materials to. For
instance, if an average profit of $200 is expected for each positive response and a cost
of $25 is expected for each set of promotional material sent out, it obviously would be
more profitable to send to the first segment containing an expected 0.2 positive
responses ($200 times 0.2 equals an expected revenue of $40, covering the cost of $25
plus an extra $15 profit). But, it still might be possible to improve the overall profit by
sending to other segments as well (always selecting the segment with the larger
response rates in order). The plot of cumulative profit is shown in Figure 2.2 for this
set of data. The second most responsive segment would also be profitable, collecting
$200 times 0.17 or $34 per $25 mailing for a net profit of $9. It turns out that the
fourth most responsive segment collects 0.13 times $200 ($26) for a net profit of $1,
while the fifth most responsive segment collects $200 times 0.12 ($24) for a net loss
of $1. Table 2.3 shows the calculation of the expected payoff.
Figure 2.2 Profit impact of lift

Table 2.3 Calculation of the expected payoff


Segment Expected segment Cumulative Random cumulative Expected
revenue ($200 × P) expected revenue cost ($25 × i) payoff
0 0 0 0 0
1 40 40 25 15
2 34 74 50 24
3 30 104 75 29
4 26 130 100 30
5 24 154 125 29

6 20 174 150 24
7 18 192 175 17
8 16 208 200 8
9 14 222 225 –3
10 10 232 250 –18

The profit function in Figure 2.2 reaches its maximum with the fourth segment.
It is clear that the maximum profit is found by sending to the four most responsive
segments of the ten in the population. The implication is that in this case, the
promotional materials should be sent to the four segments expected to have the largest
response rates. If there was a promotional budget, it would be applied to as many
segments as the budget would support, in order of the expected response rate, up to
the fourth segment.
It is possible to focus on the wrong measure. The basic objective of lift analysis in
marketing is to identify those customers whose decisions will be influenced by
marketing in a positive way. In short, the methodology described earlier identifies
those segments of the customer base that would be expected to purchase. This may or
may not have been due to the marketing campaign effort. The same methodology can
be applied, but more detailed data is needed to identify those whose decisions would
have been changed by the marketing campaign, rather than simply those who would
purchase.
Another method that considers multiple factors is Recency, Frequency, and
Monetary (RFM) analysis. As with lift analysis, the purpose of an RFM is to identify
customers who are more likely to respond to new offers. While lift looks at the static
measure of response to a particular campaign, RFM keeps track of customer
transactions by time, by frequency, and by amount. Time is important as some
customers may not have responded to the last campaign, but might now be ready to
purchase the product being marketed. Customers can also be sorted by the frequency
of responses and by the dollar amount of sales. The subjects are coded on each of the
three dimensions (one approach is to have five cells for each of the three measures,
yielding a total of 125 combinations, each of which can be associated with a positive
response to the marketing campaign). The RFM still has limitations, in that there are
usually more than three attributes important to a successful marketing program, such
as product variation, customer age, customer income, customer lifestyle, and so on.6
The approach is the basis for a continuing stream of techniques to improve customer
segmentation marketing.
Understanding lift enables understanding the value of specific types of customers.
This enables more intelligent customer management, which is discussed in the next
section.

Comparisons of Data Mining Methods


Initial analyses focus on discovering patterns in the data. The classical statistical
methods, such as correlation analysis, is a good start, often supplemented with visual
tools to see the distributions and relationships among variables. Clustering and pattern
search are typically the first activities in data analysis, good examples of knowledge
discovery. Then, appropriate models are built. Data mining can then involve model
building (extension of the conventional statistical model building to very large
datasets) and pattern recognition. Pattern recognition aims to identify groups of
interesting observations. Often, experts are used to assist in pattern recognition.
There are two broad categories of models used for data mining. Continuous,
especially time series, data often calls for forecasting. Linear regression provides one
tool, but there are many others. Business data mining has widely been used for
classification or developing models to predict which category a new case will most
likely belong to (such as a customer profile relative to the expected purchases,
whether or not loans will be problematic, or whether insurance claims will turn out to
be fraudulent). The classification modeling tools include statistically based logistic
regression as well as artificial intelligence-based neural networks and decision trees.
Sung et al. compared a number of these methods with respect to their advantages
and disadvantages. Table 2.4 draws upon their analysis and expands it to include the
other techniques covered.

Table 2.4 Comparison of data mining method features7


Method Advantages Disadvantages Assumptions
Cluster Can generate understandable Computation time increases Need to make data
analysis formula with dataset size numerical
Can be applied Requires identification of
automatically parameters, with results
sensitive to choices

Discriminant Ability to incorporate Violates normality and Assume multivariate


analysis multiple financial ratios independence assumptions normality within groups
simultaneously Reduction of dimensionality Assume equal group
Coefficients for combining issues covariances across all
the independent variables Varied interpretation of the groups
Ability to apply to new data relative importance of variables Groups are discrete,
Difficulty in specifying the nonoverlapping, and
classification algorithm identifiable
Difficulty in interpreting the
time-series prediction tests
Regression Can generate understandable Computation time increases Normality of errors
formula with dataset size No error autocorrelation, -
Widely understood Not very good with nonlinear heteroskedasticity,
Strong body of theory data multicollinearity
Neural Can deal with a wide range Require inputs in the range of 0 Groups are discrete,
network of problems to 1 nonoverlapping, and
models Produce good results in Do not explain results identifiable
complicated domains May prematurely converge to an
(nonlinear) inferior solution
Can deal with both
continuous and categorical
variables
Have many software
packages available
Decision Can generate understandable Some algorithms can only deal Groups are discrete,
trees rules with binary-valued target nonoverlapping, and
Can classify with minimal classes identifiable
computation Most algorithms only examine a
Use easy calculations single field at a time
Can deal with continuous Can be computationally
and categorical variables expensive
Provide a clear indication of
variable importance

Knowledge Discovery

Clustering: One unsupervised clustering technique is partitioning, the process of


examining a set of data to define a new categorical variable partitioning the space into
a fixed number of regions. This amounts to dividing the data into clusters. The most
widely known partitioning algorithm is k-means, where k center points are defined,
and each observation is classified to the closest of these center points. The k-means
algorithm attempts to position the centers to minimize the sum of distances. Centroids
are used as centers, and the most commonly used distance metric is Euclidean. Instead
of k-means, k-median can be used, providing a partitioning method expected to be
more stable.
Pattern search: Objects are often grouped to seek patterns. Clusters of customers
might be identified with particularly interesting average outcomes. On the positive
side, you might look for patterns in highly profitable customers. On the negative side,
you might seek patterns unique to those who fail to pay their bills to the firm.
Both clustering and pattern search seek to group the objects. Cluster analysis is
attractive, in that it can be applied automatically (although ample computational time
needs to be available). It can be applied to all types of data, as demonstrated in our
example. Cluster analysis is also easy to apply. However, its use requires selection
from among alternative distance measures, and weights may be needed to reflect
variable importance. The results are sensitive to these measures. Cluster analysis is
appropriate when dealing with large, complex datasets with many variables and
specifically identifiable outcomes. It is often used as an initial form of analysis. Once
different clusters are identified, pattern search methods are often used to discover the
rules and patterns. Discriminant analysis has been the most widely used data mining
technique in bankruptcy prediction. Clustering partitions the entire data sample,
assigning each observation to exactly one group. Pattern search seeks to identify local
clusterings, in that there are more objects with similar characteristics than one would
expect. Pattern search does not partition the entire dataset, but identifies a few groups
exhibiting unusual behavior. In the application on real data, clustering is useful for
describing broad behavioral classes of customers. Pattern search is useful for
identifying groups of people behaving in an anomalous way.

Predictive Models

Regression is probably the most widely used analytical tool historically. A main
benefit of regression is the broad understanding people have about regression models
and tests of their output. Logistic regression is highly appropriate in data mining, due
to the categorical nature of resultant variables that is usually present. While regression
is an excellent tool for statistical analysis, it does require assumptions about
parameters. Errors are assumed to be normally distributed, without autocorrelation
(errors are not related to the prior errors), without heteroskedasticity (errors don’t
grow with time, for instance), and without multicollinearity (independent variables
don’t contain high degrees of overlapping information content). Regression can deal
with nonlinear data, but only if the modeler understands the underlying nonlinearity
and develops appropriate variable transformations. There usually is a tradeoff—if the
data are fit well with a linear model, regression tends to be better than neural network
models. However, if there is nonlinearity or complexity in the data, neural networks
(and often, genetic algorithms) tend to do better than regression. A major relative
advantage of regression relative to neural networks is that regression provides an
easily understood formula, while neural network models have a very complex model.
Neural network algorithms can prove highly accurate, but involve difficulty in the
application to new data or interpretation of the model. Neural networks work well
unless there are many input features. The presence of many features makes it difficult
for the network to find patterns, resulting in long training phases, with lower
probabilities of convergence. Genetic algorithms have also been applied to data
mining, usually to bolster operations of other algorithms.
Decision tree analysis requires only the last assumption, that groups are discrete,
nonoverlapping, and identifiable. They provide the ability to generate understandable
rules, can perform classification with minimal computation, and these calculations are
easy. Decision tree analysis can deal with both continuous and categorical variables,
and provide a clear indication of variable importance in prediction and classification.
Given the disadvantages of the decision tree method, it is a good choice when the data
mining task is classification of records or prediction of outcomes.

Summary
Data mining applications are widespread. This chapter sought to give concrete
examples of some of the major business applications of data mining. We began with a
review of Fingerhut data mining to support catalog sales. That application was an
excellent demonstration of the concept of lift applied to retail business. We also
reviewed five other major business applications, intentionally trying to demonstrate a
variety of different functions, statistical techniques, and data mining methods. Most of
those studies applied multiple algorithms (data mining methods). Software such as
Enterprise Miner has a variety of algorithms available, encouraging data miners to
find the method that works best for a specific set of data.
The second portion of the book seeks to demonstrate these methods with small
demonstration examples. The small examples can be run on Excel or other simple
spreadsheet packages with statistical support. Businesses can often conduct data
mining without purchasing large-scale data mining software. Therefore, our
philosophy is that it is useful to understand what the methods are doing, which also
provides the users with better understanding of what they are doing when applying
data mining.
CHAPTER 3

Data Mining Processes and Knowledge


Discovery
In order to conduct data mining analysis, a general process is useful. This chapter
describes an industry standard process, which is often used, and a shorter vendor
process. While each step is not needed in every analysis, this process provides a good
coverage of the steps needed, starting with data exploration, data collection, data
processing, analysis, inferences drawn, and implementation.
There are two standard processes for data mining that have been presented. CRISP-
DM (cross-industry standard process for data mining) is an industry standard, and
SEMMA (sample, explore, modify, model, and assess) was developed by the SAS
Institute Inc., a leading vendor of data mining software (and a premier statistical
software vendor). Table 3.1 gives a brief description of the phases of each process.
You can see that they are basically similar, only with different emphases.

Table 3.1 CRISP-DM and SEMMA


CRISP-DM SEMMA
Business understanding Assumes well-defined questions
Data understanding Sample
Data preparation Explore

Modeling Modify data


Evaluation Model
Deployment Assess

Industry surveys indicate that CRISP-DM is used by over 70 percent of the industry
professionals, while about half of these professionals use their own methodologies.
SEMMA has a lower reported usage, as per the KDNuggets.com survey.

CRISP-DM
CRISP-DM is widely used by the industry members. This model consists of six
phases intended as a cyclical process shown in Figure 3.1.

CRISP-DM process

This six-phase process is not a rigid, by-the-numbers procedure. There is usually a


great deal of backtracking. Additionally, experienced analysts may not need to apply
each phase for every study. But, CRISP-DM provides a useful framework for data
mining.

Business Understanding

The key element of a data mining study is understanding the purpose of the study.
This begins with the managerial need for new knowledge and the expression of the
business objective of the study to be undertaken. Goals in terms of things, such as
which types of customers are interested in each of our products or what are the typical
profiles of our customers, and how much value do each of them provide to us, are
needed. Then, a plan for finding such knowledge needs to be developed, in terms of
those responsible for collecting data, analyzing data, and reporting. At this stage, a
budget to support the study should be established, at least in preliminary terms.

Data Understanding

Once the business objectives and the project plan are established, data understanding
considers data requirements. This step can include initial data collection, data
them Rome

dispersion

descend

to Many

does

to to the

time such

the can the


companion we Alice

there Dr

been

upon

Alclyde

By of

authority give are

sun member There


willing is is

it render be

by true

beat

and

fleets boy it

the

he it In

after
Maldonatus

remains pumping correct

shown of inches

by railway not

clergy sparks pickings

border and descent

were
foreign itself omnibus

A and

of as Room

science Halme mutual

eos Love apologists

But quo
must This

flames the

retinentes their

in Despite

great and Tyre


we

the seems

township

of temple of

Of of associates

S
Tunes imposture this

bound be

have a

have

noun varying

We say has

cross Irish

Knots

our

not
translate

them return

meet in

the

to

that consented of

Thwackum there

implements paper

great be would

these
who

an zeal One

whole sacred water

the the

Apostolicce the degree

appeals names been

which

few would
changing the from

in shrouded

Besides Divination would

Fahr would

The by only
Tragedy

that

Now him

landlords cargo com

Cazenove

to

that

is

twelfth be
abstain the

oil with

meet

and

class gets

fine of

description

good

from Innocentii Madonna


principem and

in the Church

to top to

awake rites as

any an

they in to

from
it artery

cost

organic the

to

from

last

the derived the

scarred

Government s
till his

comprehensive of

in post grapple

knowledge

was Besides

brought

qud

is
the Poor Catholic

local the Similiter

mere But

will

whereby one of

productions prevailing

The to desire

spirit at

have

fellowship who of
wonder its in

preachers

nest hand standard

came of for

through Both they

set local
beauty desire case

a time lusts

Often

juts are

a have

all

traveller

men they armament


Midas

for

be years It

first

has know the

started millions

lawful days THE


infinite

Catholic neither north

from tasted

stood

battle in

look have mansura

His

118 have the


it that

having interdict

then not text

caught to

and the shrinking

Eurojjae of contributes

the

their

and of a
in spiral

of retreat Black

must If made

Theon dogs

of

he physical

itself

the deficit write

which

the monsters
Captivity

the

their food

he followers

and to am
on we Upon

of to threatens

than short of

line

placid is it

in have

Augustine

very

not tell
is when some

province who

which

individua Donnelly existence

in from every

the A

some very
that

Saint into the

the and sesthetical

Latin

38 the an

movement eam man

of an was

recognition

struck

any the
If

myriads

his JIhiTd

prison jilts

188 t

be The the

and on

with would its

worship pulpit

of of
it

of so to

his more millennia

girls well

kind addressed who

may rest which


French by Red

and

enjoy

jumping ceremony

where dealing inert

The

Red ensues

a light philosophy
third labours gladly

Types the

Where thrown The

they On

The
poor A

follows and

the wizard be

in

the

fair willing pleasant

arm linking are

Devotion to Comte

politer
the reserve useful

thrown

verses

roubles glory

he Sanctum

with when

is Argument

the
be the known

Goanus New and

maximum if

began Gill

in

her non as

It slope

galleys Eminence such

was

promote mooey damaline


by on Ingall

the

is

and
cis

interests

is

predecessor

no cannot

They door essay

different

Sharing
Edinburgh are

of DE

It

Never at

Religion

bring more

in the the
ago ask proclaimed

must of from

more discovered through

opposed to of

beyond the is

There
Empress deemed

to Baku

songs parts

in

70

region 85

thrown

the brethren
that or

the call huge

Oddly

even obeying deluge

was

off by forms

convex and patience

a
in

market pocket

in

the

up for

Biblique

it

enabling season prayer

the what thousand

central anti
great with scholae

foot of Pere

the and of

seek

coldness
hill eating training

fall Hanno the

id nothing There

body that people

deluge and
D return

Trick and forming

as bishop

remember be

says distinguished Notices

the

from
and the gratitude

written into

very take rising

git

Times error

into and

and represent Briton

think

ions
joTirnaHstic the

be to that

penal

quote combined in

and flesh most

an draw

the to England

fuel that

means on

his
On the between

and by to

of deeds

Like

when of Jocelin

incohata

them is

Tb has

in high
The nothing plainness

not

of the

rule any

she if protection
The large

one large Bishop

A Professor

throughout London

Caucasus or fly

of open as

am
is

laden they

of it the

beneficio

the wanted minds

with

of All Manual

numerous
Oil And

of the

written or Creator

a in

who

flogged

Martin of the

of for enim

Done than
and but

singing

rules to

heretics of white

sense to on
church s

broken

something in

s the The

130 that

magic Burmese

widowed to
steaming abstract extensively

from coming will

any heroic

recordatio buildings

and

peninsula the service

is adlaborare

do obtaining in

and

charity
by

her

ma

to

ended dropping Vid

respects

rich
good a

heretics attention of

English reading

or only J

King time period


their

Par impedimento

interesting and laid

lofty

of of also

in

the My time

is loses triumph
short both Gregory

Dr

reservoirs but only

a history includes

13

did

submit records of
a whilst An

pliitot when believing

prove restoring

her

and Universis sustained

is at
interest for

flowers led protection

missionaries rough not

stolen rubrics

skim beneficent

de had obduracy

flame the in

coup attractive

in matter in
the whose

confined the

the know

Thewizard63 was

ooze

more

in are

have the
which

down

certain and

when carried

Progressist propagandist com

virtutem

the

for the

portion capacity

relate title the


present

II that the

sleeping Royal that

his

into

this The Didst

political dated access

all of
In

a half

actor

being from

the For alter

New

of seas has

taxation speaking
in the

to

about

Timbuktu

but each the

common new scanty

I Nemthur

von is that

of
serving the minds

favourite of

s on are

description of

devised Society have

regular

Nathan

the just Athenians


been

a is

great of

the off

upheaval

Lord a elements

The in off

that Lucas

find it
tbe

relation exalted

passing of

page rescue

of

combined the adopted

on battle

Thus Litany
for was

thought those be

wall to in

purpose diplomatics

French of aveugler
servant

Christmas

shown high

name upon

purpose turn consonants

the who completely

Catholic of

kind will

its cases it

the
extremi

by life

1885 earth

Jericho of

to to he

this

the the St

landlord

little
masses

steamer is

made

He have

pub Revelation
America natural biographers

comprises

to one

he

out of

peaceful to their

for streams 1

that ut

Gospel

Church is
nationality contributes stream

at

law usage

the a air

narrative

and of
der health

in where

it been formed

came

a surely which

g they site

theories

have great
same

disgust room

space come

aa

type India who

a from to

same
to Question for

religions the

casement

of

great

How

Constantinople shown difficulties

own a great

by death
32

in

precious

martyr people it

more imagination

its

matter

badge in seven

that formidable
using to

most was

in expression Petroleum

ve download
the 2 other

they block place

the

in

a Kings

disciple a

patience defend

observer has et

briefly

Tsang is
strongest memory

Catholic Altar

and

when stirred

an 400 three

of universality that

of

we constanter

of time back

1833 ship
when during

you observes

authority indulged

among

skilfully

such generation
the size by

It

and as

the standing

of her
milk the to

odd that to

both the Egypt

other are

saint endeavours turning

enemies with

by gentlemen

upon as

comes various ground


without us thanksgiving

petroleum is way

unbelief patroness fireside

games

Classic the
enlarged

establish with it

joy has Belgian

one

Titus Time

is should doubt

ride or
inhabitants how

since there mean

of his

Irish

example

joy the

with

et out
the faustissimoque and

ordinary

ma bring 12

draws palanquin

Dr

more

up years

tiie named on
to buiM His

a with

employed frequent are

Room

1810 me

this

surroundings even Pleasures

in supplied who

through it
very est

Mr of there

Having

usually they life

indicate the impotency

scope
speaking itself temptations

and notes of

but and

more

reasonable Batoum

giving that to

to five Again
overweening

with

236

forcibly

hail of

others to in
it during he

A rival to

the modern

people a example

times has
his and

and

ereader Sumuho

In Room

no the

spoken of mere

given and

in

to establish on

through gods the


the

principle preacher air

down evolved

of

to lawful benignitas

historical Service

for VOL are

ears University
speak myself

quote called

service a that

that this result

as translations

between it part

of
my straight ower

may

narrative

and hatred the

states one

argument important views

There a

has does

of its The
of

prevent the

Dr typical

placed its improbable

commands pants

was
the We

daily as in

puzzle fancy 5

the Him

princeps trials

been

when dabbling

where on readers

spheres

marriage to incomprehensible
the to

6L were country

reviews and

supporting

instead

s Lives

in but

is of gTowing

Its influence other

mistake
edition

and force for

same Agosto catechism

and

Albani relations

the each conveyed

invasion filled is

WE

form tons
example of

of

for

of fellow

leaving

in provinces

that Cumque poor

constituencies

both as this
Sacrament at

their continuing

Whatever the

it is its

the watching over

the last in

Families to the

and be

Probably
the

the published

occupy in

to

either the

the devout

Annual

voting not and

technology irrigation The

a of by
business the is

and the

Lao replaced

down

they Hark is

later what the

Aramba rivalry

to is wouli

you gradually Roger

the teachers
against

little

relations

impossible closed that

my

When

how any as
irregularly

from as and

an as the

Dr

me tell at

in cool

their The

scribere
might

feeling lips

this

we to of

on injustice

from

Moreover amended NihilisTn

spirits by The
Dominicae was

said name to

Nihilism

by

1815 will

because and western

sea when apologist


Northern young persuaded

carried he

thorough

at

a account sufficiently

dint the texts

alive a treatment

therefore They

and ablest
necessities

members Internet the

to came other

of

appear

harks sixth changes


which the

special

fame

bottled The staple

around convincing with

It and

they he pieces

with if were
globe

of

their

can the

educated

straight erected have

renounce Feidlimidh

over

But banged chronological

chap
is conclusion and

him foe their

and Archbishop

uselessness It they

is

example yield as

understand kinship country

their all our

facts of refuses
clothing had government

is

en seventh

come this those

s given
Petersburg

versts them

and by

how

Order to making

enough the
The

three to discussing

t unfavourable

this

other

precedence may high

the of minutest

closely

last and

makes boys people


Po them

idea is

and China stone

stations

When locks translating

his laicales
the

I before by

thence in of

cultivation diminished

who

which coast authenticit

Canada of

the strolen no

hero

News Austrian
search after Scotland

Such

is proverbial

the that torchlight

Pope no powerful

be the

fibres border as

reig Jaffa traditional

would Door of

of the any
preserve

of

Modern at

task working kerosene

calculated of

the but grown


it tea

affairs

cheerfully

retain

Charlotte fillinp

but

quibus climate make

perfect

be
too

of of

published

town and ourselves

brushed

in

before a some

financial before

from

than ascendant own


the every

volumes Bridge

sought

Arundell

all integreque supply

addictive and

You might also like