Customer Profile Cluster Techiques

Expert Systems with Applications 37 (2010) 1573–1585
Contents lists available at ScienceDirect
Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa
An efficient approach for building customer profiles from business data

L.B. Romdhane *, N. Fadhel, B. Ayeb
FSM, Computer Science, Rue de l’ environnement, Monastir, Monastir 5011, Tunisia
a r t i c l e i n f o a b s t r a c t
Keywords: Data mining (DM) is a new emerging discipline that aims at extracting knowledge from data using several
Data mining techniques. DM proved to be useful in business where transactional data turned out to be a mine of infor-
Customer profiling mation about customer purchase habits. Therefore developing customer models (called also profiles in
Fuzzy clustering the literature) is an important step for targeted marketing. In this paper, we develop an approach for cus-
Backpropagation neural networks
tomer profiling composed of three steps. In the first step, we cluster data with an FCM-based algorithm in
Partition entropy
order to extract ‘‘natural” groups of customers. An important feature of our algorithm is that it provides a
reliable estimate of the real number of distinct clusters in the data set using the partition entropy as a
validity measure. In the second step, we reduce the number of attributes for each computed group of cus-
tomers by selecting only the ‘‘most important” ones for that group. We use the information entropy to
quantify the importance of an attribute. Consequently, and a result of this second step, we obtain a set
of groups each described by a distinct set of attributes (or characteristics). In the third and final step
of our model, we build a set of customer profiles each modeled by a backpropagation neural network
and trained with the data in the corresponding group of customers. Experimental results on synthetic
and large real-world data sets reveal a very satisfactory performance of our approach.
Ó 2009 Elsevier Ltd. All rights reserved.
1. Introduction space by tailoring the user’s interaction with the web information
based on information about her/him (Nasraroui et al., 2007, 2008).
Marketing managers can develop long-term and pleasant rela- In telecommunication, call data is used to extract customer’s call-
tionships with customers if they can detect and predict changes ing habits (Hung, Yen, & Yu Wang, 2006; Li et al., 2006; Nanavati
in their consuming habits (or behavior). In the past, researchers et al., 2008) especially fraudulent behavior (Abidogun, 2005;
generally used to apply statistical surveys to study customer Karahoca, Yalcin, & Kahraman, 2007); whereas network data is
behavior. However, recently, data mining techniques have been used to support network management functions, such as fault iso-
adopted (Bose & Mahapatra, 2001; Giudici & Passerone, 2002; Nas- lation (Nunez, Morales, & Triguero, 2002).
raoui, Soliman, Saka, Badia, & Germain, 2008). These techniques The main goal of customer profiling (or segmentation) is to build
apply specific algorithms for pattern extraction to obtain implicit, reliable customer models for targeted marketing campaigns; and
previously unknown, and potentially useful information including consequently, a better profitability (Dyche, 2001). Consequently,
knowledge rules, constraints and regularities (Chen, Hen, & Yu, one can define data mining in customer profiling as being the tech-
1996; Mitra, Pal, & Mitra, 2002). nology that allows building customer models each describing the
Various successful applications of data mining have been re- specific habits, needs, and behavior of a group of customers. Once
ported in several areas including the Web (Borges & Levene, discovered, these models can be used to classify new customers;
2007; Nasraoui et al., 2008, Nasraroui, Spiliopoulou, Srivastava, and thereby, predict their special needs. Some of the difficulties
Mobasher, & Masand, 2007), marketing (Jiang & Tuzhilin, 2006; faced by data mining techniques in this area are the amount of data
Kim & Moon, 2006; Yeh, Dai, & Chen, 2007), finance and banking available to create user models, the adequacy of the data, the noise in
(Hsieha, 2004; Lee, Liu, & Chen, 2006; Zhang & Zhou, 2004), tele- that data, and the necessity of capturing the imprecise nature of hu-
communication (Hu, Shanker, Zhang, & Hung, 2008; Li, Shue, & man behavior (Jiang & Tuzhilin, 2006). Artificial Intelligence (AI)
Lee, 2006; Nanavati et al., 2008; Sohn & Kim, 2008), and functional techniques have the ability to discover ‘‘hidden” and ‘‘implicit”
genomics (Jiang, Pei, & Zhang, 2005) to highlight just a few. Natu- knowledge in large amounts of data; to process uncertainty; and
rally, the goal of using data mining varies from one area to another. to deal with noise. In addition, AI techniques have a high prediction
In Web mining, the main goal is to personalize this information accuracy compared to other ‘‘classical” approaches (Bose & Maha-
patra, 2001). Several AI techniques were used in the literature to
* Corresponding author. address this problem such as Bayesian networks (Yi-jun, Peng, &
E-mail address: Lotfi.Ben.Romdhane@Usherbrooke.ca (L.B. Romdhane). Qiang, 2006), decision trees/association rules (Li et al., 2006), artifi-
0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2009.06.050
1574 L.B. Romdhane et al. / Expert Systems with Applications 37 (2010) 1573–1585
cial neural networks (Hsieha, 2004; Hu et al., 2008; Shich, Yan, & on customer requirements. The proposed model involves two
Chen, 2008; Verù, Garcìa, Senabre, Marìn, & Franco, 2006), and sup- phases. In the first phase, customer requirements are expressed
port vector machines (Fung & Mangasarian, 2001; Huang, Chen, & through picture sorting; while in the second phase, ART2 neural
Wang, 2007) among others. In the sequel, we will briefly outline networks are trained with the collected data to capture customer
some of the proposed models for customer profiling aiming at giving profiles. The model was applied to mobile hand phone design using
the novice reader some background in the field. Excellent surveys of a data set of 100 multi-cultural customers. Five customer profiles
existing models for customer profiling could be found in (Bose & were discovered by the network as well as the dominant character-
Mahapatra, 2001; Chen et al., 1996; Dyche, 2001; Martinez, Chen, istics for each profile (such as gender, education level). Unfortu-
& Liu, 2006; Roddick & Spiliopoulou, 2002; Zhang & Zhou, 2004). nately, the scalability and efficiency of the model on real-world
Bayesian networks were used in (Yi-jun et al., 2006) to segment large data sets was not discussed.
customers of a Chinese commercial bank. First, the structure of the The model proposed in Setnes and Kaymak (2001) aims at
Bayesian network is learned from data in an incremental way: an developing customer preference profiles by fuzzy logic concepts.
arc between two nodes (or attributes) is added to the network if The proposed model consists of two steps. In the first step, a
it improves its classification accuracy. The algorithm stops if no hierarchical top-down fuzzy clustering algorithm called the Ex-
further improvements of the network are possible. Second, the tended Fuzzy C-Means (E-FCM) is used to identify groups in cus-
conditional class probabilities of the network are learned from tomer data. In the second step, the computed clusters are
data. An experimental comparison with decision trees using two represented by a set of association rules as follows. The Fuzzy
data sets of 17,000 and 47,000 customers respectively reveal a Respondent Density measure (FRD) of each cluster is computed
higher prediction accuracy of the proposed approach. Unfortu- (Setnes & Kaymak, 2001). Then, each cluster is represented by
nately, the efficiency of the model in handling a real-world number an association rule whose antecedent is the center of that cluster;
of classes was not discussed and the simulated data embedded and whose consequent is its FRD value. The model was tested on
only two classes. Most importantly, training of distinct Bayesian a data set of 15,486 customers each described by 49 features and
classifiers for distinct data sets coming from different regions obtained from a financial services company. Simulation results
may not be the optimal way to proceed for some companies; espe- reveal a good performance of the proposed model compared to
cially when the ‘‘spatial population drift” is not significant. existing statistical approaches; especially in targeting customers
In Hsieha (2004), the authors proposed an integrated behavioral where the target improvement varied between 15% and 20%.
scoring model for managing credit card customers in a bank. The However, one major shortcoming of the proposed E-FCM cluster-
proposed model is composed of two steps. In the first step, a ing algorithm is that there is no criteria to evaluate the optimal
Self-Organizing Map neural network is used to identify groups number of clusters (or groups of customers). In addition, there
based on repayment behavior, recency, frequency, and monetary is no simple and automatic method for the validation of the dis-
behavioral scoring predictors. Three groups of customers were covered if-then rules; and actually, it is up to the analyst to
identified; namely, revolver, transactor, and convenience custom- prune non-essential ones (redundant, incoherent or trivial). A
ers (Hsieha, 2004). Thereafter, an Apriori-based algorithm was similar model was proposed in Sohn and Kim (2008) in the tele-
used focusing on demographic and geographic attributes in order communication industry. In Sohn and Kim (2008), customers are
to extract a set of customer profiles modeled in the form of associ- first segmented using several clustering K-means and Self-Orga-
ation rules. Experimentation on a large customer database com- nizing Maps models: mainly eight clusters were identified. Sec-
posed of 104,979 individuals reveals a good performance of the ond, clusters are transformed to association rules using an
proposed approach. Unfortunately, the computed customer pro- Apriori-based algorithm.
files are not accurate since many discovered rules are either trivial; In this paper, we develop an approach for customer profiling
contain irrelevant information; or are redundant. The authors ad- composed of three steps. In the first step, we cluster data with a
dressed this problem by proposing interactive strategies that allow Fuzzy Clustering Algorithm in order to extract ‘‘natural” groups
the user to prune non-essential rules. of customers. An important feature of our algorithm is that it
The model proposed in Hu et al. (2008) is based on neural net- provides a reliable estimate of the real number of distinct clus-
works learning; and aims at building customer profiles that re- ters in the data set using the partition entropy as a validity mea-
lates customer characteristics (such as region, income, marital sure. In the second step, we select for each computed cluster the
status, etc.) to their communication mode (such as long distance ‘‘most important” attributes (or characteristics). The importance
telephone calling, letter/card writing, etc.). The neural architec- of an attribute is based on its entropy. Therefore, and as a result
ture is composed of three layers in which the output layer is com- of this step, we obtain a set of distinct groups each described by
posed of a single neuron modeling the communication mode of a distinct set of attributes judged to be the most important. In
the customer. A Key feature of the proposed model and its learn- the third and final step of our model, we construct user profiles
ing algorithm is the ability to select, among the available cus- using backpropagation neural networks trained by the data in
tomer characteristics, the most important ones by computing the computed groups. The rest of this paper is structured as
the prediction risk over the validation set (Hu et al., 2008). This follows. In Section 2, we detail each step of our model and illus-
prediction risk is nothing more that the average squared error trate it on a synthetic data set. Section 3 presents experimenta-
of the trained network over the validation set. Experimentation tion on real-world large data sets; while the last section offers
on a data set composed of 3377 customers (1595 for training concluding remarks and sheds light on future research
and 1842 for test) revealed a good prediction accuracy of the neu- directions.
ral model compared to logistic regression (Hu et al., 2008). In this
model, several network’s topologies are trained and evaluated for
every possible subset of features; and the best network architec- 2. Our proposal
ture is selected as that producing the least prediction risk. Natu-
rally, the computational time could be unacceptable especially for Our approach is composed of three steps and is summarized
data sets with a large number of attributes which is often the in Fig. 1. In the first step, we use fuzzy clustering to identify
case. groups of customers. The number of groups is computed
In Shich et al. (2008), the authors proposed an Adaptive automatically using the partition entropy as a validity measure:
Resonance Theory ART2 neural network for product redesign based the optimal number of clusters is that producing the lowest par-
L.B. Romdhane et al. / Expert Systems with Applications 37 (2010) 1573–1585 1575
Notation 1
– X ¼ fx1 ; . . . ; xN g is a data set composed of N multi-dimensional

data points where each xi models a customer;
– k kE is the Euclidean distance;
– c P 2 is the number of possible clusters (groups) in X;
– m > 1 is a weighting parameter called the fuzzifier;
•• • – U ¼ lki ði ¼ 1; . . . ; c; k ¼ 1; . . . ; NÞ is a fuzzy c-partition of X
where lki 20; 1½ is the membership degree of data xk in cluster i;
– V ¼ ðm1 ; . . . ; mc Þ is c-tuple where mi is the center of cluster i.
FCM is based on the minimization of the following objective

function (Bezdek, 1981):
•• •
N X
X c
m
Jm ¼ lki kxk mi k2E ð1Þ
k¼1 i¼1
subject to the constraints:

X
c X
N
lki ¼ 1 8k; lki > 0 8i: ð2Þ
•••
i¼1 k¼1
Fig. 1. Our model for customer profiling. Minimization of Jm in Eq. (1) derives the following equations for the
membership degrees and cluster centers (Bezdek, 1981):
c
X 2 !1
kxk mi k m1
lki ¼ E
8i; k; ð3Þ
j¼1
kxk mj kE
PN
tition entropy (Bezdek, 1981). Each computed cluster describes a ðlki Þm xk
group of customers whose consuming habits, incomes, etc., are
mi ¼ Pk¼1
N
8i: ð4Þ
k¼1 ð lki Þm
similar. In the second step, we proceed to a dimensionality
reduction of each group by keeping only those attributes that Unfortunately, in FCM, like in most existing fuzzy clustering
are the ‘‘most informative” for that group. We stress the fact that algorithms, the number of groups c is assumed to be a user-defined
our goal in this second step is to describe each group with its parameter which is still a hard task for practical applications. One
most important attributes. In fact, an attribute may be important possible way consists in computing the optimal number of clusters
for a given group of customers; whereas the same attribute using the partition entropy as a validity measure (Bezdek, 1981). It
could be of no importance for other groups. For this, we use is defined, for a given number of clusters c and a fuzzy partition U,
the information entropy to rank attributes and judge their as follows:
importance. In the third and final step of our model, we develop
1 XN X c

a neural network classifier for each reduced cluster using Back- HðU; cÞ ¼ lk log lki : ð5Þ
N k¼1 i¼1 i
propagation (Rumelhart, Hinton, & Williams, 1986). Hence, each
backpropagation network is trained with the customers’ data in
the corresponding cluster in order to extract the hidden and use- It is easy to see from Eq. (5) that the best partition U and in which
ful knowledge. As a result of the whole model, we obtain a set of the boundaries between clusters are ‘‘clear” will produce the lowest
customer profiles each described by a distinct subset of attri- partition entropy.
butes and encoded in the weights of a trained backpropagation Our approach is summarized in Algorithm 1. It accepts as input
neural network. Consequently, profiling of new customers based the data set X; C min and C max representing respectively the expected
on their known attributes becomes a trivial task (Widrow & minimal and maximal numbers of clusters in X. Our algorithm runs
Lehr, 1992). In the following subsections, we will detail each FCM (Bezdek, 1981) for every possible value of c 2 ½C min ; C max ; and
step of our approach. the optimal number of clusters cH (resp. the best partition U H ) is
that producing the lowest partition entropy. At this stage, we
2.1. First step: fuzzy clustering should derive the following remarks. Naturally, our clustering ap-
proach in Algorithm 1 will be computationally expensive if the
Clustering is a process for grouping a set of objects into classes interval ½C min ; C max is large. Therefore, it is up to the analyst to
(or clusters) so that the objects in a cluster have high similarity, but use any prior knowledge about the data set to fix this interval as
are very dissimilar to objects in other clusters (Baraldi & Blonda, narrow as possible. We should also notice the use of the fuzzifier
1999; Bezdek, 1981). Several clustering algorithms were proposed m in Eqs. (1)–(4). Unlike, the original model FCM, we used a ‘‘sim-
in the literature (Abeantes & Marques, 1996; Hoppner, Klawonn, ulated annealing”-like scheme for m as described in (Romdhane,
Kruse, & Runkler, 1999; Looney, 2002; Pal, Keller, & Bezdek, Ayeb, & Wang, 2002) which adds more stability to the algorithm
2005; Ying, Vijay, Praveen, & Xiaoquan, 2005). For our concern, and speeds up convergence – further details could be found in
we will use an approach based on the Fuzzy C-Means (FCM) (Romdhane et al., 2002).
(Bezdek, 1981); a fuzzy clustering algorithm which turned out to As a final result to this first step, we get a set of clusters each
be effective in dealing with fuzzy and categorical data even in describing a distinct group of customers that are sharing the same
the presence of noise and with missing values – see for instance characteristics. Subsequently, we will proceed to a reduction of the
(Ichihashi & Honda, 2005; Ichihashi, Honda, Notsu, & Yagi, 2007). number of attributes keeping only the most ‘‘informative” ones for
Before going any further, we need the following notation. each group of customers.
One should remark that a cluster Sja;i 2 Da;i is composed of a set

of values that are very similar; and thereby, could be considered as
being the same value occurring as many times as the size of that
j
cluster; i.e., fa;i times. Consequently, Da;i could be seen as being
the finite set of distinct values taken by attribute a in group g i .
Now, we will derive the following corollaries.
Corollary 1 (Important attribute). If an attribute, say a 2 X, is
important for a group of customers, say g i ; then Da;i will be composed
of dense clusters.
Corollary 1 is straightforward from Definition 1. In fact, an attri-
bute a is important for g i if all of its members share very similar
values for a. In principle, these similar values will be clustered to-
gether by Algorithm 1; and therefore, these computed clusters will
be dense. Ideally, if all values of attribute a are exactly the same in
g i , then we will get only one cluster in Da;i modeling that value, and
having the size of g i ; i.e., ca;i ¼ 1 and fa;i ¼ N i .
2.2. Second step: attribute selection Corollary 2 (Unimportant attribute). If an attribute, say a 2 X, is
unimportant for a group of customers, say g i ; then Da;i will be
In this second step, we will proceed to a dimensionality reduc- composed of scattered sparse clusters.
tion of each group of customers by selecting only those attributes
that are considered to be the ‘‘most informative”. In many existing Corollary 2 is straightforward from Definition 1. In the limit, all
models for data mining, several dimensionality reduction tech- values of a taken by customers in g i are very dissimilar so that each
niques were used for clustering purposes; i.e., reduce the feature will compose a single cluster. In such a case, Da;i will be composed
space for efficient clustering (Dash, Liu, & Yao, 1997; Fodor & Ka- of N i clusters each embedding a single distinct value.
math, 2002). However, in our model, the original set of attributes Theorem 1. Let a; b 2 X be two attributes; and g i be a group of
describing all customers are supposed to be relevant; i.e., not customers. Let us assume that a is important for g i , and that b is
reducible. If this is not the case, then any existing method could unimportant for g i ; then we have:
be adopted (Dash et al., 1997; Fodor & Kamath, 2002) to perform
ca;i cb;i ð6Þ
dimensionality reduction for clustering purposes. However, after
segmenting customers into groups, an attribute may be important Theorem 1 is straightforward from Corollaries 1 and 2; and
for one group and unimportant for another. Consequently, one has from the fact that the number of values in a group of whatever
to select for each group of customers the most informative1 attri- attribute is equal to the size of that group; i.e.:
butes. Intuitively speaking, an attribute is important for a group, if ca;i
X cb;i
X
it is a ‘‘common characteristic” of that group. Formal definition j j
fa;i ¼ fb;i ¼ Ni : ð7Þ
follows. j¼1 j¼1
Definition 1 (Important/unimportant attribute). Let g i be a group Theorem 1 simply states that the number of clusters produced
of customers. An attribute ab is said to be important for g i if it is a by an important attribute is much less than the number of clusters
grouping characteristic of the customers in g i ; i.e., these customers produced by an unimportant attribute.
share similar values for ab . Ideally, all customers in g i will have In order to quantify the importance of an attribute w.r.t. a
exactly the same value for ab . Conversely, an attribute ab is group of customers, we will use the information entropy as an
unimportant for g i if the customers in g i do not share similar evaluation criteria. But, first, we need to define the ‘‘probability
values for ab . of occurrence” (or normalized frequency) of the value of an attri-
bute a w.r.t. g i by:
In order to compute the set of ‘‘similar values” for an attribute
j
ab in a group g i , we will use the same clustering approach in Algo- 1 fa;i
pja;i ¼ 8Sja;i 2 Da;i ; ð8Þ
rithm 1 with the following interpretations: (1) the set of data 2 Ni
points input to the algorithm is composed of all values taken by j
where fa;i is the density (or number of customers) of cluster Sja;i .
ab in g i ; (2) a computed cluster by that algorithm models values Now, we will forward the following theorem.
that are very similar to each other; but are very dissimilar to values
in other clusters. But, before going any further, we need the follow- Theorem 2. Given an attribute a and a group g i , we have:
ing notation. ca;i
X 1
pja;i ¼ ; ð9Þ
Notation 2 j¼1
2
1 1
– X is the original set of descriptive attributes; 6 pja;i 6 8Sja;i 2 Da;i : ð10Þ
2Ni 2
– N i is the
n size of cgroup o g i (number of customers);
– Da;i ¼ S1a;i ; . . . ; Sa;ia;i is the set of clusters computed by Algorithm The proof of Theorem 2 is reported in Appendix A. Now, we are
1 when applied to attribute a w.r.t. g i with ca;i being the number ready to define the information entropy of an attribute a w.r.t. a
of clusters; group of customers g i by:
"c #
a;i
j
– fa;i the size (number of values) in cluster Sja;i ðj ¼ 1 ca;i Þ. 1 X
Ei ðaÞ ¼ pja;i log2 pja;i þ 1 pja;i log2 1 pja;i ;
ca;i j¼1
ð11Þ
1
In this paper, we will use interchangeably the words ‘‘informative” and where pja;i
is computed by Eq. (8). Now, we will forward the follow-
‘‘important”. ing theorems regarding the entropy of an attribute.
Theorem 3. Given a group of customers g i , we have: cluster) using artificial neural network’s learning. Several learning
techniques were proposed in the literature including decision trees
0 < Ei ðaÞ 6 1; 8a 2 X: ð12Þ
(Gehreke, 2003), association rules (Giudici & Passerone, 2002; Mi-
The proof of Theorem 3 is reported in Appendix B. tra et al., 2002), Self-Organizing Feature Maps (Hung & Tsai, 2008;
Kuoa, Ana, Wanga, & Chungbi, 2006) to highlight just a few. Further
Theorem 4. Let a; b 2 X be two attributes; and g i be a group of details can be found in the specialized literature – see for example
customers. Let us assume that a is important for g i , and that b is (Gehreke, 2003; Mitra et al., 2002).
unimportant for g i ; then we have: For our concern, we will use backpropagation neural networks
(Rumelhart et al., 1986) in order to extract the most important
Ei ðbÞ Ei ðaÞ: ð13Þ
characteristics of each group of customers. This choice is moti-
The proof of Theorem 4 is reported in Appendix C. This theorem vated by their impressive results in several areas as pattern rec-
simply states that the entropy of an attribute in a group is propor- ognition (Haykin, 1994; Widrow & Lehr, 1992). We will use a
tional to its importance in that group. Now, we will define the distinct backpropagation network for each cluster. Remember
‘‘importance ratio” of an attribute in a group by: that each cluster is described by a distinct set of attributes judged
Ei ðaÞ to be the most important for it. Each backpropagation network is
Ii ðaÞ ¼ ; ð14Þ composed of three-layers: input, output and hidden. For a cluster
Eðg i Þ
C k in which customers are described by bk attributes; we will use
where and Ei ðaÞ is computed by Eq. (11); and Eðg i Þ is the entropy of bk neurons in the input layer, bk sigmoid neurons in the output
group g i defined as being the summation of the entropy w.r.t. g i of layer, and bk =2 neurons in the hidden layer. While the number
all attributes and given by: of neurons in the input and output layers is fixed by the number
X
Eðg i Þ ¼ Ei ðaÞ; ð15Þ of attributes in the learning data set, the choice of the optimal
a2X number of hidden neurons is still an NP-hard problem in back-
propagation learning (Blum & Rivest, 1992). The general rule of
where X is the set of attributes. The ‘‘importance ratio” of an attri- thumb is that a large number of hidden neurons will lead to over-
bute obeys the following corollary: training and thereby power generalization; whereas a small num-
Corollary 3. The importance of an attribute a 2 X for a group g i ber will lead to power learning – further details can be found in
increases as its information entropy Ei ðaÞ increases; and vice versa. (Haykin, 1994; Zhanga, Mab, & Yang, 2003). In our case, for a
cluster C k in which customers are described by bk attributes, we
Corollary 3 is straightforward from Eq. (14) and Theorem 4. found in practice that bk =2 hidden neurons are sufficient. Each
The most important attributes for a group g i are computed backpropagation network was trained using the data in the corre-
using Algorithm 2. The core idea of this algorithm is to select only sponding cluster with the generalized delta rule with momentum
those attributes whose ‘‘importance ratio” computed with Eq. (Haykin, 1994):
(14) is greater than a fixed threshold q. The ratio q is a user spec-
ified parameter, and reflects the ‘‘degree of vigilance”: as q ap- DW i;j ðt þ 1Þ ¼ gdj ðt þ 1Þyi ðt þ 1Þ þ aDW i;j ðtÞ; ð16Þ
proaches 1, only the most relevant attributes will be selected.
At this stage, we should notice that actually there is no simple where t is the iteration (time) counter, wi;j is the connection weight
way for the automatic selection of q. A similar problem exists from neuron nri to neuron nrj in two consecutive layers; yi is the
in other dimensionality reduction techniques (Dash et al., 1997). output signal of nri , dj is the ‘‘learning error” of nrj ; g and a are
Finally, notice that Algorithm 2 is applied to each computed two small constants ð20; 1½Þ representing the learning and the
group of customers. Consequently, and as a result of this second momentum rates, respectively. In Eq. (16), the a-term is the
step, we obtain a set of heterogeneous groups each described momentum term used to avoid the oscillation of the algorithm be-
by a distinct set of attributes judged to be the most informative. tween two local minima (Haykin, 1994). The distinct connections of
In the next section, we will describe our learning approach for the the network are adjusted according to Eq. (16) until the sum of
extraction of the ‘‘knowledge” embedded in each group. squared errors over all training samples becomes negligible.
Finally, we obtain a set of trained backpropagation networks
each representing a distinct group of customers (or a profile).
These profiles will be very useful in categorizing new customers
as follows. We present a customer, say x, to each trained backpro-
pagtion network, say neti , and measure the error as follows:
kyi ðxÞ xkE
ERi ðxÞ ¼ ; 8net i : ð17Þ
kxkE
Here k kE is the Euclidean distance and yi ðxÞ is the output of net-
work net i when x is fed into it. Now, we define the global error of
our model w.r.t. customer x as:
ERðxÞ ¼ min ERi ðxÞ: ð18Þ
i
In order to measure the performance of our model, we use

Algorithm 3. It accepts as input, the trained backpropagation neu-
ral networks, a threshold error , and a test data set T. For each
customer x 2 T, we compute the error of the model ERðxÞ using
Eq. (18), then we take a decision on x: (1) as unrecognized if ERðxÞ
2.3. Third step: building customer profiles is greater than , (2) as misclassified if ERðxÞ is below the , but x
is assigned to the wrong class; or (3) as correctly classified if ERðxÞ
The main purpose of this third and final step of our approach is is below , and x is assigned to the correct class. The next section
to extract the ‘‘knowledge” about each group of customers (or each illustrates the distinct steps of our approach.
marized in Table 1 where the first column reports the group; the
second column its important attributes; and the third column its
unimportant ones.
2.4.2. First step: fuzzy clustering

In this first step, we will partition the training data composed of
450 customers using Algorithm 1. Fig. 4 depicts the variation of the
partition entropy in Eq. (5) w.r.t. the number of groups (or clusters)
over the interval ½C min ; . . . ; C max ¼ ½2; . . . ; 6. It is easy to see that the
optimal partitioning is that obtained for three clusters since it has
the lowest partition entropy. Hence, our Algorithm 1 was able to
discover the ‘‘natural structure” of the training data since it com-
puted the real number of clusters. Table 2 summarizes the ob-
tained optimal partition.
2.4.3. Second step: attribute selection

In this second step, we will select for each group its most impor-
tant attributes. Remember that we use Algorithm 1 to computer
clusters of ‘‘similar values” of an attribute w.r.t. a given group of
customers; and thereby judge ‘‘its importance”. Figs. 5–7 reports
the variation of the partition entropy of all attributes for groups
2.4. Illustration G1 ; G2 and G3 , respectively. For example, one can read in Fig. 5c that
the optimal partitioning of attribute A3 in group G1 is that with 5
The main purpose of this section is to illustrate the distinct clusters. Stated otherwise, the values of A3 taken by individuals
steps of our model. Results on large real-world data sets will be in group G1 could be optimally classified into five ‘‘similar” values.
outlined in the next section. The characteristics of the clusters computed by Algorithm 1 for
each attribute w.r.t. the groups G1 ; G2 , and G3 are summarized in
2.4.1. Experimental setup Table 3 where we report in the first column the group; in the sec-
We used a synthetic data set composed of 600 customers (450 ond column the attribute; and in the third column the densities of
for training, and 150 for testing) categorized into three groups. the computed clusters with the number of clusters being the com-
Each customer is described by three numerical attributes; i.e., puted optimal one. Regarding G1 , one notices that A3 produces
X ¼ fA1 ; A2 ; A3 g. A three-dimensional plot of this data set is re- sparse scattered clusters; whereas A1 and A2 produce dense clus-
ported in Fig. 2. ters. As for G2 ; A1 produces sparse scuttered clusters, whereas A2
A projection of all groups w.r.t. each attribute is reported in and A3 produce dense clusters. In G3 ; A2 produces sparse scattered
Fig. 3. From Fig. 3a, one can read that A1 is a grouping (or impor- clusters; however, A1 and A3 produce dense clusters.
tant) attribute for both g 1 and g 3 since customers in these groups Using Eq. (13) and Table 3, we computed the importance ratios
share very similar values for A1 . On the opposite, we can see that of all attributes in each group. Results are plotted in Fig. 8. Running
customers in g 2 have dissimilar values for A1 ; and therefore A1 is Algorithm 2 using an arbitrary threshold q ¼ 0:25, we get the fol-
not important for g 2 . Regarding attribute A2 , we notice in Fig. 3b lowing sets as being the most important (grouping) attributes:
that it is important for g 1 and g 2 ; while it is not important for g 3 . fA1 ; A2 g for G1 ; fA2 ; A3 g for G2 ; and fA1 ; A3 g for G3 . These obtained
Finally, attribute A3 depicted in Fig. 3c is important for groups g 3 results are in total accordance with the synthetic data set proper-
and g 2 ; while it is not important for g 1 . All these remarks are sum- ties summarized in Table 1.
20 G1
G2
G3
15
10
A3
0
30
25
20 30
15 25
20
10 15
A
2
5 10
5
0 0 A1
Fig. 2. A synthetic data set of 600 customers categorized into three groups (G1 ; G2 ; G3 ).
G3 G3 G3
Group
Group
Group
G2 G2 G2
G1 G1 G1
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
A1 A2 A3
Fig. 3. The projection of the synthetic data set w.r.t. each attribute.
Table 1 2.4.4. Third step: building customer profiles

Attributes’ importance for customers in G1 ; G2 and G3 . Keeping for each group, only the most important (grouping)
Group Unimportant attributes Important attributes attributes, we will proceed to the construction of the customer
G1 fA1 ; A2 g fA3 g
profiles using backpropagation neural learning. For this, we have
G2 fA2 ; A3 g fA1 g used three backpropagation neural networks each trained using
G3 fA1 ; A3 g fA2 g the data of the corresponding group. The architecture of each net-
work is composed of three layers: an input layer of two neurons, an
output layer of two neurons, and a single hidden neuron. Regarding
the learning rule with momentum, we used the following arbitrary
values: a learning rate g ¼ 0:1, and a momentum a ¼ 0:9. In order
to test the performance of the constructed models, we used Algo-
rithm 3 with a test set of 150 customers previously unknown to
the system and an arbitrary small threshold ¼ 0:1. Simulation re-
sults are summarized in Table 4. One can see from this table that in
104 cases (or 69.35%), the model was able to recognize the profile
of the customers. However, in 46 cases (or 30.65%) the system was
unable to recognize the profile of the customers; i.e., have pro-
duced an error greater than . Among the recognized profiles, 90
cases (or 60%) were assigned to the exact profile, and 14 cases
(or 9.35%) were assigned to the wrong profile (or misclassified).
These results could be considered as satisfactory since the system
showed a good recognition rate despite the fact that the test cus-
tomers were previously unknown. In the next section, we will out-
Fig. 4. Variation of the partition entropy w.r.t. the number of clusters.
line experimental results on real-world large data sets.
3. Experimental results
Table 2
Characteristics of the computed optimal partition. In this section, we will conduct an experimental analysis of our
model using large real-world data sets. For this, we considered a
Group Size Density (%)
data set used at the Data Mining Cup conference 2004 (DMC04,
G1 206 45.78
2004). It is composed of 18,918 customers: 15,000 will be used
G2 121 26.89
G3 123 27.33 for training (or clustering); and 3918 for testing (or prediction).
Each customer is described by 23 attributes (categorical,
Fig. 5. Variation of the partition entropy of all attributes in group G1 .

(a) Attribute A1 (b) Attribute A2 (c) Attribute A3

Table 3
Computed clusters for the attributes A1 ; A2 , and A3 in groups G1 ; G2 , and G3 .
Grp Attr. # Clust. Cluster densities

G1 A1 cA1 ;1 ¼2 S1A1 ;1 ¼ 57:8%; S2A1 ;1 ¼ 42:2%
A2 cA2 ;1 ¼2 S1A2 ;1 ¼ 44:2%; S2A2 ;1 ¼ 55:8%
A3 cA3 ;1 ¼5 S1A3 ;1 ¼ 16:0%; S2A3 ;1 ¼ 17:9%; S3A3 ;1 ¼ 23:3%; S4A3 ;1 ¼ 23:4%; S5A3 ;1 ¼ 19:4%
G2 A1 cA1 ;2 ¼5 S1A1 ;2 ¼ 21:5%; S2A1 ;2 ¼ 8:3%; S3A1 ;2 ¼ 33:1%; S4A1 ;2 ¼ 20:6%; S5A1 ;2 ¼ 16:5%;
A2 cA2 ;2 ¼2 S1A2 ;2 ¼ 53:7%; S2A2 ;2 ¼ 46:3%
A3 cA3 ;2 ¼2 S1A3 ;2 ¼ 55:4%; S2A3 ;2 ¼ 44:6%
G3 A1 cA1 ;3 ¼2 S1A1 ;3 ¼ 53:7%; S2A1 ;3 ¼ 46:3%
A2 cA2 ;3 ¼7 S1A2 ;3 ¼ 11:3%; S2A2 ;3 ¼ 11:4%; S3A2 ;3 ¼ 17:1%; S4A2 ;3 ¼ 9:9%; S5A2 ;3 ¼ 27:6%; S6A2 ;3 ¼ 9:7%; S7A2 ;3 ¼ 13:0%
A3 cA3 ;3 ¼2 S1A3 ;3 ¼ 41:5%; S2A3 ;3 ¼ 58:5%
Fig. 8. Importance ratios for G1 ; G2 , and G3 .
numerical, and boolean). These attributes include address-based chases and was gathered by the mail order company Quelle AG.
and transactional information: a detailed description is outlined Nowadays, customers are encouraged by the friendly cost-free re-
in the Appendix D of this paper. The data is about the return of pur- turn of purchases to send back merchandize they have ordered.
Table 4 In the second step, we have selected the most important attri-
Prediction accuracy of our model-CC: correctly classified; MS: misclassified; UR: butes for each group as follows. First, remember that given a group
unrecognized.
of customers we have to cluster the values of each attribute using
CC MS UR Algorithm 1 in order to find clusters of ‘‘similar values”. Tables 5
#Cases 90 14 46 and 6 summarize the clustering of all attributes w.r.t. the first
Percentage (%) 60 9.35 30.65 and the second groups, respectively. The first column reports the
attribute; and the second column the optimal number of clusters.
This optimal number corresponds to that producing the lowest
Doubtless, the cost of the return management for mail order com- partition entropy: the variation of the partition entropy of all attri-
panies is considerable; and making a prognosis of customer return butes in each group is reported in Tables 10 and 11 in the Appendix
rates becomes a necessity for customer value analysis. The used E of this paper. The third column of Tables 5 and 6 reports the den-
data set is composed of two categories of customers: ‘‘low-rate re- sities of the computed clusters; and the final column the entropy of
turn” category and ‘‘high-rate return” category – further details can the attribute computed with Eq. (11). These entropy values are
be found in (DMC04, 2004). used to derive the importance ratios with Eq. (13). Results are sum-
In the first step, we have clustered the 15,000 training custom- marized in Table 7 where we report the attribute and its impor-
ers using Algorithm 1. Fig. 9 plots the variation of the partition en- tance ratios; and thus for the ‘‘low-rate return” and ‘‘high-rate
tropy with the number of groups over the interval return” groups. For illustration purposes, rows in Table 7 are sorted
½C min ; C max ¼ ½2; 5. Clearly, the optimal partitioning is that for two w.r.t. the importance ratio in the ascending order. We remark that
groups since it has the lowest partition entropy. Hence, our Algo- the importance of an attribute varies from one group to another.
rithm 1 was able to discover the ‘‘natural structure” of the training For example, attribute a7 which is the ‘‘percent of foreigner house-
data since it computed the real number of groups: ‘‘low-rate re- holds in postcode area” has an importance of 3.74% in the first
turn” and ‘‘high-rate return” customers. group; and of 4.45% in the second group. This implicity means that
the ‘‘percent of foreigner households in postcode area” is much
more important for ‘‘high-rate return” customers than for ‘‘low-
rate return” ones. Using an arbitrary importance threshold of 3%
drops the following sets of attributes: X1 ¼ fa17 ; a3 ; a23 ; a2 ; a1 g for
the first group; and X2 ¼ fa3 ; a17 ; a23 ; a6 ; a16 ; a2 ; a1 g for the second
group – see Table 7. Stated otherwise, X1 and X2 model the set
of unimportant attributes for the first and second groups, respec-
tively. Consequently, W1 ¼ fa6 ; a16 ; a7 ; a15 ; a18 ; a19 ; a12 ; a20 ; a9 ; a21 ;
a4 ; a22 ; a11 ; a14 ; a8 ; a10 ; a13 ; :a5 g is the set of important attributes for
the first group of customers; and W2 ¼ fa15 ; a19 ; a12 ; a18 ;
a20 ; a7 ; a9 ; a21 ; a11 ; a22 ; a14 ; a8 ; a10 ; a13 ; a4 ; :a5 g is the set of important
attributes for the second group of customers. Now, let us forward
the following remarks. First, the address information a6 (‘‘postal
code”) and the transactional information a16 (‘‘number of female
persons in household”) are important for the ‘‘low-rate return” cat-
egory of customers; whereas they are unimportant for the ‘‘high-
rate return” category. Stated otherwise, the postal code and the
number of females information are significant for the ‘‘low-rate re-
Fig. 9. Variation of the partition entropy w.r.t. the number of clusters. turn” category; whereas, they are of much less significance for the
Table 5
Computed clusters for all attributes in the first group of customers (‘‘low-rate return”).
Att. CH Cluster densities E
a1 3 1.72%; 0.01%; 98.27% 0.357

a2 4 0.03%; 96.62%; 3.01%; 0.34% 0.283
a3 10 0.04%; 0.06%; 0.19%; 0.49%; 0.01%; 13.17%; 54.01%; 1.16%; 0.01%; 30.86% 0.191
a4 2 23.70%; 76.30% 0.742
a5 1 100.00% 1.000
a6 5 16.91%; 18.87%; 20.06%; 19.31%;24.86% 0.468
a7 3 1.17%; 81.88%; 16.95% 0.482
a8 2 57.46%; 42.54% 0.806
a9 2 3.50%; 96.50% 0.563
a10 2 52.77%; 47.23% 0.811
a11 2 25.83%; 74.17% 0.753
a12 2 99.18%; 0.82% 0.519
a13 2 52.41%; 47.59% 0.811
a14 2 35.00%; 65.00% 0.789
a15 2 99.83%; 0.17% 0.505
a16 3 82.08%; 0.29%; 17.64% 0.474
a17 10 0.01%; 1.06%; 0.03%; 26.93%; 0.14%; 0.03%; 0.10%; 67.51%; 4.15%; 0.03% 0.171
a18 2 0.30%; 99.70% 0.508
a19 2 99.47%; 0.53% 0.513
a20 2 3.04%; 96.96% 0.556
a21 2 6.76%; 93.24% 0.605
a22 2 75.06%; 24.94% 0.749
a23 9 0.79%; 2.22%; 1.56%; 2.12%; 10.99%; 17.05%; 1.23%; 1.91%; 62.13% 0.225
Table 6
Computed clusters for all attributes in the second group of customers (‘‘high-rate return”).
Att. CH Cluster densities E
a1 3 96.77%; 0.30%; 2.93% 0.375

a2 3 0.10%; 1.32%; 98.58% 0.354
a3 10 1.61%; 0.11%; 1.73%; 11.36%; 29.85%; 0.06%; 0.01%; 54.75%; 0.10%; 0.41% 0.195
a4 2 49.46%; 50.54% 0.811
a5 1 100.00% 1.000
a6 8 21.75%; 12.44%; 15.01%; 10.74%; 9.03%; 9.68% ; 9.46%; 11.90% 0.333
a7 2 95.04%; 4.96% 0.583
a8 2 42.41%; 57.59% 0.806
a9 2 5.42%; 94.58% 0.589
a10 2 44.03%; 55.97% 0.808
a11 2 15.37%; 84.63% 0.687
a12 2 1.18%; 98.82% 0.526
a13 2 55.21%; 44.79% 0.809
a14 2 61.21%; 38.79% 0.799
a15 2 0.19%; 99.81% 0.505
a16 5 12.26%; 0.09%; 11.63%; 1.82%; 74.20% 0.337
a17 9 7.37%; 2.51%; 0.02%; 65.39%; 0.01%; 0.09%; 0.21%; 0.60%; 23.80% 0.201
a18 2 1.43%; 98.57% 0.531
a19 2 99.75%; 0.25% 0.507
a20 2 1.51%; 98.49% 0.532
a21 2 6.50%; 93.50% 0.602
a22 2 78.43%; 21.57% 0.730
a23 9 1.77%; 1.20%; 0.76%; 20.00%; 0.39%; 1.01%; 65.12%; 2.87%; 6.88% 0.215
Table 7
The importance of the attributes in the first and second groups. Keeping for each group only the most important attributes; we
have proceeded in the third step to the construction of the two cus-
First group (‘‘low-rate return”) Second group (‘‘high-rate return”)
tomer profiles using backpropagation neural networks trained
Att. I (%) Att. I (%) with the data in the corresponding group. For the ‘‘low-rate return”
a17 1.33 a3 1.52 group, we used a backpropagation network of three layers: 18 in-
a3 1.48 a17 1.57 put neurons, nine hidden neurons, and 18 output neurons. Regard-
a23 1.75 a23 1.67
ing the ‘‘high-rate return” group, we used a backpropagation
a2 2.20 a6 2.59
a1 2.77 a16 2.62 network of three layers: 16 input neurons, eight hidden neurons,
a6 3.63 a2 2.76 and 16 output neurons. Remember that for a group described by
a16 3.68 a1 2.92 bk attributes, we use bk input neurons, bk output neurons, and
a7 3.74 a15 3.94 bk =2 hidden ones. In addition, we used a learning rate g ¼ 0:1
a15 3.92 a19 3.95
a18 3.94 a12 4.10
and a momentum a ¼ 0:9.
a19 3.98 a18 4.13 In order to test the performance of the constructed models, we
a12 4.03 a20 4.14 used Algorithm 3 with a test set of 3918 customers previously un-
a20 4.32 a7 4.54 known to the system and an arbitrary threshold error ¼ 0:15.
a9 4.37 a9 4.59
Simulation results are summarized in Table 8. We notice that for
a21 4.70 a21 4.69
a4 5.76 a11 5.35 2923 customers (i.e., in 74.60% of simulations) the model was able
a22 5.81 a22 5.69 to recognize the profile (or category); and for 995 of customers
a11 5.85 a14 6.23 (i.e., in 25.39% of simulations) the system was unable to recognize
a14 6.13 a8 6.28 their profile. Among the recognized profiles, 1464 profiles (or
a8 6.26 a10 6.29
a10 6.29 a13 6.30
37.37%) were assigned to the exact profile, and 1459 profiles (or
a13 6.29 a4 6.32 37.24%) were misclassified. These results could be considered as
a5 7.76 a5 7.79 very satisfactory since the system showed a good recognition rate
despite the fact that the test customers were previously unknown.
This could be of great profit to mail order companies for the prog-
Table 8 nosis of customer return rates.
Prediction accuracy of our model on the test set – CC: correctly classified; MS:
misclassified; UR: unrecognized.
4. Concluding remarks and future research
CC MS UR
# Cases 1464 1459 995 Customer profiling consists in determining a set of customer
Percentage (%) 37.37 37.24 25.39 models that could be used for predicting the behavior of new cus-
tomers which is very useful for targeted marketing. In this paper,
we proposed a model for customer profiling based on fuzzy clus-
‘‘high-rate return” one. Second, we observe in Table 7 that a5 tering and backpropagation neural networks. Our model is com-
(‘‘having a phone”) is the most important attribute for both groups. posed of three steps. In the first step, we cluster customers using
Consequently, the fact of having (or not) a phone is fundamental in an FCM-based approach. The number of clusters is computed from
the considered case. This seems to be natural since it is easier for the data set at hand using the partition entropy as a validity mea-
customers having phones (the ‘‘high-rate return” category) to noti- sure. In the second step, we select for each group the ‘‘most impor-
fy the mail company (by phone) of the return of goods than those tant” attributes. For this, we use the same clustering approach but
who do not have (the ‘‘low-rate return”). Similar remarks could be applied to the values of an attribute: similar values will be clus-
derived from Table 7 for other attributes of less importance. tered together and thereby could be considered as being the same.
Thereby, we compute the entropy of an attribute w.r.t. the given This proves the first part of the theorem which is Eq. (9).
group of customers in order to quantify its importance for that Regarding the second part which is Eq. (10), we have to distinguish
group. In the third and final step of our approach, we develop a two limiting cases:
set of backpropagation neural networks each corresponding to a
group of customers and trained using the data in that group. Each 1st case: Da;i ¼ fSa;i g is composed of a single cluster. This hap-
trained backpropagation network corresponds to a distinct cus- pens when all values of attribute a w.r.t. g i are exactly the same.
tomer profile and is useful for predicting the profile of new cus- Naturally, the number of values in Sa;i is equal to N i (number of
tomers. Experimental results on a real-world large database of a individuals in g i ); i.e., fa;i ¼ N i . Consequently, we have:
mail order company are very encouraging and should stimulate fu-
1 fa;i 1
ture research. First, we have to find a simple way to estimate the pa;i ¼ ¼ : ð23Þ
2 Ni 2
range (½C min ; C max ) of the optimal number of clusters using any
N
prior knowledge about the data set. In fact, the computational time 2nd case: Da;i ¼ fS1a;i ; . . . ; Sa;ii g is composed of N i clusters each
j
could be unacceptable if this interval is very large. Second, we have cluster embedding a single distinct value; i.e., fa;i ¼1
to develop an algorithm for representing the ‘‘knowledge” learned ðj ¼ 1 N i Þ. Then, we have:
by each backpropagation neural network in a more understandable
1
form; for example as a set of IF–THEN rules. These two last points pja;i ¼ 8j: ð24Þ
2Ni
constitute our immediate focus.
Using Eqs. (23) and (24), we can state that in the general case we
Appendix A. Proof of Theorem 2 have:
1 1
For the sake of readability, let us remind that pja;i is defined by: 6 pja;i 6 8i; j ð25Þ
2Ni 2
j
1 fa;i
pja;i ¼ 8Sja;i 2 Da;i : ð19Þ
2 Ni
Using Eq. (19), we get: Appendix B. Proof of Theorem 3
ca;i
X ca;i
1 X
pja;i ¼ fj : ð20Þ It is known that f ðxÞ ¼ ½xlog2 ðxÞþ ð1 xÞlog2 ð1 xÞ is an
2Ni j¼1 a;i
j¼1 increasing function over the interval 0; 12 ; and that f ðxÞ 2 ½0; 1

j
At this stage, we should remind that fa;i is the number of values in 8x 2 0; 12 . Using Theorem 2 which states that 2N1 i 6 pja;i 6 12, we
cluster Sja;j . Consequently, we have: can say that:

ca;i
X j 0 < f Pja;i 6 1 8j; ð26Þ
fa;i ¼ Ni : ð21Þ
j¼1
ca;i
X ca;i
X
1 1
) 0< f Pja;i 6 ; ð27Þ
Using Eqs. (20) and (21), we get: ca;i j¼1
ca;i j¼1
ca;i
X 1 1 ) 0 < Ei ðaÞ 6 1: ð28Þ
pja;i ¼ Ni ¼ : ð22Þ
j¼1
2Ni 2
Table 9
Description of the attributes of the used experimental data from the DMC cup 2004.
Att. Orig. abbrev Significance/remark

Address data
a1 LIEFER_MENGE Number of delivered articles in prediction period
a2 RETOUREN_MENGE Number of returned articles in prediction period
a3 ANZAHL_BESTELLUNGEN Number of orders in prediction period (only in training data)
a4 AADSEIT Address since month
a5 AATELKZ Phone yes/no
a6 APLZ5ST Postcode (5 digits)
a7 APLZAAH Percent of foreigner households in postcode area
a8 APLZAKH Percent of households with children in postcode area
a9 APLZBVD Population denseness in postcode area
a10 APLZEIN Number of residents in postcode area
a11 APLZFLH Expanse of postcode area
a12 APLZFLI Fluctation index in postcode area
a13 APLZKKE Purchasing power/resident in postcode area
a14 PALTERK Age of addressee in years
a15 PANREDK Title of addressee
a16 PRSANZW Number of female persons in household
Merchandize management data
a17 BBH0001 Number of orders in period B
a18 BBH0002 Number of orders in period F
a19 BBW0001 Order value in period A in Euro
a20 BBW0002 Order value in period B in Euro
a21 BBW0003 Order value in period D in Euro
a22 BEB0001 First order before max. x months
a23 BLB0001 Last order before max. x months
Table 10
Evolution of the partition entropy with the number of clusters varying between C min¼2 and C max ¼ 10 for the first group of customers (‘‘low-rate return”).
Att. 2 3 4 5 6 7 8 9 10 CH
a1 0.052 0.044 0.065 0.060 0.067 0.056 0.059 0.064 0.069 3
a2 0.048 0.051 0.046 0.063 0.063 0.071 0.058 0.058 0.058 4
a3 0.038 0.036 0.048 0.042 0.034 0.017 0.022 0.012 0.007 10
a4 0.041 0.054 0.061 0.065 0.068 0.070 0.071 0.072 0.073 2
a5 C H ¼ 1: all customers in this group have the same value for this attribute
a6 0.048 0.047 0.046 0.046 0.047 0.054 0.048 0.050 0.046 5
a7 0.052 0.050 0.060 0.058 0.063 0.063 0.062 0.065 0.069 3
a8 0.043 0.058 0.066 0.069 0.072 0.076 0.073 0.075 0.076 2
a9 0.014 0.023 0.032 0.032 0.045 0.048 0.052 0.055 0.060 2
a10 0.038 0.050 0.059 0.059 0.065 0.063 0.068 0.069 0.069 2
a11 0.039 0.047 0.053 0.064 0.064 0.064 0.069 0.069 0.067 2
a12 0.004 0.051 0.061 0.065 0.068 0.070 0.072 0.073 0.075 2
a13 0.054 0.062 0.069 0.074 0.066 0.075 0.070 0.073 0.069 2
a14 0.042 0.057 0.065 0.070 0.065 0.067 0.069 0.070 0.069 2
a15 0.000 0.003 0.003 0.001 0.092 0.482 0.937 0.067 0.962 2
a16 0.003 0.000 0.003 0.001 0.404 0.951 0.458 0.712 0.312 3
a17 0.044 0.043 0.053 0.060 0.060 0.040 0.054 0.046 0.030 10
a18 0.042 0.050 0.047 0.058 0.060 0.052 0.051 0.051 0.053 2
a19 0.009 0.033 0.049 0.054 0.061 0.067 0.064 0.067 0.069 2
a20 0.026 0.039 0.049 0.055 0.059 0.061 0.072 0.072 0.074 2
a21 0.037 0.042 0.051 0.063 0.061 0.068 0.067 0.070 0.070 2
a22 0.041 0.054 0.061 0.066 0.065 0.051 0.054 0.057 0.068 2
a23 0.040 0.045 0.037 0.047 0.048 0.030 0.023 0.011 0.020 9
Table 11
Evolution of the partition entropy with the number of clusters varying between C min¼2 and C max ¼ 10 for the Second Group of Customers (‘‘high-rate return”).
Att. 2 3 4 5 6 7 8 9 10 CH
a1 0.052 0.045 0.048 0.055 0.051 0.055 0.061 0.062 0.062 3
a2 0.048 0.045 0.050 0.057 0.060 0.056 0.056 0.054 0.062 3
a3 0.044 0.047 0.056 0.037 0.032 0.027 0.020 0.024 0.008 10
a4 0.041 0.055 0.061 0.066 0.068 0.070 0.072 0.073 0.072 2
a5 C H ¼ 1: all customers in this group have the same value for this attribute
a6 0.046 0.056 0.055 0.048 0.057 0.052 0.045 0.053 0.048 8
a7 0.044 0.056 0.064 0.060 0.069 0.061 0.068 0.065 0.072 2
a8 0.044 0.061 0.066 0.073 0.072 0.074 0.074 0.073 0.076 2
a9 0.018 0.038 0.041 0.043 0.045 0.049 0.049 0.054 0.049 2
a10 0.050 0.053 0.061 0.069 0.068 0.066 0.075 0.071 0.074 2
a11 0.021 0.041 0.045 0.043 0.055 0.057 0.056 0.057 0.058 2
a12 0.002 0.050 0.054 0.061 0.061 0.068 0.068 0.067 0.074 2
a13 0.042 0.045 0.056 0.059 0.069 0.067 0.071 0.072 0.077 2
a14 0.042 0.056 0.063 0.068 0.072 0.074 0.070 0.070 0.074 2
a15 0.000 0.003 0.002 0.316 0.205 0.008 0.561 0.105 0.366 2
a16 0.003 0.017 0.003 0.001 0.838 0.523 0.577 0.151 0.094 5
a17 0.047 0.040 0.060 0.068 0.059 0.056 0.061 0.040 0.050 9
a18 0.047 0.054 0.063 0.054 0.060 0.070 0.057 0.064 0.068 2
a19 0.003 0.040 0.049 0.055 0.064 0.063 0.067 0.068 0.069 2
a20 0.019 0.039 0.055 0.055 0.061 0.063 0.068 0.065 0.069 2
a21 0.037 0.049 0.053 0.059 0.066 0.065 0.070 0.070 0.072 2
a22 0.041 0.054 0.061 0.066 0.065 0.051 0.079 0.056 0.057 2
a23 0.040 0.045 0.037 0.047 0.048 0.030 0.022 0.011 0.020 9
Appendix C. Proof of Theorem 4 Using Theorem 2, we can state that:

1 1
Let a; b 2 X be two attributes; and g i a group of customers. 6 pja;i 6 8Sja;i 2 Da;i ; ð31Þ
c c
2Ni 2
Da;i ¼ fS1a;i ; . . . ; Sa;ia;i g and Db;i ¼ fS1b;i ; . . . ; Sb;ib;i g are the sets of clusters 1 1
corresponding to attributes a and b, respectively. Let us assume
j
6 pb;i 6 8Sjb;i 2 Db;i : ð32Þ
2Ni 2
that a is an important (or grouping) attribute for g i , whereas b is
not. Let f ðxÞ ¼ ½xlog2 ðxÞ þ ð1 xÞlog2 ð1 xÞ. For the sake of read- It is known that f ðxÞ is increasing over the interval ½0; 12. Using Eqs.
ability, let us remind that the probabilities pja;i and pjb;i are com- (31) and (32), and the fact that clusters in Da;i are more dense (or
puted by the following equations: embeds more values) than clusters in Db;i (see Corollaries 1 and
2), we can assert that:
j cb;i
X ca;i
X
1 fa;i
pja;i ¼ 8Sja;i 2 Da;i ; ð29Þ f pjb;i < f pja;i : ð33Þ
2 Ni j¼1 j¼1
j
1 fb;i
pjb;i ¼ 8Sjb;i 2 Db;i : ð30Þ Using Theorem 1, we can say that: ca;i cb;i . Consequently, we can
2 Ni deduce from Eq.(33) that:
cb;i ! ca;i !
1 X 1 X foundations of comp. intelligence, Honolulu, Hawaii, USA, 1–5 April 2007 (Vol. 1,
f pjb;i f pja;i ð34Þ pp. 214–221).
cb;i j¼1
ca;i j¼1 Jiang, D., Pei, J., & Zhang, A. (2005). An interactive approach to mining gene
expression data. IEEE Transactions on Knowledge Data Engineering, 17(10),
) Ei ðbÞ Ei ðaÞ: ð35Þ 1363–1378.
Jiang, T., & Tuzhilin, A. (2006). Segmenting customers from population to
individuals: Does 1-to-1 keep your customers forever. IEEE Transactions on
Knowledge Data Engineering, 18(10), 1297–1311.
Appendix D. Experimentation data attributes
Karahoca, A., Yalcin, S., Kahraman, C., & Sanver. M. (2007). Fraud detection using an
adaptive neuro-fuzzy inference system in telecommunication networks. In
The used data set is composed of 18,918 customers each de- Proceedings of the 10th joint conference on information science, Utah, USA, 18–24
July 2007 (pp. 1440–1446).
scribed by 23 attributes described in Table 9 where the first col-
Kim, Y.-H., & Moon, B.-R. (2006). Multicampaign assignment problem. IEEE
umn reports the attribute number, the second column the Transactions on Knowledge Data Engineering, 18(3), 405–414.
original used code for the attribute, and the third column its Kuoa, R. J., Ana, Y. L., Wanga, H. S., & Chungbi, W. J. (2006). Integration of self-
significance. organizing feature maps neural network and genetic K-means algorithm for
market segmentation. Expert Systems with Applications, 30(2), 313–324.
Lee, C.-H. L., Liu, A., & Chen, W.-S. (2006). Pattern discovery of fuzzy time series for
Appendix E. Variation of the partition entropy of all attributes financial prediction. IEEE Transactions on Knowledge Data Engineering, 18(5),
613–625.
in both groups of customers Li, S.-T., Shue, L.-Y., & Lee, S.-F. (2006). Enabling customer relationship management
in ISP services through mining usage pattern. Expert Systems with Applications,
See Tables 10 and 11. 30(4).
Looney, C. G. (2002). Interactive clustering and merging with a new fuzzy expected
value. Pattern Recoginition, 35(11), 2413–2423.
References Martinez, E. F., Chen, S. Y., & Liu, X. (2006). Survey of data mining approaches to user
modeling for adaptive hypermedia. IEEE Transactions on Systems, Man, and
Abeantes, A., & Marques, J. (1996). A class of constrained clustering for object Cybernatics C, 36(6), 734–749.
boundary extraction. IEEE Transactions on Image Processing, 5, 1507–1521. Mitra, S., Pal, S. K., & Mitra, P. (2002). Data mining in soft computing framework: A
Abidogun, O. A. (2005). Data mining, fraud detection and mobile telecommunications: survey. IEEE Transactions on Neural Networks, 13(1), 3–14.
Call pattern analysis with unsupervised neural networks, Master’s thesis, Nanavati, A. A., Singh, R., Chakraborty, D., Dasgupta, K., Mukherjea, S., Das, G., et al.
University of the Western Cape, August 2005. (2008). Analyzing the structure and evolution of massive telecom graphs. IEEE
Baraldi, A., & Blonda, P. (1999). A survey of fuzzy clustering algorithms for pattern Transactions on Knowledge Data Engineering, 20(5), 703–718.
recognition. IEEE Transactions on Systems, Man, and Cybernatics B, 29(6), Nasraoui, O., Soliman, M., Saka, E., Badia, A., & Germain, R. (2008). A web usage
786–801. Dec.. mining framework for mining evolving user profiles in dynamic web sites. IEEE
Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. New Transactions on Knowledge Data Engineering, 20(2), 202–215.
York: Plenum. Nasraroui, O., Spiliopoulou, M., Srivastava, J., Mobasher, B., & Masand, B. M. (2007).
Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Advances in web mining and web usage analysis. WebKDD (Vol. 4811). PA, USA:
Neural Network, 5(1), 117–127. Springer.
Borges, J., & Levene, M. (2007). Evaluating variable-length markov chain models for Nunez, M., Morales, R., & Triguero, F. (2002). Automatic discovery of rules for
analysis of user web navigation sessions. IEEE Transactions on Knowledge Data predicting network management events. IEEE Journal Selection Areas
Engineering, 19(4), 441–452. Communication, 20(4), 736–745.
Bose, I., & Mahapatra, R. K. (2001). Business data mining – A machine learning Pal, N. R., Keller, J. M., & Bezdek, J. C. (2005). A possibilistic fuzzy c-means clustering
perspective. Information and Management, 39, 211–225. algorithm. IEEE Transactions on Fuzzy System, 13(4), 517–530.
Chen, M. S., Hen, J., & Yu, P. S. (1996). Data mining: An overview from a database Roddick, J. F., & Spiliopoulou, M. (2002). A survey of temporal knowledge discovery
perspective. IEEE Transactions on Knowledge Data Engineering, 8(6), 866–883. paradigms and methods. IEEE Transactions on Knowledge Data Engineering, 14(4),
Dash, M., Liu, H., & Yao, J. (1997). Dimensionality reduction of unsupervised data. In 750–767.
Proceedings of the 9th International Conference on Tools with Artificial intelligence, Romdhane, L. B., Ayeb, B., & Wang, S. (2002). On computing the fuzzifier in #FLVQ: A
November 1997 (pp. 532–539). data driven approach. International Journal of Neuron System, 12(2), 149–157.
DMC04. Data Mining Cup (2004). <http://www.data-mining-cup.com/2004/ Rumelhart, D., Hinton, J., & Williams, R. (1986). Learning representations by back-
1213622858/>. propagating errors. Nature, 323, 533–536.
Dyche, J. (2001). The CRM handbook: A business guide to customer relationship Setnes, M., & Kaymak, U. (2001). Fuzzy modeling of client preference from large
management. AW, MA, USA. data sets: an application to target selection in direct marketing. IEEE
Fodor, I. K., & Kamath, C. (2002). Dimension reduction techniques and the Transactions on Fuzzy System, 9(1), 153–163.
classification of bent double galaxies. Computational Statistics and Data Shich, M.-D., Yan, W., & Chen, C.-H. (2008). Soliciting customer requirements for
Analysis, 41(1), 91–122. product redesign based on picture sorts and ART2 neural networks. Expert
Fung, G., & Mangasarian, O. L. (2001). Proximal support vector machine classifiers. Systems with Applications, 34(1), 194–204.
In Proceedings of the 7th ACM SIGKDD International Conference on KDDM Sohn, S. Y., & Kim, Y. (2008). Searching customer patterns of mobile service using
(pp. 77–86). NY, USA: ACM Press. clustering and quantitative association rules. Expert Systems with Applications,
Gehreke, J. (2003). Decision trees. In Nong Ye (Ed.), The handbook of data mining (pp. 34(2), 1070–1077.
3–25). Law. Erl. Ass., NJ, USA. Verù, S. V., Garcìa, M. . O., Senabre, C., Marìn, A. G., & Franco, F. G. (2006).
Giudici, P., & Passerone, G. (2002). Data mining of association structures to model Classification, filtering, and indetification of electrical customer load patterns
consumer behavior. Computational Statistics and Data Analysis, 38, 533–541. through the use of self-organizing maps. IEEE Transactions on Power System,
Haykin, S. (1994). Neural networks: A comprehensive foundation. NY: McMillan, Inc.. 21(4), 1672–1682.
Hoppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). Fuzzy cluster analysis: Widrow, B., & Lehr, M. (1992). 30 years of adaptive neural networks: Perceptron,
Methods for classification, data analysis and image recognition. NY: Wiley. madaline, and backpropagation. In P. J. Hayes, J. A. Feldman, & D. E. Rumelhart
Hsieha, Nan-Chen (2004). An integrated data mining and behavioral scoring model (Eds.), Neural networks: Theoretical foundations and analysis (pp. 27–53). New
for analyzing bank customers. Expert Systems with Applications, 27(4), 623–633. York: IEEE PRESS, The IEEE Inc..
Hu, M. Y., Shanker, M., Zhang, G. P., & Hung, M. S. (2008). Modeling consumer Yeh, M.-Y., Dai, B.-R., & Chen, M.-S. (2007). Clustering over multiple evolving
situational choice of long distance communication with neural networks. streams by events and correlations. IEEE Transactions on Knowledge Data
Decision Support System, 44(4), 899–908. Engineering, 19(10), 1349–1362.
Huang, C.-L., Chen, M.-C., & Wang, C.-J. (2007). Credit scoring with a data mining Yi-jun, L., Peng, Z., & Qiang, Y. (2006). Customer sample difference-oriented Bayes
approach based on support vector machines. Expert Systems with Applications, segmentation algorithm. In Proceedings of the international conference on
33(4), 847–856. management science and engineering, 5–7 October, 2006 (pp. 3–7).
Hung, C., & Tsai, C.-F. (2008). Market segmentation based on hierarchical self- Ying, X., Vijay, V. R., Praveen, D., & Xiaoquan, Z. (2005). A new fuzzy clustering
organizing map for markets of multimedia on demand. Expert Systems with algorithm for optimally finding granular prototypes. International Journal of
Applications, 34(1), 780–787. Applied Reasoning, 40.
Hung, S. Y., Yen, D. C., & Yu Wang, H. (2006). Applying data mining to telecom churn Zhang, D., & Zhou, L. (2004). Discovering golden nuggets: Data mining in financial
management. Expert Systems with Applications, 31(3), 515–524. application. IEEE Transactions on Systems, Man, and Cybernatics C, 34(4),
Ichihashi, H., & Honda, K. (2005). Fuzzy c-means classifier for incomplete data sets 513–522.
with outliers and missing values. In Proceedings of the CIMCA-IAWTIC Zhanga, Z., Mab, X., & Yang, Y. (2003). Bounds on the number of hidden neurons in
(pp. 457–464). Washington, DC, USA: IEEE Computer Society. three-layer binary neural networks. Neural Networks, 16, 995–1002.
Ichihashi, H., Honda, K., Notsu, A., & Yagi, T. (2007). Fuzzy c-means classifier with
deterministic initialization and missing value imputation. In IEEE symposium on

Customer Profile Cluster Techiques

Uploaded by

Document Informationclick to expand document informationHow to cluster your customers for positioning?

Document Informationclick to expand document information

Copyright:

Available Formats

Customer Profile Cluster Techiques

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Customer Profile Cluster Techiques

Uploaded by

Copyright:

Available Formats

Expert Systems with Applications 37 (2010) 1573–1585

Contents lists available at ScienceDirect

Expert Systems with Applications

An efﬁcient approach for building customer proﬁles from business data

– X ¼ fx1 ; . . . ; xN g is a data set composed of N multi-dimensional

FCM is based on the minimization of the following objective

subject to the constraints:

One should remark that a cluster Sja;i 2 Da;i is composed of a set

In order to measure the performance of our model, we use

2.4.2. First step: fuzzy clustering

2.4.3. Second step: attribute selection

Table 1 2.4.4. Third step: building customer proﬁles

Fig. 5. Variation of the partition entropy of all attributes in group G1 .

(a) Attribute A1 (b) Attribute A2 (c) Attribute A3

Fig. 7. Variation of the partition entropy of all attributes in group G3 .

Grp Attr. # Clust. Cluster densities

Fig. 8. Importance ratios for G1 ; G2 , and G3 .

Att. CH Cluster densities E

a1 3 1.72%; 0.01%; 98.27% 0.357

Att. CH Cluster densities E

a1 3 96.77%; 0.30%; 2.93% 0.375

Att. Orig. abbrev Signiﬁcance/remark

Appendix C. Proof of Theorem 4 Using Theorem 2, we can state that:

You might also like