Customer Profile Cluster Techiques
Customer Profile Cluster Techiques
Customer Profile Cluster Techiques
a r t i c l e i n f o a b s t r a c t
Keywords: Data mining (DM) is a new emerging discipline that aims at extracting knowledge from data using several
Data mining techniques. DM proved to be useful in business where transactional data turned out to be a mine of infor-
Customer profiling mation about customer purchase habits. Therefore developing customer models (called also profiles in
Fuzzy clustering the literature) is an important step for targeted marketing. In this paper, we develop an approach for cus-
Backpropagation neural networks
tomer profiling composed of three steps. In the first step, we cluster data with an FCM-based algorithm in
Partition entropy
order to extract ‘‘natural” groups of customers. An important feature of our algorithm is that it provides a
reliable estimate of the real number of distinct clusters in the data set using the partition entropy as a
validity measure. In the second step, we reduce the number of attributes for each computed group of cus-
tomers by selecting only the ‘‘most important” ones for that group. We use the information entropy to
quantify the importance of an attribute. Consequently, and a result of this second step, we obtain a set
of groups each described by a distinct set of attributes (or characteristics). In the third and final step
of our model, we build a set of customer profiles each modeled by a backpropagation neural network
and trained with the data in the corresponding group of customers. Experimental results on synthetic
and large real-world data sets reveal a very satisfactory performance of our approach.
Ó 2009 Elsevier Ltd. All rights reserved.
1. Introduction space by tailoring the user’s interaction with the web information
based on information about her/him (Nasraroui et al., 2007, 2008).
Marketing managers can develop long-term and pleasant rela- In telecommunication, call data is used to extract customer’s call-
tionships with customers if they can detect and predict changes ing habits (Hung, Yen, & Yu Wang, 2006; Li et al., 2006; Nanavati
in their consuming habits (or behavior). In the past, researchers et al., 2008) especially fraudulent behavior (Abidogun, 2005;
generally used to apply statistical surveys to study customer Karahoca, Yalcin, & Kahraman, 2007); whereas network data is
behavior. However, recently, data mining techniques have been used to support network management functions, such as fault iso-
adopted (Bose & Mahapatra, 2001; Giudici & Passerone, 2002; Nas- lation (Nunez, Morales, & Triguero, 2002).
raoui, Soliman, Saka, Badia, & Germain, 2008). These techniques The main goal of customer profiling (or segmentation) is to build
apply specific algorithms for pattern extraction to obtain implicit, reliable customer models for targeted marketing campaigns; and
previously unknown, and potentially useful information including consequently, a better profitability (Dyche, 2001). Consequently,
knowledge rules, constraints and regularities (Chen, Hen, & Yu, one can define data mining in customer profiling as being the tech-
1996; Mitra, Pal, & Mitra, 2002). nology that allows building customer models each describing the
Various successful applications of data mining have been re- specific habits, needs, and behavior of a group of customers. Once
ported in several areas including the Web (Borges & Levene, discovered, these models can be used to classify new customers;
2007; Nasraoui et al., 2008, Nasraroui, Spiliopoulou, Srivastava, and thereby, predict their special needs. Some of the difficulties
Mobasher, & Masand, 2007), marketing (Jiang & Tuzhilin, 2006; faced by data mining techniques in this area are the amount of data
Kim & Moon, 2006; Yeh, Dai, & Chen, 2007), finance and banking available to create user models, the adequacy of the data, the noise in
(Hsieha, 2004; Lee, Liu, & Chen, 2006; Zhang & Zhou, 2004), tele- that data, and the necessity of capturing the imprecise nature of hu-
communication (Hu, Shanker, Zhang, & Hung, 2008; Li, Shue, & man behavior (Jiang & Tuzhilin, 2006). Artificial Intelligence (AI)
Lee, 2006; Nanavati et al., 2008; Sohn & Kim, 2008), and functional techniques have the ability to discover ‘‘hidden” and ‘‘implicit”
genomics (Jiang, Pei, & Zhang, 2005) to highlight just a few. Natu- knowledge in large amounts of data; to process uncertainty; and
rally, the goal of using data mining varies from one area to another. to deal with noise. In addition, AI techniques have a high prediction
In Web mining, the main goal is to personalize this information accuracy compared to other ‘‘classical” approaches (Bose & Maha-
patra, 2001). Several AI techniques were used in the literature to
* Corresponding author. address this problem such as Bayesian networks (Yi-jun, Peng, &
E-mail address: Lotfi.Ben.Romdhane@Usherbrooke.ca (L.B. Romdhane). Qiang, 2006), decision trees/association rules (Li et al., 2006), artifi-
0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2009.06.050
1574 L.B. Romdhane et al. / Expert Systems with Applications 37 (2010) 1573–1585
cial neural networks (Hsieha, 2004; Hu et al., 2008; Shich, Yan, & on customer requirements. The proposed model involves two
Chen, 2008; Verù, Garcìa, Senabre, Marìn, & Franco, 2006), and sup- phases. In the first phase, customer requirements are expressed
port vector machines (Fung & Mangasarian, 2001; Huang, Chen, & through picture sorting; while in the second phase, ART2 neural
Wang, 2007) among others. In the sequel, we will briefly outline networks are trained with the collected data to capture customer
some of the proposed models for customer profiling aiming at giving profiles. The model was applied to mobile hand phone design using
the novice reader some background in the field. Excellent surveys of a data set of 100 multi-cultural customers. Five customer profiles
existing models for customer profiling could be found in (Bose & were discovered by the network as well as the dominant character-
Mahapatra, 2001; Chen et al., 1996; Dyche, 2001; Martinez, Chen, istics for each profile (such as gender, education level). Unfortu-
& Liu, 2006; Roddick & Spiliopoulou, 2002; Zhang & Zhou, 2004). nately, the scalability and efficiency of the model on real-world
Bayesian networks were used in (Yi-jun et al., 2006) to segment large data sets was not discussed.
customers of a Chinese commercial bank. First, the structure of the The model proposed in Setnes and Kaymak (2001) aims at
Bayesian network is learned from data in an incremental way: an developing customer preference profiles by fuzzy logic concepts.
arc between two nodes (or attributes) is added to the network if The proposed model consists of two steps. In the first step, a
it improves its classification accuracy. The algorithm stops if no hierarchical top-down fuzzy clustering algorithm called the Ex-
further improvements of the network are possible. Second, the tended Fuzzy C-Means (E-FCM) is used to identify groups in cus-
conditional class probabilities of the network are learned from tomer data. In the second step, the computed clusters are
data. An experimental comparison with decision trees using two represented by a set of association rules as follows. The Fuzzy
data sets of 17,000 and 47,000 customers respectively reveal a Respondent Density measure (FRD) of each cluster is computed
higher prediction accuracy of the proposed approach. Unfortu- (Setnes & Kaymak, 2001). Then, each cluster is represented by
nately, the efficiency of the model in handling a real-world number an association rule whose antecedent is the center of that cluster;
of classes was not discussed and the simulated data embedded and whose consequent is its FRD value. The model was tested on
only two classes. Most importantly, training of distinct Bayesian a data set of 15,486 customers each described by 49 features and
classifiers for distinct data sets coming from different regions obtained from a financial services company. Simulation results
may not be the optimal way to proceed for some companies; espe- reveal a good performance of the proposed model compared to
cially when the ‘‘spatial population drift” is not significant. existing statistical approaches; especially in targeting customers
In Hsieha (2004), the authors proposed an integrated behavioral where the target improvement varied between 15% and 20%.
scoring model for managing credit card customers in a bank. The However, one major shortcoming of the proposed E-FCM cluster-
proposed model is composed of two steps. In the first step, a ing algorithm is that there is no criteria to evaluate the optimal
Self-Organizing Map neural network is used to identify groups number of clusters (or groups of customers). In addition, there
based on repayment behavior, recency, frequency, and monetary is no simple and automatic method for the validation of the dis-
behavioral scoring predictors. Three groups of customers were covered if-then rules; and actually, it is up to the analyst to
identified; namely, revolver, transactor, and convenience custom- prune non-essential ones (redundant, incoherent or trivial). A
ers (Hsieha, 2004). Thereafter, an Apriori-based algorithm was similar model was proposed in Sohn and Kim (2008) in the tele-
used focusing on demographic and geographic attributes in order communication industry. In Sohn and Kim (2008), customers are
to extract a set of customer profiles modeled in the form of associ- first segmented using several clustering K-means and Self-Orga-
ation rules. Experimentation on a large customer database com- nizing Maps models: mainly eight clusters were identified. Sec-
posed of 104,979 individuals reveals a good performance of the ond, clusters are transformed to association rules using an
proposed approach. Unfortunately, the computed customer pro- Apriori-based algorithm.
files are not accurate since many discovered rules are either trivial; In this paper, we develop an approach for customer profiling
contain irrelevant information; or are redundant. The authors ad- composed of three steps. In the first step, we cluster data with a
dressed this problem by proposing interactive strategies that allow Fuzzy Clustering Algorithm in order to extract ‘‘natural” groups
the user to prune non-essential rules. of customers. An important feature of our algorithm is that it
The model proposed in Hu et al. (2008) is based on neural net- provides a reliable estimate of the real number of distinct clus-
works learning; and aims at building customer profiles that re- ters in the data set using the partition entropy as a validity mea-
lates customer characteristics (such as region, income, marital sure. In the second step, we select for each computed cluster the
status, etc.) to their communication mode (such as long distance ‘‘most important” attributes (or characteristics). The importance
telephone calling, letter/card writing, etc.). The neural architec- of an attribute is based on its entropy. Therefore, and as a result
ture is composed of three layers in which the output layer is com- of this step, we obtain a set of distinct groups each described by
posed of a single neuron modeling the communication mode of a distinct set of attributes judged to be the most important. In
the customer. A Key feature of the proposed model and its learn- the third and final step of our model, we construct user profiles
ing algorithm is the ability to select, among the available cus- using backpropagation neural networks trained by the data in
tomer characteristics, the most important ones by computing the computed groups. The rest of this paper is structured as
the prediction risk over the validation set (Hu et al., 2008). This follows. In Section 2, we detail each step of our model and illus-
prediction risk is nothing more that the average squared error trate it on a synthetic data set. Section 3 presents experimenta-
of the trained network over the validation set. Experimentation tion on real-world large data sets; while the last section offers
on a data set composed of 3377 customers (1595 for training concluding remarks and sheds light on future research
and 1842 for test) revealed a good prediction accuracy of the neu- directions.
ral model compared to logistic regression (Hu et al., 2008). In this
model, several network’s topologies are trained and evaluated for
every possible subset of features; and the best network architec- 2. Our proposal
ture is selected as that producing the least prediction risk. Natu-
rally, the computational time could be unacceptable especially for Our approach is composed of three steps and is summarized
data sets with a large number of attributes which is often the in Fig. 1. In the first step, we use fuzzy clustering to identify
case. groups of customers. The number of groups is computed
In Shich et al. (2008), the authors proposed an Adaptive automatically using the partition entropy as a validity measure:
Resonance Theory ART2 neural network for product redesign based the optimal number of clusters is that producing the lowest par-
L.B. Romdhane et al. / Expert Systems with Applications 37 (2010) 1573–1585 1575
Notation 1
Fig. 1. Our model for customer profiling. Minimization of Jm in Eq. (1) derives the following equations for the
membership degrees and cluster centers (Bezdek, 1981):
c
X 2 !1
kxk mi k m1
lki ¼ E
8i; k; ð3Þ
j¼1
kxk mj kE
PN
tition entropy (Bezdek, 1981). Each computed cluster describes a ðlki Þm xk
group of customers whose consuming habits, incomes, etc., are
mi ¼ Pk¼1
N
8i: ð4Þ
k¼1 ð lki Þm
similar. In the second step, we proceed to a dimensionality
reduction of each group by keeping only those attributes that Unfortunately, in FCM, like in most existing fuzzy clustering
are the ‘‘most informative” for that group. We stress the fact that algorithms, the number of groups c is assumed to be a user-defined
our goal in this second step is to describe each group with its parameter which is still a hard task for practical applications. One
most important attributes. In fact, an attribute may be important possible way consists in computing the optimal number of clusters
for a given group of customers; whereas the same attribute using the partition entropy as a validity measure (Bezdek, 1981). It
could be of no importance for other groups. For this, we use is defined, for a given number of clusters c and a fuzzy partition U,
the information entropy to rank attributes and judge their as follows:
importance. In the third and final step of our model, we develop
1 XN X c
a neural network classifier for each reduced cluster using Back- HðU; cÞ ¼ lk log lki : ð5Þ
N k¼1 i¼1 i
propagation (Rumelhart, Hinton, & Williams, 1986). Hence, each
backpropagation network is trained with the customers’ data in
the corresponding cluster in order to extract the hidden and use- It is easy to see from Eq. (5) that the best partition U and in which
ful knowledge. As a result of the whole model, we obtain a set of the boundaries between clusters are ‘‘clear” will produce the lowest
customer profiles each described by a distinct subset of attri- partition entropy.
butes and encoded in the weights of a trained backpropagation Our approach is summarized in Algorithm 1. It accepts as input
neural network. Consequently, profiling of new customers based the data set X; C min and C max representing respectively the expected
on their known attributes becomes a trivial task (Widrow & minimal and maximal numbers of clusters in X. Our algorithm runs
Lehr, 1992). In the following subsections, we will detail each FCM (Bezdek, 1981) for every possible value of c 2 ½C min ; C max ; and
step of our approach. the optimal number of clusters cH (resp. the best partition U H ) is
that producing the lowest partition entropy. At this stage, we
2.1. First step: fuzzy clustering should derive the following remarks. Naturally, our clustering ap-
proach in Algorithm 1 will be computationally expensive if the
Clustering is a process for grouping a set of objects into classes interval ½C min ; C max is large. Therefore, it is up to the analyst to
(or clusters) so that the objects in a cluster have high similarity, but use any prior knowledge about the data set to fix this interval as
are very dissimilar to objects in other clusters (Baraldi & Blonda, narrow as possible. We should also notice the use of the fuzzifier
1999; Bezdek, 1981). Several clustering algorithms were proposed m in Eqs. (1)–(4). Unlike, the original model FCM, we used a ‘‘sim-
in the literature (Abeantes & Marques, 1996; Hoppner, Klawonn, ulated annealing”-like scheme for m as described in (Romdhane,
Kruse, & Runkler, 1999; Looney, 2002; Pal, Keller, & Bezdek, Ayeb, & Wang, 2002) which adds more stability to the algorithm
2005; Ying, Vijay, Praveen, & Xiaoquan, 2005). For our concern, and speeds up convergence – further details could be found in
we will use an approach based on the Fuzzy C-Means (FCM) (Romdhane et al., 2002).
(Bezdek, 1981); a fuzzy clustering algorithm which turned out to As a final result to this first step, we get a set of clusters each
be effective in dealing with fuzzy and categorical data even in describing a distinct group of customers that are sharing the same
the presence of noise and with missing values – see for instance characteristics. Subsequently, we will proceed to a reduction of the
(Ichihashi & Honda, 2005; Ichihashi, Honda, Notsu, & Yagi, 2007). number of attributes keeping only the most ‘‘informative” ones for
Before going any further, we need the following notation. each group of customers.
1576 L.B. Romdhane et al. / Expert Systems with Applications 37 (2010) 1573–1585
Definition 1 (Important/unimportant attribute). Let g i be a group Theorem 1 simply states that the number of clusters produced
of customers. An attribute ab is said to be important for g i if it is a by an important attribute is much less than the number of clusters
grouping characteristic of the customers in g i ; i.e., these customers produced by an unimportant attribute.
share similar values for ab . Ideally, all customers in g i will have In order to quantify the importance of an attribute w.r.t. a
exactly the same value for ab . Conversely, an attribute ab is group of customers, we will use the information entropy as an
unimportant for g i if the customers in g i do not share similar evaluation criteria. But, first, we need to define the ‘‘probability
values for ab . of occurrence” (or normalized frequency) of the value of an attri-
bute a w.r.t. g i by:
In order to compute the set of ‘‘similar values” for an attribute
j
ab in a group g i , we will use the same clustering approach in Algo- 1 fa;i
pja;i ¼ 8Sja;i 2 Da;i ; ð8Þ
rithm 1 with the following interpretations: (1) the set of data 2 Ni
points input to the algorithm is composed of all values taken by j
where fa;i is the density (or number of customers) of cluster Sja;i .
ab in g i ; (2) a computed cluster by that algorithm models values Now, we will forward the following theorem.
that are very similar to each other; but are very dissimilar to values
in other clusters. But, before going any further, we need the follow- Theorem 2. Given an attribute a and a group g i , we have:
ing notation. ca;i
X 1
pja;i ¼ ; ð9Þ
Notation 2 j¼1
2
1 1
– X is the original set of descriptive attributes; 6 pja;i 6 8Sja;i 2 Da;i : ð10Þ
2Ni 2
– N i is the
n size of cgroup o g i (number of customers);
– Da;i ¼ S1a;i ; . . . ; Sa;ia;i is the set of clusters computed by Algorithm The proof of Theorem 2 is reported in Appendix A. Now, we are
1 when applied to attribute a w.r.t. g i with ca;i being the number ready to define the information entropy of an attribute a w.r.t. a
of clusters; group of customers g i by:
"c #
a;i
j
– fa;i the size (number of values) in cluster Sja;i ðj ¼ 1 ca;i Þ. 1 X
Ei ðaÞ ¼ pja;i log2 pja;i þ 1 pja;i log2 1 pja;i ;
ca;i j¼1
ð11Þ
1
In this paper, we will use interchangeably the words ‘‘informative” and where pja;i
is computed by Eq. (8). Now, we will forward the follow-
‘‘important”. ing theorems regarding the entropy of an attribute.
L.B. Romdhane et al. / Expert Systems with Applications 37 (2010) 1573–1585 1577
Theorem 3. Given a group of customers g i , we have: cluster) using artificial neural network’s learning. Several learning
techniques were proposed in the literature including decision trees
0 < Ei ðaÞ 6 1; 8a 2 X: ð12Þ
(Gehreke, 2003), association rules (Giudici & Passerone, 2002; Mi-
The proof of Theorem 3 is reported in Appendix B. tra et al., 2002), Self-Organizing Feature Maps (Hung & Tsai, 2008;
Kuoa, Ana, Wanga, & Chungbi, 2006) to highlight just a few. Further
Theorem 4. Let a; b 2 X be two attributes; and g i be a group of details can be found in the specialized literature – see for example
customers. Let us assume that a is important for g i , and that b is (Gehreke, 2003; Mitra et al., 2002).
unimportant for g i ; then we have: For our concern, we will use backpropagation neural networks
(Rumelhart et al., 1986) in order to extract the most important
Ei ðbÞ Ei ðaÞ: ð13Þ
characteristics of each group of customers. This choice is moti-
The proof of Theorem 4 is reported in Appendix C. This theorem vated by their impressive results in several areas as pattern rec-
simply states that the entropy of an attribute in a group is propor- ognition (Haykin, 1994; Widrow & Lehr, 1992). We will use a
tional to its importance in that group. Now, we will define the distinct backpropagation network for each cluster. Remember
‘‘importance ratio” of an attribute in a group by: that each cluster is described by a distinct set of attributes judged
Ei ðaÞ to be the most important for it. Each backpropagation network is
Ii ðaÞ ¼ ; ð14Þ composed of three-layers: input, output and hidden. For a cluster
Eðg i Þ
C k in which customers are described by bk attributes; we will use
where and Ei ðaÞ is computed by Eq. (11); and Eðg i Þ is the entropy of bk neurons in the input layer, bk sigmoid neurons in the output
group g i defined as being the summation of the entropy w.r.t. g i of layer, and bk =2 neurons in the hidden layer. While the number
all attributes and given by: of neurons in the input and output layers is fixed by the number
X
Eðg i Þ ¼ Ei ðaÞ; ð15Þ of attributes in the learning data set, the choice of the optimal
a2X number of hidden neurons is still an NP-hard problem in back-
propagation learning (Blum & Rivest, 1992). The general rule of
where X is the set of attributes. The ‘‘importance ratio” of an attri- thumb is that a large number of hidden neurons will lead to over-
bute obeys the following corollary: training and thereby power generalization; whereas a small num-
Corollary 3. The importance of an attribute a 2 X for a group g i ber will lead to power learning – further details can be found in
increases as its information entropy Ei ðaÞ increases; and vice versa. (Haykin, 1994; Zhanga, Mab, & Yang, 2003). In our case, for a
cluster C k in which customers are described by bk attributes, we
Corollary 3 is straightforward from Eq. (14) and Theorem 4. found in practice that bk =2 hidden neurons are sufficient. Each
The most important attributes for a group g i are computed backpropagation network was trained using the data in the corre-
using Algorithm 2. The core idea of this algorithm is to select only sponding cluster with the generalized delta rule with momentum
those attributes whose ‘‘importance ratio” computed with Eq. (Haykin, 1994):
(14) is greater than a fixed threshold q. The ratio q is a user spec-
ified parameter, and reflects the ‘‘degree of vigilance”: as q ap- DW i;j ðt þ 1Þ ¼ gdj ðt þ 1Þyi ðt þ 1Þ þ aDW i;j ðtÞ; ð16Þ
proaches 1, only the most relevant attributes will be selected.
At this stage, we should notice that actually there is no simple where t is the iteration (time) counter, wi;j is the connection weight
way for the automatic selection of q. A similar problem exists from neuron nri to neuron nrj in two consecutive layers; yi is the
in other dimensionality reduction techniques (Dash et al., 1997). output signal of nri , dj is the ‘‘learning error” of nrj ; g and a are
Finally, notice that Algorithm 2 is applied to each computed two small constants ð20; 1½Þ representing the learning and the
group of customers. Consequently, and as a result of this second momentum rates, respectively. In Eq. (16), the a-term is the
step, we obtain a set of heterogeneous groups each described momentum term used to avoid the oscillation of the algorithm be-
by a distinct set of attributes judged to be the most informative. tween two local minima (Haykin, 1994). The distinct connections of
In the next section, we will describe our learning approach for the the network are adjusted according to Eq. (16) until the sum of
extraction of the ‘‘knowledge” embedded in each group. squared errors over all training samples becomes negligible.
Finally, we obtain a set of trained backpropagation networks
each representing a distinct group of customers (or a profile).
These profiles will be very useful in categorizing new customers
as follows. We present a customer, say x, to each trained backpro-
pagtion network, say neti , and measure the error as follows:
kyi ðxÞ xkE
ERi ðxÞ ¼ ; 8net i : ð17Þ
kxkE
Here k kE is the Euclidean distance and yi ðxÞ is the output of net-
work net i when x is fed into it. Now, we define the global error of
our model w.r.t. customer x as:
ERðxÞ ¼ min ERi ðxÞ: ð18Þ
i
marized in Table 1 where the first column reports the group; the
second column its important attributes; and the third column its
unimportant ones.
20 G1
G2
G3
15
10
A3
0
30
25
20 30
15 25
20
10 15
A
2
5 10
5
0 0 A1
Fig. 2. A synthetic data set of 600 customers categorized into three groups (G1 ; G2 ; G3 ).
L.B. Romdhane et al. / Expert Systems with Applications 37 (2010) 1573–1585 1579
G3 G3 G3
Group
Group
Group
G2 G2 G2
G1 G1 G1
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
A1 A2 A3
Fig. 3. The projection of the synthetic data set w.r.t. each attribute.
3. Experimental results
Table 2
Characteristics of the computed optimal partition. In this section, we will conduct an experimental analysis of our
model using large real-world data sets. For this, we considered a
Group Size Density (%)
data set used at the Data Mining Cup conference 2004 (DMC04,
G1 206 45.78
2004). It is composed of 18,918 customers: 15,000 will be used
G2 121 26.89
G3 123 27.33 for training (or clustering); and 3918 for testing (or prediction).
Each customer is described by 23 attributes (categorical,
Table 3
Computed clusters for the attributes A1 ; A2 , and A3 in groups G1 ; G2 , and G3 .
numerical, and boolean). These attributes include address-based chases and was gathered by the mail order company Quelle AG.
and transactional information: a detailed description is outlined Nowadays, customers are encouraged by the friendly cost-free re-
in the Appendix D of this paper. The data is about the return of pur- turn of purchases to send back merchandize they have ordered.
L.B. Romdhane et al. / Expert Systems with Applications 37 (2010) 1573–1585 1581
Table 4 In the second step, we have selected the most important attri-
Prediction accuracy of our model-CC: correctly classified; MS: misclassified; UR: butes for each group as follows. First, remember that given a group
unrecognized.
of customers we have to cluster the values of each attribute using
CC MS UR Algorithm 1 in order to find clusters of ‘‘similar values”. Tables 5
#Cases 90 14 46 and 6 summarize the clustering of all attributes w.r.t. the first
Percentage (%) 60 9.35 30.65 and the second groups, respectively. The first column reports the
attribute; and the second column the optimal number of clusters.
This optimal number corresponds to that producing the lowest
Doubtless, the cost of the return management for mail order com- partition entropy: the variation of the partition entropy of all attri-
panies is considerable; and making a prognosis of customer return butes in each group is reported in Tables 10 and 11 in the Appendix
rates becomes a necessity for customer value analysis. The used E of this paper. The third column of Tables 5 and 6 reports the den-
data set is composed of two categories of customers: ‘‘low-rate re- sities of the computed clusters; and the final column the entropy of
turn” category and ‘‘high-rate return” category – further details can the attribute computed with Eq. (11). These entropy values are
be found in (DMC04, 2004). used to derive the importance ratios with Eq. (13). Results are sum-
In the first step, we have clustered the 15,000 training custom- marized in Table 7 where we report the attribute and its impor-
ers using Algorithm 1. Fig. 9 plots the variation of the partition en- tance ratios; and thus for the ‘‘low-rate return” and ‘‘high-rate
tropy with the number of groups over the interval return” groups. For illustration purposes, rows in Table 7 are sorted
½C min ; C max ¼ ½2; 5. Clearly, the optimal partitioning is that for two w.r.t. the importance ratio in the ascending order. We remark that
groups since it has the lowest partition entropy. Hence, our Algo- the importance of an attribute varies from one group to another.
rithm 1 was able to discover the ‘‘natural structure” of the training For example, attribute a7 which is the ‘‘percent of foreigner house-
data since it computed the real number of groups: ‘‘low-rate re- holds in postcode area” has an importance of 3.74% in the first
turn” and ‘‘high-rate return” customers. group; and of 4.45% in the second group. This implicity means that
the ‘‘percent of foreigner households in postcode area” is much
more important for ‘‘high-rate return” customers than for ‘‘low-
rate return” ones. Using an arbitrary importance threshold of 3%
drops the following sets of attributes: X1 ¼ fa17 ; a3 ; a23 ; a2 ; a1 g for
the first group; and X2 ¼ fa3 ; a17 ; a23 ; a6 ; a16 ; a2 ; a1 g for the second
group – see Table 7. Stated otherwise, X1 and X2 model the set
of unimportant attributes for the first and second groups, respec-
tively. Consequently, W1 ¼ fa6 ; a16 ; a7 ; a15 ; a18 ; a19 ; a12 ; a20 ; a9 ; a21 ;
a4 ; a22 ; a11 ; a14 ; a8 ; a10 ; a13 ; :a5 g is the set of important attributes for
the first group of customers; and W2 ¼ fa15 ; a19 ; a12 ; a18 ;
a20 ; a7 ; a9 ; a21 ; a11 ; a22 ; a14 ; a8 ; a10 ; a13 ; a4 ; :a5 g is the set of important
attributes for the second group of customers. Now, let us forward
the following remarks. First, the address information a6 (‘‘postal
code”) and the transactional information a16 (‘‘number of female
persons in household”) are important for the ‘‘low-rate return” cat-
egory of customers; whereas they are unimportant for the ‘‘high-
rate return” category. Stated otherwise, the postal code and the
number of females information are significant for the ‘‘low-rate re-
Fig. 9. Variation of the partition entropy w.r.t. the number of clusters. turn” category; whereas, they are of much less significance for the
Table 5
Computed clusters for all attributes in the first group of customers (‘‘low-rate return”).
Table 6
Computed clusters for all attributes in the second group of customers (‘‘high-rate return”).
Table 7
The importance of the attributes in the first and second groups. Keeping for each group only the most important attributes; we
have proceeded in the third step to the construction of the two cus-
First group (‘‘low-rate return”) Second group (‘‘high-rate return”)
tomer profiles using backpropagation neural networks trained
Att. I (%) Att. I (%) with the data in the corresponding group. For the ‘‘low-rate return”
a17 1.33 a3 1.52 group, we used a backpropagation network of three layers: 18 in-
a3 1.48 a17 1.57 put neurons, nine hidden neurons, and 18 output neurons. Regard-
a23 1.75 a23 1.67
ing the ‘‘high-rate return” group, we used a backpropagation
a2 2.20 a6 2.59
a1 2.77 a16 2.62 network of three layers: 16 input neurons, eight hidden neurons,
a6 3.63 a2 2.76 and 16 output neurons. Remember that for a group described by
a16 3.68 a1 2.92 bk attributes, we use bk input neurons, bk output neurons, and
a7 3.74 a15 3.94 bk =2 hidden ones. In addition, we used a learning rate g ¼ 0:1
a15 3.92 a19 3.95
a18 3.94 a12 4.10
and a momentum a ¼ 0:9.
a19 3.98 a18 4.13 In order to test the performance of the constructed models, we
a12 4.03 a20 4.14 used Algorithm 3 with a test set of 3918 customers previously un-
a20 4.32 a7 4.54 known to the system and an arbitrary threshold error ¼ 0:15.
a9 4.37 a9 4.59
Simulation results are summarized in Table 8. We notice that for
a21 4.70 a21 4.69
a4 5.76 a11 5.35 2923 customers (i.e., in 74.60% of simulations) the model was able
a22 5.81 a22 5.69 to recognize the profile (or category); and for 995 of customers
a11 5.85 a14 6.23 (i.e., in 25.39% of simulations) the system was unable to recognize
a14 6.13 a8 6.28 their profile. Among the recognized profiles, 1464 profiles (or
a8 6.26 a10 6.29
a10 6.29 a13 6.30
37.37%) were assigned to the exact profile, and 1459 profiles (or
a13 6.29 a4 6.32 37.24%) were misclassified. These results could be considered as
a5 7.76 a5 7.79 very satisfactory since the system showed a good recognition rate
despite the fact that the test customers were previously unknown.
This could be of great profit to mail order companies for the prog-
Table 8 nosis of customer return rates.
Prediction accuracy of our model on the test set – CC: correctly classified; MS:
misclassified; UR: unrecognized.
4. Concluding remarks and future research
CC MS UR
# Cases 1464 1459 995 Customer profiling consists in determining a set of customer
Percentage (%) 37.37 37.24 25.39 models that could be used for predicting the behavior of new cus-
tomers which is very useful for targeted marketing. In this paper,
we proposed a model for customer profiling based on fuzzy clus-
‘‘high-rate return” one. Second, we observe in Table 7 that a5 tering and backpropagation neural networks. Our model is com-
(‘‘having a phone”) is the most important attribute for both groups. posed of three steps. In the first step, we cluster customers using
Consequently, the fact of having (or not) a phone is fundamental in an FCM-based approach. The number of clusters is computed from
the considered case. This seems to be natural since it is easier for the data set at hand using the partition entropy as a validity mea-
customers having phones (the ‘‘high-rate return” category) to noti- sure. In the second step, we select for each group the ‘‘most impor-
fy the mail company (by phone) of the return of goods than those tant” attributes. For this, we use the same clustering approach but
who do not have (the ‘‘low-rate return”). Similar remarks could be applied to the values of an attribute: similar values will be clus-
derived from Table 7 for other attributes of less importance. tered together and thereby could be considered as being the same.
L.B. Romdhane et al. / Expert Systems with Applications 37 (2010) 1573–1585 1583
Thereby, we compute the entropy of an attribute w.r.t. the given This proves the first part of the theorem which is Eq. (9).
group of customers in order to quantify its importance for that Regarding the second part which is Eq. (10), we have to distinguish
group. In the third and final step of our approach, we develop a two limiting cases:
set of backpropagation neural networks each corresponding to a
group of customers and trained using the data in that group. Each 1st case: Da;i ¼ fSa;i g is composed of a single cluster. This hap-
trained backpropagation network corresponds to a distinct cus- pens when all values of attribute a w.r.t. g i are exactly the same.
tomer profile and is useful for predicting the profile of new cus- Naturally, the number of values in Sa;i is equal to N i (number of
tomers. Experimental results on a real-world large database of a individuals in g i ); i.e., fa;i ¼ N i . Consequently, we have:
mail order company are very encouraging and should stimulate fu-
1 fa;i 1
ture research. First, we have to find a simple way to estimate the pa;i ¼ ¼ : ð23Þ
2 Ni 2
range (½C min ; C max ) of the optimal number of clusters using any
N
prior knowledge about the data set. In fact, the computational time 2nd case: Da;i ¼ fS1a;i ; . . . ; Sa;ii g is composed of N i clusters each
j
could be unacceptable if this interval is very large. Second, we have cluster embedding a single distinct value; i.e., fa;i ¼1
to develop an algorithm for representing the ‘‘knowledge” learned ðj ¼ 1 N i Þ. Then, we have:
by each backpropagation neural network in a more understandable
1
form; for example as a set of IF–THEN rules. These two last points pja;i ¼ 8j: ð24Þ
2Ni
constitute our immediate focus.
Using Eqs. (23) and (24), we can state that in the general case we
Appendix A. Proof of Theorem 2 have:
1 1
For the sake of readability, let us remind that pja;i is defined by: 6 pja;i 6 8i; j ð25Þ
2Ni 2
j
1 fa;i
pja;i ¼ 8Sja;i 2 Da;i : ð19Þ
2 Ni
Using Eq. (19), we get: Appendix B. Proof of Theorem 3
ca;i
X ca;i
1 X
pja;i ¼ fj : ð20Þ It is known that f ðxÞ ¼ ½xlog2 ðxÞþ ð1 xÞlog2 ð1 xÞ is an
2Ni j¼1 a;i
j¼1 increasing function over the interval 0; 12 ; and that f ðxÞ 2 ½0; 1
j
At this stage, we should remind that fa;i is the number of values in 8x 2 0; 12 . Using Theorem 2 which states that 2N1 i 6 pja;i 6 12, we
cluster Sja;j . Consequently, we have: can say that:
ca;i
X j 0 < f Pja;i 6 1 8j; ð26Þ
fa;i ¼ Ni : ð21Þ
j¼1
ca;i
X ca;i
X
1 1
) 0< f Pja;i 6 ; ð27Þ
Using Eqs. (20) and (21), we get: ca;i j¼1
ca;i j¼1
ca;i
X 1 1 ) 0 < Ei ðaÞ 6 1: ð28Þ
pja;i ¼ Ni ¼ : ð22Þ
j¼1
2Ni 2
Table 9
Description of the attributes of the used experimental data from the DMC cup 2004.
Table 10
Evolution of the partition entropy with the number of clusters varying between C min¼2 and C max ¼ 10 for the first group of customers (‘‘low-rate return”).
Att. 2 3 4 5 6 7 8 9 10 CH
a1 0.052 0.044 0.065 0.060 0.067 0.056 0.059 0.064 0.069 3
a2 0.048 0.051 0.046 0.063 0.063 0.071 0.058 0.058 0.058 4
a3 0.038 0.036 0.048 0.042 0.034 0.017 0.022 0.012 0.007 10
a4 0.041 0.054 0.061 0.065 0.068 0.070 0.071 0.072 0.073 2
a5 C H ¼ 1: all customers in this group have the same value for this attribute
a6 0.048 0.047 0.046 0.046 0.047 0.054 0.048 0.050 0.046 5
a7 0.052 0.050 0.060 0.058 0.063 0.063 0.062 0.065 0.069 3
a8 0.043 0.058 0.066 0.069 0.072 0.076 0.073 0.075 0.076 2
a9 0.014 0.023 0.032 0.032 0.045 0.048 0.052 0.055 0.060 2
a10 0.038 0.050 0.059 0.059 0.065 0.063 0.068 0.069 0.069 2
a11 0.039 0.047 0.053 0.064 0.064 0.064 0.069 0.069 0.067 2
a12 0.004 0.051 0.061 0.065 0.068 0.070 0.072 0.073 0.075 2
a13 0.054 0.062 0.069 0.074 0.066 0.075 0.070 0.073 0.069 2
a14 0.042 0.057 0.065 0.070 0.065 0.067 0.069 0.070 0.069 2
a15 0.000 0.003 0.003 0.001 0.092 0.482 0.937 0.067 0.962 2
a16 0.003 0.000 0.003 0.001 0.404 0.951 0.458 0.712 0.312 3
a17 0.044 0.043 0.053 0.060 0.060 0.040 0.054 0.046 0.030 10
a18 0.042 0.050 0.047 0.058 0.060 0.052 0.051 0.051 0.053 2
a19 0.009 0.033 0.049 0.054 0.061 0.067 0.064 0.067 0.069 2
a20 0.026 0.039 0.049 0.055 0.059 0.061 0.072 0.072 0.074 2
a21 0.037 0.042 0.051 0.063 0.061 0.068 0.067 0.070 0.070 2
a22 0.041 0.054 0.061 0.066 0.065 0.051 0.054 0.057 0.068 2
a23 0.040 0.045 0.037 0.047 0.048 0.030 0.023 0.011 0.020 9
Table 11
Evolution of the partition entropy with the number of clusters varying between C min¼2 and C max ¼ 10 for the Second Group of Customers (‘‘high-rate return”).
Att. 2 3 4 5 6 7 8 9 10 CH
a1 0.052 0.045 0.048 0.055 0.051 0.055 0.061 0.062 0.062 3
a2 0.048 0.045 0.050 0.057 0.060 0.056 0.056 0.054 0.062 3
a3 0.044 0.047 0.056 0.037 0.032 0.027 0.020 0.024 0.008 10
a4 0.041 0.055 0.061 0.066 0.068 0.070 0.072 0.073 0.072 2
a5 C H ¼ 1: all customers in this group have the same value for this attribute
a6 0.046 0.056 0.055 0.048 0.057 0.052 0.045 0.053 0.048 8
a7 0.044 0.056 0.064 0.060 0.069 0.061 0.068 0.065 0.072 2
a8 0.044 0.061 0.066 0.073 0.072 0.074 0.074 0.073 0.076 2
a9 0.018 0.038 0.041 0.043 0.045 0.049 0.049 0.054 0.049 2
a10 0.050 0.053 0.061 0.069 0.068 0.066 0.075 0.071 0.074 2
a11 0.021 0.041 0.045 0.043 0.055 0.057 0.056 0.057 0.058 2
a12 0.002 0.050 0.054 0.061 0.061 0.068 0.068 0.067 0.074 2
a13 0.042 0.045 0.056 0.059 0.069 0.067 0.071 0.072 0.077 2
a14 0.042 0.056 0.063 0.068 0.072 0.074 0.070 0.070 0.074 2
a15 0.000 0.003 0.002 0.316 0.205 0.008 0.561 0.105 0.366 2
a16 0.003 0.017 0.003 0.001 0.838 0.523 0.577 0.151 0.094 5
a17 0.047 0.040 0.060 0.068 0.059 0.056 0.061 0.040 0.050 9
a18 0.047 0.054 0.063 0.054 0.060 0.070 0.057 0.064 0.068 2
a19 0.003 0.040 0.049 0.055 0.064 0.063 0.067 0.068 0.069 2
a20 0.019 0.039 0.055 0.055 0.061 0.063 0.068 0.065 0.069 2
a21 0.037 0.049 0.053 0.059 0.066 0.065 0.070 0.070 0.072 2
a22 0.041 0.054 0.061 0.066 0.065 0.051 0.079 0.056 0.057 2
a23 0.040 0.045 0.037 0.047 0.048 0.030 0.022 0.011 0.020 9
cb;i ! ca;i !
1 X 1 X foundations of comp. intelligence, Honolulu, Hawaii, USA, 1–5 April 2007 (Vol. 1,
f pjb;i f pja;i ð34Þ pp. 214–221).
cb;i j¼1
ca;i j¼1 Jiang, D., Pei, J., & Zhang, A. (2005). An interactive approach to mining gene
expression data. IEEE Transactions on Knowledge Data Engineering, 17(10),
) Ei ðbÞ Ei ðaÞ: ð35Þ 1363–1378.
Jiang, T., & Tuzhilin, A. (2006). Segmenting customers from population to
individuals: Does 1-to-1 keep your customers forever. IEEE Transactions on
Knowledge Data Engineering, 18(10), 1297–1311.
Appendix D. Experimentation data attributes
Karahoca, A., Yalcin, S., Kahraman, C., & Sanver. M. (2007). Fraud detection using an
adaptive neuro-fuzzy inference system in telecommunication networks. In
The used data set is composed of 18,918 customers each de- Proceedings of the 10th joint conference on information science, Utah, USA, 18–24
July 2007 (pp. 1440–1446).
scribed by 23 attributes described in Table 9 where the first col-
Kim, Y.-H., & Moon, B.-R. (2006). Multicampaign assignment problem. IEEE
umn reports the attribute number, the second column the Transactions on Knowledge Data Engineering, 18(3), 405–414.
original used code for the attribute, and the third column its Kuoa, R. J., Ana, Y. L., Wanga, H. S., & Chungbi, W. J. (2006). Integration of self-
significance. organizing feature maps neural network and genetic K-means algorithm for
market segmentation. Expert Systems with Applications, 30(2), 313–324.
Lee, C.-H. L., Liu, A., & Chen, W.-S. (2006). Pattern discovery of fuzzy time series for
Appendix E. Variation of the partition entropy of all attributes financial prediction. IEEE Transactions on Knowledge Data Engineering, 18(5),
613–625.
in both groups of customers Li, S.-T., Shue, L.-Y., & Lee, S.-F. (2006). Enabling customer relationship management
in ISP services through mining usage pattern. Expert Systems with Applications,
See Tables 10 and 11. 30(4).
Looney, C. G. (2002). Interactive clustering and merging with a new fuzzy expected
value. Pattern Recoginition, 35(11), 2413–2423.
References Martinez, E. F., Chen, S. Y., & Liu, X. (2006). Survey of data mining approaches to user
modeling for adaptive hypermedia. IEEE Transactions on Systems, Man, and
Abeantes, A., & Marques, J. (1996). A class of constrained clustering for object Cybernatics C, 36(6), 734–749.
boundary extraction. IEEE Transactions on Image Processing, 5, 1507–1521. Mitra, S., Pal, S. K., & Mitra, P. (2002). Data mining in soft computing framework: A
Abidogun, O. A. (2005). Data mining, fraud detection and mobile telecommunications: survey. IEEE Transactions on Neural Networks, 13(1), 3–14.
Call pattern analysis with unsupervised neural networks, Master’s thesis, Nanavati, A. A., Singh, R., Chakraborty, D., Dasgupta, K., Mukherjea, S., Das, G., et al.
University of the Western Cape, August 2005. (2008). Analyzing the structure and evolution of massive telecom graphs. IEEE
Baraldi, A., & Blonda, P. (1999). A survey of fuzzy clustering algorithms for pattern Transactions on Knowledge Data Engineering, 20(5), 703–718.
recognition. IEEE Transactions on Systems, Man, and Cybernatics B, 29(6), Nasraoui, O., Soliman, M., Saka, E., Badia, A., & Germain, R. (2008). A web usage
786–801. Dec.. mining framework for mining evolving user profiles in dynamic web sites. IEEE
Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. New Transactions on Knowledge Data Engineering, 20(2), 202–215.
York: Plenum. Nasraroui, O., Spiliopoulou, M., Srivastava, J., Mobasher, B., & Masand, B. M. (2007).
Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Advances in web mining and web usage analysis. WebKDD (Vol. 4811). PA, USA:
Neural Network, 5(1), 117–127. Springer.
Borges, J., & Levene, M. (2007). Evaluating variable-length markov chain models for Nunez, M., Morales, R., & Triguero, F. (2002). Automatic discovery of rules for
analysis of user web navigation sessions. IEEE Transactions on Knowledge Data predicting network management events. IEEE Journal Selection Areas
Engineering, 19(4), 441–452. Communication, 20(4), 736–745.
Bose, I., & Mahapatra, R. K. (2001). Business data mining – A machine learning Pal, N. R., Keller, J. M., & Bezdek, J. C. (2005). A possibilistic fuzzy c-means clustering
perspective. Information and Management, 39, 211–225. algorithm. IEEE Transactions on Fuzzy System, 13(4), 517–530.
Chen, M. S., Hen, J., & Yu, P. S. (1996). Data mining: An overview from a database Roddick, J. F., & Spiliopoulou, M. (2002). A survey of temporal knowledge discovery
perspective. IEEE Transactions on Knowledge Data Engineering, 8(6), 866–883. paradigms and methods. IEEE Transactions on Knowledge Data Engineering, 14(4),
Dash, M., Liu, H., & Yao, J. (1997). Dimensionality reduction of unsupervised data. In 750–767.
Proceedings of the 9th International Conference on Tools with Artificial intelligence, Romdhane, L. B., Ayeb, B., & Wang, S. (2002). On computing the fuzzifier in #FLVQ: A
November 1997 (pp. 532–539). data driven approach. International Journal of Neuron System, 12(2), 149–157.
DMC04. Data Mining Cup (2004). <http://www.data-mining-cup.com/2004/ Rumelhart, D., Hinton, J., & Williams, R. (1986). Learning representations by back-
1213622858/>. propagating errors. Nature, 323, 533–536.
Dyche, J. (2001). The CRM handbook: A business guide to customer relationship Setnes, M., & Kaymak, U. (2001). Fuzzy modeling of client preference from large
management. AW, MA, USA. data sets: an application to target selection in direct marketing. IEEE
Fodor, I. K., & Kamath, C. (2002). Dimension reduction techniques and the Transactions on Fuzzy System, 9(1), 153–163.
classification of bent double galaxies. Computational Statistics and Data Shich, M.-D., Yan, W., & Chen, C.-H. (2008). Soliciting customer requirements for
Analysis, 41(1), 91–122. product redesign based on picture sorts and ART2 neural networks. Expert
Fung, G., & Mangasarian, O. L. (2001). Proximal support vector machine classifiers. Systems with Applications, 34(1), 194–204.
In Proceedings of the 7th ACM SIGKDD International Conference on KDDM Sohn, S. Y., & Kim, Y. (2008). Searching customer patterns of mobile service using
(pp. 77–86). NY, USA: ACM Press. clustering and quantitative association rules. Expert Systems with Applications,
Gehreke, J. (2003). Decision trees. In Nong Ye (Ed.), The handbook of data mining (pp. 34(2), 1070–1077.
3–25). Law. Erl. Ass., NJ, USA. Verù, S. V., Garcìa, M. . O., Senabre, C., Marìn, A. G., & Franco, F. G. (2006).
Giudici, P., & Passerone, G. (2002). Data mining of association structures to model Classification, filtering, and indetification of electrical customer load patterns
consumer behavior. Computational Statistics and Data Analysis, 38, 533–541. through the use of self-organizing maps. IEEE Transactions on Power System,
Haykin, S. (1994). Neural networks: A comprehensive foundation. NY: McMillan, Inc.. 21(4), 1672–1682.
Hoppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). Fuzzy cluster analysis: Widrow, B., & Lehr, M. (1992). 30 years of adaptive neural networks: Perceptron,
Methods for classification, data analysis and image recognition. NY: Wiley. madaline, and backpropagation. In P. J. Hayes, J. A. Feldman, & D. E. Rumelhart
Hsieha, Nan-Chen (2004). An integrated data mining and behavioral scoring model (Eds.), Neural networks: Theoretical foundations and analysis (pp. 27–53). New
for analyzing bank customers. Expert Systems with Applications, 27(4), 623–633. York: IEEE PRESS, The IEEE Inc..
Hu, M. Y., Shanker, M., Zhang, G. P., & Hung, M. S. (2008). Modeling consumer Yeh, M.-Y., Dai, B.-R., & Chen, M.-S. (2007). Clustering over multiple evolving
situational choice of long distance communication with neural networks. streams by events and correlations. IEEE Transactions on Knowledge Data
Decision Support System, 44(4), 899–908. Engineering, 19(10), 1349–1362.
Huang, C.-L., Chen, M.-C., & Wang, C.-J. (2007). Credit scoring with a data mining Yi-jun, L., Peng, Z., & Qiang, Y. (2006). Customer sample difference-oriented Bayes
approach based on support vector machines. Expert Systems with Applications, segmentation algorithm. In Proceedings of the international conference on
33(4), 847–856. management science and engineering, 5–7 October, 2006 (pp. 3–7).
Hung, C., & Tsai, C.-F. (2008). Market segmentation based on hierarchical self- Ying, X., Vijay, V. R., Praveen, D., & Xiaoquan, Z. (2005). A new fuzzy clustering
organizing map for markets of multimedia on demand. Expert Systems with algorithm for optimally finding granular prototypes. International Journal of
Applications, 34(1), 780–787. Applied Reasoning, 40.
Hung, S. Y., Yen, D. C., & Yu Wang, H. (2006). Applying data mining to telecom churn Zhang, D., & Zhou, L. (2004). Discovering golden nuggets: Data mining in financial
management. Expert Systems with Applications, 31(3), 515–524. application. IEEE Transactions on Systems, Man, and Cybernatics C, 34(4),
Ichihashi, H., & Honda, K. (2005). Fuzzy c-means classifier for incomplete data sets 513–522.
with outliers and missing values. In Proceedings of the CIMCA-IAWTIC Zhanga, Z., Mab, X., & Yang, Y. (2003). Bounds on the number of hidden neurons in
(pp. 457–464). Washington, DC, USA: IEEE Computer Society. three-layer binary neural networks. Neural Networks, 16, 995–1002.
Ichihashi, H., Honda, K., Notsu, A., & Yagi, T. (2007). Fuzzy c-means classifier with
deterministic initialization and missing value imputation. In IEEE symposium on