2017 IEEE/ACS 14th International Conference on Computer Systems and Applications
ParallelCharMax: An Effective Maximal Frequent
Itemset Mining Algorithm Based on MapReduce
Framework
Rania Mkhinini Gahar1 , Olfa Arfaoui2 , Minyar Sassi Hidri3 , Nejib Ben Hadj-Alouane4
1,2,3,4 University of Tunis El Manar
1,2,3,4 National Engineering School of Tunis
3 Imam Abdulrahman Bin Faisal University, Dammam, Arabie Saoudite
1,2,3,4 BP. 37, Le Belvèdère 1002, Tunis, Tunisia
two decades. Several researches are directed to extract more
precisely the maximum frequent patterns and this amounts to
the fact that these latter can be considered as perfect summaries
of the frequent sets since they can be much less numerous than
the frequent closed patterns. Moreover, today still, it remains
a topical issue as new challenges arise, particularly with the
emergence of mega-data (Big Data) and the development of
data science.
While data mining has its roots in the traditional fields
of machine learning and statistics, the huge volume of data
today poses the most serious problem. For example, many
companies already have data warehouses in the terabyte range
(e.g., FedEx, UPS, Walmart). Similarly, scientific data is
reaching gigantic proportions (e.g., NASA space missions,
Human Genome Project). Traditional methods typically made
the assumption that the data is memory resident. This assumption is no longer tenable. Implementation of data mining
ideas in high-performance parallel and distributed computing
environments is thus becoming crucial for ensuring system
scalability and interactivity as data continues to grow in size
and complexity [4].
Parallel and distributed computing is expected to relieve
current mining methods from the sequential bottleneck, providing the ability to scale to massive datasets, and improving
the response time. Achieving good performance on today’s
multiprocessor systems is a non-trivial task. The main challenges include synchronization and communication minimization, work-load balancing, finding good data layout and data
decomposition, and disk Input/Output minimization, which is
especially important for data mining.
In order to answer the different stakes posed, we propose
a new parallel algorithm for discovering frequent maximal
itemsets based on MapReduce framework.
The rest of the paper is organized as follows; section 2
presents the basic frequent itemsets and association rules mining problems. Sections 3 describes related work the main parallel and distributed techniques used to solve these problems and
give a comprehensive survey of the most influential algorithms
that were proposed during the last decade. Section 4 focused
on a powerful sequential algorithm for searching maximal
frequent itemsets on which we will base in the fourth section in
Abstract—Nowadays, the explosive growth in data collection in
business and scientific areas has required the need to analyze
and mine useful knowledge residing in these data. The recourse
to data mining techniques seems to be inescapable in order to
extract useful and novel patterns/models from large datasets. In
this context, frequent itemsets (patterns) play an essential role
in many data mining tasks that try to find interesting patterns
from datasets. However, conventional approaches for mining
frequent itemsets in Big Data era encounter significant challenges
when computing power and memory space are limited. This
paper proposes an efficient distributed frequent itemset mining
algorithm, called ParallelCharMax, that is based on a powerful
sequential algorithm, called Charm, and computes the maximal
frequent itemsets that are considered perfect summaries of the
frequent ones. The proposed algorithm has been implemented
using MapReduce framework. The experimental component of
the study shows the efficiency and the performance of the
proposed algorithm compared with well known algorithms such
as MineWithRounds and HMBA.
Keywords—Frequent Itemset Mining, Parallel Mining Algorithm,
MapReduce, Charm.
I. I NTRODUCTION
Data Mining and Knowledge Discovery in Datasets (KDD)
is a new interdisciplinary field which presents an intersection
of statistics, machine learning, databases, and parallel and
distributed computing. It has been generated by the important
growth of data in all spheres of human effort, and the economic and scientific need to extract useful information from
the collected data. The key challenge in data mining is the
extraction of knowledge from massive datasets.
Data mining refers to the overall process of discovering new
patterns or building models from a given dataset. There are
many steps involved in the KDD process which include data
selection, data cleaning and preprocessing, data transformation
and reduction, data-mining task and algorithm selection, and
finally post-processing and interpretation of discovered knowledge [2], [3]. This KDD process tends to be highly iterative
and interactive.
In this context, the patterns extraction is one of the most
important techniques of data mining. It stimulates a great effort
that led to a variety of proposed algorithms throughout the last
2161-5330/17 $31.00 © 2017 IEEE
DOI 10.1109/AICCSA.2017.80
571
or simply F if D and s are clear from the context.
b) Problem 1. (Itemset Mining): Given a set of items I, a
transaction database D over I, and minimal support threshold
σ, find F(D, σ). In practice we are not only interested in the
set of itemsets F, but also in the actual supports of these
itemsets.
An association rule is an expression of the form X =⇒ Y ,
where X and Y are itemsets, and X ∩ Y = {}. Such a rule
expresses the association that if a transaction contains all items
in X, then that transaction also contains all items in Y . X is
called the body or antecedent, and Y is called the head or
consequent of the rule.
The support of an association rule X =⇒ Y in D, is
the support of X ∪ Y in D, and similarly, the frequency
of the rule is the frequency of X ∪ Y . An association rule
is called frequent if its support (frequency) exceeds a given
minimal support (frequency) threshold σabs (σrel ). Again, we
will only work with the absolute minimal support threshold for
association rules and omit the subscript abs unless explicitly
stated otherwise.
The confidence or accuracy of an association rule X =⇒ Y
in D is the conditional probability of having Y contained in a
transaction, given that X is contained in that transaction:
order to highlight a new parallel distributed memory algorithm
based on the MapReduce paradigm under Hadoop. Section
5 presents our parallel algorithm called ParallelCharMax to
efficiently searching maximal frequent itemsets. Section 6
discusses computational results of the proposed algorithm.
Section 7 summarizes the paper and highlights future work.
II. P RELIMINARIES
Let I be a set of items. A set X = {i1 , ..., ik } ⊆ I is called
an itemset, or a k-itemset if it contains k items.
A transaction over I is a couple T = (tid, I) where tid
is the transaction identifier and I is an itemset. A transaction
T = (tid, I) is said to support an itemset X ⊆ I, if X ⊆ I.
A transaction database D over I is a set of transactions over
I. We omit I whenever it is clear from the context.
The cover of an itemset X in D consists of the set of
transaction identifiers of transactions in D that support X:
cover(X, D) := {tid | (tid,I)∈ D, X ⊆ I}.
The support of an itemset X in D is the number of
transactions in the cover of X in D:
support(X, D) := |cover(X, D)|.
The frequency of an itemset X in D is the probability of
X occurring in a transaction T ∈ D:
f requency(X, D) := P (X) =
support(X,D)
|D|
conf idence(X =⇒ Y, D) :=P (Y | X) =
support(X∪Y,D)
support(X,D)
The rule is called confident if P (Y |X) exceeds a given
minimal confidence threshold γ, with 0 ≤ γ ≤ 1.
c) Definition 2.: Let D be a transaction database over a
set of items I, σ a minimal support threshold, and γ a minimal
confidence threshold. The collection of frequent and confident
association rules with respect to σ and γ is denoted by:
.
Note that |D| = support({}, D). We omit D whenever it is
clear from the context.
An itemset is called frequent if its support is no less
than a given absolute minimal support threshold σabs , with
0 ≤ σabs ≤ |D|. When working with frequencies of itemsets
instead of their supports, we use a relative minimal frequency
threshold σrel , with 0 ≤ σrel ≤ 1. Obviously, σabs =
⌈σrel .|D|⌉.
A frequent itemet is called closed if none of its supersets
has the same support. In other words, all its supersets have a
strictly lower support.
An itemset X in D such that support support(X, D)>
minimal support threshold σabs is called a closed frequent
itemset.
A maximal frequent itemset X in D is a frequent itemset and
all its supersets are not frequent. The set of maximal frequent
itemsets is thus a subset of the frequent closed itemsets set.
When a set of patterns X is frequent without a frequent
superset, we say that X is maximal frequent.
The reason for extracting the maximum frequent itemsets is
that the set of this later is generally much smaller than the set
of frequent itemsets and also smaller than the set of frequent
closed itemsets.
It is possible to regenerate all sets of frequent itemsets from
the set of maximal frequent itemsets, but it would not be
possible to obtain their support without scanning the database.
a) Definition 1.: Let D be a transaction database over a set
of items I, and σ a minimal support threshold. The collection
of frequent itemsets in D with respect to σ is denoted by
R(D, σ, γ):= {X ⇒ Y | X, Y ⊆ I, X∩Y = {},
X ∪ Y ∈ F(D, σ),confidence(X ⇒ Y, D) ≥ γ},
or simply R if D, σ and γ are clear from the context.
d) Problem 2. (Association Rule Mining): Given a set of
items I, a transaction database D over I, and minimal support
and confidence thresholds σ and γ, R(D, σ, γ).
Besides the set of all association rules, we are also interested
in the support and confidence of each of these rules.
Note that the Itemset Mining problem is actually a special
case of the association rule mining problem. Indeed, if we
are given the support and confidence thresholds σ and γ, then
every frequent itemset X also represents the trivial rule X ⇒
{} which holds with 100% confidence.
Obviously, the support of the rule equals the support of X.
Also note that for every frequent itemset I, all rules X ⇒ Y ,
with X ∪Y = I, hold with at least σrel confidence. Hence, the
minimal confidence threshold must be higher than the minimal
frequency threshold to be of any effect.
III. R ELATED WORK
Data mining is an important process for knowledge discovery which includes discovery of previously unknown and
useful information from databases. The most used and popular
data mining technology is association rules discovery that is
essentially based on frequent patterns mining.
F(D, σ) := {X ⊆ I | support(X, D) ≥ σ},
572
treated for iteration i if i belongs to the round. This partitioning
of candidate patterns used to imitate a deep traversal.
A number of existing algorithms on FIM have been proposed
in past decades [17], including the MapReduce based parallel
FIM algorithms emerging in recent years to deal with large
scale datasets.
In general, we can classify the existing parallel FIM algorithms into two categories: one-phase and k-phase algorithms.
One-phase algorithms need only one phase (e.g.,a MapReduce
job) to find all frequent k-itemsets. The one-phase algorithm
needs to generate many redundant itemset during processing,
which may lead memory overflow and too much execution
time for large data sets.
The k-phase algorithm (k is maximum length of frequent
itemsets) needs k phases (e.g., k iterations of a MapReduce
job) to find all frequent k-itemsets [18], [19], [20]: phase
one to determine 1-frequent itemsets, phase two the 2-freuent
itemsets, and so on.
PFP(Parallel FP-Growth) [21], based on FP-Growth, is
proposed for the extraction of frequent patterns in parallel. This
algorithm distributes the task fairly between parallel processors. It develops partitioning strategies at different stages of the
calculation process to achieve balance between processors and
adopt a data structure reducing the exchange of information
between processors.
Dist-Eclat [22] is based on Eclat algorithm. This algorithm
distributes the search space as evenly as possible between
processors. This technique is able to handle large data, but
can be prohibitively expensive when dealing with Big Data.
BigFIM [22] is optimized to handle massive amounts of
data using a hybrid algorithm combining both the principles
of Apriori and Eclat. This algorithm uses a method based on
Apriori to extract frequent patterns of length k, and then uses
the Eclat techniques when projected data can fit in memory.
HMBA [23] is an algorithm based on the MapReduce model
to calculate maximum frequent patterns. It needs two rounds of
MapReduce to compute frequent patterns. The maximum are
then extracted at a last Reduce. We note here that the maximum
patterns are found in post-processing which is inefficient.
The huge research efforts devoted to these tasks have led to a
variety of sophisticated and efficient algorithms to find frequent
itemsets. Among the best-known approaches are Apriori[1]
,FP-Growth [5] and Eclat [6] that run in a single node.
In [7], [8], they classified the algorithms developed for mining frequent itemsets into two types: the first is the candidate
itemsets generation-based approach, such as Apriori algorithm,
called Apriori-like, the second is an approach without candidate itemsets generation such as FP-growth algorithm, called
FP-growth-like [9], but with the arrival of era of Big Data these
traditional data mining algorithms have been unable to meet
Big Data analysis requirements sine the size, complexity and
variabilty are the principle challenges of mining massive data.
To address this issue, different parallel mining algorithms have
been proposed and implemented.
Thus, to overcome the problems of memory and computational capability insufficient when mining frequent itemsets from massive data, the traditional algorithms have been
converted to MapReduce task. Consequently, two types of
algorithms have been appeared: Apriori like parallel algorithm
and FP-growth like parallel algorithm.
In [9], authors gave a comparative analysis of different
frequent itemset mining techniques which work in MapReduce
framework, they presented benefits and limitations of each
algorithm.
Other algorithms have been developed using ultrametric
trees such as FiDoop [10]. The frequent items ultrametric trees
method employs an efficient ultrametric tree structure, known
as the FIUtree to present frequent items for mining frequent
itemsets [11].
Generally, parallel frequent patterns mining algorithms have
been divided into two categories, we talk about shared
memory parallel algorithms (PLCM, Paraminer, CFP-Growth,
MineWithRounds) and distributed memory parallel algorithms
(Parallel FP-Growth, DistEclat, BigFIM).
Note that the shared memory parallel algorithms divide
treatments on multiple threads(one machine) and the distributed memory parallel algorithms divide the treatments on
multiple machines.
PLCM [13] is a parallel algorithm for discovering frequent
closed patterns based on the LCM algorithm [14].
Parallelization is carried on the tree formed by the recursive
calls. The authors proposed Melinda, an interface based on the
notion of tuple space to dynamically and efficiently share work
between threads.
ParaMiner [15] is a generalisation of PLCM to compute
frequent closed patterns, the closed relational graphs and
gradual patterns.
CFP-Growth [16] is a shared memory parallel algorithm
to compute frequent patterns, it’s based on FP-Growth. It
is founded on FP-tree partition techniques and balancing of
effective loading. the main idea is to integrate the knowledge
of this tree built within the partition so that the the computing
load can be divided between a great number of threads.
MineWithRounds [12] is the only algorithm which generate
the maximum frequent patterns we have found in the litterature. It partitions the whole round patterns such as a pattern is
IV.
C HAR M AX : A N ALGORITHM FOR M AXIMAL
F REQUENT I TEMSET M INING
Many sequential algorithms exist to calculate frequent patterns but only some parallel algorithms exist. We noted that
there are a very few shared memory parallel algorithms to
calculate the maximal frequent patterns and a few memory
distributed algorithms to calculate the closed patterns and the
maximal frequent patterns.
We focused on maximal frequent patterns mining algorithms
which seem to be suitable for big data processing. We are
proposing a new distributed algorithm for maximal frequent
patterns discovering in massive data based on MapReduce and
Hadoop which will allow the mining of association rules in
order to discover correlation between items in Big Data.
To deal with the problem of extracting frequent patterns
from massive Data, we have resorted to the development
573
Table I: Execution time and number of maximal frequent
itemsets obtained by CharMax .
of sequential mining algorithm for maximal frequent patterns, called CharMax, which will be parallelized via Hadoop
MapReduce.
The idea is to extend the algorithm Charm [24] to be
able to calculate maximal frequent itemset. We have used
Charm because extensive experiments confirm that Charm
offers a significant improvement over existing methods for the
extraction of closed patterns for massive data. So, to enhance
this algorithm to be able to extract frequent maximal patterns
from the Frequent closed patterns we have followed these
steps:
• For each iteration i, the frequent closed patterns are
marked as frequent maximal patterns.
• For the following iterations, we test whether each frequent closed pattern admits a superset or not. If so, this
pattern is not maximal.
When the algorithm completes all iterations on frequent
closed patterns, the patterns that are still marked as maximal
will be included in the set of maximal frequent patterns.
Accident
BMS
CHESS
CONNECT
MUSHROOM
PUMSB
RETAIL
T10I4D100K
T40I10D100K
KOSARAK
MinSup
0.5
0.45
0.5
0.45
0.1
0.3
0.003
0.005
0.05
0.02
Execution time
3289
2008
399
7531
663
15108
5453
3849
14025
10839
#Maximal Frequent Itemsets
21
1
36
32
12
118
510
567
300
585
Table II: Execution time and number of maximal frequent
itemsets obtained by F P M ax∗ .
Accident
BMS
CHESS
CONNECT
MUSHROOM
PUMSB
RETAIL
T10I4D100K
T40I10D100K
KOSARAK
Algorithm 1: CharMax
Input: Closed Frequent Patterns
Output: Maximal Frequent Patterns
1: Maximum length of patterns ⇐ Length of largest closed
frequent pattern
2: T1 ⇐ Read table(1)
3: for i = 1; i < Maximum length of pattern do
4:
i++
5:
Ti+1 ⇐ Lire la table(i + 1)
6:
Find the maximal frequent pattern
7: end for
MinSup
0.5
0.45
0.5
0.45
0.1
0.3
0.003
0.005
0.05
0.02
Execution time
7572
7468
42891
8359
2479
19837
26571
22441
39723
186258
#Maximal Frequent Itemsets
21
1
36
32
12
118
510
567
300
585
to retail company. It contains 8 attributes and 541909
instances.
• Kosarak: This dataset contains 990,000 sequences of
click-stream data extracted from a Hungarian news portal.
• T10I4D100K and T40I10D100K: These datasets consist
of binary variables in the transactional form. They are
particularly suitable for extracting association rules.
The execution time and the number of maximal frequent
itemsets obtained by CharMax and F P M ax∗ on the different
datasets are showed in tables I and II.
We note that the MinSup values are the same for all datasets
after the execution of both algorithms.
It is also noted that the number of maximal frequent patterns
is different from one data set to another.
Indeed, with the CharMax algorithm, this number is less
than that of the F P M ax∗ algorithm. This is reasonable since
CharMax is based on the Charm algorithm which does not
extract all frequent itemsets but only the set of closed frequent
itemsets.
It is also observed that the execution time obtained by the
algorithm CharMax is lower than that of F P M ax∗ for most
data sets. Consequently, the results obtained are reliable and
satisfactory to our needs.
Figure 1 illustrates the variation of the execution time vs
MinSup for the sequential algorithms on the different datasets.
To assess CharMax, we have compared it with the algorithm F P M ax∗ which is a sequential mining algorithm for
discovering maximal frequent patterns.
Comparative studies have been carried out for maximum frequent itemset mining algorithms such as GenMax [25], MAFIA
[27] and F P M ax∗ [26], and have shown that F P M ax∗ is the
most efficient algorithm in execution time and in the quality of
extracted patterns compared to all Minsup levels. For all these
reasons we have chosen to compare CharMax to F P M ax∗ in
order to evaluate CharMax’s performance.
To evaluate the proposed algorithm, we used real data sets
that are:
• Chess: This data set is created by Christian Sheible. It
includes annotated chess games that have been posted on
chess.com. It contains 37 attributes and 3196 instances.
• Mushroom: This dataset includes descriptions of hypothetical samples corresponding to 23 species of lamellar
mushrooms in the Agaricus and Lepiota family. Each
species is identified as definitely edible, certainly toxic,
or edible unknown and not recommended. It contains 22
attributes and 8124 instances.
• Pumsb: This game contains a census of data for the
population and housing. There are 49,046 records with
2,113 different data values (separate items).
• Retail: This dataset contains all transactions between
01/12/2010 and 09/12/2011 for an online store belonging
To a M inSup ∈ [0.1], we note that the execution time of
the two algorithms is relative to the size of the file ie to the
574
·105
2
F P M ax∗
CharM ax
TIME CPU(ms) xtickten
1.5
1
0.5
0
TS
EN
ID
C
AC
S
S
BM
T
ES
CH
M
EC
OO
HR
NN
CO
US
M
BS
M
PU
DATASETS
L
I
TA
RE
AK
0K
K
00
D1
10
4D
AR
S
KO
Figure 1: The variation of the execution time vs MinSup for the sequential algorithms.
number of attributes and the number of rows that each of these
data sets contains.
We see that the difference between the two algorithms in
terms of execution time is very clear. It is also observed that
CharMax is much faster than F P M ax∗ in the extraction of
the maximum frequent patterns.
Let’s take the example of the Connect-4 dataset where the
difference is very clear. We see a spike higher at the level of the
representative curve of the algorithm F P M ax∗ ; Which means
that CharMax is much more powerful compared to F P M ax∗ .
V.
Figure 2: ParallelCharMax on MapReduce framework.
An algorithmic description detailing the Main procedure of
the ParallelCharmMax is given in Algorithm2. This latter lets
you create a Hadoop Configuration object required to:
• Allow Hadoop to obtain the general configuration of
the cluster. The object in question could also allow us
to recover ourselves from the configuration options that
interest us.
• Allow Hadoop to retrieve possible generic arguments
available on the command line (for example, the package
name of the task to be executed if the .jar contains more
than one). We will also retrieve the additional arguments
to use it.
The user can specify the name of the input file and the name
of the HDFS output directory for our Hadoop tasks using the
command line.
Thus, a new Hadoop Job object was created, which designates a Hadoop task, which in turn informs Hadoop of the
names of our Main, Map and Reduce tasks. We use the same
object to inform Hadoop of the data types used in our program
for the < key, value > couples. The previously Job Object
created is used to trigger the task launch via the Hadoop
cluster.
PARALLEL C HAR M AX : A N EFFECTIVE ALGORITHM
FOR M AXIMAL FREQUENT I TEMSETS BASED ON
M AP R EDUCE F RAMEWORK
This section will introduce our distributed ParallelCharMax
algorithm which is based essentially on the CharmMax algorithm which will be extended to be a parallel algorithm for
discovering maximal frequent itemsets. The extension of this
algorithm was performed to support the MapReduce paradigm
and operate in a parallel way to process large-scale data sets.
ParallelCharMax is a distributed in-memory MapReduce
runtime optimized for maximal frequent patterns computations.
The architecture of the runtime system that supports ParallelCharMax is shown in figure 2.
The runtime system consists of one master, a mapper and a
reducer workers. Mapper worker is responsible for executing
the map function defined by users and reducer worker execute
the reduce function, while the master schedules the execution
of parallel tasks.
575
given algorithm can deal with larger datasets. The speedup
measures how mush faster a parallel algorithm is than a
corresponding sequential algorithm by growing the number of
cores in the system while keeping the size of datasets constant.
The correctness of ParallelCharmMax is also evaluated and all
the experimental results are exactly same as MineWithRounds
and HMBA.
Algorithm 2: Main Procedure of ParallelCharMax algorithm.
Input: Dataset
Output: M aximal F requent patterns.
1: while Stopping condition is not met do
2:
Hadoop calls Function1 and Function2 and then copy maximal frequent patterns
from HDFS and store locally.
3: end while
1
2
3
4
5
6
1
2
3
4
5
6
B. Data of experimentation
In this subsection, we present the experimental results of the
ParallelCharMax on three large scale datasets with different
characteristics selected from the University of California at
Irvine (UCI) Machine Learning Repository [30]. All experiments were executed three times and the average was taken as
experimental results.
The datasets we used include Mushroom, Connect4 and
Agaricus. More details are described in table III.
Function 1 M ap(key, value) ()
Input key : data record,value : data record values
Output ≺ key ′ , value′ ≻ pair where value′ is the frequent closed
patterns.
FOREACH key
Calculate closed frequent patterns and store in value′
EndForeach
Emit ≺ key ′ , value′ ≻ pair
Fin
Function 2 Reduce(key, value) ()
Input key : data record,value : closed f requent patterns
Output ≺ key ′ , value′ ≻ pair where value′ is the maximal frequent
patterns.
FOREACH key
Calculate Maximal frequent patterns and store in value′
EndForeach
Emit ≺ key ′ , value′ ≻ pair
Fin
Table III: Datasets properties.
#Dataset
Mushroom
Connect4
Agaricus
#Number of Items
119
76
23
#Number of Transactions
78,265
135,115
92,653
#Size
120MB
200MB
170MB
C. Results
In parallel mode, our algorithm was applied to massive
datasets and compared with the two parallel algorithms:
HMBA and MineWithRounds.
Table IV shows the runtime values obtained by the
MineWithRounds, HMBA, and ParallelCharMax algorithms on
previous datasets.
The results obtained with ParallelCharMax are much better
in terms of execution time than the MineWithRounds and
HMBA algorithms. The results obtained are encouraging and
prove the performance of ParallelCharMax compared to those
existing on the runtime and number of maximal frequent
patterns.
The degree of support varies from 0.05 to 0.5 respectively
with a pitch of 0.05. We can see from Table IV that the
ParallelCharMax algorithm tends to be more efficient when
support degree is set to a higher level which is very clear in
figure 3.
Also, we note that the running time (for each datasets) of the
ParallelCharMax is apparently less than MineWithRounds and
HMBA algorithms. For example, for dataset Connect4, when
support degree is 0.5, the running time of ParallelCharmMax
is 19.32s, nearly 3 times smaller than the time of HMBA, i.e.,
53.88s.
VI. C OMPUTATIONAL STUDY
In this section, we evaluated the performance of ParallelCharMax while comparing it with MineWithRounds [29]
and HMBA [28].
MineWithRounds [29] is the only algorithm for calculating
maximum frequent patterns that we have found. It divides the
set of patterns in circles such that a pattern is processed during
iteration i if it belongs to round i. This partitioning of the
candidate patterns makes it possible to imitate a path in depth.
HMBA (Hadoop Market Basket Analysis) is an algorithm
based on the MapReduce model to compute the maximum
frequent patterns. It needs two rounds of MapReduce to
calculate frequent patterns. The maximals are then extracted
during a last reduce. We note here that the maximum patterns
are found in post-processing which is not very effective.
A. Experimental protocol
We implemented the ParallelCharMax algorithm in Hadoop
2.5. All the data were stored on the same HDFS cluster.
We made our experiments on the Cloudera virtual machine
CDH5.3.0 for our Hadoop algorithm production for the quality
of integration and the offered maturity.
Each node has 4 Intel(R) Core i5-5200U CPU running at
2.20GHZ, 8GB memory and 1TB disk. The experiments are
all running at the CENTOS6.4 version and Java 1.8.
We also evaluated the scalability performance of the ParallelCharmMax in term of sizeup and speedup. These later
are both primordial factors of parallel algorithm. The sizeup
analysis holds the number of cores in the system constant
and grows the size of the datasets to measure whether a
VII.
C ONCLUSION
Frequent patterns mining is an important research area in the
data mining process since it is widely applied in real world to
find frequent patterns and so mine human behavior patterns
and trends.
576
Table IV: The running time (ms) comparison on Mushroom, Connect4 and Agaricus datasets.
Dataset
Mushroom
Connect4
Agaricus
Support degree
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
ParallelCharMax
12,36
7,32
7,38
7,74
7,68
7,86
7,44
7,5
7,62
8,04
19,86
36,3
18,66
17,52
310,26
18,24
19,2
19,08
17,4
19,32
10,08
9,6
9,54
9,54
9,6
9,66
9,36
9,42
9,3
9,96
In this paper, we have focused on parallel algorithms which
were implemented to discover either frequent or closed frequent or maximal itemsets in order to solve the performance
deterioration, load balancing and scalability challenges of
sequential algorithm especially with the emergence of the era
big data.
However, these algorithms suffer potential problems which
make it strenuous to scale up these parallel algorithms especially with massive data. Moreover, current parallel frequent
patterns mining algorithms often suffer from generating huge
intermediate data or scanning the whole transaction database
for identifying the frequent itemsets.
In order to contribute to the resolution of these problems, we
have chosen to work on maximal frequent patterns algorithms
which are more suitable to process large data and for which
a few propositions exist. We aim at developing a new parallel algorithm based on Hadoop-MapReduce for discovering
maximal frequent patterns.
This parallel algorithm was a generalisation of the sequential
algorithm CharMax based on Charm algorithm which has been
modified to be able to generate maximal frequent patterns.
CharMax has and proven its effectiveness through a comparison study achieved between the new algorithm CharMax and
the most powerful algorithm F P M ax∗ . This study has shown
that CharMax is the most performant in terms of execution
time and patterns quality.
Numerous perspectives are offered to enhance this work.
Hence, we aim to integrate a phase of semantic pruning in the
discovery process of maximal frequent patterns. An ontology
can serve as a pruning support to remove certain candidate
from the computation. The goal is to help the expert to discover
useful maximal frequent patterns by removing the patterns that
can be found using the domain ontology associated with the
search data. The pruning phase can be defined as a posterior
processing for the generation of association rules in order to
generate only useful and interesting rules from massive data.
[1]
HMBA
287,1
275,16
186,54
182,94
166,92
120,54
104,4
60,12
42,12
17,42
622,26
566,7
471
322,02
166,92
255,12
148,8
91,26
72,36
53,88
445,32
352,5
298,8
261,12
231,78
179,88
123,18
94,44
32,1
28,48
MineWithRounds
5249,81
2975,17
2166,57
962,95
480,75
300,55
180,36
120,09
102,17
68,05
9442,26
8114,7
5288,91
1940,16
913,86
493,12
344,88
273,72
132,39
109,31
4450,09
3728,52
3681,35
2469,05
2410,11
1749,78
1509,20
487,23
177,26
148,01
Visual Data Mining pour l’exploration d’un ensemble de règles
d’association,” in IHM 2011, 23ème Conférence Francophone sur
l’Interaction Homme-Machine, Nice, France. [Online]. Available:
https://hal-enac.archives-ouvertes.fr/hal-01022276
[2]
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
Advances in knowledge discovery and data mining. AAAI press Menlo
Park, 1996, vol. 21.
[3]
U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The kdd process for
extracting useful knowledge from volumes of data,” Communications
of the ACM, vol. 39, no. 11, pp. 27–34, 1996.
[4]
M. J. Zaki, “Parallel and distributed data mining: An introduction,” in
Large-scale parallel data mining. Springer, 2000, pp. 1–23.
[5]
C. Borgelt, C. Braune, T. Kötter, and S. Grün, “New algorithms for
finding approximate frequent item sets,” Soft Comput., vol. 16, no. 5, pp.
903–917, 2012. [Online]. Available: http://dx.doi.org/10.1007/s00500011-0776-2
[6]
M. J. Zaki and K. Gouda, “Fast vertical mining using diffsets,”
in Proceedings of the Ninth SIGKDD International Conference
on Knowledge Discovery and Data Mining, ser. KDD ’03,
New York, NY, USA, 2003, pp. 326–335. [Online]. Available:
http://doi.acm.org/10.1145/956750.956788
[7]
M. J. Zaki, “Parallel and distributed association mining: A survey,”
IEEE concurrency, vol. 7, no. 4, pp. 14–25, 1999.
[8] I. Pramudiono and M. Kitsuregawa, “Fp-tax: Tree structure based
generalized association rule mining,” in Proceedings of the 9th ACM
SIGMOD workshop on Research issues in data mining and knowledge
discovery. ACM, 2004, pp. 60–63.
[9]
M. A. Shinde and K. Adhiya, “Frequent itemset mining algorithms for
big data using mapreduce technique-a review,” 2016.
[10] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on
large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–
113, 2008.
[11] S. Ghadge, P. Durge, V. Bhosale, and S. Mishra, “Frequent itemsets parallel mining algorithms,” International Engineering Research Journal
(IERJ), vol. 1, no. 8, pp. 599–604, 2015.
[12] S. Maabout, “Contributions à l’Optimisation de Requêtes
Multidimensionnelles,” Ph.D. dissertation, Université de Bordeaux.
[Online]. Available: https://hal.archives-ouvertes.fr/tel-01274065
[13] B. Negrevergne, A. Termier, J.-F. Méhaut, and T. Uno, “Discovering
closed frequent itemsets on multicore: Parallelizing computations and
G. Bothorel, M. Serrurier, and C. Hurter, “Utilisation d’outils de
577
(Mushroom)
104
P arallelCharM ax
M ineW ithRounds
[14]
HM BA
TIME CPU(ms)
103
[15]
[16]
102
[17]
101
10−1.4
[18]
10−1.2
10−1
10−0.8
10−0.6
10−0.4
MINSUP
(Connect)
[19]
P arallelCharM ax
104
M ineW ithRounds
HM BA
[20]
TIME CPU(ms)
103
[21]
[22]
102
[23]
101
10−1.4
10−1.2
10−1
10−0.8
10−0.6
[24]
10−0.4
MINSUP
(Agaricus)
[25]
P arallelCharM ax
M ineW ithRounds
[26]
HM BA
[27]
TIME CPU(ms)
103
[28]
102
[29]
101
[30]
10−1.4
10−1.2
10−1
10−0.8
10−0.6
10−0.4
MINSUP
Figure 3: The run time comparaison on Mushroom, Connect
and Agaricus datasets.
578
optimizing memory accesses,” in High Performance Computing and
Simulation (HPCS), 2010 International Conference on. IEEE, 2010,
pp. 521–528.
T. Uno, T. Asai, Y. Uchida, and H. Arimura, “Lcm: An efficient
algorithm for enumerating frequent closed item sets.” in FIMI, vol. 90.
Citeseer, 2003.
B. Negrevergne, A. Termier, M.-C. Rousset, and J.-F. Méhaut,
“Paraminer: a generic pattern mining algorithm for multi-core architectures,” Data Mining and Knowledge Discovery, vol. 28, no. 3, pp.
593–633, 2014.
R. U. Kiran and P. K. Reddy, “An improved frequent pattern-growth
approach to discover rare association rules.” in KDIR, 2009, pp. 43–52.
H. Qiu, R. Gu, C. Yuan, and Y. Huang, “Yafim: a parallel frequent itemset mining algorithm with spark,” in Parallel & Distributed Processing
Symposium Workshops (IPDPSW), 2014 IEEE International. IEEE,
2014, pp. 1664–1671.
N. Li, L. Zeng, Q. He, and Z. Shi, “Parallel implementation of apriori
algorithm based on mapreduce,” in Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD),
2012 13th ACIS International Conference on. IEEE, 2012, pp. 236–
241.
M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh, “Apriori-based frequent itemset mining algorithms on mapreduce,” in Proceedings of the 6th
international conference on ubiquitous information management and
communication. ACM, 2012, p. 76.
X. Y. Yang, Z. Liu, and Y. Fu, “Mapreduce as a programming model
for association rules algorithm on hadoop,” in Information Sciences
and Interaction Sciences (ICIS), 2010 3rd International Conference on.
IEEE, 2010, pp. 99–102.
H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, “Pfp: parallel
fp-growth for query recommendation,” in Proceedings of the 2008 ACM
conference on Recommender systems. ACM, 2008, pp. 107–114.
S. Moens, E. Aksehirli, and B. Goethals, “Frequent itemset mining for
big data,” in Big Data, 2013 IEEE International Conference on. IEEE,
2013, pp. 111–118.
M. R. Karim, M. Azam Hossain, M. Rashid, B.-S. Jeong, and H.-J.
Choi, “An efficient market basket analysis technique with improved
mapreduce framework on hadoop: An e-commerce perspective.”
M. J. Zaki and C.-J. Hsiao, “Efficient algorithms for mining closed
itemsets and their lattice structure,” IEEE transactions on knowledge
and data engineering, vol. 17, no. 4, pp. 462–478, 2005.
K. Gouda and M. J. Zaki, “Efficiently mining maximal frequent
itemsets,” in Data Mining, 2001. ICDM 2001, Proceedings IEEE
International Conference on. IEEE, 2001, pp. 163–170.
G. Grahne and J. Zhu, “Efficiently using prefix-trees in mining frequent
itemsets.” in FIMI, vol. 90, 2003.
D. Burdick, M. Calimlim, and J. Gehrke, “Mafia: A maximal frequent
itemset algorithm for transactional databases,” in Data Engineering,
2001. Proceedings. 17th International Conference on. IEEE, 2001,
pp. 443–452.
M. R. Karim, J.-H. Jo, B.-S. Jeong, and H.-J. Choi, “Mining e-shopper’s
purchase rules by using maximal frequent patterns: An e-commerce
perspective,” in Information Science and Applications (ICISA), 2012
International Conference on. IEEE, 2012, pp. 1–6.
T. Radu-Ionel, “Bordures : de la sélection de vues dans un cube
de données au calcul parallèle de fréquents maximaux,” Theses,
Université Sciences et Technologies - Bordeaux I, Sep. 2010. [Online].
Available: https://tel.archives-ouvertes.fr/tel-01086108
D. N. A. Asuncion, “UCI machine learning repository,” 2007. [Online].
Available: http://www.ics.uci.edu/∼mlearn/MLRepository.html