(PDF) ParallelCharMax: An Effective Maximal Frequent Itemset Mining Algorithm Based on MapReduce Framework

—Nowadays, the explosive growth in data collection in business and scientific areas has required the need to analyze and mine useful knowledge residing in these data. The recourse to data mining techniques seems to be inescapable in order to extract useful and novel patterns/models from large datasets. In this context, frequent itemsets (patterns) play an essential role in many data mining tasks that try to find interesting patterns from datasets. However, conventional approaches for mining frequent itemsets in Big Data era encounter significant challenges when computing power and memory space are limited. This paper proposes an efficient distributed frequent itemset mining algorithm, called ParallelCharMax, that is based on a powerful sequential algorithm, called Charm, and computes the maximal frequent itemsets that are considered perfect summaries of the frequent ones. The proposed algorithm has been implemented using MapReduce framework. The experimental component of the study shows the efficiency and the performance of the proposed algorithm compared with well known algorithms such as MineWithRounds and HMBA.

2017 IEEE/ACS 14th International Conference on Computer Systems and Applications ParallelCharMax: An Effective Maximal Frequent Itemset Mining Algorithm Based on MapReduce Framework Rania Mkhinini Gahar1 , Olfa Arfaoui2 , Minyar Sassi Hidri3 , Nejib Ben Hadj-Alouane4 1,2,3,4 University of Tunis El Manar 1,2,3,4 National Engineering School of Tunis 3 Imam Abdulrahman Bin Faisal University, Dammam, Arabie Saoudite 1,2,3,4 BP. 37, Le Belvèdère 1002, Tunis, Tunisia two decades. Several researches are directed to extract more precisely the maximum frequent patterns and this amounts to the fact that these latter can be considered as perfect summaries of the frequent sets since they can be much less numerous than the frequent closed patterns. Moreover, today still, it remains a topical issue as new challenges arise, particularly with the emergence of mega-data (Big Data) and the development of data science. While data mining has its roots in the traditional ﬁelds of machine learning and statistics, the huge volume of data today poses the most serious problem. For example, many companies already have data warehouses in the terabyte range (e.g., FedEx, UPS, Walmart). Similarly, scientiﬁc data is reaching gigantic proportions (e.g., NASA space missions, Human Genome Project). Traditional methods typically made the assumption that the data is memory resident. This assumption is no longer tenable. Implementation of data mining ideas in high-performance parallel and distributed computing environments is thus becoming crucial for ensuring system scalability and interactivity as data continues to grow in size and complexity [4]. Parallel and distributed computing is expected to relieve current mining methods from the sequential bottleneck, providing the ability to scale to massive datasets, and improving the response time. Achieving good performance on today’s multiprocessor systems is a non-trivial task. The main challenges include synchronization and communication minimization, work-load balancing, ﬁnding good data layout and data decomposition, and disk Input/Output minimization, which is especially important for data mining. In order to answer the different stakes posed, we propose a new parallel algorithm for discovering frequent maximal itemsets based on MapReduce framework. The rest of the paper is organized as follows; section 2 presents the basic frequent itemsets and association rules mining problems. Sections 3 describes related work the main parallel and distributed techniques used to solve these problems and give a comprehensive survey of the most inﬂuential algorithms that were proposed during the last decade. Section 4 focused on a powerful sequential algorithm for searching maximal frequent itemsets on which we will base in the fourth section in Abstract—Nowadays, the explosive growth in data collection in business and scientific areas has required the need to analyze and mine useful knowledge residing in these data. The recourse to data mining techniques seems to be inescapable in order to extract useful and novel patterns/models from large datasets. In this context, frequent itemsets (patterns) play an essential role in many data mining tasks that try to find interesting patterns from datasets. However, conventional approaches for mining frequent itemsets in Big Data era encounter significant challenges when computing power and memory space are limited. This paper proposes an efficient distributed frequent itemset mining algorithm, called ParallelCharMax, that is based on a powerful sequential algorithm, called Charm, and computes the maximal frequent itemsets that are considered perfect summaries of the frequent ones. The proposed algorithm has been implemented using MapReduce framework. The experimental component of the study shows the efficiency and the performance of the proposed algorithm compared with well known algorithms such as MineWithRounds and HMBA. Keywords—Frequent Itemset Mining, Parallel Mining Algorithm, MapReduce, Charm. I. I NTRODUCTION Data Mining and Knowledge Discovery in Datasets (KDD) is a new interdisciplinary ﬁeld which presents an intersection of statistics, machine learning, databases, and parallel and distributed computing. It has been generated by the important growth of data in all spheres of human effort, and the economic and scientiﬁc need to extract useful information from the collected data. The key challenge in data mining is the extraction of knowledge from massive datasets. Data mining refers to the overall process of discovering new patterns or building models from a given dataset. There are many steps involved in the KDD process which include data selection, data cleaning and preprocessing, data transformation and reduction, data-mining task and algorithm selection, and ﬁnally post-processing and interpretation of discovered knowledge [2], [3]. This KDD process tends to be highly iterative and interactive. In this context, the patterns extraction is one of the most important techniques of data mining. It stimulates a great effort that led to a variety of proposed algorithms throughout the last 2161-5330/17 $31.00 © 2017 IEEE DOI 10.1109/AICCSA.2017.80 571 or simply F if D and s are clear from the context. b) Problem 1. (Itemset Mining): Given a set of items I, a transaction database D over I, and minimal support threshold σ, ﬁnd F(D, σ). In practice we are not only interested in the set of itemsets F, but also in the actual supports of these itemsets. An association rule is an expression of the form X =⇒ Y , where X and Y are itemsets, and X ∩ Y = {}. Such a rule expresses the association that if a transaction contains all items in X, then that transaction also contains all items in Y . X is called the body or antecedent, and Y is called the head or consequent of the rule. The support of an association rule X =⇒ Y in D, is the support of X ∪ Y in D, and similarly, the frequency of the rule is the frequency of X ∪ Y . An association rule is called frequent if its support (frequency) exceeds a given minimal support (frequency) threshold σabs (σrel ). Again, we will only work with the absolute minimal support threshold for association rules and omit the subscript abs unless explicitly stated otherwise. The conﬁdence or accuracy of an association rule X =⇒ Y in D is the conditional probability of having Y contained in a transaction, given that X is contained in that transaction: order to highlight a new parallel distributed memory algorithm based on the MapReduce paradigm under Hadoop. Section 5 presents our parallel algorithm called ParallelCharMax to efﬁciently searching maximal frequent itemsets. Section 6 discusses computational results of the proposed algorithm. Section 7 summarizes the paper and highlights future work. II. P RELIMINARIES Let I be a set of items. A set X = {i1 , ..., ik } ⊆ I is called an itemset, or a k-itemset if it contains k items. A transaction over I is a couple T = (tid, I) where tid is the transaction identiﬁer and I is an itemset. A transaction T = (tid, I) is said to support an itemset X ⊆ I, if X ⊆ I. A transaction database D over I is a set of transactions over I. We omit I whenever it is clear from the context. The cover of an itemset X in D consists of the set of transaction identiﬁers of transactions in D that support X: cover(X, D) := {tid | (tid,I)∈ D, X ⊆ I}. The support of an itemset X in D is the number of transactions in the cover of X in D: support(X, D) := |cover(X, D)|. The frequency of an itemset X in D is the probability of X occurring in a transaction T ∈ D: f requency(X, D) := P (X) = support(X,D) |D| conf idence(X =⇒ Y, D) :=P (Y | X) = support(X∪Y,D) support(X,D) The rule is called conﬁdent if P (Y |X) exceeds a given minimal conﬁdence threshold γ, with 0 ≤ γ ≤ 1. c) Deﬁnition 2.: Let D be a transaction database over a set of items I, σ a minimal support threshold, and γ a minimal conﬁdence threshold. The collection of frequent and conﬁdent association rules with respect to σ and γ is denoted by: . Note that |D| = support({}, D). We omit D whenever it is clear from the context. An itemset is called frequent if its support is no less than a given absolute minimal support threshold σabs , with 0 ≤ σabs ≤ |D|. When working with frequencies of itemsets instead of their supports, we use a relative minimal frequency threshold σrel , with 0 ≤ σrel ≤ 1. Obviously, σabs = ⌈σrel .|D|⌉. A frequent itemet is called closed if none of its supersets has the same support. In other words, all its supersets have a strictly lower support. An itemset X in D such that support support(X, D)> minimal support threshold σabs is called a closed frequent itemset. A maximal frequent itemset X in D is a frequent itemset and all its supersets are not frequent. The set of maximal frequent itemsets is thus a subset of the frequent closed itemsets set. When a set of patterns X is frequent without a frequent superset, we say that X is maximal frequent. The reason for extracting the maximum frequent itemsets is that the set of this later is generally much smaller than the set of frequent itemsets and also smaller than the set of frequent closed itemsets. It is possible to regenerate all sets of frequent itemsets from the set of maximal frequent itemsets, but it would not be possible to obtain their support without scanning the database. a) Deﬁnition 1.: Let D be a transaction database over a set of items I, and σ a minimal support threshold. The collection of frequent itemsets in D with respect to σ is denoted by R(D, σ, γ):= {X ⇒ Y | X, Y ⊆ I, X∩Y = {}, X ∪ Y ∈ F(D, σ),conﬁdence(X ⇒ Y, D) ≥ γ}, or simply R if D, σ and γ are clear from the context. d) Problem 2. (Association Rule Mining): Given a set of items I, a transaction database D over I, and minimal support and conﬁdence thresholds σ and γ, R(D, σ, γ). Besides the set of all association rules, we are also interested in the support and conﬁdence of each of these rules. Note that the Itemset Mining problem is actually a special case of the association rule mining problem. Indeed, if we are given the support and conﬁdence thresholds σ and γ, then every frequent itemset X also represents the trivial rule X ⇒ {} which holds with 100% conﬁdence. Obviously, the support of the rule equals the support of X. Also note that for every frequent itemset I, all rules X ⇒ Y , with X ∪Y = I, hold with at least σrel conﬁdence. Hence, the minimal conﬁdence threshold must be higher than the minimal frequency threshold to be of any effect. III. R ELATED WORK Data mining is an important process for knowledge discovery which includes discovery of previously unknown and useful information from databases. The most used and popular data mining technology is association rules discovery that is essentially based on frequent patterns mining. F(D, σ) := {X ⊆ I | support(X, D) ≥ σ}, 572 treated for iteration i if i belongs to the round. This partitioning of candidate patterns used to imitate a deep traversal. A number of existing algorithms on FIM have been proposed in past decades [17], including the MapReduce based parallel FIM algorithms emerging in recent years to deal with large scale datasets. In general, we can classify the existing parallel FIM algorithms into two categories: one-phase and k-phase algorithms. One-phase algorithms need only one phase (e.g.,a MapReduce job) to ﬁnd all frequent k-itemsets. The one-phase algorithm needs to generate many redundant itemset during processing, which may lead memory overﬂow and too much execution time for large data sets. The k-phase algorithm (k is maximum length of frequent itemsets) needs k phases (e.g., k iterations of a MapReduce job) to ﬁnd all frequent k-itemsets [18], [19], [20]: phase one to determine 1-frequent itemsets, phase two the 2-freuent itemsets, and so on. PFP(Parallel FP-Growth) [21], based on FP-Growth, is proposed for the extraction of frequent patterns in parallel. This algorithm distributes the task fairly between parallel processors. It develops partitioning strategies at different stages of the calculation process to achieve balance between processors and adopt a data structure reducing the exchange of information between processors. Dist-Eclat [22] is based on Eclat algorithm. This algorithm distributes the search space as evenly as possible between processors. This technique is able to handle large data, but can be prohibitively expensive when dealing with Big Data. BigFIM [22] is optimized to handle massive amounts of data using a hybrid algorithm combining both the principles of Apriori and Eclat. This algorithm uses a method based on Apriori to extract frequent patterns of length k, and then uses the Eclat techniques when projected data can ﬁt in memory. HMBA [23] is an algorithm based on the MapReduce model to calculate maximum frequent patterns. It needs two rounds of MapReduce to compute frequent patterns. The maximum are then extracted at a last Reduce. We note here that the maximum patterns are found in post-processing which is inefﬁcient. The huge research efforts devoted to these tasks have led to a variety of sophisticated and efﬁcient algorithms to ﬁnd frequent itemsets. Among the best-known approaches are Apriori[1] ,FP-Growth [5] and Eclat [6] that run in a single node. In [7], [8], they classiﬁed the algorithms developed for mining frequent itemsets into two types: the ﬁrst is the candidate itemsets generation-based approach, such as Apriori algorithm, called Apriori-like, the second is an approach without candidate itemsets generation such as FP-growth algorithm, called FP-growth-like [9], but with the arrival of era of Big Data these traditional data mining algorithms have been unable to meet Big Data analysis requirements sine the size, complexity and variabilty are the principle challenges of mining massive data. To address this issue, different parallel mining algorithms have been proposed and implemented. Thus, to overcome the problems of memory and computational capability insufﬁcient when mining frequent itemsets from massive data, the traditional algorithms have been converted to MapReduce task. Consequently, two types of algorithms have been appeared: Apriori like parallel algorithm and FP-growth like parallel algorithm. In [9], authors gave a comparative analysis of different frequent itemset mining techniques which work in MapReduce framework, they presented beneﬁts and limitations of each algorithm. Other algorithms have been developed using ultrametric trees such as FiDoop [10]. The frequent items ultrametric trees method employs an efﬁcient ultrametric tree structure, known as the FIUtree to present frequent items for mining frequent itemsets [11]. Generally, parallel frequent patterns mining algorithms have been divided into two categories, we talk about shared memory parallel algorithms (PLCM, Paraminer, CFP-Growth, MineWithRounds) and distributed memory parallel algorithms (Parallel FP-Growth, DistEclat, BigFIM). Note that the shared memory parallel algorithms divide treatments on multiple threads(one machine) and the distributed memory parallel algorithms divide the treatments on multiple machines. PLCM [13] is a parallel algorithm for discovering frequent closed patterns based on the LCM algorithm [14]. Parallelization is carried on the tree formed by the recursive calls. The authors proposed Melinda, an interface based on the notion of tuple space to dynamically and efﬁciently share work between threads. ParaMiner [15] is a generalisation of PLCM to compute frequent closed patterns, the closed relational graphs and gradual patterns. CFP-Growth [16] is a shared memory parallel algorithm to compute frequent patterns, it’s based on FP-Growth. It is founded on FP-tree partition techniques and balancing of effective loading. the main idea is to integrate the knowledge of this tree built within the partition so that the the computing load can be divided between a great number of threads. MineWithRounds [12] is the only algorithm which generate the maximum frequent patterns we have found in the litterature. It partitions the whole round patterns such as a pattern is IV. C HAR M AX : A N ALGORITHM FOR M AXIMAL F REQUENT I TEMSET M INING Many sequential algorithms exist to calculate frequent patterns but only some parallel algorithms exist. We noted that there are a very few shared memory parallel algorithms to calculate the maximal frequent patterns and a few memory distributed algorithms to calculate the closed patterns and the maximal frequent patterns. We focused on maximal frequent patterns mining algorithms which seem to be suitable for big data processing. We are proposing a new distributed algorithm for maximal frequent patterns discovering in massive data based on MapReduce and Hadoop which will allow the mining of association rules in order to discover correlation between items in Big Data. To deal with the problem of extracting frequent patterns from massive Data, we have resorted to the development 573 Table I: Execution time and number of maximal frequent itemsets obtained by CharMax . of sequential mining algorithm for maximal frequent patterns, called CharMax, which will be parallelized via Hadoop MapReduce. The idea is to extend the algorithm Charm [24] to be able to calculate maximal frequent itemset. We have used Charm because extensive experiments conﬁrm that Charm offers a signiﬁcant improvement over existing methods for the extraction of closed patterns for massive data. So, to enhance this algorithm to be able to extract frequent maximal patterns from the Frequent closed patterns we have followed these steps: • For each iteration i, the frequent closed patterns are marked as frequent maximal patterns. • For the following iterations, we test whether each frequent closed pattern admits a superset or not. If so, this pattern is not maximal. When the algorithm completes all iterations on frequent closed patterns, the patterns that are still marked as maximal will be included in the set of maximal frequent patterns. Accident BMS CHESS CONNECT MUSHROOM PUMSB RETAIL T10I4D100K T40I10D100K KOSARAK MinSup 0.5 0.45 0.5 0.45 0.1 0.3 0.003 0.005 0.05 0.02 Execution time 3289 2008 399 7531 663 15108 5453 3849 14025 10839 #Maximal Frequent Itemsets 21 1 36 32 12 118 510 567 300 585 Table II: Execution time and number of maximal frequent itemsets obtained by F P M ax∗ . Accident BMS CHESS CONNECT MUSHROOM PUMSB RETAIL T10I4D100K T40I10D100K KOSARAK Algorithm 1: CharMax Input: Closed Frequent Patterns Output: Maximal Frequent Patterns 1: Maximum length of patterns ⇐ Length of largest closed frequent pattern 2: T1 ⇐ Read table(1) 3: for i = 1; i < Maximum length of pattern do 4: i++ 5: Ti+1 ⇐ Lire la table(i + 1) 6: Find the maximal frequent pattern 7: end for MinSup 0.5 0.45 0.5 0.45 0.1 0.3 0.003 0.005 0.05 0.02 Execution time 7572 7468 42891 8359 2479 19837 26571 22441 39723 186258 #Maximal Frequent Itemsets 21 1 36 32 12 118 510 567 300 585 to retail company. It contains 8 attributes and 541909 instances. • Kosarak: This dataset contains 990,000 sequences of click-stream data extracted from a Hungarian news portal. • T10I4D100K and T40I10D100K: These datasets consist of binary variables in the transactional form. They are particularly suitable for extracting association rules. The execution time and the number of maximal frequent itemsets obtained by CharMax and F P M ax∗ on the different datasets are showed in tables I and II. We note that the MinSup values are the same for all datasets after the execution of both algorithms. It is also noted that the number of maximal frequent patterns is different from one data set to another. Indeed, with the CharMax algorithm, this number is less than that of the F P M ax∗ algorithm. This is reasonable since CharMax is based on the Charm algorithm which does not extract all frequent itemsets but only the set of closed frequent itemsets. It is also observed that the execution time obtained by the algorithm CharMax is lower than that of F P M ax∗ for most data sets. Consequently, the results obtained are reliable and satisfactory to our needs. Figure 1 illustrates the variation of the execution time vs MinSup for the sequential algorithms on the different datasets. To assess CharMax, we have compared it with the algorithm F P M ax∗ which is a sequential mining algorithm for discovering maximal frequent patterns. Comparative studies have been carried out for maximum frequent itemset mining algorithms such as GenMax [25], MAFIA [27] and F P M ax∗ [26], and have shown that F P M ax∗ is the most efﬁcient algorithm in execution time and in the quality of extracted patterns compared to all Minsup levels. For all these reasons we have chosen to compare CharMax to F P M ax∗ in order to evaluate CharMax’s performance. To evaluate the proposed algorithm, we used real data sets that are: • Chess: This data set is created by Christian Sheible. It includes annotated chess games that have been posted on chess.com. It contains 37 attributes and 3196 instances. • Mushroom: This dataset includes descriptions of hypothetical samples corresponding to 23 species of lamellar mushrooms in the Agaricus and Lepiota family. Each species is identiﬁed as deﬁnitely edible, certainly toxic, or edible unknown and not recommended. It contains 22 attributes and 8124 instances. • Pumsb: This game contains a census of data for the population and housing. There are 49,046 records with 2,113 different data values (separate items). • Retail: This dataset contains all transactions between 01/12/2010 and 09/12/2011 for an online store belonging To a M inSup ∈ [0.1], we note that the execution time of the two algorithms is relative to the size of the ﬁle ie to the 574 ·105 2 F P M ax∗ CharM ax TIME CPU(ms) xtickten 1.5 1 0.5 0 TS EN ID C AC S S BM T ES CH M EC OO HR NN CO US M BS M PU DATASETS L I TA RE AK 0K K 00 D1 10 4D AR S KO Figure 1: The variation of the execution time vs MinSup for the sequential algorithms. number of attributes and the number of rows that each of these data sets contains. We see that the difference between the two algorithms in terms of execution time is very clear. It is also observed that CharMax is much faster than F P M ax∗ in the extraction of the maximum frequent patterns. Let’s take the example of the Connect-4 dataset where the difference is very clear. We see a spike higher at the level of the representative curve of the algorithm F P M ax∗ ; Which means that CharMax is much more powerful compared to F P M ax∗ . V. Figure 2: ParallelCharMax on MapReduce framework. An algorithmic description detailing the Main procedure of the ParallelCharmMax is given in Algorithm2. This latter lets you create a Hadoop Conﬁguration object required to: • Allow Hadoop to obtain the general conﬁguration of the cluster. The object in question could also allow us to recover ourselves from the conﬁguration options that interest us. • Allow Hadoop to retrieve possible generic arguments available on the command line (for example, the package name of the task to be executed if the .jar contains more than one). We will also retrieve the additional arguments to use it. The user can specify the name of the input ﬁle and the name of the HDFS output directory for our Hadoop tasks using the command line. Thus, a new Hadoop Job object was created, which designates a Hadoop task, which in turn informs Hadoop of the names of our Main, Map and Reduce tasks. We use the same object to inform Hadoop of the data types used in our program for the < key, value > couples. The previously Job Object created is used to trigger the task launch via the Hadoop cluster. PARALLEL C HAR M AX : A N EFFECTIVE ALGORITHM FOR M AXIMAL FREQUENT I TEMSETS BASED ON M AP R EDUCE F RAMEWORK This section will introduce our distributed ParallelCharMax algorithm which is based essentially on the CharmMax algorithm which will be extended to be a parallel algorithm for discovering maximal frequent itemsets. The extension of this algorithm was performed to support the MapReduce paradigm and operate in a parallel way to process large-scale data sets. ParallelCharMax is a distributed in-memory MapReduce runtime optimized for maximal frequent patterns computations. The architecture of the runtime system that supports ParallelCharMax is shown in ﬁgure 2. The runtime system consists of one master, a mapper and a reducer workers. Mapper worker is responsible for executing the map function deﬁned by users and reducer worker execute the reduce function, while the master schedules the execution of parallel tasks. 575 given algorithm can deal with larger datasets. The speedup measures how mush faster a parallel algorithm is than a corresponding sequential algorithm by growing the number of cores in the system while keeping the size of datasets constant. The correctness of ParallelCharmMax is also evaluated and all the experimental results are exactly same as MineWithRounds and HMBA. Algorithm 2: Main Procedure of ParallelCharMax algorithm. Input: Dataset Output: M aximal F requent patterns. 1: while Stopping condition is not met do 2: Hadoop calls Function1 and Function2 and then copy maximal frequent patterns from HDFS and store locally. 3: end while 1 2 3 4 5 6 1 2 3 4 5 6 B. Data of experimentation In this subsection, we present the experimental results of the ParallelCharMax on three large scale datasets with different characteristics selected from the University of California at Irvine (UCI) Machine Learning Repository [30]. All experiments were executed three times and the average was taken as experimental results. The datasets we used include Mushroom, Connect4 and Agaricus. More details are described in table III. Function 1 M ap(key, value) () Input key : data record,value : data record values Output ≺ key ′ , value′ ≻ pair where value′ is the frequent closed patterns. FOREACH key Calculate closed frequent patterns and store in value′ EndForeach Emit ≺ key ′ , value′ ≻ pair Fin Function 2 Reduce(key, value) () Input key : data record,value : closed f requent patterns Output ≺ key ′ , value′ ≻ pair where value′ is the maximal frequent patterns. FOREACH key Calculate Maximal frequent patterns and store in value′ EndForeach Emit ≺ key ′ , value′ ≻ pair Fin Table III: Datasets properties. #Dataset Mushroom Connect4 Agaricus #Number of Items 119 76 23 #Number of Transactions 78,265 135,115 92,653 #Size 120MB 200MB 170MB C. Results In parallel mode, our algorithm was applied to massive datasets and compared with the two parallel algorithms: HMBA and MineWithRounds. Table IV shows the runtime values obtained by the MineWithRounds, HMBA, and ParallelCharMax algorithms on previous datasets. The results obtained with ParallelCharMax are much better in terms of execution time than the MineWithRounds and HMBA algorithms. The results obtained are encouraging and prove the performance of ParallelCharMax compared to those existing on the runtime and number of maximal frequent patterns. The degree of support varies from 0.05 to 0.5 respectively with a pitch of 0.05. We can see from Table IV that the ParallelCharMax algorithm tends to be more efﬁcient when support degree is set to a higher level which is very clear in ﬁgure 3. Also, we note that the running time (for each datasets) of the ParallelCharMax is apparently less than MineWithRounds and HMBA algorithms. For example, for dataset Connect4, when support degree is 0.5, the running time of ParallelCharmMax is 19.32s, nearly 3 times smaller than the time of HMBA, i.e., 53.88s. VI. C OMPUTATIONAL STUDY In this section, we evaluated the performance of ParallelCharMax while comparing it with MineWithRounds [29] and HMBA [28]. MineWithRounds [29] is the only algorithm for calculating maximum frequent patterns that we have found. It divides the set of patterns in circles such that a pattern is processed during iteration i if it belongs to round i. This partitioning of the candidate patterns makes it possible to imitate a path in depth. HMBA (Hadoop Market Basket Analysis) is an algorithm based on the MapReduce model to compute the maximum frequent patterns. It needs two rounds of MapReduce to calculate frequent patterns. The maximals are then extracted during a last reduce. We note here that the maximum patterns are found in post-processing which is not very effective. A. Experimental protocol We implemented the ParallelCharMax algorithm in Hadoop 2.5. All the data were stored on the same HDFS cluster. We made our experiments on the Cloudera virtual machine CDH5.3.0 for our Hadoop algorithm production for the quality of integration and the offered maturity. Each node has 4 Intel(R) Core i5-5200U CPU running at 2.20GHZ, 8GB memory and 1TB disk. The experiments are all running at the CENTOS6.4 version and Java 1.8. We also evaluated the scalability performance of the ParallelCharmMax in term of sizeup and speedup. These later are both primordial factors of parallel algorithm. The sizeup analysis holds the number of cores in the system constant and grows the size of the datasets to measure whether a VII. C ONCLUSION Frequent patterns mining is an important research area in the data mining process since it is widely applied in real world to ﬁnd frequent patterns and so mine human behavior patterns and trends. 576 Table IV: The running time (ms) comparison on Mushroom, Connect4 and Agaricus datasets. Dataset Mushroom Connect4 Agaricus Support degree 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 ParallelCharMax 12,36 7,32 7,38 7,74 7,68 7,86 7,44 7,5 7,62 8,04 19,86 36,3 18,66 17,52 310,26 18,24 19,2 19,08 17,4 19,32 10,08 9,6 9,54 9,54 9,6 9,66 9,36 9,42 9,3 9,96 In this paper, we have focused on parallel algorithms which were implemented to discover either frequent or closed frequent or maximal itemsets in order to solve the performance deterioration, load balancing and scalability challenges of sequential algorithm especially with the emergence of the era big data. However, these algorithms suffer potential problems which make it strenuous to scale up these parallel algorithms especially with massive data. Moreover, current parallel frequent patterns mining algorithms often suffer from generating huge intermediate data or scanning the whole transaction database for identifying the frequent itemsets. In order to contribute to the resolution of these problems, we have chosen to work on maximal frequent patterns algorithms which are more suitable to process large data and for which a few propositions exist. We aim at developing a new parallel algorithm based on Hadoop-MapReduce for discovering maximal frequent patterns. This parallel algorithm was a generalisation of the sequential algorithm CharMax based on Charm algorithm which has been modiﬁed to be able to generate maximal frequent patterns. CharMax has and proven its effectiveness through a comparison study achieved between the new algorithm CharMax and the most powerful algorithm F P M ax∗ . This study has shown that CharMax is the most performant in terms of execution time and patterns quality. Numerous perspectives are offered to enhance this work. Hence, we aim to integrate a phase of semantic pruning in the discovery process of maximal frequent patterns. An ontology can serve as a pruning support to remove certain candidate from the computation. The goal is to help the expert to discover useful maximal frequent patterns by removing the patterns that can be found using the domain ontology associated with the search data. The pruning phase can be deﬁned as a posterior processing for the generation of association rules in order to generate only useful and interesting rules from massive data. [1] HMBA 287,1 275,16 186,54 182,94 166,92 120,54 104,4 60,12 42,12 17,42 622,26 566,7 471 322,02 166,92 255,12 148,8 91,26 72,36 53,88 445,32 352,5 298,8 261,12 231,78 179,88 123,18 94,44 32,1 28,48 MineWithRounds 5249,81 2975,17 2166,57 962,95 480,75 300,55 180,36 120,09 102,17 68,05 9442,26 8114,7 5288,91 1940,16 913,86 493,12 344,88 273,72 132,39 109,31 4450,09 3728,52 3681,35 2469,05 2410,11 1749,78 1509,20 487,23 177,26 148,01 Visual Data Mining pour l’exploration d’un ensemble de règles d’association,” in IHM 2011, 23ème Conférence Francophone sur l’Interaction Homme-Machine, Nice, France. [Online]. Available: https://hal-enac.archives-ouvertes.fr/hal-01022276 [2] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in knowledge discovery and data mining. AAAI press Menlo Park, 1996, vol. 21. [3] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The kdd process for extracting useful knowledge from volumes of data,” Communications of the ACM, vol. 39, no. 11, pp. 27–34, 1996. [4] M. J. Zaki, “Parallel and distributed data mining: An introduction,” in Large-scale parallel data mining. Springer, 2000, pp. 1–23. [5] C. Borgelt, C. Braune, T. Kötter, and S. Grün, “New algorithms for ﬁnding approximate frequent item sets,” Soft Comput., vol. 16, no. 5, pp. 903–917, 2012. [Online]. Available: http://dx.doi.org/10.1007/s00500011-0776-2 [6] M. J. Zaki and K. Gouda, “Fast vertical mining using diffsets,” in Proceedings of the Ninth SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’03, New York, NY, USA, 2003, pp. 326–335. [Online]. Available: http://doi.acm.org/10.1145/956750.956788 [7] M. J. Zaki, “Parallel and distributed association mining: A survey,” IEEE concurrency, vol. 7, no. 4, pp. 14–25, 1999. [8] I. Pramudiono and M. Kitsuregawa, “Fp-tax: Tree structure based generalized association rule mining,” in Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery. ACM, 2004, pp. 60–63. [9] M. A. Shinde and K. Adhiya, “Frequent itemset mining algorithms for big data using mapreduce technique-a review,” 2016. [10] J. Dean and S. Ghemawat, “Mapreduce: simpliﬁed data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107– 113, 2008. [11] S. Ghadge, P. Durge, V. Bhosale, and S. Mishra, “Frequent itemsets parallel mining algorithms,” International Engineering Research Journal (IERJ), vol. 1, no. 8, pp. 599–604, 2015. [12] S. Maabout, “Contributions à l’Optimisation de Requêtes Multidimensionnelles,” Ph.D. dissertation, Université de Bordeaux. [Online]. Available: https://hal.archives-ouvertes.fr/tel-01274065 [13] B. Negrevergne, A. Termier, J.-F. Méhaut, and T. Uno, “Discovering closed frequent itemsets on multicore: Parallelizing computations and G. Bothorel, M. Serrurier, and C. Hurter, “Utilisation d’outils de 577 (Mushroom) 104 P arallelCharM ax M ineW ithRounds [14] HM BA TIME CPU(ms) 103 [15] [16] 102 [17] 101 10−1.4 [18] 10−1.2 10−1 10−0.8 10−0.6 10−0.4 MINSUP (Connect) [19] P arallelCharM ax 104 M ineW ithRounds HM BA [20] TIME CPU(ms) 103 [21] [22] 102 [23] 101 10−1.4 10−1.2 10−1 10−0.8 10−0.6 [24] 10−0.4 MINSUP (Agaricus) [25] P arallelCharM ax M ineW ithRounds [26] HM BA [27] TIME CPU(ms) 103 [28] 102 [29] 101 [30] 10−1.4 10−1.2 10−1 10−0.8 10−0.6 10−0.4 MINSUP Figure 3: The run time comparaison on Mushroom, Connect and Agaricus datasets. 578 optimizing memory accesses,” in High Performance Computing and Simulation (HPCS), 2010 International Conference on. IEEE, 2010, pp. 521–528. T. Uno, T. Asai, Y. Uchida, and H. Arimura, “Lcm: An efﬁcient algorithm for enumerating frequent closed item sets.” in FIMI, vol. 90. Citeseer, 2003. B. Negrevergne, A. Termier, M.-C. Rousset, and J.-F. Méhaut, “Paraminer: a generic pattern mining algorithm for multi-core architectures,” Data Mining and Knowledge Discovery, vol. 28, no. 3, pp. 593–633, 2014. R. U. Kiran and P. K. Reddy, “An improved frequent pattern-growth approach to discover rare association rules.” in KDIR, 2009, pp. 43–52. H. Qiu, R. Gu, C. Yuan, and Y. Huang, “Yaﬁm: a parallel frequent itemset mining algorithm with spark,” in Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. IEEE, 2014, pp. 1664–1671. N. Li, L. Zeng, Q. He, and Z. Shi, “Parallel implementation of apriori algorithm based on mapreduce,” in Software Engineering, Artiﬁcial Intelligence, Networking and Parallel & Distributed Computing (SNPD), 2012 13th ACIS International Conference on. IEEE, 2012, pp. 236– 241. M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh, “Apriori-based frequent itemset mining algorithms on mapreduce,” in Proceedings of the 6th international conference on ubiquitous information management and communication. ACM, 2012, p. 76. X. Y. Yang, Z. Liu, and Y. Fu, “Mapreduce as a programming model for association rules algorithm on hadoop,” in Information Sciences and Interaction Sciences (ICIS), 2010 3rd International Conference on. IEEE, 2010, pp. 99–102. H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, “Pfp: parallel fp-growth for query recommendation,” in Proceedings of the 2008 ACM conference on Recommender systems. ACM, 2008, pp. 107–114. S. Moens, E. Aksehirli, and B. Goethals, “Frequent itemset mining for big data,” in Big Data, 2013 IEEE International Conference on. IEEE, 2013, pp. 111–118. M. R. Karim, M. Azam Hossain, M. Rashid, B.-S. Jeong, and H.-J. Choi, “An efﬁcient market basket analysis technique with improved mapreduce framework on hadoop: An e-commerce perspective.” M. J. Zaki and C.-J. Hsiao, “Efﬁcient algorithms for mining closed itemsets and their lattice structure,” IEEE transactions on knowledge and data engineering, vol. 17, no. 4, pp. 462–478, 2005. K. Gouda and M. J. Zaki, “Efﬁciently mining maximal frequent itemsets,” in Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on. IEEE, 2001, pp. 163–170. G. Grahne and J. Zhu, “Efﬁciently using preﬁx-trees in mining frequent itemsets.” in FIMI, vol. 90, 2003. D. Burdick, M. Calimlim, and J. Gehrke, “Maﬁa: A maximal frequent itemset algorithm for transactional databases,” in Data Engineering, 2001. Proceedings. 17th International Conference on. IEEE, 2001, pp. 443–452. M. R. Karim, J.-H. Jo, B.-S. Jeong, and H.-J. Choi, “Mining e-shopper’s purchase rules by using maximal frequent patterns: An e-commerce perspective,” in Information Science and Applications (ICISA), 2012 International Conference on. IEEE, 2012, pp. 1–6. T. Radu-Ionel, “Bordures : de la sélection de vues dans un cube de données au calcul parallèle de fréquents maximaux,” Theses, Université Sciences et Technologies - Bordeaux I, Sep. 2010. [Online]. Available: https://tel.archives-ouvertes.fr/tel-01086108 D. N. A. Asuncion, “UCI machine learning repository,” 2007. [Online]. Available: http://www.ics.uci.edu/∼mlearn/MLRepository.html

RELATED PAPERS

RELATED TOPICS

Log In

ParallelCharMax: An Effective Maximal Frequent Itemset Mining Algorithm Based on MapReduce Framework

ParallelCharMax: An Effective Maximal Frequent Itemset Mining Algorithm Based on MapReduce Framework

Related Papers

RELATED PAPERS

RELATED TOPICS