1. Introduction
This study addresses association rule mining (ARM), which refers to a widely known data mining technique. Data mining extracts useful patterns that are hidden in a huge database. The reckoned forms for the knowledge are a frequent pattern (frequent itemset) and a rule set. Nevertheless, acquiring hidden knowledge from a massive database is a complicated process [
1]. Some of the proposed data mining strategies and techniques are ARM [
2,
3,
4,
5,
6,
7,
8], classification rules [
9,
10], clustering rules [
8], and sequential patterns [
11,
12]. The ARM, in particular, unravels correlations between various attributes (items) in transaction databases [
2,
5,
6,
8,
13]. Applications of ARM include the market basket and a great deal of issues that demand discovery of hidden knowledge, such as protein sequences [
14], medical diagnosis [
14], energy [
3,
15,
16], smart homes [
11], and other vast industrial fields [
17]. The Apriori algorithm has been acknowledged as an icon implementation for the ARM technique [
11]. The Apriori algorithm generates a frequent itemset that is determined by testing the candidate itemsets. When an itemset within the candidate itemset has weight exceeding the minimum support threshold, it is considered as a frequent itemset. After generating all frequent itemsets, the association rules are retrieved from the gathered frequent itemsets. Since rules have confidence and a user defines the minimum confidence threshold, any rule with confidence exceeding its minimum level is considered as a strong rule set. Data mining is a highly attractive field and has become an even more popular topic amongst researchers since the introduction of ARM in the work carried out by Agrawal (see [
2,
13]). The data mining processes have been continually improved and developed by a substantial number of studies. In fact, a significant correlation has been established between ARM and other fields [
6,
8], namely machine learning [
18], association rule discovery, clustering rules [
19,
20], classification rules [
6,
19,
21], associative techniques [
19], sequential patterns [
21], and decision trees [
22]. Amidst these, looking for common itemsets within a massive database has become a rather common method in data mining, where the preponderance of data mining techniques is associated with discovering frequent patterns, which serve as the main output for ARM. These techniques arrange the frequent itemsets and rules in a timely manner. Data mining has been implemented in the high business domain to increase its value, such as Internet of things and sensors [
3]. One popular area in the domain of ARM refers to data stream and control of sensors [
3]. The relationship between data mining and other artificial intelligence involving machines and network sensors is extendable, along with increasing opportunities [
23]. Data representation and coverage may be insufficient when the mined knowledge is limited. Hence, the discovered knowledge typically encloses a set of rules. One feature of the discovered knowledge is that it should retain its validity when the database experiences nil change. However, the question that arises is: ‘How does one extend the validity of knowledge with less processing time?’ Precisely, knowledge represents the present state of the processed data with limited change in the data, inclusive of added, updated, or deleted transactions. The two commonly used approaches to maintain the legitimacy of knowledge are:
Upon database manipulation (adding, deleting, or updating transactions), the changes should be reflected in the elicited knowledge. While Apriori is the most common approach for ARM, other prominent algorithms do exist, such as Eclat [
24], DHP [
7], AprioriTID [
25], McEclat [
26], and MsApriori [
5], which enhance collection of rare frequent items. Generally, the ARM process line mainly refers to the following two processes [
13]:
The discovered large itemsets, also known as frequent itemsets, have more occurrences in the database than a pre-defined confidence threshold.
The frequent itemset generates strong association rules. A rule is considered strong and serves as output list if it satisfies the pre-defined confidence threshold.
Despite the existence of numerous techniques, ARM is the most widely used method for data mining. The analysis of market basket appears to be the most popular case study for association rules. The items in a market database are saved in a single record (transaction), along with customer ID, transaction time, and other imminent details. The discovering ARM includes rules, such as follows: let us assume a customer takes products x and y, but he is highly expected to purchase product w; where x, y, and w are initially unknown. The customized promotional marketing program can be designed based on useful discovery rules. Apart from devising effective marketing plans, the data mining methods can be applied to both control and minimize consumption, such as energy and carbon dioxide (see [
15,
27]. The Apriori algorithm can effectively extract a frequent itemset from a massive dataset. Based on the discovered itemset on certain step (
n), the algorithm generates an itemset called candidate itemset in the next step (
n + 1) [
2]. In each step, the algorithm checks the support counting for the candidate itemset using a vital scan database. The itemset generation process in a level-by-level manner leads to level-wise problem. In fact, many algorithms have been introduced to address the multiple defects found in Apriori algorithm [
10,
24]. These algorithms are aimed at minimizing mining costs, apart from generating consistent and stable outcomes. In order to meet this goal, vast amount of data should be gathered and analyzed. One of the many mining issues refers to database size, as new transactions are dynamically updated and removed frequently. The already discovered rule sets need to be updated, or the rule set would not reflect the database. As such, maintaining ARM result incrementally has been examined in this study by placing focus on vertical mining. The proposed algorithm can overcome several existing drawbacks by reducing mining time and incorporating all manipulative operations. Minimizing database scan frequency is one of the proposed algorithm objectives. First, the algorithm builds a complete itemset, and later updates the set without rebuilding it to hinder level-wise issue. The paper is organized as follows. In the following section, the related terms are listed and defined.
Section 3 presents the review of some past studies. In
Section 4, the incremental Apriori (IApriori) is elaborated, and
Section 5 describes the dataset.
Section 6 presents the empirical outcomes and the comparison of the 7findings is depicted in
Section 7.
2. Association Rule Terms
This section introduces and defines the important key terms used in ARM domain and the approach proposed in this study. The market basket was applied to clarify the terms and to illustrate the meaning. All the terms employed in this study are defined as follows:
Definition 1. The ‘set of items’ was given in the following:where i refers to the attribute in the database. In market basket analysis, it denotes the product. It is a chance for the set of items to function as the range pool of attributes value in the database transaction. Based on the case study of ARM, the set of items refers to product found in the supermarket. Grounded on Definition 1, the term ‘transaction’ is defined as follows in the next definition: Definition 2. The transaction, T: It is known as the main element of the database with a pair structure: , Where is the transaction identifier, while is a subset of I. The purchase of a set product by a customer is stored on a record or transaction market database. This operation is composed of two essential aspects; the for this record, and the items bought ().
From Definition 2, the ‘transaction database’ is defined as follows:
Definition 3. Transaction database () refers to a collection of transactions over I. is the source database that is scanned for the first time to generate new knowledge. From Definition 3, a special definition is derived for the modified database.
Definition 4. The modified transaction database () is a collection of transactions over I, whereby is manipulated after mining . Simply put, the new transactions are transactions that have been either updated or deleted. Based on Definitions 1–4, the term ‘itemset’ is given in the following:
Definition 5. Itemset In the core point of ARM, the itemset is characterized as , where x refers to a set of items that is supposed to befall frequently in a database as a pattern. The itemset has several features with ‘support’ being the most important one, defined as follows:
Definition 6 (itemset support).
The itemset x support denotes the counting of T in D that contains the itemset x, such as: Further clarification regarding the term ‘itemset support’ is related in the following example:
Example 1. If is an itemset in a D database that has 10 rows and x can be found in four transactions, then we have the following support of x: Based on Definitions 5 and 6, ‘support threshold’ is described as given in the following:
Definition 7. Support threshold () is the number or the ratio of itemset occurrence. If the itemset set has occurrence count exceeding α, it is considered as a frequent itemset. Nonetheless, the value of α is , with reflecting the database size. refers to the count of T that contains x itemset in the item set.
Definition 8. represents the association rule form, where and . X is the body or the antecedent, whereas Y is the head or the consequence of the rule.
Definition 9. The confidence of an association rule is the conditional probability of having Y embedded in a transaction, given that X is contained in that transaction. The concept of association rule confidence is drawn in the following example.
Example 2. Let be a database of transactions. If are itemsets, then the rule in D has the following confidence: The above definitions are linked with the ARM technique, and they are useful for most ARM algorithms. The following terms and definitions are applied to describe the proposed approach in this study.
Definition 10. Log file (Log F) states the changes that occur in a database. Generally, the database management system has a log file that states all actions that happen in the database. In the proposed method, three attributes are necessary to keep track of the database; transaction ID, action ID, and action date. Based on Definition 11, data mining is required to collect the target database.
Definition 11. The knowledge date (KD) stores the last mining date. For the next mining, KD is used to collect all manipulated transaction IDs, which were altered after the last mining. Based on the above two definitions, the following terms have been derived.
Definition 12. LTA is the list of transaction IDs that were added to the database (db+) after the KD date.
Definition 13. LTU is the list of transaction IDs that were modified in the DB after the KD date.
Definition 14. LTD is the list of transaction IDs that were deleted from the DB after the KD date.
Definition 15. The list of all item sets, , in the proposed method, whereby the structure of the itemset has changed, and it is defined as a set: where itemset is a subset from I, support refers to support for the itemset, is the list of transaction IDs that contain the itemset, and n represents the length of the list.
Definition 16. Candidate set () refers to the intermediate list that combines an item in .
Definition 17. Frequent item set () is a subset of (), which contains all itemsets that satisfy the specific . Section 3 presents the literature review, along with the criticism related to the existing approaches. 4. The Proposed Algorithm
The proposed algorithm, incremental Apriori (IApriori), is described in this section. The incremental learning approaches are explained in the previous sections. The defects of incremental algorithms were identified from the literature review. Apparently, two manipulation operations have been omitted during incremental learning, namely update and delete. These defects, thus, have motivated this study to propose a method that embeds all manipulation operations when knowledge is learnt incrementally. A large size of itemsets are extracted from the transactions of the database by the Apriori algorithm, with each element in the itemsets having real support value. However, the expected size for itemset generation reflects an exponential growth to the size of the list of attributes in the database, thus it is impractical to count every subset found in the database transactions. The generation of candidate itemset is an iterative process with a combinatorial method that causes explosion on the itemset size (n). Reducing the frequency of database scan increases the algorithm performance, while increment in scanning frequency decreases the algorithm efficiency. One solution is to generate measured itemsets so that support can be counted earlier. Association rules are determined using three main steps in IApriori algorithm. First, the algorithm gathers all itemsets (), where the length of each item is equal to one item and has support greater than zero. The next step is to combine the items in () to produce the all-possible frequent itemset () based on user requirements. Lastly, the association rule is generated only after creating the frequent itemset. The itemset structure created by IApriori is depicted in definition 15 (). Each element in the list has three attributes, namely itemset, support, and transaction ID list.
The main body of the IApriori algorithm is presented in Algorithm 1 pseudocode. The first line reads the last obtained result, while the second line categorizes the transaction IDs into three categories using the Proper Transaction List method. Note that these transaction IDs have been manipulated after the last mining. This particular method has many parameters: two and three as input and output parameters, respectively. As for the input parameters, the first is KD that refers to the last mining date, and the second is log file. This file contains every manipulation event over the transaction. Each record in the log file has three essential attributes: transaction ID, action ID, and time. A sample of the log file given in
Table 2 shows that transaction ID refers to the manipulated transaction ID. The action ID is a code that reflects the manipulation operation. The output of the Proper Transaction List refers to three lists (LTA, LTU, and LTD) described in Definitions 13, 14, and 15, respectively. The process prior to mining new data should start after discarding the effects of deleted and updated transactions. The discard transaction isolates the effect of removed transactions, as presented in Algorithm 2. The preview instructions point out the specific target of the database for mining. The following describes the generation of the first itemset from new changes in the database, as illustrated in Algorithm 3. First, IApriori fetches the old discovered knowledge (if it exists) and places it in a list. The algorithm collects the target database transactions and formats them in the list. Next, two scenarios are bound to occur depending on the final outcome. In the first scenario, the presence of old result indicates that the algorithm has to update the value of the itemset attribute in the list. Otherwise, the entire itemset must be initialized and built from scratch. As for the second scenario, both the (generate candidates) and (get itemset) methods are applied, as given in Algorithms 4 and 5, respectively. Otherwise, the (update Ln itemset) is executed, as detailed in Algorithm 7. The algorithm should have a full itemset in this step. Based on the ARM concept, threshold is required. Hence, the next step is to apply minsupp in the get frequent itemset, as portrayed in Algorithm 8, to isolate all the itemsets that satisfy the
threshold. Finally, as a result of the previous step, the algorithm can generate the final output of the entire process, which is the association rule set.
Algorithm 1: IApriori Main Body. |
{
}
|
Algorithm 2: IApriori- Discard Transaction Function. |
{
}
|
Algorithm 3: IApriori- . |
{
}
|
Algorithm 4: IApriori- Generate candidates function. |
{
} |
Algorithm 5: IApriori—get itemset function. |
ForEach(Item in )
|
One fundamental method for the proposed algorithm is the Proper Transaction List, as displayed in Algorithm 6. This method gathers the transaction ID, which has been manipulated. The data for this method derive from the log file of the database management system (Log F) and the KD, which signify the latest date for updating the knowledge. This method generates three transaction ID lists for the transactions that have been manipulated after the KD date. The first list gathers the transactions IDs that have been added to the mined database (LTA), while the second list updates the transactions IDs (LTU), and the third list deletes transactions IDs (LTD). All log F records are collected using this method, and the transactions IDs are categorized on the list based on the manipulation operation type stored in the log file for each record.
Algorithm 6: IApriori—proper transaction list function. |
|
Table 2 presents the sample from log file records. The latest changes on the database can be identified by using this log file. Upon detecting the changes made to the transactions, it facilitates mining to target the affected transactions. Typically, a log file is built in the management system to monitor changes that occur in the database. This database management system registers any operation that takes place in the database. Otherwise, a log file is created.
Algorithm 2 shows a vital feature of IApriori, which is the discard transaction method. The importance of this function stems from eliminating the manipulated transaction effect on the result obtained earlier. The effects of both deleted transactions () and updated transactions () are excluded from the result. The updated transactions are trained again with the added transaction to weigh in their new effect. The code for handling the excluded effect is given in Algorithm 2 pseudocode, whereby the method gathers the itemsets in and determines if each item has one of the IDs of and in the itemset transactions list.
Algorithm 3 presents instructions of the
item method. As the primary step in IApriori, the algorithm converts the present transaction to vertical layout view, thus changing the form of the items in the transaction to where the items occur in the database transactions. Clearly, the figure displays three attributes, namely the item element itself, the transaction IDs list, and the support of the itemset. Algorithm 4 shows the generate candidates method in the proposed approach. The main variance between this method and the equivalent enhancement algorithm method lies in the pruning strategy. The IApriori offers the itemset some real support until the end of the generation process, due to two reasons: (1) the algorithm processes incremental learning so that the itemset can gain additional support in the next training and cumulate with the old one; and (2) the algorithm gives a chance for the users to filter the intermediate itemsets based on any threshold parameter without rebuilding knowledge from scratch. Despite its advantages, it is yet to be implemented as it has to filter the whole itemset based on specific attribute value to check if the attribute is linked with other attributes, regardless of the actual support value. Another vital aspect of the proposed approach is that the algorithm avoids infrequent issue (see Cases 2 and 3 in
Table 1).
The IApriori considers all real itemsets with support value greater than zero. In Algorithm 5, this function isolates itemset that have never occurred in the database. Any itemset with a chance in the database is included in the next level of generating candidate itemsets.
The IApriori has its own characteristics, with one of them stated in Algorithm 7 pseudocode. It introduces the itemset method, which updates the whole intermediate itemset at once. This method is based on Algorithm 3, for the incremental training to gather the latest update from the database. After updating the first itemset list, this method performs intersection of the first itemset elements. For instance, x is an itemset that consists of three items , and in order to collect both x support and tid, the algorithm intersects the tid for the items from , and . From the itemset, the algorithm can maintain that is collected in parallel.
Algorithm 7: IApriori—update itemset function. |
List IDNew ForEach(m in ) {
m.TID.merge(IDNew); m.support=m.TID.count(); } |
Algorithm 8 presents the GetFrequentItemset function, where many of the frequent itemsets are driven by this method based on the input threshold without the need to rebuild the itemset. Algorithm 9 shows the generation of strong rule method that is completely identical to the original method without making any change. Example 3 describes the proposed algorithm based on dual sections: the first phase demonstrates the creating and building of the intermediate itemset from the starting point, while the second phase displays the process of updating the incremental training.
Example 3. As presented in Table 3, a database contains four transactions. There is no obtained knowledge from this database. The demand has discovered the relationships between the items and found all the itemsets. The mining parameters; and , are and , respectively. Table 4 shows the log file of the database and the action that occurred in the database transaction. Algorithm 8: IApriori—get frequent itemset function. |
ForEach(Item in )
|
Algorithm 9: IApriori- Get Strong Rules Function. |
ForEach(Item in ) {
LHS=item.subList(0,item.lenght-1); RHS=item.subList(item.lenght-1,item.lenght); Rule = LHS ⇒ RHS; Rule.confidence = item.support / RHS.support; IF (Rule.confidence ≥ MiConf) RuleList.add(Rule); } |
The main goal for the IApriori is to update the existing knowledge, thus the algorithm should possess the ability to read current knowledge. Since this is the very first time the algorithm is applied in a database, the algorithm built the knowledge from scratch. Typically, the absence of old result leads to the step where manipulated transactions are discarded. The following task prepares the transaction ID lists (
,
, and
), where LTA is
,
, and
are empty. However, under any case, the next step is to generate the first itemset by applying the
items method.
Table 5 shows the outputs of this step.
After the first intermediate itemset has been built, the complete intermediate itemset is generated. Generating the whole intermediate itemset in this case refers to creating due to empty obtained knowledge. In the next level, the itemset has a size 2. The is . The support of is calculated by intersecting the first itemset, as given in the following:
,then support of {ab} is 2.
, then support of {ac} is 2.
, then support of {bc} is 2.
Table 6 shows
, along with support and TID. After completing the generation of
, the IApriori moves on to create
, whereby the creation and collection of the actual support are made by crossing the TID list. It is important to highlight that
is limited to
.
Table 7 shows the final and compete
itemset. At this end, the end-user can apply multiple parameters to arrive at vast knowledge, wherein this deriving process requires minimal time. For instance, based on the threshold
, the frequent itemset is driven and is presented in
Table 8.
The association rules based on the frequent itemset presented in
Table 8 are driven and listed in
Table 9.
Finally, the mining process is completed for this example; the knowledge is built for the first time. In order to fully demonstrate the IApriori approach, the next example describes the incremental process. The subsequent example uses the latest example. Consider that the database in
Table 10 is complete for the database presented in
Table 3. With
parameter being
, the last result exists in the complete itemset list
(see
Table 7). The tracked changes of the database are displayed in
Table 11.
Table 7 tabulates the final outcomes, whereby the algorithm reads the previous result. The next step is to collect
is
, and
. The ensuing step discards the effects of LTD and LTU from the latest results.
Table 12 presents the complete itemset after removing the deleted and updated transaction effects.
The IApriori targets both LTA and LTU to update the first itemset from the target transactions of the database, so as to generate an updated full itemset.
The updated first itemset is presented in
Table 13. Next, the IApriori updates the complete itemset.
Table 14 tabulates the final fully updated itemset list.
The previous step may derive numerous knowledge sets based on varying thresholds. The requirement in the example is
=
(three transactions). Based on that threshold value, the frequent itemset is listed in
Table 15, while the final generated association rule for incremental example is presented in
Table 16. The next section briefly describes the dataset.
6. Results and Discussion
This section presents the empirical outputs derived from IApriori algorithm, along with comparison of the results with those obtained from baseline algorithm and AprioriTID. The experiments were performed to test and compare the scalability of the IApriori algorithms. The experimental work evaluated the changes noted in the threshold for the chess dataset.
Table 17 and
Figure 1 present the retrieved results. The knowledge from the output appeared similar for both algorithms. The IApriori displayed exceptional achievements in light of execution time, whereby IApriori algorithm had successfully decreased the processing time between
and
. Meanwhile, the Apriori algorithm took between two and four times longer processing time to perform the same tasks. The results clearly exhibit the potential of IApriori algorithm in terms of speed, when compared to other algorithms. The time values portrayed in the table are in milliseconds.
The next experiment involved the mushroom dataset with under different values of support threshold, whereby outcomes retrieved from the two algorithms were compared. The IApriori algorithm displayed improvement for execution time from
to
.
Table 18 and
Figure 2, which present the empirical findings for the mushroom dataset, show that the improvement rates were rather high.
The
dataset was used to test the IApriori algorithm, whereby the results were compared with those obtained from AprioriTID and Apriori.
Figure 3 and
Table 19 tabulate the experimental results involving
dataset. The results of both IApriori and AprioriTID algorithms appeared logical and reasonable, while the outcomes retrieved from Apriori took longer time. It is emphasized here that the experimental setting was retained similar for all algorithms.
The following table presents the numerical outputs from experiments using the T10|4D100K dataset.
Figure 4 clearly presents the recorded findings.
Figure 4 portrays the comparison of results obtained from IApriori algorithm with those of other tested algorithms on the breast cancer dataset. Apparently, IApriori achieved the best execution time. For the first experiment (when threshold was zero), the IApriori demanded additional time, but was still less when compared to the other Apriori algorithms. The following experiments required no time less than 100 ms. For instance, when the support threshold was 4, IApriori recorded 95 ms, whereas 350 and 550 ms for AprioriTID and Apriori, respectively. This achievement reflects outstanding improvement up to
.
Figure 5 shows the experimental results obtained from the algorithms for Acute Inflammations dataset. Again, the IApriori displayed excellent execution time. The time required to mine the dataset by IApriori (support threshold = two) was 98 ms, while 395 and 605 ms for AprioriTID and Apriori, respectively.
The next experimental work captures the ability of the IApriori Algorithm to process different dataset sizes, as well as the effect of dataset size on execution time. Additionally, the IApriori incorporated the changing threshold.
Figure 6 presents the outcomes of mining experiments with a set of database sizes (1K, 5K, 20K, and 75K) (Management, 2009). The execution time was about the average time for 10 experiments for each threshold value, whereby the support threshold changed from
to
. It appeared that addition time was required for the first experiment only, which was 710 ms, for the dataset with 75K transactions.
The above figures portray the outstanding ability displayed by IApriori in substantially minimizing the execution time. This achievement was tested on various types of datasets across multidisciplinary domains, along with the standard case study for ARM, namely basket market analyses, medical, and game datasets mined by IApriori. The IApriori performed rather exceptionally well for different types and sizes of datasets, as well as for different values of support threshold parameter. Overall, IApriori has achieved remarkable outcomes.
6.1. Comparison
This section discusses the theoretical comparison between IApriori and other algorithms. This comparison is divided into two parts: IApriori is compared with the baseline algorithm in the first part, while on the incremental domain in the second part.
6.2. Apriori
The IApriori algorithm seemingly managed to enhance the Apriori algorithm, which is attributed to their shared similarities. The variances are as listed in the following:
IApriori algorithm retrieved the data from the database once; it collected the first itemset. Subsequently, the algorithm collected the support count for the candidate itemset by intersecting the first itemset. The original algorithm had to check the database for every single itemset.
The Apriori algorithm applied the pruning strategy to exclude itemsets that did not satisfy the threshold. In the proposed method, the algorithm retained all itemsets.
The Apriori algorithm built the whole knowledge-based over the specific threshold, and the output knowledge appeared to be invalid for other thresholds. Upon a change in the threshold, the mining had to restart from the beginning point of building valid knowledge. On the contrary, many instances can be arrived at using the proposed algorithm in the case of changing threshold.
6.3. Incremental Algorithms
Many algorithms have been proposed in the field of incremental learning for ARM. They make sense in the process of solving standoff problems. The method proposed in this study presents an improved set of features. However, several variances were noted, as listed below:
IApriori weighed in all manipulated data found in the database, including added, updated, and deleted data. On the contrary, the remaining algorithms that dealt with the incremental issue considered only the add operation, while neglecting both update and delete operations.
As for the other algorithms, there were cases in which the algorithm had to rebuild the itemset from scratch. This, nonetheless, was not required for the proposed algorithm.
IApriori offered credible support and confidence for itemset and rule set, respectively. IApriori updated the support value, thus reflecting the itemset to the database transactions. Additionally, IApriori filtered the itemset within the changeable threshold setting.
The next table (
Table 20) summarizes the features of the reviewed incremental ARM Algorithms.