[go: up one dir, main page]

Academia.eduAcademia.edu
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 6, ISSUE 2, APRIL 2011 12 Hybrid Data Reduction Technique for Classification of Transaction Data Ifiok J. Udo and Babajide S. Afolabi Abstract— Data classification problems during the process of mining transaction data requires robust and efficient data reduction technique to guard against loss of essential level information. In this paper, we have addressed the concepts of data reduction in transaction processing systems. The tradeoffs of data reduction techniques are being presented and a hybrid technique for data reduction suitable for addressing classification problems of transaction data is proposed. Index Terms— Database, hybrid data reduction, classification and transaction data. —————————— u —————————— sic_00597930, version 1 - 2 Jun 2011 1 INTRODUCTION D ata classification problems during the mining process of transaction data requires robust and efficient data reduction technique. This is to ensure accuracy in the resulting data as well as guarantee the retention of essential level information for the proper mining process. The extraction of important features embedded in the contents of data is also made possible due to the technique of data reduction technique that is being adopted. Nevertheless, transaction data is made up of quantity of instances and number of features and there exists one-to-many association constraint as a result of data normalization. However, concurrency which is of utmost importance in transaction processing systems is fully optimized with data normalization which render some data to be of little importance or redundant nature to such system. Furthermore, organizations apart from incurring costs in terms of buying machineries (i.e. required hardware) and hiring required numbers of personnel to classify huge databases, may also fail woefully in business competitiveness which require just-in-time and accurate information made possible by being able to extract significant features from the contents of the available data. This is because the general view of data is inhibited to decision makers at the expense of informed decisions, thus making accurate information difficult to be obtained in overloaded databases. Data reduction as an essential element of data preparation [1], seeks to reduce large databases to manageable sizes and also help decision makers to know the true dimensionality of its business database(s). In transaction processing system, where transaction data are contained, daily business operations are often being supported according to [2] and there may also abound some feature vectors and associations which are of little ———————————————— or no relevance to the system during data classification and related tasks. These many feature vectors and associations often result when the level of data normalization increases in transaction processing systems to check the level of data redundancy. Normalization which also brings multiple instance problems to the fore is indispensable for achieving high level of concurrency and reduced redundancy in transaction processing systems. The justification for data reduction apart from reducing the cost of data management and aiding informed decision making, reduction of data are advantageous in many other fields of Computer Science [3] such as data mining, information retrieval, web intelligence, machine learning and knowledge discovery in databases to name but just a few. Data reduction is also aimed at diminishing the quantity of irrelevant data in databases to enhance the quality of data for further analysis. Research approaches to addressing data reduction problems have concentrated efforts in developing task-specific techniques which are often times purpose-built (i.e. to suit only a particular computing task); thus tending to neglect a multi-purpose approach that tends to be more scalable and robust in nature. In this paper, we have addressed the concepts of data reduction in transaction processing systems, the tradeoffs of the existing reduction approaches have also been reviewed and the proposed hybrid data reduction technique suitable for addressing classification problems in relational data mining and also capable of performing the twin functions of reducing a transaction data in both features and sizes concurrently is also presented. The remaining part of this paper is structured as follows: section 2, focuses on the concepts of data reduction including its merits and section 3 addresses related works and some drawbacks. Transaction processing systems and multi-relational setting as a store for transaction data is discussed in section 4. The proposed hybrid data reduction approach is subsumed in section 5 and lastly, • I.J. Udo is with the Information Storage and Retrieval Group, Department of Computer Science and Engineering, Obafemi Awolowo University, IleIfe, Nigeria. • B.S. Afolabi is with the Department of Computer Science and Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria. © 2011 JCSE http://sites.google.com/site/jcseuk/ JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 6, ISSUE 2, APRIL 2011 13 conclusion is drawn in section 6. sic_00597930, version 1 - 2 Jun 2011 2 DATA REDUCTION The concepts of data reduction came into existence in a bit to make databases reflect the true dimensionality of business enterprises in which they represents, apart from rendering data for easy to manage. Data reduction otherwise known as data editing, filtering, thinning and condensing, depending on the objective of the tasks to be achieved is also considered to be an essential element of data pre-processing [4] aimed at preparing data for mining and/ or further machining to help improve the quality of the data obtained as a consequent of reduction operations. Data reduction also finds its usefulness in data classification and document retrieval among others. In transaction processing systems where record often exhibits one-to-many association constraints due to high level of data normalization, data reduction is important because a single record described by many associations with feature vectors may but only few are relevant for the observed classification and other related tasks performed on such records. An ideal reduction of data in transaction processing systems allows decision makers to know the real dimensionality of its enterprise database(s), thus leading to an informed decision making. The need for concurrency and data consistency that can be made possible by reduction of irrevlevant data in transaction processing systems cannot be over- emphasized. 2.1 Importance of Data Reduction Although data reduction is not without a disadvantage (otherwise known as the curse of dimensionality), many advantages accrue to scientists, decision makers and analysts as a result of making data size(s) reduced.These benefits that far outweighs the disadvantages are among others include: 1. The reflection of the true dimensionality of business database(s). 2. Optimal usage of minimal or limited memory size. 3. Efficient and fast retrieval of data. 4. Increased efficiency and performance in subsequent data analysis and data mining. 5. Visualization of high dimensional data for explorative data analysis. 6. Low bandwidths consumption during data transmission as a result of data compression. 7. Cost reduction in terms of required numbers of personnel and storage requirements. 3 RELATED WORK Many data reduction approaches in both transaction systems and data warehouses in the past abounds. According to [5], there are two major approaches of data reduction. These approaches are feature selection and data size reduction. The significant drawback of these reduction approaches are being majorly tending to taskspecifics (such as data mining, pattern recognition, information retrieval, web intelligence and machine learning) rather than a general purpose approach. Other drawbacks of these approaches besides being limited to task-specific approach are that they also do not combine the twin functions of reducing the data size (i.e. number of samples reduction) as well as its dimensionality (i.e. number of features reduction or feature selection) reduction concurrently. Hence this work seeks to combine the functionalities of feature selection and data size reduction in a single algorithm to achieve the twin functions of reducing transaction data in sizes and features concurrently. 3.1 Feature Selection This method is also referred as dimensionality reduction or reducing the number of features and has been implemented with dimensionality reduction techniques. It allows for compact representation of data by mapping each point to a lower dimensional continuous vector. It may be supervised, semi-supervised or unsupervised. According to [1], feature extraction and feature subset selection are the two main paradigms involved in dimensionality reduction techniques. Dimensionality reduction can also be achieved by method of aggregation [6], [7] and Graph embedding and extension [8] as well as Discernibility [9]. 3.2 Data Size Reduction Reducing the number of samples have been implemented with several methods such as sampling procedures (e.g. simple random sampling and stratified or cluster sampling). These methods are based on statistical sampling which view data as expensive resources and assumes that it is practically impossible to collect population data. This approach does not suit data reduction in databases where population data is assumed to be known. Other methods are Adaptive sampling [1] and adaptive sampling with Genetic Algorithm [10] and Discernibility [9]. Moreso, other approaches such as Wavelets [11] and Clustering [12] for reducing the quantity of instances have been used. The adaptive sampling approach which employs chi-square criterion in our view is simple and adaptive in nature. It segments data into categories to ease computation; but it is intractable with very large and high-dimensional data. The approaches of [1] and [10] are only based on dimensionality reduction. 4 TRANSACTION PROCESSING SYSTEMS According to [2], transaction processing systems are databases that support daily business operations. It usually involves large number of users who simultaneously perform transactions to change real time data. Concurrency and atomicity are the major characteristics of transaction processing systems. In a bit to enhance concurrency, data consistency and reduced redundancy, data normalization is often used in these systems. Moreso, one-to-many association constraints which pose multiple-instance problem [7] is brought to the fore due to high level of data normalization. Therefore, transaction processing environment depicts a multi-relational setting as described in fig. 1. The representation in fig. 1 depicts part of a higher JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 6, ISSUE 2, APRIL 2011 14 sic_00597930, version 1 - 2 Jun 2011 education electronic portal schema in Nigeria for instance, whereby a student (i.e. a single object) may offer more than one course (i.e. a multiple-instance) and each course (i.e. a single object) may also be assigned to more than a lecturer (i.e. a multiple-instance) in the relation StudentCourse-Lecturer schema. The notation “RN” on students’ table stands for the student registration number, “CC” on course table stands for course code, and LID stands for lecturer’s identity number. The two levels of one-to-many association comprising student, course and lecturer relationship is as shown in fig. 1. This illustration presents a student one-to-many relationship with course relation, through the association with Lecturer through the association of the Title (i.e. Course title) and LID (i.e. Lecturer Identity Number). The objects in fig. 1 can be represented in a vector space model [13]; therefore making it possible to be manipulated algebraically. Student Course Lecturer Fig. 1. Transaction data schema showing one-to-many association constraints. 2.1 Data Model in a Multi-Relational Setting According to [14], relational data model (X) comprises number of features (J) and quantity of instances(N); the basis of which relational databases evolved. The transaction data model is therefore presented mathematically in equation (1). The equation (1) simply shows that tansaction data is a function of quantity of instances (i.e. number of records) and number of features (i.e. number of attributes) it is made up of. ! = (!  !  !)           (1) Due to increasing volumes of data in transaction processing systems, database(s) grows in both features and sizes, thereby making data normalization an appropriate means of checking data redundancy, hence optimizing concurrency and data consistency. This often leads to multiple-instance problems. In the work of [15], records exhibiting one-to-many association characteristics in multi-relational setting have been encoded into target and non-target tables which can also be represented in a vector space model as presented in equation (2). This vector space model presentation will allow similar instances to be aggregated and also permit data size reduction to be effectively carried out on objects. The relational object model in multi-relational setting is given as shown in equation (2). !! = !!! . !"# ! !!! , !!! . !"# ! !!! , … , !!! . !"# ! !!! (2) Where !!! is the frequency of !"ℎ representation in !"ℎ object,  !!!   is the frequency of the !"ℎ  representation of in each object, or the number of objects containing feature “m”, n   is the total number of objects and !!  !  !" and ! = 1  !"  !, ! = 1  !"  !. Also, !" denotes a database. 5 HYBRID DATA REDUCTION TECHNIQUE Our objective of developing a hybrid data reduction technique is to allow reduction of transaction data in both the number of features and quantity of instances at the same time. During a hybrid data reduction we consider the need for a compact representation of the observed data features to ensure retention of essential level information. In the other hand, the reduction of data size is also achieved with the Adaptive sampling procedure so that the overall reduction process is simplified and is made interactive. These different approaches of data reduction are integrated into a single algorithm. The scalability and efficiency that is achieved in our proposed algorithm is quite significant, in terms of the quality of the resulting data, and when compared with existing reduction approaches. This measure can be achieved with the closeness-of-fit measure. In our approach, a full dataset is considered as a known object which reduction operations are performed on it. The two aspects of data reduction are performed simultaneously starting from the feature selection and then followed by the data size reduction. The flowchart of the proposed multi-purpose data reduction approach is as shown in fig. 2. The feature selection phase in our approach adopts a feature subset selection paradigm [5] and it is performed by calculating the average distance to nearest neighbour object(s) using Euclidean distance measure [16]. The object(s) are then grouped depending on the computed distance (i.e. objects with the same distance measure are grouped together to form nodes). The maximum number of nodes to be formed ( Q ) is determined and calculated by equation (3) as shown. ! = !! ∗ !                                                                                                (3)   where p1 is the proportion of feature selection to be performed and J is the total number of data attributes. To further carry out data reduction process, node centre(s) which uniquely identifies each node or clusters are determined from the respective nodes. This phenomenon is otherwise called as feature subset selection. These node centres are computed as the average of the connected features to each node or cluster as shown in equation (4). JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 6, ISSUE 2, APRIL 2011 15 sic_00597930, version 1 - 2 Jun 2011 !! = ! !! !!∈!! !! , ! = 1, 2, … , Q (4) Where z! is the qth cluster or node centre, N! is the ith number of interconnected features in node !! (q= 1, …, Q) and !!   is the jth objects feature contained in the node. The output of the feature selection phase is a full database with aggregated features and respective nodes centres is/ are therefore taken as input to the data size reduction problem. The data size reduction problem is formulated as a modified chi-square criterion that minimizes the frequency between the original and the reduced dataset. By subjecting data size reduction to binary quadratic equation presented in equation (5), our aim is to minimize the closeness-of-fit between the expected and the observed distributions while selecting the true data subset to ensure accuracy. However, we incorporate a pattern search technique to the algorithm to select a true data subset representative from the observed dataset. The data reduction process continues till a specified number of iterations and trials in iteration (i.e. stopping criteria) that the algorithm obeys to fathom feasibility are met. Our data size reduction model which is in line with the work of [1] is given by equation (4). ! !!! ! !"#  !! =   !! !!" !!!!" !                                                                          (5) !!! !! !" subject  to:           ! !!! !!"#   !!"# = = !!"                                                                                                                    (6)   1;    !"  ! !, !  !"  !"#$%&#'(  !"  !(!"#)            (7)   0,                                                            !"ℎ!"#$%!. where !!"# (! = 1, … , !; ! = 1, … , !! )     is   a   row   which   must correspond to a row in X   (!  !  ! ).   The row vector in equation (7) indicates a Boolean vector, the entries of the vectors are either 1 or 0, depending on whether or not the selected data sample presents the result that matches the entry in the observed dataset. If the entry is found in the observed dataset the vector model assumed a value of 1 ! and 0 otherwise.     !!     denotes the modified chi-square criterion, !!"  and !!"     are the !"ℎ category of the !"ℎ attribute’s node centre in the reduced and true dataset respectively, ! = !/! is the sampling proportion of the data size reduction, !! ! = 1, 2, 3, … , ! is the !"ℎ frequency of features’ node centres of the total attributes (!) and ! is the total number of nodes formed. 5.1 Flowchart of a Hybrid Data Reduction Technique This is the step-by-step graphical description of data reduction algorithm based on multi-purpose approach. These steps are as as discussed below: 1. Initialization of dataset: The dataset to be used for the reduction process is defined and initialized. The preferred dataset should be large in quantity of instances and also high-dimensional. 2. The input and target values for the computation are obtained. These are parameters such as proportion of reduction in quantity (!! ) and number of features to be obtained after the execution of the algorithm (Q), number of iterations and number of trials within each iteration. 3. Feature selection is performed as discussed in subsection 3.1. The result is tested with the set reduction proportion of number of features until the feature selection phase is completed. 4. The result obtained from the feature selection phase is taken as the input to data size reduction phase. These results contain parameters such as the total number of nodes computed (Q), with respective node centres and the number of features that is/ are contained in each of the nodes to its centres. These parameters obtained and the quantity of instance in the observed dataset is used for the evaluation of the objective function as presented in equation (3). This is achieved by selecting a random sample (!!" ) of size same as the observed dataset (!!" ) and continuously swap it based on the criteria presented in equations (5), (6) and (7). 5. Step 5: The result of the reduced dataset is presented. This result can be tested to ascertain the level of its accuracy and error rate. 6. The algorithm terminates. CONCLUSION   Modern days business organizations dependency on databases or data warehouses calls for adequate attention in addressing the issues of increasing volumes of data in such systems, which render the true dimensionality of business enterprise difficult to understand. However, this inherent problem can be minimized in organizational databases to assist decision makers in informed decision making. In transaction processing systems, where concurrency, atomicity and data consistency must be ensured, these systems functions are difficult to be optimized with large volumes of irrelevant data or data with redundant nature. As a sequel to the aforementioned problems, transaction data need to be fully reduced to reflect the real dimensionality of its enterprise. This can be achieved by adopting the hybrid technique to reduce transaction data by extracting significant and relevant features from the database while diminishing the quantity of irrelevant information. While enhancing concurrency with data normalization, many associations and feature vectors that are irrelevant and/ or are of redundant nature to data classification and other related tasks can be sufficiently reduced with a hybrid data reduction technique. The proposed approach apart from being able to reduce transaction data in both features and sizes for all round usage can improve the performance of resulting databases with minimum relevant data, thus making transaction processing systems more efficient and scalable. The twin functions of reducing the number of features and quantity of instances are combined in a hybrid data reduction technique, thus minimizing the JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 6, ISSUE 2, APRIL 2011 16 overall time of reducing a transaction data. [4] [5] [6] [7] [8] sic_00597930, version 1 - 2 Jun 2011 [9] [10] [11] [12] [13] [14] [15] [16] J.R. Cano, H. Francisco and L. Manuel, “Using Evolutionary Algorithms as Instance Selection for Data Reduction in KDD: An Experimental Study”. IEEE Transactions on Evolutionary Computation, vol.7. no. 6, pp. 561-575, 2003. Q. Hu, Y. Daren and X. Zongxia. Information-preserving Hybrid Data Reduction based on Fuzzy-rough Techniques. Pattern Recognition Letters, vol. 27, no. 2006, pp.414-423, 2005. J. Skyt, C.S. Jensen and T.B. Pedersen, “Specification-based Data Reduction in Dimensional Data Warehouse”. Information Systems, vol.33, no 1. pp 36-63, 2007. R Alfred and K. Dimitar. “Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithmbased Clustering Technique”. In Local Proceedings of ADBIS, Varna. pp. 136 -147, 2007. S. Yan, X. Dong, Z. Benyu, Z. Hong-Jiang, Y. Qiang. and L. Stephen. “Graph Embedding and Extensions: A General Framework for Dimensionality Reduction”. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29 no. 1, pp.4051, 2007. Z. Voulgaris and G.D Magoulos. “Dimensionality Reduction for Feature and Pattern Selection in Classification Problems”. ICCGI Proceedings of the Third International Multi-Conference on Computing in the Global Information Technology. pp 60-65, 2008. X. Li, and V.S. Jacob “Adaptive Data Reduction for Large-Scale Transaction Data”. European Journal of Operational Research, vol. 188. no. 3. pp 910-924, 2008. S. Russsell and V. Yoon “Application of Wavelets Data Reduction in a Recommender System”. Expert Systems vol. 34. No. 4. pp 2316-2325, 2008. O. Okun and H. Priisalu, ”Unsupervised Data Reduction. In Signal Processing, vol. 87, no. 9. pp. 2260-2267, 2007. G. Salton, A. Wong, C.S. Yang, “A vector space model for automatic indexing”. Communications of the ACM, Issue.18, pp.613-620. 1975, E.F. Codd. A Relational Model of Data for Large Shared Databanks. Communications of the ACM, vol.13, no.6. pp 337-387. 1970. R. Alfred and K. Dimitar. “A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA”. In Local Proceedings of ADBIS,Varna. pp. 38 – 49, 2007 J.C. Gower. “Eucidean Distance Geometry”. Math. Scientist vol. 7, pp. 1-14, 1982. Fig. 2. A flowchart of a hybrid data reduction technique. REFERENCES [1] [2] [3] X. Li, “Data Reduction via Adaptive Sampling”. Communication in Information and Systems, vol 2, no. 1. pp 5368, 2002. E. Malinowski, and E. Zimanyi, “Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Application”. Springer-Verlag, Available at http://www.springer.com/978-3-540-74404-7. Visited: November, 2008. Z. Zhang,, Zhang, C.. and Zhang, S. “An Agent-based Hybrid Framework for Database Mining” Applied Artificial Intelligence, vol.17 no.(5–6). pp 383–398, 2003. View publication stats Ifiok J. Udo holds a B. Sc degree in Computer Science and he is at present an M.Sc student of the Department of Computer Science and Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria and a member of Information Storage and Retrieval Group in the same department. His research work is on the Development of a Hybrid Data Reduction Tecnique for OLTP Environments. He has published articles in reputable Journals and Conference. Babajide S. Afolabi holds a Ph. D in Information and Communication Sciences from Universite Nancy 2, Nanacy, France. He is the head of Information Storage and Retieval Group. He is a member of Nigerian Computer Society (NCS) and Computer Professional Registration Council of Nigeria (CPN). He is a senior lecturer in the Department of Computer Science and Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria. He has published articles in reputable Journals and Conference.