(PDF) Security Procedures for Classification Mining Algorithms

One of the challenges facing the computer science community is the development of techniques and tools to discover new and useful information from large collections of data. There are a number of basic issues associated with this challenge and many are still unresolved. This situation has led to the emergence of a new area of study called "Knowledge Discovery in Databases" (KDD). The recent efforts of KDD researchers have focused primarily on issues surrounding the individual steps of the discovery process. Those issues that are not directly related to the discovery process have received much less attention. One such issue is the impact of this new technology on database security. In this paper, we investigate issues pertaining to the assessment of the impact of classification mining on database security. In particular, the security threat presented by a category of classification mining algorithms that we refer to as decision-region based is analyzed. Providing safeguards...

https://www.ijert.org/protecting-sensitive-rules-based-on-classification-in-privacy-preserving-data-mining https://www.ijert.org/research/protecting-sensitive-rules-based-on-classification-in-privacy-preserving-data-mining-IJERTV2IS110786.pdf In this paper, we propose a method of hiding sensitive classification rules from data mining algorithms for categorical datasets. Our approach is to reconstruct a dataset according to the classification rules that have been checked and agreed by the data owner for releasing to data sharing. Unlike the other heuristic modification approaches, firstly, our method classifies a given dataset. Subsequently, a set of classification rules is shown to the data owner to identify the sensitive rules that should be hidden. Then we replace known values with unknown values ("?") in those transactions to hide a given sensitive classification rule. Finally the sanitized dataset is generated from which sensitive classification rules are no longer mined. Our experiments show that the sensitive rules can be hidden completely on the reconstructed datasets. While non-sensitive rules are still able to discovered without any side effect.

Abstract Recently, a new class of data mining methods, known as privacy preserving data mining (PPDM) algorithms, has been developed by the research community working on security and knowledge discovery. The aim of these algorithms is the extraction of relevant knowledge from large amount of data, while protecting at the same time sensitive information.

SECURITY PROCEDURES FOR CLASSIFICATION MINING ALGORITHMS Tom 10hnsten Computer Science Department Western Illinois University T-Johnsten@wlu.edu Vijay V. Raghavan Center for Advanced Computer Studies University of Louisiana Lafayette raghavan Olouisiana.edu Abstract Classification mining algorithms can be used to discover protected values from non-protected data that are voluntarily released for mining purposes. We reported earlier on the development of an assessment algorithm, along with a set of security policies, for use in analyzing a protected data element's risk of disclosure with respect to decision-region based algorithms. Tbis paper presents an extension of the previous work. Specifically, two algorithms, one for assessing a protected data element's risk of disclosure with respect to decision-region based algorithms and the other with respect to extended decision-region based algorithms, are developed. Tbe former risk assessment algorithm has a relatively low execution time, but may not achieve an exact assessment of a protected data element's risk of disclosure. However, data collected from initial experiments indicate that the application of this particular assessment algorithm can lead to an exact assessment, under certain security policies. Keywords: data mining, security, risk assessment, classification, database security. Introduction Nowadays companies and organizations frequently use data mining technology to analyze their stored data to discover valuable patterns or rules that can help them maintain their competitive edge. Recently, researchers within the information security community have begun to examine the impact of this technology on database security [1,2,3,4,5]. The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: 10.1007/978-0-387-35587-0_24 M. S. Olivier et al. (eds.), Database and Application Security XV © IFIP International Federation for Information Processing 2002 286 DATABASE AND APPUCATION SECURITY XV Our previous work focused on the disclosure of a protected data element through c1assification mining [5]. As part of that work, we identified a general category of classification mining algorithms, referred to as decision-region based, whose members possess a common set of properties related to the security risk assessment of a protected data element. Based on these properties, we developed a security risk assessment algorithm, along with a set of security policies, for use in analyzing decision-region based algorithms. We developed the risk assessment algorithm in the context of the relational data model. In the context of that model, we defined the terms protected attribute, non-protected attribute, and protected tuple. A protected attribute is the one that includes the protected data element (or attribute value) in its domain, a non-protected attribute contains no protected data elements in its domain, and a protected tuple contains a protected data element and at least one non-protected data element. This paper introduces a broader category of classification mining algorithms, which we refer to as extended decision-region based, whose members possess a common set of properties that can be exploited during the assessment of the security risk of a protected data element. These algorithms, unlike those for decision-region based, are characterized as allowing for descriptions that are not entirely defined in terms of properties of the object. As a result, new assessment procedures must be developed with respect to this new category of classification algorithms. In this paper, we present an assessment algorithm, EXACT _OB2, designed specifically for use with extended decision-region based algorithms. We also present a new assessment algorithm, APPROX-EVAL, for use with decision-region based algorithms. The APPROX-EVAL algorithm is characterized has having a relatively low execution time, but may not achieve an exact assessment. However, data collected from initial experiments indicate that, in general, the application of the latter algorithm can lead to an exact assessment of the security risk associated with a protected data element, under certain security policies. The rest of this paper is organized as folIows. Section 1 presents an overview of our previous work, which we extend in this paper. Section 2 characterizes the extended decision-region based algorithms and presents an outline of the EXACT _OB 2 algorithm. Section 3 presents an outline of the APPROX-EVAL algorithm; and, Section 4 describes the results of experiments that were conducted to assess the APPROX_EVAL's effectiveness. Finally, Section 5 presents the conclusions and discusses some future research projects. . 1. OVERVIEW OF PREVIOUS WORK We developed an algorithm, referred to as Orthogonal-Boundary (OB), that provides an exact assessment of the security risk of protected data elements with respect to decision-region based algorithms [5]. We define an exact assessment lohnsten & Raghavan 287 as one that guarantees the implementation of a security policy at the description space level. The notion of a description space level policy follows directly from the definition of a decision-region based classification algorithm. A classification algorithm, A, is a decision-region based algorithm if and only if the following two conditions are satisfied: • Condition-l: It is possible to identify apriori a finite set of descriptions, D, in terms of the attribute values present in an object 0 such that the partieular description d used by A to classify 0 is an element of D. • Condition-2: The "predicted accuracy" of assigning an object satisfying a description d E D to a class c is dependent on the distribution of dass label c relative to all other dass labels among the objects that satisfy d in the training set. These conditions lead to the property that the effective assessment of the security threat from decision-region based algorithms requires the construction of a dass-accuracy set for each description d E D. We define a class-accuracy set as a collection of ordered pairs (Ci, ai), where Ci is the i th attribute value (dass label) in the domain of the protected attribute and ai is the predicted accuracy, aecording to the classification algorithm, of assigning to the protected tuple the dass label Ci. Inference-based security policies may be applied at two levels. If it is known apriori that a particular description d E D will be selected relative to the protected tuple, then we can apply a poliey just to that description. We refer to this type of policy as being defined at the description level. A more desirable approach is to ensure that no matter which d E D is chosen, the stated security policy is satisfied. We refer to this type of policy as being defined at the description space level. Given the above definitions and Condition-l, we concluded that inference-based security policies should be specified at the description space level, since description level policies require prior knowledge of which deseription will be chosen, which makes the results specific to a particular dassification mining algorithm. The OB algorithm has been designed for use with decision-region based algorithms that produce a specifie type of description. In particular, those descriptions that are expressed as a logical conjunction of (attribute name, value) pairs and constructed from non-protected attribute values present in the protected tuple. The OB algorithm is, therefore, limited to decision-region based algorithms that divide the search space into hyper-rectangles. Those decisionregion based algorithms that divide the search space into "tilted" regions, such as decision-tree algorithms that use an attribute selection criterion based on a weighted sum of attribute values, and k-nearest neighbor dassifiers which are defined in terms of a similarity measure, are outside the scope of OB. The hyper-rectangular regions associated with a protected tuple are the logical conjunctions formed from one or more non-protected (attribute name, value) 288 DATABASE AND APPLICATION SECURITY XV pairs that appear within the tuple. We refer to this set of descriptions as the description space, D*, of the protected tuple. It follows from Condition-l that the assignment of a class label to a protected tuple is a label associated with a description d E D*. There is no way to identify apriori the description d E D* that will be chosen by a classifier without making explicit assumptions about its operation. Thus, in general, it is necessary to inspect all descriptions that betong to a tuple's description space. The risk assessment algorithm implements the description level protected rank security policy [1, 1]. Given a class-accuracy set, a protected rank policy is satisfied if the ranked position of the protected data element is not within the non-secure range [L,U], where Land U are positive integers such that 1 L :::; U and L :::; U 1{al, a2, ... ,an } I. To implement a protected rank policy it is necessary to compute a class-accuracy set for each individual description, d E D*. The class-accuracy sets are computed based on Condition-2 as folIows. Let c be a class label in the domain of a protected attribute. Given a description d E D*, the predicted accuracy, a, of assigning the protected tuple T the label c is the ratio of the number of tuptes that are assigned label c and satisfy d to the number of tuples that satisfy d. This measure is equal to the classification accuracy measure defined in [6]. There are classification algorithms that satisfy Condition-l, but violate Condition-2. For instance, the CART decision-tree algorithm does not assume that the class labels are uniformly distributed, but instead computes predicted accuracy values using a weighted ratio based on prior probabilities [7]. A high-level outline of the OB algorithm, EXACT _OB 1, is shown in FigureI. The result of executing this algorithm is the implicit or explicit inspection of each description belonging to the protected tuple's description space, D*. The inspection process ensures that all descriptions, d E D*, of a tuple's description space satisfy the stated description level security policy. Unfortunately, the number of descriptions that belong to a protected tuple's description space is exponential in terms of the number of non-protected attributes. However, the recognition of a special type of description that we refer to as a zero description can significantly reduce the required number of inspections. The tuples that satisfy a zero description are characterized as having a class label that does not equal the protected data element. The occurrence of a zero description implies that there is no need to inspect any description that is a specialization since it will also be a zero description. As shown in Figure-l, the EXACT _OB 1 algorithm determines if a description, d E D*, is a zero description. If d is a zero description it is placed on a list so that it is not directly examined by the algorithm, otherwise the algorithm determines if the description is non-secure. A non-secure description, d, is a description that violates the stated description level security policy. Currently, a description, d, is considered non-secure if it violates the protected rank policy Johnsten & Raghavan 289 EXACT_OBI Algorithm k=l while (exists descriptions to inspect) D k-Ievel descriptions requiring proteetion (excluding zero_description list) for each (description d in D) if (d == zero_description) append all specializations of d to zero_description list else if (d == non-secure description) append d to non-secure description list end_for transforrn non-secure descriptions to sec ure descriptions (by protecting subset of attribute values not belonging to protected tuple) k=k+l end_while = Figure J. Orthogonal-Boundary Algorithm: EXACT_OB I. [1,1]. The transformation of a non-secure description into a secure description is performed by concealing a subset of attribute values from the domain of nonprotected attributes that appear in the training set, but not in the protected tuple. The concealment of such values must decrease the number of tuples that satisfy the non-secure description with a class label equal to that of the protected data element, which can be repeated until there is no longer a violation of policy. This particular transformation has the following important property: the implementation 0/ a policy, P, at the description level implies the implementation 0/ P at the description space level. In the context of EXACT _OB 1, this property ensures the implementation of the protected rank policy [1, 1] at the description space level. It is important to note that EXACT _OB 1 is only valid in the context of decision-region based algorithms. In the following section we present an algorithm that is designed for use with another category of classification algorithms, referred to as extended decision-region based. 2. EXTENDED DECISION·REGION BASED ALGORITHMS We have generalized the EXACT _OB 1 algorithm to provide an exact assessment of protected data elements with respect to extended decision-region based algorithms. A classification algorithm, A, is an extended decision-region based algorithm if and only if the following two conditions are satisfied: 290 DA TA BA SE AND APPUCATION SECURITY XV • Condition-l *: It is possible to identify apriori a finite set of descriptions, D, in terms of the attribute values present in an object 0, and the attribute values present in a finite set of objects 0' derived from 0, such that the particular description d used by A to classify 0 is an element of D. • Condition-2*: The predicted accuracy of assigning an object satisfying a description d to a class c is dependent on the distribution of class label c relative to all other class labels among the objects that satisfy d in the training set. In Condition-l *, an object 0' is derived from another object 0 if it has at least one property in common with O. The difference between Condition-l and Condition-l * is that the latter condition allows for descriptions that are not entirely defined in terms of the properties of the target object. The second condition, Condition-2* , is identical to Condition-2 defined for decision-region based algorithms. Thus, a decision-region based algorithm is also an extended decision-region based algorithm. An example of an extended decision-region based algorithm is C4.5's consult interpreter when applied to protected tuples with missing non-protected attribute values [8]. To illustrate, consider the protected tuple, (Cyl=4; Fuel=null; Tran=manu; Power=med; Mileage=null) where Mileage is the protected attribute, and the following class-accuracy values, 0 (low), .62 (med) and .38 (high) produced by C4.5. These confidence values were obtained through the assessment of the descriptions shown in Figure-2. It follows from the descriptions that there are two relevant protected tupies, (Cyl=4; Fuel=efi; Tran=manu; Power=med; Mileage=null) and (Cyl = 4j Fuel = 2-bblj Tran = manUj Power = medj Mileage = null). In general, there exists a relevant, or derived, protected tuple for each distinct value in the domain of a missing nonprotected attribute. The total number of relevant tuples is equal to the product of the domain sizes ofthe varlous non-protected attributes to which the missing attributes values belong. For instance, in the above example there are two relevant protected tuples since there is one missing non-protected attribute, Fuel, and its domain size is two, {2 - bbl, efi}. Similarly, ifthe protected tuple contains two missing non-protected attributes, Fuel and Power, the total number of relevant protected tuples is six since the domain size of Fuel is two and the domain size of Power is three, {low, med, high}. The description space of each relevant protected tuple must be inspected to determine if each description, d, satisfies the stated description level policy. A high-level description of the EXACT _OB 2 algorithm is shown in Figure-3. This algorithm is an extension of the EXACT _OB 1 algorithm, and is designed for use with extended decision-region based algorithms. The EXACT _OB2 algorithm, like EXACT _OB 1, has been developed in the context of descriptions that are represented by a logical conjunction of (attribute name, value) pairs. In Johnsten & Raghavan = 291 = = = low #(Cyl 4 & Fuel 2-bbl & Mileage low) I #(Cyl = 4) + #(Fuel = efi & Cyl = 4) I #(Cyl = 4) '" #(Cyl = 4 & Fuel = efi & Power = med & Mileage = low) I #(Cyl 4 & Fuel efi & Power med) = = = Med = #(Cyl =4 & Fuel =2-bbl & Mileage =med) I #(Cyl =4) + #(Fuel = efi & Cyl = 4) I #(Cyl =4) '" #(Cyl =4 & Fuel =efi & Power =med & Mileage = med) I #(Cyl =4 & Fuel =efi & Power =med ) High =#(Cyl =4 & Fuel =2-bbl & Mileage =high) I #(Cyl = 4) + #(Fuel =efi & Cyl =4) I #(Cyl =4) '" #(Cyl =4 & Fuel =efi & Power =med & Mileage = high) I #(Cyl =4 & Fuel =efi & Power =med) Figure 2. Example Descriptions. EXACT_OB2 Algorithm PT =generate aIl relevant protected tupies for each (protected tupie in PT) apply EXACT_OB1 endjor Figure 3. Orthogonal-Boundary Algorithm: ExacLOB2. addition, we assume that the set of descriptions Dis limited to those descriptions generated from the attribute values of the protected tupie as weIl as from values in the domain of the missing non-protected attributes of the protected tuple. Based on these restrlctions, the result of executing the EXACT _OB2 algorithm is the inspection of the descriptions belonging to the description space of each relevant protected tuple. The structure ofEXACT _OB2 is almost identical to the structure ofEXACT_OB1. The only difference is that EXACT_OB2 inspects the description space of multiple protected tuples as opposed to a single protected tuple. The biggest limitation ofEXACT_OB2, as weIl as EXACT_OB1, is the potentially high execution time. Specifically, the execution time of both algorithms is dependent upon the number of non-protected attributes and the occurrence of zero descriptions. As a resuIt, we have developed an alternative assessment algorithm to approximate the exact assessment performed by EXACT_OB1. 292 3. DATABASE AND APPLICATION SECURITY XV APPROXIMATE ASSESSMENT ALGORITHM In this section, we present an assessment algorithm, which we will refer to as APPROX..EVAL. It approximates the exact assessment performed by EXACT_OBl. The idea behind APPROX..EVAL is to decrease the required execution time by reducing the size of the description space. One way to reduce the size of a description space is to transform a non-secure description into a secure description by concealing one or more non-protected attribute values. For example, if (Cyl=4) is a non-secure description with respect to the protected tuple, (Fuel=efi; Cyl=4; Tran=manu; Mileage=null), then the concealment of the non-protected attribute value, Cyl=4, removes the description Cyl=4 from D*. In other words, the description cannot be used by a decision-region based algorithm to assign a value to the protected attribute. The concealed value also removes from D* from the following descriptions, (Fuel=efi /\ Cyl=4), (Cyl=4 /\ Tran=manu), and (Fuel=efi /\ Cyl=4 /\ Tran=manu). As a result, there is no need to explicitly inspect any of these descriptions. However, as illustrated in the previous section, the concealment of non-protected attribute values may cause a classification algorithm, that normally adheres to the conditions of a decision-region based algorithm, to violate Condition-l [5]. The APPROX..EVAL algorithm must, therefore, determine if a concealed non-protected attribute value is likely to result in such a violation. A high-level outline of APPROX..EVAL is shown in Figure-4. If it is determined that a violation is unlikely then the protected tuple is evaluated using the APPROX_OB algorithm; otherwise, it is evaluated using the EXACT _OB 1 algorithm described in the previous section. The difference between the two algorithms is that APPROX_OB, unlike EXACT _OB 1, transforms a non-secure description into a secure description by concealing one or more attribute values from the domain of non-protected attributes that appear in the protected tuple. In general, the execution time of APPROX_OB will be less than that of EXACT _OB 1 due to the concealment of non-protected attribute values, and will also maximize the amount of available data. The APPROX..EVAL algorithm may not achieve an exact assessment since it must predict the impact of a concealed non-protected attribute value on a decision-region based algorithm. However, the results obtained from the experimental investigation suggest that the prediction is likely to be consistent with an exact assessment. 3.1. Predicted Accuracy Measure We have developed a measure to computea class-accuracy set to determine if a concealed non-protected attribute value is likely to result in a violation of Condition-l. In this context, the class-accuracy set is computed independently of any specific description belonging to the protected tuple's description lohnsten & Raghavan 293 APPROX_EVAL Algorithm prediction = predict impact of concealed non-protected attribute values if (prediction == violation of Condition-l) apply EXACT_OBI algorithm else apply APPROX_OB algorithm Figure 4. Approximate Evaluation Algorithm (High-Level). space. Instead, the predicted accuracy value a, with respect to a c1ass label c, is computed based upon the degree to which the protected tuple's non-protected attribute values predict the class c. The degree to which a non-protected attribute value predicts a protected attribute c1ass depends upon the attribute value's relationship to the tuples that belong to the c1ass as weIl as to the tuples that belong to other classes. These relationships form the basis of our measure. SpecificaIly, the proposed measure M, as shown in Figure-5, assigns a numerical value to a non-protected attribute value, a, that reftects the degree to which, a, predicts the protected attribute class b. The value N is equal to the number of tuples in the relation instance. The range of M is the interval [-1,1], where a value of positive (negative) one implies that the non-protected attribute value provides complete support (non-support) for the protected attribute class. The measure is comprised of two functions, class(a,b) and nonclass(a,b). The function, dass( a,b), captures the degree to which an implication of the form (b -+ a') weakens an implication of the form (a -+ b). The quantity #(a,b) is the number of tuples in the relation instance with (attribute name, value) pairs A:a and B:b; and, the quantity #(a' ,b) is the number of tuples in the relation with (attribute name, value) pairs A:a' and B:b. The symbol a' represents an element belonging to the set Dom(A) - a. The other function, nondass(a,b), captures the degree to which an implication of the form (b' -+ a') weakens an implication of the form (a -+ b'). The symbol b' represents an element belonging to the set Dom(B) - b. The quantity #(a,bi) represents the number of tuples in the relation instance with (attribute name, value) pairs A:a and B:bi, where bi is an instance ofB such that bi :1= b; and, #(a', bi) represents the number oftuples in the relation instance with (attribute name, value) pairs A:a' and B:bi. Finally, the condition expressed as cond-l is satisfied if#(a,bi) / #(a',bi ) = 1 and#(a,bj ) / #(a' ,bj) < 1 for allj :f:. i where bi , bj E Dom(B) - b. Intuitively, the function, nonclass(a, b), computes the degree of predictability associated with the class, b', that the attribute value a predicts to the greatest extent. 294 DATABASE AND APPUCATION SECURITY XV +1 = M class(a,b) = I -I log2 «class(a,b) IN + 1) I (nonclass(a,b) I N + I» #(a,b) - #(a' ,b) : #(a,b) >= #(a' ,b) #(a,b) I #(a' ,b) : otherwise : if#(a,b') = 0 : if#(a,b) =0 : otherwise o nonclass(a,b) : if cond-I = Max i={B-b} Figure 5. I #(a,b j ) - #(a' ,b j ) : jf #(a,b j) >= #(a' ,b i) : otherwise Predicted Accuracy Measure M. The definitions of class(a,b) and nonclass(a,b) represent the concept of a non-protected attribute's relevance and non-relevance odds, respectively. This type of relevance relationship has been used previously in information retrieval systems to determine a document term's degree of importance in distinguishing relative and non-relative documents [9]. 3.2. APPROX...EVAL Aigorithm The measure M is used to compute the class-accuracy set of a protected tuple. Specifically, the assigned weights that correspond to a class, C, are summed together to produce the corresponding predicted accuracy value. For example, suppose the application of M with respect to the protected tuple, (Cyl=6; Fuel=2-bbl; Tran=manu; Power=low; Mileage=null), produces the results shown in Figure-6. In this instance, the predicted class-accuracy set is {(low, -1.09), (med, -.255), (high, -.946)}. In general, a computed predicted accuracy value represents the degree to which the protected tuple predicts the corresponding protected data element. The class-accuracy set is used in conjunction with the Protected Minimum Range (PMR) security policy to determine if there is a violation of Condition1. The application of a PMR policy is dependent upon the rank position of the predicted accuracy value, ai, of the protected data element relative to the other predicted accuracy values. If ai is the maximum predicted accuracy value, then the pmr value is the difference between ai and the next largest predicted accuracy value ak; otherwise, it is the difference between ai and and the maximum predicted accuracy value. As shown in Figure-7, if pmr is greater than some established threshold, then the APPROX.EVAL algorithm evaluates the protected data element using Johnsten & Raghavan 295 Low: {Cyl=6: .1699; Fuel=2-bbl: -.0458; Tran=manu: -.2157; Power=low: -l} Med: {Cyl=6: -.1699; Fuel=2-bbl: -.0771. Tran=manu: 0; Power=low: -.OO79} High: {Cyl=6: -1; Fuel=2-bbl: .0458; Tran = manu: 0; Power=low: .0079} Figure 6. Application of Measure M. APPROX_EVAL Algorithm eompute measure M with respeet to proteeted tuple eompute eorresponding class-aceuraey set pmr = apply Proteeted Minimum Range poliey to class-aeeuraey set if (pmr >= threshold-value) apply EXACT_OB 1 algorithm else apply APPROX_OB algorithm Figure 7. Approximate Evaluation Algorithm (Detail-Level). the EXACT _OB 1 algorithm; otherwise, the protected data element is evaluated using the APPROX_OB algorithm. 4. EXPERIMENTAL INVESTIGATION Experiments were eondueted to validate the use of the APPROX...EVAL algorithm in implementing the deseription space level protected rank poliey [1,1]. In eondueting our initial experiments, we restrieted the evaluation to C4.5's decision-tree classifier and interpreter [8]. The experiments utilized ten distinet data sets, each of whieh eontained at least ten non-protected attribute values and a single protected attribute with a domain size of five. All of the relations were eonstrueted through the applieation of the Synthetie Classifieation Data Set (SCDS) program [10]. A total of fifty-one protected tuples were used in the investigation. Twenty-one of the proteeted tuples violated the deseription spaee level poliey [1,1]. Figure-8 shows the pmr value eomputed for eaeh of the fifty-one protected . tupies. The APPROX_OB algorithm was applied to eaeh proteeted tuple. The applieation of APPROX_OB resulted in the transformation of a proteeted tuple T into a protected tuple T' that, in general, eontained one or more concealed non-protected attribute values. The transformed protected tupies, T', were 296 DATABASE AND APPLICATJON SECURITY XV 0.000 5.000 ' .000 3.000 2.000 1.000 0.000 II ... ·1.000 II I I • -I. 11 II II II 111., .. I .. 11.1 ., .... . . 13 :;; III Rn : ; -- -I. l; II I .-;1' .. .... ;;; ' 2.000 -3.000 Figure 8. Computed pmr Values for Protected Thples. cIassified using C4.5's consult interpreter. In order to assess the generality ofthe APPROX.EVAL algorithm two distinct decision-tree models where generated for each relation instance. One model was constructed using the gain attribute selection criterion, while the other model was constructed using the gain ratio attribute selection criterion. We identified those protected tuples that violated the description space level protected rank policy [1,1]. There were a total of five such tupies, five with respect to the gain ratio criterion and one with respect to the gain criterion. These five tuples are shown as solid-filled columns in Figure-8. The experimental results suggest that a pmr value greater than or equal to two identifies a potential violation of Condition-l . Unfortunately, such a pmr value may produce false positives. It appears that the occurrence of false positives may be related to the percentage of non-relevant attributes contained within a relation. 5. CONCLUSION AND FUTURE WORK In this paper we identified a new category of cIassification mining algorithms, referred to as extended decision-region based, and presented a security risk assessment algorithm, EXACT _OB2, designed specifically for use with this category of algorithms. In addition, we proposed an alternative assessment algorithm, APPROX..EVAL, designed for use with decision-region based algorithms. This particular assessment algorithm is characterized as having a relatively low execution time, but may not achieve an exact assessment. We have several additional research projects planned with respect to this challenging research area. Our immediate plans incIude further verification of experimental results using additional extended decision-region based algorithms and data sets, extending the security assessments to incIude continuously 297 Johnsten & Raghavan valued attributes, and the development of additional security assessments for other groups of mining algorithms. References [1] Chang, L. and Moskowitz, I. (1998). Parsimonious Downgrading and Decision Trees Applied to the Inference Problem. Proceedings ofNew Security Paradigms, pp. 82-89. [2] Chang, L. and Moskowitz, I. (2000). An Integrated Framework for Database Privacy Protection. Proceedings ofthefourteenthAnnuallFIP WO 11.3 Working Conference on Database Security. [3] Clifton, C. and Marks, D. (1996). Security and Privacy Implications ofData Mining.1996 SIOMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 15-19. [4] Clifton, C. (1999). Protecting Against Data Mining Through Sampies. Proceedings 0/ the ThirteenthAnnualIFIP WO 11.3 Working Conference on Database Security, pp. 193-207. [5J Johnsten, T. and Raghavan, V. (1999). Impact of Decision-Region Based Classification Mining Aigorithms on Database Security. Proceedings ofthe ThirteenthAnnuallFlP WO 11.3 Working Conference on Database Security, pp. 177-191. [6J Holsheimer, M. and Siebes, A. (1994). Oata Mining: Tbe Search for Knowledge in Oatabases. Report CS-R9406. CWI. Amsterdam, Tbe Netherlands. [7J Steinberg, D. and Colla, P. (1997). CART User Manual. Salford Systems, San Diego, CA. [8J Quinlan, J. (1993). C4.5: Programs Por Machine Leaming. Morgan Kaufmann, San Mateo,CA. [9J Robertson, S. and Jones, K. (1976). Relevance Weighting for Search Terms. Journal the American Society for Information Science, May-June, pp. 129-145. [10] Synthetic Classification Data dents/mellVSCDS/intro.html. Sets, http://fas.sfu.ca/cslpeople/Orad 0/ Stu-

RELATED PAPERS

RELATED TOPICS

Log In

Security Procedures for Classification Mining Algorithms

Security Procedures for Classification Mining Algorithms

Related Papers

RELATED PAPERS

RELATED TOPICS