SECURITY PROCEDURES FOR
CLASSIFICATION MINING ALGORITHMS
Tom 10hnsten
Computer Science Department
Western Illinois University
T-Johnsten@wlu.edu
Vijay V. Raghavan
Center for Advanced Computer Studies
University of Louisiana Lafayette
raghavan Olouisiana.edu
Abstract
Classification mining algorithms can be used to discover protected values from
non-protected data that are voluntarily released for mining purposes. We reported
earlier on the development of an assessment algorithm, along with a set of security
policies, for use in analyzing a protected data element's risk of disclosure with
respect to decision-region based algorithms. Tbis paper presents an extension of
the previous work. Specifically, two algorithms, one for assessing a protected
data element's risk of disclosure with respect to decision-region based algorithms
and the other with respect to extended decision-region based algorithms, are
developed. Tbe former risk assessment algorithm has a relatively low execution
time, but may not achieve an exact assessment of a protected data element's
risk of disclosure. However, data collected from initial experiments indicate
that the application of this particular assessment algorithm can lead to an exact
assessment, under certain security policies.
Keywords:
data mining, security, risk assessment, classification, database security.
Introduction
Nowadays companies and organizations frequently use data mining technology to analyze their stored data to discover valuable patterns or rules that
can help them maintain their competitive edge. Recently, researchers within
the information security community have begun to examine the impact of this
technology on database security [1,2,3,4,5].
The original version of this chapter was revised: The copyright line was incorrect. This has been
corrected. The Erratum to this chapter is available at DOI: 10.1007/978-0-387-35587-0_24
M. S. Olivier et al. (eds.), Database and Application Security XV
© IFIP International Federation for Information Processing 2002
286
DATABASE AND APPUCATION SECURITY XV
Our previous work focused on the disclosure of a protected data element
through c1assification mining [5]. As part of that work, we identified a general category of classification mining algorithms, referred to as decision-region
based, whose members possess a common set of properties related to the security risk assessment of a protected data element. Based on these properties,
we developed a security risk assessment algorithm, along with a set of security
policies, for use in analyzing decision-region based algorithms. We developed
the risk assessment algorithm in the context of the relational data model. In the
context of that model, we defined the terms protected attribute, non-protected
attribute, and protected tuple. A protected attribute is the one that includes the
protected data element (or attribute value) in its domain, a non-protected attribute contains no protected data elements in its domain, and a protected tuple
contains a protected data element and at least one non-protected data element.
This paper introduces a broader category of classification mining algorithms,
which we refer to as extended decision-region based, whose members possess
a common set of properties that can be exploited during the assessment of the
security risk of a protected data element. These algorithms, unlike those for
decision-region based, are characterized as allowing for descriptions that are not
entirely defined in terms of properties of the object. As a result, new assessment
procedures must be developed with respect to this new category of classification
algorithms. In this paper, we present an assessment algorithm, EXACT _OB2,
designed specifically for use with extended decision-region based algorithms.
We also present a new assessment algorithm, APPROX-EVAL, for use with
decision-region based algorithms. The APPROX-EVAL algorithm is characterized has having a relatively low execution time, but may not achieve an exact
assessment. However, data collected from initial experiments indicate that, in
general, the application of the latter algorithm can lead to an exact assessment of
the security risk associated with a protected data element, under certain security
policies.
The rest of this paper is organized as folIows. Section 1 presents an overview
of our previous work, which we extend in this paper. Section 2 characterizes
the extended decision-region based algorithms and presents an outline of the
EXACT _OB 2 algorithm. Section 3 presents an outline of the APPROX-EVAL
algorithm; and, Section 4 describes the results of experiments that were conducted to assess the APPROX_EVAL's effectiveness. Finally, Section 5 presents
the conclusions and discusses some future research projects.
.
1.
OVERVIEW OF PREVIOUS WORK
We developed an algorithm, referred to as Orthogonal-Boundary (OB), that
provides an exact assessment of the security risk of protected data elements with
respect to decision-region based algorithms [5]. We define an exact assessment
lohnsten & Raghavan
287
as one that guarantees the implementation of a security policy at the description
space level. The notion of a description space level policy follows directly from
the definition of a decision-region based classification algorithm.
A classification algorithm, A, is a decision-region based algorithm if and
only if the following two conditions are satisfied:
•
Condition-l: It is possible to identify apriori a finite set of descriptions,
D, in terms of the attribute values present in an object 0 such that the
partieular description d used by A to classify 0 is an element of D.
•
Condition-2: The "predicted accuracy" of assigning an object satisfying
a description d E D to a class c is dependent on the distribution of dass
label c relative to all other dass labels among the objects that satisfy d in
the training set.
These conditions lead to the property that the effective assessment of the
security threat from decision-region based algorithms requires the construction
of a dass-accuracy set for each description d E D. We define a class-accuracy set
as a collection of ordered pairs (Ci, ai), where Ci is the i th attribute value (dass
label) in the domain of the protected attribute and ai is the predicted accuracy,
aecording to the classification algorithm, of assigning to the protected tuple the
dass label Ci. Inference-based security policies may be applied at two levels.
If it is known apriori that a particular description d E D will be selected relative
to the protected tuple, then we can apply a poliey just to that description. We
refer to this type of policy as being defined at the description level. A more
desirable approach is to ensure that no matter which d E D is chosen, the stated
security policy is satisfied. We refer to this type of policy as being defined
at the description space level. Given the above definitions and Condition-l,
we concluded that inference-based security policies should be specified at the
description space level, since description level policies require prior knowledge
of which deseription will be chosen, which makes the results specific to a
particular dassification mining algorithm.
The OB algorithm has been designed for use with decision-region based
algorithms that produce a specifie type of description. In particular, those descriptions that are expressed as a logical conjunction of (attribute name, value)
pairs and constructed from non-protected attribute values present in the protected tuple. The OB algorithm is, therefore, limited to decision-region based
algorithms that divide the search space into hyper-rectangles. Those decisionregion based algorithms that divide the search space into "tilted" regions, such
as decision-tree algorithms that use an attribute selection criterion based on a
weighted sum of attribute values, and k-nearest neighbor dassifiers which are
defined in terms of a similarity measure, are outside the scope of OB.
The hyper-rectangular regions associated with a protected tuple are the logical conjunctions formed from one or more non-protected (attribute name, value)
288
DATABASE AND APPLICATION SECURITY XV
pairs that appear within the tuple. We refer to this set of descriptions as the
description space, D*, of the protected tuple. It follows from Condition-l that
the assignment of a class label to a protected tuple is a label associated with a
description d E D*. There is no way to identify apriori the description d E D*
that will be chosen by a classifier without making explicit assumptions about
its operation. Thus, in general, it is necessary to inspect all descriptions that
betong to a tuple's description space.
The risk assessment algorithm implements the description level protected
rank security policy [1, 1]. Given a class-accuracy set, a protected rank policy
is satisfied if the ranked position of the protected data element is not within the
non-secure range [L,U], where Land U are positive integers such that 1 L
:::; U and L :::; U 1{al, a2, ... ,an } I. To implement a protected rank policy it
is necessary to compute a class-accuracy set for each individual description, d
E D*. The class-accuracy sets are computed based on Condition-2 as folIows.
Let c be a class label in the domain of a protected attribute. Given a description
d E D*, the predicted accuracy, a, of assigning the protected tuple T the label
c is the ratio of the number of tuptes that are assigned label c and satisfy d to
the number of tuples that satisfy d. This measure is equal to the classification
accuracy measure defined in [6]. There are classification algorithms that satisfy
Condition-l, but violate Condition-2. For instance, the CART decision-tree
algorithm does not assume that the class labels are uniformly distributed, but
instead computes predicted accuracy values using a weighted ratio based on
prior probabilities [7].
A high-level outline of the OB algorithm, EXACT _OB 1, is shown in FigureI. The result of executing this algorithm is the implicit or explicit inspection of
each description belonging to the protected tuple's description space, D*. The
inspection process ensures that all descriptions, d E D*, of a tuple's description
space satisfy the stated description level security policy.
Unfortunately, the number of descriptions that belong to a protected tuple's
description space is exponential in terms of the number of non-protected attributes. However, the recognition of a special type of description that we refer
to as a zero description can significantly reduce the required number of inspections. The tuples that satisfy a zero description are characterized as having a
class label that does not equal the protected data element. The occurrence of a
zero description implies that there is no need to inspect any description that is
a specialization since it will also be a zero description.
As shown in Figure-l, the EXACT _OB 1 algorithm determines if a description, d E D*, is a zero description. If d is a zero description it is placed on a
list so that it is not directly examined by the algorithm, otherwise the algorithm
determines if the description is non-secure. A non-secure description, d, is a
description that violates the stated description level security policy. Currently,
a description, d, is considered non-secure if it violates the protected rank policy
Johnsten & Raghavan
289
EXACT_OBI Algorithm
k=l
while (exists descriptions to inspect)
D k-Ievel descriptions requiring proteetion
(excluding zero_description list)
for each (description d in D)
if (d == zero_description)
append all specializations of d to zero_description list
else if (d == non-secure description)
append d to non-secure description list
end_for
transforrn non-secure descriptions to sec ure descriptions
(by protecting subset of attribute values not belonging to
protected tuple)
k=k+l
end_while
=
Figure J.
Orthogonal-Boundary Algorithm: EXACT_OB I.
[1,1]. The transformation of a non-secure description into a secure description
is performed by concealing a subset of attribute values from the domain of nonprotected attributes that appear in the training set, but not in the protected tuple.
The concealment of such values must decrease the number of tuples that satisfy
the non-secure description with a class label equal to that of the protected data
element, which can be repeated until there is no longer a violation of policy.
This particular transformation has the following important property: the implementation 0/ a policy, P, at the description level implies the implementation 0/
P at the description space level. In the context of EXACT _OB 1, this property
ensures the implementation of the protected rank policy [1, 1] at the description
space level. It is important to note that EXACT _OB 1 is only valid in the context of decision-region based algorithms. In the following section we present
an algorithm that is designed for use with another category of classification
algorithms, referred to as extended decision-region based.
2.
EXTENDED DECISION·REGION BASED
ALGORITHMS
We have generalized the EXACT _OB 1 algorithm to provide an exact assessment of protected data elements with respect to extended decision-region based
algorithms. A classification algorithm, A, is an extended decision-region based
algorithm if and only if the following two conditions are satisfied:
290
DA TA BA SE AND APPUCATION SECURITY XV
•
Condition-l *: It is possible to identify apriori a finite set of descriptions,
D, in terms of the attribute values present in an object 0, and the attribute
values present in a finite set of objects 0' derived from 0, such that the
particular description d used by A to classify 0 is an element of D.
•
Condition-2*: The predicted accuracy of assigning an object satisfying
a description d to a class c is dependent on the distribution of class label
c relative to all other class labels among the objects that satisfy d in the
training set.
In Condition-l *, an object 0' is derived from another object 0 if it has at
least one property in common with O. The difference between Condition-l
and Condition-l * is that the latter condition allows for descriptions that are
not entirely defined in terms of the properties of the target object. The second
condition, Condition-2* , is identical to Condition-2 defined for decision-region
based algorithms. Thus, a decision-region based algorithm is also an extended
decision-region based algorithm.
An example of an extended decision-region based algorithm is C4.5's consult interpreter when applied to protected tuples with missing non-protected attribute values [8]. To illustrate, consider the protected tuple, (Cyl=4; Fuel=null;
Tran=manu; Power=med; Mileage=null) where Mileage is the protected attribute, and the following class-accuracy values, 0 (low), .62 (med) and .38
(high) produced by C4.5. These confidence values were obtained through
the assessment of the descriptions shown in Figure-2. It follows from the
descriptions that there are two relevant protected tupies, (Cyl=4; Fuel=efi;
Tran=manu; Power=med; Mileage=null) and (Cyl = 4j Fuel = 2-bblj Tran =
manUj Power = medj Mileage = null). In general, there exists a relevant, or
derived, protected tuple for each distinct value in the domain of a missing nonprotected attribute. The total number of relevant tuples is equal to the product of
the domain sizes ofthe varlous non-protected attributes to which the missing attributes values belong. For instance, in the above example there are two relevant
protected tuples since there is one missing non-protected attribute, Fuel, and its
domain size is two, {2 - bbl, efi}. Similarly, ifthe protected tuple contains two
missing non-protected attributes, Fuel and Power, the total number of relevant
protected tuples is six since the domain size of Fuel is two and the domain size
of Power is three, {low, med, high}. The description space of each relevant
protected tuple must be inspected to determine if each description, d, satisfies
the stated description level policy.
A high-level description of the EXACT _OB 2 algorithm is shown in Figure-3.
This algorithm is an extension of the EXACT _OB 1 algorithm, and is designed
for use with extended decision-region based algorithms. The EXACT _OB2 algorithm, like EXACT _OB 1, has been developed in the context of descriptions
that are represented by a logical conjunction of (attribute name, value) pairs. In
Johnsten & Raghavan
=
291
=
=
=
low #(Cyl 4 & Fuel 2-bbl & Mileage low) I #(Cyl = 4) +
#(Fuel = efi & Cyl = 4) I #(Cyl = 4) '"
#(Cyl = 4 & Fuel = efi & Power = med & Mileage = low) I
#(Cyl 4 & Fuel efi & Power med)
=
=
=
Med = #(Cyl =4 & Fuel =2-bbl & Mileage =med) I #(Cyl =4) +
#(Fuel = efi & Cyl = 4) I #(Cyl =4) '"
#(Cyl =4 & Fuel =efi & Power =med & Mileage = med) I
#(Cyl =4 & Fuel =efi & Power =med )
High =#(Cyl =4 & Fuel =2-bbl & Mileage =high) I #(Cyl = 4) +
#(Fuel =efi & Cyl =4) I #(Cyl =4) '"
#(Cyl =4 & Fuel =efi & Power =med & Mileage = high) I
#(Cyl =4 & Fuel =efi & Power =med)
Figure 2.
Example Descriptions.
EXACT_OB2 Algorithm
PT =generate aIl relevant protected tupies
for each (protected tupie in PT)
apply EXACT_OB1
endjor
Figure 3.
Orthogonal-Boundary Algorithm: ExacLOB2.
addition, we assume that the set of descriptions Dis limited to those descriptions
generated from the attribute values of the protected tupie as weIl as from values
in the domain of the missing non-protected attributes of the protected tuple.
Based on these restrlctions, the result of executing the EXACT _OB2 algorithm
is the inspection of the descriptions belonging to the description space of each
relevant protected tuple. The structure ofEXACT _OB2 is almost identical to the
structure ofEXACT_OB1. The only difference is that EXACT_OB2 inspects
the description space of multiple protected tuples as opposed to a single protected tuple. The biggest limitation ofEXACT_OB2, as weIl as EXACT_OB1,
is the potentially high execution time. Specifically, the execution time of both
algorithms is dependent upon the number of non-protected attributes and the
occurrence of zero descriptions. As a resuIt, we have developed an alternative assessment algorithm to approximate the exact assessment performed by
EXACT_OB1.
292
3.
DATABASE AND APPLICATION SECURITY XV
APPROXIMATE ASSESSMENT ALGORITHM
In this section, we present an assessment algorithm, which we will refer
to as APPROX..EVAL. It approximates the exact assessment performed by
EXACT_OBl. The idea behind APPROX..EVAL is to decrease the required
execution time by reducing the size of the description space. One way to reduce
the size of a description space is to transform a non-secure description into a
secure description by concealing one or more non-protected attribute values.
For example, if (Cyl=4) is a non-secure description with respect to the protected
tuple, (Fuel=efi; Cyl=4; Tran=manu; Mileage=null), then the concealment of
the non-protected attribute value, Cyl=4, removes the description Cyl=4 from
D*. In other words, the description cannot be used by a decision-region based
algorithm to assign a value to the protected attribute.
The concealed value also removes from D* from the following descriptions, (Fuel=efi /\ Cyl=4), (Cyl=4 /\ Tran=manu), and (Fuel=efi /\ Cyl=4 /\
Tran=manu). As a result, there is no need to explicitly inspect any of these
descriptions. However, as illustrated in the previous section, the concealment
of non-protected attribute values may cause a classification algorithm, that normally adheres to the conditions of a decision-region based algorithm, to violate
Condition-l [5]. The APPROX..EVAL algorithm must, therefore, determine if
a concealed non-protected attribute value is likely to result in such a violation.
A high-level outline of APPROX..EVAL is shown in Figure-4. If it is determined that a violation is unlikely then the protected tuple is evaluated using
the APPROX_OB algorithm; otherwise, it is evaluated using the EXACT _OB 1
algorithm described in the previous section. The difference between the two
algorithms is that APPROX_OB, unlike EXACT _OB 1, transforms a non-secure
description into a secure description by concealing one or more attribute values
from the domain of non-protected attributes that appear in the protected tuple.
In general, the execution time of APPROX_OB will be less than that of EXACT _OB 1 due to the concealment of non-protected attribute values, and will
also maximize the amount of available data.
The APPROX..EVAL algorithm may not achieve an exact assessment since
it must predict the impact of a concealed non-protected attribute value on a
decision-region based algorithm. However, the results obtained from the experimental investigation suggest that the prediction is likely to be consistent
with an exact assessment.
3.1.
Predicted Accuracy Measure
We have developed a measure to computea class-accuracy set to determine
if a concealed non-protected attribute value is likely to result in a violation
of Condition-l. In this context, the class-accuracy set is computed independently of any specific description belonging to the protected tuple's description
lohnsten & Raghavan
293
APPROX_EVAL Algorithm
prediction = predict impact of concealed non-protected attribute values
if (prediction == violation of Condition-l)
apply EXACT_OBI algorithm
else
apply APPROX_OB algorithm
Figure 4.
Approximate Evaluation Algorithm (High-Level).
space. Instead, the predicted accuracy value a, with respect to a c1ass label c, is
computed based upon the degree to which the protected tuple's non-protected
attribute values predict the class c. The degree to which a non-protected attribute
value predicts a protected attribute c1ass depends upon the attribute value's relationship to the tuples that belong to the c1ass as weIl as to the tuples that belong
to other classes.
These relationships form the basis of our measure. SpecificaIly, the proposed
measure M, as shown in Figure-5, assigns a numerical value to a non-protected
attribute value, a, that reftects the degree to which, a, predicts the protected
attribute class b. The value N is equal to the number of tuples in the relation
instance. The range of M is the interval [-1,1], where a value of positive
(negative) one implies that the non-protected attribute value provides complete
support (non-support) for the protected attribute class. The measure is comprised of two functions, class(a,b) and nonclass(a,b). The function, dass( a,b),
captures the degree to which an implication of the form (b -+ a') weakens
an implication of the form (a -+ b). The quantity #(a,b) is the number of
tuples in the relation instance with (attribute name, value) pairs A:a and B:b;
and, the quantity #(a' ,b) is the number of tuples in the relation with (attribute
name, value) pairs A:a' and B:b. The symbol a' represents an element belonging to the set Dom(A) - a. The other function, nondass(a,b), captures the
degree to which an implication of the form (b' -+ a') weakens an implication
of the form (a -+ b'). The symbol b' represents an element belonging to the
set Dom(B) - b. The quantity #(a,bi) represents the number of tuples in the
relation instance with (attribute name, value) pairs A:a and B:bi, where bi is an
instance ofB such that bi :1= b; and, #(a', bi) represents the number oftuples in
the relation instance with (attribute name, value) pairs A:a' and B:bi. Finally,
the condition expressed as cond-l is satisfied if#(a,bi) / #(a',bi ) = 1 and#(a,bj )
/ #(a' ,bj) < 1 for allj :f:. i where bi , bj E Dom(B) - b. Intuitively, the function,
nonclass(a, b), computes the degree of predictability associated with the class,
b', that the attribute value a predicts to the greatest extent.
294
DATABASE AND APPUCATION SECURITY XV
+1
=
M
class(a,b)
=
I
-I
log2 «class(a,b) IN + 1) I (nonclass(a,b) I N + I»
#(a,b) - #(a' ,b)
: #(a,b) >= #(a' ,b)
#(a,b) I #(a' ,b)
: otherwise
: if#(a,b') = 0
: if#(a,b) =0
: otherwise
o
nonclass(a,b)
: if cond-I
=
Max i={B-b}
Figure 5.
I
#(a,b j ) - #(a' ,b j ) : jf #(a,b j) >= #(a' ,b i)
: otherwise
Predicted Accuracy Measure M.
The definitions of class(a,b) and nonclass(a,b) represent the concept of a
non-protected attribute's relevance and non-relevance odds, respectively. This
type of relevance relationship has been used previously in information retrieval
systems to determine a document term's degree of importance in distinguishing
relative and non-relative documents [9].
3.2.
APPROX...EVAL Aigorithm
The measure M is used to compute the class-accuracy set of a protected
tuple. Specifically, the assigned weights that correspond to a class, C, are
summed together to produce the corresponding predicted accuracy value. For
example, suppose the application of M with respect to the protected tuple,
(Cyl=6; Fuel=2-bbl; Tran=manu; Power=low; Mileage=null), produces the
results shown in Figure-6. In this instance, the predicted class-accuracy set is
{(low, -1.09), (med, -.255), (high, -.946)}. In general, a computed predicted
accuracy value represents the degree to which the protected tuple predicts the
corresponding protected data element.
The class-accuracy set is used in conjunction with the Protected Minimum
Range (PMR) security policy to determine if there is a violation of Condition1. The application of a PMR policy is dependent upon the rank position of
the predicted accuracy value, ai, of the protected data element relative to the
other predicted accuracy values. If ai is the maximum predicted accuracy
value, then the pmr value is the difference between ai and the next largest
predicted accuracy value ak; otherwise, it is the difference between ai and and
the maximum predicted accuracy value.
As shown in Figure-7, if pmr is greater than some established threshold,
then the APPROX.EVAL algorithm evaluates the protected data element using
Johnsten & Raghavan
295
Low: {Cyl=6: .1699; Fuel=2-bbl: -.0458; Tran=manu: -.2157; Power=low: -l}
Med: {Cyl=6: -.1699; Fuel=2-bbl: -.0771. Tran=manu: 0; Power=low: -.OO79}
High: {Cyl=6: -1; Fuel=2-bbl: .0458; Tran = manu: 0; Power=low: .0079}
Figure 6.
Application of Measure M.
APPROX_EVAL Algorithm
eompute measure M with respeet to proteeted tuple
eompute eorresponding class-aceuraey set
pmr = apply Proteeted Minimum Range poliey to class-aeeuraey set
if (pmr >= threshold-value)
apply EXACT_OB 1 algorithm
else
apply APPROX_OB algorithm
Figure 7.
Approximate Evaluation Algorithm (Detail-Level).
the EXACT _OB 1 algorithm; otherwise, the protected data element is evaluated
using the APPROX_OB algorithm.
4.
EXPERIMENTAL INVESTIGATION
Experiments were eondueted to validate the use of the APPROX...EVAL algorithm in implementing the deseription space level protected rank poliey [1,1].
In eondueting our initial experiments, we restrieted the evaluation to C4.5's
decision-tree classifier and interpreter [8]. The experiments utilized ten distinet
data sets, each of whieh eontained at least ten non-protected attribute values
and a single protected attribute with a domain size of five. All of the relations
were eonstrueted through the applieation of the Synthetie Classifieation Data
Set (SCDS) program [10]. A total of fifty-one protected tuples were used in
the investigation. Twenty-one of the proteeted tuples violated the deseription
spaee level poliey [1,1].
Figure-8 shows the pmr value eomputed for eaeh of the fifty-one protected
. tupies. The APPROX_OB algorithm was applied to eaeh proteeted tuple. The
applieation of APPROX_OB resulted in the transformation of a proteeted tuple
T into a protected tuple T' that, in general, eontained one or more concealed
non-protected attribute values. The transformed protected tupies, T', were
296
DATABASE AND APPLICATJON SECURITY XV
0.000
5.000
' .000
3.000
2.000
1.000
0.000
II
...
·1.000
II
I
I
•
-I.
11
II
II
II
111., .. I .. 11.1
., .... . . 13
:;; III
Rn : ;
-- -I.
l;
II I
.-;1' .. ....
;;;
' 2.000
-3.000
Figure 8.
Computed pmr Values for Protected Thples.
cIassified using C4.5's consult interpreter. In order to assess the generality ofthe
APPROX.EVAL algorithm two distinct decision-tree models where generated
for each relation instance. One model was constructed using the gain attribute
selection criterion, while the other model was constructed using the gain ratio
attribute selection criterion. We identified those protected tuples that violated
the description space level protected rank policy [1,1]. There were a total of
five such tupies, five with respect to the gain ratio criterion and one with respect
to the gain criterion. These five tuples are shown as solid-filled columns in
Figure-8.
The experimental results suggest that a pmr value greater than or equal to two
identifies a potential violation of Condition-l . Unfortunately, such a pmr value
may produce false positives. It appears that the occurrence of false positives
may be related to the percentage of non-relevant attributes contained within a
relation.
5.
CONCLUSION AND FUTURE WORK
In this paper we identified a new category of cIassification mining algorithms, referred to as extended decision-region based, and presented a security
risk assessment algorithm, EXACT _OB2, designed specifically for use with
this category of algorithms. In addition, we proposed an alternative assessment algorithm, APPROX..EVAL, designed for use with decision-region based
algorithms. This particular assessment algorithm is characterized as having a
relatively low execution time, but may not achieve an exact assessment.
We have several additional research projects planned with respect to this
challenging research area. Our immediate plans incIude further verification
of experimental results using additional extended decision-region based algorithms and data sets, extending the security assessments to incIude continuously
297
Johnsten & Raghavan
valued attributes, and the development of additional security assessments for
other groups of mining algorithms.
References
[1] Chang, L. and Moskowitz, I. (1998). Parsimonious Downgrading and Decision Trees
Applied to the Inference Problem. Proceedings ofNew Security Paradigms, pp. 82-89.
[2] Chang, L. and Moskowitz, I. (2000). An Integrated Framework for Database Privacy
Protection. Proceedings ofthefourteenthAnnuallFIP WO 11.3 Working Conference on
Database Security.
[3] Clifton, C. and Marks, D. (1996). Security and Privacy Implications ofData Mining.1996
SIOMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp.
15-19.
[4] Clifton, C. (1999). Protecting Against Data Mining Through Sampies. Proceedings 0/ the
ThirteenthAnnualIFIP WO 11.3 Working Conference on Database Security, pp. 193-207.
[5J Johnsten, T. and Raghavan, V. (1999). Impact of Decision-Region Based Classification
Mining Aigorithms on Database Security. Proceedings ofthe ThirteenthAnnuallFlP WO
11.3 Working Conference on Database Security, pp. 177-191.
[6J Holsheimer, M. and Siebes, A. (1994). Oata Mining: Tbe Search for Knowledge in
Oatabases. Report CS-R9406. CWI. Amsterdam, Tbe Netherlands.
[7J Steinberg, D. and Colla, P. (1997). CART User Manual. Salford Systems, San Diego, CA.
[8J Quinlan, J. (1993). C4.5: Programs Por Machine Leaming. Morgan Kaufmann, San Mateo,CA.
[9J Robertson, S. and Jones, K. (1976). Relevance Weighting for Search Terms. Journal
the American Society for Information Science, May-June, pp. 129-145.
[10] Synthetic
Classification
Data
dents/mellVSCDS/intro.html.
Sets,
http://fas.sfu.ca/cslpeople/Orad
0/
Stu-