Let
a be a numerical attribute with domain
. A partition of the domain
into
k intervals:
where
,
, and
for
, determines a discretization of
a. The numbers
,
, ...,
are called
cut points. In this paper, corresponding intervals are denoted as follows:
An example of a dataset with numerical attributes is presented in
Table 1. In this table, all cases are described by variables called
attributes and one variable called a
decision. The set of all attributes is denoted by
A. The decision is denoted by
d. The set of all cases is denoted by
U. In
Table 1, the attributes are
length,
width and
height, while the decision is
quality. Additionally,
U = {1, 2, 3, 4, 5, 6, 7}. For a subset
S of the set
U of all cases, an entropy of a variable
v (attribute or decision) with values
,
, ...,
is defined by the following formula:
where
is a probability (relative frequency) of value
in the set
S,
. All logarithms in this paper are binary.
2.1. Stopping Criterion for Discretization
A stopping criterion of the process of discretization, described in this paper, is the
level of consistency [
5], based on
rough set theory [
35,
36]. For any subset
B of the set
A of all attributes, an
indiscernibility relation
is defined, for any
, in the following way:
where
denotes the value of the attribute
for the case
. The relation
is an equivalence relation. The equivalence classes of
are denoted by
and are called
B-elementary sets. Any finite union of
B-elementary sets is
B-definable.
A partition on
U constructed from all
B-elementary sets of
is denoted by
. {
d}-elementary sets are called
concepts, where
d is a decision. For example, for
Table 1, if
,
= {{1, 2}, {3, 4}, {5, 6, 7}} and {
d}
= {{1}, {2}, {3, 4, 5}, {6, 7}}. In general, arbitrary
is not
B-definable. For example, the concept {1} is not
B-definable. However, any
may be approximated by a
B-lower approximation of
X, denoted by
and defined as follows:
and by
B-upper approximation of
X, denoted by
and defined as follows:
In our example,
=
∅ and
= {1, 2}. The
B-lower approximation of
X is the greatest
B-definable set contained in
X. The
B-upper approximation of
X is the least
B-definable set containing
X. A
level of consistency [
5], denoted by
, is defined as follows:
Usually, the requested level of consistency for discretization is 1.0,
i.e., we want the discretized dataset to be
consistent. For example, for
Table 1, the level of consistency
is equal to 1.0, since {A}
= {{1}, {2}, {3}, {4}, {5}, {6}, {7}} and, for any
X from {
quality}
= {{1}, {2}, {3, 4, 5}, {6, 7}}, we have
. Additionally,
0.286, where
B = {
length}.
2.2. Equal Interval Width and Equal Frequency per Interval
Some discretization methods are obvious. To this category belong the
equal interval width and
equal frequency per interval methods [
14]. These methods are applied to a single attribute at a time, so they are called
local [
14]. Note that a discretization method is called
global if it depends on all attributes [
14]. In the local discretization method, the user must specify a positive integer
k, a number of intervals required for discretization. In the former method, the domain of a numerical attribute
a should be divided into
k intervals that are approximately equal. In the latter method, the domain of the numerical attribute
a should be divided into
k intervals, each containing approximately an equal number of cases.
These two methods were converted to global methods, using the idea of entropy, in [
5]. First, all numerical attributes are discretized using
k = 2. After that, we need to compute the level of consistency for the set of all discretized attributes. If the level of consistency satisfies the requirement, the discretization is done. If not, the worst attribute must be selected for further discretization.
For an attribute
a, let
denote the discretized attribute. Let
denote the set of all discretized attributes. For any partially-discretized attribute
, we define a measure of quality, called the
average block entropy, in the following way:
A partially-discretized attribute
with the largest
is the worst attribute [
5]. The worst attribute is the subject of additional discretization for
intervals. The rest of the algorithm is defined by recursion. These new methods of discretization are called the
globalized version of equal interval width and the
globalized version of equal frequency per interval. As follows from [
2], both methods are quite successful.
We will illustrate the globalized version of equal frequency per interval method by applying this method to
Table 1. We need to compute the values of cut points for the first iteration. These cut points are: 4.5 for
length, 1.85 for
width and 1.55 for
height.
Table 2 presents the partially-discretized dataset. The partitions on
U corresponding to partially-discretized attributes are:
{length} = {{1, 2, 3, 4}, {5, 6, 7}},
{width} = {{1, 2, 3, 5, 7}, {4, 6}},
{height} = {{1, 2, 3, 5}, {4, 6, 7}}.
The partition is {{1, 2, 3}, {4}, {5}, {6}, {7}}.
Thus, the level of consistency, after the initial discretization, is
. In order to find the worst attribute, we compute the average block entropy for all initially discretized attributes. For the attribute
length,
Additionally,
M(
width and
M(
height. The worst attribute is
width. We need to compute new cut points for
width with
k = 3. The new cut point is 1.75. This time: {
width}
= {{1, 5}, {2, 3, 7}, {4, 6}}, so
= {{1}, {2, 3}, {4}, {5}, {6}, {7}}, and the new level of consistency
is 0.714. The average block entropy, for
width with
k = 3, is 0.417. This time, the worst attribute is
length. The third cut point for
length is 4.1. This time
= {{1}, {2}, {3}, {4}, {5}, {6}, {7}}.
Table 3 presents the new discretized table. Furthermore, the level of consistency
for
Table 3 is one, so the process of discretization is done.
2.3. Multiple Scanning
In the multiple scanning discretization method, the entire attribute set is scanned t times; t is a parameter selected by the user. In our experiments, we applied t = 1, 2, ..., until the error rate, a result of ten-fold cross-validation, computed by C4.5, was the same for two consecutive values of t. Initially, for each attribute, the best cut point is selected, using the minimum of conditional entropy , for all possible values of q. During the next scans (i.e., for t = 2, 3, ...), the entire attribute set is scanned again; for each attribute, we identify one cut point: for each block X of , the best cut point is selected, the best cut point among all such blocks is accepted as the best cut point for the attribute. If the requested parameter t is reached and the dataset needs more discretization since , the dominant attribute technique is used for remaining discretization.
Let us discretize
Table 1 using the multiple scanning method. First, we need to compute the conditional entropy
for each attribute
q and for all possible cut points for each attribute. The first attribute is
length, with two possible cut points: 4.1 and 4.5. The corresponding conditional entropies are:
For the attribute length, the better cut point is 4.1. For the attribute
width, there are two possible cut points: 1.75 and 1.85, with
= 1.373 and
= 1.659; the better cut point is 1.75. For the attribute
height, there are two possible cut points: 1.45 and 1.55, with
= 0.980 and
= 1.251; the better cut point is 1.45. A partially-discretized dataset, after the first scan, is presented in
Table 4.
For
Table 4,
= {{1}, {2}, {3, 4, 6, 7}, {5}}, and
= 0.429. The remaining discretization is conducted using the dominant attribute method for the sub-table presented in
Table 5.
It is clear that the attribute
length is the best attribute (with the smallest entropy) and that the remaining cut point is 4.5 for the attribute
length. As a result, we obtain
Table 6, for which
= 1.