-
Notifications
You must be signed in to change notification settings - Fork 74
Dataset Transformation
To launch the discretizer, use the following command:
$ java mltk.core.processor.Discretizer
It should output a message like this:
Usage: mltk.core.processor.Discretizer
-i input dataset path
-o output dataset path
[-r] attribute file path
[-t] training file path
[-d] discretized attribute file path
[-m] output attribute file path
[-n] maximum num of bins (default: 256)
This class has two functionalities. The first one is to learn a discretization and to create a discretized attribute file. The second one is to use the new attribute file (containing discretization information) to discretize new datasets.
$ java mltk.core.processor.Discretizer -r <attr file> -t <training data> -m <output attribute file> -i <input dataset> -o <discretized output dataset>
This command loads training data and discretizes all continuous features into 256 bins (default value). It generates a new attribute file specified by -m
argument. It also takes an input dataset, discretizes it and saves the new discretized dataset to disk.
$ java mltk.core.processor.Discretizer -r <attr file> -i <input dataset> -d <discretized attribute file> -o <discretized output dataset>
This command loads the input dataset, applies the discretization specified by -m
argument, and saves the new discretized dataset to disk.
List<Attribute> attributes = instances.getAttributes();
for (int j = 0; j < instances.dimension(); j++) {
if (attributes.get(j).getType() == Type.NUMERIC) {
Discretizer.discretize(instances, j, 256);
}
}
It discretizes all numeric attributes into 256 bins. The corresponding attribute objects are updated.
To launch the splitter, use the following command:
$ java mltk.core.processor.InstancesSplitter
It should output a message like this:
Usage: mltk.core.processor.InstancesSplitter
-i input dataset path
-o output directory path
[-r] attribute file path
[-m] splitting mode:parameter. Splitting mode can be split (s) and cross validation (c) (default: c:5)
[-a] attribute name to perform stratified sampling (default: null)
[-s] seed of the random number generator (default: 0)
There are two modes in InstancesSplitter
: split (s) and cross validation (c). The output is under a directory specified by -o
argument. If the directory does not exist, a new one will be created. Optionally, stratified sampling can be performed. For example, -a label
will indicate the code to keep the distribution of attribute label
in all samples.
In this mode, the dataset is split into two parts; training set and validation set. The parameter
followed by s
will determine the portion of points in training set. For example, -m s:0.8
means 80% of the points are in training set and 20% of the points are in validation set, -m s:0.7:0.15:0.15
means 70% of the points are in the training set, 15% of the points are in the validation set and 15% of the points are in the test set.
In this mode, the dataset is split into k folds. Each fold contains a training set, a test set and an optional validation set. For example, -m c:5
will create 5 directories (cv.0, ..., cv.4) under the output directory. Each fold will contain a training set and a test set. 1/k of the points are in test set and the rest are in training set. In this case, 20% of the points are in the test set and 80% of the points are in the training set. All test sets are disjoint and the union of them will be the whole dataset. -m c:5:0.8
creates an additional validation set for each fold; 20% of the points will be in test set, 16% of the points will be in validation set, and 64% of the points will be in training set.