Data Mining Tools Laboratory Manual
Data Mining Tools Laboratory Manual
SCIENCE
LAB MANUAL
OBJECTIVES:
System/Software Requirements:
LIST OF EXPERIMENTS:
2
TABLE OF CONTENTS
3
[Link]: 01 Build Data Warehouse and Explore WEKA
A. Build a Data Warehouse/Data Mart (using open source tools like Pentaho Data
Integration tool, Pentaho Business Analytics; or other data warehouse tools like
Microsoft SSIS, Informatica, Business Objects, etc.).
In this task, we are going to use MySQL administrator, SQLyog Enterprise tools for
building & identifying tables in database & also for populating (filling) the sample data in
those tables of a database.A data warehouse is constructed by integrating data from
multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc
queries and decision making. We are building a data warehouse by integrating all the
tables in database & analyzing those data. In the below figure we represented MySQL
Administrator connection establishment.
4
There are different options available in MySQL administrator. Another tool SQLyog
Enterprise, we are using for building & identifying tables in a database after successful
connection establishment through MySQL Administrator. Below we can see the window
of SQLyog Enterprise.
On left-side navigation, we can see different databases & it‘s related tables. Now we
are going to build tables & populate table‘s data in database through SQL queries. These
tables in database can be used further for building data warehouse.
5
In the above two windows, we created a database named “sample”&
& in that
database we created two tables named as “user_details”& “hockey”through
ugh SQL
queries.
Now, we are going to populate (filling) sample data through SQL queries in those
two created tables as represented in below windows.
6
Through MySQL administrator & SQLyog, we can import databases from other sources
(.XLS, .CSV, .sql)
l) & also we can export our databases as backup for further processing.
We can connect MySQL to other applications for data analysis & reporting.
Multi-Dimensional
Dimensional model was developed for implementing data warehouses & it
provides both a mechanism to store data and a way for business analysis. The primary
componentsts of dimensional model are dimensions & facts. There are different of types of
multi-dimensional data models. They are:
1. Star Schema Model 2. Snow
Flake Schema Model 3. Fact
Constellation Model.
7
In the above window, left side navigation bar consists of a database named as
―sales_dw‖ in which there are six different tables (dimcustdetails, dimcustomer,
dimproduct, dimsalesperson, dimstores, factproductsales) has been created.
After creating tables in database, here we are going to use a tool called as “Microsoft
Visual Studio 2012 for Business Intelligence” for building multi-dimensional models.
In the above window, we are seeing Microsoft Visual Studio before creating a
project In which right side navigation bar contains different options like Data Sources,
Data Source Views, Cubes, Dimensions etc.
8
Through Data Sources, we can connect to our MySQL database named as
“sales_dw”. Then, automatically all the tables in that database will be retrieved to this
tool for creating multidimensional models.
By data source views & cubes, we can see our retrieved tables in multi-
dimensional models. We need to add dimensions also through dimensions option. In
general, Multidimensional models consists of dimension tables & fact tables.
A Star schema model is a join between a fact table and a no. of dimension tables. Each
dimensional table are joined to the fact table using primary key to foreign key join but
dimensional tables are not joined to each other. It is the simplest style of dataware house
schema.
Star schema is a entity relationship diagram of this schema resembles a star with
point radiating from central table as we seen in the below implemented window in visual
studio.
9
Snow flake schema is represented by centralized fact table which are connected to
multiple dimension tables. Snow flake effects only dimension tables not fact tables. we
developed
ped a snowflake schema for sales_dw database by visual studio tool as shown
below.
10
2. Write ETL scripts and implement using data warehouse tools
ETL (Extract-Transform-Load):
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers
a process of how the data are loaded from the source system to the data warehouse.
Currently, the ETL encompasses a cleaning step as a separate step. The sequence is then
Extract-CleanTransform-Load. Let us briefly describe each step of the ETL process.
Process
Extract:
The Extract step covers the data extraction from the source system and makes it
accessible for further processing. The main objective of the extract step is to retrieve all
the required data from the source system with as little resources as possible. The extract
step should be designed in a way that it does not negatively affect the source system in
terms or performance, response time or any kind of locking.
There are several ways to perform the extract:
• Update notification - if the source system is able to provide a notification that a
record has been changed and describe the change, this is the easiest way to get the
data.
• Incremental extract - some systems may not be able to provide notification that an
update has occurred, but they are able to identify which records have been modified
and provide an extract of such records. During further ETL steps, the system needs
to identify changes and propagate it down. Note, that by using daily extract, we may
not be able to handle deleted records properly.
• Full extract - some systems are not able to identify which data has been changed at
all, so a full extract is the only way one can get the data out of the system. The full
extract requires keeping a copy of the last extract in the same format in order to be
able to identify changes. Full extract handles deletions as well.
When using Incremental or Full extracts, the extract frequency is extremely important.
Particularly for full extracts; the data volumes can be in tens of gigabytes.
Clean:
The cleaning step is one of the most important as it ensures the quality of the data in the
data warehouse. Cleaning should perform basic data unification rules, such as:
• Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided value
• Convert phone numbers, ZIP codes to a standardized form
• Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
• Validate address fields against each other (State/Country, City/State, City/ZIP code,
City/Street).
11
Transform:
The transform step applies a set of rules to transform the data from the source to the
target. This includes converting any measured data to the same dimension (i.e. conformed
dimension) using the same units so that they can later be joined. The transformation step
also requires joining data from several sources, generating aggregates, generating
surrogate keys, sorting, deriving new calculated values, and applying advanced validation
rules.
Load:
During the load step, it is necessary to ensure that the load is performed correctly and
with as little resources as possible. The target of the Load process is often a database. In
order to make the load process efficient, it is helpful to disable any constraints and
indexes before the load and enable them back only after the load completes. The
referential integrity needs to be maintained by ETL tool to ensure consistency.
Managing ETL Process:
The ETL process seems quite straight forward. As with every application, there is a
possibility that the ETL process fails. This can be caused by missing extracts from one of
the systems, missing values in one of the reference tables, or simply a connection or
power outage. Therefore, it is necessary to design the ETL process keeping fail-recovery
in mind.
Staging:
It should be possible to restart, at least, some of the phases independently from the others.
For example, if the transformation step fails, it should not be necessary to restart the
Extract step. We can ensure this by implementing proper staging. Staging means that the
data is simply dumped to the location (called the Staging Area) so that it can then be read
by the next processing phase. The staging area is also used during ETL process to store
intermediate results of processing. This is ok for the ETL process which uses for this
purpose. However, tThe staging area should is be accessed by the load ETL process only.
It should never be available to anyone else; particularly not to end users as it is not
intended for data presentation to the [Link] contain incomplete or in-the-middle-
of-the-processing data.
ETL Tool Implementation:
When you are about to use an ETL tool, there is a fundamental decision to be made: will
the company build its own data transformation tool or will it use an existing tool?
Building your own data transformation tool (usually a set of shell scripts) is the preferred
approach for a small number of data sources which reside in storage of the same type.
The reason for that is the effort to implement the necessary transformation is little due to
similar data structure and common system architecture. Also, this approach saves
licensing cost and there is no need to train the staff in a new tool. This approach,
12
however, is dangerous from the TOC point of view. If the transformations become more
sophisticated during the time or there is a need to integrate other systems, the complexity
of such an ETL system grows but the manageability drops significantly. Similarly, the
implementation of your own tool often resembles re-inventing the wheel.
There are many ready-to-use ETL tools on the market. The main benefit of using off-the-
shelf ETL tools is the fact that they are optimized for the ETL process by providing
connectors to common data sources like databases, flat files, mainframe systems, xml,
etc. They provide a means to implement data transformations easily and consistently
across various data sources. This includes filtering, reformatting, sorting, joining,
merging, aggregation and other operations ready to use. The tools also support
transformation scheduling, version control, monitoring and unified metadata
management. Some of the ETL tools are even integrated with BI tools.
Some of the Well Known ETL Tools:
The most well-known commercial tools are Ab Initio, IBM InfoSphere
DataStage, Informatica, Oracle Data Integrator, and SAP Data
Integrator. There are several open source ETL tools are
OpenRefine, Apatar, CloverETL, Pentaho and Talend.
In these above tools, we are going to use OpenRefine 2.8 ETL toolto different sample
datasets forextracting, data cleaning, transforming & loading.
Perform various OLAP operations such slice, dice, roll up, drill down and pivot.
OLAP Operations:
Since OLAP servers are based on multidimensional view of data, we will discuss
13
OLAP operations in multidimensional
ional data.
Here is the list of OLAP operations
• Roll-up (Drill-up)
• Drill-down
• Slice and dice
• Pivot (rotate)
Roll-up (Drill-up):
Roll-up performs aggregation on a data cube in any of the following ways
• By climbing up a concept hierarchy for a dimension
• By dimension reduction
• Roll-up is performed by climbing up a concept hierarchy for the dimension location.
• Initially the concept hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location hierarchy from the
level of city to the level of country.
• The data is grouped into cities rather than countries.
• When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down:
Drill-down
down is the reverse operation of roll-up.
roll It is performed by either of the
following ways By stepping down a concept hierarchy for a dimension
By introducing a new dimension.
• Drill-down is performed by stepping down a concept hierarchy for the dimension
time.
• Initially the concept hierarchy was "day < month < quarter < year."
• On drilling down, the time dimension is descended from the level of quarter to the
level of month.
• When drill-down is performed, one or more dimensions from the data cube are
added. It navigates the data from less detailed data to highly detailed data.
Slice:
The slice operation selects one particular dimension from a given cube and
provides a new sub-cube.
Dice:
Dice selects two or more dimensions from a given cube and provides a new sub
sub-
cube.
Pivot (rotate):
The pivot operation is also
o known as rotation. It rotates the data axes in view in
order to provide an alternative presentation of data.
Now, we are practically implementing all these OLAP Operations using Microsoft
Excel.
14
Procedure for OLAP Operations:
1. Open Microsoft Excel, go toData tab in top & click on ―Existing Connections”.
2. Existing Connections window will be opened, there “Browse for more”option should
be clicked for importing .cub extension file for performing OLAP Operations. For
sample, I took [Link] file.
15
3. As shown in above window, select ―PivotTable Report” and click “OK”.
4. We got all the [Link] data for analyzing different OLAP [Link], we
performed drill-down operation as shown below.
Now we are going to perform roll-up (drill-up) operation, in the above window I selected
16
January month then automatically Drill-up option is enabled on top. We will clickon Drill-up
option, then the below window will be displayed.
17
While inserting slicers for slicing operation, we select 2 Dimensions (for e.g.
CategoryName & Year) only with one Measure (for e.g. Sum of sales).After inserting a
slice& adding a filter (CategoryName: AVANT ROCK & BIG BAND; Year: 2009 &
2010), we will get table as shown below.
18
7. Finally, the Pivot (rotate) OLAP operation is performed by swapping rows (Order
Date-Year) & columns (Values-Sum of Quantity & Sum of Sales) through right side bottom
navigation bar as shown below.
9.
After Swapping (rotating), we will get resultant as represented below with a pie-chart for
Category-Classical& Year Wise data.
19
(v). Explore visualization features of the tool for analysis like identifying trends etc.
There are different visualization features for analyzing the data for trend analysisin data
warehouses. Some of the popular visualizations are:
1. Column Charts
2. Line Charts
3. Pie Chart
4. Bar Graphs
4. Area Graphs
5. X & Y Scatter Graphs
6. Stock Graphs
7. Surface Charts
8. Radar Graphs
9. Treemap
10. Sunburst
11. Histogram
12. Box & Whisker
13. Waterfall
14. Combo Graphs
15. Geo Map
16. Heat Grid
17. Interactive Report
18. Stacked Column
19. Stacked Bar
20. Scatter Area
20
These type of visualizations can be used for analyzing data for trend analysis. Some of
the tools for data visualization are Microsoft Excel, Tableau, Pentaho Business Analytics
Online etc. Practically different visualization features are tested with different sample
datasets.
In the below window, we used 3D-Column Charts of Microsoft Excel for analyzing data
in data warehouse.
21
Below window, represents the data visualization through Pentaho Business
Analytics tool online ([Link] for some sample dataset.
22
The Weka GUI Chooser (class [Link]) provides a starting pointfor
launching Weka‘s main GUI applications and supporting tools. If one prefersa MDI
(―multiple document interface‖) appearance, then this is provided by analternative
launcher called ―Main‖ (class [Link]).
The GUI Chooser consists of four buttons—one for each of the four majorWeka
applications— and four menus.
23
Knowledge Flow- This environment supports essentially the same functions as the
Explorer but with a drag-and-drop interface. One advantageis that it supports incremental
learning.
SimpleCLI - Provides a simple command-line interface that allows directexecution of
WEKA commands for operating systems that do not provide their own command line
interface.
(iii). Navigate the options available in the WEKA (ex. Select attributes panel,
Preprocess panel, classify panel, Cluster panel, Associate panel and Visualize panel)
When the Explorer is first started only the first tab is active; the others are greyed out.
This is because it is necessary to open (and potentially pre-process) a data set before
starting to explore the data.
The tabs are as follows:
1. Preprocess. Choose and modify the data being acted on.
2. Classify. Train and test learning schemes that classify or perform regression.
3. Cluster. Learn clusters for the data.
4. Associate. Learn association rules for the data.
5. Select attributes. Select the most relevant attributes in the data.
6. Visualize. View an interactive 2D plot of the data.
Once the tabs are active, clicking on them flicks between different screens, on which the
respective actions can be performed. The bottom area of the window (including the status
box, the log button, and the Weka bird) stays visible regardless of which section you are
in.
1. Preprocessing
24
Loading Data:
The first four buttons at the top of the preprocess section enable you to loaddata into
WEKA:
1. Open file. ... Brings up a dialog box allowing you to browse for the datafile on the local
file system.
2. Open URL ....Asks for a Uniform Resource Locator address for wherethe data is stored.
3. Open DB. ... Reads data from a database. (Note that to make this workyou might have to
edit the file in weka/experiment/[Link].)
4. Generate.... Enables you to generate artificial data from a variety ofDataGenerators.
Using the Open file. .. button you can read files in a variety of formats:
WEKA‘s ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF
files typically have a .arff extension, CSV files a .csv extension,C4.5 files a .data and
.names extension, and serialized Instances objects a .bsiextension.
2. Classification:
Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text fieldthat gives
the name of the currently selected classifier, and its options. Clickingon the text box with
the left mouse button brings up a GenericObjectEditordialog box, just the same as for
filters, that you can use to configure the optionsof the current classifier. With a right click
(or Alt+Shift+left click) you canonce again copy the setup string to the clipboard or
25
display the properties in aGenericObjectEditor dialog box. The Choose button allows you
to choose on4eof the classifiers that are available in WEKA.
Test Options
The result of applying the chosen classifier will be tested according to the optionsthat are
set by clicking in the Test options box. There are four test modes:
1. Use training set: The classifier is evaluated on how well it predicts theclass of the
instances it was trained on.
2. Supplied test set: The classifier is evaluated on how well it predicts theclass of a set
of instances loaded from a file. Clicking the Set... buttonbrings up a dialog allowing you
to choose the file to test on.
3. Cross-validation: The classifier is evaluated by cross-validation, usingthe number
of folds that are entered in the Folds text field.
4. Percentage split: The classifier is evaluated on how well it predicts acertain
percentage of the data which is held out for testing. The amountof data held out depends
on the value entered in the % field.
3. Clustering:
Cluster Modes:
The Cluster mode box is used to choose what to cluster and how to evaluatethe results.
The first three options are the same as for classification: Use training set, Supplied test set
and Percentage split.
26
4. Associating:
Setting Up
This panel contains schemes for learning association rules, and the learners are chosen
and configured in the same way as the clusterers, filters, and classifiers in the other
panels.
5. Selecting Attributes:
27
objects must be set up: an attribute evaluator and a searchmethod. The evaluator
determines what method is used to assign a worth toeach subset of attributes. The search
method determines what style of searchis performed.
6. Visualizing:
Overview
ARFF files have two distinct sections. The first section is the Header information, which
is followed the Data information.
The Header of the ARFF file contains the name of the relation, a list of the attributes (the
columns in the data), and their types. An example header on the standard IRIS dataset
looks like this:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments.
The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.
(v). Explore the available data sets in WEKA
There are 23 different datasets are available in weka (C:\Program Files\Weka-3-6\) by
default for testing purpose. All the datasets are available in. arff format. Those datasets
are listed below.
29
(vi). Load a data set (ex. Weather dataset, Iris dataset, etc.)
Procedure:
1. Open the weka tool and select the explorer option.
2. New window will be opened which consists of different options (Preprocess, Association
etc.) 3. In the preprocess, click the ―open file‖ option.
4. Go to C:\Program Files\Weka-3-6\data for finding different existing. arff datasets.
5. Click on any dataset for loading the data then the data will be displayed as shown below.
30
iii. Identify the class attribute (if any)
There is one class attribute which consists of 3 labels. They
are:
1. Iris-setosa
2. Iris-versicolor
3. Iris-virginica
iv. Plot Histogram
There is one class attribute (150 records) which consists of 3 labels. They are shown
below 1. Iris-setosa - 50 records
2. Iris-versicolor – 50 records
3. Iris-virginica – 50 records
31
vi. Visualize the data in various dimensions
Result :
32
[Link]: 02 Demonstration of preprocessing on dataset [Link]
Aim:
This experiment illustrates some of the basic data preprocessing operations that can be
performed using WEKA-Explorer. The sample dataset used for this example is the student
data available in arff format.
Steps:
Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the
data weka will compute some basic strategies on each attribute. The left panel in the above
figure shows the list of recognized attributes while the top panel indicates the names of the
base relation or table and the current working relation (which are same initially).
Step3: Clicking on an attribute in the left panel will show the basic statistics on the
attributes for the categorical attributes the frequency of each attribute value is shown, while
for continuous attributes we can obtain min, max, mean, standard deviation and deviation
etc.,
Step4: The visualization in the right button panel in the form of cross-tabulation across two
attributes.
Note: we can select another attribute using the dropdown list.
Step5: Selecting or filtering attributes
Removing an attribute:
When we need to remove an attribute,we can do this by using the attribute filters in [Link]
the filter model panel,click on choose button,This will show a popup window with a list of
available filters.
33
d) Save the new working relation as an arff file by clicking save button on the
top(button)panel.([Link])
Discretization:
34
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%
Result :
Steps:
Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the
data weka will compute some basic strategies on each attribute. The left panel in the above
figure shows the list of recognized attributes while the top panel indicates the names of the
base relation or table and the current working relation (which are same initially).
Step3: Clicking on an attribute in the left panel will show the basic statistics on the
attributes for the categorical attributes the frequency of each attribute value is shown, while
for continuous attributes we can obtain min, max, mean, standard deviation and deviation
etc.,
Step4: The visualization in the right button panel in the form of cross-tabulation across two
attributes.
Note: we can select another attribute using the dropdown list.
Step5: Selecting or filtering attributes
Removing an attribute:
When we need to remove an attribute,we can do this by using the attribute filters in [Link]
the filter model panel,click on choose button,This will show a popup window with a list of
available filters.
36
d) Save the new working relation as an arff file by clicking save button on the
top(button)panel.([Link])
Discretization:
Sometimes association rule mining can only be performed on categorical [Link]
requires performing discretization on numeric or continuous [Link] the following
example let us discretize duration attribute. Let us divide the values of duration
attribute into three bins(intervals).
First load the dataset into weka([Link])
Select the duration attribute.
Activate filter-dialog box and select
“[Link]”from the list.
To change the defaults for the filters,click on the box immediately to the right of the
choose button.
We enter the index for the attribute to be [Link] this case the attribute is
duration So we must enter ‘1’ corresponding to the duration attribute.
Enter ‘1’ as the number of [Link] the remaining field values as they are.
Click OK button.
Click apply in the filter [Link] will result in a new working relation with the
selected attribute partition into 1 bin.
Save the new working relation in a file called [Link]
Dataset [Link]
@relation 'labor-neg-data'
@attribute 'duration' real
@attribute 'wage-increase-first-year' real
@attribute 'wage-increase-second-year' real
@attribute 'wage-increase-third-year' real
@attribute 'cost-of-living-adjustment' {'none','tcf','tc'}
@attribute 'working-hours' real
@attribute 'pension' {'none','ret_allw','empl_contr'}
@attribute 'standby-pay' real
@attribute 'shift-differential' real
@attribute 'education-allowance' {'yes','no'}
@attribute 'statutory-holidays' real
@attribute 'vacation' {'below_average','average','generous'}
@attribute 'longterm-disability-assistance' {'yes','no'}
@attribute 'contribution-to-dental-plan' {'none','half','full'}
@attribute 'bereavement-assistance' {'yes','no'}
@attribute 'contribution-to-health-plan' {'none','half','full'}
@attribute 'class' {'bad','good'}
@data
1,5,?,?,?,40,?,?,2,?,11,'average',?,?,'yes',?,'good'
2,4.5,5.8,?,?,35,'ret_allw',?,?,'yes',11,'below_average',?,'full',?,'full','good'
37
?,?,?,?,?,38,'empl_contr',?,5,?,11,'generous','yes','half','yes','half','good'
3,3.7,4,5,'tc',?,?,?,?,'yes',?,?,?,?,'yes',?,'good'
3,4.5,4.5,5,?,40,?,?,?,?,12,'average',?,'half','yes','half','good'
2,2,2.5,?,?,35,?,?,6,'yes',12,'average',?,?,?,?,'good'
3,4,5,5,'tc',?,'empl_contr',?,?,?,12,'generous','yes','none','yes','half','good'
3,6.9,4.8,2.3,?,40,?,?,3,?,12,'below_average',?,?,?,?,'good'
2,3,7,?,?,38,?,12,25,'yes',11,'below_average','yes','half','yes',?,'good'
1,5.7,?,?,'none',40,'empl_contr',?,4,?,11,'generous','yes','full',?,?,'good'
3,3.5,4,4.6,'none',36,?,?,3,?,13,'generous',?,?,'yes','full','good'
2,6.4,6.4,?,?,38,?,?,4,?,15,?,?,'full',?,?,'good'
2,3.5,4,?,'none',40,?,?,2,'no',10,'below_average','no','half',?,'half','bad'
3,3.5,4,5.1,'tcf',37,?,?,4,?,13,'generous',?,'full','yes','full','good'
1,3,?,?,'none',36,?,?,10,'no',11,'generous',?,?,?,?,'good'
2,4.5,4,?,'none',37,'empl_contr',?,?,?,11,'average',?,'full','yes',?,'good'
1,2.8,?,?,?,35,?,?,2,?,12,'below_average',?,?,?,?,'good'
1,2.1,?,?,'tc',40,'ret_allw',2,3,'no',9,'below_average','yes','half',?,'none','bad'
1,2,?,?,'none',38,'none',?,?,'yes',11,'average','no','none','no','none','bad'
2,4,5,?,'tcf',35,?,13,5,?,15,'generous',?,?,?,?,'good'
2,4.3,4.4,?,?,38,?,?,4,?,12,'generous',?,'full',?,'full','good'
2,2.5,3,?,?,40,'none',?,?,?,11,'below_average',?,?,?,?,'bad'
3,3.5,4,4.6,'tcf',27,?,?,?,?,?,?,?,?,?,?,'good'
2,4.5,4,?,?,40,?,?,4,?,10,'generous',?,'half',?,'full','good'
1,6,?,?,?,38,?,8,3,?,9,'generous',?,?,?,?,'good'
3,2,2,2,'none',40,'none',?,?,?,10,'below_average',?,'half','yes','full','bad'
2,4.5,4.5,?,'tcf',?,?,?,?,'yes',10,'below_average','yes','none',?,'half','good'
2,3,3,?,'none',33,?,?,?,'yes',12,'generous',?,?,'yes','full','good'
2,5,4,?,'none',37,?,?,5,'no',11,'below_average','yes','full','yes','full','good'
3,2,2.5,?,?,35,'none',?,?,?,10,'average',?,?,'yes','full','bad'
3,4.5,4.5,5,'none',40,?,?,?,'no',11,'average',?,'half',?,?,'good'
3,3,2,2.5,'tc',40,'none',?,5,'no',10,'below_average','yes','half','yes','full','bad'
2,2.5,2.5,?,?,38,'empl_contr',?,?,?,10,'average',?,?,?,?,'bad'
2,4,5,?,'none',40,'none',?,3,'no',10,'below_average','no','none',?,'none','bad'
3,2,2.5,2.1,'tc',40,'none',2,1,'no',10,'below_average','no','half','yes','full','bad'
2,2,2,?,'none',40,'none',?,?,'no',11,'average','yes','none','yes','full','bad'
1,2,?,?,'tc',40,'ret_allw',4,0,'no',11,'generous','no','none','no','none','bad'
1,2.8,?,?,'none',38,'empl_contr',2,3,'no',9,'below_average','yes','half',?,'none','bad'
3,2,2.5,2,?,37,'empl_contr',?,?,?,10,'average',?,?,'yes','none','bad'
2,4.5,4,?,'none',40,?,?,4,?,12,'average','yes','full','yes','half','good'
1,4,?,?,'none',?,'none',?,?,'yes',11,'average','no','none','no','none','bad'
2,2,3,?,'none',38,'empl_contr',?,?,'yes',12,'generous','yes','none','yes','full','bad'
2,2.5,2.5,?,'tc',39,'empl_contr',?,?,?,12,'average',?,?,'yes',?,'bad'
2,2.5,3,?,'tcf',40,'none',?,?,?,11,'below_average',?,?,'yes',?,'bad'
2,4,4,?,'none',40,'none',?,3,?,10,'below_average','no','none',?,'none','bad'
38
2,4.5,4,?,?,40,?,?,2,'no',10,'below_average','no','half',?,'half','bad'
%
%
Result :
Steps:
Step1: Open the data file in Weka Explorer. It is presumed that the required data fields
have been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm.
Step3: We will use apriori algorithm. This is the default algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence etc)
we click on the text box immediately to the right of the choose button.
Dataset [Link]
@relation contact-lenses
@attribute age {young, pre-presbyopic, presbyopic}
@attribute spectacle-prescrip {myope, hypermetrope}
@attribute astigmatism{no, yes}
@attribute tear-prod-rate {reduced, normal}
@attribute contact-lenses {soft, hard, none}
@data
%
% 24 instances %
young,myope,no,reduced,none
young,myope,no,normal,soft
young,myope,yes,reduced,none
young,myope,yes,normal,hard
young,hypermetrope,no,reduced,none
young,hypermetrope,no,normal,soft
young,hypermetrope,yes,reduced,none
young,hypermetrope,yes,normal,hard
pre-presbyopic,myope,no,reduced,none
pre-presbyopic,myope,no,normal,soft
pre-presbyopic,myope,yes,reduced,none
pre-presbyopic,myope,yes,normal,hard
pre-presbyopic,hypermetrope,no,reduced,none
40
pre-presbyopic,hypermetrope,no,normal,soft
pre-presbyopic,hypermetrope,yes,reduced,none
pre-presbyopic,hypermetrope,yes,normal,none
presbyopic,myope,no,reduced,none
presbyopic,myope,no,normal,none
presbyopic,myope,yes,reduced,none
presbyopic,myope,yes,normal,hard
presbyopic,hypermetrope,no,reduced,none
presbyopic,hypermetrope,no,normal,soft
presbyopic,hypermetrope,yes,reduced,none
presbyopic,hypermetrope,yes,normal,none
%
%
%
The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.
41
Result :
Aim:
This experiment illustrates some of the basic elements of asscociation rule mining
using WEKA. The sample dataset used for this example is [Link]
Steps:
Step1: Open the data file in Weka Explorer. It is presumed that the required data fields
have been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm.
Step3: We will use apriori algorithm. This is the default algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence etc)
we click on the text box immediately to the right of the choose button.
Dataset [Link]
@relation test
@attribute admissionyear {2005,2006,2007,2008,2009,2010}
@attribute course {cse,mech,it,ece}
@data
%
2005, cse
2005, it
2005, cse
2006, mech
2006, it
2006, ece
2007, it
2007, cse
2008, it
2008, cse
2009, it
2009, ece
%
The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.
43
Result :
Steps:
Step-1: We begin the experiment by loading the data ([Link])into weka.
Step2: Next we select the “classify” tab and click “choose” button t o select the
“j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in
the text box to the right of the chose button. In this example, we accept the default
values. The default version does perform some pruning but does not perform error
pruning.
Step4: Under the “text” options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation data
set, this is necessary to get a reasonable idea of accuracy of generated model.
Step-5: We now click ”start” to generate the model .the Ascii version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates
that we may find more work. (Either in preprocessing or in selecting current
parameters for the classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree.
This can be done by right clicking the last result set and selecting “visualize tree”
from the pop-up menu.
Step-8: We will use our model to classify the new instances.
Step-9: In the main panel under “text” options click the “supplied test set” radio
button and then click the “set” button. This wills pop-up a window which will allow
you to open the file containing test instances.
@relation student
45
@attribute age {<30,30-40,>40}
@attribute income {low, medium, high}
@attribute student {yes, no}
@attribute credit-rating {fair, excellent}
@attribute buyspc {yes, no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%
The following screenshot shows the classification rules that were generated when j48
algorithm is applied on the given dataset.
46
Result :
47
[Link] : 07 Demonstration of classification rule process on dataset
[Link] using j48 algorithm
Aim:
This experiment illustrates the use of j-48 classifier in [Link] sample data set used
in this experiment is “employee”data available at arff format. This document
assumes that appropriate data pre processing has been performed.
Steps:
Step 1: We begin the experiment by loading the data ([Link]) into weka.
Step2: Next we select the “classify” tab and click “choose” button to select the
“j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in
the text box to the right of the chose button. In this example, we accept the default
values the default version does perform some pruning but does not perform error
pruning.
Step4: Under the “text “options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation data
set, this is necessary to get a reasonable idea of accuracy of generated model.
Step-5: We now click ”start” to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates
that we may find more work. (Either in preprocessing or in selecting current
parameters for the classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree.
This can be done by right clicking the last result set and selecting “visualize tree”
from the pop-up menu.
Step-8: We will use our model to classify the new instances.
Step-9: In the main panel under “text “options click the “supplied test set” radio
button and then click the “set” button. This wills pop-up a window which will allow
you to open the file containing test instances.
48
Data set [Link]:
@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48}
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@attribute performance {good, avg, poor}
@data
%
25, 10k, poor
27, 15k, poor
27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 35k, good
48, 32k, good
%
49
Result :
Aim:
This experiment illustrates the use of id3 classifier in weka. The sample data set used
in this experiment is “employee”data available at arff format. This document
assumes that appropriate data pre processing has been performed.
Steps:
Step-1: We begin the experiment by loading the data ([Link]) into weka.
Step-2: next we select the “classify” tab and click “choose” button to select the
“id3”classifier.
Step-3: now we specify the various parameters. These can be specified by clicking in
the text box to the right of the chose button. In this example, we accept the default
values his default version does perform some pruning but does not perform error
pruning.
Step-4: under the “text “options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation data
set, this is necessary to get a reasonable idea of accuracy of generated model.
Step-5: we now click”start”to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: note that the classification accuracy of model is about 69%.this indicates that
we may find more work. (Either in preprocessing or in selecting current parameters
for the classification)
Step-7: now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting “visualize tree” from the
pop-up menu.
51
Dataset [Link]:
@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48}
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@attribute performance {good, avg, poor}
@data
%
25, 10k, poor
27, 15k, poor
27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 35k, good
48, 32k, good
%
The following screenshot shows the classification rules that were generated when id3
algorithm is applied on the given dataset.
52
Result :
Steps:
Step-1: We begin the experiment by loading the data ([Link]) into weka.
Step-2: next we select the “classify” tab and click “choose” button to select the
“id3”classifier.
Step-3: now we specify the various parameters. These can be specified by clicking in the
text box to the right of the chose button. In this example, we accept the default values his
default version does perform some pruning but does not perform error pruning.
Step-4: under the “text “options in the main panel. We select the 10-fold cross validation
as our evaluation approach. Since we don’t have separate evaluation data set, this is
necessary to get a reasonable idea of accuracy of generated model.
Step-5: we now click”start”to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: note that the classification accuracy of model is about 69%.this indicates that we
may find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: now weka also lets us a view a graphical version of the classification tree. This can
be done by right clicking the last result set and selecting “visualize tree” from the pop-up
menu.
Step-8: we will use our model to classify the new instances.
Step-9: In the main panel under “text “options click the “supplied test set” radio button and
then click the “set” button. This will show pop-up window which will allow you to open
the file containing test instances.
54
Dataset [Link]:
@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48}
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@attribute performance {good, avg, poor}
@data
%
25, 10k, poor
27, 15k, poor
27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 35k, good
48, 32k, good
%
The following screenshot shows the classification rules that were generated when naive
bayes algorithm is applied on the given dataset.
55
Result :
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
57
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
58
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
59
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
60
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica
%
%
%
The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.
61
Interpretation of the above visualization
From the above visualization, we can understand the distribution of sepal length and petal
length in each cluster. For instance, for each cluster is dominated by petal length. In this
case by changing the color dimension to other attributes we can see their distribution with
in each of the cluster.
Step 8: We can assure that resulting dataset which included each instance along with its
assign cluster. To do so we click the save button in the visualization window and save the
result iris k-mean .The top portion of this file is shown in the following figure.
62
Result :
63
[Link] : 11 Demonstration of clustering rule process on dataset
[Link] using simple k- means
Aim:
This experiment illustrates the use of simple k-mean clustering with Weka explorer.
The sample data set used for this example is based on the student data available in
ARFF format. This document assumes that appropriate preprocessing has been
performed. This student dataset includes 14 instances.
Steps:
Step 1: Run the Weka explorer and load the data file [Link] in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.
Step 3 : In this case we select ‘simple k-means’.
Step 4: Next click in text button to the right of the choose button to get popup window
shown in the screenshots. In this window we enter six on the number of clusters and we
leave the value of the seed on as it is. The seed value is used in generating a random
number which is used for making the internal assignments of instances of clusters.
Step 5 : Once of the option have been specified. We run the clustering algorithm
there we must make sure that they are in the ‘cluster mode’ panel. The use of training
set option is selected and then we click ‘start’ button. This process and resulting
window are shown in the following screenshots.
Step 6 : The result window shows the centroid of each cluster as well as statistics on
the number and the percent of instances assigned to different clusters. Here clusters
centroid are means vectors for each clusters. This clusters can be used to
characterized the cluster.
Step 7: Another way of understanding characterstics of each cluster through
visualization ,we can do this, try right clicking the result set on the result. List panel
and selecting the visualize cluster assignments.
From the above visualization, we can understand the distribution of age and instance
number in each cluster. For instance, for each cluster is dominated by age. In this
case by changing the color dimension to other attributes we can see their distribution
with in each of the cluster.
64
Step 8: We can assure that resulting dataset which included each instance along with
its assign cluster. To do so we click the save button in the visualization window and
save the result student k-mean .The top portion of this file is shown in the following
figure.
@relation student
@attribute age {<30,30-40,>40}
@attribute income {low,medium,high}
@attribute student {yes,no}
@attribute credit-rating {fair,excellent}
@attribute buyspc {yes,no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%
65
The following screenshot shows the clustering rules that were generated when simple k-
means algorithm is applied on the given dataset.
66
Result :
Steps:
Step 1: Run the Weka explorer and load the data file [Link] in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.
cross-validation:
Step 9: In which we selected Linear Regression algorithm & click on start option with
cross validation option with 10 folds.
Step 10: Then we will get regression model & its result as shown below.
percentage split:
Step 11: In which we selected Linear Regression algorithm & click on start option with
percentage split option with 66% split.
Step 12: Then we will get regression model & its result as shown below.
68
Dataset cpu .arff :
@relation 'cpu'
@attribute vendor { adviser, amdahl, apollo, basf, bti, burroughs, c.r.d, cdc, cambex, dec,
dg, formation, four-phase, gould, hp, harris, honeywell, ibm, ipl, magnuson, microdata,
nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, sratus, wang}
@attribute MYCT real
@attribute MMIN real
@attribute MMAX real
@attribute CACH real
@attribute CHMIN real
@attribute CHMAX real
@attribute ERP real
@data
adviser,125,256,6000,256,16,128,199
amdahl,29,8000,32000,32,8,32,253
amdahl,29,8000,32000,32,8,32,253
amdahl,29,8000,32000,32,8,32,253
amdahl,29,8000,16000,32,8,16,132
amdahl,26,8000,32000,64,8,32,290
amdahl,23,16000,32000,64,16,32,381
amdahl,23,16000,32000,64,16,32,381
amdahl,23,16000,64000,64,16,32,749
amdahl,23,32000,64000,128,32,64,1238
apollo,400,1000,3000,0,1,2,23
apollo,400,512,3500,4,1,6,24
basf,60,2000,8000,65,1,8,70
basf,50,4000,16000,65,1,8,117
bti,350,64,64,0,1,4,15
bti,200,512,16000,0,4,32,64
burroughs,167,524,2000,8,4,15,23
burroughs,143,512,5000,0,7,32,29
burroughs,143,1000,2000,0,5,16,22
burroughs,110,5000,5000,142,8,64,124
burroughs,143,1500,6300,0,5,32,35
burroughs,143,3100,6200,0,5,20,39
burroughs,143,2300,6200,0,6,64,40
burroughs,110,3100,6200,0,6,64,45
c.r.d,320,128,6000,0,1,12,28
c.r.d,320,512,2000,4,1,3,21
c.r.d,320,256,6000,0,1,6,28
c.r.d,320,256,3000,4,1,3,22
c.r.d,320,512,5000,4,1,5,28
69
c.r.d,320,256,5000,4,1,6,27
cdc,25,1310,2620,131,12,24,102
cdc,25,1310,2620,131,12,24,102
cdc,50,2620,10480,30,12,24,74
cdc,50,2620,10480,30,12,24,74
cdc,56,5240,20970,30,12,24,138
cdc,64,5240,20970,30,12,24,136
cdc,50,500,2000,8,1,4,23
cdc,50,1000,4000,8,1,5,29
cdc,50,2000,8000,8,1,5,44
cambex,50,1000,4000,8,3,5,30
cambex,50,1000,8000,8,3,5,41
cambex,50,2000,16000,8,3,5,74
cambex,50,2000,16000,8,3,6,74
cambex,50,2000,16000,8,3,6,74
dec,133,1000,12000,9,3,12,54
dec,133,1000,8000,9,3,12,41
dec,810,512,512,8,1,1,18
dec,810,1000,5000,0,1,1,28
dec,320,512,8000,4,1,5,36
dec,200,512,8000,8,1,8,38
dg,700,384,8000,0,1,1,34
dg,700,256,2000,0,1,1,19
dg,140,1000,16000,16,1,3,72
dg,200,1000,8000,0,1,2,36
dg,110,1000,4000,16,1,2,30
dg,110,1000,12000,16,1,2,56
dg,220,1000,8000,16,1,2,42
formation,800,256,8000,0,1,4,34
formation,800,256,8000,0,1,4,34
formation,800,256,8000,0,1,4,34
formation,800,256,8000,0,1,4,34
formation,800,256,8000,0,1,4,34
four-phase,125,512,1000,0,8,20,19
gould,75,2000,8000,64,1,38,75
gould,75,2000,16000,64,1,38,113
gould,75,2000,16000,128,1,38,157
hp,90,256,1000,0,3,10,18
hp,105,256,2000,0,3,10,20
hp,105,1000,4000,0,3,24,28
hp,105,2000,4000,8,3,19,33
hp,75,2000,8000,8,3,24,47
70
hp,75,3000,8000,8,3,48,54
hp,175,256,2000,0,3,24,20
harris,300,768,3000,0,6,24,23
harris,300,768,3000,6,6,24,25
harris,300,768,12000,6,6,24,52
harris,300,768,4500,0,1,24,27
harris,300,384,12000,6,1,24,50
harris,300,192,768,6,6,24,18
harris,180,768,12000,6,1,31,53
honeywell,330,1000,3000,0,2,4,23
honeywell,300,1000,4000,8,3,64,30
honeywell,300,1000,16000,8,2,112,73
honeywell,330,1000,2000,0,1,2,20
honeywell,330,1000,4000,0,3,6,25
honeywell,140,2000,4000,0,3,6,28
honeywell,140,2000,4000,0,4,8,29
honeywell,140,2000,4000,8,1,20,32
honeywell,140,2000,32000,32,1,20,175
honeywell,140,2000,8000,32,1,54,57
honeywell,140,2000,32000,32,1,54,181
honeywell,140,2000,32000,32,1,54,181
honeywell,140,2000,4000,8,1,20,32
ibm,57,4000,16000,1,6,12,82
ibm,57,4000,24000,64,12,16,171
ibm,26,16000,32000,64,16,24,361
ibm,26,16000,32000,64,8,24,350
ibm,26,8000,32000,0,8,24,220
ibm,26,8000,16000,0,8,16,113
ibm,480,96,512,0,1,1,15
ibm,203,1000,2000,0,1,5,21
ibm,115,512,6000,16,1,6,35
ibm,1100,512,1500,0,1,1,18
ibm,1100,768,2000,0,1,1,20
ibm,600,768,2000,0,1,1,20
ibm,400,2000,4000,0,1,1,28
ibm,400,4000,8000,0,1,1,45
ibm,900,1000,1000,0,1,2,18
ibm,900,512,1000,0,1,2,17
ibm,900,1000,4000,4,1,2,26
ibm,900,1000,4000,8,1,2,28
ibm,900,2000,4000,0,3,6,28
ibm,225,2000,4000,8,3,6,31
71
ibm,225,2000,4000,8,3,6,31
ibm,180,2000,8000,8,1,6,42
ibm,185,2000,16000,16,1,6,76
ibm,180,2000,16000,16,1,6,76
ibm,225,1000,4000,2,3,6,26
ibm,25,2000,12000,8,1,4,59
ibm,25,2000,12000,16,3,5,65
ibm,17,4000,16000,8,6,12,101
ibm,17,4000,16000,32,6,12,116
ibm,1500,768,1000,0,0,0,18
ibm,1500,768,2000,0,0,0,20
ibm,800,768,2000,0,0,0,20
ipl,50,2000,4000,0,3,6,30
ipl,50,2000,8000,8,3,6,44
ipl,50,2000,8000,8,1,6,44
ipl,50,2000,16000,24,1,6,82
ipl,50,2000,16000,24,1,6,82
ipl,50,8000,16000,48,1,10,128
magnuson,100,1000,8000,0,2,6,37
magnuson,100,1000,8000,24,2,6,46
magnuson,100,1000,8000,24,3,6,46
magnuson,50,2000,16000,12,3,16,80
magnuson,50,2000,16000,24,6,16,88
magnuson,50,2000,16000,24,6,16,88
microdata,150,512,4000,0,8,128,33
nas,115,2000,8000,16,1,3,46
nas,115,2000,4000,2,1,5,29
nas,92,2000,8000,32,1,6,53
nas,92,2000,8000,32,1,6,53
nas,92,2000,8000,4,1,6,41
nas,75,4000,16000,16,1,6,86
nas,60,4000,16000,32,1,6,95
nas,60,2000,16000,64,5,8,107
nas,60,4000,16000,64,5,8,117
nas,50,4000,16000,64,5,10,119
nas,72,4000,16000,64,8,16,120
nas,72,2000,8000,16,6,8,48
nas,40,8000,16000,32,8,16,126
nas,40,8000,32000,64,8,24,266
nas,35,8000,32000,64,8,24,270
nas,38,16000,32000,128,16,32,426
nas,48,4000,24000,32,8,24,151
72
nas,38,8000,32000,64,8,24,267
nas,30,16000,32000,256,16,24,603
ncr,112,1000,1000,0,1,4,19
ncr,84,1000,2000,0,1,6,21
ncr,56,1000,4000,0,1,6,26
ncr,56,2000,6000,0,1,8,35
ncr,56,2000,8000,0,1,8,41
ncr,56,4000,8000,0,1,8,47
ncr,56,4000,12000,0,1,8,62
ncr,56,4000,16000,0,1,8,78
ncr,38,4000,8000,32,16,32,80
ncr,38,4000,8000,32,16,32,80
ncr,38,8000,16000,64,4,8,142
ncr,38,8000,24000,160,4,8,281
ncr,38,4000,16000,128,16,32,190
nixdorf,200,1000,2000,0,1,2,21
nixdorf,200,1000,4000,0,1,4,25
nixdorf,200,2000,8000,64,1,5,67
perkin-elmer,250,512,4000,0,1,7,24
perkin-elmer,250,512,4000,0,4,7,24
perkin-elmer,250,1000,16000,1,1,8,64
prime,160,512,4000,2,1,5,25
prime,160,512,2000,2,3,8,20
prime,160,1000,4000,8,1,14,29
prime,160,1000,8000,16,1,14,43
prime,160,2000,8000,32,1,13,53
siemens,240,512,1000,8,1,3,19
siemens,240,512,2000,8,1,5,22
siemens,105,2000,4000,8,3,8,31
siemens,105,2000,6000,16,6,16,41
siemens,105,2000,8000,16,4,14,47
siemens,52,4000,16000,32,4,12,99
siemens,70,4000,12000,8,6,8,67
siemens,59,4000,12000,32,6,12,81
siemens,59,8000,16000,64,12,24,149
siemens,26,8000,24000,32,8,16,183
siemens,26,8000,32000,64,12,16,275
siemens,26,8000,32000,128,24,32,382
sperry,116,2000,8000,32,5,28,56
sperry,50,2000,32000,24,6,26,182
sperry,50,2000,32000,48,26,52,227
sperry,50,2000,32000,112,52,104,341
73
sperry,50,4000,32000,112,52,104,360
sperry,30,8000,64000,96,12,176,919
sperry,30,8000,64000,128,12,176,978
sperry,180,262,4000,0,1,3,24
sperry,180,512,4000,0,1,3,24
sperry,180,262,4000,0,1,3,24
sperry,180,512,4000,0,1,3,24
sperry,124,1000,8000,0,1,8,37
sperry,98,1000,8000,32,2,8,50
sratus,125,2000,8000,0,2,14,41
wang,480,512,8000,32,0,0,47
wang,480,1000,4000,0,0,0,25
The following screenshot shows the clustering rules that were generated when Linear
Regression algorithm is applied on the given dataset.
74
cross-validation:
75
percentage split:
Result :
Description: The business of banks is making loans. Assessing the credit worthiness of
an applicant is of crucial importance. You have to develop a system to help a loan officer
decide whether the credit of a customer is good. Or bad. A bank’s business rules
regarding loans must consider two opposing factors. On th one han, a bank wants to
make as many loans as possible.
Interest on these loans is the banks profit source. On the other hand, a bank can not
afford to make too many bad loans. Too many bad loans could lead to the collapse of
the bank. The bank’s loan policy must involved a compromise. Not too strict and not
too lenient.
To do the assignment, you first and foremost need some knowledge about the world of
credit. You can acquire such knowledge in a number of ways.
1. Knowledge engineering: Find a loan officer who is willing to talk. Interview her and try
to represent her knowledge in a number of ways.
2. Books: Find some training manuals for loan officers or perhaps a suitable textbook on
finance. Translate this knowledge from text from to production rule form.
3. Common sense: Imagine yourself as a loan officer and make up reasonable rules which
can be used to judge the credit worthiness of a loan applicant.
4. Case histories: Find records of actual cases where competent loan officers correctly
judged when and not to. Approve a loan application.
Actual historical credit data is not always easy to come by because of confidentiality
rules. Here is one such data set. Consisting of 1000 actual cases collected in Germany.
In spite of the fact that the data is German, you should probably make use of it for this
assignment(Unless you really can consult a real loan officer!)
77
There are 20 attributes used in judging a loan applicant( ie., 7 Numerical attributes and
13 Categoricl or Nominal attributes). The goal is the classify the applicant into one of two
categories. Good or Bad.
1. Checking_Status
2. Duration
3. Credit_history
4. Purpose
5. Credit_amout
6. Savings_status
7. Employment
8. Installment_Commitment
9. Personal_status
10. Other_parties
11. Residence_since
12. Property_Magnitude
13. Age
14. Other_payment_plans
15. Housing
16. Existing_credits
17. Job
18. Num_dependents
19. Own_telephone
20. Foreign_worker
21. Class
78
Tasks:
1. List all the categorical (or nominal) attributes and the real valued attributes
separately.
3. Click on invert.
4. Then we get all categorial attributes selected
5. Click on remove
6. Click on visualize all.
1. Checking_Status
2. Credit_history
3. Purpose
4. Savings_status
5. Employment
6. Personal_status
7. Other_parties
8. Property_Magnitude
9. Other_payment_plans
10. Housing
11. Job
12. Own_telephone
13. Foreign_worker
79
The following are the Numerical attributes)
1. Duration
2. Credit_amout
3. Installment_Commitment
4. Residence_since
5. Age
6. Existing_credits
7. Num_dependents
2. What attributes do you think might be crucial in making the credit assessment? Come
up with some simple rules in plain English using your selected attributes.
Ans) The following are the attributes may be crucial in making the credit assessment.
1. Credit_amount
2. Age
3. Job
4. Savings_status
5. Existing_credits
6. Installment_commitment
7. Property_magnitude
3. One type of model that you can create is a Decision tree . train a Decision tree using
the complete data set as the training data. Report the model obtained after training.
We created a decision tree by using J48 Technique for the complete dataset as the training
data.
4. Suppose you use your above model trained on the complete dataset, and classify
credit good/bad for each of the examples in the dataset. What % of examples can you
classify correctly?(This is also called testing on the training set) why do you think can
not get 100% training accuracy?
5. Is testing on the training set as you did above a good idea? Why or why not? Ans)It
is not good idea by using 100% training data set.
6. One approach for solving the problem encountered in the previous question is
using cross-validation? Describe what is cross validation briefly. Train a decision tree
again using cross validation and report your results. Does accuracy increase/decrease?
Why?
82
Ans) steps followed are:
1. Double click on [Link] file.
2. Click on classify tab.
3. Click on choose button.
4. Expand tree folder and select J48
5. Click on cross validations in test options.
6. Select folds as 10
7. Click on start
8. Change the folds to 5
9. Again click on start 10. Change the folds with 2
11. Click on start.
12. Right click on blue bar under result list and go to visualize tree Output:
Cross-Validation Definition: The classifier is evaluated by cross validation using the number
of folds that are entered in the folds text field.
In Classify Tab, Select cross-validation option and folds size is 2 then Press Start Button, next
time change as folds size is 5 then press start, and next time change as folds size is 10 then
press start.
i) Fold Size-10
Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 705 70.5 %
Incorrectly Classified Instances 295 29.5 %
Kappa statistic 0.2467
Mean absolute error 0.3467
Root mean squared error 0.4796
Relative absolute error 82.5233 %
Root relative squared error 104.6565 %
Coverage of cases (0.95 level) 92.8 %
Mean rel. region size (0.95 level) 91.7 %
Total Number of Instances 1000
83
=== Confusion Matrix ===
a b <-- classified as
588 112 | a = good
183 117 | b = bad
a b <-- classified as
596 104 | a = good
163 137 | b = bad
Note: With this observation, we have seen accuracy is increased when we have folds size is
5 and accuracy is decreased when we have 10 folds.
7. Check to see if the data shows a bias against “foreign workers” or “personal-status”.
One way to do this is to remove these attributes from the data set and see if the decision
tree created in those cases is significantly different from the full dataset case which you
have already done. Did removing these attributes have any significantly effect? Discuss.
Output:
i) If Foreign_worker is removed
a b <-- classified as
668 32 | a = good
109 191 | b = bad
86
i) If Personal_status is removed
a b <-- classified as
668 32 | a = good
102 198 | b = bad
Note: With this observation we have seen, when “Foreign_worker “attribute is removed
from the
8. Another question might be, do you really need to input so many attributes to get good
results? May be only a few would do. For example, you could try just having attributes
2,3,5,7,10,17 and 21. Try out some combinations.(You had removed two attributes in
problem 7. Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
a b <-- classified as
647 53 | a = good
106 194 | b = bad
Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 3rd attribute (Credit_history). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed.
a b <-- classified as
645 55 | a = good
106 194 | b = bad
88
Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 5th attribute (Credit_amount). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed. === Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 864 86.4 %
Incorrectly Classified Instances 136 13.6 %
= Confusion Matrix ===
a b <-- classified as
675 25 | a = good
111 189 | b = bad
Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 7th attribute (Employment). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed.
a b <-- classified as
670 30 | a = good
112 188 | b = bad
Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 10th attribute (Other_parties). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed.
Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 17th attribute (Job). In Classify Tab,
Select Use Training set option then Press Start Button, If these attributes removed from the
dataset, we can see change in the accuracy compare to full data set when we removed.
a b <-- classified as
675 25 | a = good
116 184 | b = bad
Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 21st attribute (Class). In Classify
Tab, Select Use Training set option then Press Start Button, If these attributes removed from
the dataset, we can see change in the accuracy compare to full data set when we removed.
a b <-- classified as
963 0 | a = yes
37 0 | b = no
90
Note: With this observation we have seen, when 3rd attribute is removed from the Dataset, the
accuracy (83%) is decreased. So this attribute is important for classification. when 2 nd and 10th
attributes are removed from the Dataset, the accuracy(84%) is same. So we can remove any
one among them. when 7th and 17th attributes are removed from the Dataset, the
accuracy(85%) is same. So we can remove any one among them. If we remove 5th and 21st
attributes the accuracy is increased, so these attributes may not be needed for the classification.
9. Sometimes, The cost of rejecting an applicant who actually has good credit might
be higher than accepting an applicant who has bad credit. Instead of counting the
misclassification equally in both cases, give a higher cost to the first case ( say cost 5) and
lower cost to the second case. By using a cost matrix in weak. Train your decision tree
and report the Decision Tree and cross validation results. Are they significantly different
from results obtained in problem 6.
OUTPUT:
In Weka GUI Explorer, Select Classify Tab, In that Select Use Training set option . In
Classify Tab then press Choose button in that select J48 as Decision Tree Technique. In
Classify Tab then press More options button then we get classifier evaluation options window
in that select cost sensitive evaluation the press set option Button then we get Cost Matrix
91
Editor. In that change classes as 2 then press Resize button. Then we get 2X2 Cost matrix. In
Cost Matrix (0,1) location value change as 5, then we get modified cost matrix is as follows.
0.0 5.0
1.0 0.0
Then close the cost matrix editor, then press ok button. Then press start button.
=== Evaluation on training set ===
=== Summary ===
a b <-- classified as
669 31 | a = good
114 186 | b = bad
Note: With this observation we have seen that ,total 700 customers in that 669 classified as
good customers and 31 misclassified as bad customers. In total 300cusotmers, 186 classified
as bad customers and 114 misclassified as good customers.
10. Do you think it is a good idea to prefect simple decision trees instead of having long
complex decision tress? How does the complexity of a Decision Tree relate to the bias of
the model? Ans)
It is Good idea to prefer simple Decision trees, instead of having complex Decision tree.
11. You can make your Decision Trees simpler by pruning the nodes. One approach
is to use Reduced Error Pruning. Explain this idea briefly. Try reduced error pruning
92
for training your Decision Trees using cross validation and report the Decision Trees
you obtain? Also Report your accuracy using the pruned model Does your Accuracy
increase?
Ans)
We can make our decision tree simpler by pruning the nodes. For that In Weka GUI Explorer,
Select Classify Tab, In that Select Use Training set option . In Classify Tab then press Choose
button in that select J48 as Decision Tree Technique. Beside Choose Button Press on J48 –c
0.25 –M2 text we get Generic Object Editor. In that select Reduced Error pruning Property
as True then press ok. Then press start button.
12) How can you convert a Decision Tree into “if-then-else rules”. Make up your own
small
Decision Tree consisting 2-3 levels and convert into a set of rules. There also exist
different classifiers that output the model in the form of rules. One such classifier in
weka is rules. PART, train this model and report the set of rules obtained. Sometimes
just one attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this data set? OneR classifier uses a single
93
attribute to make decisions(it chooses the attribute based on minimum error).Report the
rule obtained by training a one R classifier. Rank the performance of j48,PART,oneR.
Ans)
94
Converting Decision tree into a set of rules is as follows.
In Weka GUI Explorer, Select Classify Tab, In that Select Use Training set option .There also
exist different classifiers that output the model in the form of Rules. Such classifiers in weka
are
“PART” and ”OneR” . Then go to Choose and select Rules in that select PART and press
start Button.
a b <-- classified as
653 47 | a = good
56 244 | b = bad
Then go to Choose and select Rules in that select OneR and press start Button.
== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 742 74.2 %
Incorrectly Classified Instances 258 25.8 %
=== Confusion Matrix ===
a b <-- classified as
642 58 | a = good
200 100 | b = bad
Then go to Choose and select Trees in that select J48 and press start Button.
=== Evaluation on training set ===
95
=== Summary ===
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
a b <-- classified as
669 31 | a = good
114 186 | b = bad
Note: With this observation we have seen the performance of classifier and Rank is as follows
1. PART
2. J48 3. OneR
Result :
Dimension
The dimension object(dimension);
_name
_hierarchies
Dimensions objects(dimension) consists of set of levels and set of hierarchies defined over
those [Link] levels represent levels of [Link] describe-child relationships
among a set of levels.
For example .a typical calander dimension could contain five [Link] hierarchies can be
defined on these levels.
H1: YearL>QuarterL>MonthL>DayL
H2: YearL>WeekL>DayL
The hierarchies are describes from parent to child,so that year is the parent of Quarter,quarter
are parent of month,and so forth.
About Unique key constraints
When you create a definition for a hierarchy,warehouse builder creates an identifier key for
each level of the hierarchy and unique key constraint on the lowest level (base level)
PATIENT(patient_name,age,address,etc)
MEDICINE(Medicine_brand_name,Drug_name,supplier,no_units,units_price,etc..,)
97
SUPPLIER:( Supplier_name,medicine_brand_name,address,etc..,)
If each dimension has 6 levels,decide the levels and hierarchies,assumes the level names
suitably.
Design the hospital management system data warehousing using all [Link] the example
Result :
Data Preprocessing
Objective:
Understanding the purpose of unsupervised attribute/instance filters for
preprocessing the input data.
The preprocess section allows filters to be defined that transform the data in various ways.
The Filter box is used to set up filters that are required. At the left of the Filter box is a
Choose button. By clicking this button it is possible to select one of the filters in Weka.
Once a filter has been selected, its name and options are shown in the field next to the
Choose button. Clicking on this box brings up a GenericObjectEditor dialog box, which lets
you configure a filter. Once you are happy with the settings you have chosen, click OK to
return to the main Explorer window.
Now you can apply it to the data by pressing the Apply button at the right end of the Filter
panel. The Preprocess panel will then show the transformed data. The change can be undone
using the Undo button. Use the Edit button to view your transformed data in the dataset
editor.
Try each of the following Unsupervised Attribute Filters. (Choose -> weka -> filters ->
unsupervised -> attribute)
• Use ReplaceMissingValues to replace missing values in the given dataset.
• Use the filter Add to add the attribute Average.
• Use the filter AddExpression and add an attribute which is the average of attributes
M1 and M2. Name this attribute as AVG.
• Understand the purpose of the attribute filter Copy.
• Use the attribute filters Discretize and PKIDiscretize to discretize the M1 and
M2 attributes into five bins. (NOTE: Open the file afresh to apply the second filter
since there would be no numeric attribute to dicretize after you have applied the first
filter.)
• Perform Normalize and Standardize on the dataset and identify the difference
between these operations.
• Use the attribute filter FirstOrder to convert the M1 and M2 attributes into a single
attribute representing the first differences between them.
99
• Add a nominal attribute Grade and use the filter MakeIndicator to convert the
attribute into a Boolean attribute.
• Try if you can accomplish the task in the previous step using the filter
MergeTwoValues.
• Try the following transformation functions and identify the purpose of each
• NumericTransform
• NominalToBinary
• NumericToBinary
• Remove
• RemoveType
• RemoveUseless
• ReplaceMissingValues
• SwapValues
Try the following Unsupervised Instance Filters.
• Perform Randomize on the given dataset and try to correlate the resultant sequence
with the given one.
Result :