[go: up one dir, main page]

0% found this document useful (0 votes)
33 views100 pages

Data Mining Tools Laboratory Manual

The document is a lab manual for a Data Mining Tools Laboratory course, outlining objectives such as practical exposure to data mining tasks and performance evaluation of algorithms. It includes a list of experiments that cover building data warehouses, data preprocessing, association rule mining, classification, clustering, regression, and project work. The manual specifies system requirements and provides detailed instructions for various data mining tasks using tools like WEKA and Microsoft Excel.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views100 pages

Data Mining Tools Laboratory Manual

The document is a lab manual for a Data Mining Tools Laboratory course, outlining objectives such as practical exposure to data mining tasks and performance evaluation of algorithms. It includes a list of experiments that cover building data warehouses, data preprocessing, association rule mining, classification, clustering, regression, and project work. The manual specifies system requirements and provides detailed instructions for various data mining tasks using tools like WEKA and Microsoft Excel.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA

SCIENCE

LAB MANUAL

U20AI605 – DATA MINING TOOLS LABORATORY

Year / Sem: III / VI


U20AI605 – DATA MINING TOOLS LABORATORY

OBJECTIVES:

• Practical exposure on implementation of well known data mining tasks.


• Exposure to real life data sets for analysis and prediction.
• Learning performance evaluation of data mining algorithms in a supervised and an
unsupervised setting.
• Handling a small data mining project for a given practical domain.

System/Software Requirements:

· Intel based desktop PC


· WEKA TOOL

LIST OF EXPERIMENTS:

1. Build Data Warehouse and Explore WEKA

2. Perform data preprocessing tasks and Demonstrate performing association


rule mining on datasets

3. Demonstrate performing classification on datasets

4. Demonstrate performing clustering on datasets

5. Demonstrate performing Regression on datasets

6. Credit Risk Assessment. Sample Programs using German Credit Data

7. Sample Programs using Hospital Management System

8. Beyond the Syllabus- Simple Project on Data Preprocessing

2
TABLE OF CONTENTS

[Link]. EXPERIMENT TITLE PAGENO

1 Build Data Warehouse and Explore WEKA 4

2 Demonstration of preprocessing on [Link] 33

3 Demonstration of preprocessing on [Link] 36

4 Demonstration of Association rule process on dataset 40


[Link] using apriori algorithm

5 Demonstration of Association rule process ondataset 43


[Link] using apriori algorithm

6 Demonstration of classification rule process ondataset 45


[Link] using j48 algorithm

7 Demonstration of classification rule process ondataset 48


[Link] using j48 algorithm

8 Demonstration of classification rule process ondataset 51


[Link] using id3 algorithm

9 Demonstration of classification rule process on dataset 54


[Link] using naïve bayes algorithm

10 Demonstration of clustering rule process ondataset 57


[Link] using simple k-means

11 Demonstration of clustering rule process ondataset 64


[Link] using simple k- means

12 Demonstrate performing Regression on datasets 68

13 Credit Risk Assessment Sample Programs usingGerman 77


Credit Data.

14 Sample Programs using Hospital ManagementSystem 97

15 Simple Project on Data Preprocessing 99

3
[Link]: 01 Build Data Warehouse and Explore WEKA

A. Build a Data Warehouse/Data Mart (using open source tools like Pentaho Data
Integration tool, Pentaho Business Analytics; or other data warehouse tools like
Microsoft SSIS, Informatica, Business Objects, etc.).

(i). Identify source tables and populate sample data

In this task, we are going to use MySQL administrator, SQLyog Enterprise tools for
building & identifying tables in database & also for populating (filling) the sample data in
those tables of a database.A data warehouse is constructed by integrating data from
multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc
queries and decision making. We are building a data warehouse by integrating all the
tables in database & analyzing those data. In the below figure we represented MySQL
Administrator connection establishment.

After successful login, it will open new window as shown below.

4
There are different options available in MySQL administrator. Another tool SQLyog
Enterprise, we are using for building & identifying tables in a database after successful
connection establishment through MySQL Administrator. Below we can see the window
of SQLyog Enterprise.

On left-side navigation, we can see different databases & it‘s related tables. Now we
are going to build tables & populate table‘s data in database through SQL queries. These
tables in database can be used further for building data warehouse.

5
In the above two windows, we created a database named “sample”&
& in that
database we created two tables named as “user_details”& “hockey”through
ugh SQL
queries.
Now, we are going to populate (filling) sample data through SQL queries in those
two created tables as represented in below windows.

6
Through MySQL administrator & SQLyog, we can import databases from other sources
(.XLS, .CSV, .sql)
l) & also we can export our databases as backup for further processing.
We can connect MySQL to other applications for data analysis & reporting.

(ii). Design multi-dimensional


dimensional data models namely Star, snowflake and Fact
constellation schemas for any one enter
enterprise
prise (ex. Banking, Insurance, Finance,
Healthcare, Manufacturing, Automobile, etc.).

Multi-Dimensional
Dimensional model was developed for implementing data warehouses & it
provides both a mechanism to store data and a way for business analysis. The primary
componentsts of dimensional model are dimensions & facts. There are different of types of
multi-dimensional data models. They are:
1. Star Schema Model 2. Snow
Flake Schema Model 3. Fact
Constellation Model.

Now, we are going to design these multi


multi-dimensional models for
or the Marketing
enterprise.
First, we need to built the tables in a database through SQLyog as shown below.

7
In the above window, left side navigation bar consists of a database named as
―sales_dw‖ in which there are six different tables (dimcustdetails, dimcustomer,
dimproduct, dimsalesperson, dimstores, factproductsales) has been created.
After creating tables in database, here we are going to use a tool called as “Microsoft
Visual Studio 2012 for Business Intelligence” for building multi-dimensional models.

In the above window, we are seeing Microsoft Visual Studio before creating a
project In which right side navigation bar contains different options like Data Sources,
Data Source Views, Cubes, Dimensions etc.

8
Through Data Sources, we can connect to our MySQL database named as
“sales_dw”. Then, automatically all the tables in that database will be retrieved to this
tool for creating multidimensional models.

By data source views & cubes, we can see our retrieved tables in multi-
dimensional models. We need to add dimensions also through dimensions option. In
general, Multidimensional models consists of dimension tables & fact tables.

Star Schema Model:

A Star schema model is a join between a fact table and a no. of dimension tables. Each
dimensional table are joined to the fact table using primary key to foreign key join but
dimensional tables are not joined to each other. It is the simplest style of dataware house
schema.
Star schema is a entity relationship diagram of this schema resembles a star with
point radiating from central table as we seen in the below implemented window in visual
studio.

Snow Flake Schema:


It is slightly different from star schema in which dimensional tables from a star schema
are organized into a hierarchy by normalizing them.

9
Snow flake schema is represented by centralized fact table which are connected to

multiple dimension tables. Snow flake effects only dimension tables not fact tables. we
developed
ped a snowflake schema for sales_dw database by visual studio tool as shown
below.

10
2. Write ETL scripts and implement using data warehouse tools

ETL (Extract-Transform-Load):
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers
a process of how the data are loaded from the source system to the data warehouse.
Currently, the ETL encompasses a cleaning step as a separate step. The sequence is then
Extract-CleanTransform-Load. Let us briefly describe each step of the ETL process.
Process
Extract:
The Extract step covers the data extraction from the source system and makes it
accessible for further processing. The main objective of the extract step is to retrieve all
the required data from the source system with as little resources as possible. The extract
step should be designed in a way that it does not negatively affect the source system in
terms or performance, response time or any kind of locking.
There are several ways to perform the extract:
• Update notification - if the source system is able to provide a notification that a
record has been changed and describe the change, this is the easiest way to get the
data.
• Incremental extract - some systems may not be able to provide notification that an
update has occurred, but they are able to identify which records have been modified
and provide an extract of such records. During further ETL steps, the system needs
to identify changes and propagate it down. Note, that by using daily extract, we may
not be able to handle deleted records properly.
• Full extract - some systems are not able to identify which data has been changed at
all, so a full extract is the only way one can get the data out of the system. The full
extract requires keeping a copy of the last extract in the same format in order to be
able to identify changes. Full extract handles deletions as well.
When using Incremental or Full extracts, the extract frequency is extremely important.
Particularly for full extracts; the data volumes can be in tens of gigabytes.
Clean:
The cleaning step is one of the most important as it ensures the quality of the data in the
data warehouse. Cleaning should perform basic data unification rules, such as:
• Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided value
• Convert phone numbers, ZIP codes to a standardized form
• Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
• Validate address fields against each other (State/Country, City/State, City/ZIP code,
City/Street).
11
Transform:
The transform step applies a set of rules to transform the data from the source to the
target. This includes converting any measured data to the same dimension (i.e. conformed
dimension) using the same units so that they can later be joined. The transformation step
also requires joining data from several sources, generating aggregates, generating
surrogate keys, sorting, deriving new calculated values, and applying advanced validation
rules.

Load:
During the load step, it is necessary to ensure that the load is performed correctly and
with as little resources as possible. The target of the Load process is often a database. In
order to make the load process efficient, it is helpful to disable any constraints and
indexes before the load and enable them back only after the load completes. The
referential integrity needs to be maintained by ETL tool to ensure consistency.
Managing ETL Process:
The ETL process seems quite straight forward. As with every application, there is a
possibility that the ETL process fails. This can be caused by missing extracts from one of
the systems, missing values in one of the reference tables, or simply a connection or
power outage. Therefore, it is necessary to design the ETL process keeping fail-recovery
in mind.
Staging:
It should be possible to restart, at least, some of the phases independently from the others.
For example, if the transformation step fails, it should not be necessary to restart the
Extract step. We can ensure this by implementing proper staging. Staging means that the
data is simply dumped to the location (called the Staging Area) so that it can then be read
by the next processing phase. The staging area is also used during ETL process to store
intermediate results of processing. This is ok for the ETL process which uses for this
purpose. However, tThe staging area should is be accessed by the load ETL process only.
It should never be available to anyone else; particularly not to end users as it is not
intended for data presentation to the [Link] contain incomplete or in-the-middle-
of-the-processing data.
ETL Tool Implementation:
When you are about to use an ETL tool, there is a fundamental decision to be made: will
the company build its own data transformation tool or will it use an existing tool?
Building your own data transformation tool (usually a set of shell scripts) is the preferred
approach for a small number of data sources which reside in storage of the same type.
The reason for that is the effort to implement the necessary transformation is little due to
similar data structure and common system architecture. Also, this approach saves
licensing cost and there is no need to train the staff in a new tool. This approach,
12
however, is dangerous from the TOC point of view. If the transformations become more
sophisticated during the time or there is a need to integrate other systems, the complexity
of such an ETL system grows but the manageability drops significantly. Similarly, the
implementation of your own tool often resembles re-inventing the wheel.
There are many ready-to-use ETL tools on the market. The main benefit of using off-the-
shelf ETL tools is the fact that they are optimized for the ETL process by providing
connectors to common data sources like databases, flat files, mainframe systems, xml,
etc. They provide a means to implement data transformations easily and consistently
across various data sources. This includes filtering, reformatting, sorting, joining,
merging, aggregation and other operations ready to use. The tools also support
transformation scheduling, version control, monitoring and unified metadata
management. Some of the ETL tools are even integrated with BI tools.
Some of the Well Known ETL Tools:
The most well-known commercial tools are Ab Initio, IBM InfoSphere
DataStage, Informatica, Oracle Data Integrator, and SAP Data
Integrator. There are several open source ETL tools are
OpenRefine, Apatar, CloverETL, Pentaho and Talend.

In these above tools, we are going to use OpenRefine 2.8 ETL toolto different sample
datasets forextracting, data cleaning, transforming & loading.

Perform various OLAP operations such slice, dice, roll up, drill down and pivot.
OLAP Operations:
Since OLAP servers are based on multidimensional view of data, we will discuss
13
OLAP operations in multidimensional
ional data.
Here is the list of OLAP operations
• Roll-up (Drill-up)
• Drill-down
• Slice and dice
• Pivot (rotate)
Roll-up (Drill-up):
Roll-up performs aggregation on a data cube in any of the following ways
• By climbing up a concept hierarchy for a dimension
• By dimension reduction
• Roll-up is performed by climbing up a concept hierarchy for the dimension location.
• Initially the concept hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location hierarchy from the
level of city to the level of country.
• The data is grouped into cities rather than countries.
• When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down:
Drill-down
down is the reverse operation of roll-up.
roll It is performed by either of the
following ways By stepping down a concept hierarchy for a dimension
By introducing a new dimension.
• Drill-down is performed by stepping down a concept hierarchy for the dimension
time.
• Initially the concept hierarchy was "day < month < quarter < year."
• On drilling down, the time dimension is descended from the level of quarter to the
level of month.
• When drill-down is performed, one or more dimensions from the data cube are
added. It navigates the data from less detailed data to highly detailed data.
Slice:
The slice operation selects one particular dimension from a given cube and
provides a new sub-cube.
Dice:
Dice selects two or more dimensions from a given cube and provides a new sub
sub-
cube.
Pivot (rotate):
The pivot operation is also
o known as rotation. It rotates the data axes in view in
order to provide an alternative presentation of data.
Now, we are practically implementing all these OLAP Operations using Microsoft
Excel.

14
Procedure for OLAP Operations:

1. Open Microsoft Excel, go toData tab in top & click on ―Existing Connections”.
2. Existing Connections window will be opened, there “Browse for more”option should
be clicked for importing .cub extension file for performing OLAP Operations. For
sample, I took [Link] file.

15
3. As shown in above window, select ―PivotTable Report” and click “OK”.
4. We got all the [Link] data for analyzing different OLAP [Link], we
performed drill-down operation as shown below.

In the above window, we selected year „2008‟ in „Electronic‟ Category, then


automatically the Drill-Down option is enabled on top navigation options. We will click
on
„Drill-Down‟ option, then the below window will be displayed.

Now we are going to perform roll-up (drill-up) operation, in the above window I selected

16
January month then automatically Drill-up option is enabled on top. We will clickon Drill-up
option, then the below window will be displayed.

5. Next OLAP operation Slicing is performed by inserting slicer as shown in top


navigation options.

17
While inserting slicers for slicing operation, we select 2 Dimensions (for e.g.
CategoryName & Year) only with one Measure (for e.g. Sum of sales).After inserting a
slice& adding a filter (CategoryName: AVANT ROCK & BIG BAND; Year: 2009 &
2010), we will get table as shown below.

6. Dicing operation is similar to Slicing operation. Here we are selecting 3 dimensions


(CategoryName, Year, RegionCode)& 2 Measures (Sum of Quantity, Sum of Sales)
through
„insert slicer‟ option. After that adding a filter for CategoryName, Year & RegionCode
as shown below.

18
7. Finally, the Pivot (rotate) OLAP operation is performed by swapping rows (Order
Date-Year) & columns (Values-Sum of Quantity & Sum of Sales) through right side bottom
navigation bar as shown below.
9.

After Swapping (rotating), we will get resultant as represented below with a pie-chart for
Category-Classical& Year Wise data.

19
(v). Explore visualization features of the tool for analysis like identifying trends etc.
There are different visualization features for analyzing the data for trend analysisin data
warehouses. Some of the popular visualizations are:
1. Column Charts
2. Line Charts
3. Pie Chart
4. Bar Graphs
4. Area Graphs
5. X & Y Scatter Graphs
6. Stock Graphs
7. Surface Charts
8. Radar Graphs
9. Treemap
10. Sunburst
11. Histogram
12. Box & Whisker
13. Waterfall
14. Combo Graphs
15. Geo Map
16. Heat Grid
17. Interactive Report
18. Stacked Column
19. Stacked Bar
20. Scatter Area

20
These type of visualizations can be used for analyzing data for trend analysis. Some of
the tools for data visualization are Microsoft Excel, Tableau, Pentaho Business Analytics
Online etc. Practically different visualization features are tested with different sample
datasets.

In the below window, we used 3D-Column Charts of Microsoft Excel for analyzing data
in data warehouse.

21
Below window, represents the data visualization through Pentaho Business
Analytics tool online ([Link] for some sample dataset.

B. Explore WEKA Data Mining/Machine Learning Toolkit


(i). Downloading and/or installation of WEKA data mining toolkit
Procedure:
1. Go to the Weka website, [Link] and download the
software. On the left-hand side, click on the link that says download.
2. Select the appropriate link corresponding to the version of the software based on your
operating system and whether or not you already have Java VM running on your machine
(if you don‘t know what Java VM is, then you probably don‘t).
3. The link will forward you to a site where you can download the software from a mirror
site. Save the self-extracting executable to disk and then double click on it to install Weka.
Answer yes or next to the questions during the installation.
4. Click yes to accept the Java agreement if necessary. After you install the program
Weka should appear on your start menu under Programs (if you are using Windows).
5. Running Weka from the start menu select Programs, then [Link] will see the
Weka GUI Chooser. Select Explorer. The Weka Explorer will then launch.
(ii). Understand the features of WEKA toolkit such as Explorer, Knowledge Flow
interface, Experimenter, command-line interface.

22
The Weka GUI Chooser (class [Link]) provides a starting pointfor
launching Weka‘s main GUI applications and supporting tools. If one prefersa MDI
(―multiple document interface‖) appearance, then this is provided by analternative
launcher called ―Main‖ (class [Link]).
The GUI Chooser consists of four buttons—one for each of the four majorWeka
applications— and four menus.

The buttons can be used to start the following applications:


Explorer- An environment for exploring data with WEKA

a) Click on ―explorer‖ button to bring up the explorer window.


b) Make sure the ―preprocess‖ tab is highlighted.
c) Open a new file by clicking on ―Open New file‖ and choosing a file with ―.arff‖
extension from the ―Data‖ directory.
d) Attributes appear in the window below.
e) Click on the attributes to see the visualization on the right.
f) Click ―visualize all‖ to see them all

Experimenter- An environment for performing experiments and conducting statistical


tests between learning schemes.
a) Experimenter is for comparing results.
b) Under the ―set up‖ tab click ―New‖.
c) Click on ―Add New‖ under ―Data‖ frame. Choose a couple of arff format files
from ―Data‖ directory one at a time.
d) Click on ―Add New‖ under ―Algorithm‖ frame. Choose several algorithms, one at
a time by clicking ―OK‖ in the window and ―Add New‖.
e) Under the ―Run‖ tab click ―Start‖.
f) Wait for WEKA to finish.
g) Under ―Analyses‖ tab click on ―Experiment‖ to see results.

23
Knowledge Flow- This environment supports essentially the same functions as the
Explorer but with a drag-and-drop interface. One advantageis that it supports incremental
learning.
SimpleCLI - Provides a simple command-line interface that allows directexecution of
WEKA commands for operating systems that do not provide their own command line
interface.
(iii). Navigate the options available in the WEKA (ex. Select attributes panel,
Preprocess panel, classify panel, Cluster panel, Associate panel and Visualize panel)

When the Explorer is first started only the first tab is active; the others are greyed out.
This is because it is necessary to open (and potentially pre-process) a data set before
starting to explore the data.
The tabs are as follows:
1. Preprocess. Choose and modify the data being acted on.
2. Classify. Train and test learning schemes that classify or perform regression.
3. Cluster. Learn clusters for the data.
4. Associate. Learn association rules for the data.
5. Select attributes. Select the most relevant attributes in the data.
6. Visualize. View an interactive 2D plot of the data.
Once the tabs are active, clicking on them flicks between different screens, on which the
respective actions can be performed. The bottom area of the window (including the status
box, the log button, and the Weka bird) stays visible regardless of which section you are
in.

1. Preprocessing

24
Loading Data:
The first four buttons at the top of the preprocess section enable you to loaddata into
WEKA:
1. Open file. ... Brings up a dialog box allowing you to browse for the datafile on the local
file system.
2. Open URL ....Asks for a Uniform Resource Locator address for wherethe data is stored.

3. Open DB. ... Reads data from a database. (Note that to make this workyou might have to
edit the file in weka/experiment/[Link].)
4. Generate.... Enables you to generate artificial data from a variety ofDataGenerators.

Using the Open file. .. button you can read files in a variety of formats:
WEKA‘s ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF
files typically have a .arff extension, CSV files a .csv extension,C4.5 files a .data and
.names extension, and serialized Instances objects a .bsiextension.

2. Classification:

Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text fieldthat gives
the name of the currently selected classifier, and its options. Clickingon the text box with
the left mouse button brings up a GenericObjectEditordialog box, just the same as for
filters, that you can use to configure the optionsof the current classifier. With a right click
(or Alt+Shift+left click) you canonce again copy the setup string to the clipboard or

25
display the properties in aGenericObjectEditor dialog box. The Choose button allows you
to choose on4eof the classifiers that are available in WEKA.

Test Options
The result of applying the chosen classifier will be tested according to the optionsthat are
set by clicking in the Test options box. There are four test modes:
1. Use training set: The classifier is evaluated on how well it predicts theclass of the
instances it was trained on.
2. Supplied test set: The classifier is evaluated on how well it predicts theclass of a set
of instances loaded from a file. Clicking the Set... buttonbrings up a dialog allowing you
to choose the file to test on.
3. Cross-validation: The classifier is evaluated by cross-validation, usingthe number
of folds that are entered in the Folds text field.
4. Percentage split: The classifier is evaluated on how well it predicts acertain
percentage of the data which is held out for testing. The amountof data held out depends
on the value entered in the % field.

3. Clustering:

Cluster Modes:
The Cluster mode box is used to choose what to cluster and how to evaluatethe results.
The first three options are the same as for classification: Use training set, Supplied test set
and Percentage split.

26
4. Associating:

Setting Up
This panel contains schemes for learning association rules, and the learners are chosen
and configured in the same way as the clusterers, filters, and classifiers in the other
panels.
5. Selecting Attributes:

Searching and Evaluating


Attribute selection involves searching through all possible combinations of attributes in
the data to find which subset of attributes works best for prediction. To do this, two

27
objects must be set up: an attribute evaluator and a searchmethod. The evaluator
determines what method is used to assign a worth toeach subset of attributes. The search
method determines what style of searchis performed.

6. Visualizing:

WEKA‘s visualization section allows you to visualize 2D plots of the current


relation.
(iv). Study the arff file format
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a
list of instances sharing a set of attributes. ARFF files were developed by the Machine
Learning Project at the Department of Computer Science of The University of Waikato
for use with the Weka machine learning software.

Overview

ARFF files have two distinct sections. The first section is the Header information, which
is followed the Data information.

The Header of the ARFF file contains the name of the relation, a list of the attributes (the
columns in the data), and their types. An example header on the standard IRIS dataset
looks like this:

% 1. Title: Iris Plants Database


28
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall
(MARSHALL%PLU@[Link]) % (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

The Data of the ARFF file looks like the following:

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments.
The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.
(v). Explore the available data sets in WEKA
There are 23 different datasets are available in weka (C:\Program Files\Weka-3-6\) by
default for testing purpose. All the datasets are available in. arff format. Those datasets
are listed below.
29
(vi). Load a data set (ex. Weather dataset, Iris dataset, etc.)
Procedure:
1. Open the weka tool and select the explorer option.
2. New window will be opened which consists of different options (Preprocess, Association
etc.) 3. In the preprocess, click the ―open file‖ option.
4. Go to C:\Program Files\Weka-3-6\data for finding different existing. arff datasets.
5. Click on any dataset for loading the data then the data will be displayed as shown below.

(vii). Load each dataset and observe the following:


Here we have taken [Link] dataset as sample for observing all the below things.
i. List the attribute names and they types
There are 5 attributes& its datatype present in the above loaded dataset
([Link]) sepallength – Numeric sepalwidth – Numeric petallength –
Numeric petallength – Numeric Class – Nominal
ii. Number of records in each dataset
There are total 150 records (Instances) in dataset ([Link]).

30
iii. Identify the class attribute (if any)
There is one class attribute which consists of 3 labels. They
are:
1. Iris-setosa
2. Iris-versicolor
3. Iris-virginica
iv. Plot Histogram

v. Determine the number of records for each class.

There is one class attribute (150 records) which consists of 3 labels. They are shown
below 1. Iris-setosa - 50 records
2. Iris-versicolor – 50 records
3. Iris-virginica – 50 records
31
vi. Visualize the data in various dimensions

Result :

This program has been successfully executed.

32
[Link]: 02 Demonstration of preprocessing on dataset [Link]
Aim:
This experiment illustrates some of the basic data preprocessing operations that can be
performed using WEKA-Explorer. The sample dataset used for this example is the student
data available in arff format.

Steps:
Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the
data weka will compute some basic strategies on each attribute. The left panel in the above
figure shows the list of recognized attributes while the top panel indicates the names of the
base relation or table and the current working relation (which are same initially).
Step3: Clicking on an attribute in the left panel will show the basic statistics on the
attributes for the categorical attributes the frequency of each attribute value is shown, while
for continuous attributes we can obtain min, max, mean, standard deviation and deviation
etc.,
Step4: The visualization in the right button panel in the form of cross-tabulation across two
attributes.
Note: we can select another attribute using the dropdown list.
Step5: Selecting or filtering attributes

Removing an attribute:
When we need to remove an attribute,we can do this by using the attribute filters in [Link]
the filter model panel,click on choose button,This will show a popup window with a list of
available filters.

Scroll down the list and select the “[Link]” filters.


Step 6: a) Next click the textbox immediately to the right of the choose [Link] the
resulting dialog box enter the index of the attribute to be filtered out.
b) Make sure that invert selection option is set to [Link] click OK now in the filter [Link]
will see “Remove-R-7”.
c) Click the apply button to apply filter to this [Link] will remove the attribute and create
new working relation.

33
d) Save the new working relation as an arff file by clicking save button on the
top(button)panel.([Link])

Discretization:

1) Sometimes association rule mining can only be performed on categorical [Link]


requires performing discretization on numeric or continuous [Link] the following
example let us discretize age attribute.

 Let us divide the values of age attribute into three bins(intervals).


 First load the dataset into weka([Link])
 Select the age attribute.
 Activate filter-dialog box and select
“[Link]”from the list.
 To change the defaults for the filters,click on the box immediately to the right of the
choose button.
 We enter the index for the attribute to be [Link] this case the attribute is [Link]
we must enter ‘1’ corresponding to the age attribute.
 Enter ‘3’ as the number of [Link] the remaining field values as they are.
 Click OK button.
 Click apply in the filter [Link] will result in a new working relation with the
selected attribute partition into 3 bins.
 Save the new working relation in a file called [Link]
Dataset student .arff
@relation student
@attribute age {<30,30-40,>40}
@attribute income {low, medium, high}
@attribute student {yes, no}
@attribute credit-rating {fair, excellent}
@attribute buyspc {yes, no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no

34
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%

The following screenshot shows the effect of discretization.

Result :

This program has been successfully executed.


35
[Link] : 03 Demonstration of preprocessing on dataset [Link]
Aim:
This experiment illustrates some of the basic data preprocessing operations that can be
performed using WEKA-Explorer. The sample dataset used for this example is the labor
data available in arff format.

Steps:
Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the
data weka will compute some basic strategies on each attribute. The left panel in the above
figure shows the list of recognized attributes while the top panel indicates the names of the
base relation or table and the current working relation (which are same initially).
Step3: Clicking on an attribute in the left panel will show the basic statistics on the
attributes for the categorical attributes the frequency of each attribute value is shown, while
for continuous attributes we can obtain min, max, mean, standard deviation and deviation
etc.,

Step4: The visualization in the right button panel in the form of cross-tabulation across two
attributes.
Note: we can select another attribute using the dropdown list.
Step5: Selecting or filtering attributes

Removing an attribute:
When we need to remove an attribute,we can do this by using the attribute filters in [Link]
the filter model panel,click on choose button,This will show a popup window with a list of
available filters.

Scroll down the list and select the “[Link]” filters.


Step 6: a) Next click the textbox immediately to the right of the choose [Link] the
resulting dialog box enter the index of the attribute to be filtered out.
b) Make sure that invert selection option is set to [Link] click OK now in the filter [Link]
will see “Remove-R-7”.
c) Click the apply button to apply filter to this [Link] will remove the attribute and create
new working relation.

36
d) Save the new working relation as an arff file by clicking save button on the
top(button)panel.([Link])
Discretization:
 Sometimes association rule mining can only be performed on categorical [Link]
requires performing discretization on numeric or continuous [Link] the following
example let us discretize duration attribute. Let us divide the values of duration
attribute into three bins(intervals).
 First load the dataset into weka([Link])
 Select the duration attribute.
 Activate filter-dialog box and select
“[Link]”from the list.
 To change the defaults for the filters,click on the box immediately to the right of the
choose button.
 We enter the index for the attribute to be [Link] this case the attribute is
duration So we must enter ‘1’ corresponding to the duration attribute.
 Enter ‘1’ as the number of [Link] the remaining field values as they are.
 Click OK button.
 Click apply in the filter [Link] will result in a new working relation with the
selected attribute partition into 1 bin.
 Save the new working relation in a file called [Link]

Dataset [Link]
@relation 'labor-neg-data'
@attribute 'duration' real
@attribute 'wage-increase-first-year' real
@attribute 'wage-increase-second-year' real
@attribute 'wage-increase-third-year' real
@attribute 'cost-of-living-adjustment' {'none','tcf','tc'}
@attribute 'working-hours' real
@attribute 'pension' {'none','ret_allw','empl_contr'}
@attribute 'standby-pay' real
@attribute 'shift-differential' real
@attribute 'education-allowance' {'yes','no'}
@attribute 'statutory-holidays' real
@attribute 'vacation' {'below_average','average','generous'}
@attribute 'longterm-disability-assistance' {'yes','no'}
@attribute 'contribution-to-dental-plan' {'none','half','full'}
@attribute 'bereavement-assistance' {'yes','no'}
@attribute 'contribution-to-health-plan' {'none','half','full'}
@attribute 'class' {'bad','good'}
@data
1,5,?,?,?,40,?,?,2,?,11,'average',?,?,'yes',?,'good'
2,4.5,5.8,?,?,35,'ret_allw',?,?,'yes',11,'below_average',?,'full',?,'full','good'
37
?,?,?,?,?,38,'empl_contr',?,5,?,11,'generous','yes','half','yes','half','good'
3,3.7,4,5,'tc',?,?,?,?,'yes',?,?,?,?,'yes',?,'good'
3,4.5,4.5,5,?,40,?,?,?,?,12,'average',?,'half','yes','half','good'
2,2,2.5,?,?,35,?,?,6,'yes',12,'average',?,?,?,?,'good'
3,4,5,5,'tc',?,'empl_contr',?,?,?,12,'generous','yes','none','yes','half','good'
3,6.9,4.8,2.3,?,40,?,?,3,?,12,'below_average',?,?,?,?,'good'
2,3,7,?,?,38,?,12,25,'yes',11,'below_average','yes','half','yes',?,'good'
1,5.7,?,?,'none',40,'empl_contr',?,4,?,11,'generous','yes','full',?,?,'good'
3,3.5,4,4.6,'none',36,?,?,3,?,13,'generous',?,?,'yes','full','good'
2,6.4,6.4,?,?,38,?,?,4,?,15,?,?,'full',?,?,'good'
2,3.5,4,?,'none',40,?,?,2,'no',10,'below_average','no','half',?,'half','bad'
3,3.5,4,5.1,'tcf',37,?,?,4,?,13,'generous',?,'full','yes','full','good'
1,3,?,?,'none',36,?,?,10,'no',11,'generous',?,?,?,?,'good'
2,4.5,4,?,'none',37,'empl_contr',?,?,?,11,'average',?,'full','yes',?,'good'
1,2.8,?,?,?,35,?,?,2,?,12,'below_average',?,?,?,?,'good'
1,2.1,?,?,'tc',40,'ret_allw',2,3,'no',9,'below_average','yes','half',?,'none','bad'
1,2,?,?,'none',38,'none',?,?,'yes',11,'average','no','none','no','none','bad'
2,4,5,?,'tcf',35,?,13,5,?,15,'generous',?,?,?,?,'good'
2,4.3,4.4,?,?,38,?,?,4,?,12,'generous',?,'full',?,'full','good'
2,2.5,3,?,?,40,'none',?,?,?,11,'below_average',?,?,?,?,'bad'
3,3.5,4,4.6,'tcf',27,?,?,?,?,?,?,?,?,?,?,'good'
2,4.5,4,?,?,40,?,?,4,?,10,'generous',?,'half',?,'full','good'
1,6,?,?,?,38,?,8,3,?,9,'generous',?,?,?,?,'good'
3,2,2,2,'none',40,'none',?,?,?,10,'below_average',?,'half','yes','full','bad'
2,4.5,4.5,?,'tcf',?,?,?,?,'yes',10,'below_average','yes','none',?,'half','good'
2,3,3,?,'none',33,?,?,?,'yes',12,'generous',?,?,'yes','full','good'
2,5,4,?,'none',37,?,?,5,'no',11,'below_average','yes','full','yes','full','good'
3,2,2.5,?,?,35,'none',?,?,?,10,'average',?,?,'yes','full','bad'
3,4.5,4.5,5,'none',40,?,?,?,'no',11,'average',?,'half',?,?,'good'
3,3,2,2.5,'tc',40,'none',?,5,'no',10,'below_average','yes','half','yes','full','bad'
2,2.5,2.5,?,?,38,'empl_contr',?,?,?,10,'average',?,?,?,?,'bad'
2,4,5,?,'none',40,'none',?,3,'no',10,'below_average','no','none',?,'none','bad'
3,2,2.5,2.1,'tc',40,'none',2,1,'no',10,'below_average','no','half','yes','full','bad'
2,2,2,?,'none',40,'none',?,?,'no',11,'average','yes','none','yes','full','bad'
1,2,?,?,'tc',40,'ret_allw',4,0,'no',11,'generous','no','none','no','none','bad'
1,2.8,?,?,'none',38,'empl_contr',2,3,'no',9,'below_average','yes','half',?,'none','bad'
3,2,2.5,2,?,37,'empl_contr',?,?,?,10,'average',?,?,'yes','none','bad'
2,4.5,4,?,'none',40,?,?,4,?,12,'average','yes','full','yes','half','good'
1,4,?,?,'none',?,'none',?,?,'yes',11,'average','no','none','no','none','bad'
2,2,3,?,'none',38,'empl_contr',?,?,'yes',12,'generous','yes','none','yes','full','bad'
2,2.5,2.5,?,'tc',39,'empl_contr',?,?,?,12,'average',?,?,'yes',?,'bad'
2,2.5,3,?,'tcf',40,'none',?,?,?,11,'below_average',?,?,'yes',?,'bad'
2,4,4,?,'none',40,'none',?,3,?,10,'below_average','no','none',?,'none','bad'

38
2,4.5,4,?,?,40,?,?,2,'no',10,'below_average','no','half',?,'half','bad'
%
%

The following screenshot shows the effect of discretization

Result :

This program has been successfully executed.


39
[Link] : 04 Demonstration of Association rule process on dataset
[Link] using apriori algorithm
Aim:
This experiment illustrates some of the basic elements of asscociation rule mining
using WEKA. The sample dataset used for this example is [Link]

Steps:
Step1: Open the data file in Weka Explorer. It is presumed that the required data fields
have been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm.
Step3: We will use apriori algorithm. This is the default algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence etc)
we click on the text box immediately to the right of the choose button.

Dataset [Link]
@relation contact-lenses
@attribute age {young, pre-presbyopic, presbyopic}
@attribute spectacle-prescrip {myope, hypermetrope}
@attribute astigmatism{no, yes}
@attribute tear-prod-rate {reduced, normal}
@attribute contact-lenses {soft, hard, none}
@data
%
% 24 instances %
young,myope,no,reduced,none
young,myope,no,normal,soft
young,myope,yes,reduced,none
young,myope,yes,normal,hard
young,hypermetrope,no,reduced,none
young,hypermetrope,no,normal,soft
young,hypermetrope,yes,reduced,none
young,hypermetrope,yes,normal,hard
pre-presbyopic,myope,no,reduced,none
pre-presbyopic,myope,no,normal,soft
pre-presbyopic,myope,yes,reduced,none
pre-presbyopic,myope,yes,normal,hard
pre-presbyopic,hypermetrope,no,reduced,none

40
pre-presbyopic,hypermetrope,no,normal,soft
pre-presbyopic,hypermetrope,yes,reduced,none
pre-presbyopic,hypermetrope,yes,normal,none
presbyopic,myope,no,reduced,none
presbyopic,myope,no,normal,none
presbyopic,myope,yes,reduced,none
presbyopic,myope,yes,normal,hard
presbyopic,hypermetrope,no,reduced,none
presbyopic,hypermetrope,no,normal,soft
presbyopic,hypermetrope,yes,reduced,none
presbyopic,hypermetrope,yes,normal,none
%
%
%

The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.

41
Result :

This program has been successfully executed


42
[Link] : 05 Demonstration of Association rule process on dataset
[Link] using apriori algorithm

Aim:
This experiment illustrates some of the basic elements of asscociation rule mining
using WEKA. The sample dataset used for this example is [Link]

Steps:
Step1: Open the data file in Weka Explorer. It is presumed that the required data fields
have been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm.
Step3: We will use apriori algorithm. This is the default algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence etc)
we click on the text box immediately to the right of the choose button.

Dataset [Link]

@relation test
@attribute admissionyear {2005,2006,2007,2008,2009,2010}
@attribute course {cse,mech,it,ece}
@data
%
2005, cse
2005, it
2005, cse
2006, mech
2006, it
2006, ece
2007, it
2007, cse
2008, it
2008, cse
2009, it
2009, ece
%
The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.

43
Result :

This program has been successfully executed.


44
[Link] : 06 Demonstration of classification rule process on dataset
[Link] using j48 algorithm
Aim:
This experiment illustrates the use of j-48 classifier in weka. The sample data set used
in this experiment is “student” data available at arff format. This document assumes
that appropriate data pre processing has been performed.

Steps:
Step-1: We begin the experiment by loading the data ([Link])into weka.
Step2: Next we select the “classify” tab and click “choose” button t o select the
“j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in
the text box to the right of the chose button. In this example, we accept the default
values. The default version does perform some pruning but does not perform error
pruning.

Step4: Under the “text” options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation data
set, this is necessary to get a reasonable idea of accuracy of generated model.

Step-5: We now click ”start” to generate the model .the Ascii version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates
that we may find more work. (Either in preprocessing or in selecting current
parameters for the classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree.
This can be done by right clicking the last result set and selecting “visualize tree”
from the pop-up menu.
Step-8: We will use our model to classify the new instances.
Step-9: In the main panel under “text” options click the “supplied test set” radio
button and then click the “set” button. This wills pop-up a window which will allow
you to open the file containing test instances.

Dataset student .arff

@relation student
45
@attribute age {<30,30-40,>40}
@attribute income {low, medium, high}
@attribute student {yes, no}
@attribute credit-rating {fair, excellent}
@attribute buyspc {yes, no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%

The following screenshot shows the classification rules that were generated when j48
algorithm is applied on the given dataset.

46
Result :

This program has been successfully executed.

47
[Link] : 07 Demonstration of classification rule process on dataset
[Link] using j48 algorithm

Aim:
This experiment illustrates the use of j-48 classifier in [Link] sample data set used
in this experiment is “employee”data available at arff format. This document
assumes that appropriate data pre processing has been performed.

Steps:
Step 1: We begin the experiment by loading the data ([Link]) into weka.
Step2: Next we select the “classify” tab and click “choose” button to select the
“j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in
the text box to the right of the chose button. In this example, we accept the default
values the default version does perform some pruning but does not perform error
pruning.

Step4: Under the “text “options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation data
set, this is necessary to get a reasonable idea of accuracy of generated model.

Step-5: We now click ”start” to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates
that we may find more work. (Either in preprocessing or in selecting current
parameters for the classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree.
This can be done by right clicking the last result set and selecting “visualize tree”
from the pop-up menu.
Step-8: We will use our model to classify the new instances.
Step-9: In the main panel under “text “options click the “supplied test set” radio
button and then click the “set” button. This wills pop-up a window which will allow
you to open the file containing test instances.

48
Data set [Link]:

@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48}
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@attribute performance {good, avg, poor}
@data
%
25, 10k, poor
27, 15k, poor
27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 35k, good
48, 32k, good
%

49
Result :

This program has been successfully executed.


50
[Link] : 08 Demonstration of classification rule process on dataset
[Link] using id3 algorithm

Aim:
This experiment illustrates the use of id3 classifier in weka. The sample data set used
in this experiment is “employee”data available at arff format. This document
assumes that appropriate data pre processing has been performed.

Steps:
Step-1: We begin the experiment by loading the data ([Link]) into weka.
Step-2: next we select the “classify” tab and click “choose” button to select the
“id3”classifier.
Step-3: now we specify the various parameters. These can be specified by clicking in
the text box to the right of the chose button. In this example, we accept the default
values his default version does perform some pruning but does not perform error
pruning.

Step-4: under the “text “options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation data
set, this is necessary to get a reasonable idea of accuracy of generated model.

Step-5: we now click”start”to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: note that the classification accuracy of model is about 69%.this indicates that
we may find more work. (Either in preprocessing or in selecting current parameters
for the classification)
Step-7: now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting “visualize tree” from the
pop-up menu.

Step-8: we will use our model to classify the new instances.


Step-9: In the main panel under “text “options click the “supplied test set” radio
button and then click the “set” button. This will show pop-up window which will
allow you to open the file containing test instances.

51
Dataset [Link]:

@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48}
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@attribute performance {good, avg, poor}
@data
%
25, 10k, poor
27, 15k, poor
27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 35k, good
48, 32k, good
%
The following screenshot shows the classification rules that were generated when id3
algorithm is applied on the given dataset.

52
Result :

This program has been successfully executed.


53
[Link] : 09 Demonstration of classification rule process on
dataset [Link] using naïve bayes algorithm
Aim:
This experiment illustrates the use of naïve bayes classifier in weka. The sample data set
used in this experiment is “employee”data available at arff format. This document assumes
that appropriate data pre processing has been performed.

Steps:
Step-1: We begin the experiment by loading the data ([Link]) into weka.
Step-2: next we select the “classify” tab and click “choose” button to select the
“id3”classifier.
Step-3: now we specify the various parameters. These can be specified by clicking in the
text box to the right of the chose button. In this example, we accept the default values his
default version does perform some pruning but does not perform error pruning.

Step-4: under the “text “options in the main panel. We select the 10-fold cross validation
as our evaluation approach. Since we don’t have separate evaluation data set, this is
necessary to get a reasonable idea of accuracy of generated model.

Step-5: we now click”start”to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: note that the classification accuracy of model is about 69%.this indicates that we
may find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: now weka also lets us a view a graphical version of the classification tree. This can
be done by right clicking the last result set and selecting “visualize tree” from the pop-up
menu.
Step-8: we will use our model to classify the new instances.
Step-9: In the main panel under “text “options click the “supplied test set” radio button and
then click the “set” button. This will show pop-up window which will allow you to open
the file containing test instances.

54
Dataset [Link]:
@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48}
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@attribute performance {good, avg, poor}
@data
%
25, 10k, poor
27, 15k, poor
27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 35k, good
48, 32k, good
%

The following screenshot shows the classification rules that were generated when naive
bayes algorithm is applied on the given dataset.

55
Result :

This program has been successfully executed.


56
[Link] : 10 Demonstration of clustering rule process on dataset
[Link] using simple k-means
Aim:
This experiment illustrates the use of simple k-mean clustering with Weka explorer. The
sample data set used for this example is based on the iris data available in ARFF format.
This document assumes that appropriate preprocessing has been performed. This iris
dataset includes 150 instances.
Steps:
Step 1: Run the Weka explorer and load the data file [Link] in preprocessing
interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.

Step 3 : In this case we select ‘simple k-means’.


Step 4: Next click in text button to the right of the choose button to get popup window
shown in the screenshots. In this window we enter six on the number of clusters and we
leave the value of the seed on as it is. The seed value is used in generating a random
number which is used for making the internal assignments of instances of clusters.
Step 5 : Once of the option have been specified. We run the clustering algorithm there we
must make sure that they are in the ‘cluster mode’ panel. The use of training set option is
selected and then we click ‘start’ button. This process and resulting window are shown in
the following screenshots.
Step 6 : The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters centroid
are means vectors for each clusters. This clusters can be used to characterized the
[Link] eg, the centroid of cluster1 shows the class [Link] mean value of the
sepal length is 5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.
Step 7: Another way of understanding characterstics of each cluster through visualization
,we can do this, try right clicking the result set on the result. List panel and selecting the
visualize cluster assignments.
Dataset [Link]:

@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
57
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa

58
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
59
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica

60
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica
%
%
%
The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.

61
Interpretation of the above visualization

From the above visualization, we can understand the distribution of sepal length and petal
length in each cluster. For instance, for each cluster is dominated by petal length. In this
case by changing the color dimension to other attributes we can see their distribution with
in each of the cluster.
Step 8: We can assure that resulting dataset which included each instance along with its
assign cluster. To do so we click the save button in the visualization window and save the
result iris k-mean .The top portion of this file is shown in the following figure.

62
Result :

This program has been successfully executed.

63
[Link] : 11 Demonstration of clustering rule process on dataset
[Link] using simple k- means

Aim:
This experiment illustrates the use of simple k-mean clustering with Weka explorer.
The sample data set used for this example is based on the student data available in
ARFF format. This document assumes that appropriate preprocessing has been
performed. This student dataset includes 14 instances.

Steps:
Step 1: Run the Weka explorer and load the data file [Link] in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.
Step 3 : In this case we select ‘simple k-means’.
Step 4: Next click in text button to the right of the choose button to get popup window
shown in the screenshots. In this window we enter six on the number of clusters and we
leave the value of the seed on as it is. The seed value is used in generating a random
number which is used for making the internal assignments of instances of clusters.
Step 5 : Once of the option have been specified. We run the clustering algorithm
there we must make sure that they are in the ‘cluster mode’ panel. The use of training
set option is selected and then we click ‘start’ button. This process and resulting
window are shown in the following screenshots.
Step 6 : The result window shows the centroid of each cluster as well as statistics on
the number and the percent of instances assigned to different clusters. Here clusters
centroid are means vectors for each clusters. This clusters can be used to
characterized the cluster.
Step 7: Another way of understanding characterstics of each cluster through
visualization ,we can do this, try right clicking the result set on the result. List panel
and selecting the visualize cluster assignments.

Interpretation of the above visualization

From the above visualization, we can understand the distribution of age and instance
number in each cluster. For instance, for each cluster is dominated by age. In this
case by changing the color dimension to other attributes we can see their distribution
with in each of the cluster.

64
Step 8: We can assure that resulting dataset which included each instance along with
its assign cluster. To do so we click the save button in the visualization window and
save the result student k-mean .The top portion of this file is shown in the following
figure.

Dataset student .arff

@relation student
@attribute age {<30,30-40,>40}
@attribute income {low,medium,high}
@attribute student {yes,no}
@attribute credit-rating {fair,excellent}
@attribute buyspc {yes,no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%

65
The following screenshot shows the clustering rules that were generated when simple k-
means algorithm is applied on the given dataset.

66
Result :

This program has been successfully executed.


67
[Link] : 12 Demonstrate performing Linear Regression on data sets
Aim:
This experiment illustrates the use of Linear Regression classifier in [Link]
sample data set used in this experiment is “cpu”data available at arff format. This
document assumes that appropriate data pre processing has been performed.

Steps:
Step 1: Run the Weka explorer and load the data file [Link] in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.

Step 3 : In this case we select ‘LinearRegression’.


Step 4 : Then we will get regression model & its result as shown below.

Step 5 : click on start option with use training set option.

Interpretation of the above visualization


From the above visualization, we can understand the distribution of age and instance
number in each cluster. For instance, for each cluster is dominated by age. In this
case by changing the color dimension to other attributes we can see their distribution
with in each of the cluster.
Step 6: We can assure that resulting dataset which included each instance along with
its assign cluster. To do so we click the save button in the visualization window and
save the result cpu Linear Regression. The top portion of this file is shown in the
following figure.

cross-validation:
Step 9: In which we selected Linear Regression algorithm & click on start option with
cross validation option with 10 folds.

Step 10: Then we will get regression model & its result as shown below.

percentage split:
Step 11: In which we selected Linear Regression algorithm & click on start option with
percentage split option with 66% split.

Step 12: Then we will get regression model & its result as shown below.

68
Dataset cpu .arff :
@relation 'cpu'
@attribute vendor { adviser, amdahl, apollo, basf, bti, burroughs, c.r.d, cdc, cambex, dec,
dg, formation, four-phase, gould, hp, harris, honeywell, ibm, ipl, magnuson, microdata,
nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, sratus, wang}
@attribute MYCT real
@attribute MMIN real
@attribute MMAX real
@attribute CACH real
@attribute CHMIN real
@attribute CHMAX real
@attribute ERP real
@data
adviser,125,256,6000,256,16,128,199
amdahl,29,8000,32000,32,8,32,253
amdahl,29,8000,32000,32,8,32,253
amdahl,29,8000,32000,32,8,32,253
amdahl,29,8000,16000,32,8,16,132
amdahl,26,8000,32000,64,8,32,290
amdahl,23,16000,32000,64,16,32,381
amdahl,23,16000,32000,64,16,32,381
amdahl,23,16000,64000,64,16,32,749
amdahl,23,32000,64000,128,32,64,1238
apollo,400,1000,3000,0,1,2,23
apollo,400,512,3500,4,1,6,24
basf,60,2000,8000,65,1,8,70
basf,50,4000,16000,65,1,8,117
bti,350,64,64,0,1,4,15
bti,200,512,16000,0,4,32,64
burroughs,167,524,2000,8,4,15,23
burroughs,143,512,5000,0,7,32,29
burroughs,143,1000,2000,0,5,16,22
burroughs,110,5000,5000,142,8,64,124
burroughs,143,1500,6300,0,5,32,35
burroughs,143,3100,6200,0,5,20,39
burroughs,143,2300,6200,0,6,64,40
burroughs,110,3100,6200,0,6,64,45
c.r.d,320,128,6000,0,1,12,28
c.r.d,320,512,2000,4,1,3,21
c.r.d,320,256,6000,0,1,6,28
c.r.d,320,256,3000,4,1,3,22
c.r.d,320,512,5000,4,1,5,28
69
c.r.d,320,256,5000,4,1,6,27
cdc,25,1310,2620,131,12,24,102
cdc,25,1310,2620,131,12,24,102
cdc,50,2620,10480,30,12,24,74
cdc,50,2620,10480,30,12,24,74
cdc,56,5240,20970,30,12,24,138
cdc,64,5240,20970,30,12,24,136
cdc,50,500,2000,8,1,4,23
cdc,50,1000,4000,8,1,5,29
cdc,50,2000,8000,8,1,5,44
cambex,50,1000,4000,8,3,5,30
cambex,50,1000,8000,8,3,5,41
cambex,50,2000,16000,8,3,5,74
cambex,50,2000,16000,8,3,6,74
cambex,50,2000,16000,8,3,6,74
dec,133,1000,12000,9,3,12,54
dec,133,1000,8000,9,3,12,41
dec,810,512,512,8,1,1,18
dec,810,1000,5000,0,1,1,28
dec,320,512,8000,4,1,5,36
dec,200,512,8000,8,1,8,38
dg,700,384,8000,0,1,1,34
dg,700,256,2000,0,1,1,19
dg,140,1000,16000,16,1,3,72
dg,200,1000,8000,0,1,2,36
dg,110,1000,4000,16,1,2,30
dg,110,1000,12000,16,1,2,56
dg,220,1000,8000,16,1,2,42
formation,800,256,8000,0,1,4,34
formation,800,256,8000,0,1,4,34
formation,800,256,8000,0,1,4,34
formation,800,256,8000,0,1,4,34
formation,800,256,8000,0,1,4,34
four-phase,125,512,1000,0,8,20,19
gould,75,2000,8000,64,1,38,75
gould,75,2000,16000,64,1,38,113
gould,75,2000,16000,128,1,38,157
hp,90,256,1000,0,3,10,18
hp,105,256,2000,0,3,10,20
hp,105,1000,4000,0,3,24,28
hp,105,2000,4000,8,3,19,33
hp,75,2000,8000,8,3,24,47
70
hp,75,3000,8000,8,3,48,54
hp,175,256,2000,0,3,24,20
harris,300,768,3000,0,6,24,23
harris,300,768,3000,6,6,24,25
harris,300,768,12000,6,6,24,52
harris,300,768,4500,0,1,24,27
harris,300,384,12000,6,1,24,50
harris,300,192,768,6,6,24,18
harris,180,768,12000,6,1,31,53
honeywell,330,1000,3000,0,2,4,23
honeywell,300,1000,4000,8,3,64,30
honeywell,300,1000,16000,8,2,112,73
honeywell,330,1000,2000,0,1,2,20
honeywell,330,1000,4000,0,3,6,25
honeywell,140,2000,4000,0,3,6,28
honeywell,140,2000,4000,0,4,8,29
honeywell,140,2000,4000,8,1,20,32
honeywell,140,2000,32000,32,1,20,175
honeywell,140,2000,8000,32,1,54,57
honeywell,140,2000,32000,32,1,54,181
honeywell,140,2000,32000,32,1,54,181
honeywell,140,2000,4000,8,1,20,32
ibm,57,4000,16000,1,6,12,82
ibm,57,4000,24000,64,12,16,171
ibm,26,16000,32000,64,16,24,361
ibm,26,16000,32000,64,8,24,350
ibm,26,8000,32000,0,8,24,220
ibm,26,8000,16000,0,8,16,113
ibm,480,96,512,0,1,1,15
ibm,203,1000,2000,0,1,5,21
ibm,115,512,6000,16,1,6,35
ibm,1100,512,1500,0,1,1,18
ibm,1100,768,2000,0,1,1,20
ibm,600,768,2000,0,1,1,20
ibm,400,2000,4000,0,1,1,28
ibm,400,4000,8000,0,1,1,45
ibm,900,1000,1000,0,1,2,18
ibm,900,512,1000,0,1,2,17
ibm,900,1000,4000,4,1,2,26
ibm,900,1000,4000,8,1,2,28
ibm,900,2000,4000,0,3,6,28
ibm,225,2000,4000,8,3,6,31
71
ibm,225,2000,4000,8,3,6,31
ibm,180,2000,8000,8,1,6,42
ibm,185,2000,16000,16,1,6,76
ibm,180,2000,16000,16,1,6,76
ibm,225,1000,4000,2,3,6,26
ibm,25,2000,12000,8,1,4,59
ibm,25,2000,12000,16,3,5,65
ibm,17,4000,16000,8,6,12,101
ibm,17,4000,16000,32,6,12,116
ibm,1500,768,1000,0,0,0,18
ibm,1500,768,2000,0,0,0,20
ibm,800,768,2000,0,0,0,20
ipl,50,2000,4000,0,3,6,30
ipl,50,2000,8000,8,3,6,44
ipl,50,2000,8000,8,1,6,44
ipl,50,2000,16000,24,1,6,82
ipl,50,2000,16000,24,1,6,82
ipl,50,8000,16000,48,1,10,128
magnuson,100,1000,8000,0,2,6,37
magnuson,100,1000,8000,24,2,6,46
magnuson,100,1000,8000,24,3,6,46
magnuson,50,2000,16000,12,3,16,80
magnuson,50,2000,16000,24,6,16,88
magnuson,50,2000,16000,24,6,16,88
microdata,150,512,4000,0,8,128,33
nas,115,2000,8000,16,1,3,46
nas,115,2000,4000,2,1,5,29
nas,92,2000,8000,32,1,6,53
nas,92,2000,8000,32,1,6,53
nas,92,2000,8000,4,1,6,41
nas,75,4000,16000,16,1,6,86
nas,60,4000,16000,32,1,6,95
nas,60,2000,16000,64,5,8,107
nas,60,4000,16000,64,5,8,117
nas,50,4000,16000,64,5,10,119
nas,72,4000,16000,64,8,16,120
nas,72,2000,8000,16,6,8,48
nas,40,8000,16000,32,8,16,126
nas,40,8000,32000,64,8,24,266
nas,35,8000,32000,64,8,24,270
nas,38,16000,32000,128,16,32,426
nas,48,4000,24000,32,8,24,151
72
nas,38,8000,32000,64,8,24,267
nas,30,16000,32000,256,16,24,603
ncr,112,1000,1000,0,1,4,19
ncr,84,1000,2000,0,1,6,21
ncr,56,1000,4000,0,1,6,26
ncr,56,2000,6000,0,1,8,35
ncr,56,2000,8000,0,1,8,41
ncr,56,4000,8000,0,1,8,47
ncr,56,4000,12000,0,1,8,62
ncr,56,4000,16000,0,1,8,78
ncr,38,4000,8000,32,16,32,80
ncr,38,4000,8000,32,16,32,80
ncr,38,8000,16000,64,4,8,142
ncr,38,8000,24000,160,4,8,281
ncr,38,4000,16000,128,16,32,190
nixdorf,200,1000,2000,0,1,2,21
nixdorf,200,1000,4000,0,1,4,25
nixdorf,200,2000,8000,64,1,5,67
perkin-elmer,250,512,4000,0,1,7,24
perkin-elmer,250,512,4000,0,4,7,24
perkin-elmer,250,1000,16000,1,1,8,64
prime,160,512,4000,2,1,5,25
prime,160,512,2000,2,3,8,20
prime,160,1000,4000,8,1,14,29
prime,160,1000,8000,16,1,14,43
prime,160,2000,8000,32,1,13,53
siemens,240,512,1000,8,1,3,19
siemens,240,512,2000,8,1,5,22
siemens,105,2000,4000,8,3,8,31
siemens,105,2000,6000,16,6,16,41
siemens,105,2000,8000,16,4,14,47
siemens,52,4000,16000,32,4,12,99
siemens,70,4000,12000,8,6,8,67
siemens,59,4000,12000,32,6,12,81
siemens,59,8000,16000,64,12,24,149
siemens,26,8000,24000,32,8,16,183
siemens,26,8000,32000,64,12,16,275
siemens,26,8000,32000,128,24,32,382
sperry,116,2000,8000,32,5,28,56
sperry,50,2000,32000,24,6,26,182
sperry,50,2000,32000,48,26,52,227
sperry,50,2000,32000,112,52,104,341
73
sperry,50,4000,32000,112,52,104,360
sperry,30,8000,64000,96,12,176,919
sperry,30,8000,64000,128,12,176,978
sperry,180,262,4000,0,1,3,24
sperry,180,512,4000,0,1,3,24
sperry,180,262,4000,0,1,3,24
sperry,180,512,4000,0,1,3,24
sperry,124,1000,8000,0,1,8,37
sperry,98,1000,8000,32,2,8,50
sratus,125,2000,8000,0,2,14,41
wang,480,512,8000,32,0,0,47
wang,480,1000,4000,0,0,0,25

The following screenshot shows the clustering rules that were generated when Linear
Regression algorithm is applied on the given dataset.

74
cross-validation:

75
percentage split:

Result :

This program has been successfully executed.


76
[Link] : 13 Sample Programs using German Credit Data

Credit Risk Assessment

Description: The business of banks is making loans. Assessing the credit worthiness of
an applicant is of crucial importance. You have to develop a system to help a loan officer
decide whether the credit of a customer is good. Or bad. A bank’s business rules
regarding loans must consider two opposing factors. On th one han, a bank wants to
make as many loans as possible.

Interest on these loans is the banks profit source. On the other hand, a bank can not
afford to make too many bad loans. Too many bad loans could lead to the collapse of
the bank. The bank’s loan policy must involved a compromise. Not too strict and not
too lenient.

To do the assignment, you first and foremost need some knowledge about the world of
credit. You can acquire such knowledge in a number of ways.

1. Knowledge engineering: Find a loan officer who is willing to talk. Interview her and try
to represent her knowledge in a number of ways.

2. Books: Find some training manuals for loan officers or perhaps a suitable textbook on
finance. Translate this knowledge from text from to production rule form.

3. Common sense: Imagine yourself as a loan officer and make up reasonable rules which
can be used to judge the credit worthiness of a loan applicant.

4. Case histories: Find records of actual cases where competent loan officers correctly
judged when and not to. Approve a loan application.

The German Credit Data

Actual historical credit data is not always easy to come by because of confidentiality
rules. Here is one such data set. Consisting of 1000 actual cases collected in Germany.
In spite of the fact that the data is German, you should probably make use of it for this
assignment(Unless you really can consult a real loan officer!)

77
There are 20 attributes used in judging a loan applicant( ie., 7 Numerical attributes and
13 Categoricl or Nominal attributes). The goal is the classify the applicant into one of two
categories. Good or Bad.

The total number of attributes present in German credit data are.

1. Checking_Status
2. Duration
3. Credit_history
4. Purpose
5. Credit_amout
6. Savings_status
7. Employment
8. Installment_Commitment
9. Personal_status
10. Other_parties
11. Residence_since
12. Property_Magnitude
13. Age
14. Other_payment_plans
15. Housing
16. Existing_credits
17. Job
18. Num_dependents
19. Own_telephone
20. Foreign_worker
21. Class

78
Tasks:

1. List all the categorical (or nominal) attributes and the real valued attributes
separately.

Ans) Steps for identifying categorical attributes

1. Double click on [Link] file.


2. Select all categorical attributes.
3. Click on invert.
4. Then we get all real valued attributes selected
5. Click on remove
6. Click on visualize all.

Steps for identifying real valued attributes

1. Double click on [Link] file.


[Link] all real valued attributes.

3. Click on invert.
4. Then we get all categorial attributes selected
5. Click on remove
6. Click on visualize all.

The following are the Categorical (or Nominal) attributes)

1. Checking_Status
2. Credit_history
3. Purpose
4. Savings_status
5. Employment
6. Personal_status
7. Other_parties
8. Property_Magnitude
9. Other_payment_plans
10. Housing
11. Job
12. Own_telephone
13. Foreign_worker

79
The following are the Numerical attributes)

1. Duration
2. Credit_amout
3. Installment_Commitment
4. Residence_since
5. Age
6. Existing_credits
7. Num_dependents

2. What attributes do you think might be crucial in making the credit assessment? Come
up with some simple rules in plain English using your selected attributes.

Ans) The following are the attributes may be crucial in making the credit assessment.
1. Credit_amount
2. Age
3. Job
4. Savings_status
5. Existing_credits
6. Installment_commitment
7. Property_magnitude

3. One type of model that you can create is a Decision tree . train a Decision tree using
the complete data set as the training data. Report the model obtained after training.

Ans) Steps to model decision tree.

1. Double click on [Link] file.


2. Consider all the 21 attributes for making decision tree.
3. Click on classify tab.
4. Click on choose button.
5. Expand tree folder and select J48 6. Click on use training set in test options.
7. Click on start button.
8. Right click on result list and choose the visualize tree to get decision tree.

We created a decision tree by using J48 Technique for the complete dataset as the training
data.

The following model obtained after training.


80
Output:

=== Run information ===

Scheme : [Link].J48 -C 0.25 -M 2


Relation : german_credit
Instances : 1000
Attributes : 21

Checking_status duration credit_history purpose credit_amount savings_status


employment installment_commitment personal_status other_parties residence_since
property_magnitude age other_payment_plans housing existing_credits job num_dependents
own_telephone foreign_worker class

Test mode: evaluate on training data

=== Classifier model (full training set) ===

J48 pruned tree -

Number of Leaves : 103

Size of the tree 140

Time taken to build model: 0.08 seconds

=== Evaluation on training set ===

=== Summary ===

Correctly Classified Instances 855 85.5 %


Incorrectly Classified Instances 145 14.5 %
Kappa statistic 0.6251
Mean absolute error 0.2312
Root mean squared error 0.34
Relative absolute error 55.0377%
Root relative squared error 74.2015%
Coverage of cases(0.95level) 100 %

Mean rel. region size (0.95 level) 93.3%


Total Number of Instances 1000
81
=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0.956 0.38 0.854 0.956 0.902 0.857 good
0.62 0.044 0.857 0.62 0.72 0.857 bad
WeightedAvg.0.855 0.279 0.855 0.855 0.847 0.857

=== Confusion Matrix ===

a b <-- classified as 669


31 | a = good

114 186 | b = bad

4. Suppose you use your above model trained on the complete dataset, and classify
credit good/bad for each of the examples in the dataset. What % of examples can you
classify correctly?(This is also called testing on the training set) why do you think can
not get 100% training accuracy?

Ans) Steps followed are:

1. Double click on [Link] file.


2. Click on classify tab.
3. Click on choose button.
4. Expand tree folder and select J48 5. Click on use training set in test options.
6. Click on start button.
7. On right side we find confusion matrix
8. Note the correctly classified instances.
Output:
If we used our above model trained on the complete dataset and classified credit as good/bad
for each of the examples in that dataset. We can not get 100% training accuracy only 85.5%
of examples, we can classify correctly.

5. Is testing on the training set as you did above a good idea? Why or why not? Ans)It
is not good idea by using 100% training data set.

6. One approach for solving the problem encountered in the previous question is
using cross-validation? Describe what is cross validation briefly. Train a decision tree
again using cross validation and report your results. Does accuracy increase/decrease?
Why?

82
Ans) steps followed are:
1. Double click on [Link] file.
2. Click on classify tab.
3. Click on choose button.
4. Expand tree folder and select J48
5. Click on cross validations in test options.
6. Select folds as 10
7. Click on start
8. Change the folds to 5
9. Again click on start 10. Change the folds with 2
11. Click on start.
12. Right click on blue bar under result list and go to visualize tree Output:

Cross-Validation Definition: The classifier is evaluated by cross validation using the number
of folds that are entered in the folds text field.
In Classify Tab, Select cross-validation option and folds size is 2 then Press Start Button, next
time change as folds size is 5 then press start, and next time change as folds size is 10 then
press start.

i) Fold Size-10
Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 705 70.5 %
Incorrectly Classified Instances 295 29.5 %
Kappa statistic 0.2467
Mean absolute error 0.3467
Root mean squared error 0.4796
Relative absolute error 82.5233 %
Root relative squared error 104.6565 %
Coverage of cases (0.95 level) 92.8 %
Mean rel. region size (0.95 level) 91.7 %
Total Number of Instances 1000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0.84 0.61 0.763 0.84 0.799 0.639 good
0.39 0.16 0.511 0.39 0.442 0.639 bad
Weighted Avg. 0.705 0.475 0.687 0.705 0.692 0.639

83
=== Confusion Matrix ===

a b <-- classified as
588 112 | a = good
183 117 | b = bad

ii) Fold Size-5


Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 733 73.3 %
Incorrectly Classified Instances 267 26.7 %
Kappa statistic 0.3264
Mean absolute error 0.3293
Root mean squared error 0.4579
Relative absolute error 78.3705 %
Root relative squared error 99.914 %
Coverage of cases (0.95 level) 94.7 %
Mean rel. region size (0.95 level) 93 %
Total Number of Instances 1000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0.851 0.543 0.785 0.851 0.817 0.685 good
0.457 0.149 0.568 0.457 0.506 0.685 bad
Weighted Avg. 0.733 0.425 0.72 0.733 0.724 0.685

=== Confusion Matrix ===

a b <-- classified as
596 104 | a = good
163 137 | b = bad

iii) Fold Size-2


Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 721 72.1%
Incorrectly Classified Instances 279 27.9%
Kappa statistic 0.2443
Mean absolute error 0.3407
Root mean squared error 0.4669
Relative absolute error 81.0491 %
84
Root relative squared error 101.8806 %
Coverage of cases (0.95 level) 92.8 %
Mean rel. region size (0.95 level) 91.3 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.891 0.677 0.755 0.891 0.817 0.662 good
0.323 0.109 0.561 0.323 0.41 0.662 bad
Weighted Avg. 0.721 0.506 0.696 0.721 0.695 0.662

=== Confusion Matrix ===


a b <-- classified as
624 76 | a = good
203 97 | b = bad

Note: With this observation, we have seen accuracy is increased when we have folds size is
5 and accuracy is decreased when we have 10 folds.

7. Check to see if the data shows a bias against “foreign workers” or “personal-status”.

One way to do this is to remove these attributes from the data set and see if the decision
tree created in those cases is significantly different from the full dataset case which you
have already done. Did removing these attributes have any significantly effect? Discuss.

Ans) steps followed are:


1. Double click on [Link] file.
2. Click on classify tab.
3. Click on choose button.
4. Expand tree folder and select J48
5. Click on cross validations in test options.
6. Select folds as 10
7. Click on start
8. Click on visualization
9. Now click on preprocessor tab
th th
10. Select 9 and 20 attribute
11. Click on remove button
12. Goto classify tab
13. Choose J48 tree
85
14. Select cross validation with 10 folds
15. Click on start button
16. Right click on blue bar under the result list and go to visualize tree.

Output:

We use the Preprocess Tab in Weka GUI Explorer to remove an attribute


“Foreignworkers” & “Perosnal_status” one by one. In Classify Tab, Select Use Training
set option then
Press Start Button, If these attributes removed from the dataset, we can see change in the
accuracy compare to full data set when we removed.

i) If Foreign_worker is removed

Evaluation on training set ===


=== Summary ===
Correctly Classified Instances 859 85.9%
Incorrectly Classified Instances 141 14.1%
Kappa statistic 0.6377
Mean absolute error 0.2233
Root mean squared error 0.3341
Relative absolute error 53.1347%
Root relative squared error 72.9074%
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 91.9%
Total Number of Instances 1000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0.954 0.363 0.86 0.954 0.905 0.867 good
0.637 0.046 0.857 0.637 0.73 0.867 bad
Weighted Avg 0.859 0.268 0.859 0.859 0.852 0.867

=== Confusion Matrix ===

a b <-- classified as
668 32 | a = good
109 191 | b = bad

86
i) If Personal_status is removed

Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 866 86.6%


Incorrectly Classified Instances 134 13.4%
Kappa statistic 0.6582
Mean absolute error 0.2162
Root mean squared error 0.3288
Relative absolute error 51.4483%
Root relative squared error 71.7411%
Coverage of cases (0.95 level) 100%
Mean rel. region size (0.95 level) 91.7 %
Total Number of Instances 1000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0.954 0.34 0.868 0.954 0.909 0.868 good
0.66 0.046 0.861 0.66 0.747 0.868 bad
Weighted Avg. 0.866 0.252 0.866 0.866 0.86 0.868

=== Confusion Matrix ===

a b <-- classified as
668 32 | a = good
102 198 | b = bad
Note: With this observation we have seen, when “Foreign_worker “attribute is removed
from the

Dataset, the accuracy is decreased. So this attribute is important for classification.

8. Another question might be, do you really need to input so many attributes to get good
results? May be only a few would do. For example, you could try just having attributes
2,3,5,7,10,17 and 21. Try out some combinations.(You had removed two attributes in
problem 7. Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)

Ans) steps followed are:


87
1. Double click on [Link] file.
2. Select 2,3,5,7,10,17,21 and tick the check boxes.
3. Click on invert
4. Click on remove
5. Click on classify tab
6. Choose trace and then algorithm as J48 7. Select cross validation folds as 2
8. Click on start.
OUTPUT:
We use the Preprocess Tab in Weka GUI Explorer to remove 2nd attribute (Duration). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed.

=== Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 841 84.1 %


Incorrectly Classified Instances 159 15.9 %

Confusion Matrix ===

a b <-- classified as
647 53 | a = good
106 194 | b = bad

Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 3rd attribute (Credit_history). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed.

=== Evaluation on training set ===


=== Summary ===
Correctly Classified Instances 839 83.9 %
Incorrectly Classified Instances 161 16.1 %
== Confusion Matrix ===

a b <-- classified as
645 55 | a = good
106 194 | b = bad
88
Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 5th attribute (Credit_amount). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed. === Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 864 86.4 %
Incorrectly Classified Instances 136 13.6 %
= Confusion Matrix ===

a b <-- classified as
675 25 | a = good
111 189 | b = bad

Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 7th attribute (Employment). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed.

=== Evaluation on training set ===


=== Summary ===
Correctly Classified Instances 858 85.8 %
Incorrectly Classified Instances 142 14.2 %
== Confusion Matrix ===

a b <-- classified as
670 30 | a = good
112 188 | b = bad

Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 10th attribute (Other_parties). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed.

Time taken to build model: 0.05 seconds

=== Evaluation on training set ===


89
=== Summary ===

Correctly Classified Instances 845 84.5 %


Incorrectly Classified Instances 155 15.5 %

Confusion Matrix ===


a b <-- classified as
663 37 | a = good
118 182 | b = bad

Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 17th attribute (Job). In Classify Tab,
Select Use Training set option then Press Start Button, If these attributes removed from the
dataset, we can see change in the accuracy compare to full data set when we removed.

=== Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 859 85.9%


Incorrectly Classified Instances 141 14.1%
=== Confusion Matrix ===

a b <-- classified as

675 25 | a = good
116 184 | b = bad

Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
use the Preprocess Tab in Weka GUI Explorer to remove 21st attribute (Class). In Classify
Tab, Select Use Training set option then Press Start Button, If these attributes removed from
the dataset, we can see change in the accuracy compare to full data set when we removed.

=== Evaluation on training set ===


=== Summary ===
Correctly Classified Instances 963 96.3 %
Incorrectly Classified Instances 37 3.7 %
=== Confusion Matrix ===

a b <-- classified as

963 0 | a = yes
37 0 | b = no
90
Note: With this observation we have seen, when 3rd attribute is removed from the Dataset, the
accuracy (83%) is decreased. So this attribute is important for classification. when 2 nd and 10th
attributes are removed from the Dataset, the accuracy(84%) is same. So we can remove any
one among them. when 7th and 17th attributes are removed from the Dataset, the
accuracy(85%) is same. So we can remove any one among them. If we remove 5th and 21st
attributes the accuracy is increased, so these attributes may not be needed for the classification.

9. Sometimes, The cost of rejecting an applicant who actually has good credit might
be higher than accepting an applicant who has bad credit. Instead of counting the
misclassification equally in both cases, give a higher cost to the first case ( say cost 5) and
lower cost to the second case. By using a cost matrix in weak. Train your decision tree
and report the Decision Tree and cross validation results. Are they significantly different
from results obtained in problem 6.

Ans) steps followed are:


1. Double click on [Link] file.
2. Click on classify tab.
3. Click on choose button.
4. Expand tree folder and select J48
5. Click on start
6. Note down the accuracy values
7. Now click on credit arff file
8. Click on attributes 2,3,5,7,10,17,21
9. Click on invert
10. Click on classify tab
11. Choose J48 algorithm
12. Select Cross validation fold as 2
13. Click on start and note down the accuracy values.
14. Again make cross validation folds as 10 and note down the accuracy values.
15. Again make cross validation folds as 20 and note down the accuracy values.

OUTPUT:

In Weka GUI Explorer, Select Classify Tab, In that Select Use Training set option . In
Classify Tab then press Choose button in that select J48 as Decision Tree Technique. In
Classify Tab then press More options button then we get classifier evaluation options window
in that select cost sensitive evaluation the press set option Button then we get Cost Matrix

91
Editor. In that change classes as 2 then press Resize button. Then we get 2X2 Cost matrix. In
Cost Matrix (0,1) location value change as 5, then we get modified cost matrix is as follows.

0.0 5.0
1.0 0.0
Then close the cost matrix editor, then press ok button. Then press start button.
=== Evaluation on training set ===
=== Summary ===

Correctly Classified Instances 855 85.5 %


Incorrectly Classified Instances 145 14.5 %

=== Confusion Matrix ===

a b <-- classified as
669 31 | a = good
114 186 | b = bad

Note: With this observation we have seen that ,total 700 customers in that 669 classified as
good customers and 31 misclassified as bad customers. In total 300cusotmers, 186 classified
as bad customers and 114 misclassified as good customers.

10. Do you think it is a good idea to prefect simple decision trees instead of having long
complex decision tress? How does the complexity of a Decision Tree relate to the bias of
the model? Ans)

steps followed are:-


1)click on credit arff file
2)Select all attributes
3)click on classify tab
4)click on choose and select J48 algorithm
5)select cross validation folds with 2
6)click on start
7)write down the time complexity value

It is Good idea to prefer simple Decision trees, instead of having complex Decision tree.
11. You can make your Decision Trees simpler by pruning the nodes. One approach
is to use Reduced Error Pruning. Explain this idea briefly. Try reduced error pruning

92
for training your Decision Trees using cross validation and report the Decision Trees
you obtain? Also Report your accuracy using the pruned model Does your Accuracy
increase?

Ans)

steps followed are:-


1)click on credit arff file
2)Select all attributes
3)click on classify tab
4)click on choose and select REP algorithm
5)select cross validation 2
6) click on start
7) Note down the results

We can make our decision tree simpler by pruning the nodes. For that In Weka GUI Explorer,
Select Classify Tab, In that Select Use Training set option . In Classify Tab then press Choose
button in that select J48 as Decision Tree Technique. Beside Choose Button Press on J48 –c
0.25 –M2 text we get Generic Object Editor. In that select Reduced Error pruning Property
as True then press ok. Then press start button.

=== Evaluation on training set ===


=== Summary ===
Correctly Classified Instances 786 78.6 %
Incorrectly Classified Instances 214 21.4 %
==Confusion Matrix===
a b <-- classified as
662 38 | a = good
176 124 | b = bad
By using pruned model, the accuracy decreased. Therefore by pruning the nodes we can make
our decision tree simpler.

12) How can you convert a Decision Tree into “if-then-else rules”. Make up your own
small

Decision Tree consisting 2-3 levels and convert into a set of rules. There also exist
different classifiers that output the model in the form of rules. One such classifier in
weka is rules. PART, train this model and report the set of rules obtained. Sometimes
just one attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this data set? OneR classifier uses a single
93
attribute to make decisions(it chooses the attribute based on minimum error).Report the
rule obtained by training a one R classifier. Rank the performance of j48,PART,oneR.

Ans)

Steps For Analyze Decision Tree:


1)click on credit arff file
2)Select all attributes
3) click on classify tab
4) click on choose and select J48 algorithm
5)select cross validation folds with 2
6)click on start
7) note down the accuracy value
8) again goto choose tab and select PART
9)select cross validation folds with 2
10)click on start
11) note down accuracy value
12) again goto choose tab and select One R
13)select cross validation folds with 2
14)click on start
15)note down the accuracy value.
Sample Decision Tree of 2-3 levles.

94
Converting Decision tree into a set of rules is as follows.

Rule1: If age = youth AND student=yes THEN buys_computer=yes


Rule2: If age = youth AND student=no THEN buys_computer=no
Rule3: If age = middle_aged THEN buys_computer=yes

Rule4: If age = senior AND credit_rating=excellent THEN buys_computer=yes


Rule5: If age = senior AND credit_rating=fair THEN buys_computer=no

In Weka GUI Explorer, Select Classify Tab, In that Select Use Training set option .There also
exist different classifiers that output the model in the form of Rules. Such classifiers in weka
are

“PART” and ”OneR” . Then go to Choose and select Rules in that select PART and press
start Button.

== Evaluation on training set ===


=== Summary ===
Correctly Classified Instances 897 89.7 %
Incorrectly Classified Instances 103 10.3 %

== Confusion Matrix ===

a b <-- classified as
653 47 | a = good
56 244 | b = bad

Then go to Choose and select Rules in that select OneR and press start Button.
== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 742 74.2 %
Incorrectly Classified Instances 258 25.8 %
=== Confusion Matrix ===
a b <-- classified as
642 58 | a = good
200 100 | b = bad

Then go to Choose and select Trees in that select J48 and press start Button.
=== Evaluation on training set ===

95
=== Summary ===
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %

=== Confusion Matrix ===

a b <-- classified as
669 31 | a = good
114 186 | b = bad

Note: With this observation we have seen the performance of classifier and Rank is as follows

1. PART
2. J48 3. OneR

Result :

This program has been successfully executed.


96
[Link] : 14 Hospital Management System

Data warehouse consists dimension table and fact table.

REMEMBER the following

Dimension
The dimension object(dimension);

_name

_attributes(levels),with primary key

_hierarchies

One time dimension is must.

About levels and hierarchies

Dimensions objects(dimension) consists of set of levels and set of hierarchies defined over
those [Link] levels represent levels of [Link] describe-child relationships
among a set of levels.

For example .a typical calander dimension could contain five [Link] hierarchies can be
defined on these levels.
H1: YearL>QuarterL>MonthL>DayL

H2: YearL>WeekL>DayL

The hierarchies are describes from parent to child,so that year is the parent of Quarter,quarter
are parent of month,and so forth.
About Unique key constraints

When you create a definition for a hierarchy,warehouse builder creates an identifier key for
each level of the hierarchy and unique key constraint on the lowest level (base level)

Design a hospital management system data warehouse(TARGET) consists of dimensions


patient,medicine,supplier,[Link] measure are ‘ NO UNITS’ ,UNIT PRICE.

Assume the relational database(SOURCE)table schemas as follows TIME(day,month,year)

PATIENT(patient_name,age,address,etc)
MEDICINE(Medicine_brand_name,Drug_name,supplier,no_units,units_price,etc..,)

97
SUPPLIER:( Supplier_name,medicine_brand_name,address,etc..,)

If each dimension has 6 levels,decide the levels and hierarchies,assumes the level names
suitably.

Design the hospital management system data warehousing using all [Link] the example

4-D cube with assumption names.

Result :

This program has been successfully executed.


98
[Link] : 15 Simple Project on Data Preprocessing

Data Preprocessing

Objective:
Understanding the purpose of unsupervised attribute/instance filters for
preprocessing the input data.

Follow the steps mentioned below to configure and apply a filter.

The preprocess section allows filters to be defined that transform the data in various ways.
The Filter box is used to set up filters that are required. At the left of the Filter box is a
Choose button. By clicking this button it is possible to select one of the filters in Weka.
Once a filter has been selected, its name and options are shown in the field next to the
Choose button. Clicking on this box brings up a GenericObjectEditor dialog box, which lets
you configure a filter. Once you are happy with the settings you have chosen, click OK to
return to the main Explorer window.

Now you can apply it to the data by pressing the Apply button at the right end of the Filter
panel. The Preprocess panel will then show the transformed data. The change can be undone
using the Undo button. Use the Edit button to view your transformed data in the dataset
editor.
Try each of the following Unsupervised Attribute Filters. (Choose -> weka -> filters ->
unsupervised -> attribute)
• Use ReplaceMissingValues to replace missing values in the given dataset.
• Use the filter Add to add the attribute Average.
• Use the filter AddExpression and add an attribute which is the average of attributes
M1 and M2. Name this attribute as AVG.
• Understand the purpose of the attribute filter Copy.
• Use the attribute filters Discretize and PKIDiscretize to discretize the M1 and
M2 attributes into five bins. (NOTE: Open the file afresh to apply the second filter
since there would be no numeric attribute to dicretize after you have applied the first
filter.)
• Perform Normalize and Standardize on the dataset and identify the difference
between these operations.
• Use the attribute filter FirstOrder to convert the M1 and M2 attributes into a single
attribute representing the first differences between them.

99
• Add a nominal attribute Grade and use the filter MakeIndicator to convert the
attribute into a Boolean attribute.
• Try if you can accomplish the task in the previous step using the filter
MergeTwoValues.
• Try the following transformation functions and identify the purpose of each
• NumericTransform
• NominalToBinary
• NumericToBinary
• Remove
• RemoveType
• RemoveUseless
• ReplaceMissingValues
• SwapValues
Try the following Unsupervised Instance Filters.

(Choose -> weka -> filters -> unsupervised -> instance)

• Perform Randomize on the given dataset and try to correlate the resultant sequence
with the given one.

• Use RemoveRange filter to remove the last two instances.

• Use RemovePercent to remove 10 percent of the dataset.

• Apply the filter RemoveWithValues to a nominal and a numeric attribute

Result :

This program has been successfully executed.


100

You might also like