List of Practical
List of Practical
Experiment No. 1
1. To study of various data mining tool.
Theory:
Introduction
Data mining has a wide number of applications ranging from marketing and advertising of
goods, services or products, artificial intelligence research, biological sciences, crime
investigations to high-level government intelligence. Due to its widespread use and
complexity involved in building data mining applications, a large number of Data mining
tools have been developed over decades. Every tool has its own advantages and
disadvantages. Data mining provides many mining techniques to extract data from
databases. Data mining tools predict future trends, behaviours, allowing business to make
proactive, knowledge driven decisions. The development and application of data mining
algorithms requires use of very powerful software tools. As the number of available tools
continues to grow the choice of most suitable tool becomes increasingly difficult. The top
six open source tools available for data mining are briefed as below.
A . Weka :
Waikato Environment for Knowledge Analysis. Weka is a collection of
machine learning algorithms for data mining tasks. These algorithms can either be applied
directly to a data set or can be called from your own Java code. The Weka (pronounced
Weh-Kuh) workbench contains a collection of several tools for visualization and algorithms
for analytics of data and predictive modeling, together with graphical user interfaces for
easy access to this functionality.
1) Technical Specification:
First released in 1997.
Latest version available is WEKA 3.6.11.
Has GNU general public license.
Platform independent software.
Supported by Java
Can be downloaded from www.cs.waikato.ac
2) General Features :
Weka is a Java based open source tool data mining tool which is a collection of many
data mining and machine
learning algorithms, including pre-processing on data, classification, clustering, and
association rule extraction.
Weka provides three graphical user interfaces i.e. the Explorer for exploratory data
analysis to support
preprocessing, attribute selection, learning, visualization, the Experimenter that provides
experimental environment for testing and evaluating machine learning algorithms, and the
Knowledge Flow for new process model inspired interface for visual design of KDD
process.
A simple Command-line explorer which is a simple interface for typing commands is also
provided by weka.
3) Specialization:
Weka is best suited for mining association rules .
Stronger in machine learning techniques.
Suited for machine Learning.
B. KEEL
Knowledge Extraction based on Evolutionary Learning is an application package of
machine learning software tools. KEEL is designed for providing solution to data mining
problems and assessing evolutionary algorithms. It has a collection of libraries for
preprocessing and post-processing techniques for data manipulating, soft-computing
methods in knowledge of extracting and learning, and providing scientific and research
methods.
1) Technical Overview
First released in 2004.
Latest version available is KEEL 2.0.
Licensed by GNU, general public license.
Can run on any platform.
Supported by java language.
Can be downloaded from www.sci2s.ugr.es/keel.
2) Specialization
Keel is a software tool to assess evolutionary algorithms for Data Mining problems.
Machine learning tool.
C. R
Revolution is a free software programming language and software environment for
statistical computing and graphics. The R language is widely used among statisticians and
data miners for developing statistical software and data analysis. One of R's strengths is the
ease with which well-designed publication-quality plots can be produced, including
mathematical symbols and formulae where needed.
1) Technical Specification
First released in 1997
Latest version Available is 3.1.0
Licensed by GNU General Public License
Cross Platform
C, Fortran and R
www.r-project.org
2) General Features
The R project is a platform for the analysis, graphics and software development activities
of data miners and
related areas.
R is a well-supported, open source, command line driven, statistics package. There are
hundreds of extra
packages freely available, which provide all sorts of data mining, machine learning and
statistical techniques. .
It allows statisticians to do very intricate and complicated analyses without knowing the
blood and guts of
computing systems
3) Specification:
It has a large number of users, in particular in the fields of bio-informatics and social
science. It is also a free
ware replacement for SPSS.
Suited for Statistical Computing.
D. KNIME
Konstanz Information Miner, is an open source data analytics, reporting and
integration platform. It has been used in pharmaceutical research, but is also used in other
areas like CRM customer data analysis, business intelligence and financial data analysis. It
is based on the Eclipse platform and, through its modular API, and is easily extensible.
Custom nodes and types can be implemented in KNIME within hours thus extending
KNIME to comprehend and provide firsttier support for highly domain-specific data format.
1) Technical Specification
Released on 2004.
Latest version available is KNIME2.9
Licensed By GNU General Public License
Compatible with Linux ,OS X, Windows
Written in java
www.knime.org
2) General Features
Knime, pronounced naim, is a nicely designed data mining tool that runs inside the
IBMs Eclipse development environment.
It is a modular data exploration platform that enables the user to visually create data
flows (often referred to as pipelines), selectively execute some or all analysis steps, and later
investigate the results through interactive views on data and models.
The Knime base version already incorporates over 100 processing nodes for data I/O,
preprocessing and cleansing, modeling, analysis and data mining as well as various
interactive views, such as scatter plots, parallel coordinates and others.
3) Specification
Integration of the Chemistry Development Kit with additional nodes for the
processing of chemical structures, compounds, etc.
Specialized for Enterprise reporting, Business Intelligence, data mining
Theory:
Introduction:
WEKA is a data mining system
developed by the University of
Waikato in New Zealand that
implements data mining algorithms.
WEKA is a state-of-the-art facility for
developing machine learning (ML) techniques and their application to real-world data mining
problems. It is a collection of machine learning algorithms for data mining tasks. The algorithms are
applied directly to a dataset. WEKA implements algorithms for data preprocessing, classification,
regression, clustering, association rules; it also includes a visualization tools. The new machine
learning schemes can also be developed with this package. WEKA is open source software issued
under the GNU General Public License.
Launching WEKA Explorer
You can launch Weka from C:\Program Files directory, from your desktop selecting
icon, or from the Windows task bar Start Programs Weka 3-4. When WEKA GUI
Chooser window appears on the screen, you can select one of the four options at the bottom of the
window
1. Simple CLI provides a simple command-line interface and allows direct execution of Weka
commands.
2. Explorer is an environment for exploring data.
3. Experimenter is an environment for performing experiments and conducting statistical tests
between learning schemes.
4. KnowledgeFlow is a Java-Beans-based interface for setting up and running machine learning
experiments.
For the exercises in this tutorial you will use Explorer. Click on Explorer button in the WEKA
GUI Chooser window.
3.
Preprocessing Data
At the very top of the window, just below the title bar there is a row of tabs. Only the first tab,
Preprocess, is active at the moment because there is no dataset open. The first three buttons at the
top of the preprocess section enable you to load data into WEKA. Data can be imported from a file
in various formats: ARFF, CSV, C4.5, binary, it can also be read from a URL or from an SQL
database (using JDBC) [4]. The easiest and the most common way of getting the data into WEKA is
to store it as Attribute-Relation File Format (ARFF) file. Youve already been given weather.arff
file for this exercise; therefore, you can skip section 3.1 that will guide you through the file
conversion.
3.1. File Conversion
We assume that all your data stored in a Microsoft Excel spreadsheet weather.xls.
WEKA expects the data file to be in Attribute-Relation File Format (ARFF) file. Before you apply
the algorithm to your data, you need to convert your data into comma-separated file into ARFF
format (into the file with .arff extension) [1]. To save you data in comma-separated format, select
the Save As menu item from Excel File pull-down menu. In the ensuing dialog box select
CSV (Comma Delimited) from the file type pop-up menu, enter a name of the file, and click
Save button. Ignore all messages that appear by clicking OK. Open this file with Microsoft
Word. Your screen will look like the screen below.
The rows of the original spreadsheet are converted into lines of text where the elements are
separated from each other by commas. In this file you need to change the first line, which holds the
attribute names, into the header structure that makes up the beginning of an ARFF file. Add a
@relation tag with the datasets name, an @attribute tag with the attribute information, and a @data
tag as shown below.
Choose Save As from the File menu and specify Text Only with Line Breaks as the file type.
Enter a file name and click Save button. Rename the file to the file with extension .arff to indicate
that it is in ARFF format.
3.2. Opening file from a local file system
Click on Open file button.
It brings up a dialog box allowing you to browse for the data file on the local file system, choose
weather.arff file.
Some databases have the ability to save data in CSV format. In this case, you can select CSV file
from the local file system. If you would like to convert this file into ARFF format, you can click on
Save button. WEKA automatically creates ARFF file from your CSV file.
The URL of the web site in our example is http://gaia.ecs.csus.edu/~aksenovs/. It means that the file
is stored in this directory, just as in the case with your local file system. To open this file, click on
Open URL button, it brings up a dialog box requesting to enter source URL.
Enter the URL of the web site followed by the file name, in this example the URL is
http://gaia.ecs.csus.edu/~aksenovs/weather.arff, where weather.arff is the name of the file youare
trying to load from the website.
3.4. Reading data from a database
Data can also be read from an SQL database using JDBC. Click on Open DB button,
GenericObjectEditor appears on the screen.
To read data from a database, click on Open button and select the database from a
filesystem.
3.5. Preprocessing window
At the bottom of the window there is Status box. The Status box displays messages that keep you
informed about what is going on. For example, when you first opened the Explorer, the message
says, Welcome to the Weka Explorer. When you loading weather.arff file, the Status box
displays the message Reading from file. Once the file is loaded, the message in the Status box
changes to say OK. Right-click anywhere in Status box, it brings up a menu with two options:
1. Available Memory that displays in the log and in Status box the amount of
memory available to WEKA in bytes.
2. Run garbage collector that forces Java garbage collector to search for memory
that is no longer used, free this memory up and to allow this memory for new tasks. To the right of
Status box there is a Log button that opens up the log. The log records every action in WEKA
and keeps a record of what has happened. Each line of text in the log contains time of entry. For
example, if the file you tried to open is not loaded, the log will have record of the problem that
occurred during opening. To the right of the Log button there is an image of a bird. The bird is
WEKA status icon.
The number next to X symbol indicates a number of concurrently running processes. When you
loading a file, the bird sits down that means that there are no processes running. The number of
processes besides symbol X is zero that means that the system is idle. Later, in classification
problem, when generating result look at the bird, it gets up and start moving that indicates that a
process started. The number next to X becomes 1 that means that there is one process running, in
this case calculation.
If the bird is standing and not moving for a long time, it means that something has gone wrong. In
this case you should restart WEKA Explorer.
3. Loading data
Lets load the data and look what is happening in the Preprocess window. The most common and
easiest way of loading data into WEKA is from ARFF file, using Open
file button (section 3.2). Click on Open file button and choose weather.arff file from your
local filesystem. Note, the data can be loaded from CSV file as well because some databases have
the ability to convert data only into CSV format.
Once the data is loaded, WEKA recognizes attributes that are shown in the Attribute window. Left
panel of Preprocess window shows the list of recognized attributes:
No. is a number that identifies the order of the attribute as they are in data file,
Selection tick boxes allow you to select the attributes for working relation,
Name is a name of an attribute as it was declared in the data file.
The Current relation box above Attribute box displays the base relation (table) name and the
current working relation (which are initially the same) - weather, the number of instances 14
and the number of attributes - 5.
During the scan of the data, WEKA computes some basic statistics on each attribute. The following
statistics are shown in Selected attribute box on the right panel of Preprocess window:
Name is the name of an attribute,
Type is most commonly Nominal or Numeric, and
Missing is the number (percentage) of instances in the data for which this attribute is
unspecified,
Distinct is the number of different values that the data contains for this attribute, and
Unique is the number (percentage) of instances in the data having a value for this attribute that no
other instances have.
An attribute can be deleted from the Attributes window. Highlight an attribute you would like to
delete and hit Delete button on your keyboard. By clicking on an attribute, you can see the basic
statistics on that attribute. The frequency for each attribute value is shown for categorical attributes.
Min, max, mean, standard deviation (StdDev) is shown for continuous attributes.
Click on attribute Outlook in the Attribute window.
Temperature is a numeric value; therefore, you can see min, max, means, and standard deviation in
Selected Attribute window.
Missing = 0 means that the attribute is specified for all instances (no missing values),
Distinct = 12 means that Temperature has twelve different values, and
Unique = 10 means that other attributes or instances have the same 10 value as Temperature has.
Temperature is a Numeric value; therefore, you can see the statistics describing the distribution of
values in the data - Minimum, Maximum, Mean and Standard Deviation.
Minimum = 64 is the lowest temperature,
Maximum = 85 is the highest temperature, mean and standard deviation.
Compare the result with the attribute table weather.xls; the numbers in WEKA match the numbers
in the table. You can select a class in the Class pull-down box. The last attribute in the Attributes
window is the default class selected in the Class pull-down box
You can Visualize the attributes based on selected class. One way is to visualize selected attribute
based on class selected in the Class pull-down window, or visualize all attributes by clicking on
Visualize All button.
4. Building Classifiers
Classifiers in WEKA are the models for predicting nominal or numeric quantities. The learning
schemes available in WEKA include decision trees and lists, instance-based classifiers, support
vector machines, multi-layer perceptrons, logistic regression, and bayes nets. Meta- classifiers
include bagging, boosting, stacking, error-correcting output codes, and locally weighted learning.
Once you have your data set loaded, all the tabs are available to you. Click on the Classify
tab.
Now you can start analyzing the data using the provided algorithms. In this exercise you will
analyze the data with C4.5 algorithm using J48, WEKAs implementation of decision tree learner.
The sample data used in this exercise is the weather data from the file weather.arff.
Conclusion: Studied the Weka Tool
4. Create operational System, a data warehouse for given case study-Medical Store.
Theory:
Identify all dimension tables minimum 10 and fact tables minimum 2 for the Medical Store
data Warehouse.
Identify primary keys and all other attributes with their data types for all dimension tables
minimum attributes 10.
Identify foreign keys for fact tables with measurable facts from medical store.
Create all tables.
Insert values into Dimension tables.
Display all inserted values for all tables
Dimensions Identified:
medicine
customers
supplier
Store
Bills
Facts Identified:
Sales
Stock
Tables with attributes:
SQL> create table medicine
(
m_id number,
m_name varchar2(10),
m_comp varchar2(10),
m_supp varchar2(10),
m_mfg date,
m_exp date,
m_mrp number,
m_cp number,
quantity number
);
Table created.
insert into medicine values(1,'ciplox500','cipla','patchem','12/mar/08','12/apr/09',30,27,7);
insert into medicine values(2,'crocin','duphar','tahela','12/mar/08','4/mar/09',20,18,5);
insert into medicine values(3,'action 500','proctor','cgmarket','5/jan/07','5/jan/09',25,21,4);
insert into medicine values(4,'cervin','ranbaxy','sp shah','12/mar/08','4/mar/09',35,30,8);
insert into medicine values(5,'atorva','cadila','saroj','5/dec/06','5/dec/09',30,27,5);
crocin
duphar
tahela
12-MAR-08 04-MAR-09 20
18
action 500
25
21
cervin
ranbaxy
30
atorva
cadila
saroj
27
05-DEC-06 05-DEC-09
30
losar 25
gelusil
pfizer
subdha
25-APR-07
25-APR-07
45
40
lasix
avantis subdha
02-JUN-07
25-APR-09
45
41
digine
amol
meher
02-JUN-07
25-APR-09
20
avantis
arihant
25-MAR-07 25-APR-09
10 combiflam
unichem amritlal
05-DEC-06 05-DEC-09
10
30
25
soframycin
avantis
eastern
25-MAR-07 25-APR-09
25
21
12
pan 40
alchem
navdeep
02-MAY-07 25-JUN-09
10
13
perinorm
ipca
30
25
14
domstal
elder
kishore
25-APR-09
40
35
15
shelcal
avantis
23-MAR-07 25-APR-09
25
20
arihant
21-JUL-07
15 rows selected.
SQL> create table customers
(
cust_id number,
cust_name varchar(10),
cust_add varchar(10),
cust_phone number,
cust_bill_no number
);
Table created.
insert into customers values(102,'krunal','ghatkopar',25000111,1);
insert into customers values(103,'kunal','chembur',25000112,2);
insert into customers values(104,'mansi','powai',25000113,3);
insert into customers values(105,'dinesh','chembur',25000114,4);
insert into customers values(106,'aniket','ulhas',25000115,5);
insert into customers values(107,'sagar','parel',25000116,6);
insert into customers values(108,'mitesh','borivili',25000117,7);
insert into customers values(109,'vinay','chembur',25000118,8);
insert into customers values(110,'sarvesh','ghatkopar',25000119,9);
insert into customers values(111,'tarun','vasai',25000120,10);
insert into customers values(112,'sagar','ghatkopar',25000121,11);
insert into customers values(113,'varun','panvel',25000122,12);
insert into customers values(114,'nikhil','panvel',25000123,13);
insert into customers values(115,'megh','sion',25000124,14);
insert into customers values(116,'dhiral','chembur',25000125,15);
SQL> select * from customers;
CUST_ID CUST_NAME CUST_ADD CUST_PHONE CUST_BILL_NO
------- ---------- ---------- ---------- ------------ ---------- ------------ ---------- -----------102
krunal
ghatkopar
25000111
1
103
kunal
chembur
25000112
2
104
mansi
powai
25000113
3
105
dinesh
chembur
25000114
4
106
aniket
ulhas
25000115
5
107
sagar
parel
25000116
6
108
mitesh
borivili
25000117
7
109
vinay
chembur
25000118
8
110
sarvesh
ghatkopar
25000119
9
111
tarun
vasai
25000120
10
112
sagar
ghatkopar
25000121
11
CUST_ID CUST_NAME CUST_ADD CUST_PHONE CUST_BILL_NO
------- ---------- ---------- ---------- ------------ ---------- ------------ ---------- -----------113
varun
panvel
25000122
12
114
nikhil
panvel
25000123
13
115
megh
sion
25000124
14
116
dhiral
chembur
25000125
15
15 rows selected.
SQL> create table supplier
(
s_id number,
s_name varchar2(10),
s_add varchar2(10),
s_phone number,
s_date date,
s_amount number
);
Table created.
insert into supplier values(1001,'cipla','sion',24000111,'1/jan/2009',400);
insert into supplier values(1002,'duphar','parel',24000112,'1/feb/2009',500);
insert into supplier values(1003,'proctor','thane',24000113,'21/jan/2009',450);
insert into supplier values(1004,'ranbaxy','cst',24000114,'3/mar/2009',300);
insert into supplier values(1005,'cadila','sion',24000115,'5/apr/2009',350);
insert into supplier values(1006,'unichem','sion',24000116,'1/jan/2009',200);
insert into supplier values(1007,'pfizer','chembur',24000117,'11/jan/2009',100);
insert into supplier values(1008,'avantis','kandivili',24000118,'21/jan/2009',200);
insert into supplier values(1009,'amol','daishar',24000119,'15/dec/2008',100);
insert into supplier values(1010,'alchem','parle',24000120,'18/jan/2009',120);
insert into supplier values(1011,'unique','goregaon',24000121,'31/jan/2009',130);
insert into supplier values(1012,'ipca','vasai',24000122,'19/nov/2008',250);
insert into supplier values(1013,'torrent','mira road',24000123,'14/feb/2009',290);
insert into supplier values(1014,'elder','ghatkopar',24000124,'17/mar/2009',2000);
insert into supplier values(1015,'glaxo','sion',24000125,'1/jan/2009',400);
SQL> select * from supplier;
S_ID S_NAME S_ADD
S_PHONE S_DATE
S_AMOUNT
---------- ---------- ---------- ---------- --------- ---------- ---------- ------------ -----------1001
cipla
sion
24000111
01-JAN-09
400
1002
duphar parel
24000112
01-FEB-09
500
1003
proctor thane
24000113
21-JAN-09
450
1004
ranbaxy cst
24000114
03-MAR-09
300
1005
cadila
sion
24000115
05-APR-09
350
1006
unichem sion
24000116
01-JAN-09
200
1007
pfizer chembur 24000117
11-JAN-09
100
1008
avantis kandivili 24000118
21-JAN-09
200
1009
amol
daishar
24000119
15-DEC-08
100
1010
alchem parle
24000120
18-JAN-09
120
1011
unique goregaon 24000121
31-JAN-09
130
S_ID S_NAME S_ADD
S_PHONE S_DATE
S_AMOUNT
---------- ---------- ---------- ---------- --------- ---------- ------------ ------------ -----------1012
ipca
vasai
24000122 19-NOV-08
250
1013
torrent mira road 24000123 14-FEB-09
290
1014
1015
elder
glaxo
2000
400
15 rows selected.
SQL> create table bill
(
b_no number,
b_date date,
cust_name varchar2(10),
b_doc varchar(10),
b_amount number,
quantity number
);
Table created.
insert into bill values(001,'11/mar/2009','krunal','batra',400,5);
insert into bill values(002,'11/apr/2009','mansi','batra',20,6);
insert into bill values(003,'11/mar/2009','dinesh','batra',70,7);
insert into bill values(004,'21/mar/2009','aniket','batra',90,8);
insert into bill values(005,'5/apr/2009','mitesh','batra',300,9);
insert into bill values(006,'15/mar/2009','amit','batra',400,10);
insert into bill values(007,'21/mar/2009','sarvesh','batra',160,11);
insert into bill values(008,'29/mar/2009','tarun','batra',100,12);
insert into bill values(009,'26/mar/2009','varun','batra',250,5);
insert into bill values(010,'23/mar/2009','sagar','batra',90,6);
insert into bill values(011,'12/apr/2009','vinay','batra',300,4);
insert into bill values(012,'11/apr/2009','krunal','batra',85,5);
insert into bill values(013,'11/apr/2009','megh','batra',40,2);
insert into bill values(014,'10/apr/2009','megh','batra',400,3);
insert into bill values(015,'10/apr/2009','nikhil','batra',450,5);
SQL> select * from bill;
B_NO B_DATE CUST_NAME B_DOC
B_AMOUNT QUANTITY
---------- --------- ---------- ---------- ---------- ---------- ------------ ------------ -----------1 11-MAR-09 krunal
batra
400
5
2 11-APR-09 mansi
batra
20
6
3 11-MAR-09 dinesh
batra
70
7
4 21-MAR-09 aniket
batra
90
8
5 05-APR-09 mitesh
batra
300
9
6 15-MAR-09 amit
batra
400
10
7 21-MAR-09 sarvesh
batra
160
11
8 29-MAR-09 tarun
batra
100
12
9 26-MAR-09 varun
batra
250
5
10 23-MAR-09 sagar
batra
90
6
11 12-APR-09 vinay
batra
300
4
B_NO B_DATE
CUST_NAME B_DOC
B_AMOUNT QUANTITY
---------- --------- ---------- ---------- ---------- ---------- ------------ ------------ -----------12 11-APR-09 krunal
batra
85
5
13 11-APR-09 megh
batra
40
2
14 10-APR-09 megh
batra
400
3
15 10-APR-09 nikhil
batra
450
5
15 rows selected.
SQL> create table sales
(
m_name varchar2(10),
m_comp varchar2(10),
s_name varchar2(10),
profit number,
monthly_sales number
);
Table created.
insert into sales values('ciplox 500','cipla','patchem',50,1000);
insert into sales values('crocin','duphar','tahela',90,500);
insert into sales values('action 500','proctor','cg market',80,700);
insert into sales values('cervin','ranbaxy','sp shah',100,850);
insert into sales values('atorva','cadila','saroj',150,700);
insert into sales values('losar 25','unichem','amritlal',100,850);
insert into sales values('gelusil','pfizer','subdha',25,600);
insert into sales values('lasix','avantis','subdha',80,400);
insert into sales values('digine','amol','meher',39,600);
insert into sales values('combiflam','avantis','arihant',39,500);
insert into sales values('soframycin','avantis','eastern',30,600);
insert into sales values('perinorm','ipca','md pharma',90,500);
insert into sales values('domstal','elder','kishore',30,600);
insert into sales values('shelcal','avantis','arihant',30,900);
insert into sales values('pan 40','alchem','navdeep',50,100);
SQL> select * from sales;
M_NAME M_COMP
S_NAME
PROFIT
MONTHLY_SALES
---------- ---------- ---------- ---------- ------------- ------------ ------------ -----------ciplox 500
cipla
patchem
50
1000
crocin
duphar
tahela
90
500
action 500
proctor
cg market
80
700
cervin
ranbaxy
sp shah
100
850
atorva
cadila
saroj
150
700
losar 25
unichem
amritlal
100
850
gelusil
pfizer
subdha
25
600
lasix
avantis
subdha
80
400
digine
amol
meher
39
600
combiflam
avantis
arihant
39
500
soframycin
avantis
eastern
30
600
M_NAME
M_COMP
S_NAME
PROFIT
MONTHLY_SALES
12
13
14
15
pan 40
perinorm
domstal
shelcal
1012
1013
1014
1015
12
13
14
15
113
114
115
116
15 rows selected.
from sales
group by profit;
PROFIT SUM(MONTHLY_SALES)
---------- -----------------25
600
30
2100
39
1100
50
1100
80
1100
90
1000
100
1700
150
700
8 rows selected.
ii) Create SnowFlake Schema
BILL
STORES
B_no
B_date
Cust_name
B_doc
B_amount
Quantity
Store_id
M_name
S_id
B_no
Cust_id
B_amount
B_cash
B_cheque
SUPPLIER
S_PHONE
S_id
S_name
S_add
S_phone
S_date
S_amount
S_phone
S_mobile
S_office
CUST_ADD
CUSTOMERS
Cust_id
Cust_name
Cust_add
Cust_phone
Cust_bill_no
cervin
atorva
losar 25
gelusil
lasix
combiflam
soframycin
sp shah
saroj
amritlal
subdha
subdha
arihant
eastern
35
30
30
45
45
10
25
M_NAME
M_SUPP
M_MRP
---------- ---------- ---------- --------- --------perinorm
md pharma
30
domstal
kishore
40
shelcal
arihant
25
pan 40
navdeep
10
14 rows selected.
amit
megh
megh
mansi
sagar
tarun
varun
vinay
aniket
dinesh
400
40
400
20
90
100
250
300
90
70
CUST_NAME B_AMOUNT
---------- ---------- ---------- -----krunal
85
krunal
400
mitesh
300
nikhil
450
sarvesh
160
15 rows selected.
SQL> select cust_name,sum(quantity)
from bill
group by (cust_name,quantity);
CUST_NAME SUM(QUANTITY)
---------- ---------- ---------- ---------amit
10
megh
2
megh
3
mansi
6
sagar
6
tarun
12
varun
5
vinay
4
aniket
8
dinesh
7
CUST_NAME SUM(QUANTITY)
---------- ---------- ---------- ---------krunal
10
mitesh
9
nikhil
5
sarvesh
11
14 rows selected.
Experiment No-6
Aim: Perform Online Analytical Processing (OLAP) operations on Operational System created.
Theory: Write details for OLAP operations and execute them.
Roll-up
Drill-down
Slicing
Dicing
Pivoting(Rotate)
Experiment No 7
7. WAP to implement Classification using Decision Tree Induction / ID3
Softwares: JAVA & Weka Mining Tool
Theory:
Classification
predicts categorical class labels
classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions, i.e., predicts unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Supervised learning
Supervision: The training data (observations, measurements, etc.) are accompanied
by labels indicating the class of the observations
New data is classified based on the training set
Issues regarding classification and prediction
Data cleaning
Preprocess data in order to reduce noise and handle missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Algorithm
Basic algorithm (a greedy algorithm)-Decision Tree Induction
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning majority voting is
employed for classifying the leaf
There are no samples left
Program Code:
import java.io.*;
import java.util.*;
public class ID3 {
int numAttributes;
String []attributeNames;
Vector []domains;
class DataPoint {
public int []attributes;
public DataPoint(int numattributes)
{
attributes = new int[numattributes];
}
};
/* The class to represent a node in the decomposition tree.
*/
class TreeNode {
public double entropy;
node
public Vector data; // The set of data points if this is a leaf node
public int decompositionAttribute; // If this is not a leaf node, the attribute
that is used to divide the set of data points
public int decompositionValue;
// the attribute-value that is used
to divide the parent node
public TreeNode []children;
// If this is not a leaf node, references to
the children nodes
public TreeNode parent;
// The parent to this node. The
root has parent == null
public TreeNode() {
data = new Vector();
}
};
/* The root of the decomposition tree */
TreeNode root = new TreeNode();
double averageentropy = 0;
for (int j=0; j< numvalues; j++) {
Vector subset = getSubset(node.data, i, j);
if (subset.size() == 0) continue;
double subentropy = calculateEntropy(subset);
averageentropy += subentropy * subset.size(); // Weighted sum
}
averageentropy = averageentropy / numdata; // Taking the weighted average
if (selected == false) {
selected = true;
bestEntropy = averageentropy;
selectedAttribute = i;
} else {
if (averageentropy < bestEntropy) {
selected = true;
bestEntropy = averageentropy;
selectedAttribute = i;
}
}
}
if (selected == false) return;
// Now divide the dataset using the selected attribute
int numvalues = domains[selectedAttribute].size();
node.decompositionAttribute = selectedAttribute;
node.children = new TreeNode [numvalues];
for (int j=0; j< numvalues; j++) {
node.children[j] = new TreeNode();
node.children[j].parent = node;
node.children[j].data = getSubset(node.data, selectedAttribute, j);
node.children[j].decompositionValue = j;
}
// Recursively divides children nodes
for (int j=0; j< numvalues; j++) {
decomposeNode(node.children[j]);
}
node.data = null;
}
/** Function to read the data file.*/
public int readData(String filename) throws Exception {
FileInputStream in = null;
try {
File inputFile = new File(filename);
in = new FileInputStream(inputFile);
} catch ( Exception e) {
System.err.println( "Unable to open data file: " + filename + "\n" + e);
return 0;
}
BufferedReader bin = new BufferedReader(new InputStreamReader(in) );
String input;
while(true) {
input = bin.readLine();
if (input == null) {
System.err.println( "No data found in the data file: " +
filename + "\n");
return 0;
}
if (input.startsWith("//")) continue;
if (input.equals("")) continue;
break;
}
StringTokenizer tokenizer = new StringTokenizer(input);
numAttributes = tokenizer.countTokens();
if (numAttributes <= 1) {
System.err.println( "Read line: " + input);
System.err.println( "Could not obtain the names of attributes in the
line");
System.err.println( "Expecting at least one input attribute and one
output attribute");
return 0;
}
domains = new Vector[numAttributes];
for (int i=0; i < numAttributes; i++) domains[i] = new Vector();
attributeNames = new String[numAttributes];
for (int i=0; i < numAttributes; i++) {
attributeNames[i] = tokenizer.nextToken();
}
while(true) {
input = bin.readLine();
if (input == null) break;
if (input.startsWith("//")) continue;
if (input.equals("")) continue;
tokenizer = new StringTokenizer(input);
} else {
if( Shape == "square") {
Slab = "two-way";
} else {
Slab = "one-way";
}
}
0 Seconds
Span
Shape
Slab
//**************************************
long
square
waffle
long
rectangle waffle
short
square
two-way
short
rectangle one-way
Theory:
The k-nearest neighbour algorithm (k-NN) is a method for classifying objects based on closest
training set, where the function is only approximated locally and all computation is deferred until
classification. The k-nearest neighbour algorithm is amongst the simplest of all machine learning
algorithms: an object is classified by a majority vote of its neighbours, with the object being
assigned to the class most common amongst its k-nearest neighbours (k is a positive integer,
typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbour.
Neighbours Classification is quite straightforward; examples are classified based on the class of
their nearest neighbours. It is often useful to take more than one neighbours into account so the
technique is more commonly referred to as k-Nearest Neighbours (k-NN) Classification where k
nearest neighbours are used in determining the class.
Example of k-NN classification
The test sample (green circle) should be classified either to the first class of blue squares or to the
second class of red triangles. If k = 3it is assigned to the second class because there are 2 triangles
and only 1 square inside the inner circle. If k = 5it is assigned to the first class (3 squares vs. 2
triangles inside the outer circle).
Implementation
The algorithm on how to compute the K-nearest neighbours is as follows:
1. Determine the parameter K = number of nearest neighbours beforehand. This value is all up to
you.
2. Calculate the distance between the query-instance and all the training samples. You can use any
distance algorithm.
3. Sort the distances for all the training samples and determine the nearest neighbour based on the
Kth minimum distance.
4. Since this is supervised learning, get all the Categories of your training data for the sorted value
which fall under K.
5. Use the majority of nearest neighbours as the prediction value.
Conclusion:
K-NN is very simple to understand and easy to implement. So it should be considered in seeking a
solution to any classification problem. Some advantages of k-NN are as follows (many of these
derive from its simplicity and interpretability):
1. Because the process is transparent, it is easy to implement and debug.
2. In situations where an explanation of the output of the classifier is useful, k-NN can be very
effective if an analysis of the neighbours is useful as explanation.
3. There are some noise reduction techniques that work only for k-NN that can be effective in
improving the accuracy of the classifier.
7b
Theory:
Introduction
The Bayesian Classification represents a supervised learning method as well as a
statistical method for classification. Assumes an underlying probabilistic model and it
allows us to capture uncertainty about the model in a principled way by determining
probabilities of the outcomes. It can solve diagnostic and predictive problems. This
Classification is named after Thomas Bayes ( 1702-1761), who proposed the Bayes
Theorem.Bayesian classification provides practical learning algorithms and prior knowledge and
observed data can be combined. Bayesian Classification provides a useful perspective for
understanding and evaluating many learning algorithms. It calculates explicit probabilities for
hypothesis and it is robust to noise in input data.
The Bayes Theorem:
P(h/D) =
P(h) : Prior probability of hypothesis h
P(D) : Prior probability of training data D
P(h/D) : Probability of h given D
P(D/h) : Probability of D given h
Que: To Calculate Probability to buy a computer for the following assumption :
Age = 35
Income = $50000
Let,
D : 35 year old customer with an income of $50,000 PA
h : Hypothesis that our customer will buy our computer
Theorm:
If we assume P(hi) = P(hj) where the calculated probabilities amount to the same
hML = average max P(D/hi) (where hi belongs to H)
h1: Customer buys a computer = Yes
h2 : Customer buys a computer = No
where h1 and h2 are subsets of our Hypothesis Space H
P(h/D) (Final Outcome) = arg max{ P( D/h1) P(h1) , P(D/h2) P(h2)}
P(D) can be ignored as it is the same for both the terms
Algorithm applied:
P (buys computer = yes) = 5/10 = 0.5
P (buys computer = no) = 5/10 = 0.5
P (customer is 35 yrs & earns $50,000) = 4/10 = 0.4
P (customer is 35 yrs & earns $50,000 / buys computer = yes) = 3/5 =0.6
P (customer is 35 yrs & earns $50,000 / buys computer = no) =1/5 = 0.2
Customer buys a computer P(h1/D) = P(h1) * P (D/ h1) / P(D) = 0.5 * 0.6 / 0.4
Customer does not buy a computer P(h2/D) = P(h2) * P (D/ h2)/ P(D) = 0.5 * 0.2 / 0.4
Final Outcome = average max {P(h1/D) , P(h2/D)} = max(0.6, 0.2)
=> Customer buys a computer
Output
Conclusion: A Naive Bays Classifier was proposed which is very efficient in classifying data
present in the database.
Experiment no:8
8. Implement DBSCAN method for clustering
Theory:
Introduction
Density-based Clustering locates regions of high density that are separated from one another by
regions of low density.
Density = number of points within a specified radius (Eps)
DBSCAN is a density-based algorithm. A point is a core point if it has more than a
specified number of points (MinPts) within Eps. These are po ints that are at the interior of a cluster
1) A border point has fewer than MinPts within Eps, but is in the neighbourhood of a core point
2) A noise point is any point that is not a core point or a border point. Any two core points are close
enough within a distance Eps of one another are put in the same cluster. Any border point that is
close enough to a core point is put in the same cluster as the core point noise points are discarded.
Conclusion: A density based algorithm called DBSCAN was proposed which is very efficient in
discovering clusters of arbitrary shape and also noise present in the database
Experiment no 9
9. Draw scatter plot of your data and find whether linear regression can be used or not.
Theory:
Scatter plot
A scatter plot or scatter graph is a type of mathematical diagram using Cartesian coordinates to
display values for two variables for a set of data. The data is displayed as a collection of points,
each having the value of one variable determining the position on the horizontal axis and the value
of the other variable determining the position on the vertical axis.This kind of plot is also called a
scatter chart, scattergram, scatter diagram or scatter graph. A scatter plot is used when a variable
exists that is under the control of the experimenter. If a parameter exists that is systematically
incremented and/or decremented by the other, it is called the control parameter or independent
variable and is customarily plotted along the horizontal axis. The measured or dependent variable is
customarily plotted along the vertical
axis. If no dependent variable exists, either type of variable can be plotted on either on x-axisor yaxis. Scatter plot will illustrate only the degree of correlation (not causation) between two variables.
Regression
Definition: Regression is a data mining (machine learning) technique used to fit an
equation to a dataset. The simplest form of regression, linear regression, uses the formula of a
straight line (y = mx + b) and determines the appropriate values for m and b to predict the value of y
based upon a given value of x. Advanced techniques, such as multiple regression, allow the use of
more than one input variable and allow for the fitting of more complex models, such as a quadratic
equation.
Linear Regression
A linear regression technique can be used if the relationship between the predictors and the target
can be approximated with a straight line. Regression with a single predictor is the easiest to
visualize. Simple linear regression with a
single predictor is shown in Figure 1.
Figure1:
Linear
Regression
with a Single Predictor
Linear regression with a single predictor can be expressed with the following equation.
y = 2x + 1 + e
The regression parameters in simple linear regression are:
The slope of the line (2) the angle between a data point and the regression line
The y intercept (1) the point where x crosses the y axis (x = 0)
Working of linear regression
Minimizing sum-of-squares: The goal of linear regression is to adjust the values of slope and
intercept to find the line that best predicts Y from X. More precisely, the goal of regression is to
minimize the sum of the squares of the vertical distances of the points from the line. Why minimize
the sum of the squares of the distances? Why not simply minimize the sum of the actual distances?
If the random scatter follows a Gaussian distribution, it is far more likely to have two medium size
deviations (say 5 units each) than to have one small deviation (1 unit) and one large (9 units). A
procedure that minimized the sum of the absolute value of the distances would have no preference
over a line that was 5 units away from two points and one that was
1 unit away from one point and 9 units from another. The sum of the distances (more
Precisely, the sum of the absolute value of the distances) is 10 units in each case. A procedure that
minimizes the sum of the squares of the distances prefers to be 5 units away from two points (sumof-squares = 50) rather than 1 unit away from one point and 9 units away from another (sum-ofsquares = 82). If the scatter is Gaussian (or nearly so), the line determined by minimizing the sumof-squares is most likely to be
Correct.
Slope and intercept
Prism reports the best-fit values of the slope and intercept, along with their standard errors and
confidence intervals. The slope quantifies the steepness of the line. It equals the change in Y for
each unit change in X. It is expressed in the units of the Y-axis divided by the units of the X-axis.
If the slope is positive, Y increases as X increases. If the slope is negative, Y decreases as X
increases. The Y intercept is the Y value of the line when X equals zero. It defines the elevation of
the line.
The
standard
error
values of
the slope
and
intercept
can
be
hard to interpret, but their
main purpose is to compute the 95% confidence intervals. If you accept the assumptions of linear
regression, there is a 95% chance that the 95% confidence interval of the slope contains the true
value of the slope, and that the 95% confidence interval for the intercept contains the true value of
the intercept.
r2, a measure of goodness-of-fit of linear regression
The value r2 is a fraction between 0.0 and 1.0, and has no units. An r2 value of 0.0
means that knowing X does not help you predict Y. There is no linear relationship
between X and Y, and the best-fit line is a horizontal line going through the mean of all Y values.
When r2 equals 1.0, all points lie exactly on a straight line with no scatter. Knowing X lets you
predict Y perfectly.
Input Dataset:
Output
Conclusion: Scatter plot is very efficient for analyzing whether linear regression can be used in
data or not.
Experiment no 10
10. Implement a regression analysis and show the prediction on sample data.
Introduction
Regression is a data mining function that predicts a number. Profit, sales, mortgage rates, house
values, square footage, temperature, or distance could all be predicted using regression techniques.
For example, a regression model could be used to predict the value of a house based on location,
number of rooms, lot size, and other factors.
A regression task begins with a data set in which the target values are known.
For example, a regression model that predicts house values could be developed based on observed
data for many houses over a period of time. In addition to the value, the data might track the age of
the house, square footage, number of rooms, taxes, school district, proximity to shopping centers,
and so on. House value would be the target, the other attributes would be the predictors, and the
data for each house would constitute a case. In the model build (training) process, a regression
algorithm estimates the value of the target as a function of the
predictors for each case in the build data. These relationships between predictors and target are
summarized in a model, which can then be applied to a different data set in which the target values
are unknown.
Regression algorithms predict one or more continuous variables, such as profit or loss, based
on other attributes in the dataset. An example of a regression algorithm is the Microsoft Time
Series Algorithm. Types of regression:
Linear Regression
A linear regression technique can be used if the relationship between the predictors and
the target can be approximated with a straight line. Regression with a single predictor is
the easiest to visualize. Simple linear regression with a single predictor is shown in 1.
Figure 1 Linear Regression With a Single Predictor
Linear
Linear
regression
with a single predictor can be expressed with the following equation.
y = 2x + 1 + e
The regression parameters in simple linear regression are:
The slope of the line (2) the angle between a data point and the regression line
The y intercept (1) the point where x crosses the y axis (x = 0)
Multivariate Linear Regression
The term multivariate linear regression refers to linear regression with two or more predictors
(x1, x2, , xn). When multiple predictors are used, the regression line cannot be visualized in twodimensional space. However, the line can be computed simply by expanding the equation for
single-predictor linear regression to include the parameters for each of the predictors.
y = 1 + 2x1 + 3x2 + ..... n xn-1 + e
Regression Coefficients
In multivariate linear regression, the regression parameters are often referred to as
coefficients. When you build a multivariate linear regression model, the algorithm
computes a coefficient for each of the predictors used by the model. The coefficient is a
measure of the impact of the predictor x on the target y. Numerous statistics are available
for analyzing the regression coefficients to evaluate how well the regression line fits the
data.
Nonlinear Regression
Often
the
relationship
between
x
and y cannot
be
approximated
with a straight
line. In
this case, a
nonlinear
regression
technique
may be used.
Alternatively,
the data could
be
preprocessed
to make the
relationship
linear.
Nonlinear
regression
models define
y as a
function of x
using an equation that is more complicated than the linear regression
equation. In 2, x and y have a nonlinear relationship
Figure 2 Nonlinear Regression With a SIngle Predictor
Multivariate Nonlinear Regression
The
term
multivariate
nonlinear regression refers to nonlinear regression with two or
more predictors (x1, x2, , xn). When multiple predictors are used, the nonlinear
relationship cannot be v based on other attributes in the dataset.
Example:
Input Dataset as pima_diabetes
Logistic
Conclusion:
A regression analysis is very efficient in predicting data present in the database.