[go: up one dir, main page]

0% found this document useful (0 votes)
185 views66 pages

List of Practical

1. WEKA is a data mining system developed by the University of Waikato in New Zealand that implements various data mining algorithms. 2. It is a collection of machine learning algorithms for tasks like classification, regression, clustering, and association rule mining. These algorithms can be directly applied to datasets. 3. WEKA provides tools for data preprocessing, modeling, evaluation, and visualization. It also allows users to integrate their own machine learning algorithms.

Uploaded by

Vinod Bawane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views66 pages

List of Practical

1. WEKA is a data mining system developed by the University of Waikato in New Zealand that implements various data mining algorithms. 2. It is a collection of machine learning algorithms for tasks like classification, regression, clustering, and association rule mining. These algorithms can be directly applied to datasets. 3. WEKA provides tools for data preprocessing, modeling, evaluation, and visualization. It also allows users to integrate their own machine learning algorithms.

Uploaded by

Vinod Bawane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 66

LIST OF PRACTICAL

1. To study of various data mining tool


2. To Study Weka Tool.
3. a. Creation of star, snowflake and fact constellation schemas.
b. Creation of sample database schema for any of the above mentioned three schema
types Eg i) Medical store ii) Shopping Mall iii) Tour ticket booking.

4. a. Implementation of classification of customers data using WEKA.


b. Implementation of classification of weather forecasting data using WEKA.
c. Implementation of classification of agricultural data using WEKA.
d. Implementation of classification of Medical store using WEKA.
5. a. Implementation of prediction customers data using WEKA.
b. Implementation of prediction weather forecasting data using WEKA.
c. Implementation of prediction agricultural data using WEKA for.
d. Implementation of prediction Medical store using WEKA.
6. a. Perform OLAP operation
b. Execution of sample OLAP operations on sample database schemas Eg i) Medical store
ii) Shopping Mall iii) Tour ticket booking.
7. a. Implement Kmeans algorithm for clustering
b. Implement Nave Bays Classifier for sample data.
8. Implement DBSCAN method for clustering.
9. Draw scatter plot of your data and find whether linear regression can be used or not.
10. Implement a regression analysis and show the prediction on sample data. Compare predicted
and actual value and calculate error in your system.

1. To study of various data mining tool.........................................................................................5


2 Aim: To Study Weka Tool..........................................................................................................9
3. Study of a typical architecture of Datawarehousing system....................................................28
4. Create operational System, a data warehouse for given case study-Medical Store................29
5. Creation of Star and Snowflake and Fact Constellation Schemas...........................................36
6. Perform Online Analytical Processing (OLAP) operations on Operational System-Case
Study............................................................................................................................................42
7. WAP to implement Classification using Decision Tree Induction / ID3.................................44
Experiment No 7a........................................................................................................................52
7a.Implementation of various K- nearest neighbour classifier....................................................52
7bImplement Nave Bays Classifier for sample data..................................................................54
8. Implement DBSCAN method for clustering...........................................................................56
9. Draw scatter plot of your data and find whether linear regression can be used or not............59
10. Implement a regression analysis and show the prediction on sample data...........................66

Experiment No. 1
1. To study of various data mining tool.

Theory:
Introduction
Data mining has a wide number of applications ranging from marketing and advertising of
goods, services or products, artificial intelligence research, biological sciences, crime
investigations to high-level government intelligence. Due to its widespread use and
complexity involved in building data mining applications, a large number of Data mining
tools have been developed over decades. Every tool has its own advantages and
disadvantages. Data mining provides many mining techniques to extract data from
databases. Data mining tools predict future trends, behaviours, allowing business to make
proactive, knowledge driven decisions. The development and application of data mining
algorithms requires use of very powerful software tools. As the number of available tools
continues to grow the choice of most suitable tool becomes increasingly difficult. The top
six open source tools available for data mining are briefed as below.
A . Weka :
Waikato Environment for Knowledge Analysis. Weka is a collection of
machine learning algorithms for data mining tasks. These algorithms can either be applied
directly to a data set or can be called from your own Java code. The Weka (pronounced
Weh-Kuh) workbench contains a collection of several tools for visualization and algorithms
for analytics of data and predictive modeling, together with graphical user interfaces for
easy access to this functionality.
1) Technical Specification:
First released in 1997.
Latest version available is WEKA 3.6.11.
Has GNU general public license.
Platform independent software.
Supported by Java
Can be downloaded from www.cs.waikato.ac
2) General Features :
Weka is a Java based open source tool data mining tool which is a collection of many
data mining and machine
learning algorithms, including pre-processing on data, classification, clustering, and
association rule extraction.
Weka provides three graphical user interfaces i.e. the Explorer for exploratory data
analysis to support
preprocessing, attribute selection, learning, visualization, the Experimenter that provides
experimental environment for testing and evaluating machine learning algorithms, and the
Knowledge Flow for new process model inspired interface for visual design of KDD
process.

A simple Command-line explorer which is a simple interface for typing commands is also
provided by weka.
3) Specialization:
Weka is best suited for mining association rules .
Stronger in machine learning techniques.
Suited for machine Learning.
B. KEEL
Knowledge Extraction based on Evolutionary Learning is an application package of
machine learning software tools. KEEL is designed for providing solution to data mining
problems and assessing evolutionary algorithms. It has a collection of libraries for
preprocessing and post-processing techniques for data manipulating, soft-computing
methods in knowledge of extracting and learning, and providing scientific and research
methods.
1) Technical Overview
First released in 2004.
Latest version available is KEEL 2.0.
Licensed by GNU, general public license.
Can run on any platform.
Supported by java language.
Can be downloaded from www.sci2s.ugr.es/keel.

2) Specialization
Keel is a software tool to assess evolutionary algorithms for Data Mining problems.
Machine learning tool.
C. R
Revolution is a free software programming language and software environment for
statistical computing and graphics. The R language is widely used among statisticians and
data miners for developing statistical software and data analysis. One of R's strengths is the
ease with which well-designed publication-quality plots can be produced, including
mathematical symbols and formulae where needed.
1) Technical Specification
First released in 1997
Latest version Available is 3.1.0
Licensed by GNU General Public License
Cross Platform
C, Fortran and R
www.r-project.org
2) General Features
The R project is a platform for the analysis, graphics and software development activities
of data miners and
related areas.
R is a well-supported, open source, command line driven, statistics package. There are
hundreds of extra
packages freely available, which provide all sorts of data mining, machine learning and
statistical techniques. .

It allows statisticians to do very intricate and complicated analyses without knowing the
blood and guts of
computing systems
3) Specification:
It has a large number of users, in particular in the fields of bio-informatics and social
science. It is also a free
ware replacement for SPSS.
Suited for Statistical Computing.
D. KNIME
Konstanz Information Miner, is an open source data analytics, reporting and
integration platform. It has been used in pharmaceutical research, but is also used in other
areas like CRM customer data analysis, business intelligence and financial data analysis. It
is based on the Eclipse platform and, through its modular API, and is easily extensible.
Custom nodes and types can be implemented in KNIME within hours thus extending
KNIME to comprehend and provide firsttier support for highly domain-specific data format.
1) Technical Specification
Released on 2004.
Latest version available is KNIME2.9
Licensed By GNU General Public License
Compatible with Linux ,OS X, Windows
Written in java
www.knime.org
2) General Features
Knime, pronounced naim, is a nicely designed data mining tool that runs inside the
IBMs Eclipse development environment.
It is a modular data exploration platform that enables the user to visually create data
flows (often referred to as pipelines), selectively execute some or all analysis steps, and later
investigate the results through interactive views on data and models.
The Knime base version already incorporates over 100 processing nodes for data I/O,
preprocessing and cleansing, modeling, analysis and data mining as well as various
interactive views, such as scatter plots, parallel coordinates and others.
3) Specification

Integration of the Chemistry Development Kit with additional nodes for the
processing of chemical structures, compounds, etc.
Specialized for Enterprise reporting, Business Intelligence, data mining

Conclusion: Studied the various data mining tools

2 Aim: To Study Weka Tool

Theory:
Introduction:
WEKA is a data mining system
developed by the University of
Waikato in New Zealand that
implements data mining algorithms.
WEKA is a state-of-the-art facility for
developing machine learning (ML) techniques and their application to real-world data mining
problems. It is a collection of machine learning algorithms for data mining tasks. The algorithms are
applied directly to a dataset. WEKA implements algorithms for data preprocessing, classification,
regression, clustering, association rules; it also includes a visualization tools. The new machine
learning schemes can also be developed with this package. WEKA is open source software issued
under the GNU General Public License.
Launching WEKA Explorer
You can launch Weka from C:\Program Files directory, from your desktop selecting
icon, or from the Windows task bar Start Programs Weka 3-4. When WEKA GUI
Chooser window appears on the screen, you can select one of the four options at the bottom of the
window

1. Simple CLI provides a simple command-line interface and allows direct execution of Weka
commands.
2. Explorer is an environment for exploring data.
3. Experimenter is an environment for performing experiments and conducting statistical tests
between learning schemes.
4. KnowledgeFlow is a Java-Beans-based interface for setting up and running machine learning
experiments.
For the exercises in this tutorial you will use Explorer. Click on Explorer button in the WEKA
GUI Chooser window.

WEKA Explorer window appears on a screen.

3.

Preprocessing Data
At the very top of the window, just below the title bar there is a row of tabs. Only the first tab,
Preprocess, is active at the moment because there is no dataset open. The first three buttons at the
top of the preprocess section enable you to load data into WEKA. Data can be imported from a file
in various formats: ARFF, CSV, C4.5, binary, it can also be read from a URL or from an SQL
database (using JDBC) [4]. The easiest and the most common way of getting the data into WEKA is
to store it as Attribute-Relation File Format (ARFF) file. Youve already been given weather.arff
file for this exercise; therefore, you can skip section 3.1 that will guide you through the file
conversion.
3.1. File Conversion
We assume that all your data stored in a Microsoft Excel spreadsheet weather.xls.

WEKA expects the data file to be in Attribute-Relation File Format (ARFF) file. Before you apply
the algorithm to your data, you need to convert your data into comma-separated file into ARFF
format (into the file with .arff extension) [1]. To save you data in comma-separated format, select
the Save As menu item from Excel File pull-down menu. In the ensuing dialog box select
CSV (Comma Delimited) from the file type pop-up menu, enter a name of the file, and click
Save button. Ignore all messages that appear by clicking OK. Open this file with Microsoft
Word. Your screen will look like the screen below.

The rows of the original spreadsheet are converted into lines of text where the elements are
separated from each other by commas. In this file you need to change the first line, which holds the
attribute names, into the header structure that makes up the beginning of an ARFF file. Add a
@relation tag with the datasets name, an @attribute tag with the attribute information, and a @data
tag as shown below.

Choose Save As from the File menu and specify Text Only with Line Breaks as the file type.
Enter a file name and click Save button. Rename the file to the file with extension .arff to indicate
that it is in ARFF format.
3.2. Opening file from a local file system
Click on Open file button.

It brings up a dialog box allowing you to browse for the data file on the local file system, choose
weather.arff file.

Some databases have the ability to save data in CSV format. In this case, you can select CSV file
from the local file system. If you would like to convert this file into ARFF format, you can click on
Save button. WEKA automatically creates ARFF file from your CSV file.

3.3. Opening file from a web site


A file can be opened from a website. Suppose, that weather.arff is on the following website:

The URL of the web site in our example is http://gaia.ecs.csus.edu/~aksenovs/. It means that the file
is stored in this directory, just as in the case with your local file system. To open this file, click on
Open URL button, it brings up a dialog box requesting to enter source URL.

Enter the URL of the web site followed by the file name, in this example the URL is
http://gaia.ecs.csus.edu/~aksenovs/weather.arff, where weather.arff is the name of the file youare
trying to load from the website.
3.4. Reading data from a database
Data can also be read from an SQL database using JDBC. Click on Open DB button,
GenericObjectEditor appears on the screen.

To read data from a database, click on Open button and select the database from a
filesystem.
3.5. Preprocessing window

At the bottom of the window there is Status box. The Status box displays messages that keep you
informed about what is going on. For example, when you first opened the Explorer, the message
says, Welcome to the Weka Explorer. When you loading weather.arff file, the Status box
displays the message Reading from file. Once the file is loaded, the message in the Status box
changes to say OK. Right-click anywhere in Status box, it brings up a menu with two options:
1. Available Memory that displays in the log and in Status box the amount of
memory available to WEKA in bytes.
2. Run garbage collector that forces Java garbage collector to search for memory
that is no longer used, free this memory up and to allow this memory for new tasks. To the right of
Status box there is a Log button that opens up the log. The log records every action in WEKA
and keeps a record of what has happened. Each line of text in the log contains time of entry. For
example, if the file you tried to open is not loaded, the log will have record of the problem that
occurred during opening. To the right of the Log button there is an image of a bird. The bird is
WEKA status icon.
The number next to X symbol indicates a number of concurrently running processes. When you
loading a file, the bird sits down that means that there are no processes running. The number of
processes besides symbol X is zero that means that the system is idle. Later, in classification
problem, when generating result look at the bird, it gets up and start moving that indicates that a

process started. The number next to X becomes 1 that means that there is one process running, in
this case calculation.

If the bird is standing and not moving for a long time, it means that something has gone wrong. In
this case you should restart WEKA Explorer.
3. Loading data
Lets load the data and look what is happening in the Preprocess window. The most common and
easiest way of loading data into WEKA is from ARFF file, using Open
file button (section 3.2). Click on Open file button and choose weather.arff file from your
local filesystem. Note, the data can be loaded from CSV file as well because some databases have
the ability to convert data only into CSV format.

Once the data is loaded, WEKA recognizes attributes that are shown in the Attribute window. Left
panel of Preprocess window shows the list of recognized attributes:
No. is a number that identifies the order of the attribute as they are in data file,
Selection tick boxes allow you to select the attributes for working relation,
Name is a name of an attribute as it was declared in the data file.

The Current relation box above Attribute box displays the base relation (table) name and the
current working relation (which are initially the same) - weather, the number of instances 14
and the number of attributes - 5.
During the scan of the data, WEKA computes some basic statistics on each attribute. The following
statistics are shown in Selected attribute box on the right panel of Preprocess window:
Name is the name of an attribute,
Type is most commonly Nominal or Numeric, and
Missing is the number (percentage) of instances in the data for which this attribute is
unspecified,
Distinct is the number of different values that the data contains for this attribute, and
Unique is the number (percentage) of instances in the data having a value for this attribute that no
other instances have.
An attribute can be deleted from the Attributes window. Highlight an attribute you would like to
delete and hit Delete button on your keyboard. By clicking on an attribute, you can see the basic
statistics on that attribute. The frequency for each attribute value is shown for categorical attributes.
Min, max, mean, standard deviation (StdDev) is shown for continuous attributes.
Click on attribute Outlook in the Attribute window.

Temperature is a numeric value; therefore, you can see min, max, means, and standard deviation in
Selected Attribute window.
Missing = 0 means that the attribute is specified for all instances (no missing values),
Distinct = 12 means that Temperature has twelve different values, and
Unique = 10 means that other attributes or instances have the same 10 value as Temperature has.
Temperature is a Numeric value; therefore, you can see the statistics describing the distribution of
values in the data - Minimum, Maximum, Mean and Standard Deviation.
Minimum = 64 is the lowest temperature,
Maximum = 85 is the highest temperature, mean and standard deviation.
Compare the result with the attribute table weather.xls; the numbers in WEKA match the numbers
in the table. You can select a class in the Class pull-down box. The last attribute in the Attributes
window is the default class selected in the Class pull-down box

You can Visualize the attributes based on selected class. One way is to visualize selected attribute
based on class selected in the Class pull-down window, or visualize all attributes by clicking on
Visualize All button.

4. Building Classifiers
Classifiers in WEKA are the models for predicting nominal or numeric quantities. The learning
schemes available in WEKA include decision trees and lists, instance-based classifiers, support
vector machines, multi-layer perceptrons, logistic regression, and bayes nets. Meta- classifiers
include bagging, boosting, stacking, error-correcting output codes, and locally weighted learning.
Once you have your data set loaded, all the tabs are available to you. Click on the Classify
tab.

Classify window comes up on the screen.

Now you can start analyzing the data using the provided algorithms. In this exercise you will
analyze the data with C4.5 algorithm using J48, WEKAs implementation of decision tree learner.
The sample data used in this exercise is the weather data from the file weather.arff.
Conclusion: Studied the Weka Tool

3. Study of a typical architecture of Datawarehousing system.

Need to put details about each of them


Conclusion: Studied the architecture of a typical datawarehouse application.

4. Create operational System, a data warehouse for given case study-Medical Store.

Theory:

Identify all dimension tables minimum 10 and fact tables minimum 2 for the Medical Store
data Warehouse.
Identify primary keys and all other attributes with their data types for all dimension tables
minimum attributes 10.
Identify foreign keys for fact tables with measurable facts from medical store.
Create all tables.
Insert values into Dimension tables.
Display all inserted values for all tables

Dimensions Identified:
medicine
customers
supplier
Store
Bills
Facts Identified:
Sales
Stock
Tables with attributes:
SQL> create table medicine
(
m_id number,
m_name varchar2(10),
m_comp varchar2(10),
m_supp varchar2(10),
m_mfg date,
m_exp date,
m_mrp number,
m_cp number,
quantity number
);
Table created.
insert into medicine values(1,'ciplox500','cipla','patchem','12/mar/08','12/apr/09',30,27,7);
insert into medicine values(2,'crocin','duphar','tahela','12/mar/08','4/mar/09',20,18,5);
insert into medicine values(3,'action 500','proctor','cgmarket','5/jan/07','5/jan/09',25,21,4);
insert into medicine values(4,'cervin','ranbaxy','sp shah','12/mar/08','4/mar/09',35,30,8);
insert into medicine values(5,'atorva','cadila','saroj','5/dec/06','5/dec/09',30,27,5);

insert into medicine values(6,'losar 25','unichem','amritlal','5/dec/06','5/dec/09',30,25,5);


insert into medicine values(7,'gelusil','pfizer','subdha','25/apr/07','25/apr/07',45,40,9);
insert into medicine values(8,'lasix','avantis','subdha','2/jun/07','25/apr/09',45,41,9);
insert into medicine values(9,'digine','amol','meher','2/jun/07','25/apr/09',5,3,2);
insert into medicine values(10,'combiflam','avantis','arihant','25/mar/07','25/apr/09',10,7,3);
insert into medicine values(11,'soframycin','avantis','eastern','25/mar/07','25/apr/09',25,21,5);
insert into medicine values(12,'pan 40','alchem','navdeep','2/may/07','25/june/09',10,8,3);
insert into medicine values(13,'perinorm','ipca','md pharma','27/nov/07','25/apr/09',30,25,5);
insert into medicine values(14,'domstal','elder','kishore','21/jul/07','25/apr/09',40,35,6);
insert into medicine values(15,'shelcal','avantis','arihant','23/mar/07','25/apr/09',25,20,7);
SQL> select * from medicine;
M_ID M_NAME M_COMP M_SUPP M_MFG M_EXP M_MRP M_CP QUANTITY
---------- ---------- ---------- ---------- --------- --------- ---------- ---------- ---------- ---------- ---------1
ciplox 500
cipla
patchem 12-MAR-08 12-APR-09 30
27
7
2

crocin

duphar

tahela

12-MAR-08 04-MAR-09 20

18

action 500

proctor cg market 05-JAN-07 05-JAN-09

25

21

cervin

ranbaxy

sp shah 12-MAR-08 04-MAR-09 35

30

atorva

cadila

saroj

27

05-DEC-06 05-DEC-09

30

M_ID M_NAME M_COMP M_SUPP M_MFG M_EXP M_MRP M_CP QUANTITY


---------- ---------- ---------- ---------- --------- --------- ---------- ---------- ---------- ---------- ---------6

losar 25

gelusil

pfizer

subdha

25-APR-07

25-APR-07

45

40

lasix

avantis subdha

02-JUN-07

25-APR-09

45

41

digine

amol

meher

02-JUN-07

25-APR-09

20

avantis

arihant

25-MAR-07 25-APR-09

10 combiflam

unichem amritlal

05-DEC-06 05-DEC-09

10

30

25

M_ID M_NAME M_COMP M_SUPP M_MFG M_EXP M_MRP M_CP QUANTITY


---------- ---------- ---------- ---------- --------- --------- ---------- ---------- ---------- ---------- ---------11

soframycin

avantis

eastern

25-MAR-07 25-APR-09

25

21

12

pan 40

alchem

navdeep

02-MAY-07 25-JUN-09

10

13

perinorm

ipca

md pharma 27-NOV-07 25-APR-09

30

25

14

domstal

elder

kishore

25-APR-09

40

35

15

shelcal

avantis

23-MAR-07 25-APR-09

25

20

arihant

21-JUL-07

15 rows selected.
SQL> create table customers
(
cust_id number,
cust_name varchar(10),
cust_add varchar(10),
cust_phone number,
cust_bill_no number
);
Table created.
insert into customers values(102,'krunal','ghatkopar',25000111,1);
insert into customers values(103,'kunal','chembur',25000112,2);
insert into customers values(104,'mansi','powai',25000113,3);
insert into customers values(105,'dinesh','chembur',25000114,4);
insert into customers values(106,'aniket','ulhas',25000115,5);
insert into customers values(107,'sagar','parel',25000116,6);
insert into customers values(108,'mitesh','borivili',25000117,7);
insert into customers values(109,'vinay','chembur',25000118,8);
insert into customers values(110,'sarvesh','ghatkopar',25000119,9);
insert into customers values(111,'tarun','vasai',25000120,10);
insert into customers values(112,'sagar','ghatkopar',25000121,11);
insert into customers values(113,'varun','panvel',25000122,12);
insert into customers values(114,'nikhil','panvel',25000123,13);
insert into customers values(115,'megh','sion',25000124,14);
insert into customers values(116,'dhiral','chembur',25000125,15);
SQL> select * from customers;
CUST_ID CUST_NAME CUST_ADD CUST_PHONE CUST_BILL_NO
------- ---------- ---------- ---------- ------------ ---------- ------------ ---------- -----------102
krunal
ghatkopar
25000111
1
103
kunal
chembur
25000112
2
104
mansi
powai
25000113
3
105
dinesh
chembur
25000114
4
106
aniket
ulhas
25000115
5
107
sagar
parel
25000116
6
108
mitesh
borivili
25000117
7
109
vinay
chembur
25000118
8
110
sarvesh
ghatkopar
25000119
9
111
tarun
vasai
25000120
10
112
sagar
ghatkopar
25000121
11
CUST_ID CUST_NAME CUST_ADD CUST_PHONE CUST_BILL_NO
------- ---------- ---------- ---------- ------------ ---------- ------------ ---------- -----------113
varun
panvel
25000122
12
114
nikhil
panvel
25000123
13
115
megh
sion
25000124
14
116
dhiral
chembur
25000125
15

15 rows selected.
SQL> create table supplier
(
s_id number,
s_name varchar2(10),
s_add varchar2(10),
s_phone number,
s_date date,
s_amount number
);
Table created.
insert into supplier values(1001,'cipla','sion',24000111,'1/jan/2009',400);
insert into supplier values(1002,'duphar','parel',24000112,'1/feb/2009',500);
insert into supplier values(1003,'proctor','thane',24000113,'21/jan/2009',450);
insert into supplier values(1004,'ranbaxy','cst',24000114,'3/mar/2009',300);
insert into supplier values(1005,'cadila','sion',24000115,'5/apr/2009',350);
insert into supplier values(1006,'unichem','sion',24000116,'1/jan/2009',200);
insert into supplier values(1007,'pfizer','chembur',24000117,'11/jan/2009',100);
insert into supplier values(1008,'avantis','kandivili',24000118,'21/jan/2009',200);
insert into supplier values(1009,'amol','daishar',24000119,'15/dec/2008',100);
insert into supplier values(1010,'alchem','parle',24000120,'18/jan/2009',120);
insert into supplier values(1011,'unique','goregaon',24000121,'31/jan/2009',130);
insert into supplier values(1012,'ipca','vasai',24000122,'19/nov/2008',250);
insert into supplier values(1013,'torrent','mira road',24000123,'14/feb/2009',290);
insert into supplier values(1014,'elder','ghatkopar',24000124,'17/mar/2009',2000);
insert into supplier values(1015,'glaxo','sion',24000125,'1/jan/2009',400);
SQL> select * from supplier;
S_ID S_NAME S_ADD
S_PHONE S_DATE
S_AMOUNT
---------- ---------- ---------- ---------- --------- ---------- ---------- ------------ -----------1001
cipla
sion
24000111
01-JAN-09
400
1002
duphar parel
24000112
01-FEB-09
500
1003
proctor thane
24000113
21-JAN-09
450
1004
ranbaxy cst
24000114
03-MAR-09
300
1005
cadila
sion
24000115
05-APR-09
350
1006
unichem sion
24000116
01-JAN-09
200
1007
pfizer chembur 24000117
11-JAN-09
100
1008
avantis kandivili 24000118
21-JAN-09
200
1009
amol
daishar
24000119
15-DEC-08
100
1010
alchem parle
24000120
18-JAN-09
120
1011
unique goregaon 24000121
31-JAN-09
130
S_ID S_NAME S_ADD
S_PHONE S_DATE
S_AMOUNT
---------- ---------- ---------- ---------- --------- ---------- ------------ ------------ -----------1012
ipca
vasai
24000122 19-NOV-08
250
1013
torrent mira road 24000123 14-FEB-09
290

1014
1015

elder
glaxo

ghatkopar 24000124 17-MAR-09


sion
24000125 01-JAN-09

2000
400

15 rows selected.
SQL> create table bill
(
b_no number,
b_date date,
cust_name varchar2(10),
b_doc varchar(10),
b_amount number,
quantity number
);
Table created.
insert into bill values(001,'11/mar/2009','krunal','batra',400,5);
insert into bill values(002,'11/apr/2009','mansi','batra',20,6);
insert into bill values(003,'11/mar/2009','dinesh','batra',70,7);
insert into bill values(004,'21/mar/2009','aniket','batra',90,8);
insert into bill values(005,'5/apr/2009','mitesh','batra',300,9);
insert into bill values(006,'15/mar/2009','amit','batra',400,10);
insert into bill values(007,'21/mar/2009','sarvesh','batra',160,11);
insert into bill values(008,'29/mar/2009','tarun','batra',100,12);
insert into bill values(009,'26/mar/2009','varun','batra',250,5);
insert into bill values(010,'23/mar/2009','sagar','batra',90,6);
insert into bill values(011,'12/apr/2009','vinay','batra',300,4);
insert into bill values(012,'11/apr/2009','krunal','batra',85,5);
insert into bill values(013,'11/apr/2009','megh','batra',40,2);
insert into bill values(014,'10/apr/2009','megh','batra',400,3);
insert into bill values(015,'10/apr/2009','nikhil','batra',450,5);
SQL> select * from bill;
B_NO B_DATE CUST_NAME B_DOC
B_AMOUNT QUANTITY
---------- --------- ---------- ---------- ---------- ---------- ------------ ------------ -----------1 11-MAR-09 krunal
batra
400
5
2 11-APR-09 mansi
batra
20
6
3 11-MAR-09 dinesh
batra
70
7
4 21-MAR-09 aniket
batra
90
8
5 05-APR-09 mitesh
batra
300
9
6 15-MAR-09 amit
batra
400
10
7 21-MAR-09 sarvesh
batra
160
11
8 29-MAR-09 tarun
batra
100
12
9 26-MAR-09 varun
batra
250
5
10 23-MAR-09 sagar
batra
90
6
11 12-APR-09 vinay
batra
300
4
B_NO B_DATE

CUST_NAME B_DOC

B_AMOUNT QUANTITY

---------- --------- ---------- ---------- ---------- ---------- ------------ ------------ -----------12 11-APR-09 krunal
batra
85
5
13 11-APR-09 megh
batra
40
2
14 10-APR-09 megh
batra
400
3
15 10-APR-09 nikhil
batra
450
5
15 rows selected.
SQL> create table sales
(
m_name varchar2(10),
m_comp varchar2(10),
s_name varchar2(10),
profit number,
monthly_sales number
);
Table created.
insert into sales values('ciplox 500','cipla','patchem',50,1000);
insert into sales values('crocin','duphar','tahela',90,500);
insert into sales values('action 500','proctor','cg market',80,700);
insert into sales values('cervin','ranbaxy','sp shah',100,850);
insert into sales values('atorva','cadila','saroj',150,700);
insert into sales values('losar 25','unichem','amritlal',100,850);
insert into sales values('gelusil','pfizer','subdha',25,600);
insert into sales values('lasix','avantis','subdha',80,400);
insert into sales values('digine','amol','meher',39,600);
insert into sales values('combiflam','avantis','arihant',39,500);
insert into sales values('soframycin','avantis','eastern',30,600);
insert into sales values('perinorm','ipca','md pharma',90,500);
insert into sales values('domstal','elder','kishore',30,600);
insert into sales values('shelcal','avantis','arihant',30,900);
insert into sales values('pan 40','alchem','navdeep',50,100);
SQL> select * from sales;
M_NAME M_COMP
S_NAME
PROFIT
MONTHLY_SALES
---------- ---------- ---------- ---------- ------------- ------------ ------------ -----------ciplox 500
cipla
patchem
50
1000
crocin
duphar
tahela
90
500
action 500
proctor
cg market
80
700
cervin
ranbaxy
sp shah
100
850
atorva
cadila
saroj
150
700
losar 25
unichem
amritlal
100
850
gelusil
pfizer
subdha
25
600
lasix
avantis
subdha
80
400
digine
amol
meher
39
600
combiflam
avantis
arihant
39
500
soframycin
avantis
eastern
30
600
M_NAME

M_COMP

S_NAME

PROFIT

MONTHLY_SALES

---------- ---------- ---------- ---------- ------------- ------------ ------------ -----------perinorm


ipca
md pharma
90
500
domstal
elder
kishore
30
600
shelcal
avantis
rihant
30
900
pan 40
alchem
avdeep
50
100
15 rows selected.
SQL> create table store
(
store_id number,
m_name varchar2(10),
s_id number,
b_no number,
cust_id number,
);
insert into store values(01,'ciplox 500',1001,1,102);
insert into store values(02,'crocin',1002,2,103);
insert into store values(03,'action 500',1003,3,104);
insert into store values(04,'ceruvin',1004,4,105);
insert into store values(05,'atorva',1005,5,106);
insert into store values(06,'losar 25',1006,6,107);
insert into store values(07,'gelusil',1007,7,108);
insert into store values(08,'lasix',1008,8,109);
insert into store values(09,'digine',1009,9,110);
insert into store values(10,'combiflam',1010,10,111);
insert into store values(11,'soframycin',1011,11,112);
insert into store values(12,'pan 40',1012,12,113);
insert into store values(13,'perinorm',1013,13,114);
insert into store values(14,'domstal',1014,14,115);
insert into store values(15,'shelcal',1015,15,116);
SQL> select * from store;
STORE_ID M_NMAME
S_ID
B_NO CUST_ID
---------- ---------- ---------- ---------- ---------- ------------ -----------1
ciplox 500
1001
1
102
2
crocin
1002
2
103
3
action 500
1003
3
104
4
ceruvin
1004
4
105
5
atorva
1005
5
106
6
losar 25
1006
6
107
7
gelusil
1007
7
108
8
lasix
1008
8
109
9
digine
1009
9
110
10
combiflam
1010
10
111
11
soframycin
1011
11
112
STORE_ID M_NMAME
S_ID
B_NO CUST_ID
---------- ---------- ---------- ---------- ---------- ------------ ------------

12
13
14
15

pan 40
perinorm
domstal
shelcal

1012
1013
1014
1015

12
13
14
15

113
114
115
116

15 rows selected.

5. Creation of Star and Snowflake and Fact Constellation Schemas


Aim: To create Star, Snowflake and Fact constellation Schemas for selected case study.
Theory:
i) Creation of Star Schema
Identify all dimension tables minimum 10 and fact tables minimum 2 for the Medical Store
data Warehouse.
Identify primary keys and all other attributes with their data types for all dimension tables
minimum attributes 10.
Identify foreign keys for fact tables with measurable facts from medical store.
Create all tables

Fig-STAR SCHEMA FOR SALES OF MEDICAL STORE

SQL> select count(*) from medicine;


COUNT(*)
-------------15
SQL> select m_name,m_comp,m_supp,m_mrp,quantity,s_date,s_amount
from medicine,supplier
where s_add='sion' and m_mrp=30;
M_NAME M_COMP M_SUPP M_MRP
QUANTITY
S_ADD
S_DATE
S_AMOUNT
---------- ---------- ---------- ---------- --------- --------- ---------- ---------- ---------- ---------- ---------ciplox 500
cipla patchem
30
7
sion
01-JAN-09
400
atorva
cadila saroj
30
5
sion
01-JAN-09
400
losar 25
unichem amritlal
30
5
sion
01-JAN-09
400
perinorm
ipca
md pharma 30
5
sion
01-JAN-09 400
ciplox 500
cipla
patchem
30
7
sion
05-APR-09
350
atorva
cadila saroj
30
5
sion
05-APR-09 350
losar 25
unichem amritlal 30
5
sion
05-APR-09 350
perinorm
ipca
md pharma 30
5
sion
05-APR-09 350
ciplox 500
cipla
patchem
30
7
sion
01-JAN-09 200
atorva
cadila
saroj
30
5
sion
01-JAN-09 200
losar 25
unichem amritlal 30
5
sion
01-JAN-09 200
perinorm
ipca
md pharma 30
5
sion
01-JAN-09 200
ciplox 500
cipla
patchem
30
7
sion
01-JAN-09 400
atorva
cadila saroj
30
5
sion
01-JAN-09 400
losar 25
unichem amritlal 30
5
sion
01-JAN-09 400
perinorm
ipca
md pharma 30
5
sion
01-JAN-09 400
16 rows selected.
SQL> select s_name
from supplier
where s_date='01/jan/09' order by s_name;
S_NAME
---------cipla
glaxo
unichem
3 rows selected.
SQL> select max(profit) from sales;
MAX(PROFIT)
----------150
SQL> select profit,sum(monthly_sales)

from sales
group by profit;
PROFIT SUM(MONTHLY_SALES)
---------- -----------------25
600
30
2100
39
1100
50
1100
80
1100
90
1000
100
1700
150
700
8 rows selected.
ii) Create SnowFlake Schema

Normalize dimension tables further to simplify the complexity.


MEDICINE
SALES
MOTHLY SALES
M_NAME
M_name
M_comp
S_name
Profit
Monthly_sales
M_name
M_type

BILL
STORES
B_no
B_date
Cust_name
B_doc
B_amount
Quantity

Store_id

M_name
S_id
B_no
Cust_id
B_amount
B_cash
B_cheque
SUPPLIER
S_PHONE
S_id
S_name
S_add
S_phone
S_date
S_amount
S_phone
S_mobile
S_office

CUST_ADD
CUSTOMERS
Cust_id
Cust_name
Cust_add
Cust_phone
Cust_bill_no

Fig-SNOWFLAKE SCHEMA FOR SALES OF MEDICAL STORE


SQL> select m_name,m_supp,m_mrp
from medicine
where m_mrp between 10 and 100;
M_NAME
M_SUPP
M_MRP
---------- ---------- ---------- --------- --------ciplox 500
patchem
30
crocin
tahela
20
action 500
cg market
25

cervin
atorva
losar 25
gelusil
lasix
combiflam
soframycin

sp shah
saroj
amritlal
subdha
subdha
arihant
eastern

35
30
30
45
45
10
25

M_NAME
M_SUPP
M_MRP
---------- ---------- ---------- --------- --------perinorm
md pharma
30
domstal
kishore
40
shelcal
arihant
25
pan 40
navdeep
10
14 rows selected.

SQL> select m_name,m_supp,m_mrp,quantity


from medicine
where quantity>=7 and m_mrp<=90;
M_NAME
M_SUPP
M_MRP QUANTITY
----------- ----------- ----------- ----------- ---------- ---------ciplox 500
patchem
30
7
cervin
sp shah
35
8
gelusil
subdha
45
9
lasix
subdha
45
9
shelcal
arihant
25
7
5 rows selected.
SQL> select sum(profit)
from sales;
SUM(PROFIT)
--------- ---------983

SQL> select cust_name,b_amount


from bill
group by (cust_name,b_amount);
CUST_NAME B_AMOUNT
---------- ---------- ---------- ------

amit
megh
megh
mansi
sagar
tarun
varun
vinay
aniket
dinesh

400
40
400
20
90
100
250
300
90
70

CUST_NAME B_AMOUNT
---------- ---------- ---------- -----krunal
85
krunal
400
mitesh
300
nikhil
450
sarvesh
160
15 rows selected.
SQL> select cust_name,sum(quantity)
from bill
group by (cust_name,quantity);
CUST_NAME SUM(QUANTITY)
---------- ---------- ---------- ---------amit
10
megh
2
megh
3
mansi
6
sagar
6
tarun
12
varun
5
vinay
4
aniket
8
dinesh
7
CUST_NAME SUM(QUANTITY)
---------- ---------- ---------- ---------krunal
10
mitesh
9
nikhil
5
sarvesh
11
14 rows selected.
Experiment No-6

6. Perform Online Analytical Processing (OLAP) operations on Operational System-Case


Study

Aim: Perform Online Analytical Processing (OLAP) operations on Operational System created.
Theory: Write details for OLAP operations and execute them.

Roll-up
Drill-down
Slicing
Dicing
Pivoting(Rotate)

SQL> select m_id,m_name,quantity,s_name,profit,monthly_sales


from(medicine natural full outer join sales)
where profit=30 or quantity=5;
M_ID M_NAME
QUANTITY S_NAME PROFIT MONTHLY_SALES
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -----5
atorva
5
saroj
150
700
2
crocin
5
tahela
90
500
14
domstal
6
kishore
30
600
6
losar 25
5
amritlal
100
850
13
perinorm
5
md pharma
90
500
15
shelcal
7
arihant
30
900
11 soframycin
5
eastern
30
600
7 rows selected.
SQL> select profit,sum(monthly_sales)
from sales
group by CUBE(profit);
PROFIT
SUM(MONTHLY_SALES)
---------- ---------- ---------- ---------- ---------9400
25
600
30
2100
39
1100
50
1100
80
1100
90
1000
100
1700
150
700
9 rows selected.

SQL> select profit,max(monthly_sales)


from sales
group by CUBE(profit);
PROFIT MAX(MONTHLY_SALES)
------------ ------------ ------------ ----------1000
25
600
30
900
39
600
50
1000
80
700
90
500
100
850
150
700
9 rows selected.
SQL> select profit,max(monthly_sales)
from sales
group by ROLLUP(PROFIT);
PROFIT MAX(MONTHLY_SALES)
----------- ----------- ----------- -----------25
600
30
900
39
600
50
1000
80
700
90
500
100
850
150
700
1000
9 rows selected.

Experiment No 7
7. WAP to implement Classification using Decision Tree Induction / ID3
Softwares: JAVA & Weka Mining Tool
Theory:
Classification
predicts categorical class labels
classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions, i.e., predicts unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Supervised learning
Supervision: The training data (observations, measurements, etc.) are accompanied
by labels indicating the class of the observations
New data is classified based on the training set
Issues regarding classification and prediction
Data cleaning
Preprocess data in order to reduce noise and handle missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Algorithm
Basic algorithm (a greedy algorithm)-Decision Tree Induction
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning majority voting is
employed for classifying the leaf
There are no samples left
Program Code:

import java.io.*;
import java.util.*;
public class ID3 {
int numAttributes;
String []attributeNames;

// no of attributes including O/P


// The names of all attributes.

Vector []domains;
class DataPoint {
public int []attributes;
public DataPoint(int numattributes)
{
attributes = new int[numattributes];
}
};
/* The class to represent a node in the decomposition tree.
*/
class TreeNode {
public double entropy;

// The entropy of data points if this node is a leaf

node
public Vector data; // The set of data points if this is a leaf node
public int decompositionAttribute; // If this is not a leaf node, the attribute
that is used to divide the set of data points
public int decompositionValue;
// the attribute-value that is used
to divide the parent node
public TreeNode []children;
// If this is not a leaf node, references to
the children nodes
public TreeNode parent;
// The parent to this node. The
root has parent == null
public TreeNode() {
data = new Vector();
}
};
/* The root of the decomposition tree */
TreeNode root = new TreeNode();

public int getSymbolValue(int attribute, String symbol) {

int index = domains[attribute].indexOf(symbol);


if (index < 0) {
domains[attribute].addElement(symbol);
return domains[attribute].size() -1;
}
return index;
}
/* Returns all the values of the specified attribute in the data set */
public int []getAllValues(Vector data, int attribute) {
Vector values = new Vector();
int num = data.size();
for (int i=0; i< num; i++) {
DataPoint point = (DataPoint)data.elementAt(i);
String
symbol
(String)domains[attribute].elementAt(point.attributes[attribute] );
int index = values.indexOf(symbol);
if (index < 0) {
values.addElement(symbol);
}
}

int []array = new int[values.size()];


for (int i=0; i< array.length; i++) {
String symbol = (String)values.elementAt(i);
array[i] = domains[attribute].indexOf(symbol);
}
values = null;
return array;
}
/* Returns a subset of data, in which the value of the specfied attribute of all data points is
the specified value */
public Vector getSubset(Vector data, int attribute, int value) {
Vector subset = new Vector();
int num = data.size();
for (int i=0; i< num; i++) {
DataPoint point = (DataPoint)data.elementAt(i);
if (point.attributes[attribute] == value) subset.addElement(point);
}
return subset;
}
/* Calculates the entropy of the set of data points.
The entropy is calculated using the values of the output attribute which is the last
element in the array attribtues
*/
public double calculateEntropy(Vector data) {
int numdata = data.size();
if (numdata == 0) return 0;

int attribute = numAttributes-1;


int numvalues = domains[attribute].size();
double sum = 0;
for (int i=0; i< numvalues; i++) {
int count=0;
for (int j=0; j< numdata; j++) {
DataPoint point = (DataPoint)data.elementAt(j);
if (point.attributes[attribute] == i) count++;
}
double probability = 1.*count/numdata;
if (count > 0) sum += -probability*Math.log(probability);
}
return sum;
}
/* This function checks if the specified attribute is used to decompose the data set in any
of the parents of the specfied node in the decomposition tree.Recursively checks the specified node
as well as all parents
*/
public boolean alreadyUsedToDecompose(TreeNode node, int attribute) {
if (node.children != null) {
if (node.decompositionAttribute == attribute )
return true;
}
if (node.parent == null) return false;
return alreadyUsedToDecompose(node.parent, attribute);
}
/* This function decomposes the specified node according to the ID3 algorithm.
Recursively divides all children nodes until it is not possible to divide any further */
public void decomposeNode(TreeNode node) {
double bestEntropy=0;
boolean selected=false;
int selectedAttribute=0;
int numdata = node.data.size();
int numinputattributes = numAttributes-1;
node.entropy = calculateEntropy(node.data);
if (node.entropy == 0) return;
/* In the following two loops, the best attribute is located which causes
maximum decrease in entropy
*/
for (int i=0; i< numinputattributes; i++) {
int numvalues = domains[i].size();
if ( alreadyUsedToDecompose(node, i) ) continue;
// Use the following variable to store the entropy for the test node created with the
attribute i

double averageentropy = 0;
for (int j=0; j< numvalues; j++) {
Vector subset = getSubset(node.data, i, j);
if (subset.size() == 0) continue;
double subentropy = calculateEntropy(subset);
averageentropy += subentropy * subset.size(); // Weighted sum
}
averageentropy = averageentropy / numdata; // Taking the weighted average
if (selected == false) {
selected = true;
bestEntropy = averageentropy;
selectedAttribute = i;
} else {
if (averageentropy < bestEntropy) {
selected = true;
bestEntropy = averageentropy;
selectedAttribute = i;
}
}
}
if (selected == false) return;
// Now divide the dataset using the selected attribute
int numvalues = domains[selectedAttribute].size();
node.decompositionAttribute = selectedAttribute;
node.children = new TreeNode [numvalues];
for (int j=0; j< numvalues; j++) {
node.children[j] = new TreeNode();
node.children[j].parent = node;
node.children[j].data = getSubset(node.data, selectedAttribute, j);
node.children[j].decompositionValue = j;
}
// Recursively divides children nodes
for (int j=0; j< numvalues; j++) {
decomposeNode(node.children[j]);
}
node.data = null;
}
/** Function to read the data file.*/
public int readData(String filename) throws Exception {
FileInputStream in = null;

try {
File inputFile = new File(filename);
in = new FileInputStream(inputFile);
} catch ( Exception e) {
System.err.println( "Unable to open data file: " + filename + "\n" + e);
return 0;
}
BufferedReader bin = new BufferedReader(new InputStreamReader(in) );
String input;
while(true) {
input = bin.readLine();
if (input == null) {
System.err.println( "No data found in the data file: " +
filename + "\n");
return 0;
}
if (input.startsWith("//")) continue;
if (input.equals("")) continue;
break;
}
StringTokenizer tokenizer = new StringTokenizer(input);
numAttributes = tokenizer.countTokens();
if (numAttributes <= 1) {
System.err.println( "Read line: " + input);
System.err.println( "Could not obtain the names of attributes in the
line");
System.err.println( "Expecting at least one input attribute and one
output attribute");
return 0;
}
domains = new Vector[numAttributes];
for (int i=0; i < numAttributes; i++) domains[i] = new Vector();
attributeNames = new String[numAttributes];
for (int i=0; i < numAttributes; i++) {
attributeNames[i] = tokenizer.nextToken();
}
while(true) {
input = bin.readLine();
if (input == null) break;
if (input.startsWith("//")) continue;
if (input.equals("")) continue;
tokenizer = new StringTokenizer(input);

int numtokens = tokenizer.countTokens();


if (numtokens != numAttributes) {
System.err.println( "Read " + root.data.size() + " data");
System.err.println( "Last line read: " + input);
System.err.println( "Expecting " + numAttributes
+ "
attributes");
return 0;
}
DataPoint point = new DataPoint(numAttributes);
for (int i=0; i < numAttributes; i++) {
point.attributes[i] = getSymbolValue(i, tokenizer.nextToken() );
}
root.data.addElement(point);
}
bin.close();
return 1;
}

// End of function readData//

/* This function prints the decision tree in the form of rules.*/


public void printTree(TreeNode node, String tab) {
int outputattr = numAttributes-1;
if (node.children == null) {
int []values = getAllValues(node.data, outputattr );
if (values.length == 1) {
System.out.println(tab + "\t" + attributeNames[outputattr] + "
= \"" + domains[outputattr].elementAt(values[0]) + "\";");
return;
}
System.out.print(tab + "\t" + attributeNames[outputattr] + " = {");
for (int i=0; i < values.length; i++) {
System.out.print("\""
+
domains[outputattr].elementAt(values[i]) + "\" ");
if ( i != values.length-1 ) System.out.print( " , " );
}
System.out.println( " };");
return;
}
int numvalues = node.children.length;
for (int i=0; i < numvalues; i++) {
System.out.println(tab + "if( " + attributeNames[node.decompositionAttribute] + "
== \"" +
domains[node.decompositionAttribute].elementAt(i) + "\") {" );
printTree(node.children[i], tab + "\t");

if (i != numvalues-1) System.out.print(tab + "} else ");


else System.out.println(tab + "}");
}
}
/* creates the decision tree and prints it in the form of rules */
public void createDecisionTree() {
decomposeNode(root);
printTree(root, "");
}
/* main function */
public static void main(String[] args) throws Exception {
int num = args.length;
if (num != 1) {
System.out.println("You need to specify the name of the datafile at the
command line " );
return;
}
ID3 me = new ID3();
long startTime = System.currentTimeMillis(); //

To print the time taken to

process the data


int status = me.readData(args[0]);
if (status <= 0) return;
me.createDecisionTree();
long endTime = System.currentTimeMillis();
long totalTime = (endTime-startTime)/1000;
System.out.println( totalTime + " Seconds");
}
/* End of the main function */
}
OUTPUTif( Span == "long") {
Slab = "waffle";

} else {
if( Shape == "square") {
Slab = "two-way";
} else {
Slab = "one-way";
}
}
0 Seconds
Span
Shape
Slab
//**************************************
long
square
waffle
long
rectangle waffle
short
square
two-way
short
rectangle one-way

7a.Implementation of various K- nearest neighbour classifier.

Theory:
The k-nearest neighbour algorithm (k-NN) is a method for classifying objects based on closest
training set, where the function is only approximated locally and all computation is deferred until
classification. The k-nearest neighbour algorithm is amongst the simplest of all machine learning
algorithms: an object is classified by a majority vote of its neighbours, with the object being
assigned to the class most common amongst its k-nearest neighbours (k is a positive integer,
typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbour.
Neighbours Classification is quite straightforward; examples are classified based on the class of
their nearest neighbours. It is often useful to take more than one neighbours into account so the
technique is more commonly referred to as k-Nearest Neighbours (k-NN) Classification where k
nearest neighbours are used in determining the class.
Example of k-NN classification
The test sample (green circle) should be classified either to the first class of blue squares or to the
second class of red triangles. If k = 3it is assigned to the second class because there are 2 triangles
and only 1 square inside the inner circle. If k = 5it is assigned to the first class (3 squares vs. 2
triangles inside the outer circle).

Implementation
The algorithm on how to compute the K-nearest neighbours is as follows:
1. Determine the parameter K = number of nearest neighbours beforehand. This value is all up to
you.
2. Calculate the distance between the query-instance and all the training samples. You can use any
distance algorithm.
3. Sort the distances for all the training samples and determine the nearest neighbour based on the
Kth minimum distance.
4. Since this is supervised learning, get all the Categories of your training data for the sorted value
which fall under K.
5. Use the majority of nearest neighbours as the prediction value.
Conclusion:

K-NN is very simple to understand and easy to implement. So it should be considered in seeking a
solution to any classification problem. Some advantages of k-NN are as follows (many of these
derive from its simplicity and interpretability):
1. Because the process is transparent, it is easy to implement and debug.
2. In situations where an explanation of the output of the classifier is useful, k-NN can be very
effective if an analysis of the neighbours is useful as explanation.
3. There are some noise reduction techniques that work only for k-NN that can be effective in
improving the accuracy of the classifier.

7b

Implement Nave Bays Classifier for sample data

Theory:
Introduction
The Bayesian Classification represents a supervised learning method as well as a
statistical method for classification. Assumes an underlying probabilistic model and it
allows us to capture uncertainty about the model in a principled way by determining
probabilities of the outcomes. It can solve diagnostic and predictive problems. This
Classification is named after Thomas Bayes ( 1702-1761), who proposed the Bayes
Theorem.Bayesian classification provides practical learning algorithms and prior knowledge and
observed data can be combined. Bayesian Classification provides a useful perspective for

understanding and evaluating many learning algorithms. It calculates explicit probabilities for
hypothesis and it is robust to noise in input data.
The Bayes Theorem:
P(h/D) =
P(h) : Prior probability of hypothesis h
P(D) : Prior probability of training data D
P(h/D) : Probability of h given D
P(D/h) : Probability of D given h
Que: To Calculate Probability to buy a computer for the following assumption :
Age = 35
Income = $50000
Let,
D : 35 year old customer with an income of $50,000 PA
h : Hypothesis that our customer will buy our computer
Theorm:
If we assume P(hi) = P(hj) where the calculated probabilities amount to the same
hML = average max P(D/hi) (where hi belongs to H)
h1: Customer buys a computer = Yes
h2 : Customer buys a computer = No
where h1 and h2 are subsets of our Hypothesis Space H
P(h/D) (Final Outcome) = arg max{ P( D/h1) P(h1) , P(D/h2) P(h2)}
P(D) can be ignored as it is the same for both the terms
Algorithm applied:
P (buys computer = yes) = 5/10 = 0.5
P (buys computer = no) = 5/10 = 0.5
P (customer is 35 yrs & earns $50,000) = 4/10 = 0.4
P (customer is 35 yrs & earns $50,000 / buys computer = yes) = 3/5 =0.6
P (customer is 35 yrs & earns $50,000 / buys computer = no) =1/5 = 0.2
Customer buys a computer P(h1/D) = P(h1) * P (D/ h1) / P(D) = 0.5 * 0.6 / 0.4
Customer does not buy a computer P(h2/D) = P(h2) * P (D/ h2)/ P(D) = 0.5 * 0.2 / 0.4
Final Outcome = average max {P(h1/D) , P(h2/D)} = max(0.6, 0.2)
=> Customer buys a computer
Output

Conclusion: A Naive Bays Classifier was proposed which is very efficient in classifying data
present in the database.

Experiment no:8
8. Implement DBSCAN method for clustering

Theory:
Introduction
Density-based Clustering locates regions of high density that are separated from one another by
regions of low density.
Density = number of points within a specified radius (Eps)
DBSCAN is a density-based algorithm. A point is a core point if it has more than a
specified number of points (MinPts) within Eps. These are po ints that are at the interior of a cluster
1) A border point has fewer than MinPts within Eps, but is in the neighbourhood of a core point
2) A noise point is any point that is not a core point or a border point. Any two core points are close
enough within a distance Eps of one another are put in the same cluster. Any border point that is
close enough to a core point is put in the same cluster as the core point noise points are discarded.

There are two concepts in the DBSCAN these are as follows:


1. Reachability is the first building block in DBSCAN. It defines whether two distance close points
belong to the same cluster. Points p1is density reachable from p2if two conditions are satisfied: (i)
the points are close enough to each other:
distance(p1,p2)<e, (ii)there are enough of points in is neighbourhood:
{r : distance(r,p2)}|>m, Where, r is a database point. Figure illustrates a density reachable point
p2from p1.
Reachability
2.
Connectivity
is the last
building step
of DBSCAN.
Points p0 and
pn are density
connected, if
there is a
sequence of
density
reachable
points

p1,i2,...,i(n-1) from p0 to pn such that p(i+1) is density reachable from pi.


Connectivity
Algorithm:
1. Select a point p
2. Retrieve all points density-reachable from p w.r.to e and MinPts.
3. If p is a core point, a cluster is formed.
4. If p is a border point, no points are density-reachable from p and DBSCAN
visits the next point of the database.
5. Continue the process until all of the points have been processed.
Result is independent of the order of processing the points.
For Example:

Conclusion: A density based algorithm called DBSCAN was proposed which is very efficient in
discovering clusters of arbitrary shape and also noise present in the database

Experiment no 9
9. Draw scatter plot of your data and find whether linear regression can be used or not.

Theory:
Scatter plot
A scatter plot or scatter graph is a type of mathematical diagram using Cartesian coordinates to
display values for two variables for a set of data. The data is displayed as a collection of points,
each having the value of one variable determining the position on the horizontal axis and the value
of the other variable determining the position on the vertical axis.This kind of plot is also called a
scatter chart, scattergram, scatter diagram or scatter graph. A scatter plot is used when a variable
exists that is under the control of the experimenter. If a parameter exists that is systematically
incremented and/or decremented by the other, it is called the control parameter or independent
variable and is customarily plotted along the horizontal axis. The measured or dependent variable is
customarily plotted along the vertical
axis. If no dependent variable exists, either type of variable can be plotted on either on x-axisor yaxis. Scatter plot will illustrate only the degree of correlation (not causation) between two variables.

Regression
Definition: Regression is a data mining (machine learning) technique used to fit an
equation to a dataset. The simplest form of regression, linear regression, uses the formula of a
straight line (y = mx + b) and determines the appropriate values for m and b to predict the value of y
based upon a given value of x. Advanced techniques, such as multiple regression, allow the use of

more than one input variable and allow for the fitting of more complex models, such as a quadratic
equation.
Linear Regression
A linear regression technique can be used if the relationship between the predictors and the target
can be approximated with a straight line. Regression with a single predictor is the easiest to
visualize. Simple linear regression with a
single predictor is shown in Figure 1.

Figure1:
Linear
Regression
with a Single Predictor
Linear regression with a single predictor can be expressed with the following equation.
y = 2x + 1 + e
The regression parameters in simple linear regression are:
The slope of the line (2) the angle between a data point and the regression line
The y intercept (1) the point where x crosses the y axis (x = 0)
Working of linear regression
Minimizing sum-of-squares: The goal of linear regression is to adjust the values of slope and
intercept to find the line that best predicts Y from X. More precisely, the goal of regression is to
minimize the sum of the squares of the vertical distances of the points from the line. Why minimize
the sum of the squares of the distances? Why not simply minimize the sum of the actual distances?
If the random scatter follows a Gaussian distribution, it is far more likely to have two medium size
deviations (say 5 units each) than to have one small deviation (1 unit) and one large (9 units). A
procedure that minimized the sum of the absolute value of the distances would have no preference
over a line that was 5 units away from two points and one that was

1 unit away from one point and 9 units from another. The sum of the distances (more
Precisely, the sum of the absolute value of the distances) is 10 units in each case. A procedure that
minimizes the sum of the squares of the distances prefers to be 5 units away from two points (sumof-squares = 50) rather than 1 unit away from one point and 9 units away from another (sum-ofsquares = 82). If the scatter is Gaussian (or nearly so), the line determined by minimizing the sumof-squares is most likely to be
Correct.
Slope and intercept
Prism reports the best-fit values of the slope and intercept, along with their standard errors and
confidence intervals. The slope quantifies the steepness of the line. It equals the change in Y for
each unit change in X. It is expressed in the units of the Y-axis divided by the units of the X-axis.
If the slope is positive, Y increases as X increases. If the slope is negative, Y decreases as X
increases. The Y intercept is the Y value of the line when X equals zero. It defines the elevation of
the line.
The
standard
error
values of
the slope
and
intercept

can
be
hard to interpret, but their
main purpose is to compute the 95% confidence intervals. If you accept the assumptions of linear
regression, there is a 95% chance that the 95% confidence interval of the slope contains the true
value of the slope, and that the 95% confidence interval for the intercept contains the true value of
the intercept.
r2, a measure of goodness-of-fit of linear regression
The value r2 is a fraction between 0.0 and 1.0, and has no units. An r2 value of 0.0
means that knowing X does not help you predict Y. There is no linear relationship
between X and Y, and the best-fit line is a horizontal line going through the mean of all Y values.
When r2 equals 1.0, all points lie exactly on a straight line with no scatter. Knowing X lets you
predict Y perfectly.

Input Dataset:

Output

Conclusion: Scatter plot is very efficient for analyzing whether linear regression can be used in
data or not.

Experiment no 10
10. Implement a regression analysis and show the prediction on sample data.

Introduction
Regression is a data mining function that predicts a number. Profit, sales, mortgage rates, house
values, square footage, temperature, or distance could all be predicted using regression techniques.
For example, a regression model could be used to predict the value of a house based on location,
number of rooms, lot size, and other factors.
A regression task begins with a data set in which the target values are known.
For example, a regression model that predicts house values could be developed based on observed
data for many houses over a period of time. In addition to the value, the data might track the age of
the house, square footage, number of rooms, taxes, school district, proximity to shopping centers,
and so on. House value would be the target, the other attributes would be the predictors, and the
data for each house would constitute a case. In the model build (training) process, a regression
algorithm estimates the value of the target as a function of the
predictors for each case in the build data. These relationships between predictors and target are
summarized in a model, which can then be applied to a different data set in which the target values
are unknown.
Regression algorithms predict one or more continuous variables, such as profit or loss, based
on other attributes in the dataset. An example of a regression algorithm is the Microsoft Time
Series Algorithm. Types of regression:
Linear Regression
A linear regression technique can be used if the relationship between the predictors and
the target can be approximated with a straight line. Regression with a single predictor is
the easiest to visualize. Simple linear regression with a single predictor is shown in 1.
Figure 1 Linear Regression With a Single Predictor
Linear

Linear
regression
with a single predictor can be expressed with the following equation.

y = 2x + 1 + e
The regression parameters in simple linear regression are:
The slope of the line (2) the angle between a data point and the regression line
The y intercept (1) the point where x crosses the y axis (x = 0)
Multivariate Linear Regression
The term multivariate linear regression refers to linear regression with two or more predictors
(x1, x2, , xn). When multiple predictors are used, the regression line cannot be visualized in twodimensional space. However, the line can be computed simply by expanding the equation for
single-predictor linear regression to include the parameters for each of the predictors.
y = 1 + 2x1 + 3x2 + ..... n xn-1 + e
Regression Coefficients
In multivariate linear regression, the regression parameters are often referred to as
coefficients. When you build a multivariate linear regression model, the algorithm
computes a coefficient for each of the predictors used by the model. The coefficient is a
measure of the impact of the predictor x on the target y. Numerous statistics are available
for analyzing the regression coefficients to evaluate how well the regression line fits the
data.
Nonlinear Regression
Often
the
relationship
between
x
and y cannot
be
approximated
with a straight
line. In
this case, a
nonlinear
regression
technique
may be used.
Alternatively,
the data could
be
preprocessed
to make the
relationship
linear.
Nonlinear
regression
models define
y as a
function of x
using an equation that is more complicated than the linear regression
equation. In 2, x and y have a nonlinear relationship
Figure 2 Nonlinear Regression With a SIngle Predictor
Multivariate Nonlinear Regression

The
term
multivariate
nonlinear regression refers to nonlinear regression with two or
more predictors (x1, x2, , xn). When multiple predictors are used, the nonlinear
relationship cannot be v based on other attributes in the dataset.
Example:
Input Dataset as pima_diabetes

Bayesian Logistic Regression

Logistic

Conclusion:
A regression analysis is very efficient in predicting data present in the database.

You might also like