[go: up one dir, main page]

0% found this document useful (0 votes)
89 views47 pages

DWM Lab Manual 2025-26 Updated

The document is a laboratory manual for the Data Warehouse & Mining course at Nagpur Institute of Technology for the 2024-2025 session. It outlines various practical experiments students will perform, including data preprocessing, normalization, discretization, and classification using Weka software. Each practical includes an aim, theory, and step-by-step procedures for executing the tasks.

Uploaded by

ganeshawghade700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views47 pages

DWM Lab Manual 2025-26 Updated

The document is a laboratory manual for the Data Warehouse & Mining course at Nagpur Institute of Technology for the 2024-2025 session. It outlines various practical experiments students will perform, including data preprocessing, normalization, discretization, and classification using Weka software. Each practical includes an aim, theory, and step-by-step procedures for executing the tasks.

Uploaded by

ganeshawghade700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 47

SHRI SAI SHIKSHAN SANSTHA’S

NAGPUR INSTITUTE OF TECHNOLOGY, NAGPUR

DEPARTMENT OF INFORMATION TECHNOLOGY


Session 2024-2025
Laboratory Manual

Seventh Semester
Subject Name: Data Warehouse &Mining
Practical Code: BEIT701P
COURSE OUTCOME

On successful completion of the course, students will be able to:


Sr. Practical
Code
Title Of Experiment
No.
To create .csv files from excel and read it as arff file in Weka.
1 PRO1

Perform preprocessing on bank dataset by copying attributes in the


2 PR02 dataset.

3
PR03
Study about discretization on iris data set.
Study about normalization on iris data set.
4 PR04

5 Perform Nominal to Binary conversion on weather dataset.


PR05

6 Perform remove filter on weather dataset.


PR06

Generate decision tree using J48 algorithm.


7 PR07

8 PR08 Perform association on contact lenses dataset using Apriory algorithm.

Perform classification on labor dataset using decision tree.


9 PR09
Perform Classification of Supermarket dataset by using Naive Bayesian
10 PR10 Classifier.
Demonstrate standardization on weather data set.
11. PR11

Perform classification on weather data set using ZeroR rule.


12. PR12

Perform classification on weather data set using OneR rule.


13. PR13

Perform k means clustering on iris data set.


14. PR14

Use multiple ROC curve for model evaluation.


15. PR15

1. Perform Pre-processing On dataset.

2. Perform Classification on Datasets.

3. Perform normalization & discretization on dataset.


List of Practical
Practical No. 01
AIM: To create .csv files from excel and read it as arff file in Weka.
THEORY:

Attribute Relation File Format (ARFF) is an ASCII text file that describes a list of instances
sharing a set of attriobutes. ARFF files were developed by the Machine Learning Project at the department
of computer science of the University of Waikato for use with the Weka machine learning software. ARFF
file can be created from excel files by saving the excel files in comma separated values (CSV) format.

CSV is a simple file format used to store tabular data, such as a spreadsheet or database. Files in
the CSV format can be imported to and exported from programs that store data in tables, such as Microsoft
Excel or Open Office Calc.
CSV stands for "comma-separated values". Its data fields are most often separated, or delimited,
by a comma. For example, let's say you had a spreadsheet containing the following data.

Name Class Dorm Room GPA

Sally Whittaker 2018 McCarren House 312 3.75


Belinda Jameson 2017 Cushing House 148 3.52
Jeff Smith 2018 Prescott House 17-D 3.20
Sandy Allen 2019 Oliver House 108 3.48
The above data could be represented in a CSV-formatted file as follows:

Sally Whittaker,2018,McCarren House,312,3.75

Belinda Jameson,2017,Cushing House,148,3.52

Jeff Smith,2018,Prescott House,17-D,3.20

Sandy Allen,2019,Oliver House,108,3.48

Procedure:
Step 1: Create dataset in excel file. Name the file as student_dataset

Step 2: Save the dataset in CSV format


Step 3: In the weka explorer, click on open file button and select the .csv file created in the previous step.

Output:

Practical No. 02
AIM: Perform preprocessing on bank dataset by copying attributes in the dataset.

THEORY:

In data preprocessing, we can copy any attribute or no. of attributes if required. Some applications
require copying attributes in the data set. This exercise illustrates some of the basic data preprocessing
operations that can be performed using WEKA. The sample data set used for this example is the "bank data"
available in comma‐separated format (bank‐data.csv).

The data contains the following fields


1.Id- a unique identification number
2.Age- age of customer in years (numeric
3.sex -MALE / FEMALE region inner_city/rural/suburban/town
4.income- income of customer (numeric) married is the customer married (YES/NO)
5.children- number of children (numeric) car does the customer own a car (YES/NO)
6.save-_acct does the customer have a saving account (YES/NO)
7.current-_acct does the customer have a current account (YES/NO)
8.mortgage -does the customer have a mortgage (YES/NO)
9.pep did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)

Loading the Data In addition to the native ARFF data file format, WEKA has the capability to read
in ".csv" format files. This is fortunate since many databases or spreadsheet applications can save or export
data into flat files in this format. As can be seen in the sample data file, the first row contains the attribute
names (separated by commas) followed by each data row with attribute values listed in the same order (also
separated by commas). In fact, once loaded into WEKA, the data set can be saved into ARFF format. In
this example, we load the data set into WEKA, perform a series of operations using WEKA's preprocessing
filters. While all of these operations can be performed from the command line, we use the GUI interface for
WEKA Explorer. Initially (in the Preprocess tab) click "open" and navigate to the directory containing the
data file (.csv or .arff).
Procedure:

Step 1: Load bank dataset in weka explorer.

Step 2: Choose the copy filter in the filters panel.


Step 3: Choose the index of the attribute you want to copy. Apply the filter

Output:

Practical No. 03
AIM: Study about discretization on iris data set.
THEORY:

Data discretization techniques can be used to reduce the number of values for a given continuous
attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace
actual data values [5]. This leads to a concise, easy-to-use, knowledge-level representation of mining results.
Data discretization can perform before or while doing data mining. Most of the real data set usually contains
continuous attributes. Some machine learning algorithms that can handle both continuous and discrete
attributes perform better with discrete-valued attributes. Discretization involves: Divide the ranges of
continuous attribute into

 intervals Some classification algorithms only accept


 categorical attributes Reduce data size by discretization
 Prepare for further analysis
 Discretization techniques are often used by the classification algorithms. Unsupervised discretization
algorithms that do not use class information that divides continuous ranges into sub-ranges [8].
Discretization involves several advantages. Some of them are given below: Discretization will reduce the
number of
 continuous features values, which brings smaller demands on system’s storage. Discretization makes
learning more accurate.
 faster. In addition to many advantages of having discrete
 data over continuous one, a suite of classification learning algorithms can only deal with discrete data.
Data can also be reduced and simplified through
 discretization. For both users and experts, discrete features are easier to understand, use, and explain.

Procedure:

Step 1: Load iris dataset in the weka explorer.

Step 2:Select Discretize from filters panel


Step 3: Select the attribute indicves on which discretization is to be performed. Fill the no. of bins
required in the bins field. Apply the filter.

Step 4: Output
Practical No. 04
AIM: Study about normalization on iris data set.

THEORY:
In creating a database, normalization is the process of organizing it into tables in such a way that the
results of using the database are always unambiguous and as intended. Normalization may have the effect of
duplicating data within the database and often results in the creation of additional tables. (While
normalization tends to increase the duplication of data, it does not introduce redundancy, which is
unnecessary duplication.) Normalization is typically a refinement process after the initial exercise of
identifying the data objects that should be in the database, identifying their relationships, and defining the
tables required and the columns within each table.

Procedure:

Step 1: Load iris dataset in the weka explorer.


Step 2: Select Normalize from filters panel

Step 3: Select the scale for normalization, along with other fields, and apply the filter.
Step 4: Otuput
Practical No. 05
AIM: Perform Nominal to Binary conversion on weather dataset.
THEORY:

 Normalization is scaling technique or a mapping technique or a pre processing stage [1]. Where, we
can find new range from an existing one range. It can be helpful for the prediction or forecasting
purpose a lot [2].
 As we know there are so many ways to predict or forecast but all can vary with each other a lot. So
to maintain the large variation of prediction and forecasting the Normalization technique is required
to make them closer.
 there is some existing normalization techniques as mentioned in my abstract section namely Min-
Max, Zscore & Decimal scaling excluding these technique we are presenting new one technique
called Integer Scaling technique. This technique comes from the AMZD (Advanced on Min-Max Z-
score Decimal scaling)
 Data normalization is a process in which data attributes within a data model are organized to increase
the cohesion of entity types. In other words, the goal of data normalization is to reduce and even
eliminate data redundancy, an important consideration for application developers because it is
incredibly difficult to stores objects in a relational database that maintains the same information in
several places.

1. First Normal Form (1NF)

Let’s consider an example. An entity type is in first normal form (1NF) when it contains no repeating
groups of data. For example, in Figure 1 you see that there are several repeating attributes in the
data Order0NF table – the ordered item information repeats nine times and the contact information is
repeated twice, once for shipping information and once for billing information. Although this initial version
of orders could work, what happens when an order has more than nine order items? Do you create
additional order records for them? What about the vast majority of orders that only have one or two items?
Do we really want to waste all that storage space in the database for the empty fields? Likely not.
Furthermore, do you want to write the code required to process the nine copies of item information, even if
it is only to marshal it back and forth between the appropriate number of objects. Once again, likely not.

2. Second Normal Form (2NF)

Although the solution presented in Figure 2 is improved over that of Figure 1, it can be normalized
further. Figure 3 presents the data schema of Figure 2in second normal form (2NF). an entity type is in
second normal form (2NF) when it is in 1NF and when every non-key attribute, any attribute that is not part
of the primary key, is fully dependent on the primary key. This was definitely not the case with
the OrderItem1NF table, therefore we need to introduce the new table Item2NF. The problem
with OrderItem1NF is that item information, such as the name and price of an item, do not depend upon an
order for that item. For example, if Hal Jordan orders three widgets and Oliver Queen orders five widgets,
the facts that the item is called a “widget" and that the unit price is $19.95 is constant. This information
depends on the concept of an item, not the concept of an order for an item, and therefore should not be
stored in the order items table – therefore the Item2NF table was introduced. OrderItem2NF retained the a
calculated value that is the number of items ordered multiplied by the price of the item. The value of
the SubtotalBeforeTax column within the Order2NF table is the total of the values of the total price
extended for each of its order items.
Procedure:

Step 1: Load weather dataset in the weka explorer.

Step 2: Save the file as .csv extension.


Step 3: Explore the file on WEKA explorer.

Step 4: Choose the nominal to binary filter


Step 5: Indicate attribute from index which you want to change to binary.

Step 5 OUTPUT: After selection click apply button.


Practical No. 06
AIM: Perform remove filter on weather dataset.
THEORY: A filter that removes a range of attributes from the dataset. Will re-order the remaining
attributes if invert matching sense is turned on and the attribute column indices are not specified in
ascending order.

Procedure:

Step 1: Load weather dataset in the weka explorer.

Step 2: Save the file as .csv extension.


Step 3: Explore the file on WEKA explorer

Step 4:Choose the remove filter.


Step 5:Indicate attribute from index which you want remove.

Step 6: After selection click apply button and particular attribute will be remove.
Practical No. 07
AIM: Generate decision tree using J48 algorithm.
THEORY:

 A decision tree is a predictive machine-learning model that decides the target value (dependent
variable) of a new sample based on various attribute values of the available data.
 The internal nodes of a decision tree denote the different attributes, the branches between the nodes
tell us the possible values that these attributes can have in the observed samples, while the terminal
nodes tell us the final value (classification) of the dependent variable.
 The attribute that is to be predicted is known as the dependent variable, since its value depends
upon, or is decided by, the values of all the other attributes. The other attributes, which help in
predicting the value of the dependent variable, are known as the independent variables in the dataset.

Modified J48 Decision Tree Algorithm


 The 16 bit representation of the device MAC address is presented in theCurrent Active Directory
List. The modified J48 decision tree algorithm examines thenormalized information gain that results
from choosing an attribute for splitting thedata.
 To make the decision, the attribute with the highest normalized information gainis used. Then the
algorithm recurs on the smaller subsets. The splitting procedurestops if all instances in a subset
belong to the same class.
 Then a leaf node is created in the decision tree telling to choose that class. Inhis case, the modified
J48 decision tree algorithm creates a decision node higher upin the tree using the expected value of
the class.
 If the generated LSB value in CADLand incoming protocol device MAC address are same then the
device is authenticatedotherwise the device recommended for the intruder.

Disadvantages of J48 algorithm:


run-time complexity of the algorithm matches to the tree depth, which cannot be greater than the number of
attributes. Tree depth is linked to tree size, and thereby to the number of examples. So, the size of C4.5 trees
increaseslinearly with the number of examples. C4.5 rules slow for large and noisy datasets Space
complexity is very large as wehave to store the values repeatedly in arrays.
Fig: Create weather dataset and save with extension .csv.

Fig: Explore the excel file in WEKA explorer.


Fig: Use J48 tree classifier.

Fig: Classifier Output.


Fig: Select visualize tree option.

Fig: Classifier decision tree view.

Result: Hence,we have generated decision tree using J48 algorithm.


Practical No. 08
AIM: Perform association on contact lenses dataset using Apriory algorithm.

THEORY:

The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to operate
on databases containing transactions (for example, collections of items bought by customers, or details of a
website frequentation). Other algorithms are designed for finding association rules in data having no
transactions, or having no timestamps (DNA sequencing). Each transaction is seen as a set of items
(an itemset). Given a threshold , the Apriori algorithm identifies the item sets which are subsets of at
least transactions in the database. Apriori uses a "bottom up" approach, where frequent subsets are extended
one item at a time (a step known as candidate generation), and groups of candidates are tested against the
data. The algorithm terminates when no further successful extensions are found.
Apriori uses breadth first search and a Hash tree structure to count candidate item sets efficiently. It
generates candidate item sets of length from item sets of length . Then it prunes the candidates which have
an infrequent sub patternThe pseudo code for the algorithm is given below for a transaction database , and a
support threshold of . Usual set theoretic notation is employed, though note that is a multi-set. Is the
candidate set for level. At each step, the algorithm is assumed to generate the candidate sets from the large
item sets of the preceding level, heeding the downward closure lemma. Accesses a field of the data structure
that represents candidate set, which is initially assumed to be zero. Many details are omitted below, usually
the most important part of the implementation is the data structure used for storing the candidate sets, and
counting their frequencies.
Procedure:

Step 1: Explore contact lenses dataset in WEKA.

Step 2: Apply Associate on contact dataset.


Step 3: Apply Apriori algorithm on dataset.

Step 4: Change the number of rules.


Step 5: Rule numbers change on Apriori dataset.

Result: Hence, we have performed associate on contact lenses dataset using Apriori algorithm.
Practical No. 09
AIM: Perform classification on labor dataset using decision tree.

THEORY:
Decision trees are a classic supervised learning algorithms, easy to understand and easy to use.
In this article we will describe the basic mechanism behind decision trees and we will see the
algorithm into action by using Weka (Waikato Environment for Knowledge Analysis).

The main concept behind decision tree learning is the following: starting from the training data,
we will build a predictive model which is mapped to a tree structure. The goal is to achieve
perfect classification with minimal number of decision, although not always possible due to noise
or inconsistencies in data.

Step 1: Explore labor dataset in WEKA.


Step 2: Apply decision classifier on dataset.

Step 3:Right click on tree to visualize classifier error.


AIM: Perform Classification of Supermarket dataset by using Naive Bayesian Classifier.

Practical No. 10
AIM: Perform classification on supermarket dataset using naïve bayes algorithm.

THEORY:
It is a classification technique based on Bayes Theorem with an assumption of
independence among predictors. In simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature. For
example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other
features, all of these properties independently contribute to the probability that this fruit is an
apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Step 1: Explore Supermarket dataset in WEKA.

Step 2: Apply Naive bayes classifier on dataset.


Step 3:Right click on tree to visualize classifier error.

Practical No. 11
AIM: Demonstrate standardization on weather data set.
THEORY:
In standardization , it standarizes all numeric attribute in the given dataset to have zero mean and unit variance
(apart from class attribute if set).

Procedure:

Step 1: Load waether dataset in the weka explorer.

Step 2:Select standardize from filters panel

Step 3: Select the scale for standardization and apply the filter
Step 4: Output
Practical No. 12
AIM: Perform classification on weather data set using ZeroR rule.

THEORY:
ZeroR rule:

Class for building and using a 0-R classifier. Predicts the mean (for a numeric class) or the mode (for
a nominal class).

ZeroR is the simplest method which relies on the frequency of target.

Procedure :

Step 1: Load the weather data set in weka explorer.

Step 2: Classify the weather data set by using ZeroR rule.


Output :
Practical No. 13
AIM: Perform classification on weather data set using OneR rule

THEORY:
OneR: learns a one-level decision tree, i.e. generates a set of rules that test one particular attribute. Basic
version (assuming nominal attributes):
• One branch for each of the attribute’s values
• Each branch assigns most frequent class
• Error rate: proportion of instances that don't belong to the majority class of their corresponding
branch
• Choose attribute with lowest error rate
Procedure :

Step 1: Load the weather data set in weka explorer.


Step 2: Classify the weather data set by using OneoR rule.

Output : Visualize threshold curve


Practical No. 14
AIM: Perform k means clustering on iris data set.

THEORY:
Procedure :

Step 1: Load the iris data set in weka explorer.

Step 2: Apply k means Clustering on iris dataset .


Output : Visualize Clustering assignment.
Practical No. 15
AIM: Use multiple ROC curve for model evaluation

THEORY:

Procedure :

Step 1: click on Knowledge Flow.

Step 2: select Arff loader


Step 3: select Class assigner .

Step 4: Apply naïve bayes.


Step 5: Model Performance chart

Step 6:

Step 7 output:

You might also like