0% found this document useful (0 votes)

213 views50 pages

E-Commerce Data Mining Guide

The document discusses decision trees and logistic regression machine learning techniques. It provides definitions and examples of decision trees, describing how they work and how to build one in Weka. It also defines logistic regression and compares it to linear regression, explaining how logistic regression uses a sigmoid function to classify data rather than directly outputting results. The document demonstrates how to perform logistic regression in Weka, including saving a trained model to use for predicting new data.

Uploaded by

Fikri Faris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

213 views50 pages

E-Commerce Data Mining Guide

Uploaded by

Fikri Faris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

ECLT5810/SEEM5750

E-Commerce Data Mining Technique

Tutorial 2: Decision tree; Regression; Assignment 1
Wenxuan ZHANG
wxzhang@se.cuhk.edu.hk
What is Decision Tree?
Decision tree is a decision-making tool using a tree-like graph or model of
decisions and their possible consequences such as event outcomes, resource
costs, and utility.

All the conditional control statements used in the decision tree can be displayed
for easily understand the logic behind it.
What is Decision Tree?
A decision tree is a flowchart-like structure contains three components:

◦ each internal node represents a “test” on an attribute (e.g., whether a coin flip
comes up heads or tails)

◦ each branch represents the outcome of the test

◦ each leaf node represents a class label (decision taken after computing all
attributes).

The paths from the root to leaf represent classification rules.

What is Decision Tree?
This is an example of a decision tree for the
target variable response. This variable has two
labels: 1 for response and 0 for no response.

Each node determines which attribute should

be used for splitting the dataset based on the
information gain. In this example, Node 1 uses
Income as splitting attribute, <$25k go to Node
2 and >= $25k go to Node 3.

There are 4 leaf nodes (Node 4-7) for

determining the predicted label.
Decision Tree in Weka
In Weka, it provides several classification algorithms including decision tree for
users to easily construct a predictive model using their training data.

Weka provides many tree-based algorithms. In this tutorial, we will use the J48
algorithm which is an implementation of the C4.5 algorithm.
Preparation for building Decision Tree
Before constructing our decision tree, we first need to prepare our training data.

Open Weka, choose Explorer in the Weka GUI Chooser

Preparation for building Decision Tree
Click Open file, then open the bank.csv
used in the last tutorial

Again, please remember to

change to CSV data files(*.csv) in file
type.
Preparation for building Decision Tree
Now, data is loaded into Explorer.

We can perform feature engineering

before building the decision tree,
but this time we simply use the
original dataset to do it.
Building Decision Tree
Click Classify
Building Decision Tree
Click Choose
Building Decision Tree
Under
classifiers->trees
select J48
Building Decision Tree
Click on the text near Choose to
access to the configuration
Building Decision Tree
Here is the configuration of J48

Change the minNumObj from 2 to 30 such that

each leaf needs to at least cover 30 instances so
that the size of the tree can be reduced

Then, click OK
Building Decision Tree
In the Test options here,

Use training set means use whole

training set as testing.

Supplied test set means we provide an

external testing set for testing

Percentage split means split part of the

training set as testing set
Cross-Validation
The Cross-validation means to split the training set into n folds. Use the first n-1 fold to train the
model and the remaining 1 fold as testing. The step is repeated n times.

Here is a simple illustration. Say we have 5 folds.

Step 1: 1 2 3 4 5

Fold 1 - 4 as training Fold 5 as testing

Cross-Validation
Step 2: 1 2 3 4 5

Fold 1 – 3, 5 as training
Fold 4 as testing

And finally Step 5: 1 2 3 4 5

Fold 1 as testing Fold 2 - 5 as training

Cross-Validation
Cross-validation is widely used for testing the predictive model performance
as it can provide a better understanding of our model and have an investigation
on overfitting.
Building Decision Tree
This time, we simply use percentage
split 66% as our testing option.
Building Decision Tree
Click Start to start our decision tree
construction
Visualizing the Decision Tree
We can visualize the trained
decision tree.

In the result list, right click the

model currently trained

Click Visualize tree

Visualizing the Decision Tree
The trained decision tree will be
shown in a new window
Viewing the Classifier output
The result is shown on the right
panel.

The accuracy of our

model is 89.525%
Viewing the Classifier output
The model accuracy sometimes could not reflect all the model performance. As
a result, Weka provides several statistics for us to better investigate our model
performance.

Including the Confusion Matrix, TP rate, FP rate, Precision, Recall and F-measure
for each class.
Viewing the Classifier output
Here is the Confusion Matrix.

It can be viewed as follow:

Predicted Label
a b
a True Positive (TP) False Negative(FN)
Actual Label
b False Positive (FP) True Negative (TN)
Viewing the Classifier output
True Positive is an outcome where the model correctly predicts the positive class

True Negative is an outcome where the model correctly predicts the negative class.

False Positive is an outcome where the model incorrectly predicts the positive class.

False Negative is an outcome where the model incorrectly predicts the negative class
Viewing the Classifier output
Here is the Detailed Accuracy By Class

TP rate is calculated by TP / (TP + FN)

FP rate is calculated by FP / (FP + TN)
Precision is calculated by TP / (TP + FP)
Recall is calculated by TP/ (TP+ FN)
F-measure is calculated by 2*Precision*Recall / (Precision + Recall)
Linear Regression
Logistic Regression
Logistic regression is similar to linear regression where the aims of them are
both finding a straight line. However, the purpose of linear regression is to use
that straight line to fit the data while logistic regression is to use the line for
separating the data.

Linear Regression Logistic Regression

Logistic Regression
Logistic regression is similar to linear regression where the aims of them are
both finding a straight line. However, the purpose of linear regression is to use
that straight line to fit the data while logistic regression is to use the line for
separating the data.

Just like a Linear Regression model, a Logistic Regression model computes a

weighted sum of the input features (plus a bias term), but instead of outputting
the result directly like the Linear Regression model does, it outputs the logistic
of this result.
What is Logistic Regression?
Ideally, we can use a unit-step function to determine the class label after
obtaining the straight line f(x).

However, if the value of f(x) is very close to 0, we might mis-classify

the data using the unit-step function. To have more flexibility, logistic regression
uses a function called Sigmoid function instead of the uni-step function.
What is Logistic Regression?
The Sigmoid function is a "S"-shaped curve with maximum value of 1 and
minimum value of 0. It is defined by
What is Logistic Regression?
The formula of logistic regression is therefore

is the target parameter needed to regress.

Logistic Regression in Weka
We can easily perform a logistic regression in Weka.

Weka will do the calculation for us. We only need to prepare our dataset.
Logistic Regression in Weka
In Classify tag, Click Choose
Logistic Regression in Weka
Under classifier->function
Select Logistic
Logistic Regression in Weka
Use percentage split 66% as our
testing option.
Logistic Regression in Weka
Click Start to start our logistic
regression
Logistic Regression in Weka
The result is shown on the
right panel.

The accuracy of
our model is 89.7202%. It is
slightly better than our previous
decision tree model.
Save Model and Make Predictions on New Data
After we have found a well-performing machine learning model, we can finalize
our model and save it.

If we have some new data later, we can load our previous trained model and
make predictions on the new data.
Save Machine Learning Model
Suppose we want to save the
logistic regression model
trained in last section.

In the result list, right click

the model

Click Save model

Save Machine Learning Model
Select a location and enter a
filename such as logistic, click Save

Our model is now saved to the file

"logistic.model".
Load Our Machine Learning Model
Suppose we want to use our
trained model to make prediction.

Right click on the Result list and

click Load model, select the model
saved in the previous slide
"logistic.model".
Load Our Machine Learning Model
Now, the model is loaded, and
we can see some information
on the right panel.
Make Predictions on New Data
Suppose we have some new data and we want to use our trained model to make
predictions on it.

We will use the file bank-new.csv as our new data. It contains first 100
instances of bank.csv but the class label (i.e., the attribute y) is changed to “?”
from yes/no.
Make Predictions on New Data
Go to Classify tab.

Select the Supplied test set

option in the Test options pane.
Make Predictions on New Data
Click Set, click the Open file on the options window and select the new dataset
we just created with the name "bank-new.csv".

For the Class, select y

Then, Click Close

Make Predictions on New Data
Click the “More options…” to
bring up options for evaluating
the classifier.
Make Predictions on New Data
Uncheck the the following information:

◦ Output model
◦ Output per-class stats
◦ Output confusion matrix
◦ Store predictions for visualization
◦ Collect predictions for evaluation based on
AUROC, etc.

For Output predictions, choose PlainText

Click OK
Make Predictions on New Data
Right click on the list item for our
loaded model in the Results list.

Choose Re-evaluate model on

current test set
Make Predictions on New Data
The predictions for each test
instance are then listed in the
Classifier Output.

Specifically, the middle column of

the results is the predicted label
which is "yes" or "no".

Weka Tool
No ratings yet
Weka Tool
12 pages
Unit 7 Deterministic Models
No ratings yet
Unit 7 Deterministic Models
71 pages
Unit 2 ML
No ratings yet
Unit 2 ML
17 pages
Unit 2 ML
No ratings yet
Unit 2 ML
17 pages
Decision Tree and Ensemble
No ratings yet
Decision Tree and Ensemble
92 pages
EDA Cat2
No ratings yet
EDA Cat2
54 pages
Machine - Learning - Lecture - 08 - Decision Tree Learning
No ratings yet
Machine - Learning - Lecture - 08 - Decision Tree Learning
67 pages
Business Analytics: Data Classification
No ratings yet
Business Analytics: Data Classification
36 pages
Module 5 Machine Learning
No ratings yet
Module 5 Machine Learning
36 pages
Decision Tree
100% (1)
Decision Tree
57 pages
Decision Tree
No ratings yet
Decision Tree
57 pages
M2 - Supervised Machine Learning
No ratings yet
M2 - Supervised Machine Learning
79 pages
Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
Experiment No. 7
No ratings yet
Experiment No. 7
4 pages
Decision Trees for Data Enthusiasts
No ratings yet
Decision Trees for Data Enthusiasts
52 pages
Unit 5
No ratings yet
Unit 5
25 pages
08 CSE358 Intro To Machine Learning II
No ratings yet
08 CSE358 Intro To Machine Learning II
100 pages
ML Lecture 8 9 Classification
No ratings yet
ML Lecture 8 9 Classification
35 pages
Module 6
No ratings yet
Module 6
82 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
Decision Tree
No ratings yet
Decision Tree
13 pages
Lecture 7 - Decision Tree Regression Imran 19032025 103416am
No ratings yet
Lecture 7 - Decision Tree Regression Imran 19032025 103416am
40 pages
Supervised Learning
No ratings yet
Supervised Learning
187 pages
Decision Trees
No ratings yet
Decision Trees
77 pages
Unit II
No ratings yet
Unit II
34 pages
Harsh It
No ratings yet
Harsh It
9 pages
Lecture 5a
No ratings yet
Lecture 5a
24 pages
Week 4 Supervised Learning Classification
No ratings yet
Week 4 Supervised Learning Classification
14 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
Session 9 10 Decision Tree
No ratings yet
Session 9 10 Decision Tree
41 pages
ES335
No ratings yet
ES335
22 pages
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
17 pages
Trees and Random Forest
No ratings yet
Trees and Random Forest
34 pages
Decision Tree Classification Guide
No ratings yet
Decision Tree Classification Guide
8 pages
AIMLB PGP 2025 Session 8
No ratings yet
AIMLB PGP 2025 Session 8
52 pages
Types of Pruning Techniques
No ratings yet
Types of Pruning Techniques
10 pages
Lecture 8
No ratings yet
Lecture 8
28 pages
Intro to Regression & Decision Trees
No ratings yet
Intro to Regression & Decision Trees
11 pages
Lec 3
No ratings yet
Lec 3
31 pages
Decision Tree
No ratings yet
Decision Tree
82 pages
Decision Trees for Data Scientists
No ratings yet
Decision Trees for Data Scientists
15 pages
3-Classification, Clustering and Prediction
No ratings yet
3-Classification, Clustering and Prediction
142 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
224 pages
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
22 pages
What Is Decision Tree
No ratings yet
What Is Decision Tree
35 pages
ML Unit 2
No ratings yet
ML Unit 2
37 pages
Supervised Learning - Basics
No ratings yet
Supervised Learning - Basics
115 pages
2-Machine Learning Algorithms
No ratings yet
2-Machine Learning Algorithms
16 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
Class Tree
No ratings yet
Class Tree
36 pages
ShortCourse QTT Lecture2
No ratings yet
ShortCourse QTT Lecture2
37 pages
MISY 631 Final Review Calculators Will Be Provided For The Exam
No ratings yet
MISY 631 Final Review Calculators Will Be Provided For The Exam
9 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
22 pages
Module 04
No ratings yet
Module 04
75 pages
Classification & Prediction Guide
100% (1)
Classification & Prediction Guide
67 pages
Decision Trees
67% (3)
Decision Trees
14 pages
Unit3 ML
No ratings yet
Unit3 ML
7 pages
Tree Based Classifiers: Dinesh R
No ratings yet
Tree Based Classifiers: Dinesh R
54 pages
rr320306 Heat Transfer
100% (2)
rr320306 Heat Transfer
8 pages
Physics Exam Prep: Optics & Lenses
No ratings yet
Physics Exam Prep: Optics & Lenses
11 pages
Eccentrically Loaded Weld Group Analysis: Company Name Sample Calculations YP and Address 8/1/21 C-Shape Weld
No ratings yet
Eccentrically Loaded Weld Group Analysis: Company Name Sample Calculations YP and Address 8/1/21 C-Shape Weld
9 pages
CalcAnswersCh7 Nswers To Exercises For Chapter 7 Logarithmic and Exponential Functions PDF
No ratings yet
CalcAnswersCh7 Nswers To Exercises For Chapter 7 Logarithmic and Exponential Functions PDF
7 pages
MSWLogo Guide for Educators
No ratings yet
MSWLogo Guide for Educators
6 pages
2D Finite Difference Method
No ratings yet
2D Finite Difference Method
32 pages
Mathematics 2a Study Material
No ratings yet
Mathematics 2a Study Material
177 pages
Presentation 1
No ratings yet
Presentation 1
12 pages
Distribution Parameters From MATLAB
No ratings yet
Distribution Parameters From MATLAB
4 pages
Excel Formulas: Archive of MR Excel Message Board
No ratings yet
Excel Formulas: Archive of MR Excel Message Board
5 pages
Matlab Bridge Course Overview
No ratings yet
Matlab Bridge Course Overview
30 pages
AP Inter 2nd Year Maths 2A Important Questions 2025
No ratings yet
AP Inter 2nd Year Maths 2A Important Questions 2025
7 pages
Activity 1 ULO 1 Wk2 AUSA 2
No ratings yet
Activity 1 ULO 1 Wk2 AUSA 2
5 pages
Step 4
No ratings yet
Step 4
70 pages
External Flows: Dye Streak
No ratings yet
External Flows: Dye Streak
8 pages
BITSF316
No ratings yet
BITSF316
3 pages
Quicksort Algorithm and Partition Methods
No ratings yet
Quicksort Algorithm and Partition Methods
4 pages
Water Dist Full Manual V8i
100% (4)
Water Dist Full Manual V8i
731 pages
UPSC Electrical Engineering Guide
No ratings yet
UPSC Electrical Engineering Guide
7 pages
Python Tutorial
No ratings yet
Python Tutorial
210 pages
CollegeBoard Algebra 2
No ratings yet
CollegeBoard Algebra 2
30 pages
Chap 1.4
No ratings yet
Chap 1.4
9 pages
ADMATH Module 2
No ratings yet
ADMATH Module 2
13 pages
Structural Theory: Introduction To Structural Analysis: Prepared By: Dr. Ramela B. Ramirez
No ratings yet
Structural Theory: Introduction To Structural Analysis: Prepared By: Dr. Ramela B. Ramirez
7 pages
Lecture2 Interpolation
No ratings yet
Lecture2 Interpolation
3 pages
Physics Project C-12th
No ratings yet
Physics Project C-12th
18 pages
Application of Differential Calculus
0% (1)
Application of Differential Calculus
4 pages
Short Notes
No ratings yet
Short Notes
44 pages
Answers For The Worksheet On Sets in Set
No ratings yet
Answers For The Worksheet On Sets in Set
8 pages
Algebra Sheet - 1 - 450850 - Crwill
No ratings yet
Algebra Sheet - 1 - 450850 - Crwill
23 pages

E-Commerce Data Mining Guide

Uploaded by

E-Commerce Data Mining Guide

Uploaded by

ECLT5810/SEEM5750

E-Commerce Data Mining Technique

◦ each branch represents the outcome of the test

The paths from the root to leaf represent classification rules.

Each node determines which attribute should

There are 4 leaf nodes (Node 4-7) for

Open Weka, choose Explorer in the Weka GUI Chooser

Again, please remember to

We can perform feature engineering

Change the minNumObj from 2 to 30 such that

Use training set means use whole

Supplied test set means we provide an

Percentage split means split part of the

Here is a simple illustration. Say we have 5 folds.

Fold 1 - 4 as training Fold 5 as testing

And finally Step 5: 1 2 3 4 5

Fold 1 as testing Fold 2 - 5 as training

In the result list, right click the

Click Visualize tree

The accuracy of our

It can be viewed as follow:

TP rate is calculated by TP / (TP + FN)

Linear Regression Logistic Regression

Just like a Linear Regression model, a Logistic Regression model computes a

However, if the value of f(x) is very close to 0, we might mis-classify

is the target parameter needed to regress.

In the result list, right click

Click Save model

Our model is now saved to the file

Right click on the Result list and

Select the Supplied test set

For the Class, select y

Then, Click Close

For Output predictions, choose PlainText

Choose Re-evaluate model on

Specifically, the middle column of

You might also like