[go: up one dir, main page]

0% found this document useful (0 votes)
213 views50 pages

E-Commerce Data Mining Guide

The document discusses decision trees and logistic regression machine learning techniques. It provides definitions and examples of decision trees, describing how they work and how to build one in Weka. It also defines logistic regression and compares it to linear regression, explaining how logistic regression uses a sigmoid function to classify data rather than directly outputting results. The document demonstrates how to perform logistic regression in Weka, including saving a trained model to use for predicting new data.

Uploaded by

Fikri Faris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
213 views50 pages

E-Commerce Data Mining Guide

The document discusses decision trees and logistic regression machine learning techniques. It provides definitions and examples of decision trees, describing how they work and how to build one in Weka. It also defines logistic regression and compares it to linear regression, explaining how logistic regression uses a sigmoid function to classify data rather than directly outputting results. The document demonstrates how to perform logistic regression in Weka, including saving a trained model to use for predicting new data.

Uploaded by

Fikri Faris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

ECLT5810/SEEM5750

E-Commerce Data Mining Technique


Tutorial 2: Decision tree; Regression; Assignment 1
Wenxuan ZHANG
wxzhang@se.cuhk.edu.hk
What is Decision Tree?
Decision tree is a decision-making tool using a tree-like graph or model of
decisions and their possible consequences such as event outcomes, resource
costs, and utility.

All the conditional control statements used in the decision tree can be displayed
for easily understand the logic behind it.
What is Decision Tree?
A decision tree is a flowchart-like structure contains three components:

◦ each internal node represents a “test” on an attribute (e.g., whether a coin flip
comes up heads or tails)

◦ each branch represents the outcome of the test

◦ each leaf node represents a class label (decision taken after computing all
attributes).

The paths from the root to leaf represent classification rules.


What is Decision Tree?
This is an example of a decision tree for the
target variable response. This variable has two
labels: 1 for response and 0 for no response.

Each node determines which attribute should


be used for splitting the dataset based on the
information gain. In this example, Node 1 uses
Income as splitting attribute, <$25k go to Node
2 and >= $25k go to Node 3.

There are 4 leaf nodes (Node 4-7) for


determining the predicted label.
Decision Tree in Weka
In Weka, it provides several classification algorithms including decision tree for
users to easily construct a predictive model using their training data.

Weka provides many tree-based algorithms. In this tutorial, we will use the J48
algorithm which is an implementation of the C4.5 algorithm.
Preparation for building Decision Tree
Before constructing our decision tree, we first need to prepare our training data.

Open Weka, choose Explorer in the Weka GUI Chooser


Preparation for building Decision Tree
Click Open file, then open the bank.csv
used in the last tutorial

Again, please remember to


change to CSV data files(*.csv) in file
type.
Preparation for building Decision Tree
Now, data is loaded into Explorer.

We can perform feature engineering


before building the decision tree,
but this time we simply use the
original dataset to do it.
Building Decision Tree
Click Classify
Building Decision Tree
Click Choose
Building Decision Tree
Under
classifiers->trees
select J48
Building Decision Tree
Click on the text near Choose to
access to the configuration
Building Decision Tree
Here is the configuration of J48

Change the minNumObj from 2 to 30 such that


each leaf needs to at least cover 30 instances so
that the size of the tree can be reduced

Then, click OK
Building Decision Tree
In the Test options here,

Use training set means use whole


training set as testing.

Supplied test set means we provide an


external testing set for testing

Percentage split means split part of the


training set as testing set
Cross-Validation
The Cross-validation means to split the training set into n folds. Use the first n-1 fold to train the
model and the remaining 1 fold as testing. The step is repeated n times.

Here is a simple illustration. Say we have 5 folds.

Step 1: 1 2 3 4 5

Fold 1 - 4 as training Fold 5 as testing


Cross-Validation
Step 2: 1 2 3 4 5

Fold 1 – 3, 5 as training
Fold 4 as testing

And finally Step 5: 1 2 3 4 5

Fold 1 as testing Fold 2 - 5 as training


Cross-Validation
Cross-validation is widely used for testing the predictive model performance
as it can provide a better understanding of our model and have an investigation
on overfitting.
Building Decision Tree
This time, we simply use percentage
split 66% as our testing option.
Building Decision Tree
Click Start to start our decision tree
construction
Visualizing the Decision Tree
We can visualize the trained
decision tree.

In the result list, right click the


model currently trained

Click Visualize tree


Visualizing the Decision Tree
The trained decision tree will be
shown in a new window
Viewing the Classifier output
The result is shown on the right
panel.

The accuracy of our


model is 89.525%
Viewing the Classifier output
The model accuracy sometimes could not reflect all the model performance. As
a result, Weka provides several statistics for us to better investigate our model
performance.

Including the Confusion Matrix, TP rate, FP rate, Precision, Recall and F-measure
for each class.
Viewing the Classifier output
Here is the Confusion Matrix.

It can be viewed as follow:

Predicted Label
a b
a True Positive (TP) False Negative(FN)
Actual Label
b False Positive (FP) True Negative (TN)
Viewing the Classifier output
True Positive is an outcome where the model correctly predicts the positive class

True Negative is an outcome where the model correctly predicts the negative class.

False Positive is an outcome where the model incorrectly predicts the positive class.

False Negative is an outcome where the model incorrectly predicts the negative class
Viewing the Classifier output
Here is the Detailed Accuracy By Class

TP rate is calculated by TP / (TP + FN)


FP rate is calculated by FP / (FP + TN)
Precision is calculated by TP / (TP + FP)
Recall is calculated by TP/ (TP+ FN)
F-measure is calculated by 2*Precision*Recall / (Precision + Recall)
Linear Regression
Logistic Regression
Logistic regression is similar to linear regression where the aims of them are
both finding a straight line. However, the purpose of linear regression is to use
that straight line to fit the data while logistic regression is to use the line for
separating the data.

Linear Regression Logistic Regression


Logistic Regression
Logistic regression is similar to linear regression where the aims of them are
both finding a straight line. However, the purpose of linear regression is to use
that straight line to fit the data while logistic regression is to use the line for
separating the data.

Just like a Linear Regression model, a Logistic Regression model computes a


weighted sum of the input features (plus a bias term), but instead of outputting
the result directly like the Linear Regression model does, it outputs the logistic
of this result.
What is Logistic Regression?
Ideally, we can use a unit-step function to determine the class label after
obtaining the straight line f(x).

However, if the value of f(x) is very close to 0, we might mis-classify


the data using the unit-step function. To have more flexibility, logistic regression
uses a function called Sigmoid function instead of the uni-step function.
What is Logistic Regression?
The Sigmoid function is a "S"-shaped curve with maximum value of 1 and
minimum value of 0. It is defined by
What is Logistic Regression?
The formula of logistic regression is therefore

is the target parameter needed to regress.


Logistic Regression in Weka
We can easily perform a logistic regression in Weka.

Weka will do the calculation for us. We only need to prepare our dataset.
Logistic Regression in Weka
In Classify tag, Click Choose
Logistic Regression in Weka
Under classifier->function
Select Logistic
Logistic Regression in Weka
Use percentage split 66% as our
testing option.
Logistic Regression in Weka
Click Start to start our logistic
regression
Logistic Regression in Weka
The result is shown on the
right panel.

The accuracy of
our model is 89.7202%. It is
slightly better than our previous
decision tree model.
Save Model and Make Predictions on New Data
After we have found a well-performing machine learning model, we can finalize
our model and save it.

If we have some new data later, we can load our previous trained model and
make predictions on the new data.
Save Machine Learning Model
Suppose we want to save the
logistic regression model
trained in last section.

In the result list, right click


the model

Click Save model


Save Machine Learning Model
Select a location and enter a
filename such as logistic, click Save

Our model is now saved to the file


"logistic.model".
Load Our Machine Learning Model
Suppose we want to use our
trained model to make prediction.

Right click on the Result list and


click Load model, select the model
saved in the previous slide
"logistic.model".
Load Our Machine Learning Model
Now, the model is loaded, and
we can see some information
on the right panel.
Make Predictions on New Data
Suppose we have some new data and we want to use our trained model to make
predictions on it.

We will use the file bank-new.csv as our new data. It contains first 100
instances of bank.csv but the class label (i.e., the attribute y) is changed to “?”
from yes/no.
Make Predictions on New Data
Go to Classify tab.

Select the Supplied test set


option in the Test options pane.
Make Predictions on New Data
Click Set, click the Open file on the options window and select the new dataset
we just created with the name "bank-new.csv".

For the Class, select y

Then, Click Close


Make Predictions on New Data
Click the “More options…” to
bring up options for evaluating
the classifier.
Make Predictions on New Data
Uncheck the the following information:

◦ Output model
◦ Output per-class stats
◦ Output confusion matrix
◦ Store predictions for visualization
◦ Collect predictions for evaluation based on
AUROC, etc.

For Output predictions, choose PlainText

Click OK
Make Predictions on New Data
Right click on the list item for our
loaded model in the Results list.

Choose Re-evaluate model on


current test set
Make Predictions on New Data
The predictions for each test
instance are then listed in the
Classifier Output.

Specifically, the middle column of


the results is the predicted label
which is "yes" or "no".

You might also like