0% found this document useful (0 votes)

21 views70 pages

Topic 7

The document discusses methods for evaluating predictive models for classification and estimation problems. It covers topics like confusion matrices, accuracy, precision, recall, and other performance metrics for classification models as well as measures for models predicting continuous outcomes.

Uploaded by

Hy Chong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views70 pages

Topic 7

Uploaded by

Hy Chong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 70

Predictive Data

Mining
Chapter 7

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction (Slide 1 of 2)

• An observation, or record, is the set of recorded

values of variables associated with a single
entity.
• Supervised learning: Data mining methods for
predicting an outcome based on a set of input
variables, or features.
• Supervised learning can be used for:
• Estimation of a continuous outcome.
• Classification of a categorical outcome.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction (Slide 2 of 2)

The data mining process comprises the following

steps:
1. Data sampling.
2. Data preparation.
3. Data partitioning.
4. Model construction.
5. Model assessment.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation,
and Partitioning

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 1 of 7)

• When dealing with large volumes of data, best

practice is to extract a representative sample for
analysis.
• A sample is representative if the analyst can
make the same conclusions from it as from the
entire population of data.
• The sample of data must be large enough to
contain significant information, yet small enough
to be manipulated quickly.
• Data mining algorithms typically are more
effective given more data.

• When obtaining a representative sample, it is

generally best to include as many variables as
possible in the sample.
• After exploring the data with descriptive statistics
and visualization, the analyst can eliminate
variables that are not of interest.
• Data mining applications deal with an abundance
of data that simplifies the process of assessing
the accuracy of data-based estimates of variable
effects.

• Overfitting occurs when the analyst builds a model

that does a great job of explaining the sample of data
on which it is based, but fails to accurately predict
outside the sample data.
• We can use the abundance of data to guard against
the potential for overfitting by splitting the data set
into different subsets for:
• The training (or construction) of candidate
models.
• The validation (or performance comparison) of
candidate models
• The testing (or assessment) of future
performance of a selected model.

Statistic Holdout Method

• Training set: Consists of the data used to build
the candidate models.
• Validation set: The data set to which the
promising subset of models is applied to identify
which model is the most accurate at predicting
observations that were not used to build the
model.
• Test set: The data set to which the final model
should be applied to estimate this model’s
effectiveness when applied to data that have not
been used to build or select the model.

k-Fold Cross-Validation
• k-Fold Cross-Validation: A robust procedure to
train and validate models in which observations
to be used to train and validate the model are
repeatedly randomly divided into k subsets called
folds. In each iteration, one fold is designated as
the validation set and the remaining k-1 folds are
designated as the training set. The results of the
iterations are then combined and evaluated.

k-Fold Cross-Validation
• A special case of k-fold cross-validation is leave-
one-out cross-validation.
• In this case, the number of folds equals the
number of observations in the combined training
and validation data.

Class Imbalanced Data

• There are two basic sampling approaches for
modifying the class distribution of the training set:
• Undersampling: Balances the number of Class 1 and
Class 0 observations in a training set by removing
majority class observations from the training set.
• Oversampling: Balances the number of Class 1 and
Class 0 observations in a training set by inserting
copies of minority class observations into the training
set.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes
Evaluating the Estimation of Continuous Outcomes

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical
Outcomes:
• By counting the classification errors on
a sufficiently large validation set
Performance and/or test set that is representative of
Measures (Slide the population, we will generate an
accurate measure of the model’s
1 of 19) classification performance.
• Classification confusion matrix:
Displays a model’s correct and
incorrect classifications.

• Many measures of classification performance are based on the confusion

matrix.
• Overall error rate: Percentage of misclassified observations:
n10  n01
Overall error rate 
n11  n10  n01  n00

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical
Outcomes (cont.):
• One minus the overall error rate is often
Performance referred to as the accuracy of the model.
• While overall error rate conveys an
Measures (Slide aggregate measure of misclassification, it
counts as misclassifying an actual Class 0
3 of 19) observation as a Class 1 observation (a
false positive) the same as misclassifying
an actual Class 1 observation as a Class 0
observation (a false negative).

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 4 of 19)
Evaluating the Classification of Categorical Outcomes (cont.):
• To account for the asymmetric costs in misclassification, we define the
error rate with respect to the individual classes:
• Class 1 error rate = n10
n11  n10

• Class 0 error rate =

n01
n01  n00
• Cutoff value: Probability value used to understand the tradeoff between
Class 1 error rate and Class 0 error rate.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Probability of Probability of
Actual Class Class 1 Actual Class Class 1
1 1.00 0 0.66
1 1.00 0 0.65
0 1.00 1 0.64
1 1.00 0 0.62
0 1.00 0 0.60
0 0.90 0 0.51
1 0.90 0 0.49
0 0.88 0 0.49
0 0.88 1 0.46
1 0.88 0 0.46

Performance Measures (Slide 5 of 19)

Table 9.2: Classification Probabilities

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Actual Probability Actual Probability
Class of Class 1 Class of Class 1
0 0.87 1 0.45
Performance 0 0.87 1 0.45
Measures 0
0
0.87
0.86
0
0
0.45
0.44
(Slide 6 of 19) 1 0.86 0 0.44
Table 9.2: Classification 0 0.86 0 0.30
Probabilities (cont.) 0 0.86 0 0.28
0 0.85 0 0.26
0 0.84 1 0.24
0 0.84 0 0.22
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Actual
Class
Probability
of Class 1
Actual
Class
Probability
of Class 1
Measures 0 0.83 0 0.21
(Slide 7 of 19) 0
0
0.68
0.67
0
0
0.04
0.04
Table 9.2: Classification 0 0.67 0 0.01
Probabilities (cont.)
0 0.67 0 0.00

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical Outcomes
(cont.):
• Cumulative lift chart: Compares the number of
actual Class 1 observations identified if
considered in decreasing order of their estimated
Performance probability of being in Class 1 and compares this
to the number of actual Class 1 observations
Measures (Slide identified if randomly selected.
• Decile-wise lift chart: Another way to view how
12 of 19) much better a classifier is at identifying Class 1
observations than random classification.
• Observations are ordered in decreasing
probability of Class 1 membership and then
considered in 10 equal-sized groups.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 14 of 19)
Evaluating the Classification of Categorical Outcomes (cont.):
• The ability to correctly predict Class 1 (positive) observations is
commonly expressed as sensitivity, or recall, and is calculated as:
n11
Sensitivity  1  Class 1 error rate 
n11  n10

• The ability to correctly predict Class 0 (negative) observations is

commonly expressed as specificity and is calculated as:
n00
Specificity  1  Class 0 error rate 
n11  n10

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 15 of 19)
Evaluating the Classification of Categorical Outcomes (cont.):
• Precision is a measure that corresponds to the proportion of
observations predicted to be Class 1 by a classifier that are actually in
Class 1:
n11
Precision =
n11  n01

• The F1 Score combines precision and sensitivity into a single measure

and is defined as:
2n11
F1 Score =
2n11  n01  n10

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical
Outcomes (cont.):
• The receiver operating characteristic (ROC)
curve is an alternative graphical approach for
Performance displaying the tradeoff between a classifier’s
ability to correctly identify Class 1 observations
Measures (Slide and its Class 0 error rate.
• In general, we can evaluate the quality of a
16 of 19) classifier by computing the area under the ROC
curve, often referred to as the AUC.
• The greater the area under the ROC curve, i.e.,
the larger the AUC, the better the classifier
performs.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 18 of 19)
Evaluating the Estimation of Continuous Outcomes:
• The measures of accuracy are some function of the error in estimating an
outcome for an observation i.
• Two common measures are:

n
• Average error = e n
i 1 i

 i 1 i n
n2
• Root mean squared error (RMSE) = e
(ei  error in estimating an outcome for observation i)
The average error estimates the bias in a model’s predictions:
• If the average error is negative, then the model tends to overestimate the value
of the outcome variable.
• If the average error is positive, the model tends to underestimate.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Regression

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
• Logistic regression attempts to classify
a binary categorical outcome
Logistic (y = 0 or 1) as a linear function of
Regression explanatory variables.
• A linear regression model fails to
(Slide 1 of 8) appropriately explain a categorical
outcome variable.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic
Regression (Slide
3 of 8)
•Figure 9.5: Residuals for Simple Linear Regression on Oscars
Data
•An unmistakable pattern of systematic misprediction suggests
that the simple linear regression model is not appropriate.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Regression (Slide 4 of 8)
Odds is a measure related to probability.
If an estimate of the probability of an event is pˆ , then the equivalent
odds measure is pˆ 1  pˆ .

The odds metric ranges between zero and positive infinity.

We eliminate the fit problem by using logit, ln  pˆ 1  pˆ  .

Estimating the logit with a linear function results in the estimated

logistic regression model.

Given a set of explanatory variables, a logistic regression algorithm

determines values of b0 , b1 , ,bq that best estimate the log odds.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Regression (Slide 7 of 8)
• Logistic regression classifies an observation by using the logistic
function to compute the probability of an observation belonging to
Class 1 and then comparing this probability to a cutoff value.
• If the probability exceeds the cutoff value, the observation is
classified as Class 1 and otherwise it is classified as Class 0.
• While a logistic regression model used for prediction should
ultimately be judged based on its classification accuracy on
validation and test sets, Mallow’s C p statistic is a measure
commonly computed by statistical software that can be used to
identify models with promising sets of variables.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Total Number Predicted

Regression
of Oscar Probability
Nominations of Winning Predicted Class Actual Class
14 0.89 Winner Winner

(Slide 8 of 8) 11

10
0.58

0.44
Winner

Loser
Loser
Loser

Table 9.5: Predicted Probabilities 6 0.07 Loser Winner

by Logistic Regression for Oscars
Example

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k-Nearest Neighbors
Classifying Categorical Outcomes with k-Nearest Neighbors
Estimating Continuous Outcomes with k-Nearest Neighbors

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
• k-Nearest Neighbors (k-NN): This
method can be used either to classify a
categorical outcome or to estimate a
continuous outcome.
k-Nearest • k-NN uses the k most similar
Neighbors observations from the training set,
where similarity is typically measured
(Slide 1 of 7) with Euclidean distance.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classifying Categorical Outcomes with k-Nearest
Neighbors:
• A nearest-neighbor classifier is a “lazy learner”
that directly uses the entire training set to classify

k-Nearest •
observations in the validation and test sets.
The value of k can plausibly range from 1 to n, the
Neighbors number of observations in the training set.
• If k = 1, then the classification of a new

(Slide 2 of 7) observation is set to be equal to the class of

the single most similar observation from the
training set.
• If k = n, then the new observation’s class is
naïvely assigned to the most common class in
the training set.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Average Loan
Observation Balance Age Default
1 49 38 1
2 671 26 1

k-Nearest 3
4
772
136
47
48
1
1
Neighbors 5
6
123
36
40
29
1
0
(Slide 3 of 7) 7
8
192
6,574
31
35
0
0
Table 9.6: Training Set 9 2,200 58 0
Observations for k-NN Classifier
10 2,100 30 0
Average: 1,285 38.2
Standard
Deviation: 2,029 10.2

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k-Nearest
Neighbors
(Slide 4 of 7)
Figure 9.7: Scatter Chart for k-NN
Classification

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k-Nearest Neighbors (Slide 5 of 7)

% of Class 1
k Neighbors Classification
1 1.00 1
2 0.50 1
3 0.33 0
•Table 9.7: Classification of Observation with
4 0.25 0
Average Balance = 900 and Age = 28 for Different
5 0.40 0 Values of k
6 0.50 1
7 0.57 1
8 0.63 1
9 0.56 1
10 0.50 1

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
•Estimating Continuous Outcomes with k-Nearest Neighbors:
• When k-NN is used to estimate a continuous outcome, a new
observation’s outcome value is predicted to be the average of
the outcome values of its k-nearest neighbors in the training
k-Nearest set.
• The value of k can plausibly range from 1 to n, the number of
Neighbors (Slide observations in the training set.

6 of 7) •Figure 9.8: Scatter Chart for k-NN Estimation

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k Average Balance Estimate
1 $36
k-Nearest 2
3
$936
$936
Neighbors 4 $750
(Slide 7 of 7) 5
6
$1,915
$1,604
Table 9.8: Estimation Average
Balance for Observation with 7 $1,392
Age = 28 for Different Values of k 8 $1,315
9 $1,184
10 $1,285

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
Classifying Categorical Outcomes with a Classification Tree
Estimating Continuous Outcomes with a Regression Tree
Ensemble Methods

• Classification and regression trees (CART) successively

partition a data set of observations into increasingly
smaller and more homogeneous subsets.
• At each iteration of the CART method, a subset of
observations is split into two new subsets based on
the values of a single variable.
• CART method can be thought of as a series of
questions that successively narrow down
observations into smaller and smaller groups of
decreasing impurity, which is the measure of the
heterogeneity in a group of observations’ outcome
classes or outcome values.

Classifying Categorical Outcomes with a Classification

Tree:
• Classification trees: The impurity of a group of
observations is based on the proportion of
observations belonging to the same class.
• There is zero impurity if all observations in a
group are in the same class.
• After a final tree is constructed, the classification of a
new observation is then based on the final partition
into which the new observation belongs.

Classifying a Categorical Outcome with a

Classification Tree (cont.):
• To explain how a classification tree categorizes
observations:
• We use a small sample of data from
DemoHHI consisting of 46 observations.
• Only two variables from HHI—percentage of
the $ character and percentage of the !
Character.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 4 of 20)
Figure 9.9: Construction
Sequence of Branches in a
Classification Tree

•Figure 9.10: Geometric Illustration of

Classification Tree Partitions
•The final partitioning resulting from the
sequence of variable splits.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 6 of 20)
Figure 9.11: Classification Tree
with One Pruned Branch

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Number of % Classification Error on % Classification Error on
Decision Nodes Training Set Validation Set

Classification and 0
1
43.5
8.7
39.4
20.9
Regression Trees 2
3
8.7
8.7
20.9
20.9
(Slide 7 of 20) 4 6.5 20.9
5 4.3 21.3
Table 9.9: Classification Error 6 2.2 21.3
Rates on Sequence of Pruned 7 0 21.6
Trees

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification
and Regression
Trees
(Slide 8 of 20)
Figure 9.12: Best-Pruned
Classification Tree

Estimating Continuous Outcomes with a Regression Tree:

• A regression tree successively partitions observations of
the training set into smaller and smaller groups in a
similar fashion as a classification tree.
• The differences are:
• A regression tree bases the impurity of a partition
based on the variance of the outcome value for the
observations in the group.
• After a final tree is constructed, the estimated
outcome value of an observation is based on the
mean outcome value of the partition in which the
new observation belongs.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 10 of 20)
Figure 9.13: Geometric
Illustration of First Six Rules of a
Regression Tree

Ensemble Methods:
• In an ensemble method, predictions are made
based on the combination of a collection of
models.
• Two necessary conditions for an ensemble to
perform better than a single model:
1. Individual base models are constructed
independently of each other.
2. Individual models perform better than just
randomly guessing.

Ensemble Methods (cont.):

• Two primary steps to an ensemble approach:
1. The development of a committee of individual base
models.
2. The combination of the individual base models’
predictions to form a composite prediction.
• A classification or estimation method is unstable if relatively
small changes in the training set cause its predictions to
fluctuate.
• Three different ways to construct an ensemble of classification
or regression trees:
• Bagging.
• Boosting.
• Random forests.

•Ensemble Methods (cont.):

• In the bagging approach, the committee of
individual base models is generated by first
Age 29 31 35 38 47 48 53 54 58 70
constructing multiple training sets by repeated
Loan
default 0 0 0 1 1 1 1 0 0 0
random sampling of the n observations in the
original data with replacement.
•Table 9.10: Original 10-Observation Training Data

Ensemble Methods (cont.):

• The boosting method generates is committee of
individual base models by sampling multiple
training sets.
• Boosting iteratively adapts how it samples the
original data when constructing a new training
set based on the prediction error of the models
constructed on the previous training sets.
• Random forests can be viewed as a variation of
bagging specifically tailored for use with
classification or regression trees.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification
and Regression
Trees
(Slide 15 of 20)
Table 9.11: Bagging: Generation of
10 New Training Sets and
Corresponding Classification Trees

•Table 9.11: Bagging: Generation of 10 New Training Sets and

Corresponding Classification Trees (cont.)

•Table 9.11: Bagging: Generation of 10 New Training Sets and

Corresponding Classification Trees (cont.)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Overall
Classification Age 26 29 30 32 34 37 42 47 48 54
Error
Rate

and Regression Loan

default 1 0 0 0 0 1 0 1 1 0

Trees Tree 1
Tree 2
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
1
0
1
0
30%
40%
(Slide 18 of 20) Tree 3
Tree 4
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
30%
30%
Table 9.12: Classification of 10 Tree 5 0 0 0 0 0 0 1 1 1 1 40%
Observations from Validation Set Tree 6 1 1 1 1 1 1 1 1 1 0 50%
with Bagging Ensemble Tree 7 1 1 1 1 1 1 1 1 1 0 50%

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Overall Error
Age 26 29 30 32 34 37 42 47 48 54 Rate
Loan default 1 0 0 0 0 1 0 1 1 0
Tree 8 1 1 1 1 1 1 1 1 1 0 50%
Tree 9 1 1 1 1 1 1 1 1 1 0 50%
Tree 10 0 0 0 0 0 0 0 0 0 0 40%
Average Vote 0.4 0.4 0.4 0.4 0.4 0.7 0.8 0.8 0.8 0.4
Bagging
Ensemble 0 0 0 0 0 1 1 1 1 0 20%

Classification and Regression Trees

(Slide 19 of 20)
Table 9.12: Classification of 10 Observations from Validation Set with Bagging Ensemble

Ensemble Methods (cont.):

• For most problems, the predictive accuracy of boosting ensembles
exceeds the predictive performance of bagging ensembles.
• Boosting achieves its performance advantage because:
• It evolves its committee of models by focusing on observations
that are mispredicted.
• The member models’ votes are weighted by their accuracy.
• Boosting is more computationally expensive than bagging.
• There is no adaptive feedback in a bagging approach, so all m training
sets are corresponding models can be implemented simultaneously.
• Random forests approach has performance similar to boosting, but
maintains the computational simplicity of bagging.

Modellingandevaluationunit2june2322 220623063944 5c70ebed
No ratings yet
Modellingandevaluationunit2june2322 220623063944 5c70ebed
53 pages
ML Unit IV
No ratings yet
ML Unit IV
70 pages
Bi Unit 5
No ratings yet
Bi Unit 5
20 pages
Political Analysis
No ratings yet
Political Analysis
11 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
Python by Example Book 2 (Data Manipulation and Analysis)
No ratings yet
Python by Example Book 2 (Data Manipulation and Analysis)
105 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
DM Unit - 3
No ratings yet
DM Unit - 3
21 pages
DS Notes
No ratings yet
DS Notes
36 pages
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
81 pages
UNIT03
No ratings yet
UNIT03
52 pages
Report Intern
No ratings yet
Report Intern
34 pages
Aml - Module - 4
No ratings yet
Aml - Module - 4
12 pages
Camm 4e Ch09 PPT
No ratings yet
Camm 4e Ch09 PPT
71 pages
Evaluation
No ratings yet
Evaluation
21 pages
Lect 03 Evaluation Part 2
No ratings yet
Lect 03 Evaluation Part 2
40 pages
ML Pyq Ans
No ratings yet
ML Pyq Ans
37 pages
5 - Model For Predictions - ML
No ratings yet
5 - Model For Predictions - ML
52 pages
EDA Template
No ratings yet
EDA Template
18 pages
Cofusion Matrix Cross - Validation
No ratings yet
Cofusion Matrix Cross - Validation
34 pages
UNIT I Complete Notes
No ratings yet
UNIT I Complete Notes
5 pages
Perceived Quality of Products A Framework and Attributes Ranking Method
No ratings yet
Perceived Quality of Products A Framework and Attributes Ranking Method
51 pages
Unit3ModellingandEvaluationpptx 2023 09 02 15 19 21
No ratings yet
Unit3ModellingandEvaluationpptx 2023 09 02 15 19 21
49 pages
Activity 4 CGPA Vs Placement Package Program
No ratings yet
Activity 4 CGPA Vs Placement Package Program
4 pages
DCSN 216 Summary
No ratings yet
DCSN 216 Summary
23 pages
Performance Evaluation
No ratings yet
Performance Evaluation
29 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Bi Intro
No ratings yet
Bi Intro
24 pages
Presentation On Classification
No ratings yet
Presentation On Classification
18 pages
Lecture 4-v5
No ratings yet
Lecture 4-v5
25 pages
Clase10 11
No ratings yet
Clase10 11
18 pages
S1 Evaluate Performance LKW 1mar2025
No ratings yet
S1 Evaluate Performance LKW 1mar2025
26 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
Bart-Baesens-Analytics-In-A-Big-Data-World.-The-Essential-Guide-To-Data-Science-And-Its-Applications-Wiley-2014-91-102
No ratings yet
Bart-Baesens-Analytics-In-A-Big-Data-World.-The-Essential-Guide-To-Data-Science-And-Its-Applications-Wiley-2014-91-102
12 pages
Content: Training and Placement Department Summer Training Instructions Manual For 2019-20
No ratings yet
Content: Training and Placement Department Summer Training Instructions Manual For 2019-20
31 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
CHP 3
No ratings yet
CHP 3
70 pages
Module 5 Advanced Classification Techniques
No ratings yet
Module 5 Advanced Classification Techniques
40 pages
Module 4
No ratings yet
Module 4
12 pages
5 DL
No ratings yet
5 DL
33 pages
ML Unit 2 Part 1
No ratings yet
ML Unit 2 Part 1
47 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
Accuracy Measures
No ratings yet
Accuracy Measures
61 pages
Python Data Analytics Libraries
No ratings yet
Python Data Analytics Libraries
8 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Week 6 Machine Learning
No ratings yet
Week 6 Machine Learning
17 pages
ML Unit 2
No ratings yet
ML Unit 2
35 pages
Data Preparation
No ratings yet
Data Preparation
12 pages
Question1 Answers Complete
No ratings yet
Question1 Answers Complete
4 pages
ML Model Evaluation
No ratings yet
ML Model Evaluation
17 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
73 pages
AI Capstone Project - Notes-Part2
No ratings yet
AI Capstone Project - Notes-Part2
8 pages
Shubham Nov 2022
No ratings yet
Shubham Nov 2022
2 pages
ClassWork 03 Multi Way ANOVA - Sol
No ratings yet
ClassWork 03 Multi Way ANOVA - Sol
31 pages
ML Unit-3 - RTU
No ratings yet
ML Unit-3 - RTU
20 pages
Lec 16
No ratings yet
Lec 16
18 pages
9b. Evaluation of Classifiers
No ratings yet
9b. Evaluation of Classifiers
4 pages
Wk07 Topic07 2 - 202303
No ratings yet
Wk07 Topic07 2 - 202303
21 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
Data Mining - Credibility: Evaluating What's Been Learned
No ratings yet
Data Mining - Credibility: Evaluating What's Been Learned
36 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
Christina Huang Liao - Resume
No ratings yet
Christina Huang Liao - Resume
1 page
Unit 2
No ratings yet
Unit 2
28 pages
QUIZ
No ratings yet
QUIZ
41 pages
2) Theoretical Background: 2.1 EDA (Exploratory Data Analysis)
No ratings yet
2) Theoretical Background: 2.1 EDA (Exploratory Data Analysis)
7 pages
CH 18
No ratings yet
CH 18
42 pages
Business Research Methods BRM
No ratings yet
Business Research Methods BRM
29 pages
2023 05 Struktur Variaans-Kovarians
No ratings yet
2023 05 Struktur Variaans-Kovarians
42 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Worksheet 10 & 13: A) Strong Positive B) Strong Negative C) Perfect Negative D) Perfect Positive
No ratings yet
Worksheet 10 & 13: A) Strong Positive B) Strong Negative C) Perfect Negative D) Perfect Positive
5 pages
Accuracy and Error Measures
No ratings yet
Accuracy and Error Measures
46 pages
My Proposal
No ratings yet
My Proposal
11 pages
MT 330
No ratings yet
MT 330
1 page
Session01 DataScience
No ratings yet
Session01 DataScience
79 pages
For More Visit WWW - Ktunotes.in
No ratings yet
For More Visit WWW - Ktunotes.in
21 pages
Reviewer in Practical Research
No ratings yet
Reviewer in Practical Research
6 pages
DepEd Baguio City - Research Manual
No ratings yet
DepEd Baguio City - Research Manual
45 pages
Bmgt107l Business-Analytics TH 1.0 71 Bmgt107l 66 Acp
No ratings yet
Bmgt107l Business-Analytics TH 1.0 71 Bmgt107l 66 Acp
2 pages
Stats Test II
No ratings yet
Stats Test II
3 pages
CHAPTER 3 - RESEARCH METHODOLOGY: Data Collection Method and Research Tools
No ratings yet
CHAPTER 3 - RESEARCH METHODOLOGY: Data Collection Method and Research Tools
10 pages
VAR Package Pricing at Mission Hospital
No ratings yet
VAR Package Pricing at Mission Hospital
6 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
ML Quiz 1
No ratings yet
ML Quiz 1
4 pages
Ged 172 People and The Earth New Syllabus
100% (1)
Ged 172 People and The Earth New Syllabus
10 pages
LAS Module 1 PR2 Ver 2 Secured
No ratings yet
LAS Module 1 PR2 Ver 2 Secured
6 pages
ISTQB Certified Tester Foundation Level Practice Exam Questions
From Everand
ISTQB Certified Tester Foundation Level Practice Exam Questions
Gabriel Awoyemi
5/5 (1)

Topic 7

Uploaded by

Topic 7

Uploaded by

Predictive Data

• An observation, or record, is the set of recorded

The data mining process comprises the following

• When dealing with large volumes of data, best

• When obtaining a representative sample, it is

• Overfitting occurs when the analyst builds a model

Statistic Holdout Method

Class Imbalanced Data

• Many measures of classification performance are based on the confusion

• Class 0 error rate =

Performance Measures (Slide 5 of 19)

• The ability to correctly predict Class 0 (negative) observations is

• The F1 Score combines precision and sensitivity into a single measure

The odds metric ranges between zero and positive infinity.

Estimating the logit with a linear function results in the estimated

Given a set of explanatory variables, a logistic regression algorithm

Table 9.5: Predicted Probabilities 6 0.07 Loser Winner

(Slide 2 of 7) observation is set to be equal to the class of

6 of 7) •Figure 9.8: Scatter Chart for k-NN Estimation

• Classification and regression trees (CART) successively

Classifying Categorical Outcomes with a Classification

Classifying a Categorical Outcome with a

•Figure 9.10: Geometric Illustration of

Estimating Continuous Outcomes with a Regression Tree:

Ensemble Methods (cont.):

•Ensemble Methods (cont.):

Ensemble Methods (cont.):

•Table 9.11: Bagging: Generation of 10 New Training Sets and

•Table 9.11: Bagging: Generation of 10 New Training Sets and

and Regression Loan

Classification and Regression Trees

Ensemble Methods (cont.):

You might also like