[go: up one dir, main page]

0% found this document useful (0 votes)
1 views27 pages

Lecture 9

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 27

BIG DATA ANALYTICS

Lecture 9 --- Week 10


Content

 Classification versus Regression

 Supervised vs. Unsupervised Learning

 Evaluating Predictive Models

 Supervised Learning Algorithms

 Model Evaluation
Classification vs Regression

 Classification:
 predicts categorical class labels
 classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying new
data.
 Regression:
 models continuous-valued functions, i.e., predicts unknown or missing
values.
 Typical Applications
 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis
Classification – A Motivating
Application
 Credit approval
 A bank wants to classify its customers based on whether they
are expected to pay back their approved loans
 The history of past customers is used to train the classifier
 The classifier provides rules, which identify potentially reliable
future customers
 Classification rule:
 If age = “31...40” and income = high then credit_rating = excellent
 Future customers
 Paul: age = 35, income = high  excellent credit rating
 John: age = 20, income = medium  fair credit rating
Classification—A Two-Step Process

 Model construction: describing a set of predetermined classes


 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test samples is compared with the classified result
from the model
 Accuracy rate is the percentage of test set samples that are correctly
classified by the model
 Test set is independent of training set, otherwise over-fitting will occur
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Classification Process (2): Use the
Model in Prediction
Accuracy=?
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Mellisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Classification (Training Phase)

 In the first step, a classifier is built describing a predetermined set of data


classes or concepts.

 This is the learning step (or training phase), where a classification


algorithm builds the classifier by analyzing or “learning from” a training set
made up of database tuples and their associated class labels.

 A tuple, X, is represented by an n-dimensional attribute vector, X = (x1,


x2, …. , xn), depicting n measurements made on the tuple from n database
attributes, respectively, A1, A2, ….. , An.

 Each tuple, X, is assumed to belong to a predefined class as determined by


another database attribute called the class label attribute.
 The class label attribute is discrete-valued and unordered.
 It is categorical (or nominal) in that each value serves as a category or class.

 The individual tuples making up the training set are referred to as training
tuples and are randomly sampled from the database under analysis. In the
context of classification, data tuples can be referred to as samples, examples,
instances, data points, or objects.

 This first step of the classification process can also be viewed as the learning of a
mapping or function, y = f (X), that can predict the associated class label y of a
given tuple X.

 In this view, we wish to learn a mapping or function that separates the data
classes.

 This mapping is represented in the form of classification rules, decision trees, or


mathematical formulae.
 The mapping is represented as classification rules that identify loan
applications as being either safe or risky.

 The rules can be used to categorize future data tuples, as well as


provide deeper insight into the data contents.

 They also provide a compressed data representation.


Classification (Testing Phase)

 In the second step, the model issued for classification.

 First, the predictive accuracy of the classifier is estimated.

 If we were to use the training set to measure the classifier’s accuracy,


this estimate would likely be optimistic, because the classifier tends
to overfit the data (i.e., during learning it may incorporate some
particular anomalies of the training data that are not present in the
general data set overall).
Supervised vs. Unsupervised
Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
Evaluating Predictive Models

 Predictive Accuracy
 Speed
 time to construct the model
 time to use the model
 Robustness
 handling noise and missing values
 Scalability
 efficiency in disk-resident databases
 Interpretability:
 understanding and insight provided by the model
 Goodness of rules (quality)
 True Positives, True Negatives, False Negatives, False Positives
 compactness of classification rules
Supervised Learning Algorithms

 Artificial Neural Network


 Linear Regression
 Support Vector Machine
Artificial Neural Networks

 Perceptron
 Developed by Frank Rosenblatt by using McCulloch and Pitts model,
perceptron is the basic operational unit of artificial neural networks. It
employs supervised learning rule and is able to classify the data into
two classes.
 Operational characteristics of the perceptron: It consists of a single
neuron with an arbitrary number of inputs along with adjustable
weights, but the output of the neuron is 1 or 0 depending upon the
threshold. It also consists of a bias whose weight is always 1.
Following figure gives a schematic representation of the perceptron.
 Perceptron thus has the following three basic elements −
 Links − It would have a set of connection links, which carries a
weight including a bias always having weight 1.
 Adder − It adds the input after they are multiplied with their
respective weights.
 Activation function − It limits the output of neuron. The most basic
activation function is a Heaviside step function that has two possible
outputs. This function returns 1, if the input is positive, and 0 for any
negative input.
Linear Regression
 Linear regression may be defined as the statistical model that
analyzes the linear relationship between a dependent variable with
given set of independent variables. Linear relationship between
variables means that when the value of one or more independent
variables will change (increase or decrease), the value of dependent
variable will also change accordingly (increase or decrease).
 Mathematically the relationship can be represented with the help of
following equation −
Y = mX + b
 Here, Y is the dependent variable we are trying to predict
 X is the dependent variable we are using to make predictions.
 m is the slop of the regression line which represents the effect X
has on Y
 b is a constant, known as the Y-intercept. If X = 0,Y would be
equal to b.
 Positive Linear Relationship
 A linear relationship will be called positive if both independent and
dependent variable increases. It can be understood with the help of
following graph −
 Negative Linear relationship
 A linear relationship will be called negative if independent increases
and dependent variable decreases. It can be understood with the help
of following graph −
Support Vector Machines

 An SVM model is basically a representation of different classes in a


hyper-plane in multidimensional space. The hyper-plane will be
generated in an iterative manner by SVM so that the error can be
minimized. The goal of SVM is to divide the datasets into classes to
find a maximum marginal hyper-plane (MMH).
 The followings are important concepts in SVM −
 Support Vectors − Data-points that are closest to the hyper-plane is
called support vectors. Separating line will be defined with the help of
these data points.
 Hyper-plane − As we can see in the above diagram, it is a decision
plane or space which is divided between a set of objects having
different classes.
 Margin − It may be defined as the gap between two lines on the
closet data points of different classes. It can be calculated as the
perpendicular distance from the line to the support vectors. Large
margin is considered as a good margin and small margin is
considered as a bad margin.
 The main goal of SVM is to divide the datasets into classes to find a
maximum marginal hyper-plane (MMH) and it can be done in the
following two steps −
 First, SVM will generate hyper-planes iteratively that segregates
the classes in best way.
 Then, it will choose the hyper-plane that separates the classes
correctly.
Model Evaluation

 Metrics for Performance Evaluation


 How to evaluate the performance of a model?

 Methods for Performance Evaluation


 How to obtain reliable estimates?

 Methods for Model Comparison


 How to compare the relative performance among competing models?
Metrics for Performance Evaluation

 Focus on the predictive capability of a model


 Rather than how fast it takes to classify or build models, scalability, etc.
 Confusion Matrix:

PREDICTED CLASS

Class=Yes Class=No

a: TP (true positive)
ACTUAL Class=Yes a b
b: FN (false
CLASS negative)
Class=No c d
c: FP (false
positive)
d: TN (true
Metrics for Performance
Evaluation… PREDICTED CLASS

Class=Yes Class=No

ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)

 Most widely-used metric:

ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Methods of Estimation
 Holdout
 Reserve 2/3 for training and 1/3 for testing
 Random subsampling
 One sample may be biased -- Repeated holdout
 Cross validation
 Partition data into k disjoint subsets
 k-fold: train on k-1 partitions, test on the
remaining one
 Leave-one-out: k=n
 Guarantees that each record is used the same number of times for training
and testing
 Bootstrap
 Sampling with replacement
 ~63% of records used for training, ~27% for testing
ROC (Receiver Operating Characteristic)

 Developed in 1950s for signal detection theory to


analyze noisy signals
 Characterize the trade-off between positive hits and false
alarms
 ROC curve plotsTP TPR (on the y-axis) against FPR
TPR 
(on the x-axis) PREDICTED CLASS
TP  FN
Fraction of positive Yes No
instances predicted as
positive Yes a b
FP Actual (TP) (FN)
FPR  No c d
FP  TN
(FP) (TN)
Fraction of negative
instances predicted as
positive

You might also like