Classification
• Classification: Basic Concepts
• Logistic Regression
• Decision Tree Induction
• Bayes Classification Methods
2
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
3
Prediction Problems: Classification vs.
Numeric Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
– Credit/loan approval:
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
– Web page categorization: which category it is
4
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
5
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
6
Process (2): Using the Model in Prediction
Classifier
Testing Unseen
Data Data
(Jeff, Professor, 4)
Tenured?
7
Logistic Regression
References
• http://cse.iitkgp.ac.in/~saptarshi/courses/ml2
018spring/ML-02-linear-regression.pdf
• Course material of Andrew Ng
http://cs229.stanford.edu/