Course Overview and Introduction
CE-717 : Machine Learning
Sharif University of Technology
M. Soleymani
Fall 2016
Course Info
Instructor: Mahdieh Soleymani
Email: soleymani@sharif.edu
Lectures: Sun-Tue (13:30-15:00)
Website: http://ce.sharif.edu/cources/95-96/1/ce717-2
2
Text Books
Pattern Recognition and Machine Learning, C. Bishop, Springer, 2006.
Machine Learning,T. Mitchell, MIT Press,1998.
Additional readings: will be made available when appropriate.
Other books:
The elements of statistical learning, T. Hastie, R. Tibshirani, J. Friedman,
Second Edition, 2008.
Machine Learning: A Probabilistic Perspective, K. Murphy, MIT Press,
2012.
3
Marking Scheme
Midterm Exam: 25%
Final Exam: 30%
Project: 5-10%
Homeworks (written & programming) : 20-25%
Mini-exams: 15%
4
Machine Learning (ML) and Artificial Intelligence (AI)
ML appears first as a branch of AI
ML is now also a preferred approach to other subareas of
AI
Computer Vision, Speech Recognition, …
Robotics
Natural Language Processing
ML is a strong driver in Computer Vision and NLP
5
A Definition of ML
Tom Mitchell (1998):Well-posed learning problem
“A computer program is said to learn from experience E
with respect to some task T and some performance
measure P, if its performance on T, as measured by P,
improves with experience E”.
Using the observed data to make better decisions
Generalizing from the observed data
6
ML Definition: Example
Consider an email program that learns how to filter spam
according to emails you do or do not mark as spam.
T: Classifying emails as spam or not spam.
E: Watching you label emails as spam or not spam.
P: The number (or fraction) of emails correctly classified as
spam/not spam.
7
The essence of machine learning
A pattern exist
We do not know it mathematically
We have data on it
8
Example: Home Price
Housing price prediction
400
300
Price ($)
200
in 1000’s
100
0
0 500 1000 1500 2000 2500
Size in feet2
Figure adopted from slides of Andrew Ng,
Machine Learning course, Stanford.
9
Example: Bank loan
Applicant form as the input:
Output: approving or denying the request
10
Components of (Supervised) Learning
Unknown target function: 𝑓: 𝒳 → 𝒴
Input space: 𝒳
Output space: 𝒴
Training data: 𝒙1 , 𝑦1 , 𝒙2 , 𝑦2 , … , (𝒙𝑁 , 𝑦𝑁 )
Pick a formula 𝑔: 𝒳 → 𝒴 that approximates the target
function 𝑓
selected from a set of hypotheses ℋ
11
Training data: Example
Training data
x2
𝑥1 𝑥2 𝑦
0.9 2.3 1
3.5 2.6 1
2.6 3.3 1
2.7 4.1 1
1.8 3.9 1
6.5 6.8 -1
7.2 7.5 -1
7.9 8.3 -1
6.9 8.3 -1
8.8 7.9 -1
9.1 6.2 -1
x1
12
Components of (Supervised) Learning
Learning
model
13
Solution Components
Learning model composed of:
Learning algorithm
Hypothesis set
Perceptron example
14
Perceptron classifier
Input 𝒙 = 𝑥1 , … , 𝑥𝑑 x2
Classifier:
If 𝑑𝑖=1 𝑤𝑖 𝑥𝑖 > threshold then output 1
else output −1
The linear formula 𝑔 ∈ ℋ can be written: x1
𝑑
𝑔 𝒙 = sign 𝑤𝑖 𝑥𝑖 − threshold
+ 𝑤0
𝑖=1
If we add a coordinate 𝑥0 = 1 to the input:
𝑑
Vector form
𝑔 𝒙 = sign 𝑤𝑖 𝑥𝑖
𝑖=0
𝑔 𝒙 = sign 𝒘𝑇 𝒙
15
Perceptron learning algorithm:
linearly separable data
1 1 𝑁
Give the training data 𝒙 ,𝑦 , … , (𝒙 , 𝑦 (𝑁) )
𝑛 𝑛
Misclassified data 𝒙 ,𝑦 :
sign(𝒘𝑇 𝒙 𝑛 ) ≠ 𝑦 (𝑛)
Repeat
Pick a misclassified data 𝒙 𝑛 , 𝑦 𝑛 from training data and
update 𝒘:
𝒘 = 𝒘 + 𝑦 (𝑛) 𝒙(𝑛)
Until all training data points are correctly classified by 𝑔
16
Perceptron learning algorithm:
Example of weight update
x2 x2
17 x1 x1
Experience (E) in ML
Basic premise of learning:
“Using a set of observations to uncover an underlying
process”
We have different types of (getting) observations in
different types or paradigms of ML methods
18
Paradigms of ML
Supervised learning (regression, classification)
predicting a target variable for which we get to see examples.
Unsupervised learning
revealing structure in the observed data
Reinforcement learning
partial (indirect) feedback, no explicit guidance
Given rewards for a sequence of moves to learn a policy and
utility functions
Other paradigms: semi-supervised learning, active learning,
online learning, etc.
19
Supervised Learning:
Regression vs. Classification
Supervised Learning
Regression: predict a continuous target variable
E.g., 𝑦 ∈ [0,1]
Classification: predict a discrete target variable
E.g.,𝑦 ∈ {1,2, … , 𝐶}
20
Data in Supervised Learning
Data are usually considered as vectors in a 𝑑 dimensional
space
Now, we make this assumption for illustrative purpose
We will see it is not necessary
𝑥1 𝑥2 ... 𝑥𝑑 𝑦
(Target)
Columns: Sample1
Features/attributes/dimensions
Sample
Rows: 2
Data/points/instances/examples/samples …
Y column: Sample
Target/outcome/response/label n-1
Sample
21 n
Regression: Example
Housing price prediction
400
300
Price ($)
200
in 1000’s
100
0
0 500 1000 1500 2000 2500
Size in feet2
Figure adopted from slides of Andrew Ng
22
Classification: Example
Weight (Cat, Dog)
1(Dog)
0(Cat)
weight
weight
23
Supervised Learning vs. Unsupervised
Learning
Supervised learning
Given:Training set
𝑁
labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 𝑖 ,𝑦 𝑖
𝑖=1
Goal: learning a mapping from 𝒙 to 𝑦
Unsupervised learning
Given:Training set
𝑖 𝑁
𝒙 𝑖=1
Goal: find groups or structures in the data
Discover the intrinsic structure in the data
24
Supervised Learning: Samples
x2
Classification
x1
25
Unsupervised Learning: Samples
x2 Type II
Type I
Clustering
Type III
x1
26
Sample Data in Unsupervised Learning
Unsupervised Learning:
𝑥1 𝑥2 ... 𝑥𝑑
Sample1
Columns:
Sample
Features/attributes/dimensions 2
Rows: …
Data/points/instances/examples/s
Sample
amples n-1
Sample
n
27
Unsupervised Learning: Example
Applications
Clustering docs based on their similarities
Grouping new stories in the Google news site
Market segmentation: group customers into different
market segments given a database of customer data.
Social network analysis
28
Reinforcement
Provides only an indication as to whether an action is
correct or not
Data in supervised learning:
(input, correct output)
Data in Reinforcement Learning:
(input, some output, a grade of reward for this output)
29
Reinforcement Learning
Typically, we need to get a sequence of decisions
it is usually assumed that reward signals refer to the entire sequence
30
Is learning feasible?
Learning an unknown function is impossible.
The function can assume any value outside the data we have.
However, it is feasible in a probabilistic sense.
31
Example
32
Generalization
We don’t intend to memorize data but need to figure out
the pattern.
A core objective of learning is to generalize from the
experience.
Generalization: ability of a learning algorithm to perform
accurately on new, unseen examples after having experienced.
33
Components of (Supervised) Learning
Learning
model
34
Main Steps of Learning Tasks
Selection of hypothesis set (or model specification)
Which class of models (mappings) should we use for our data?
Learning: find mapping 𝑓 (from hypothesis set) based on the
training data
Which notion of error should we use? (loss functions)
Optimization of loss function to find mapping 𝑓
Evaluation: how well 𝑓 generalizes to yet unseen examples
How do we ensure that the error on future data is minimized?
(generalization)
35
Some Learning Applications
Face, speech, handwritten character recognition
Document classification and ranking in web search
engines
Photo tagging
Self-customizing programs (recommender systems)
Database mining (e.g., medical records)
Market prediction (e.g., stock/house prices)
Computational biology (e.g., annotation of biological
sequences)
Autonomous vehicles
36
ML in Computer Science
Why ML applications are growing?
Improved machine learning algorithms
Availability of data (Increased data capture, networking, etc)
Demand for self-customization to user or environment
Software too complex to write by hand
37
Handwritten Digit Recognition Example
Data: labeled samples
0
1
2
3
4
5
6
7
8
9
38
Example: Input representation
39
Example: Illustration of features
40
Example: Classification boundary
41
Main Topics of the Course
Supervised learning
Regression Most of the lectures
Classification (our main focus) are on this topic
Learning theory
Unsupervised learning
Reinforcement learning
Some advanced topics & applications
42
Resource
Yaser S. Abu-Mostafa, Malik Maghdon-Ismail, and Hsuan
Tien Lin,“Learning from Data”, 2012.
43