Machine Learning Fundamentals
Introduction
This webinar covers
Identifying the needs and goals
analysing the requirements
gathering and prepossessing data
understanding how to apply machine learning in commercials
Lecturer: Samson Hui
IT Support for Research: https://www.polyu.edu.hk/its/researchsupport/en/
Materials on Git Repo: https://polyu.hk/OJETT
Contact Person
Timothy Yim
Senior Specialist
Information and Technology Service
timothy.yim@polyu.edu.hk
Computer vs Human
Computers are good at
94893 ×1235 = 117192855, 2394 ÷ 123804 = 0.17342799….
Fast Memory
Fast Calculation
Fast Signal Transmit
Humans are good at
Recognition
Think out of the box.
Make decisions base on intelligence and life experience
Our Goal
Develop algorithms and models so that computers can perform
tasks that traditionally humans are better at.
And with the help of high computational power and data storage,
hopefully computers can out perform humans in terms of
accuracy, speed and volume.
AI vs Machine Learning vs Deep Learning
Artificial intelligence - programs and machines to solve problems
like human
Machine learning is a subset of AI – without explicitly programmed
Deep learning is a subset of machine learning – neural network
Data Visualization
Data visualization is the graphical representation of information and data.
Charts, graphs and maps
Key tools to tell stories
Curating data into a form easier to understand
Data Visualization – Data Table vs Graph
Data Visualization – Classifying Iris
Problem:
• Classifying three types of Iris, Setosa, Versicolour and Virginica.
Existing dataset information
• Sepal length (cm)
• Sepal width (cm)
• Petal length (cm)
• Petal width (cm)
Data Visualization – Demo with Orange
Data Prepossessing
Data Cleaning
Data Integration
Data Transformation
Data Transformation
Examples: Distance (1 km, 1m, 1 cm)
Measurement in different scale do not contribute equally to our
senses.
Correlation is much important.
Data transformation process or feature scaling methods are
needed
Standardization
Standardization is a widely used data transformation technique to
change feature vectors into representation.
Transform the data to center it by removing the mean value of each
feature, then scale it by dividing non-constant features by their
standard deviation
Therefore, scaled data has zero mean and scaled variance.
Standardization Demo with Python
Example Python Code for Standardization
Machine Learning Algorithms
Machine learning algorithms or models are used to make decision
or prediction with the data. E.g. KNN, Neural Network, SVM…
The model is said to learn from existing data and giving outputs
with new data.
For example, traffic patterns prediction
We will be focusing on machine learning algorithms in later
webinars.
Validation
We need to know that our trained algorithm is working as expected.
Validation is important before we publish our machine learning
program to the world.
The challenge of validation
Limited existing data
Past data may not representing the future
We will be introducing one of the widely used validations method to
solve these problems.
K-Fold Cross-Validation
Cross-validation is a resampling procedure
Limited data sample.
k refers the number of folds
Senerio
The data set is divided into k groups, e.g. 10 groups
9 groups of data is used to train the machine learning model and
the remaining group is used for testing.
Iterate each group to become the testing data set.
K-Fold Cross-Validation
Applied Machine Learning in Business
Thanks!
Machine Learning Fundamentals (Session 2)
Objectives
Machine learning is a subset of AI
Without explicitly programmed
Supervised learning VS Unsupervised learning
And now, we are going to learn how to build a self learn program
How Human Learns
Imagine we are learning how to throw darts….
Brain
Eyes(feedbacks
)
Our goal is to hit the specific the
targets, e.g. bull’s eye, triple 20,
single 16….. Algorithms
Supervised Learning
• Trained on a pre-defined set of data
• Reach conclusion when given new data.
• Develop the function , where is input
Supervised Learning – Classification vs
Regression
• Supervised learning problems can be further grouped into classification and
regression problems.
• Classification – When the output variable is a category, e.g. true or false, red or
blue
• Regression – When the output variable is a real value, e.g. exchange rate,
weight
Supervised Learning – K Nearest Neighbors
• K nearest neighbours is a simple algorithm that stores all available cases and
classifies new cases based on a similarity measure (e.g. distance function).
• Classify by majority votes of its neighbours
• Measured by a distance function
• If K = 1, assigned to the class of its nearest neighbour
Supervised Learning – K Nearest Neighbors
• When K=3, Class B
• When K=6, Class A
Supervised Learning – K Nearest Neighbors
The black line is the decision
boundary
KNN – How K Influences the algorithm
• The boundary becomes smoother with increasing the value of K.
• When K is 1, the algorithm is overfitting the boundary.
• When K is infinite, the prediction will become only one class depending on the
total majority, which is useless….
Error Rates
Most of the time, our trained model will have errors
• Classifying the target to a wrong class
• The predicted value is not exactly equal to the real value
We calculate the error rate to evaluate the effectiveness of our trained model
Bayes Error
• The lowest possible error
rate for any classifier of a
random outcome and is
analogous to the
irreducible error.
Error Rates
In the KNN example, we fine tune the value k to lower the error as much as possible.
But what if we cannot improve the successful rate anymore and it’s still bad….
Supervised Learning – Neural Network
Supervised Learning – Neural Network History
• Warren McCulloch and Walter Pitts (1943) opened the subject by creating a
computational model for neural network.
• First functional networks with many layers were published by Ivakhnenko and
Lapa in 1965.
• The basics of continuous backpropagation were derived in the context by Kelley
in 1960 and by Bryson in 1961, using principles of dynamic programming.
• In 1970, a lot of research were carried out but stagnated because of computers at
that time lacked sufficient power to process useful neural networks.
• Recently, the rise of high performance GPUs and CPUs make multiple layers
neural network feasible and neural network becomes popular.
Supervised Learning – Neural Network
Neural networks are computing systems vaguely inspired by the biological neural
networks that constitute animal brains
Components
• Neurons
• Input layer
• Hidden layer
• Output layer
• Connections and Weights
Supervised Learning – Classifying Iris
Problem:
• Classifying three types of Iris, Setosa, Versicolour and Virginica.
Existing dataset information
• Sepal length (cm)
• Sepal width (cm)
• Petal length (cm)
• Petal width (cm)
Unsupervised Learning
• Dataset without labelled responses
• Find hidden patterns
• Find grouping in data
• Usually less accurate and trustworthy
• Clustering is a common
Clustering
• Involves the grouping of data points
• Similar properties in the same group
• Highly Dissimilar properties in different group
• Work best if the classes not overlapping
Examples:
• K-means clustering
• Hierarchical clustering
• Fuzzy c-means clustering
K-means Clustering
• Target number k – number of centroids
• A centroid is the imaginary or real location representing the center of the cluster
• Allocates every data point to the nearest cluster
• Keeping centroids as small as possible
K-means Clustering
• Target number k – number of centroids
• A centroid is the imaginary or real location representing the center of the cluster
• Allocates every data point to the nearest cluster
• Keeping centroids as small as possible
K-means Clustering - Steps
1. Randomly Initialize a number of classes/groups
2. Classify each point to the closest centre
3. Re-computer the centres by the means of data points
4. Iterate a set number or until centres do not change much
Summary
Supervised Learning
• Labelled data
• Develop the finely tuned function to predict with inputs
• Can be very precise and data are harder to be collected
Unsupervised learning
• Unlabelled data
• Find hidden pattern
• Less trustworthy but data are easier to be collected