[go: up one dir, main page]

0% found this document useful (0 votes)
72 views70 pages

APS1070 Lecture (3) Slides

1. The document discusses the agenda for a lecture on foundations of machine learning, which includes end-to-end machine learning, popular Python libraries, and decision trees. 2. The lecture will cover the machine learning process from problem formulation to model deployment and the key steps of data retrieval, exploration, preparation, model selection, testing and assessment. 3. Popular machine learning algorithms that will be discussed include decision trees, support vector machines, logistic regression, naive Bayes and neural networks.

Uploaded by

Саша Цой
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views70 pages

APS1070 Lecture (3) Slides

1. The document discusses the agenda for a lecture on foundations of machine learning, which includes end-to-end machine learning, popular Python libraries, and decision trees. 2. The lecture will cover the machine learning process from problem formulation to model deployment and the key steps of data retrieval, exploration, preparation, model selection, testing and assessment. 3. Popular machine learning algorithms that will be discussed include decision trees, support vector machines, logistic regression, naive Bayes and neural networks.

Uploaded by

Саша Цой
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

APS1070

Foundations of Data Analytics and


Machine Learning
Fall 2022
Lecture 3:
• End-to-end Machine Learning
• Data Retrieval and Preparation
• Plotting and Visualization
• Making Predictions
• Decision Trees

Samin Aref
Agenda
➢ Today’s focus is on Foundations of Learning
1. End-to-end machine learning
2. Python Libraries
—NumPy
—Matlplotlib
—Pandas
—Scikit-Learn
3. Decision Trees

2
Part 1
End-to-End Machine Learning
End-to-End Machine Learning

1. Understand the problem


2. Retrieve the data
3. Explore and visualize the data to gain insights
4. Prepare the data for the algorithm/model
5. Select and train the algorithm/model
6. Fine-tune your algorithm/model
7. Present your solution
8. Launch, monitor, and maintain your system
4
End-to-End Machine Learning

Understand Data Model Test and


Problem Visualization Selection Assess

Data Data Model


Collection Preparation Training

5
Classification vs. Regression
➢ Classification: Discrete target
➢ Separate the Dataset
➢ Apples or oranges?
➢ Dog or Cat?

Feature # 2

Target
➢ Handwritten digit recognition

➢ Regression: Continues Target


➢ Fit the dataset
➢ Price of a house Feature # 1 Feature # 1
➢ Revenue of a company
➢ Age of a tree

6
Understand the Problem
Supervised Unsupervised
➢ Often, we need to make some sort of
decisions (predictions)

Discrete
classification clustering
➢ Two common types of decisions that
we make are:
➢ Classification

Continuous
➢ Discrete number of possibilities regression
dimensionality
reduction
➢ Regression
➢ Continuous number of real-valued possibilities

7
Understand the Problem
Input data is represented
by features that can come
in many forms:

➢ Raw pixels
➢ Histograms
➢ Tabular data
➢ Spectrograms
➢…

8
Data Exploration
➢ Understand your data through
visualization
➢ Assess the difficulty of the problem

➢ You have a data set D = {(x(i),y(i))}


➢ You want to learn y = f(x) from D
➢ more precisely, you want to minimize
error in predictions

➢ What kind of model (algorithm) do you MNIST low-dimensional projection

need? 9
Model Selection
Many classifiers to choose from
➢ Support-Vector Machine (SVM)
➢ Logistic Regression
➢ Random Forests
➢ Naive Bayes
➢ Bayesian network
➢ K-Nearest Neighbour
➢ (Deep) Neural networks
➢ Etc.
10
Model Selection
➢ Often the easiest algorithm to
implement is k-Nearest Neighbours
➢ Match to similar data using a distance
metric

Q: What happens as we increase #data?


Q: What about as #data approaches
infinity?

11
Test and Assess
➢ Unlike us, computers have no trouble with
memorization.
➢ The real question is, how well does our
algorithm make predictions on new data?

➢ We need a way to measure how well our


algorithm (model) generalizes to new, never
before seen, data.

12
Regression Example
➢ Let’s look at a more concrete
example…
➢ Given noisy sample data

stock price
(blue), we want to find the
polynomial that generated the
data

➢ Q: What kind of a problem is media exposure


this?

13
Mean Squared Error
➢ Need to first define our error
term, in this case we can use
the mean squared error
(MSE):
➢ Error is measured by finding
the squared error in the
prediction of y(x) from x.
➢ The error for the red
polynomial can be measured
based on the mean of the
squared vertical errors
14
Fitting the Data
Q: Which polynomial fits the
data best?
➢ based on training data?
➢ based on test data?

15
Overfitting vs Underfitting

High training error Acceptable training error Perfect training error (zero)
and high test error (Underfit) and test error and high test error (Overfit)

16
Generalization
➢ Giving the model a greater capacity (more complexity) to
fit the data… does not necessarily help
➢ How do we evaluate the model performance?

Verify model
on New Data

17
Overfitting
➢ In brief: fitting characteristics of training data that do
not generalize to future test data

➢ Central problem in machine learning


➢ Particularly problematic if #data << #parameters
➢ … don’t have enough data to “identify” parameters

18
Generalization
➢ Machine learning is a game of balance, with our objective
being to generalize to all possible future data

New samples
Error (% Incorrect)

Under-fitting

Over-fitting
Training samples

Model Capacity (Complexity) 19


Bias-Variance Trade-off

➢ Models with too few


parameters are inaccurate
because of a large bias (not
enough flexibility).

➢ Models with too many


parameters are inaccurate
because of a large variance (too
much sensitivity to the sample).

20
Inductive Bias
➢ Let’s avoid making assumptions about the model (polynomial order)
➢ Assume for simplicity that D = {(x(i),y(i))} is noise free
➢ x(i)’s in D only cover small subset of input space x
➢ Q: What’s the best we can do?
➢ If we’ve seen x=x(i) report y=y(i)
➢ If we have not seen x= x(i), can’t say anything (no assumptions)
➢ This is called rote learning… boring, eh?
➢ Key idea: you can't generalize to unseen data w/o assumptions!
➢ Thus, key to ML is generalization
➢ To generalize, ML algorithm must have some inductive bias
➢ Bias usually in the form of a restricted model (hypothesis) space
➢ Important to understand restrictions (and whether appropriate) 21
Inductive Bias
➢ Example: Nearest neighbors
– We suppose that most of the cases in a small neighborhood in feature space belong to the same class. Given a case for
which the class is unknown, we assume that it belongs to the same class as the majority in its immediate neighborhood.
– This is the bias used in the k-nearest neighbors algorithm.
– The assumption is that cases that are near each other tend to belong to the same class.

22
Training and Testing Data
➢ Track generalization error by splitting data into
training and testing
➢ 80% training and 20% testing

➢ More data = better model


➢ Would like to use all our data for training, however
we need some way to evaluate our model

23
The problem with tracking test accuracy
➢ What K should be?

➢ If we track test error/accuracy in our training


curve, then:
➢ We may make decisions about model
architecture using the test accuracy and
make the testing meaningless.

➢ The final test accuracy will not be a realistic


estimate of how our model will perform on a
new data set!
24
Validation Set
➢ We still want to track the loss/accuracy on a data set not used for training
➢ Idea: set aside a separate data set, called the validation set
➢ Track validation accuracy in the training curve
➢ Make decisions about model architecture using the validation set

25
Validation Set
➢ We still want to track the loss/accuracy on a data set not used for training
➢ Idea: set aside a separate data set, called the validation set
➢ Track validation accuracy in the training curve
➢ Make decisions about model architecture using the validation set
K is a hyperparameter.
We tune hyperparameters using the validation set

26
Validation and Holdout Data
➢ Training, Validation and Testing Data
➢ Less data for your training model
➢ Ideally use the holdout data only once
➢ Requires a great deal of discipline to not look at the
holdout data

Holdout Data

27
Cross-Validation
➢ Splitting training and
validation data into
several folds during
training
➢ This is known as k-fold
Cross-Validation
➢ Model parameters
selected based on
average achieved over
k folds
Source: scikit-learn
28
Data Processing
➢ Q: You test your model on new data and you
find it fails to predict certain samples. Why
could be happening?

Test Data

Training Data
29
Data Augmentation
➢ For example, how can your algorithms
(models) predict on rotations if it has never
seen a rotated sample?

➢ Apply Data Augmentation!


➢ translation,
➢ scaling, Linear Algebra
➢ rotation, to the Rescue!
➢ reflection,
➢… Source: https://morioh.com/p/928228425a08

30
More Data Processing
➢ Q: Large input feature size (short and wide data)
is problematic? Why do you think that is?

➢ Curse of dimensionality!
➢As features grow you require more model
capacity (complexity) to represent the data
➢Models of greater complexity require
exponentially more training data

31
Dimensionality Reduction
Solution:
➢Reduce the number of features using
dimensionality reduction
➢Principal Component Analysis
➢more details provided in weeks 7 and 8

Source: Data Courses

32
Deep Learning
➢ Principle Component Analysis (PCA) is limited
to linear transformations
➢ Deep Learning techniques can be used to
learn and apply nonlinear transformations
for dimensionality reduction
➢ More detail on model-based machine learning
techniques in weeks 9 – 11

33
Roadmap for the rest of APS1070

Understand Data Model Test and


Problem Visualization Selection Assess

Data Data Model


Collection Preparation Training

End-to-end machine learning is just one piece of the pie. The concepts we’ll cover in
this course have utility that goes far beyond machine learning.
34
Basic Python Check-up
Tutorials 0 and 1: Python Basics
❑Data Types ❑Operations
❑Single: int, float, bool ❑arithmetic: +,*,-,/,//,%, **
❑Multiple: str, list, set, tuple, dict ❑boolean: not, and, or
❑relational: ==, !=, >, <, >=, <=
❑index [], slice [::], mutability
❑Display
❑Conditionals ❑print, end, sep
❑if, elif, else ❑Files
❑Functions ❑open, close, with
❑def, return, recursion, default vals ❑read, write
❑CSV
❑Loops ❑Object-Oriented Programming (OOP)
❑for, while, range ❑class, methods, attributes
❑list comprehension ❑__init__, __str__, polymorphism 36
Other resources for Python
➢ Toronto-based and internationally popular resources:
➢ Kaggle 5-hour course on Python (by Colin Morris)
https://www.kaggle.com/learn/python
➢U of T MOOC Learn to Program: The Fundamentals
https://www.coursera.org/learn/learn-to-program
➢U of T MOOC Learn to Program: Crafting Quality Code
https://www.coursera.org/learn/program-code
➢U of T Coders (student-run group)
https://uoftcoders.github.io/

➢ Google is your (BEST) friend?


➢ APS1070 Piazza Discussion Board
37
Scientific Computing Tools for Python
➢ Scientific computing in Python builds upon a small core of
packages:
➢ NumPy, the fundamental package for numerical computation. It defines the
numerical array and matrix types and basic operations on them.
➢ The SciPy library, a collection of numerical algorithms and domain-specific
toolboxes, including signal processing, optimization, statistics and much
more.
➢ Matplotlib, a mature and popular plotting package, that provides
publication-quality 2D plotting as well as rudimentary 3D plotting

➢ Data and computation:


➢ pandas, providing high-performance, easy to use data structures.
➢ scikit-learn is a collection of algorithms and tools for machine learning.

Source: https://www.scipy.org/about.html 38
NumPy
➢ Let’s start with NumPy. Among other things, NumPy
contains:
➢ A powerful N-dimensional array object.
➢ Sophisticated (broadcasting/universal) functions.
➢ Tools for integrating C/C++ and Fortran code.
➢ Useful linear algebra, Fourier transform, and random number
capabilities.
➢ Besides its obvious scientific uses, NumPy can also be used as
an efficient multi-dimensional container of generic data.
➢ Many other python libraries are built on NumPy
➢ Provides vectorization of mathematical operations on arrays
and matrices which significantly improves the performance 39
NumPy
➢ The key to NumPy is the ndarray object, an n-dimensional array
of homogeneous data types, with many operations being
performed in compiled code for performance.
➢ There are several important differences between NumPy arrays
and the standard Python sequences:
➢ NumPy arrays have a fixed size. Modifying the size means creating a
new array.
➢ NumPy arrays must be of the same data type, but this can include
Python objects.
➢ More efficient mathematical operations than built-in sequence types

40
NumPy
➢ To begin, NumPy supports a wider variety of data types than
are built-in to the Python language by default. They are defined
by the numpy.dtype class and include:
➢ intc (same as a C integer) and intp (used for indexing)
➢ int8, int16, int32, int64
➢ uint8, uint16, uint32, uint64
➢ float16, float32, float64
➢ complex64, complex128
➢ bool_, int_, float_, complex_ are shorthand for defaults.

41
NumPy
➢ There are a couple of mechanisms for creating arrays in NumPy:
➢ Conversion from other Python structures (e.g., lists, tuples).
➢ Built-in NumPy array creation (e.g., arrange, ones, zeros, etc.).
➢ Reading arrays from disk, either from standard or custom formats (e.g.
reading in from a CSV file).
➢ and others …

42
NumPy
➢ There are a couple of mechanisms for creating arrays in NumPy:
➢ Conversion from other Python structures (e.g., lists, tuples).
➢ Built-in NumPy array creation (e.g., arrange, ones, zeros, etc.).
➢ Reading arrays from disk, either from standard or custom formats (e.g.
reading in from a CSV file).
➢ and others …

➢ In general, any numerical data that is stored in an array-like


container can be converted to an ndarray through use of the
array() function. The most obvious examples are sequence types
like lists and tuples.
43
SciPy
➢ Collection of algorithms for linear algebra, differential
equations, numerical integration, optimization, statistics and
much more
➢ Part of SciPy Stack
➢ Built on NumPy

➢ With SciPy an interactive Python session becomes a data-


processing and system-prototyping environment rivaling
systems such as MATLAB, IDL, Octave, R-Lab, and SciLab.

44
SciPy
➢ SciPy’s functionality is implemented in a number of specific sub-
modules. These include:
➢ Special mathematical functions (scipy.special) -- airy, elliptic, bessel, etc.
➢ Integration (scipy.integrate)
➢ Optimization (scipy.optimize)
➢ Interpolation (scipy.interpolate)
➢ Fourier Transforms (scipy.fftpack)
➢ Signal Processing (scipy.signal)
➢ Linear Algebra (scipy.linalg)
➢ Statistics (scipy.stats)
➢ Multidimensional image processing (scipy.ndimage)
➢ Data IO (scipy.io)
➢ and more!
45
Pandas
➢ Adds data structures and tools designed to work with table-like
data (similar to Series and Data Frames in R)
➢ Provides tools for data manipulation: reshaping, merging,
sorting, slicing, aggregation etc.
➢ Aggregation - computing a summary statistic for groups
➢ min, max, count, sum, prod, mean, median, mode, mad, std, var

➢ Allows for handling missing data

Source: http://pandas.pydata.org/
46
Matplotlib
➢ Matplotlib is an incredibly powerful (and beautiful!) 2-D plotting
library. It’s easy to use and provides a huge number of examples
for tackling unique problems.
➢ Similar to MATLAB

47
Seaborn
➢ Seaborn has more convenient commands and options
❑ Kaggle 4-hour course on information visualization (by Alexis Cook and Dan Becker)
https://www.kaggle.com/learn/data-visualization

48
pyplot
➢ At the center of most matplotlib scripts is pyplot.
➢ The pyplot module is stateful and tracks changes to a figure. All
pyplot functions revolve around creating or manipulating the
state of a figure.

49
pyplot
➢ The plot function can actually take any number of arguments.
➢ The format string argument associated with a pair of sequence
objects indicates the color and line type of the plot (e.g. ‘bs’
indicates blue squares and ‘ro’ indicates red circles).
➢ Generally speaking, the x_values and y_values will be numpy
arrays and if not, they will be converted to numpy arrays
internally.
➢ Line properties can be set via keyword arguments to the plot
function. Examples include label, linewidth, animated, color,
etc…
50
Jupyter Notebook
➢ All of these libraries come preinstalled on Google Colab
➢ Google Colab uses a Jupyter notebook environment
that runs in the cloud and requires no setup to use
➢ Runs in Python 3
➢ Includes all the commonly used machine learning (data
science) libraries
➢ i.e. NumPy, SciPy, Matplotlib, Pandas, PyTorch, Tensorflow,
etc.

➢ Alternatively, can use Jupyter notebook on your


computer 51
Part 2
Python Libraries and Titanic
Let’s take a look at week 3 Jupyter Notebook

53
Part 3
Decision Trees
Decision Trees
➢ A rule-based supervised learning algorithm
➢ Powerful algorithm capable of fitting complex
datasets.
➢ Can be applied to classification (discrete) and
regression (continuous) tasks.
➢ Highly interpretable!

➢ A fundamental component of Random Forests


which are one of the most used Machine Learning
algorithms today
55
Lemon Vs. Orange!

Flowchart-like structure!

56
Test example

57
Constructing a Decision Tree
➢ Decision trees make predictions by recursively splitting
on different attributes according to a tree structure

width (cm) 58
What if the attributes are discrete?

59
What if the attributes are discrete?

Attributes: Features (inputs)!


Discrete or Continuous

60
Output is Discrete
Wait for table

Go somewhere else

61
Output is Continuous (Regression)

➢ Instead of predicting a
class at each leaf node,
predict a value based on
the average of all
instances at the leaf node.

Source: GDCoder 62
Summary: Discrete vs Continuous Output
➢ Classification Tree:
➢ discrete output
➢ output node (leaf) typically set to the most common value

➢ Regression Tree:
➢ continuous output
➢ output node (leaf) value typically set to the mean value in data

63
Generalization
➢ Decision trees can fit any function arbitrarily closely
➢ Could potentially create a leaf for each example in
the training dataset
➢ Not likely to generalize to test data!

➢ Need some way to prune the tree!

64
Managing Overfitting
➢ Add parameters to reduce potential for overfitting
➢ Parameters include:
➢ depth of tree
➢ minimum number of samples

65
Random Forests
➢ One of the most popular variants of
decision trees
➢ Addresses overfitting by training
multiple trees on subsampling of
features among other things
➢ Majority vote of all the trees is used
to make the final output

Source: Venkata Jagannath

66
Decision Trees are interpretable models
Gini Impurity is a measurement of the likelihood of
an incorrect classification of a new instance of a random
variable, if that new instance were randomly classified
according to the distribution of class labels from the data set.

67
Comparison to k-NN
➢ There are many advantages of Decision Trees over k-Nearest Neighbours:
➢ Good with discrete attributes
➢ Robust to scale of inputs (does not require normalization)
➢ Easily handle missing values
➢ Good at handling lots of attributes, especially when only a few are important
➢ Fast test time
➢ More interpretable
➢ Decision trees not good at handling rotations in data
➢ Decision trees have limited predictive performance (more advanced tree-based models)

68
Next Time
➢ Week 3 Q&A Support Session
➢ Help with Python and Project 1
➢ Reading assignment 3 is out
➢ Project 1 is out
➢ Week 4 Lecture – Uncertainty and Performance
➢ K-Means Clustering
➢ Probability Theory
➢ Summary Statistics
➢ Multivariate Gaussians
➢ Performance Metrics
69
Decision Trees Code Example (Google Colab)

70

You might also like