0% found this document useful (0 votes)

72 views70 pages

APS1070 Lecture (3) Slides

1. The document discusses the agenda for a lecture on foundations of machine learning, which includes end-to-end machine learning, popular Python libraries, and decision trees. 2. The lecture will cover the machine learning process from problem formulation to model deployment and the key steps of data retrieval, exploration, preparation, model selection, testing and assessment. 3. Popular machine learning algorithms that will be discussed include decision trees, support vector machines, logistic regression, naive Bayes and neural networks.

Uploaded by

Саша Цой

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views70 pages

APS1070 Lecture (3) Slides

Uploaded by

Саша Цой

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

APS1070

Foundations of Data Analytics and

Machine Learning
Fall 2022
Lecture 3:
• End-to-end Machine Learning
• Data Retrieval and Preparation
• Plotting and Visualization
• Making Predictions
• Decision Trees

Samin Aref
Agenda
➢ Today’s focus is on Foundations of Learning
1. End-to-end machine learning
2. Python Libraries
—NumPy
—Matlplotlib
—Pandas
—Scikit-Learn
3. Decision Trees

2
Part 1
End-to-End Machine Learning
End-to-End Machine Learning

1. Understand the problem

2. Retrieve the data
3. Explore and visualize the data to gain insights
4. Prepare the data for the algorithm/model
5. Select and train the algorithm/model
6. Fine-tune your algorithm/model
7. Present your solution
8. Launch, monitor, and maintain your system
4
End-to-End Machine Learning

Understand Data Model Test and

Problem Visualization Selection Assess

Data Data Model

Collection Preparation Training

5
Classification vs. Regression
➢ Classification: Discrete target
➢ Separate the Dataset
➢ Apples or oranges?
➢ Dog or Cat?

Feature # 2

Target
➢ Handwritten digit recognition

➢ Regression: Continues Target

➢ Fit the dataset
➢ Price of a house Feature # 1 Feature # 1
➢ Revenue of a company
➢ Age of a tree

6
Understand the Problem
Supervised Unsupervised
➢ Often, we need to make some sort of
decisions (predictions)

Discrete
classification clustering
➢ Two common types of decisions that
we make are:
➢ Classification

Continuous
➢ Discrete number of possibilities regression
dimensionality
reduction
➢ Regression
➢ Continuous number of real-valued possibilities

7
Understand the Problem
Input data is represented
by features that can come
in many forms:

➢ Raw pixels
➢ Histograms
➢ Tabular data
➢ Spectrograms
➢…

8
Data Exploration
➢ Understand your data through
visualization
➢ Assess the difficulty of the problem

➢ You have a data set D = {(x(i),y(i))}

➢ You want to learn y = f(x) from D
➢ more precisely, you want to minimize
error in predictions

➢ What kind of model (algorithm) do you MNIST low-dimensional projection

need? 9
Model Selection
Many classifiers to choose from
➢ Support-Vector Machine (SVM)
➢ Logistic Regression
➢ Random Forests
➢ Naive Bayes
➢ Bayesian network
➢ K-Nearest Neighbour
➢ (Deep) Neural networks
➢ Etc.
10
Model Selection
➢ Often the easiest algorithm to
implement is k-Nearest Neighbours
➢ Match to similar data using a distance
metric

Q: What happens as we increase #data?

Q: What about as #data approaches
infinity?

11
Test and Assess
➢ Unlike us, computers have no trouble with
memorization.
➢ The real question is, how well does our
algorithm make predictions on new data?

➢ We need a way to measure how well our

algorithm (model) generalizes to new, never
before seen, data.

12
Regression Example
➢ Let’s look at a more concrete
example…
➢ Given noisy sample data

stock price
(blue), we want to find the
polynomial that generated the
data

➢ Q: What kind of a problem is media exposure

this?

13
Mean Squared Error
➢ Need to first define our error
term, in this case we can use
the mean squared error
(MSE):
➢ Error is measured by finding
the squared error in the
prediction of y(x) from x.
➢ The error for the red
polynomial can be measured
based on the mean of the
squared vertical errors
14
Fitting the Data
Q: Which polynomial fits the
data best?
➢ based on training data?
➢ based on test data?

15
Overfitting vs Underfitting

High training error Acceptable training error Perfect training error (zero)
and high test error (Underfit) and test error and high test error (Overfit)

16
Generalization
➢ Giving the model a greater capacity (more complexity) to
fit the data… does not necessarily help
➢ How do we evaluate the model performance?

Verify model
on New Data

17
Overfitting
➢ In brief: fitting characteristics of training data that do
not generalize to future test data

➢ Central problem in machine learning

➢ Particularly problematic if #data << #parameters
➢ … don’t have enough data to “identify” parameters

18
Generalization
➢ Machine learning is a game of balance, with our objective
being to generalize to all possible future data

New samples
Error (% Incorrect)

Under-fitting

Over-fitting
Training samples

Model Capacity (Complexity) 19

Bias-Variance Trade-off

➢ Models with too few

parameters are inaccurate
because of a large bias (not
enough flexibility).

➢ Models with too many

parameters are inaccurate
because of a large variance (too
much sensitivity to the sample).

20
Inductive Bias
➢ Let’s avoid making assumptions about the model (polynomial order)
➢ Assume for simplicity that D = {(x(i),y(i))} is noise free
➢ x(i)’s in D only cover small subset of input space x
➢ Q: What’s the best we can do?
➢ If we’ve seen x=x(i) report y=y(i)
➢ If we have not seen x= x(i), can’t say anything (no assumptions)
➢ This is called rote learning… boring, eh?
➢ Key idea: you can't generalize to unseen data w/o assumptions!
➢ Thus, key to ML is generalization
➢ To generalize, ML algorithm must have some inductive bias
➢ Bias usually in the form of a restricted model (hypothesis) space
➢ Important to understand restrictions (and whether appropriate) 21
Inductive Bias
➢ Example: Nearest neighbors
– We suppose that most of the cases in a small neighborhood in feature space belong to the same class. Given a case for
which the class is unknown, we assume that it belongs to the same class as the majority in its immediate neighborhood.
– This is the bias used in the k-nearest neighbors algorithm.
– The assumption is that cases that are near each other tend to belong to the same class.

22
Training and Testing Data
➢ Track generalization error by splitting data into
training and testing
➢ 80% training and 20% testing

➢ More data = better model

➢ Would like to use all our data for training, however
we need some way to evaluate our model

23
The problem with tracking test accuracy
➢ What K should be?

➢ If we track test error/accuracy in our training

curve, then:
➢ We may make decisions about model
architecture using the test accuracy and
make the testing meaningless.

➢ The final test accuracy will not be a realistic

estimate of how our model will perform on a
new data set!
24
Validation Set
➢ We still want to track the loss/accuracy on a data set not used for training
➢ Idea: set aside a separate data set, called the validation set
➢ Track validation accuracy in the training curve
➢ Make decisions about model architecture using the validation set

25
Validation Set
➢ We still want to track the loss/accuracy on a data set not used for training
➢ Idea: set aside a separate data set, called the validation set
➢ Track validation accuracy in the training curve
➢ Make decisions about model architecture using the validation set
K is a hyperparameter.
We tune hyperparameters using the validation set

26
Validation and Holdout Data
➢ Training, Validation and Testing Data
➢ Less data for your training model
➢ Ideally use the holdout data only once
➢ Requires a great deal of discipline to not look at the
holdout data

Holdout Data

27
Cross-Validation
➢ Splitting training and
validation data into
several folds during
training
➢ This is known as k-fold
Cross-Validation
➢ Model parameters
selected based on
average achieved over
k folds
Source: scikit-learn
28
Data Processing
➢ Q: You test your model on new data and you
find it fails to predict certain samples. Why
could be happening?

Test Data

Training Data
29
Data Augmentation
➢ For example, how can your algorithms
(models) predict on rotations if it has never
seen a rotated sample?

➢ Apply Data Augmentation!

➢ translation,
➢ scaling, Linear Algebra
➢ rotation, to the Rescue!
➢ reflection,
➢… Source: https://morioh.com/p/928228425a08

30
More Data Processing
➢ Q: Large input feature size (short and wide data)
is problematic? Why do you think that is?

➢ Curse of dimensionality!
➢As features grow you require more model
capacity (complexity) to represent the data
➢Models of greater complexity require
exponentially more training data

31
Dimensionality Reduction
Solution:
➢Reduce the number of features using
dimensionality reduction
➢Principal Component Analysis
➢more details provided in weeks 7 and 8

Source: Data Courses

32
Deep Learning
➢ Principle Component Analysis (PCA) is limited
to linear transformations
➢ Deep Learning techniques can be used to
learn and apply nonlinear transformations
for dimensionality reduction
➢ More detail on model-based machine learning
techniques in weeks 9 – 11

33
Roadmap for the rest of APS1070

Understand Data Model Test and

Problem Visualization Selection Assess

Data Data Model

Collection Preparation Training

End-to-end machine learning is just one piece of the pie. The concepts we’ll cover in
this course have utility that goes far beyond machine learning.
34
Basic Python Check-up
Tutorials 0 and 1: Python Basics
❑Data Types ❑Operations
❑Single: int, float, bool ❑arithmetic: +,*,-,/,//,%, **
❑Multiple: str, list, set, tuple, dict ❑boolean: not, and, or
❑relational: ==, !=, >, <, >=, <=
❑index [], slice [::], mutability
❑Display
❑Conditionals ❑print, end, sep
❑if, elif, else ❑Files
❑Functions ❑open, close, with
❑def, return, recursion, default vals ❑read, write
❑CSV
❑Loops ❑Object-Oriented Programming (OOP)
❑for, while, range ❑class, methods, attributes
❑list comprehension ❑__init__, __str__, polymorphism 36
Other resources for Python
➢ Toronto-based and internationally popular resources:
➢ Kaggle 5-hour course on Python (by Colin Morris)
https://www.kaggle.com/learn/python
➢U of T MOOC Learn to Program: The Fundamentals
https://www.coursera.org/learn/learn-to-program
➢U of T MOOC Learn to Program: Crafting Quality Code
https://www.coursera.org/learn/program-code
➢U of T Coders (student-run group)
https://uoftcoders.github.io/

➢ Google is your (BEST) friend?

➢ APS1070 Piazza Discussion Board
37
Scientific Computing Tools for Python
➢ Scientific computing in Python builds upon a small core of
packages:
➢ NumPy, the fundamental package for numerical computation. It defines the
numerical array and matrix types and basic operations on them.
➢ The SciPy library, a collection of numerical algorithms and domain-specific
toolboxes, including signal processing, optimization, statistics and much
more.
➢ Matplotlib, a mature and popular plotting package, that provides
publication-quality 2D plotting as well as rudimentary 3D plotting

➢ Data and computation:

➢ pandas, providing high-performance, easy to use data structures.
➢ scikit-learn is a collection of algorithms and tools for machine learning.

Source: https://www.scipy.org/about.html 38
NumPy
➢ Let’s start with NumPy. Among other things, NumPy
contains:
➢ A powerful N-dimensional array object.
➢ Sophisticated (broadcasting/universal) functions.
➢ Tools for integrating C/C++ and Fortran code.
➢ Useful linear algebra, Fourier transform, and random number
capabilities.
➢ Besides its obvious scientific uses, NumPy can also be used as
an efficient multi-dimensional container of generic data.
➢ Many other python libraries are built on NumPy
➢ Provides vectorization of mathematical operations on arrays
and matrices which significantly improves the performance 39
NumPy
➢ The key to NumPy is the ndarray object, an n-dimensional array
of homogeneous data types, with many operations being
performed in compiled code for performance.
➢ There are several important differences between NumPy arrays
and the standard Python sequences:
➢ NumPy arrays have a fixed size. Modifying the size means creating a
new array.
➢ NumPy arrays must be of the same data type, but this can include
Python objects.
➢ More efficient mathematical operations than built-in sequence types

40
NumPy
➢ To begin, NumPy supports a wider variety of data types than
are built-in to the Python language by default. They are defined
by the numpy.dtype class and include:
➢ intc (same as a C integer) and intp (used for indexing)
➢ int8, int16, int32, int64
➢ uint8, uint16, uint32, uint64
➢ float16, float32, float64
➢ complex64, complex128
➢ bool_, int_, float_, complex_ are shorthand for defaults.

41
NumPy
➢ There are a couple of mechanisms for creating arrays in NumPy:
➢ Conversion from other Python structures (e.g., lists, tuples).
➢ Built-in NumPy array creation (e.g., arrange, ones, zeros, etc.).
➢ Reading arrays from disk, either from standard or custom formats (e.g.
reading in from a CSV file).
➢ and others …

42
NumPy
➢ There are a couple of mechanisms for creating arrays in NumPy:
➢ Conversion from other Python structures (e.g., lists, tuples).
➢ Built-in NumPy array creation (e.g., arrange, ones, zeros, etc.).
➢ Reading arrays from disk, either from standard or custom formats (e.g.
reading in from a CSV file).
➢ and others …

➢ In general, any numerical data that is stored in an array-like

container can be converted to an ndarray through use of the
array() function. The most obvious examples are sequence types
like lists and tuples.
43
SciPy
➢ Collection of algorithms for linear algebra, differential
equations, numerical integration, optimization, statistics and
much more
➢ Part of SciPy Stack
➢ Built on NumPy

➢ With SciPy an interactive Python session becomes a data-

processing and system-prototyping environment rivaling
systems such as MATLAB, IDL, Octave, R-Lab, and SciLab.

44
SciPy
➢ SciPy’s functionality is implemented in a number of specific sub-
modules. These include:
➢ Special mathematical functions (scipy.special) -- airy, elliptic, bessel, etc.
➢ Integration (scipy.integrate)
➢ Optimization (scipy.optimize)
➢ Interpolation (scipy.interpolate)
➢ Fourier Transforms (scipy.fftpack)
➢ Signal Processing (scipy.signal)
➢ Linear Algebra (scipy.linalg)
➢ Statistics (scipy.stats)
➢ Multidimensional image processing (scipy.ndimage)
➢ Data IO (scipy.io)
➢ and more!
45
Pandas
➢ Adds data structures and tools designed to work with table-like
data (similar to Series and Data Frames in R)
➢ Provides tools for data manipulation: reshaping, merging,
sorting, slicing, aggregation etc.
➢ Aggregation - computing a summary statistic for groups
➢ min, max, count, sum, prod, mean, median, mode, mad, std, var

➢ Allows for handling missing data

Source: http://pandas.pydata.org/
46
Matplotlib
➢ Matplotlib is an incredibly powerful (and beautiful!) 2-D plotting
library. It’s easy to use and provides a huge number of examples
for tackling unique problems.
➢ Similar to MATLAB

47
Seaborn
➢ Seaborn has more convenient commands and options
❑ Kaggle 4-hour course on information visualization (by Alexis Cook and Dan Becker)
https://www.kaggle.com/learn/data-visualization

48
pyplot
➢ At the center of most matplotlib scripts is pyplot.
➢ The pyplot module is stateful and tracks changes to a figure. All
pyplot functions revolve around creating or manipulating the
state of a figure.

49
pyplot
➢ The plot function can actually take any number of arguments.
➢ The format string argument associated with a pair of sequence
objects indicates the color and line type of the plot (e.g. ‘bs’
indicates blue squares and ‘ro’ indicates red circles).
➢ Generally speaking, the x_values and y_values will be numpy
arrays and if not, they will be converted to numpy arrays
internally.
➢ Line properties can be set via keyword arguments to the plot
function. Examples include label, linewidth, animated, color,
etc…
50
Jupyter Notebook
➢ All of these libraries come preinstalled on Google Colab
➢ Google Colab uses a Jupyter notebook environment
that runs in the cloud and requires no setup to use
➢ Runs in Python 3
➢ Includes all the commonly used machine learning (data
science) libraries
➢ i.e. NumPy, SciPy, Matplotlib, Pandas, PyTorch, Tensorflow,
etc.

➢ Alternatively, can use Jupyter notebook on your

computer 51
Part 2
Python Libraries and Titanic
Let’s take a look at week 3 Jupyter Notebook

53
Part 3
Decision Trees
Decision Trees
➢ A rule-based supervised learning algorithm
➢ Powerful algorithm capable of fitting complex
datasets.
➢ Can be applied to classification (discrete) and
regression (continuous) tasks.
➢ Highly interpretable!

➢ A fundamental component of Random Forests

which are one of the most used Machine Learning
algorithms today
55
Lemon Vs. Orange!

Flowchart-like structure!

56
Test example

57
Constructing a Decision Tree
➢ Decision trees make predictions by recursively splitting
on different attributes according to a tree structure

width (cm) 58
What if the attributes are discrete?

59
What if the attributes are discrete?

Attributes: Features (inputs)!

Discrete or Continuous

60
Output is Discrete
Wait for table

Go somewhere else

61
Output is Continuous (Regression)

➢ Instead of predicting a
class at each leaf node,
predict a value based on
the average of all
instances at the leaf node.

Source: GDCoder 62
Summary: Discrete vs Continuous Output
➢ Classification Tree:
➢ discrete output
➢ output node (leaf) typically set to the most common value

➢ Regression Tree:
➢ continuous output
➢ output node (leaf) value typically set to the mean value in data

63
Generalization
➢ Decision trees can fit any function arbitrarily closely
➢ Could potentially create a leaf for each example in
the training dataset
➢ Not likely to generalize to test data!

➢ Need some way to prune the tree!

64
Managing Overfitting
➢ Add parameters to reduce potential for overfitting
➢ Parameters include:
➢ depth of tree
➢ minimum number of samples

65
Random Forests
➢ One of the most popular variants of
decision trees
➢ Addresses overfitting by training
multiple trees on subsampling of
features among other things
➢ Majority vote of all the trees is used
to make the final output

Source: Venkata Jagannath

66
Decision Trees are interpretable models
Gini Impurity is a measurement of the likelihood of
an incorrect classification of a new instance of a random
variable, if that new instance were randomly classified
according to the distribution of class labels from the data set.

67
Comparison to k-NN
➢ There are many advantages of Decision Trees over k-Nearest Neighbours:
➢ Good with discrete attributes
➢ Robust to scale of inputs (does not require normalization)
➢ Easily handle missing values
➢ Good at handling lots of attributes, especially when only a few are important
➢ Fast test time
➢ More interpretable
➢ Decision trees not good at handling rotations in data
➢ Decision trees have limited predictive performance (more advanced tree-based models)

68
Next Time
➢ Week 3 Q&A Support Session
➢ Help with Python and Project 1
➢ Reading assignment 3 is out
➢ Project 1 is out
➢ Week 4 Lecture – Uncertainty and Performance
➢ K-Means Clustering
➢ Probability Theory
➢ Summary Statistics
➢ Multivariate Gaussians
➢ Performance Metrics
69
Decision Trees Code Example (Google Colab)

Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
FML - KNN
No ratings yet
FML - KNN
64 pages
ML Chap 2
No ratings yet
ML Chap 2
60 pages
Introduction - Final
No ratings yet
Introduction - Final
64 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Intro to Machine Learning & kNN
No ratings yet
Intro to Machine Learning & kNN
90 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Introduction ML
No ratings yet
Introduction ML
65 pages
2021 Machine Learning Intro
No ratings yet
2021 Machine Learning Intro
43 pages
Machine Learning 2
No ratings yet
Machine Learning 2
7 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
Air Quality Prediction Using Machine Learning
No ratings yet
Air Quality Prediction Using Machine Learning
29 pages
Intro To ML
No ratings yet
Intro To ML
26 pages
ML 01
No ratings yet
ML 01
24 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
Lec2 Intro To ML
No ratings yet
Lec2 Intro To ML
35 pages
Evaluating Model Performance: Evaluation Strategies: Train/Validation/Test
No ratings yet
Evaluating Model Performance: Evaluation Strategies: Train/Validation/Test
127 pages
Unit I 2
No ratings yet
Unit I 2
78 pages
3 - InnovatiCS - Introduction To CRISP-DM
No ratings yet
3 - InnovatiCS - Introduction To CRISP-DM
35 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
316 pages
PSCS511 - Machine Learning
No ratings yet
PSCS511 - Machine Learning
23 pages
Machine Learning Basics & kNN Guide
No ratings yet
Machine Learning Basics & kNN Guide
94 pages
Week 2: Machine Learning Intro: Instructor: Ting Sun
No ratings yet
Week 2: Machine Learning Intro: Instructor: Ting Sun
21 pages
Capstone Project
No ratings yet
Capstone Project
6 pages
MLE
No ratings yet
MLE
15 pages
Chapter Three
No ratings yet
Chapter Three
35 pages
MachineLearning Chatgpt
No ratings yet
MachineLearning Chatgpt
19 pages
Unit 5 Intro To Machine Learning
No ratings yet
Unit 5 Intro To Machine Learning
25 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
Module 1 ML
No ratings yet
Module 1 ML
51 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Lec-7 Intro Machine Learning
No ratings yet
Lec-7 Intro Machine Learning
87 pages
Lec 2
No ratings yet
Lec 2
13 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
DS-05 Introduction To Machine Learning
No ratings yet
DS-05 Introduction To Machine Learning
103 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
369 pages
AIML105
No ratings yet
AIML105
5 pages
ML Unit1
No ratings yet
ML Unit1
25 pages
Introduction To ML
No ratings yet
Introduction To ML
48 pages
Types of Machine Learning Algorithms
No ratings yet
Types of Machine Learning Algorithms
14 pages
Moule 3
No ratings yet
Moule 3
25 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
INT354 - Unit 1
No ratings yet
INT354 - Unit 1
72 pages
Lecture 2
No ratings yet
Lecture 2
36 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
AI Unit 1
No ratings yet
AI Unit 1
30 pages
Intro DL 01
No ratings yet
Intro DL 01
64 pages
Machine Learning Updated
No ratings yet
Machine Learning Updated
14 pages
Statistics: Rejection Regions Guide
67% (3)
Statistics: Rejection Regions Guide
12 pages
HWSC S A0012940590 1
No ratings yet
HWSC S A0012940590 1
4 pages
On Chemical Eqiulibrium For G-11
No ratings yet
On Chemical Eqiulibrium For G-11
44 pages
Rotor Balancing: HG 4 (Chapter 8)
No ratings yet
Rotor Balancing: HG 4 (Chapter 8)
30 pages
Protein Study Guide for Students
No ratings yet
Protein Study Guide for Students
8 pages
Mandibular Movements
100% (1)
Mandibular Movements
140 pages
Spedding 1988
No ratings yet
Spedding 1988
12 pages
Chap-3 (Malware Analysis) (Sem-5)
No ratings yet
Chap-3 (Malware Analysis) (Sem-5)
22 pages
Moisture Content Determination
No ratings yet
Moisture Content Determination
5 pages
Kurmanji Basic Learning Manual
No ratings yet
Kurmanji Basic Learning Manual
32 pages
Green University of Bangladesh Department of Textile: Lab Report
No ratings yet
Green University of Bangladesh Department of Textile: Lab Report
5 pages
Pilkington Low e Glass How It Works
No ratings yet
Pilkington Low e Glass How It Works
2 pages
Christ - Freeze Dryer - Vacuum Conc
No ratings yet
Christ - Freeze Dryer - Vacuum Conc
16 pages
TCC Number 119 4 4
No ratings yet
TCC Number 119 4 4
1 page
APM Agents
No ratings yet
APM Agents
102 pages
ReleaseNote - FileList of X756UAK - WIN10 - 64 - V7.02
No ratings yet
ReleaseNote - FileList of X756UAK - WIN10 - 64 - V7.02
2 pages
Weibull-Analysis-In-Excel Standard IEC 61649
No ratings yet
Weibull-Analysis-In-Excel Standard IEC 61649
113 pages
Challenging 2A04 Sol e
No ratings yet
Challenging 2A04 Sol e
5 pages
BBACA 2019 Pat. SEM III CA 302 Data Structure MCQ
No ratings yet
BBACA 2019 Pat. SEM III CA 302 Data Structure MCQ
22 pages
BIRD Internet Routing Daemon: Introduction, Version 2.0.x
No ratings yet
BIRD Internet Routing Daemon: Introduction, Version 2.0.x
29 pages
Using The Swift Futura Remote Video Unit
No ratings yet
Using The Swift Futura Remote Video Unit
12 pages
LAB 1 Installing Servers
No ratings yet
LAB 1 Installing Servers
7 pages
Assignment 1: Fundamentals: of Financial Management (FIBA 201)
No ratings yet
Assignment 1: Fundamentals: of Financial Management (FIBA 201)
8 pages
Timoshenko Beam Theory
No ratings yet
Timoshenko Beam Theory
8 pages
Vectors and Equilibrium Guide
No ratings yet
Vectors and Equilibrium Guide
14 pages
Determinants and Matrices Previous Year Questions With Answer
75% (4)
Determinants and Matrices Previous Year Questions With Answer
15 pages
7 Rational Functions Equations Inequalities
No ratings yet
7 Rational Functions Equations Inequalities
27 pages
Graph Theory Applications in Science & CS
No ratings yet
Graph Theory Applications in Science & CS
4 pages
Smart Grid Innovations for Utilities
No ratings yet
Smart Grid Innovations for Utilities
13 pages
TLV Check Valve Ckf3m
No ratings yet
TLV Check Valve Ckf3m
2 pages

APS1070 Lecture (3) Slides

Uploaded by

APS1070 Lecture (3) Slides

Uploaded by

APS1070

Foundations of Data Analytics and

1. Understand the problem

Understand Data Model Test and

Data Data Model

➢ Regression: Continues Target

➢ You have a data set D = {(x(i),y(i))}

➢ What kind of model (algorithm) do you MNIST low-dimensional projection

Q: What happens as we increase #data?

➢ We need a way to measure how well our

➢ Q: What kind of a problem is media exposure

➢ Central problem in machine learning

Model Capacity (Complexity) 19

➢ Models with too few

➢ Models with too many

➢ More data = better model

➢ If we track test error/accuracy in our training

➢ The final test accuracy will not be a realistic

➢ Apply Data Augmentation!

Source: Data Courses

Understand Data Model Test and

Data Data Model

➢ Google is your (BEST) friend?

➢ Data and computation:

➢ In general, any numerical data that is stored in an array-like

➢ With SciPy an interactive Python session becomes a data-

➢ Allows for handling missing data

➢ Alternatively, can use Jupyter notebook on your

➢ A fundamental component of Random Forests

Attributes: Features (inputs)!

➢ Need some way to prune the tree!

Source: Venkata Jagannath

You might also like