0% found this document useful (0 votes)

13 views31 pages

(MLP) MidtermNote

Uploaded by

quynhanhphamnguyen221

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views31 pages

(MLP) MidtermNote

Uploaded by

quynhanhphamnguyen221

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

01 and 02: Introduction, Regression Analysis, and Gradient Descent

What is machine learning?

● Here we...
○ Define what it is
○ When to use it
● Not a well defined definition
○ Couple of examples of how people have tried to define it
● Arthur Samuel (1959)
○ Machine learning: "Field of study that gives computers the ability to learn without being explicitly
programmed"
■ Samuels wrote a checkers playing program
■ Had the program play 10000 games against itself
■ Work out which board positions were good and bad depending on wins/losses
● Tom Michel (1999)

○ Well posed learning problem: "A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E."
■ The checkers example,
■ E = 10000s games
■ T is playing checkers
■ P if you win or not
Several types of learning algorithms
○ Supervised learning
■ Teach the computer how to do something, then let it use it;s new found knowledge to do it
○ Unsupervised learning
■ Let the computer learn how to do something, and use this to determine structure and patterns in data
○ Reinforcement learning
○ Recommender systems
This course
○ Look at practical advice for applying learning algorithms
○ Learning a set of tools and how to apply them

Supervised learning - introduction

Probably the most common problem type in machine learning

Starting with an example
How do we predict housing prices
■ Collect data regarding housing prices and how they relate to size in feet

Example problem: "Given this data, a friend has a house 750 square feet - how much can they be expected to get?"

What approaches can we use to solve this?

○ Straight line through data
■ Maybe $150 000
○ Second order polynomial
■ Maybe $200 000
○ One thing we discuss later - how to chose straight or curved line?
○ Each of these approaches represent a way of doing supervised learning
What does this mean?
○ We gave the algorithm a data set where a "right answer" was provided
○ So we know actual prices for houses
■ The idea is we can learn what makes the price a certain value from the training data
■ The algorithm should then produce more right answers based on new training data where we don't know the
price already
■ i.e. predict the price
We also call this a regression problem
○ Predict continuous valued output (price)
○ No real discrete delineation
Another example
○ Can we definer breast cancer as malignant or benign based on tumour size
● Looking at data

○ Five of each
○ Can you estimate prognosis based on tumor size?
○ This is an example of a classification problem
■ Classify data into one of two discrete classes - no in between, either malignant or not
■ In classification problems, can have a discrete number of possible values for the output
■ e.g. maybe have four values
■ 0 - benign
■ 1 - type 1
■ 2 - type 2
■ 3 - type 4
● In classification problems we can plot data in a different way

● Use only one attribute (size)

○ In other problems may have multiple attributes
○ We may also, for example, know age and tumor size

● Based on that data, you can try and define separate classes by
○ Drawing a straight line between the two groups
○ Using a more complex function to define the two groups (which we'll discuss later)
○ Then, when you have an individual with a specific tumor size and who is a specific age, you can hopefully use that
information to place them into one of your classes
● You might have many features to consider
○ Clump thickness
○ Uniformity of cell size
○ Uniformity of cell shape
● The most exciting algorithms can deal with an infinite number of features
○ How do you deal with an infinite number of features?
○ Neat mathematical trick in support vector machine (which we discuss later)
■ If you have an infinitely long list - we can develop and algorithm to deal with that
● Summary
○ Supervised learning lets you get the "right" data a
○ Regression problem
○ Classification problem

Unsupervised learning - introduction

● Second major problem type

● In unsupervised learning, we get unlabeled data
○ Just told - here is a data set, can you structure it
● One way of doing this would be to cluster data into to groups
○ This is a clustering algorithm
Linear Regression

● Housing price data example used earlier

○ Supervised learning regression problem
● What do we start with?
○ Training set (this is your data set)
○ Notation (used throughout the course)
■ m = number of training examples
■ x's = input variables / features
■ y's = output variable "target" variables
■ (x,y) - single training example
■ (xi, yj) - specific example (ith training example)

■ i is an index to training set

● With our training set defined - how do we used it?

○ Take training set
○ Pass into a learning algorithm
○ Algorithm outputs a function (denoted h ) (h = hypothesis)
■ This function takes an input (e.g. size of new house)
■ Tries to output the estimated value of Y
● How do we represent hypothesis h ?
○ Going to present h as;
■ hθ(x) = θ0 + θ1x
■ h(x) (shorthand)

● What does this mean?

○ Means Y is a linear function of x!
○ θi are parameters
■ θ0 is zero condition
■ θ1 is gradient
● This kind of function is a linear regression with one variable
○ Also called univariate linear regression
● So in summary
○ A hypothesis takes in some variable
○ Uses parameters determined by a learning system
○ Outputs a prediction based on that input

Linear regression - implementation (cost function)

A cost function lets us figure out how to fit the best straight line to our data
Choosing values for θi (parameters)
○ Different values give you different functions
○ If θ0 is 1.5 and θ1 is 0 then we get straight line parallel with X along 1.5 @ y
○ If θ1 is > 0 then we get a positive slope
Based on our training set we want to generate parameters which make the straight line
○ Chosen these parameters so hθ(x) is close to y for our training examples
■ Basically, uses xs in training set with hθ(x) to give output which is as close to the actual y value as possible
■ Think of hθ(x) as a "y imitator" - it tries to convert the x into y, and considering we already have y we can
evaluate how well hθ(x) does this
To formalize this;
○ We want to want to solve a minimization problem
○ Minimize (hθ(x) - y)2
■ i.e. minimize the difference between h(x) and y for each/any/every example
○ Sum this over the training set
● Minimize squared different between predicted house price and actual house price
○ 1/2m
■ 1/m - means we determine the average
■ 1/2m the 2 makes the math a bit easier, and doesn't change the constants we determine at all (i.e. half the
smallest value is still the smallest value!)
○ Minimizing θ0/θ1 means we get the values of θ0 and θ1 which find on average the minimal deviation of x from y when we
use those parameters in our hypothesis function
● More cleanly, this is a cost function

● And we want to minimize this cost function

○ Our cost function is (because of the summation term) inherently looking at ALL the data in the training set at any time
● So to recap
○ Hypothesis - is like your prediction machine, throw in an x value, get a putative y value

○ Cost - is a way to, using your training data, determine values for your θ values which make the hypothesis as accurate as
possible

■ This cost function is also called the squared error cost function
■ This cost function is reasonable choice for most regression functions
■ Probably most commonly used function
○ In case J(θ0,θ1) is a bit abstract, going into what it does, why it works and how we use it in the coming sections

Cost function - a deeper look

Lets consider some intuition about the cost function and why we want to use it
○ The cost function determines parameters
○ The value associated with the parameters determines how your hypothesis behaves, with different values generate
different
Simplified hypothesis
○ Assumes θ0 = 0

● Cost function and goal here are very similar to when we have θ0, but with a simpler parameter
○ Simplified hypothesis makes visualizing cost function J() a bit easier
So hypothesis pass through 0,0
Two key functins we want to understand
hθ(x): Hypothesis is a function of x - function of what the size of the house is
J(θ1): Is a function of the parameter of θ1
So for example: θ1 = 1, J(θ1) = 0
Plot θ1 vs J(θ1). Data
1) θ1 = 1, J(θ1) = 0
2) θ1 = 0.5, J(θ1) = ~0.58
3) θ1 = 0, J(θ1) = ~2.3
○ If we compute a range of values plot
■ J(θ1) vs θ1 we get a polynomial (looks like a quadratic)

The optimization objective for the learning algorithm is find the value of θ1 which minimizes J(θ1)
○ So, here θ1 = 1 is the best value for θ1

A deeper insight into the cost function - simplified cost function

Assume you're familiar with contour plots or contour figures

Using same cost function, hypothesis and goal as previously
It's OK to skip parts of this section if you don't understand cotour plots
Using our original complex hyothesis with two pariables,
So cost function is
J(θ0, θ1)
Example,
Say
θ0 = 50
θ1 = 0.06
Previously we plotted our cost function by plotting
θ1 vs J(θ1)
Now we have two parameters
Plot becomes a bit more complicated
Generates a 3D surface plot where axis are
■ X = θ1
■ Z = θ0
■ Y = J(θ0,θ1)

● We can see that the height (y) indicates the value of the cost function, so find where y is at a minimum

● Instead of a surface plot we can use a contour figures/plots

○ Set of ellipses in different colors
○ Each colour is the same value of J(θ0, θ1), but obviously plot to different locations because θ1 and θ0 will vary
○ Imagine a bowl shape function coming out of the screen so the middle is the concentric circles

● Each point (like the red one above) represents a pair of parameter values for Ɵ0 and Ɵ1
○ Our example here put the values at
■ θ0 = ~800
■ θ1 = ~-0.15
○ Not a good fit
■ i.e. these parameters give a value on our contour plot far from the center
○ If we have
■ θ0 = ~360
■ θ1 = 0
■ This gives a better hypothesis, but still not great - not in the center of the countour plot
○ Finally we find the minimum, which gives the best hypothesis
● Doing this by eye/hand is a pain in the ass
○ What we really want is an efficient algorithm fro finding the minimum for θ0 and θ1

Gradient descent algorithm

● Minimize cost function J

● Gradient descent
○ Used all over machine learning for minimization
● Start by looking at a general J() function
● Problem
○ We have J(θ0, θ1)
○ We want to get min J(θ0, θ1)
● Gradient descent applies to more general functions
○ J(θ0, θ1, θ2 .... θn)
○ min J(θ0, θ1, θ2 .... θn)

How does it work?

● Start with initial guesses

○ Start at 0,0 (or any other value)
○ Keeping changing θ0 and θ1 a little bit to try and reduce J(θ0,θ1)
● Each time you change the parameters, you select the gradient which reduces J(θ0,θ1) the most possible
● Repeat
● Do so until you converge to a local minimum
● Has an interesting property
○ Where you start can determine which minimum you end up

○ Here we can see one initialization point led to one local minimum
○ The other led to a different one

A more formal definition

● Do the following until convergence

● What does this all mean?

○ Update θj by setting it to (θj - α) times the partial derivative of the cost function with respect to θj
● Notation
○ :=
■ Denotes assignment
■ NB a = b is a truth assertion
○ α (alpha)
■ Is a number called the learning rate
■ Controls how big a step you take
■ If α is big have an aggressive gradient descent
■ If α is small take tiny steps
● Derivative term

○ Not going to talk about it now, derive it later

● There is a subtly about how this gradient descent algorithm is implemented
○ Do this for θ0 and θ1
○ For j = 0 and j = 1 means we simultaneously update both
○ How do we do this?
■ Compute the right hand side for both θ0 and θ1
■ So we need a temp value
■ Then, update θ0 and θ1 at the same time
■ We show this graphically below

● If you implement the non-simultaneous update it's not gradient descent, and will behave weirdly
○ But it might look sort of right - so it's important to remember this!

Understanding the algorithm

To understand gradient descent, we'll return to a simpler function where we minimize one parameter to help explain the algorithm in
more detail
○ min θ1 J(θ1) where θ1 is a real number
Two key terms in the algorithm
○ Alpha
○ Derivative term
Notation nuances
○ Partial derivative vs. derivative
■ Use partial derivative when we have multiple variables but only derive with respect to one
■ Use derivative when we are deriving with respect to all the variables
○ Derivative says
■ Lets take the tangent at the point and look at the slope of the line
■ So moving towards the minimum (down) will create a negative derivative, alpha is always positive, so will
update j(θ1) to a smaller value
■ Similarly, if we're moving up a slope we make j(θ1) a bigger numbers

Alpha term (α)

When you get to a local minimum
○ Gradient of tangent/derivative is 0
○ So derivative term = 0
○ alpha * 0 = 0
○ So θ1 = θ1- 0
○ So θ1 remains the same
As you approach the global minimum the derivative term gets smaller, so your update gets smaller, even with alpha is fixed
○ Means as the algorithm runs you take smaller steps as you approach the minimum

○ So no need to change alpha over time

Linear regression with gradient descent

● Apply gradient descent to minimize the squared error cost function J(θ0, θ1)
● Now we have a partial derivative

● So here we're just expanding out the first expression

○ J(θ0, θ1) = 1/2m....
○ hθ(x) = θ0 + θ1*x
● So we need to determine the derivative for each parameter - i.e.
○ When j = 0
○ When j = 1
● Figure out what this partial derivative is for the θ0 and θ1 case
○ When we derive this expression in terms of j = 0 and j = 1 we get the following

● To check this you need to know multivariate calculus

○ So we can plug these values back into the gradient descent algorithm
● How does it work
○ Risk of meeting different local optimum
○ The linear regression cost function is always a convex function - always has a single minimum
■ Bowl shaped
■ One global optima
■ So gradient descent will always converge to global optima
○ In action
■ Initialize values to
■ θ0 = 900
■ θ1 = -0.1
● End up at a global minimum
● This is actually Batch Gradient Descent
○ Refers to the fact that over each step you look at all the training data
■ Each step compute over m training examples
○ Sometimes non-batch versions exist, which look at small data subsets
■ We'll look at other forms of gradient descent (to use when m is too large) later in the course
● There exists a numerical solution for finding a solution for a minimum function
○ Normal equations method
○ Gradient descent scales better to large data sets though
○ Used in lots of contexts and machine learning

What's next - important extensions

Two extension to the algorithm

● 1) Normal equation for numeric solution

○ To solve the minimization problem we can solve it [ min J(θ0, θ1) ] exactly using a numeric method which avoids the
iterative approach used by gradient descent
○ Normal equations method
○ Has advantages and disadvantages
■ Advantage
■ No longer an alpha term
■ Can be much faster for some problems
■ Disadvantage
■ Much more complicated
○ We discuss the normal equation in the linear regression with multiple features section
● 2) We can learn with a larger number of features
○ So may have other parameters which contribute towards a prize
■ e.g. with houses
■ Size
■ Age
■ Number bedrooms
■ Number floors
■ x1, x2, x3, x4
○ With multiple features becomes hard to plot
■ Can't really plot in more than 3 dimensions
■ Notation becomes more complicated too
■ Best way to get around with this is the notation of linear algebra
■ Gives notation and set of things you can do with matrices and vectors
■ e.g. Matrix

● We see here this matrix shows us

○ Size
○ Number of bedrooms
○ Number floors
○ Age of home
● All in one variable

○ Block of numbers, take all data organized into one big block
● Vector
○ Shown as y
○ Shows us the prices
● Need linear algebra for more complex linear regression modles
● Linear algebra is good for making computationally efficient models (as seen later too)
○ Provide a good way to work with large sets of data sets
○ Typically vectorization of a problem is a common optimization technique
03: Linear Algebra - Review
Matrices - overview

● Rectangular array of numbers written between square brackets

○ 2D array
○ Named as capital letters (A,B,X,Y)
● Dimension of a matrix are [Rows x Columns]
○ Start at top left
○ To bottom left
○ To bottom right
○ R[r x c] means a matrix which has r rows and c columns

■ Is a [4 x 2] matrix
● Matrix elements
○ A(i,j) = entry in ith row and jth column

● Provides a way to organize, index and access a lot of data

Matrix manipulation

● Matrix by vector multiplication

○ [3 x 2] matrix * [2 x 1] vector
■ New matrix is [3 x 1]
■ More generally if [a x b] * [b x c]
■ Then new matrix is [a x c]
■ How do you do it?
■ Take the two vector numbers and multiply them with the first row of the matrix
■ Then add results together - this number is the first number in the new vector
■ The multiply second row by vector and add the results together
■ Then multiply final row by vector and add them together

● Detailed explanation
○ A*x=y
■ A is m x n matrix
■ x is n x 1 matrix
■ n must match between vector and matrix
■ i.e. inner dimensions must match
■ Result is an m-dimensional vector
○ To get yi - multiply A's ith row with all the elements of vector x and
add them up
● Neat trick
○ Say we have a data set with four values
○ Say we also have a hypothesis hθ(x) = -40 + 0.25x
■ Create your data as a matrix which can be multiplied by
a vector
■ Have the parameters in a vector which your matrix can
be multiplied by

○ Means we can do
■ Prediction = Data Matrix * Parameters

■ Here we add an extra column to the data with 1s - this means our θ0 values can be calculated and expressed
The diagram above shows how this works
○ This can be far more efficient computationally than lots of for loops
○ This is also easier and cleaner to code (assuming you have appropriate libraries to do matrix multiplication)
Matrix-matrix multiplication
○ General idea
■ Step through the second matrix one column at a time
■ Multiply each column vector from second matrix by the entire first matrix, each time generating a vector
■ The final product is these vectors combined (not added or summed, but literally just put together)
○ Details
■ AxB=C

■ A = [m x n]
■ B = [n x o]
■ C = [m x o]
■ With vector multiplications o = 1
■ Can only multiply matrix where columns in A match rows in B
○ Mechanism
■ Take column 1 of B, treat as a vector
■ Multiply A by that column - generates an [m x 1] vector
■ Repeat for each column in B
■ There are o columns in B, so we get o columns in C

Implementation/use

● House prices, but now we have three hypothesis and the same data set
● To apply all three hypothesis to all data we can do this efficiently using matrix-matrix multiplication
○ Have
■ Data matrix
■ Parameter matrix
○ Example
■ Four houses, where we want to predict the prize
■ Three competing hypotheses
■ Because our hypothesis are one variable, to make the matrices match up we make our data (houses sizes)
vector into a 4x2 matrix by adding an extra column of 1s

● What does this mean

○ Can quickly apply three hypotheses at once, making 12 predictions
○ Lots of good linear algebra libraries to do this kind of thing very efficiently
Matrix multiplication properties

● Can pack a lot into one operation

○ However, should be careful of how you use those operations
○ Some interesting properties
● Commutativity

○ When working with raw numbers/scalars multiplication is commutative

■ 3 * 5 == 5 * 3
○ This is not true for matrix
■ A x B != B x A
■ Matrix multiplication is not commutative
● Associativity

○ 3 x 5 x 2 == 3 x 10 = 15 x 2
■ Associative property
○ Matrix multiplications is associative
■ A x (B x C) == (A x B) x C
● Identity matrix
○ 1 is the identity for any scalar
■ i.e. 1 x z = z
■ for any real number
○ In matrices we have an identity matrix called I
■ Sometimes called I{n x n}

● See some identity matrices above

○ Different identity matrix for each set of dimensions
○ Has
■ 1s along the diagonals
■ 0s everywhere else
○ 1x1 matrix is just "1"
Has the property that any matrix A which can be multiplied by an identity matrix gives you matrix A back
○ So if A is [m x n] then
A*I
■ I=nxn
I*A
■ I=mxm
(To make inside dimensions match to allow multiplication)
Identity matrix dimensions are implicit
Remember that matrices are not commutative AB != BA
○ Except when B is the identity matrix
○ Then AB == BA

Inverse and transpose operations

● Matrix inverse
○ How does the concept of "the inverse" relate to real numbers?
■ 1 = "identity element" (as mentioned above)
■ Each number has an inverse
■ This is the number you multiply a number by to get the identify element
■ i.e. if you have x, x * 1/x = 1
■ e.g. given the number 3
■ 3 * 3-1 = 1 (the identity number/matrix)
■ In the space of real numbers not everything has an inverse
■ e.g. 0 does not have an inverse
○ What is the inverse of a matrix
■ If A is an m x m matrix, then A inverse = A-1
■ So A*A-1 = I
■ Only matrices which are m x m have inverses
■ Square matrices only!
○ Example
■ 2 x 2 matrix

■ How did you find the inverse

■ Turns out that you can sometimes do it by hand, although this is very hard
■ Numerical software for computing a matrices inverse
■ Lots of open source libraries
○ If A is all zeros then there is no inverse matrix
■ Some others don't, intuition should be matrices that don't have an inverse are a singular matrix or a degenerate
matrix (i.e. when it's too close to 0)
■ So if all the values of a matrix reach zero, this can be described as reaching singularity
Matrix transpose
Have matrix A (which is [n x m]) how do you change it to become [m x n] while keeping the same values
■ i.e. swap rows and columns!
How you do it;
■ Take first row of A - becomes 1st column of AT
■ Second row of A - becomes 2nd column...
A is an m x n matrix
■ B is a transpose of A
■ Then B is an n x m matrix
■ A(i,j) = B(j,i)

04: Linear Regression with Multiple Variables

Linear regression with multiple features

New version of linear regression with multiple features

Multiple variables = multiple features

In original version we had
○ X = house size, use this to predict
○ y = house price
If in a new scheme we have more variables (such as number of bedrooms, number floors, age of the home)
○ x1, x2, x3, x4 are the four features
■ x1 - size (feet squared)
■ x2 - Number of bedrooms
■ x3 - Number of floors
■ x4 - Age of home (years)
○ y is the output variable (price)
More notation
○ n
■ number of features (n = 4)
○ m
■ number of examples (i.e. number of rows in a table)
○ xi
■ vector of the input for an example (so a vector of the four parameters for the ith input example)
■ i is an index into the training set
■ So
■ x is an n-dimensional feature vector
■ x3 is, for example, the 3rd house, and contains the four features associated with that house
i
○ xj
■ The value of feature j in the ith training example
■ So
■ x23 is, for example, the number of bedrooms in the third house
Now we have multiple features
○ What is the form of our hypothesis?
○ Previously our hypothesis took the form;
■ hθ(x) = θ0 + θ1x
■ Here we have two parameters (theta 1 and theta 2) determined by our cost function
■ One variable x
○ Now we have multiple features
■ hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
○ For example

■ hθ(x) = 80 + 0.1x1 + 0.01x2 + 3x3 - 2x4

■ An example of a hypothesis which is trying to predict the price of a house
■ Parameters are still determined through a cost function
○ For convenience of notation, x0 = 1
■ For every example i you have an additional 0th feature for each example
■ So now your feature vector is n + 1 dimensional feature vector indexed from 0

■ This is a column vector called x

■ Each example has a column vector associated with it
■ So let's say we have a new example called "X"
■ Parameters are also in a 0 indexed n+1 dimensional vector
■ This is also a column vector called θ
■ This vector is the same for each example
○ Considering this, hypothesis can be written
■ hθ(x) = θ0x0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
○ If we do
■ hθ(x) =θT X
■ θT is an [1 x n+1] matrix
■ In other words, because θ is a column vector, the transposition operation transforms it into a row
vector
■ So before
■ θ was a matrix [n + 1 x 1]
■ Now
■ θT is a matrix [1 x n+1]
■ Which means the inner dimensions of θT and X match, so they can be multiplied together as
■ [1 x n+1] * [n+1 x 1]
■ = hθ(x)
■ So, in other words, the transpose of our parameter vector * an input example X gives
you a predicted hypothesis which is [1 x 1] dimensions (i.e. a single value)
■ This x0 = 1 lets us write this like this
○ This is an example of multivariate linear regression

Gradient descent for multiple variables

● Fitting parameters for the hypothesis with gradient descent

○ Parameters are θ0 to θn
○ Instead of thinking about this as n separate values, think about the parameters as a single vector (θ)
■ Where θ is n+1 dimensional
● Our cost function is

● Similarly, instead of thinking of J as a function of the n+1 numbers, J() is just a function of the parameter vector
○ J(θ)
● Gradient descent

● Once again, this is

○ θj = θj - learning rate (α) times the partial derivative of J(θ) with respect to θJ(...)
○ We do this through a simultaneous update of every θj value
● Implementing this algorithm
○ When n = 1

● Above, we have slightly different update rules for θ0 and θ1

○ Actually they're the same, except the end has a previously undefined x0(i) as 1, so wasn't shown
● We now have an almost identical rule for multivariate gradient descent
● What's going on here?
○ We're doing this for each j (0 until n) as a simultaneous update (like when n = 1)
○ So, we re-set θj to
■ θj minus the learning rate (α) times the partial derivative of of the θ vector with respect to θj
■ In non-calculus words, this means that we do
■ Learning rate
■ Times 1/m (makes the maths easier)
■ Times the sum of
■ The hypothesis taking in the variable vector, minus the actual value, times the j-th value
in that variable vector for EACH example
○ It's important to remember that

● These algorithm are highly similar

Gradient Decent in practice: 1 Feature Scaling

Having covered the theory, we now move on to learn about some of the practical tricks
Feature scaling
○ If you have a problem with multiple features
○ You should make sure those features have a similar scale
■ Means gradient descent will converge more quickly
○ e.g.
■ x1 = size (0 - 2000 feet)
■ x2 = number of bedrooms (1-5)
■ Means the contours generated if we plot θ1 vs. θ2 give a very tall and thin shape due to the huge range difference
○ Running gradient descent on this kind of cost function can take a long time to find the global minimum

Pathological input to gradient descent

So we need to rescale this input so it's more effective
So, if you define each value from x1 and x2 by dividing by the max for each feature
Contours become more like circles (as scaled between 0 and 1)
May want to get everything into -1 to +1 range (approximately)
Want to avoid large ranges, small ranges or very different ranges from one
another
Rule a thumb regarding acceptable ranges
■ -3 to +3 is generally fine - any bigger bad
■ -1/3 to +1/3 is ok - any smaller bad
Can do mean normalization
Take a feature xi
■ Replace it by (xi - mean)/max
■ So your values all have an average of about 0

● Instead of max can also use standard deviation

Learning Rate α

● Focus on the learning rate (α)

● Topics
○ Update rule
○ Debugging
○ How to chose α

Make sure gradient descent is working

● Plot min J(θ) vs. no of iterations

○ (i.e. plotting J(θ) over the course of gradient descent
● If gradient descent is working then J(θ) should decrease after every iteration
● Can also show if you're not making huge gains after a certain number
○ Can apply heuristics to reduce number of iterations if need be
○ If, for example, after 1000 iterations you reduce the parameters by nearly nothing you could chose to only run 1000
iterations in the future
○ Make sure you don't accidentally hard-code thresholds like this in and then forget about why they're their though!

Number of iterations varies a lot

■ 30 iterations
■ 3000 iterations
■ 3000 000 iterations
■ Very hard to tel in advance how many iterations will be needed
■ Can often make a guess based a plot like this after the first 100 or so iterations
Automatic convergence tests
■ Check if J(θ) changes by a small threshold or less
■ Choosing this threshold is hard
■ So often easier to check for a straight line
■ Why? - Because we're seeing the straightness in the context of the whole algorithm
■ Could you design an automatic checker which calculates a threshold based on the
systems preceding progress?
Checking its working
■ If you plot J(θ) vs iterations and see the value is increasing - means you probably need a smaller α
■ Cause is because your minimizing a function which looks like this

But you overshoot, so reduce learning rate so you actually reach the minimum (green line)

So, use a smaller α

Another problem might be if J(θ) looks like a series of waves
Here again, you need a smaller α
However
If α is small enough, J(θ) will decrease on every iteration
BUT, if α is too small then rate is too slow
■ A less steep incline is indicative of a slow convergence, because we're decreasing by less on each iteration than
a steeper slope
Typically
Try a range of alpha values
Plot J(θ) vs number of iterations for each version of alpha
Go for roughly threefold increases
■ 0.001, 0.003, 0.01, 0.03. 0.1, 0.3
Features and polynomial regression

● Choice of features and how you can get different learning algorithms by choosing appropriate features
● Polynomial regression for non-linear function
Example
House price prediction
■ Two features
■ Frontage - width of the plot of land along road (x1)
■ Depth - depth away from road (x2)
You don't have to use just two features
■ Can create new features
Might decide that an important feature is the land area
■ So, create a new feature = frontage * depth (x3)
■ h(x) = θ0 + θ1x3
■ Area is a better indicator
Often, by defining new features you may get a better model
Polynomial regression
May fit the data better
θ0 + θ1x + θ2x2 e.g. here we have a quadratic function
For housing data could use a quadratic function
■ But may not fit the data so well - inflection point means housing prices decrease when size gets really big
■ So instead must use a cubic function

● How do we fit the model to this data

○ To map our old linear hypothesis and cost functions to these polynomial descriptions the easy thing to do is set
■ x1 = x
■ x2 = x2
■ x3 = x3
○ By selecting the features like this and applying the linear regression algorithms you can do polynomial linear regression
○ Remember, feature scaling becomes even more important here
● Instead of a conventional polynomial you could do variable ^(1/something) - i.e. square root, cubed root etc
● Lots of features - later look at developing an algorithm to chose the best features

Normal equation

● For some linear regression problems the normal equation provides a better solution
● So far we've been using gradient descent
○ Iterative algorithm which takes steps to converse
● Normal equation solves θ analytically
○ Solve for the optimum value of theta
● Has some advantages and disadvantages

How does it work?

Simplified cost function

J(θ) = aθ2 + bθ + c
■ θ is just a real number, not a vector
Cost function is a quadratic function
How do you minimize this?
■ Do

■ Take derivative of J(θ) with respect to θ

■ Set that derivative equal to 0
■ Allows you to solve for the value of θ which minimizes J(θ)
In our more complex problems;
Here θ is an n+1 dimensional vector of real numbers
Cost function is a function of the vector value
■ How do we minimize this function
■ Take the partial derivative of J(θ) with respect θj and set to 0 for every j
■ Do that and solve for θ0 to θn
■ This would give the values of θ which minimize J(θ)
If you work through the calculus and the solution, the derivation is pretty complex
■ Not going to go through here
■ Instead, what do you need to know to implement this process

Example of normal equation

Here
○ m=4
○ n=4
To implement the normal equation
○ Take examples
○ Add an extra column (x0 feature)
○ Construct a matrix (X - the design matrix) which contains all the training data features in an [m x n+1] matrix
○ Do something similar for y
■ Construct a column vector y vector [m x 1] matrix
○ Using the following equation (X transpose * X) inverse times X transpose y

● If you compute this, you get the value of theta which minimize the cost function

General case

Have m training examples and n features

The design matrix (X)
■ Each training example is a n+1 dimensional feature column vector
■ X is constructed by taking each training example, determining its transpose (i.e. column -> row) and using it
for a row in the design A
■ This creates an [m x (n+1)] matrix

Vector y
■ Used by taking all the y values into a column vector
What is this equation?!
○ (XT * X)-1
■ What is this --> the inverse of the matrix (XT * X)
■ i.e. A = XT X
■ A-1 = (XT X)-1
In octave and MATLAB you could do;

pinv(X'*x)*x'*y

■ X' is the notation for X transpose

■ pinv is a function for the inverse of a matrix
In a previous lecture discussed feature scaling
○ If you're using the normal equation then no need for feature scaling

When should you use gradient descent and when should you use feature scaling?

Gradient descent
Need to chose learning rate
Needs many iterations - could make it slower
Works well even when n is massive (millions)
■ Better suited to big data
■ What is a big n though
■ 100 or even a 1000 is still (relativity) small
■ If n is 10 000 then look at using gradient descent
Normal equation
No need to chose a learning rate
No need to iterate, check for convergence etc.
Normal equation needs to compute (XT X)-1
■ This is the inverse of an n x n matrix
■ With most implementations computing a matrix inverse grows by O(n3 )
■ So not great
Slow of n is large
■ Can be much slower

Normal equation and non-invertibility

● Advanced concept
○ Often asked about, but quite advanced, perhaps optional material
○ Phenomenon worth understanding, but not probably necessary
● When computing (XT X)-1 * XT * y)
○ What if (XT X) is non-invertible (singular/degenerate)
■ Only some matrices are invertible
■ This should be quite a rare problem
■ Octave can invert matrices using
■ pinv (pseudo inverse)
■ This gets the right value even if (XT X) is non-invertible
■ inv (inverse)
○ What does it mean for (XT X) to be non-invertible
■ Normally two common causes
■ Redundant features in learning model
■ e.g.
■ x1 = size in feet
■ x2 = size in meters squared
■ Too many features
■ e.g. m <= n (m is much larger than n)
■ m = 10
■ n = 100
■ Trying to fit 101 parameters from 10 training examples
■ Sometimes work, but not always a good idea
■ Not enough data
■ Later look at why this may be too little data
■ To solve this we
■ Delete features
■ Use regularization (let's you use lots of features for a small training set)
○ If you find (XT X) to be non-invertible
■ Look at features --> are features linearly dependent?
■ So just delete one, will solve problem

06: Logistic Regression

Classification

Where y is a discrete value

Develop the logistic regression algorithm to determine what class a new input should fall into
Classification problems
Email -> spam/not spam?
Online transactions -> fraudulent?
Tumor -> Malignant/benign
Variable in these problems is Y
Y is either 0 or 1
■ 0 = negative class (absence of something)
■ 1 = positive class (presence of something)
Start with binary class problems
Later look at multiclass classification problem, although this is just an extension of binary classification
How do we develop a classification algorithm?
Tumour size vs malignancy (0 or 1)
We could use linear regression
■ Then threshold the classifier output (i.e. anything over some value is yes, else no)
■ In our example below linear regression with thresholding seems to work

We can see above this does a reasonable job of stratifying the data points into one of two classes
○ But what if we had a single Yes with a very small tumour
○ This would lead to classifying all the existing yeses as nos
Another issues with linear regression
○ We know Y is 0 or 1
○ Hypothesis can give values large than 1 or less than 0
So, logistic regression generates a value where is always either 0 or 1
○ Logistic regression is a classification algorithm - don't be confused

Hypothesis representation

What function is used to represent our hypothesis in classification

We want our classifier to output values between 0 and 1
○ When using linear regression we did hθ(x) = (θT x)
○ For classification hypothesis representation we do hθ(x) = g((θT x))
■ Where we define g(z)
■ z is a real number
■ g(z) = 1/(1 + e-z)
■ This is the sigmoid function, or the logistic function
■ If we combine these equations we can write out the hypothesis as

What does the sigmoid function look like

Crosses 0.5 at the origin, then flattens out]
○ Asymptotes at 0 and 1

● Given this we need to fit θ to our data

Interpreting hypothesis output

● When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability that y=1 on input x
○ Example
■ If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
■ hθ(x) = 0.7
■ Tells a patient they have a 70% chance of a tumor being malignant
○ We can write this using the following notation
■ hθ(x) = P(y=1|x ; θ)
○ What does this mean?
■ Probability that y=1, given x, parameterized by θ
● Since this is a binary classification task we know y = 0 or 1
○ So the following must be true
■ P(y=1|x ; θ) + P(y=0|x ; θ) = 1
■ P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

Decision boundary

● Gives a better sense of what the hypothesis function is computing

● Better understand of what the hypothesis function looks like
○ One way of using the sigmoid function is;

■ When the probability of y being 1 is greater than 0.5 then we can predict y = 1
■ Else we predict y = 0
○ When is it exactly that hθ(x) is greater than 0.5?
■ Look at sigmoid function
■ g(z) is greater than or equal to 0.5 when z is greater than or equal to 0

■ So if z is positive, g(z) is greater than 0.5

■ z = (θT x)
■ So when
■ θT x >= 0
■ Then hθ >= 0.5
● So what we've shown is that the hypothesis predicts y = 1 when θT x >= 0
○ The corollary of that when θT x <= 0 then the hypothesis predicts y = 0
○ Let's use this to better understand how the hypothesis makes its predictions

Decision boundary

● hθ(x) = g(θ0 + θ1x1 + θ2x2)

So, for example

○ θ0 = -3
○ θ1 = 1
○ θ2 = 1
So our parameter vector is a column vector with the above values
○ So, θT is a row vector = [-3,1,1]
What does this mean?
○ The z here becomes θT x
○ We predict "y = 1" if
■ -3x0 + 1x1 + 1x2 >= 0
■ -3 + x1 + x2 >= 0
We can also re-write this as
○ If (x1 + x2 >= 3) then we predict y = 1
○ If we plot
■ x1 + x2 = 3 we graphically plot our decision boundary
Means we have these two regions on the graph
Blue = false
Magenta = true
Line = decision boundary
■ Concretely, the straight line is the set of points where
hθ(x) = 0.5 exactly
The decision boundary is a property of the hypothesis
■ Means we can create the boundary with the hypothesis
and parameters without any data
■ Later, we use the data to determine the
parameter values
■ i.e. y = 1 if
■ 5 - x1 > 0
■ 5 > x1

Non-linear decision boundaries

Get logistic regression to fit a complex non-linear data set

Like polynomial regress add higher order terms
So say we have
hθ(x) = g(θ0 + θ1x1+ θ3x12 + θ4x22)
We take the transpose of the θ vector times the input vector
Say θT was [-1,0,0,1,1] then we say;
Predict that "y = 1" if
■ -1 + x12 + x22 >= 0
or
■ x12 + x22 >= 1
If we plot x12 + x22 = 1
■ This gives us a circle with a radius of 1 around 0

Mean we can build more complex decision boundaries by fitting complex parameters to this (relatively) simple hypothesis
More complex decision boundaries?
○ By using higher order polynomial terms, we can get even more complex decision boundaries
Cost function for logistic regression

Fit θ parameters
Define the optimization object for the cost function we use the fit the parameters
Training set of m training examples
■ Each example has is n+1 length column vector

● This is the situation

○ Set of m training examples
○ Each example is a feature vector which is n+1 dimensional
○ x0 = 1
○ y ∈ {0,1}
○ Hypothesis is based on parameters (θ)
■ Given the training set how to we chose/fit θ?
● Linear regression uses the following function to determine θ

Instead of writing the squared error term, we can write

○ If we define "cost()" as;
■ cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2
■ Which evaluates to the cost for an individual example using the same measure as used in linear regression
○ We can redefine J(θ) as

■ Which, appropriately, is the sum of all the individual costs over the training data (i.e. the same as linear
regression)
To further simplify it we can get rid of the superscripts
○ So

What does this actually mean?

○ This is the cost you want the learning algorithm to pay if the outcome is hθ(x) and the actual outcome is y
○ If we use this function for logistic regression this is a non-convex function for parameter optimization
■ Could work....
What do we mean by non convex?
○ We have some function - J(θ) - for determining the parameters
○ Our hypothesis function has a non-linearity (sigmoid function of hθ(x) )
■ This is a complicated non-linear function
○ If you take hθ(x) and plug it into the Cost() function, and them plug the Cost() function into J(θ) and plot J(θ) we find
many local optimum -> non convex function
○ Why is this a problem
■ Lots of local minima mean gradient descent may not find the global optimum - may get stuck in a global
minimum
○ We would like a convex function so if you run gradient descent you converge to a global minimum
A convex logistic regression cost function

● To get around this we need a different, convex Cost() function which means we can apply gradient descent

This is our logistic regression cost function

○ This is the penalty the algorithm pays
○ Plot the function
Plot y = 1
○ So hθ(x) evaluates as -log(hθ(x))

So when we're right, cost function is 0

○ Else it slowly increases cost function as we become "more" wrong
○ X axis is what we predict
○ Y axis is the cost associated with that prediction
This cost functions has some interesting properties
○ If y = 1 and hθ(x) = 1
■ If hypothesis predicts exactly 1 and thats exactly
correct then that corresponds to 0 (exactly, not nearly 0)
○ As hθ(x) goes to 0
■ Cost goes to infinity
■ This captures the intuition that if hθ(x) = 0 (predict P
(y=1|x; θ) = 0) but y = 1 this will penalize the learning algorithm with a massive cost
What about if y = 0
then cost is evaluated as -log(1- hθ( x ))
○ Just get inverse of the other function

● Now it goes to plus infinity as hθ(x) goes to 1

● With our particular cost functions J(θ) is going to be convex and avoid local
minimum

Simplified cost function and gradient descent

● Define a simpler way to write the cost function and apply gradient descent to the logistic regression
○ By the end should be able to implement a fully functional logistic regression function
● Logistic regression cost function is as follows
This is the cost for a single example
For binary classification problems y is always 0 or 1
■ Because of this, we can have a simpler way to write the cost function
■ Rather than writing cost function on two lines/two cases
■ Can compress them into one equation - more efficient
Can write cost function is
■ cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
■ This equation is a more compact of the two cases above
We know that there are only two possible cases
■ y=1
■ Then our equation simplifies to
■ -log(hθ(x)) - (0)log(1 - hθ(x))
■ -log(hθ(x))
■ Which is what we had before when y = 1
■ y=0
■ Then our equation simplifies to

■ -(0)log(hθ(x)) - (1)log(1 - hθ(x))

■ = -log(1- hθ(x))
■ Which is what we had before when y = 0
■ Clever!
So, in summary, our cost function for the θ parameters can be defined as

Why do we chose this function when other cost functions exist?

This cost function can be derived from statistics using the principle of maximum likelihood estimation
Note this does mean there's an underlying Gaussian assumption relating to the distribution of features
Also has the nice property that it's convex
To fit parameters θ:
Find parameters θ which minimize J(θ)
This means we have a set of parameters to use in our model for future predictions
Then, if we're given some new example with set of features x, we can take the θ which we generated, and output our prediction using

This result is
p(y=1 | x ; θ)
■ Probability y = 1, given x, parameterized by θ

How to minimize the logistic regression cost function

Now we need to figure out how to minimize J(θ)

○ Use gradient descent as before
○ Repeatedly update each parameter using a learning rate

● If you had n features, you would have an n+1 column vector for θ
● This equation is the same as the linear regression rule
○ The only difference is that our definition for the hypothesis has changed
● Previously, we spoke about how to monitor gradient descent to check it's working
○ Can do the same thing here for logistic regression
● When implementing logistic regression with gradient descent, we have to update all the θ values (θ0 to θn) simultaneously
○ Could use a for loop
○ Better would be a vectorized implementation
● Feature scaling for gradient descent for logistic regression also applies here
Advanced optimization

Previously we looked at gradient descent for minimizing the cost function

Here look at advanced concepts for minimizing the cost function for logistic regression
Good for large machine learning problems (e.g. huge feature set)
What is gradient descent actually doing?
We have some cost function J(θ), and we want to minimize it
We need to write code which can take θ as input and compute the following
■ J(θ)
■ Partial derivative if J(θ) with respect to j (where j=0 to j = n)

● Given code that can do these two things

○ Gradient descent repeatedly does the following update

So update each j in θ sequentially

So, we must;
Supply code to compute J(θ) and the derivatives
Then plug these values into gradient descent
Alternatively, instead of gradient descent to minimize the cost function we could use
Conjugate gradient
BFGS (Broyden-Fletcher-Goldfarb-Shanno)
L-BFGS (Limited memory - BFGS)
These are more optimized algorithms which take that same input and minimize the cost function
These are very complicated algorithms
Some properties
Advantages
■ No need to manually pick alpha (learning rate)
■ Have a clever inner loop (line search algorithm) which tries a bunch of alpha values and picks a
good one
■ Often faster than gradient descent
■ Do more than just pick a good learning rate
■ Can be used successfully without understanding their complexity
Disadvantages
■ Could make debugging more difficult
■ Should not be implemented themselves
■ Different libraries may use different implementations - may hit performance

Multiclass classification problems

Getting logistic regression for multiclass classification using one vs. all
Multiclass - more than yes or no (1 or 0)
○ Classification with multiple classes for assignment

Given a dataset with three classes, how do we get a learning algorithm to work?
Use one vs. all classification make binary classification work for multiclass classification
One vs. all classification
Split the training set into three separate binary classification problems
i.e. create a new fake training set
■ Triangle (1) vs crosses and squares (0) hθ1(x)
■ P(y=1 | x1; θ)
■ Crosses (1) vs triangle and square (0) hθ2(x)
■ P(y=1 | x2; θ)
■ Square (1) vs crosses and square (0) hθ3(x)
■ P(y=1 | x3; θ)
● Overall

○ Train a logistic regression classifier hθ(i)(x) for each class i to

predict the probability that y = i
○ On a new input, x to make a prediction, pick the class i that
maximizes the probability that hθ(i)(x) = 1

07: Regularization
The problem of overfitting

● So far we've seen a few algorithms - work well for many applications, but can suffer from the problem of overfitting
● What is overfitting?
● What is regularization and how does it help

Overfitting with linear regression

Using our house pricing example again

Fit a linear function to the data - not a great model
■ This is underfitting - also known as high bias
■ Bias is a historic/technical one - if we're fitting a straight line to the data we have a strong preconception that there should
be a linear fit
■ In this case, this is not correct, but a straight line can't help being straight!
Fit a quadratic function
■ Works well
Fit a 4th order polynomial
■ Now curve fit's through all five examples
■ Seems to do a good job fitting the training set
■ But, despite fitting the data we've provided very well, this is actually not such a good model
■ This is overfitting - also known as high variance
Algorithm has high variance
■ High variance - if fitting high order polynomial then the hypothesis can basically fit any data
■ Space of hypothesis is too large

To recap, if we have too many features then the learned hypothesis may give a cost function of exactly zero
○ But this tries too hard to fit the training set
○ Fails to provide a general solution - unable to generalize (apply to new examples)

Overfitting with logistic regression

Same thing can happen to logistic regression
○ Sigmoidal function is an underfit
○ But a high order polynomial gives and overfitting (high variance hypothesis)

Addressing overfitting

Later we'll look at identifying when overfitting and underfitting is occurring

Earlier we just plotted a higher order function - saw that it looks "too curvy"
Plotting hypothesis is one way to decide, but doesn't always work
Often have lots of a features - here it's not just a case of selecting a degree polynomial, but also harder to plot the data and
visualize to decide what features to keep and which to drop
If you have lots of features and little data - overfitting can be a problem
How do we deal with this?
1) Reduce number of features
■ Manually select which features to keep
■ Model selection algorithms are discussed later (good for reducing number of features)
■ But, in reducing the number of features we lose some information
■ Ideally select those features which minimize data loss, but even so, some info is lost
2) Regularization
■ Keep all features, but reduce magnitude of parameters θ
■ Works well when we have a lot of features, each of which contributes a bit to predicting y

Cost function optimization for regularization

● Penalize and make some of the θ parameters really small

○ e.g. here θ3 and θ4

The addition in blue is a modification of our cost function to help penalize θ3 and θ4
○ So here we end up with θ3 and θ4 being close to zero (because the constants are massive)
○ So we're basically left with a quadratic function

In this example, we penalized two of the parameter values

○ More generally, regularization is as follows
Regularization
Small values for parameters corresponds to a simpler hypothesis (you effectively get rid of some of the terms)
A simpler hypothesis is less prone to overfitting
Another example
Have 100 features x1, x2, ..., x100
Unlike the polynomial example, we don't know what are the high order terms
How do we pick the ones to pick to shrink?
With regularization, take cost function and modify it to shrink all the parameters
Add a term at the end
■ This regularization term shrinks every parameter
■ By convention you don't penalize θ0 - minimization is from θ1 onwards

In practice, if you include θ0 has little impact

λ is the regularization parameter
○ Controls a trade off between our two goals
■ 1) Want to fit the training set well
■ 2) Want to keep parameters small
With our example, using the regularized objective (i.e. the cost function with the regularization term) you get a much smoother
curve which fits the data and gives a much better hypothesis
○ If λ is very large we end up penalizing ALL the parameters (θ1, θ2 etc.) so all the parameters end up being close to zero
■ If this happens, it's like we got rid of all the terms in the hypothesis
■ This results here is then underfitting
■ So this hypothesis is too biased because of the absence of any parameters (effectively)
So, λ should be chosen carefully - not too big...
○ We look at some automatic ways to select λ later in the course

Regularized linear regression

Previously, we looked at two algorithms for linear regression

○ Gradient descent
○ Normal equation
Our linear regression with regularization is shown below

Previously, gradient descent would repeatedly update the parameters θj, where j = 0,1,2...n simultaneously
○ Shown below

● We've got the θ0 update here shown explicitly

○ This is because for regularization we don't penalize θ0 so treat it slightly differently
● How do we regularize these two rules?
○ Take the term and add λ/m * θj
■ Sum for every θ (i.e. j = 0 to n)
○ This gives regularization for gradient descent
● We can show using calculus that the equation given below is the partial derivative of the regularized J(θ)
● The update for θj
○ θj gets updated to
■ θj - α * [a big term which also depends on θj]
● So if you group the θj terms together

The term

○ Is going to be a number less than 1 usually

○ Usually learning rate is small and m is large
■ So this typically evaluates to (1 - a small number)
■ So the term is often around 0.99 to 0.95
This in effect means θj gets multiplied by 0.99
○ Means the squared norm of θj a little smaller
○ The second term is exactly the same as the original gradient descent

Regularization with the normal equation

Normal equation is the other linear regression model

○ Minimize the J(θ) using the normal equation
○ To use regularization we add a term (+ λ [n+1 x n+1]) to the equation
■ [n+1 x n+1] is the n+1 identity matrix

Regularization for logistic regression

● We saw earlier that logistic regression can be prone to overfitting with lots of features
● Logistic regression cost function is as follows;

● To modify it we have to add an extra term

● This has the effect of penalizing the parameters θ1, θ2 up to θn

○ Means, like with linear regression, we can get what appears to be a better fitting lower order hypothesis
● How do we implement this?
○ Original logistic regression with gradient descent function was as follows

● Again, to modify the algorithm we simply need to modify the update rule for θ1, onwards
○ Looks cosmetically the same as linear regression, except obviously the hypothesis is very different

What Is Machine Learning by Coursera
No ratings yet
What Is Machine Learning by Coursera
47 pages
Machine Learning Basics Explained
No ratings yet
Machine Learning Basics Explained
12 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Machine Learning Basics Stanford Notes
No ratings yet
Machine Learning Basics Stanford Notes
15 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
ML:Introduction What Is Machine Learning?: Continuous and Discrete Data
No ratings yet
ML:Introduction What Is Machine Learning?: Continuous and Discrete Data
6 pages
Machine Learning
No ratings yet
Machine Learning
60 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Regression
No ratings yet
Regression
30 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Machine Learning Basics for Students
No ratings yet
Machine Learning Basics for Students
7 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Machine Learning Notes by Standard Andrew NG
No ratings yet
Machine Learning Notes by Standard Andrew NG
142 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
cs229 2
No ratings yet
cs229 2
275 pages
Notes 1
No ratings yet
Notes 1
30 pages
Linear Regression and Gradient Descent
No ratings yet
Linear Regression and Gradient Descent
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229
No ratings yet
CS229
69 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Machine Learning Basics Explained
No ratings yet
Machine Learning Basics Explained
10 pages
CS 229: Supervised Learning Basics
100% (1)
CS 229: Supervised Learning Basics
48 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
7 pages
ML Coursera
No ratings yet
ML Coursera
10 pages
ML: Introduction 1. What Is Machine Learning?
No ratings yet
ML: Introduction 1. What Is Machine Learning?
38 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
Algorithms For Data Science: Attendance: 88772147
No ratings yet
Algorithms For Data Science: Attendance: 88772147
35 pages
Slide 3 - Linear Regression One Variable
No ratings yet
Slide 3 - Linear Regression One Variable
60 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
43 pages
Andrew NG Week 1-2
No ratings yet
Andrew NG Week 1-2
120 pages
Machine Learning Basics for Beginners
No ratings yet
Machine Learning Basics for Beginners
8 pages
Linear Regression For Absolute Beginners With Implementation in Python
No ratings yet
Linear Regression For Absolute Beginners With Implementation in Python
17 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
CS229 Lecture Notes
No ratings yet
CS229 Lecture Notes
142 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Week 6
No ratings yet
Week 6
72 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
5 pages
ML - Mca
No ratings yet
ML - Mca
48 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
Lecture3 - Linear Regression and Logistic Regression
No ratings yet
Lecture3 - Linear Regression and Logistic Regression
60 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
ML Day2
No ratings yet
ML Day2
7 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
Multivariable Linear Regression Guide
No ratings yet
Multivariable Linear Regression Guide
7 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
No ratings yet
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
20 pages
Machine Learning Summary
No ratings yet
Machine Learning Summary
38 pages
Brief Summary ML
No ratings yet
Brief Summary ML
25 pages
Week 4 Exercise
No ratings yet
Week 4 Exercise
2 pages
Report Mẫu K15
No ratings yet
Report Mẫu K15
15 pages
Week 1 Exercises
No ratings yet
Week 1 Exercises
1 page
Requirements
No ratings yet
Requirements
1 page
ITITIU19141 - NGuyenAnhKhoa - Entre - Midterm
No ratings yet
ITITIU19141 - NGuyenAnhKhoa - Entre - Midterm
8 pages
HW 2 - AEIOU Framework
No ratings yet
HW 2 - AEIOU Framework
2 pages
ESPACE Interior Design Trifold Brochure
No ratings yet
ESPACE Interior Design Trifold Brochure
2 pages
Entre Group5 Assignment5
No ratings yet
Entre Group5 Assignment5
23 pages
Lecture 9
No ratings yet
Lecture 9
22 pages
ID - Name - Entre - Midterm
No ratings yet
ID - Name - Entre - Midterm
9 pages
Background and Context of The Problem, Including Trends
No ratings yet
Background and Context of The Problem, Including Trends
3 pages
Midterm Examination: SUBJECT: Scalable and Distributed Computing (ID: IT1391U)
No ratings yet
Midterm Examination: SUBJECT: Scalable and Distributed Computing (ID: IT1391U)
3 pages
ITITIU19141 - NGuyenAnhKhoa - Entre - Midterm
No ratings yet
ITITIU19141 - NGuyenAnhKhoa - Entre - Midterm
9 pages
Song Lyríc
No ratings yet
Song Lyríc
4 pages
GradientDescent - Implementation - Ipynb - Colab
No ratings yet
GradientDescent - Implementation - Ipynb - Colab
5 pages
Linear Regr GD
No ratings yet
Linear Regr GD
3 pages
Cybersecurity Compliance for IT Teams
No ratings yet
Cybersecurity Compliance for IT Teams
22 pages
Piping Design Model Review Guide
100% (1)
Piping Design Model Review Guide
5 pages
Case Study - Student Result Management System
No ratings yet
Case Study - Student Result Management System
68 pages
DLL Ict Matatag W2
No ratings yet
DLL Ict Matatag W2
11 pages
Laboratory-Scientific Equipment: Suction Units / Autoclave / Sterilizers
No ratings yet
Laboratory-Scientific Equipment: Suction Units / Autoclave / Sterilizers
27 pages
Harvard University Digital Giza A Portal To The Pyramids
No ratings yet
Harvard University Digital Giza A Portal To The Pyramids
36 pages
Aquacrop-: Reference Manual
No ratings yet
Aquacrop-: Reference Manual
20 pages
Life Orientation Grade 12 CAPS Step Ahead Notes 2021
100% (1)
Life Orientation Grade 12 CAPS Step Ahead Notes 2021
48 pages
Microsoft PL 200 Dumps by Morgan 24 05 2024 7qa Braindumpscollection
No ratings yet
Microsoft PL 200 Dumps by Morgan 24 05 2024 7qa Braindumpscollection
11 pages
Network Engineering Report 1
No ratings yet
Network Engineering Report 1
8 pages
Tech Innovations Weekly: Exploring The Future of Technology
No ratings yet
Tech Innovations Weekly: Exploring The Future of Technology
3 pages
Bsides Ahmedabad - Dark Side of DeFi
No ratings yet
Bsides Ahmedabad - Dark Side of DeFi
46 pages
Automata Theory T
No ratings yet
Automata Theory T
5 pages
FSD Week 1
No ratings yet
FSD Week 1
30 pages
Advanced Outlook User Guide
No ratings yet
Advanced Outlook User Guide
17 pages
Class X Math Holiday Homework
100% (1)
Class X Math Holiday Homework
56 pages
2506.apex Forklift AMR Configurations (Legacy) - ForwardX Wiki
No ratings yet
2506.apex Forklift AMR Configurations (Legacy) - ForwardX Wiki
54 pages
Encoder Tool
No ratings yet
Encoder Tool
3 pages
Counting Bears Graphing
No ratings yet
Counting Bears Graphing
8 pages
Object Oriented Technique Using Java (Theory and Lab)
No ratings yet
Object Oriented Technique Using Java (Theory and Lab)
6 pages
Aura Studio 3: Visually Stunning Sound
No ratings yet
Aura Studio 3: Visually Stunning Sound
2 pages
Resume - II Siddesh
No ratings yet
Resume - II Siddesh
2 pages
My Resume
No ratings yet
My Resume
1 page
C Programming Output Analysis
No ratings yet
C Programming Output Analysis
32 pages
Ar-En-005-0001-04 - Firewall Plan and Details
No ratings yet
Ar-En-005-0001-04 - Firewall Plan and Details
1 page
Motion Control of Robot by Using Kinect Sensor
No ratings yet
Motion Control of Robot by Using Kinect Sensor
6 pages
ZeryxPaid (1) - 1
No ratings yet
ZeryxPaid (1) - 1
52 pages
Documento Web 2017 EO 13636 - 13691 Section 5 Report - Signed 012618 - Final
No ratings yet
Documento Web 2017 EO 13636 - 13691 Section 5 Report - Signed 012618 - Final
49 pages
Versions of Table Read and Table Write Instructions
No ratings yet
Versions of Table Read and Table Write Instructions
7 pages
Inventory Management System Software
No ratings yet
Inventory Management System Software
4 pages

(MLP) MidtermNote

Uploaded by

(MLP) MidtermNote

Uploaded by

01 and 02: Introduction, Regression Analysis, and Gradient Descent

What is machine learning?

Supervised learning - introduction

​ Probably the most common problem type in machine learning

​ What approaches can we use to solve this?

● Use only one attribute (size)

Unsupervised learning - introduction

● Second major problem type

● Housing price data example used earlier

■ i is an index to training set

● With our training set defined - how do we used it?

● What does this mean?

Linear regression - implementation (cost function)

● And we want to minimize this cost function

Cost function - a deeper look

A deeper insight into the cost function - simplified cost function

​ Assume you're familiar with contour plots or contour figures

● Instead of a surface plot we can use a contour figures/plots

Gradient descent algorithm

● Minimize cost function J

How does it work?

● Start with initial guesses

A more formal definition

● Do the following until convergence

● What does this all mean?

○ Not going to talk about it now, derive it later

Understanding the algorithm

​ Alpha term (α)

○ So no need to change alpha over time

Linear regression with gradient descent

● So here we're just expanding out the first expression

● To check this you need to know multivariate calculus

What's next - important extensions

● 1) Normal equation for numeric solution

● We see here this matrix shows us

● Rectangular array of numbers written between square brackets

● Provides a way to organize, index and access a lot of data

● Matrix by vector multiplication

● What does this mean

● Can pack a lot into one operation

○ When working with raw numbers/scalars multiplication is commutative

● See some identity matrices above

Inverse and transpose operations

■ How did you find the inverse

04: Linear Regression with Multiple Variables

New version of linear regression with multiple features

​ Multiple variables = multiple features

■ hθ(x) = 80 + 0.1x1 + 0.01x2 + 3x3 - 2x4

■ This is a column vector called x

Gradient descent for multiple variables

● Fitting parameters for the hypothesis with gradient descent

● Once again, this is

● Above, we have slightly different update rules for θ0 and θ1

● These algorithm are highly similar

Gradient Decent in practice: 1 Feature Scaling

​ Pathological input to gradient descent

● Instead of max can also use standard deviation

● Focus on the learning rate (α)

Make sure gradient descent is working

● Plot min J(θ) vs. no of iterations

​ Number of iterations varies a lot

​ So, use a smaller α

● How do we fit the model to this data

How does it work?

​ Simplified cost function

■ Take derivative of J(θ) with respect to θ

Example of normal equation

​ Have m training examples and n features

■ X' is the notation for X transpose

Normal equation and non-invertibility

06: Logistic Regression

​ Where y is a discrete value

​ What function is used to represent our hypothesis in classification

​ What does the sigmoid function look like

● Given this we need to fit θ to our data

Interpreting hypothesis output

Probably the most common problem type in machine learning

What approaches can we use to solve this?

Assume you're familiar with contour plots or contour figures

Alpha term (α)

Multiple variables = multiple features

Pathological input to gradient descent

Number of iterations varies a lot

So, use a smaller α

Simplified cost function

Have m training examples and n features

Where y is a discrete value

What function is used to represent our hypothesis in classification

What does the sigmoid function look like

So, for example

Get logistic regression to fit a complex non-linear data set

Instead of writing the squared error term, we can write

What does this actually mean?

This is our logistic regression cost function

So when we're right, cost function is 0

Why do we chose this function when other cost functions exist?

Now we need to figure out how to minimize J(θ)

Previously we looked at gradient descent for minimizing the cost function

So update each j in θ sequentially

Using our house pricing example again

Later we'll look at identifying when overfitting and underfitting is occurring

In this example, we penalized two of the parameter values

In practice, if you include θ0 has little impact

Previously, we looked at two algorithms for linear regression

Normal equation is the other linear regression model