(MLP) MidtermNote
(MLP) MidtermNote
● Here we...
○ Define what it is
○ When to use it
● Not a well defined definition
○ Couple of examples of how people have tried to define it
● Arthur Samuel (1959)
○ Machine learning: "Field of study that gives computers the ability to learn without being explicitly
programmed"
■ Samuels wrote a checkers playing program
■ Had the program play 10000 games against itself
■ Work out which board positions were good and bad depending on wins/losses
● Tom Michel (1999)
○ Well posed learning problem: "A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E."
■ The checkers example,
■ E = 10000s games
■ T is playing checkers
■ P if you win or not
Several types of learning algorithms
○ Supervised learning
■ Teach the computer how to do something, then let it use it;s new found knowledge to do it
○ Unsupervised learning
■ Let the computer learn how to do something, and use this to determine structure and patterns in data
○ Reinforcement learning
○ Recommender systems
This course
○ Look at practical advice for applying learning algorithms
○ Learning a set of tools and how to apply them
Example problem: "Given this data, a friend has a house 750 square feet - how much can they be expected to get?"
○ Five of each
○ Can you estimate prognosis based on tumor size?
○ This is an example of a classification problem
■ Classify data into one of two discrete classes - no in between, either malignant or not
■ In classification problems, can have a discrete number of possible values for the output
■ e.g. maybe have four values
■ 0 - benign
■ 1 - type 1
■ 2 - type 2
■ 3 - type 4
● In classification problems we can plot data in a different way
● Based on that data, you can try and define separate classes by
○ Drawing a straight line between the two groups
○ Using a more complex function to define the two groups (which we'll discuss later)
○ Then, when you have an individual with a specific tumor size and who is a specific age, you can hopefully use that
information to place them into one of your classes
● You might have many features to consider
○ Clump thickness
○ Uniformity of cell size
○ Uniformity of cell shape
● The most exciting algorithms can deal with an infinite number of features
○ How do you deal with an infinite number of features?
○ Neat mathematical trick in support vector machine (which we discuss later)
■ If you have an infinitely long list - we can develop and algorithm to deal with that
● Summary
○ Supervised learning lets you get the "right" data a
○ Regression problem
○ Classification problem
A cost function lets us figure out how to fit the best straight line to our data
Choosing values for θi (parameters)
○ Different values give you different functions
○ If θ0 is 1.5 and θ1 is 0 then we get straight line parallel with X along 1.5 @ y
○ If θ1 is > 0 then we get a positive slope
Based on our training set we want to generate parameters which make the straight line
○ Chosen these parameters so hθ(x) is close to y for our training examples
■ Basically, uses xs in training set with hθ(x) to give output which is as close to the actual y value as possible
■ Think of hθ(x) as a "y imitator" - it tries to convert the x into y, and considering we already have y we can
evaluate how well hθ(x) does this
To formalize this;
○ We want to want to solve a minimization problem
○ Minimize (hθ(x) - y)2
■ i.e. minimize the difference between h(x) and y for each/any/every example
○ Sum this over the training set
● Minimize squared different between predicted house price and actual house price
○ 1/2m
■ 1/m - means we determine the average
■ 1/2m the 2 makes the math a bit easier, and doesn't change the constants we determine at all (i.e. half the
smallest value is still the smallest value!)
○ Minimizing θ0/θ1 means we get the values of θ0 and θ1 which find on average the minimal deviation of x from y when we
use those parameters in our hypothesis function
● More cleanly, this is a cost function
○ Cost - is a way to, using your training data, determine values for your θ values which make the hypothesis as accurate as
possible
■ This cost function is also called the squared error cost function
■ This cost function is reasonable choice for most regression functions
■ Probably most commonly used function
○ In case J(θ0,θ1) is a bit abstract, going into what it does, why it works and how we use it in the coming sections
Lets consider some intuition about the cost function and why we want to use it
○ The cost function determines parameters
○ The value associated with the parameters determines how your hypothesis behaves, with different values generate
different
Simplified hypothesis
○ Assumes θ0 = 0
● Cost function and goal here are very similar to when we have θ0, but with a simpler parameter
○ Simplified hypothesis makes visualizing cost function J() a bit easier
So hypothesis pass through 0,0
Two key functins we want to understand
hθ(x): Hypothesis is a function of x - function of what the size of the house is
J(θ1): Is a function of the parameter of θ1
So for example: θ1 = 1, J(θ1) = 0
Plot θ1 vs J(θ1). Data
1) θ1 = 1, J(θ1) = 0
2) θ1 = 0.5, J(θ1) = ~0.58
3) θ1 = 0, J(θ1) = ~2.3
○ If we compute a range of values plot
■ J(θ1) vs θ1 we get a polynomial (looks like a quadratic)
The optimization objective for the learning algorithm is find the value of θ1 which minimizes J(θ1)
○ So, here θ1 = 1 is the best value for θ1
● We can see that the height (y) indicates the value of the cost function, so find where y is at a minimum
● Each point (like the red one above) represents a pair of parameter values for Ɵ0 and Ɵ1
○ Our example here put the values at
■ θ0 = ~800
■ θ1 = ~-0.15
○ Not a good fit
■ i.e. these parameters give a value on our contour plot far from the center
○ If we have
■ θ0 = ~360
■ θ1 = 0
■ This gives a better hypothesis, but still not great - not in the center of the countour plot
○ Finally we find the minimum, which gives the best hypothesis
● Doing this by eye/hand is a pain in the ass
○ What we really want is an efficient algorithm fro finding the minimum for θ0 and θ1
○ Here we can see one initialization point led to one local minimum
○ The other led to a different one
● If you implement the non-simultaneous update it's not gradient descent, and will behave weirdly
○ But it might look sort of right - so it's important to remember this!
To understand gradient descent, we'll return to a simpler function where we minimize one parameter to help explain the algorithm in
more detail
○ min θ1 J(θ1) where θ1 is a real number
Two key terms in the algorithm
○ Alpha
○ Derivative term
Notation nuances
○ Partial derivative vs. derivative
■ Use partial derivative when we have multiple variables but only derive with respect to one
■ Use derivative when we are deriving with respect to all the variables
○ Derivative says
■ Lets take the tangent at the point and look at the slope of the line
■ So moving towards the minimum (down) will create a negative derivative, alpha is always positive, so will
update j(θ1) to a smaller value
■ Similarly, if we're moving up a slope we make j(θ1) a bigger numbers
● Apply gradient descent to minimize the squared error cost function J(θ0, θ1)
● Now we have a partial derivative
○ Block of numbers, take all data organized into one big block
● Vector
○ Shown as y
○ Shows us the prices
● Need linear algebra for more complex linear regression modles
● Linear algebra is good for making computationally efficient models (as seen later too)
○ Provide a good way to work with large sets of data sets
○ Typically vectorization of a problem is a common optimization technique
03: Linear Algebra - Review
Matrices - overview
■ Is a [4 x 2] matrix
● Matrix elements
○ A(i,j) = entry in ith row and jth column
Matrix manipulation
● Detailed explanation
○ A*x=y
■ A is m x n matrix
■ x is n x 1 matrix
■ n must match between vector and matrix
■ i.e. inner dimensions must match
■ Result is an m-dimensional vector
○ To get yi - multiply A's ith row with all the elements of vector x and
add them up
● Neat trick
○ Say we have a data set with four values
○ Say we also have a hypothesis hθ(x) = -40 + 0.25x
■ Create your data as a matrix which can be multiplied by
a vector
■ Have the parameters in a vector which your matrix can
be multiplied by
○ Means we can do
■ Prediction = Data Matrix * Parameters
■ Here we add an extra column to the data with 1s - this means our θ0 values can be calculated and expressed
The diagram above shows how this works
○ This can be far more efficient computationally than lots of for loops
○ This is also easier and cleaner to code (assuming you have appropriate libraries to do matrix multiplication)
Matrix-matrix multiplication
○ General idea
■ Step through the second matrix one column at a time
■ Multiply each column vector from second matrix by the entire first matrix, each time generating a vector
■ The final product is these vectors combined (not added or summed, but literally just put together)
○ Details
■ AxB=C
■ A = [m x n]
■ B = [n x o]
■ C = [m x o]
■ With vector multiplications o = 1
■ Can only multiply matrix where columns in A match rows in B
○ Mechanism
■ Take column 1 of B, treat as a vector
■ Multiply A by that column - generates an [m x 1] vector
■ Repeat for each column in B
■ There are o columns in B, so we get o columns in C
Implementation/use
● House prices, but now we have three hypothesis and the same data set
● To apply all three hypothesis to all data we can do this efficiently using matrix-matrix multiplication
○ Have
■ Data matrix
■ Parameter matrix
○ Example
■ Four houses, where we want to predict the prize
■ Three competing hypotheses
■ Because our hypothesis are one variable, to make the matrices match up we make our data (houses sizes)
vector into a 4x2 matrix by adding an extra column of 1s
○ 3 x 5 x 2 == 3 x 10 = 15 x 2
■ Associative property
○ Matrix multiplications is associative
■ A x (B x C) == (A x B) x C
● Identity matrix
○ 1 is the identity for any scalar
■ i.e. 1 x z = z
■ for any real number
○ In matrices we have an identity matrix called I
■ Sometimes called I{n x n}
● Matrix inverse
○ How does the concept of "the inverse" relate to real numbers?
■ 1 = "identity element" (as mentioned above)
■ Each number has an inverse
■ This is the number you multiply a number by to get the identify element
■ i.e. if you have x, x * 1/x = 1
■ e.g. given the number 3
■ 3 * 3-1 = 1 (the identity number/matrix)
■ In the space of real numbers not everything has an inverse
■ e.g. 0 does not have an inverse
○ What is the inverse of a matrix
■ If A is an m x m matrix, then A inverse = A-1
■ So A*A-1 = I
■ Only matrices which are m x m have inverses
■ Square matrices only!
○ Example
■ 2 x 2 matrix
● Similarly, instead of thinking of J as a function of the n+1 numbers, J() is just a function of the parameter vector
○ J(θ)
● Gradient descent
Having covered the theory, we now move on to learn about some of the practical tricks
Feature scaling
○ If you have a problem with multiple features
○ You should make sure those features have a similar scale
■ Means gradient descent will converge more quickly
○ e.g.
■ x1 = size (0 - 2000 feet)
■ x2 = number of bedrooms (1-5)
■ Means the contours generated if we plot θ1 vs. θ2 give a very tall and thin shape due to the huge range difference
○ Running gradient descent on this kind of cost function can take a long time to find the global minimum
But you overshoot, so reduce learning rate so you actually reach the minimum (green line)
● Choice of features and how you can get different learning algorithms by choosing appropriate features
● Polynomial regression for non-linear function
Example
House price prediction
■ Two features
■ Frontage - width of the plot of land along road (x1)
■ Depth - depth away from road (x2)
You don't have to use just two features
■ Can create new features
Might decide that an important feature is the land area
■ So, create a new feature = frontage * depth (x3)
■ h(x) = θ0 + θ1x3
■ Area is a better indicator
Often, by defining new features you may get a better model
Polynomial regression
May fit the data better
θ0 + θ1x + θ2x2 e.g. here we have a quadratic function
For housing data could use a quadratic function
■ But may not fit the data so well - inflection point means housing prices decrease when size gets really big
■ So instead must use a cubic function
Normal equation
● For some linear regression problems the normal equation provides a better solution
● So far we've been using gradient descent
○ Iterative algorithm which takes steps to converse
● Normal equation solves θ analytically
○ Solve for the optimum value of theta
● Has some advantages and disadvantages
Here
○ m=4
○ n=4
To implement the normal equation
○ Take examples
○ Add an extra column (x0 feature)
○ Construct a matrix (X - the design matrix) which contains all the training data features in an [m x n+1] matrix
○ Do something similar for y
■ Construct a column vector y vector [m x 1] matrix
○ Using the following equation (X transpose * X) inverse times X transpose y
● If you compute this, you get the value of theta which minimize the cost function
General case
Vector y
■ Used by taking all the y values into a column vector
What is this equation?!
○ (XT * X)-1
■ What is this --> the inverse of the matrix (XT * X)
■ i.e. A = XT X
■ A-1 = (XT X)-1
In octave and MATLAB you could do;
pinv(X'*x)*x'*y
When should you use gradient descent and when should you use feature scaling?
Gradient descent
Need to chose learning rate
Needs many iterations - could make it slower
Works well even when n is massive (millions)
■ Better suited to big data
■ What is a big n though
■ 100 or even a 1000 is still (relativity) small
■ If n is 10 000 then look at using gradient descent
Normal equation
No need to chose a learning rate
No need to iterate, check for convergence etc.
Normal equation needs to compute (XT X)-1
■ This is the inverse of an n x n matrix
■ With most implementations computing a matrix inverse grows by O(n3 )
■ So not great
Slow of n is large
■ Can be much slower
● Advanced concept
○ Often asked about, but quite advanced, perhaps optional material
○ Phenomenon worth understanding, but not probably necessary
● When computing (XT X)-1 * XT * y)
○ What if (XT X) is non-invertible (singular/degenerate)
■ Only some matrices are invertible
■ This should be quite a rare problem
■ Octave can invert matrices using
■ pinv (pseudo inverse)
■ This gets the right value even if (XT X) is non-invertible
■ inv (inverse)
○ What does it mean for (XT X) to be non-invertible
■ Normally two common causes
■ Redundant features in learning model
■ e.g.
■ x1 = size in feet
■ x2 = size in meters squared
■ Too many features
■ e.g. m <= n (m is much larger than n)
■ m = 10
■ n = 100
■ Trying to fit 101 parameters from 10 training examples
■ Sometimes work, but not always a good idea
■ Not enough data
■ Later look at why this may be too little data
■ To solve this we
■ Delete features
■ Use regularization (let's you use lots of features for a small training set)
○ If you find (XT X) to be non-invertible
■ Look at features --> are features linearly dependent?
■ So just delete one, will solve problem
We can see above this does a reasonable job of stratifying the data points into one of two classes
○ But what if we had a single Yes with a very small tumour
○ This would lead to classifying all the existing yeses as nos
Another issues with linear regression
○ We know Y is 0 or 1
○ Hypothesis can give values large than 1 or less than 0
So, logistic regression generates a value where is always either 0 or 1
○ Logistic regression is a classification algorithm - don't be confused
Hypothesis representation
● When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability that y=1 on input x
○ Example
■ If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
■ hθ(x) = 0.7
■ Tells a patient they have a 70% chance of a tumor being malignant
○ We can write this using the following notation
■ hθ(x) = P(y=1|x ; θ)
○ What does this mean?
■ Probability that y=1, given x, parameterized by θ
● Since this is a binary classification task we know y = 0 or 1
○ So the following must be true
■ P(y=1|x ; θ) + P(y=0|x ; θ) = 1
■ P(y=0|x ; θ) = 1 - P(y=1|x ; θ)
Decision boundary
■ When the probability of y being 1 is greater than 0.5 then we can predict y = 1
■ Else we predict y = 0
○ When is it exactly that hθ(x) is greater than 0.5?
■ Look at sigmoid function
■ g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
Decision boundary
Mean we can build more complex decision boundaries by fitting complex parameters to this (relatively) simple hypothesis
More complex decision boundaries?
○ By using higher order polynomial terms, we can get even more complex decision boundaries
Cost function for logistic regression
Fit θ parameters
Define the optimization object for the cost function we use the fit the parameters
Training set of m training examples
■ Each example has is n+1 length column vector
■ Which, appropriately, is the sum of all the individual costs over the training data (i.e. the same as linear
regression)
To further simplify it we can get rid of the superscripts
○ So
● To get around this we need a different, convex Cost() function which means we can apply gradient descent
● Define a simpler way to write the cost function and apply gradient descent to the logistic regression
○ By the end should be able to implement a fully functional logistic regression function
● Logistic regression cost function is as follows
This is the cost for a single example
For binary classification problems y is always 0 or 1
■ Because of this, we can have a simpler way to write the cost function
■ Rather than writing cost function on two lines/two cases
■ Can compress them into one equation - more efficient
Can write cost function is
■ cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
■ This equation is a more compact of the two cases above
We know that there are only two possible cases
■ y=1
■ Then our equation simplifies to
■ -log(hθ(x)) - (0)log(1 - hθ(x))
■ -log(hθ(x))
■ Which is what we had before when y = 1
■ y=0
■ Then our equation simplifies to
This result is
p(y=1 | x ; θ)
■ Probability y = 1, given x, parameterized by θ
● If you had n features, you would have an n+1 column vector for θ
● This equation is the same as the linear regression rule
○ The only difference is that our definition for the hypothesis has changed
● Previously, we spoke about how to monitor gradient descent to check it's working
○ Can do the same thing here for logistic regression
● When implementing logistic regression with gradient descent, we have to update all the θ values (θ0 to θn) simultaneously
○ Could use a for loop
○ Better would be a vectorized implementation
● Feature scaling for gradient descent for logistic regression also applies here
Advanced optimization
Getting logistic regression for multiclass classification using one vs. all
Multiclass - more than yes or no (1 or 0)
○ Classification with multiple classes for assignment
Given a dataset with three classes, how do we get a learning algorithm to work?
Use one vs. all classification make binary classification work for multiclass classification
One vs. all classification
Split the training set into three separate binary classification problems
i.e. create a new fake training set
■ Triangle (1) vs crosses and squares (0) hθ1(x)
■ P(y=1 | x1; θ)
■ Crosses (1) vs triangle and square (0) hθ2(x)
■ P(y=1 | x2; θ)
■ Square (1) vs crosses and square (0) hθ3(x)
■ P(y=1 | x3; θ)
● Overall
07: Regularization
The problem of overfitting
● So far we've seen a few algorithms - work well for many applications, but can suffer from the problem of overfitting
● What is overfitting?
● What is regularization and how does it help
To recap, if we have too many features then the learned hypothesis may give a cost function of exactly zero
○ But this tries too hard to fit the training set
○ Fails to provide a general solution - unable to generalize (apply to new examples)
Addressing overfitting
The addition in blue is a modification of our cost function to help penalize θ3 and θ4
○ So here we end up with θ3 and θ4 being close to zero (because the constants are massive)
○ So we're basically left with a quadratic function
Previously, gradient descent would repeatedly update the parameters θj, where j = 0,1,2...n simultaneously
○ Shown below
The term
● We saw earlier that logistic regression can be prone to overfitting with lots of features
● Logistic regression cost function is as follows;
● Again, to modify the algorithm we simply need to modify the update rule for θ1, onwards
○ Looks cosmetically the same as linear regression, except obviously the hypothesis is very different