CIS 419/519 Introduction To Machine Learning Assignment 2: Instructions
CIS 419/519 Introduction To Machine Learning Assignment 2: Instructions
Assignment 2
Due: October 11, 2017 11:59pm
Instructions
Read all instructions in this section thoroughly.
Collaboration: Make certain that you understand the course collaboration policy, described on the course
website. You must complete this assignment individually; you are not allowed to collaborate with anyone
else. You may discuss the homework to understand the problems and the mathematics behind the various
learning algorithms, but you are not allowed to share problem solutions or your code with any
other students. You must also not consult code on the internet that is directly related to the programming
exercise. We will be using automatic checking software to detect academic dishonesty, so please don’t do it.
You are also prohibited from posting any part of your solution to the internet, even after the course is
complete. Similarly, please don’t post this PDF file or the homework skeleton code to the internet.
Formatting: This assignment consists of two parts: a problem set and program exercises.
For the problem set, you must write up your solutions electronically and submit it as a single PDF document.
We will not accept handwritten or paper copies of the homework. Your problem set solutions must use
proper mathematical formatting. For this reason, we strongly encourage you to write up your responses
using LATEX. (Alternative word processors, such as MS Word, produce very poorly formatted mathematics.)
Your solutions to the programming exercises must be implemented in python, following the precise instruc-
tions included in Part 2. Portions of the programing exercise will be graded automatically, so it is imperative
that your code follows the specified API. A few parts of the programming exercise asks you to create plots
or describe results; these should be included in the same PDF document that you create for the problem set.
Homework Template and Files to Get You Started: The homework zip file contains the skeleton
code and data sets that you will require for this assignment. Please read through the documentation
provided in ALL files before starting the assignment.
Citing Your Sources: Any sources of help that you consult while completing this assignment (other
students, textbooks, websites, etc.) *MUST* be noted in the your README file. This includes anyone you
briefly discussed the homework with. If you received help from the following sources, you do not need to cite
it: course instructor, course teaching assistants, course lecture notes, course textbooks or other readings.
Submitting Your Solution: We will post instructions for submitting your solution one week before the
assignment is due. Be sure to check Piazza then for details.
CIS 519 ONLY Problems: Several problems are marked as “[CIS 519 ONLY]” in this assignment. Only
students enrolled in CIS 519 are required to complete these problems. However, we do encourage students
in CIS 419 to read through these problems, although you are not required to complete them.
All homeworks will receive a percentage grade, but CIS 519 homeworks will be graded out of a different
total number of points than CIS 419 homeworks. Students in CIS 419 choosing to complete CIS 519 ONLY
exercises will not receive any credit for answers to these questions (i.e., they will not count as extra credit
nor will they compensate for points lost on other problems).
Acknowledgements: Parts of the programming exercises have been adapted from course materials by
Andrew Ng.
Assignment Version 20170926a
1
PART I: PROBLEM SET
Your solutions to the problems will be submitted as a single PDF document. Be certain that your problems
are well-numbered and that it is clear what your answers are. Additionally, you will be required to duplicate
your answers to particular problems in the README file that you will submit.
d.) Solve for w0 , using your value of w and the two constraints (2)–(3) for the max margin classifier.
e.) Write down the form of the discriminant h(x) = w0 + w| φ(x) as an explicit function in terms of x.
2
PART II: PROGRAMMING EXERCISES
3
script fits a polynomial of degree d = 8 with no regularization λ = 0. From the plot, we see that the
function fits the data well, but will not generalize well to new data points. Try increasing the amount of
regularization, and examine the resulting effect on the function.
Once you thoroughly understand the learningCurve() function, run the test polyreg learningCurve.py
script to plot the learning curves for various values of λ and d. You should see plots similar to the following:
• The y-axis is using a log-scale and the ranges of the y-scale are all different for the plots. The dashed
black line indicates the y = 1 line as a point of reference between the plots.
4
• The plot of the unregularized model with d = 1 shows poor training error, indicating a high bias (i.e.,
it is a standard univariate linear regression fit).
• The plot of the unregularized model (λ = 0) with d = 8 shows that the training error is low, but that
the testing error is high. There is a huge gap between the training and testing errors caused by the
model overfitting the training data, indicating a high variance problem.
• As the regularization parameter increases (e.g., λ = 1) with d = 8, we see that the gap between the
training and testing error narrows, with both the training and testing errors converging to a low value.
We can see that the model fits the data well and generalizes well, and therefore does not have either a
high bias or a high variance problem. Effectively, it has a good tradeoff between bias and variance.
• Once the regularization parameter is too high (λ = 100), we see that the training and testing errors
are once again high, indicating a poor fit. Effectively, there is too much regularization, resulting in
high bias.
Make absolutely certain that you understand these observations, and how they relate to the learning curve
plots. In practice, we can choose the value for λ via cross-validation to achieve the best bias-variance tradeoff.
5
2 Logistic Regression (35 points)
Now that we’ve implemented a basic regression model using gradient descent, we will use a similar technique
to implement the logistic regression classifier.
Relevant Files in Homework Skeleton2
2.1 Implementation
Implement regularized logistic regression by completing the LogisticRegression class in logreg.py. Your
class must implement the following API:
• init (alpha, regLambda, epsilon, maxNumIters): the constructor, which takes in α, λ, , and
maxNumIters as arguments
Note that these methods have already been defined correctly for you in logreg.py; be very careful not to
change the API.
Sigmoid Function You should begin by implementing the sigmoid(z) function. Recall that the logistic
regression hypothesis h() is defined as:
hθ (x) = g(θ T x)
where g() is the sigmoid function defined as:
1
g(z) =
1 + exp−z
The Sigmoid function has the property that g(+∞) ≈ 1 and g(−∞) ≈ 0. Test your function by calling
sigmoid(z) on different test samples. Be certain that your sigmoid function works with both
vectors and matrices — for either a vector or a matrix, you function should perform the sigmoid function
on every element.
2 Bold text indicates files that you will need to complete; you should not need to modify any of the other files.
6
Cost Function and Gradient Implement the functions to compute the cost function and the gradient
of the cost function. Recall the cost function for logistic regression is a scalar value given by
n
X λ
J (θ) = [−y (i) log(hθ (x(i) )) − (1 − y (i) ) log(1 − hθ (x(i) ))] + kθk22 .
i=1
2
The gradient of the cost function is a d-dimensional vector, where the j th element (for j = 1...d) is given by
n
∂J (θ) X (i)
= (hθ (x(i) ) − y (i) )xj + λθj .
∂θj i=1
We must be careful not to regularize the θ0 parameter (corresponding to the 1’s feature we add to each
instance), and so
n
∂J (θ) X
= (hθ (x(i) ) − y (i) ) .
∂θ0 i=1
Training and Prediction Once you have the cost and gradient functions complete, implement the fit
and predict methods. To make absolutely certain that the un-regularized θ0 corresponds to the 1’s feature
that we add to the input, we will augment both the training and testing instances within the fit and
predict methods (instead of relying on it being done externally to the classifier). Recall that you can do
this via:
X = np . c [ np . o n e s ( ( n , 1 ) ) , X]
Your fit method should train the model via gradient descent, relying on the cost and gradient functions.
Instead of simply running gradient descent for a specific number of iterations (as in the linear regression
exercise), we will use a more sophisticated method: we will stop it after the solution has converged. Stop
the gradient descent procedure when θ stops changing between consecutive iterations. You can detect this
convergence when
kθnew − θold k2 ≤ , (6)
for some small (e.g, = 10E-4). For readability, we’d recommend implementing this convergence test as
a dedicated function hasConverged. For safety, we will also set the maximum number of gradient descent
iterations, maxNumIters. The values of λ, , maxNumIters, and α (the gradient descent learning rate) are
arguments to LogisticRegression’s constructor. At the start of gradient descent, θ should be initialized
to random values with mean 0, as described in the linear regression exercise.
2.3 Analysis
Varying λ changes the amount of regularization, which acts as a penalty on the objective function for complex
solutions. In test logreg1.py, we provide the code to draw the decision boundary of the model. Vary the
value of λ in test logreg1.py and plot the decision boundary for each value. Include several plots in your
PDF writeup for this assignment, labeling each plot with the value of λ in your writeup. Write a brief
paragraph (2-3 sentences) describing what you observe as you increase the value of λ and explain why.
7
2.4 [CIS 519 ONLY] Learning a Nonlinear Decision Surface (10 pts)
We strongly encourage students in CIS 419 to read through this exercise in detail, even though you are not
required to complete it. This exercise is required only for students in CIS 519.
Some classification problems that cannot be solved in a low-dimensional feature space can be separable
in a high-dimension space. We can make our logistic regression classifier much more powerful by adding
additional features to the input. Imagine that we are given a data set with only two features, x1 and x2 , but
the decision surface is non-linear in these two features. We can expand the possible feature set into higher
dimensions by adding new features that are functions of the given features. For example, we could add a
new feature that is the product of the original features, or any other mathematical function of those two
features that we want.
In this example, we will map the two features into all polynomial
terms
of x1 and x2 up to the 6th power:
1
x1
x2
x21
x1 x
mapF eature(x1 , x2 ) = 2 (7)
x22
···
x1 x52
x62
Given a 2-dimensional instance x, we can project that instance into a 28-dimensional feature space using
the transformation above. Then, instead of training the classifier on instances in the 2-dimensional space,
we train it on the 28-dimensional projection of those instances. The resulting decision boundary will then
be more complex and will appear non-linear in the 2D plot.
Complete the mapFeature function in mapFeature.py to map instances from a 2D feature space to a 28-
dimensional feature space defined by all polynomial terms of x1 and x2 up to the sixth power. Your function
mapFeature(X1, X2) will take in two column matrices, X1 and X2, which correspond to the 1st and 2nd
columns respectively of the data set X (recall that X is n-by-d, and so both X1 and X2 will have n entries).
It will return an n-by-28 dimensional matrix, where each row represents the expanded feature set for one
instance.
Once this function is complete, you can run python test logreg2.py from the command line. This script
will load a data set that is non-linearly separable, use your mapFeature function to project the instances
into a higher dimensional space, and then train a logistic regression model in that higher dimensional feature
space. When we project the resulting non-linear classifier back down into 2D, we get a decision surface that
appears similar to the following:
8
3 Support Vector Machines (20 points)
In this section, we’ll implement various kernels for the support vector machine (SVM). This exercise looks
long, but in practice you’ll be writing only a few lines of code. The scikit learn package already includes
several SVM implementations; in this case, we’ll focus on the SVM for classification, sklearn.svm.SVC.
Before starting this assignment, be certain to read through the documentation for SVC, available at http:
//scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html.
While we could certainly create our own SVM implementation, most people applying SVMs to real problems
rely on highly optimized SVM toolboxes, such as LIBSVM or SVMlight.3 These toolboxes provide highly
optimized SVM implementations that use a variety of optimization techniques to enable them to scale to
extremely large problems. Therefore, we will focus on implementing custom kernels for the SVMs, which is
often essential for applying these SVM toolboxes to real applications.
Relevant Files in Homework 2 Skeleton4
• example svm.py • test svmPolyKernel.py
• example svmCustomKernel.py • data/svmData.dat
• svmKernels.py • data/svmTuningData.dat
• test svmGaussianKernel.py
3 The SVC implementation provided with scikit learn is based on LIBSVM, but is not quite as efficient.
4 Bold text indicates files that you will need to complete; you should not need to modify any of the other files.
9
3.2 Implementing Custom Kernels
The SVC implementation allows us to define our own kernel functions to learn non-linear decision surfaces
using the SVM. The SVC constructor’s kernel argument can be defined as either a string specifying one of
the built-in kernels (e.g., ’linear’, ’poly’ (polynomial), ’rbf’ (radial basis function), ’sigmoid’, etc.)
or it can take as input a custom kernel function, as we will define in this exercise.
For example, the following code snippet defines a custom kernel and uses it to train the SVM:
def myCustomKernel (X1 , X2 ) :
”””
Custom k e r n e l :
k (X1 , X2) = X1 (3 0) X2 .T
(0 2)
When the SVM calls the custom kernel function during training, X1 and X2 are both initialized to be
the same as X (i.e., ntrain -by-d numpy arrays); in other words, they both contain a complete copy of
the training instances. The custom kernel function returns an ntrain -by-ntrain numpy array during the
training step. Later, when it is used for testing, X1 will be the ntest testing instances and X2 will be the
ntrain training instances, and so it will return an ntest -by-ntrain numpy array. For a complete example, see
example svmCustomKernel.py, which uses the custom kernel above to generate the following figure:
10
Figure 3: Sample output of test svmPolyKernel.py
For the built-in polynomial kernel, the degree is specified in the SVC constructor. However, in our custom
kernel we must set the degree via a global variable polyDegree. Therefore, be sure to use the value of the
global variable polyDegree as the degree in your polynomial kernel. The test svmPolyKernel.py script
uses polyDegree for the degree of both your custom kernel and the built-in polynomial kernel.
Vary both C and d and study how the SVM reacts.
Again, vary both C and σ and study how the SVM reacts.
Write a brief paragraph describing how the SVM reacts as both C and d vary for the polynomial kernel, and
as C and σ vary for the Gaussian kernel. Put this paragraph in your writeup, and also in the README file.
11
is to determine the optimal values of C and σ for an SVM with your Gaussian kernel as applied to the data
in data/svmTuningData.dat, depicted below. You should use whatever method you like to search over the
space of possible combinations of C and σ, and determine the optimal fit as measured by accuracy. You may
use any built-in methods from scikit learn you wish (e.g., sklearn.grid search.GridSearchCV). We recom-
mend that you search over multiplicative steps (e.g., . . . , 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6, 10, 30, 60, 100, . . .).
Once you determine the optimal parameter values, report those optimal values and the corresponding es-
timated accuracy in the README file. For reference, the SVM with the optimal parameters we found
produced the following decision surface.
The resulting decision surface for your optimal parameters may look slightly different than ours.
3.6 Implementing the Cosine Similarity Kernel (CIS 519 ONLY) (10 pts)
Implement the cosine similarity kernel by completing myCosineSimilarityKernel() in svmKernels.py. We
have not provided (quite on purpose!) a test script that compares your implementation to the cosine similarity
kernel already provided by scikit learn, but you are welcome to adapt the test svmGaussianKernel.py script
for this purpose.
12