[go: up one dir, main page]

0% found this document useful (0 votes)
75 views11 pages

Data Mining Exercise 3

The document provides instructions for using collaborative filtering and the alternating least squares (ALS) algorithm to build a movie recommendation model. It includes tasks to: 1) describe collaborative filtering, 2) find example use cases, 3) identify implementation types like ALS, 4) explain implicit vs explicit collaborative filtering. The document then demonstrates building an ALS model on the Movielens dataset, analyzing model performance for different parameter values, and selecting optimal rank, iteration, and lambda parameters. Finally, it provides instructions for using the trained model to generate movie recommendations based on user ratings.

Uploaded by

Mohamed Boukhari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views11 pages

Data Mining Exercise 3

The document provides instructions for using collaborative filtering and the alternating least squares (ALS) algorithm to build a movie recommendation model. It includes tasks to: 1) describe collaborative filtering, 2) find example use cases, 3) identify implementation types like ALS, 4) explain implicit vs explicit collaborative filtering. The document then demonstrates building an ALS model on the Movielens dataset, analyzing model performance for different parameter values, and selecting optimal rank, iteration, and lambda parameters. Finally, it provides instructions for using the trained model to generate movie recommendations based on user ratings.

Uploaded by

Mohamed Boukhari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Exercise 3: Collaborative filtering and

Alternating Least Squares

1. Collaborative filtering (4 pts)

Find out what the collaborative filtering is, there is a lot of information on the network. Few links here:
- Course book.
- https://en.wikipedia.org/wiki/Collaborative_filtering
- http://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/
- http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

There’s also many different examples made with movielens data…


- https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html
- https://www.codementor.io/spark/tutorial/building-a-recommender-with-apache-spark-python-example-app-part1

Task 1: Describe collaborative filtering briefly. (1 pts)

Task 2: Find some example cases where collaborative filtering can be used. (1 pts)

Task 3: Find some implementation types of collaborative filtering. For example (Alternating Least Squares
algorithm, ALS) (1 pts)

Task 3: For what purpose is the implicit version of the collaborative filtering used in ALS algorithm? How it
differs from the explicit version? (check spark mllib link above) (1 pts)

Input dataset
We will be testing recommendation algorithm with the movielens 1M database. Dataset is described here
http://grouplens.org/datasets/movielens/

Run the following command to download and unzip movielens dataset. You can also upload dataset
manually; in case the download link does not work.

import os
os.system('wget http://files.grouplens.org/datasets/movielens/ml-1m.zip; unzip ml-
1m.zip')

The data format can be found out from the readme or from the movielens webpage.
2. ALS model, model parameter search (3 pts)

Before we can use the recommendation model we need to determine if our model is good. In the following
code we analyze the model behavior while modifying rank, iterations and the regularization parameter
lambda. For this particular dataset we can only assume that the rank should be quite small, less than 20.
For iterations parameter at least few iterations are needed, upper bound is limited by computation time.
The regularization parameter, lambda, is guesswork and can be estimated by trying computationally. The
default value for the lambda parameter is 0.1.

Unfortunately, we don’t have a cluster to compute results in a reasonable timeframe. At least this is the
point where you notice that even a little amount of data can be very expensive computationally.

To evaluate how the parameters are affecting model average prediction error, inspect the resulting plot,
which shows how the model error behaves against the test dataset. We can only show prediction error
surface for two parameters. We need to fix one parameter to complete analysis for the last parameter.
%matplotlib notebook

#python imports
import pyspark
import pyspark.mllib.recommendation as reco
import numpy
from timeit import default_timer as timer
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

try:
sc = pyspark.SparkContext('local[2]')
except:
sc = sc

#read ratings file as lines of text, assuming no errors on data


lines = sc.textFile("ml-1m/ratings.dat").repartition(2)
ratings = lines.map(lambda l: l.split("::")). \
map(lambda p: reco.Rating( user=int(p[0]), product=int(p[1]), rating=float(p[2])) )

#split out data into random test and training subsets


traindata, testdata = ratings.randomSplit([0.9, 0.1])
#cache data at this point as we are going to reuse it (see drop in time after first iteration!)
traindata.persist()
#remove ratings for test prediction
predictdata = testdata.map(lambda x: (x[0],x[1]))
predictdata.persist()

testdata = testdata.map(lambda r: ((r[0], r[1]), r[2]))


testdata.persist()

x_values = ( 3, 6, 9, 15) #rank


y_values = ( 2, 5, 8, 15) #iterations
mses = numpy.zeros((len(x_values), len(y_values)))
lambd = 0.1;

last_time = timer()
for x in range(0, len(x_values)):
for y in range(0, len(y_values)):
model = reco.ALS.train(traindata, x_values[x], y_values[y], lambd, \
nonnegative=True, seed=10)
#calculate mean squared error between real user ratings and predicted ratings
predictions = model.predictAll(predictdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = testdata.join(predictions)
mses[x,y] = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print( str((x*len(x_values)+y+1)/(len(x_values)*len(y_values))*100) + '% ' + format(timer()
- last_time,'.1f') + ' s')
last_time = timer()

#plotting
X, Y = numpy.meshgrid(x_values, y_values, indexing='ij')
fig = plt.figure()
ax = fig.gca(projection='3d')
surf = ax.plot_surface(X, Y, mses,rstride=1, cstride=1,cmap='YlGn');

ax.set_xlabel('Rank')
ax.set_ylabel('Iterations')
ax.set_zlabel('Mean squared prediction error')

NOTE: The printed 3D plot in the figure is interactive, you can rotate it with mouse!

Find the parameters of the best model from the figure by looking at the prediction error vs parameter axes.
Some expert knowledge needs to be applied when selecting the model parameters. For example, if you
select a model with a too high rank, based on the model lower prediction error, then you can turn up with a
feature space that doesn’t mean anything. Think about that you could be splitting movies into very small
subcategories, where the only link between the movies could be something like having a same actor. Other
things that limit our parameter selection is computing time. If you have too many iterations, the gains in
better prediction power could be minimal, but using more computation time on cluster can cost money.

In a real-world data case, after parameters have been selected, we would run our model against a test set
of individual personal movie tastes, to see if model is truly producing plausible results, but we skip that now
and we try to do it as our last exercise step.

Task 1: After you have analyzed the model behavior under iteration and rank parameter changes.
Document your rank and iteration value, you will need values later!
RANK: (one value)
ITERATIONS: (one value) (2 pts)

We are looking for the set of values for {rank; iterations} for which the model is the best. “The best” means
that:
- the MSE is low
- rank is between 0 and 20, for this dataset
- iterations value is not excessively high. Otherwise, it would take too much time to compute.

Here are the 3D plots obtained for ranks ranging from 8 to 14 and 12 to 18.

I tried different ranges changing the values in these lines:


Interpretation:
When we look at one specific iteration value at the time on the 3D plot, we can see that single curves have
the same shape along the rank range. But, still for a specific iteration value, it seems that the MSE gets
lower the higher the rank is. It goes down to around 0.74 for 20 iterations (our highest iteration value
here). After 20 iterations, the computing begin to take quite long and it is not worth it because the MSE
does not get proportionally lower.

So the chosen values are:


RANK: 18
ITERATION: 20

Task 2: Replace code below x_values line to the following code and then find a good lambda value from the
figure. Document your lambda value!
LAMBDA: (one value) (1 pts)

RANK = 8;
x_values = ( 0.0075, 0.04, 0.09, 0.15) #lambda
y_values = ( 2, 8, 15) #iterations
mses = numpy.zeros((len(x_values), len(y_values)))
lambd = 0.1;

last_time = timer()
for x in range(0, len(x_values)):
for y in range(0, len(y_values)):
model = reco.ALS.train(traindata, RANK, y_values[y], x_values[x], \
nonnegative=True, seed=10)
#calculate mean squared error between real user ratings and predicted ratings
predictions = model.predictAll(predictdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = testdata.join(predictions)
mses[x,y] = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print( str((x*len(x_values)+y+1)/(len(x_values)*len(y_values))*100) + '% ' + format(timer()
- last_time,'.1f') + ' s')
last_time = timer()

#plotting
X, Y = numpy.meshgrid(x_values, y_values, indexing='ij')
fig = plt.figure()
ax = fig.gca(projection='3d')
surf = ax.plot_surface(X, Y, mses,rstride=1, cstride=1,cmap='YlGn');

ax.set_xlabel('Lambda')
ax.set_ylabel('Iterations')
ax.set_zlabel('Mean squared prediction error')
Looking at the 3D plot, we see that the MSE is the lowest for lambda equals to 0.04. So, the solution is:

LAMBDA: 0.04
Recommendations based on user input and ALS model
Do a fresh start (for example create a new notebook).

3. Import movie name file (2 pts)


Start with the following code and continue adding code:

import pyspark
import pyspark.mllib.recommendation as reco

try:
sc = pyspark.SparkContext('local[*]')
except:
sc = sc

#read ratings file as lines of text, assuming no errors on data


lines = sc.textFile("ml-1m/ratings.dat")
ratings = lines.map(lambda l: l.split("::")). \
map(lambda p: reco.Rating( user=int(p[0]), product=int(p[1]), rating=float(p[2])) )

You need movie names to find the movies you want to rate and secondly because at the end we want to
print out the results with the names of the movies. The movie names are listed in the movies.dat.

Task 1: Start by writing a data importer for movie names. Map movie names to a key-value pair (MovieID,
(Title, Genres)) and store result in movienames variable. The MovieID, etc. are the columns of the
movies.dat. Remember to convert MovieID to integer!! (1 pts)

Here is the code for the movie names data importer. To convert MovieID to integer, I reused the code from
Lab2.
The for loop enabled us to check the format of the result. We managed to obtain a key-value map with
(MovieID, (Title, Genres)) pairs.

Task 2: Print movie names in sorted order so that you can find the movies you want to rate. (1 pts)

There, I printed 30 rows from the map showing names sorted in alphabetical order.
4. Collect your movieids from the printed list and rate them (2 pts)
Task 1: Collect MovieID:s from the printed list and rate them in a 1 to 5 scale, rate movies that you know
otherwise it will be hard to know if your model works at the end of the exercise. The more you rate the
better your recommendation result will be! In the code block below, replace existing ratings with yours.
User id 0 was not reserved, we use that to inject our results into the dataset. (1 pts)

Add following code block and insert you ratings.


my_user_id = 0

my_ratings = [
(my_user_id, 1214, 5 ),
(my_user_id, 1127, 4)]

new_ratings = sc.parallelize(my_ratings).map(lambda p: reco.Rating( user=int(p[0]), product=int(p[1]), rating=int(p[2])) )

After you have rated enough movies, disable movie name printing!

Here is the good used to add new ratings to movies I know:


Task 2: Union the rows from your new_ratings rdd into the ratings rdd. (rdd has union function) (1 pts)

FAILURE TO COMPLETE TASK 2 WILL RESULT IN SILENT FAILURE AT THE END!!!

Here is the code for the union between the ratings sets and the result. In my case, the union is put in a rdd
named “union”:
5. Calculating movie rating counts for result inspection and to remove
movies that have too few ratings to be useful (3 pts)
Task 1: Calculate into moviecounts rdd how many ratings each movie has in the ratings rdd. (Hint:
Remember the word count example?) (1 pts)

Here is the code from Lab2 adapted to our problem. You can find the result below. For example, movie with
ID 1367 has been rated 365 times.

Task 2: Make another filtered rdd of the moviecounts and store it in too_few_ratings variable, based on
how many ratings have been given. Keep MovieID’s that has less than X ratings than some defined value
(like 30). The structure of the too_few_ratings variable must be a python list of MovieID:s.

Add filtering “filter(lambda x: x[1]<30)”, remove counts with “map(lambda x: x[0])” and finally call
“collect()”. Check that your result list contains only movieID numbers! (1 pts)

Task 3: Calculate model again with your parameters from the first exercise part 3. (1 pts)
iterations = 5;
rank = 5;
lambd = 0.01;
#model calculation
model = reco.ALS.train(ratings, rank, iterations, lambd, nonnegative=True, seed=10)

6. Selecting movies to be rated for you (1 pts)


Find out which movies we want recommendation to be generated.

The following code generates a list of movies that recommendation ratings will be generated. It will remove
the movies you rated and movies with too few ratings.

my_not_rated = movienames.filter(lambda x: x[0] not in [row[1] for row in my_ratings]) #remove my rated
my_not_rated = my_not_rated.filter(lambda x: x[0] not in too_few_ratings) #remove with too few ratings
my_not_rated_with_user_id = my_not_rated.map(lambda p: (my_user_id,p[0])) #add userid

Task 1: Use the models predictAll function with my_not_rated_with_user_id to find the recommendation
ratings for the movies and store the result into rdd recommendations. (1 pts)

7. Result processing (5 pts)


The model.predictAll function returns a rdd with (UserID,MovieID,rating) columns.

Task 1: Map recommendations into key value pair (MovieID,rating). In other words, get rid of the userid
key. Then join recommendations by key to movienames and then with another join to the moviecounts. (1
pts)

Task 2: Sort the recommendations with the sortBy function by rating value into descending order,
(ascending=False). Print the top 100 recommendations from the sorted list, retrieved with the
recommendations.take(100).

Copy top 15 results to report and highlight results that make some sense. (1 pts)

Task 3: The movies with too few ratings were removed, explain why the movies with too few ratings must
be removed. (Set the moviecounts filter to compare against 0 to see what happens). (1 pts)

Task 4:

Do your movie recommendations make any sense?

Do you think you have given diverse set of input recommendations that the model can work?

Check the predicted ratings to see if your ratings are overshooting past five. What do you think could have
gone wrong if you get lot of values way past five?

(2 pts)

You might also like