0% found this document useful (0 votes)

75 views11 pages

Data Mining Exercise 3

The document provides instructions for using collaborative filtering and the alternating least squares (ALS) algorithm to build a movie recommendation model. It includes tasks to: 1) describe collaborative filtering, 2) find example use cases, 3) identify implementation types like ALS, 4) explain implicit vs explicit collaborative filtering. The document then demonstrates building an ALS model on the Movielens dataset, analyzing model performance for different parameter values, and selecting optimal rank, iteration, and lambda parameters. Finally, it provides instructions for using the trained model to generate movie recommendations based on user ratings.

Uploaded by

Mohamed Boukhari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views11 pages

Data Mining Exercise 3

Uploaded by

Mohamed Boukhari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Exercise 3: Collaborative filtering and

Alternating Least Squares

1. Collaborative filtering (4 pts)

Find out what the collaborative filtering is, there is a lot of information on the network. Few links here:
- Course book.
- https://en.wikipedia.org/wiki/Collaborative_filtering
- http://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/
- http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

There’s also many different examples made with movielens data…

- https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html
- https://www.codementor.io/spark/tutorial/building-a-recommender-with-apache-spark-python-example-app-part1

Task 1: Describe collaborative filtering briefly. (1 pts)

Task 2: Find some example cases where collaborative filtering can be used. (1 pts)

Task 3: Find some implementation types of collaborative filtering. For example (Alternating Least Squares
algorithm, ALS) (1 pts)

Task 3: For what purpose is the implicit version of the collaborative filtering used in ALS algorithm? How it
differs from the explicit version? (check spark mllib link above) (1 pts)

Input dataset
We will be testing recommendation algorithm with the movielens 1M database. Dataset is described here
http://grouplens.org/datasets/movielens/

Run the following command to download and unzip movielens dataset. You can also upload dataset
manually; in case the download link does not work.

import os
os.system('wget http://files.grouplens.org/datasets/movielens/ml-1m.zip; unzip ml-
1m.zip')

The data format can be found out from the readme or from the movielens webpage.
2. ALS model, model parameter search (3 pts)

Before we can use the recommendation model we need to determine if our model is good. In the following
code we analyze the model behavior while modifying rank, iterations and the regularization parameter
lambda. For this particular dataset we can only assume that the rank should be quite small, less than 20.
For iterations parameter at least few iterations are needed, upper bound is limited by computation time.
The regularization parameter, lambda, is guesswork and can be estimated by trying computationally. The
default value for the lambda parameter is 0.1.

Unfortunately, we don’t have a cluster to compute results in a reasonable timeframe. At least this is the
point where you notice that even a little amount of data can be very expensive computationally.

To evaluate how the parameters are affecting model average prediction error, inspect the resulting plot,
which shows how the model error behaves against the test dataset. We can only show prediction error
surface for two parameters. We need to fix one parameter to complete analysis for the last parameter.
%matplotlib notebook

#python imports
import pyspark
import pyspark.mllib.recommendation as reco
import numpy
from timeit import default_timer as timer
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

try:
sc = pyspark.SparkContext('local[2]')
except:
sc = sc

#read ratings file as lines of text, assuming no errors on data

lines = sc.textFile("ml-1m/ratings.dat").repartition(2)
ratings = lines.map(lambda l: l.split("::")). \
map(lambda p: reco.Rating( user=int(p[0]), product=int(p[1]), rating=float(p[2])) )

#split out data into random test and training subsets

traindata, testdata = ratings.randomSplit([0.9, 0.1])
#cache data at this point as we are going to reuse it (see drop in time after first iteration!)
traindata.persist()
#remove ratings for test prediction
predictdata = testdata.map(lambda x: (x[0],x[1]))
predictdata.persist()

testdata = testdata.map(lambda r: ((r[0], r[1]), r[2]))

testdata.persist()

x_values = ( 3, 6, 9, 15) #rank

y_values = ( 2, 5, 8, 15) #iterations
mses = numpy.zeros((len(x_values), len(y_values)))
lambd = 0.1;

last_time = timer()
for x in range(0, len(x_values)):
for y in range(0, len(y_values)):
model = reco.ALS.train(traindata, x_values[x], y_values[y], lambd, \
nonnegative=True, seed=10)
#calculate mean squared error between real user ratings and predicted ratings
predictions = model.predictAll(predictdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = testdata.join(predictions)
mses[x,y] = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print( str((x*len(x_values)+y+1)/(len(x_values)*len(y_values))*100) + '% ' + format(timer()
- last_time,'.1f') + ' s')
last_time = timer()

#plotting
X, Y = numpy.meshgrid(x_values, y_values, indexing='ij')
fig = plt.figure()
ax = fig.gca(projection='3d')
surf = ax.plot_surface(X, Y, mses,rstride=1, cstride=1,cmap='YlGn');

ax.set_xlabel('Rank')
ax.set_ylabel('Iterations')
ax.set_zlabel('Mean squared prediction error')

NOTE: The printed 3D plot in the figure is interactive, you can rotate it with mouse!

Find the parameters of the best model from the figure by looking at the prediction error vs parameter axes.
Some expert knowledge needs to be applied when selecting the model parameters. For example, if you
select a model with a too high rank, based on the model lower prediction error, then you can turn up with a
feature space that doesn’t mean anything. Think about that you could be splitting movies into very small
subcategories, where the only link between the movies could be something like having a same actor. Other
things that limit our parameter selection is computing time. If you have too many iterations, the gains in
better prediction power could be minimal, but using more computation time on cluster can cost money.

In a real-world data case, after parameters have been selected, we would run our model against a test set
of individual personal movie tastes, to see if model is truly producing plausible results, but we skip that now
and we try to do it as our last exercise step.

Task 1: After you have analyzed the model behavior under iteration and rank parameter changes.
Document your rank and iteration value, you will need values later!
RANK: (one value)
ITERATIONS: (one value) (2 pts)

We are looking for the set of values for {rank; iterations} for which the model is the best. “The best” means
that:
- the MSE is low
- rank is between 0 and 20, for this dataset
- iterations value is not excessively high. Otherwise, it would take too much time to compute.

Here are the 3D plots obtained for ranks ranging from 8 to 14 and 12 to 18.

I tried different ranges changing the values in these lines:

Interpretation:
When we look at one specific iteration value at the time on the 3D plot, we can see that single curves have
the same shape along the rank range. But, still for a specific iteration value, it seems that the MSE gets
lower the higher the rank is. It goes down to around 0.74 for 20 iterations (our highest iteration value
here). After 20 iterations, the computing begin to take quite long and it is not worth it because the MSE
does not get proportionally lower.

So the chosen values are:

RANK: 18
ITERATION: 20

Task 2: Replace code below x_values line to the following code and then find a good lambda value from the
figure. Document your lambda value!
LAMBDA: (one value) (1 pts)

RANK = 8;
x_values = ( 0.0075, 0.04, 0.09, 0.15) #lambda
y_values = ( 2, 8, 15) #iterations
mses = numpy.zeros((len(x_values), len(y_values)))
lambd = 0.1;

last_time = timer()
for x in range(0, len(x_values)):
for y in range(0, len(y_values)):
model = reco.ALS.train(traindata, RANK, y_values[y], x_values[x], \
nonnegative=True, seed=10)
#calculate mean squared error between real user ratings and predicted ratings
predictions = model.predictAll(predictdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = testdata.join(predictions)
mses[x,y] = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print( str((x*len(x_values)+y+1)/(len(x_values)*len(y_values))*100) + '% ' + format(timer()
- last_time,'.1f') + ' s')
last_time = timer()

#plotting
X, Y = numpy.meshgrid(x_values, y_values, indexing='ij')
fig = plt.figure()
ax = fig.gca(projection='3d')
surf = ax.plot_surface(X, Y, mses,rstride=1, cstride=1,cmap='YlGn');

ax.set_xlabel('Lambda')
ax.set_ylabel('Iterations')
ax.set_zlabel('Mean squared prediction error')
Looking at the 3D plot, we see that the MSE is the lowest for lambda equals to 0.04. So, the solution is:

LAMBDA: 0.04
Recommendations based on user input and ALS model
Do a fresh start (for example create a new notebook).

3. Import movie name file (2 pts)

Start with the following code and continue adding code:

import pyspark
import pyspark.mllib.recommendation as reco

try:
sc = pyspark.SparkContext('local[*]')
except:
sc = sc

#read ratings file as lines of text, assuming no errors on data

lines = sc.textFile("ml-1m/ratings.dat")
ratings = lines.map(lambda l: l.split("::")). \
map(lambda p: reco.Rating( user=int(p[0]), product=int(p[1]), rating=float(p[2])) )

You need movie names to find the movies you want to rate and secondly because at the end we want to
print out the results with the names of the movies. The movie names are listed in the movies.dat.

Task 1: Start by writing a data importer for movie names. Map movie names to a key-value pair (MovieID,
(Title, Genres)) and store result in movienames variable. The MovieID, etc. are the columns of the
movies.dat. Remember to convert MovieID to integer!! (1 pts)

Here is the code for the movie names data importer. To convert MovieID to integer, I reused the code from
Lab2.
The for loop enabled us to check the format of the result. We managed to obtain a key-value map with
(MovieID, (Title, Genres)) pairs.

Task 2: Print movie names in sorted order so that you can find the movies you want to rate. (1 pts)

There, I printed 30 rows from the map showing names sorted in alphabetical order.
4. Collect your movieids from the printed list and rate them (2 pts)
Task 1: Collect MovieID:s from the printed list and rate them in a 1 to 5 scale, rate movies that you know
otherwise it will be hard to know if your model works at the end of the exercise. The more you rate the
better your recommendation result will be! In the code block below, replace existing ratings with yours.
User id 0 was not reserved, we use that to inject our results into the dataset. (1 pts)

Add following code block and insert you ratings.

my_user_id = 0

my_ratings = [
(my_user_id, 1214, 5 ),
(my_user_id, 1127, 4)]

new_ratings = sc.parallelize(my_ratings).map(lambda p: reco.Rating( user=int(p[0]), product=int(p[1]), rating=int(p[2])) )

After you have rated enough movies, disable movie name printing!

Here is the good used to add new ratings to movies I know:

Task 2: Union the rows from your new_ratings rdd into the ratings rdd. (rdd has union function) (1 pts)

FAILURE TO COMPLETE TASK 2 WILL RESULT IN SILENT FAILURE AT THE END!!!

Here is the code for the union between the ratings sets and the result. In my case, the union is put in a rdd
named “union”:
5. Calculating movie rating counts for result inspection and to remove
movies that have too few ratings to be useful (3 pts)
Task 1: Calculate into moviecounts rdd how many ratings each movie has in the ratings rdd. (Hint:
Remember the word count example?) (1 pts)

Here is the code from Lab2 adapted to our problem. You can find the result below. For example, movie with
ID 1367 has been rated 365 times.

Task 2: Make another filtered rdd of the moviecounts and store it in too_few_ratings variable, based on
how many ratings have been given. Keep MovieID’s that has less than X ratings than some defined value
(like 30). The structure of the too_few_ratings variable must be a python list of MovieID:s.

Add filtering “filter(lambda x: x[1]<30)”, remove counts with “map(lambda x: x[0])” and finally call
“collect()”. Check that your result list contains only movieID numbers! (1 pts)

Task 3: Calculate model again with your parameters from the first exercise part 3. (1 pts)
iterations = 5;
rank = 5;
lambd = 0.01;
#model calculation
model = reco.ALS.train(ratings, rank, iterations, lambd, nonnegative=True, seed=10)

6. Selecting movies to be rated for you (1 pts)

Find out which movies we want recommendation to be generated.

The following code generates a list of movies that recommendation ratings will be generated. It will remove
the movies you rated and movies with too few ratings.

my_not_rated = movienames.filter(lambda x: x[0] not in [row[1] for row in my_ratings]) #remove my rated
my_not_rated = my_not_rated.filter(lambda x: x[0] not in too_few_ratings) #remove with too few ratings
my_not_rated_with_user_id = my_not_rated.map(lambda p: (my_user_id,p[0])) #add userid

Task 1: Use the models predictAll function with my_not_rated_with_user_id to find the recommendation
ratings for the movies and store the result into rdd recommendations. (1 pts)

7. Result processing (5 pts)

The model.predictAll function returns a rdd with (UserID,MovieID,rating) columns.

Task 1: Map recommendations into key value pair (MovieID,rating). In other words, get rid of the userid
key. Then join recommendations by key to movienames and then with another join to the moviecounts. (1
pts)

Task 2: Sort the recommendations with the sortBy function by rating value into descending order,
(ascending=False). Print the top 100 recommendations from the sorted list, retrieved with the
recommendations.take(100).

Copy top 15 results to report and highlight results that make some sense. (1 pts)

Task 3: The movies with too few ratings were removed, explain why the movies with too few ratings must
be removed. (Set the moviecounts filter to compare against 0 to see what happens). (1 pts)

Task 4:

Do your movie recommendations make any sense?

Do you think you have given diverse set of input recommendations that the model can work?

Check the predicted ratings to see if your ratings are overshooting past five. What do you think could have
gone wrong if you get lot of values way past five?

(2 pts)

01 Introductory Econometrics 4E Woolridge
No ratings yet
01 Introductory Econometrics 4E Woolridge
13 pages
Recursion for Java Programmers
No ratings yet
Recursion for Java Programmers
43 pages
Chapter 1 Econometric
No ratings yet
Chapter 1 Econometric
32 pages
Basic Maths
No ratings yet
Basic Maths
36 pages
Module 1 Fintec
No ratings yet
Module 1 Fintec
50 pages
DDRSTRPPT
No ratings yet
DDRSTRPPT
44 pages
Ai in Finance Notes
No ratings yet
Ai in Finance Notes
4 pages
Implementing and Detecting An ACPI BIOS Rootkit
No ratings yet
Implementing and Detecting An ACPI BIOS Rootkit
36 pages
Artificial Intelligencegroup
No ratings yet
Artificial Intelligencegroup
12 pages
Generative AI Finance Presentation Radha
No ratings yet
Generative AI Finance Presentation Radha
5 pages
Adl SQL Cheatsheet
100% (1)
Adl SQL Cheatsheet
3 pages
Front End Developer - Screening
No ratings yet
Front End Developer - Screening
3 pages
Laptop Repair Tutorial (Chip Level) (2nd Edition) - 1
No ratings yet
Laptop Repair Tutorial (Chip Level) (2nd Edition) - 1
150 pages
AI+ Finance Blueprint
No ratings yet
AI+ Finance Blueprint
14 pages
AI in Finance
No ratings yet
AI in Finance
2 pages
AI+ Customer Service Summary
No ratings yet
AI+ Customer Service Summary
14 pages
Camera Gear & VRM Guide
No ratings yet
Camera Gear & VRM Guide
32 pages
M5 AIforFinance
No ratings yet
M5 AIforFinance
40 pages
AI Tools For Finanace Professionals
No ratings yet
AI Tools For Finanace Professionals
3 pages
AI+ Executive Summary
No ratings yet
AI+ Executive Summary
15 pages
Dump 1
No ratings yet
Dump 1
6 pages
Top 50+ Java Collections Interview Questions (2024)
No ratings yet
Top 50+ Java Collections Interview Questions (2024)
44 pages
Ram Slots - Random Access Memory
No ratings yet
Ram Slots - Random Access Memory
12 pages
AI - Workshop - CDS
No ratings yet
AI - Workshop - CDS
7 pages
Microprocessor & Random Access Memory
No ratings yet
Microprocessor & Random Access Memory
4 pages
M.2 Interface, Key and Socket Explained 2
No ratings yet
M.2 Interface, Key and Socket Explained 2
1 page
Recursive Problem Solving
No ratings yet
Recursive Problem Solving
16 pages
Dump 2
No ratings yet
Dump 2
5 pages
CS1010S-Lec-03 Recursion, Iteration
No ratings yet
CS1010S-Lec-03 Recursion, Iteration
70 pages
SonyLIV Channel Playlist
No ratings yet
SonyLIV Channel Playlist
2 pages
Technical Indicators & Overlays
No ratings yet
Technical Indicators & Overlays
61 pages
React Interview Guide V 2
No ratings yet
React Interview Guide V 2
23 pages
Shareholders of Habib Metro Bank
No ratings yet
Shareholders of Habib Metro Bank
308 pages
AI in Finance Expanded Presentation
No ratings yet
AI in Finance Expanded Presentation
15 pages
AI in Finance
No ratings yet
AI in Finance
14 pages
DDR5 Sdram Overview
No ratings yet
DDR5 Sdram Overview
13 pages
Anamoly Detection
0% (1)
Anamoly Detection
20 pages
Complete Java Generics Tutorial - HowToDoInJava
No ratings yet
Complete Java Generics Tutorial - HowToDoInJava
28 pages
Karnata Bharata Katha Manjari
No ratings yet
Karnata Bharata Katha Manjari
3,100 pages
ACPI for Mobile Architecture Labs
No ratings yet
ACPI for Mobile Architecture Labs
10 pages
Rocking System Design Course Slides
No ratings yet
Rocking System Design Course Slides
365 pages
2020 Day6 Functions Recursion Python
No ratings yet
2020 Day6 Functions Recursion Python
26 pages
Certified Kubernetes Administrator
No ratings yet
Certified Kubernetes Administrator
4 pages
AI+ Security Level 2 Executive Summary
No ratings yet
AI+ Security Level 2 Executive Summary
15 pages
PySpark SQL Cheat Sheet Guide
No ratings yet
PySpark SQL Cheat Sheet Guide
1 page
What Is An M.2 SSD
No ratings yet
What Is An M.2 SSD
10 pages
CSC 391 Data Structure and Algorithm DONE
No ratings yet
CSC 391 Data Structure and Algorithm DONE
16 pages
AN126 - r3.0 - MP2940A Programming Guide
No ratings yet
AN126 - r3.0 - MP2940A Programming Guide
13 pages
IIT Madras BS Degree Maths1 Question Paper
No ratings yet
IIT Madras BS Degree Maths1 Question Paper
10 pages
AZ-304.prepaway - Premium.exam.102q: Number: AZ-304 Passing Score: 800 Time Limit: 120 Min File Version: 2.0
100% (1)
AZ-304.prepaway - Premium.exam.102q: Number: AZ-304 Passing Score: 800 Time Limit: 120 Min File Version: 2.0
110 pages
33 Strategies of War in Philippine History
No ratings yet
33 Strategies of War in Philippine History
39 pages
Manual vs Auto Transmission MPG Analysis
No ratings yet
Manual vs Auto Transmission MPG Analysis
5 pages
DMN User Guide for Enterprise Architect
No ratings yet
DMN User Guide for Enterprise Architect
111 pages
React Succinctly
No ratings yet
React Succinctly
119 pages
How To Learn Machine Learning Algorithms For Interviews
No ratings yet
How To Learn Machine Learning Algorithms For Interviews
16 pages
AI+ Security Level 1 Executive Summary
No ratings yet
AI+ Security Level 1 Executive Summary
16 pages
Agentic AI For Finance Teams - CDS
No ratings yet
Agentic AI For Finance Teams - CDS
3 pages
GCP Professional Cloud Architect v21.12.3 260 Onje4c
No ratings yet
GCP Professional Cloud Architect v21.12.3 260 Onje4c
267 pages
Mastering Python Data Structure (Lists)
No ratings yet
Mastering Python Data Structure (Lists)
31 pages
### Summary Notes On Machine Learning Experiments
No ratings yet
### Summary Notes On Machine Learning Experiments
5 pages
Manual Mark Vision
No ratings yet
Manual Mark Vision
107 pages
Grade 11 UT 3 Computer Science QP
No ratings yet
Grade 11 UT 3 Computer Science QP
2 pages
LTE Architecture Overview
No ratings yet
LTE Architecture Overview
19 pages
ENG 302 Technical Comm Lecture Note
No ratings yet
ENG 302 Technical Comm Lecture Note
30 pages
Updated Lab File BCS351
No ratings yet
Updated Lab File BCS351
7 pages
Generative Artificial Intelligence and Language Teaching
No ratings yet
Generative Artificial Intelligence and Language Teaching
94 pages
Ict SS1 3RD Term Note
100% (2)
Ict SS1 3RD Term Note
14 pages
Emerging Chapter 1,2&3
100% (1)
Emerging Chapter 1,2&3
53 pages
Unit 5 C Language
100% (1)
Unit 5 C Language
15 pages
Broken Authentication
No ratings yet
Broken Authentication
31 pages
DLD Lab 9
No ratings yet
DLD Lab 9
6 pages
DF-L04-Current Digital Forensics Tools
No ratings yet
DF-L04-Current Digital Forensics Tools
52 pages
Multicommunicating
No ratings yet
Multicommunicating
15 pages
ALM Presentation
No ratings yet
ALM Presentation
37 pages
Editing, Coding, Data Entry, Tabulation
100% (3)
Editing, Coding, Data Entry, Tabulation
3 pages
AAI ATC Marathon 2025 AAI ATC Preparation AAI ATC Previous
No ratings yet
AAI ATC Marathon 2025 AAI ATC Preparation AAI ATC Previous
17 pages
74HC240 74HCT240: 1. General Description
No ratings yet
74HC240 74HCT240: 1. General Description
14 pages
SF 30
No ratings yet
SF 30
2 pages
DFF
No ratings yet
DFF
12 pages
SQL Advanced Q and A
No ratings yet
SQL Advanced Q and A
5 pages
Module 7 Euc
No ratings yet
Module 7 Euc
10 pages
The Scalar Kalman Filter
100% (4)
The Scalar Kalman Filter
16 pages
HttpSession & @SessionAttributes
No ratings yet
HttpSession & @SessionAttributes
4 pages
Upgrading The Firmware: Document Version Release Date
No ratings yet
Upgrading The Firmware: Document Version Release Date
6 pages
SEM 3 BC0042 1 Operating Systems
No ratings yet
SEM 3 BC0042 1 Operating Systems
27 pages
Advance Analytics Job Description
No ratings yet
Advance Analytics Job Description
2 pages
AIML Manual V1!6!83 Removed
No ratings yet
AIML Manual V1!6!83 Removed
51 pages
Farnsworth Leer
No ratings yet
Farnsworth Leer
18 pages
Azure Questions
No ratings yet
Azure Questions
4 pages
Academic Certification Roadmap Jan2018 PDF
No ratings yet
Academic Certification Roadmap Jan2018 PDF
1 page