Movie Recommendation System Using Graph Database
Movie Recommendation System Using Graph Database
Technologies used:
Dataset Description:
1. MovieLens (Small) dataset, according to its own description, describes 5-star rating and free-
text tagging activity from [MovieLens]( http://movielens.org ), a movie recommendation service. It
contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by
610 users between March 29, 1996 and September 24, 2018. This dataset was generated on
September 26, 2018.
https://grouplens.org/datasets/movielens/
2
Here how does the downloaded zip file looks like:
File Links.csv contains 3 different ids of each movie: movieId – the one used in MovieLens dataset,
imdbId – is corresponding to IMDB dataset and tmdbId – id corresponding to tmdb
https://www.themoviedb.org/ dataset, which we’ll use to get information about actors and directors
of the movies. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.
File movies.csv has information about movie id, the title along with the year of release in
parentheses, and genres, separated by “|” , which selected from the following list:
* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)
3
Each line of file ratings.csv contains rating made on a 5-star scale, with half-star increments (0.5
stars - 5.0 stars) of one movie by one user. The user is represented by id only. User ids have been
anonymized.
Each row of file tags.csv represents one tag applied to one movie by one user. Tags are user-
generated metadata about movies.
4
2. TMDB 5000 Movie Dataset. This dataset was generated from The Movie Database API. It
contains 2 csv files: one with detailed information about the movie (budget, genres, original
language and so forth), the second one – contains movies credits – actors, directors, producers. Only
one of two csv files with credentials will be used.
Compressed size of tmdb_5000_credits.csv file: 7.64 MB; uncompressed: 39 MB.
https://www.kaggle.com/tmdb/tmdb-movie-metadata
Each row of this file contains movie_id, title, cast and crew information.
We will use only actors name’s from cast column and directors from crew column.
East cell of “cast” column contains JSON formatted data with looks like follows:
[
{
"cast_id": 4,
"character": "Captain Jack Sparrow",
"credit_id": "52fe4232c3a36847f800b50d",
"gender": 2,
"id": 85,
"name": "Johnny Depp",
"order": 0
},
{
"cast_id": 5,
"character": "Will Turner",
"credit_id": "52fe4232c3a36847f800b511",
"gender": 2,
"id": 114,
"name": "Orlando Bloom",
"order": 1
},
…
6
Data reading and preprocessing:
All queries have been written using Cypher language.
To reproduce results, please go to http://localhost:7474/browser/
and simply copy queries from this report to the command line at the top of the page:
Data from MovieLens dataset can be easily downloaded to neo4j database using LOAD CSV
function.
First, I placed csv files to %NEO4J_HOME%/import folder.
Let’s download movie information by creating label Movie with properties id and title and label
Genre with the single property title:
Added 9762 labels, created 9762 nodes, set 19504 properties, created 22084
relationships, completed after 38484 ms.
7
By downloading data from ratings.csv we will create label User with only property id (because data
about users are anonymized) and connection RATED with property rating: (User)-[:RATED
{ rating:}]->(Movie).
LOAD CSV WITH HEADERS FROM "file:///ratings.csv" AS line
MATCH (m:Movie {id:line.movieId})
MERGE (u:User {id:line.userId})
MERGE (u)-[:RATED { rating: toFloat(line.rating)}]->(m);
Added 610 labels, created 610 nodes, set 101446 properties, created 100836
relationships, completed after 695343 ms.
Tags:
Set 3683 properties, created 3683 relationships, completed after 21062 ms.
As described above, file Links.csv contains 3 different ids of each movie: movieId – the one used in
MovieLens dataset, imdbId – is corresponding to IMDB dataset and tmdbId – id corresponding to
tmdb https://www.themoviedb.org/ dataset, which we’ll use to get information about actors and
directors of the movies. Let’s add to each movie new property – tmdbId:
Now let’s proceed with information about actors and directors. As content of tmdb_5000_credits.csv
is not that easy to download to neo4j (csv with JSON format for some columns content) and, taking
into the consideration that we don’t need all the information from this file (we will not use, for
example, information about Director of Photography of Casting Director to make a recommendation,
with all the respect to them), let’s create simple Python application with will read all the info from
tmdb_5000_credits.csv file, filter it, and create csv file with easy to read for neo4j data.
First, let’s read data from csv file to Pandas dataframe:
import pandas as pd
import json
data = pd.read_csv("C:\\Users\\Aleks\\Desktop\\BD final\\
tmdb_5000_credits.csv")
data.head()
8
Let’s examine “cast” column. East cell contains JSON formatted data with looks like follows:
[
{
"cast_id": 4,
"character": "Captain Jack Sparrow",
"credit_id": "52fe4232c3a36847f800b50d",
"gender": 2,
"id": 85,
"name": "Johnny Depp",
"order": 0
},
{
"cast_id": 5,
"character": "Will Turner",
"credit_id": "52fe4232c3a36847f800b511",
"gender": 2,
"id": 114,
"name": "Orlando Bloom",
"order": 1
},
{
"cast_id": 6,
"character": "Elizabeth Swann",
"credit_id": "52fe4232c3a36847f800b515",
"gender": 1,
"id": 116,
"name": "Keira Knightley",
"order": 2
},
…
We need only actor’s name and role. Let’s read the data we need to new dataframe:
castDf = pd.DataFrame({'movieId':[], 'person_name':[], 'role':[]})
castDf.count()
movieId 106257
person_name 106257
role 106257
dtype: int64
9
We don’t need that much of actors. Most of them probably plays once, in role like Waitress. Let’s
remove those who played less than 5 times, as they will unlike be helpful in movie
recommendations:
castDf['count'] = castDf.groupby('person_name')
['person_name'].transform(pd.Series.value_counts)
castDf = castDf[castDf['count']>5]
castDf.drop('count', axis=1, inplace=True)
castDf.count()
movieId 33470
person_name 33470
role 33470
dtype: int64
movieId 4773
10
person_name 4773
dtype: int64
Based on same logic as with actors, let’s discard those who directed less than 3 movies, as it
wouldn’t be much helpful for recommendations:
directorDf['count'] = directorDf.groupby('person_name')
['person_name'].transform(pd.Series.value_counts)
directorDf = directorDf[directorDf['count']>3]
directorDf.drop('count', axis=1, inplace=True)
directorDf.count()
movieId 2058
person_name 2058
dtype: int64
And directors.csv:
11
Now we can go back to neo4j and read data about directors and actors:
LOAD CSV WITH HEADERS FROM "file:///directors.csv" AS line
MATCH (m:Movie{ tmdbId:line.movieId})
MERGE (p:Person{name:line.person_name})
MERGE (p)-[:DIRECTED]->(m);
Added 339 labels, created 339 nodes, set 339 properties, created
1889 relationships, completed after 13784 ms.
Now, when all data have been read, let’s review general information about the obtained database:
12
Building recommendations
Let’s examine how our data looks like.
All genres, actors and director of a movie:
MATCH (m:Movie {title: "Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and
the Philosopher's Stone) (2001)"})-[:ACTED_IN|:IS_GENRE|:DIRECTED]-(p)
RETURN m, p
13
Movies with shared actors or directors (connected thought 2nd-degree connection):
MATCH q=(m:Movie {title: "Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter
and the Philosopher's Stone) (2001)"})-[:ACTED_IN |:DIRECTED*..2]-(p)
RETURN q LIMIT 50
14
Users who rated or tagged this movie:
MATCH (m:Movie {title: "Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and
the Philosopher's Stone) (2001)"})-[:RATED|:TAGGED]-(u)
RETURN m, u LIMIT 25
15
Now, when we are familiar with data, let’s build find some recommendation, starting with the
simpliest one and gradually increasing the complexity of our queries.
The approach when we are taking in consideration only what other users liked is called
Collaborative Filtering.
Let’s find movies targeted user likes, then find users who also liked that movies, and recommend
movies that other users liked but which our user haven’t seen (rated), sorted by the number of
“paths” that led to a particular recommendation.
MATCH (me:User{id:'220'})-[r1:RATED]->(m:Movie)<-[r2:RATED]-(other:User)-
[r3:RATED]->(m2:Movie)
WHERE r1.rating > 3 AND r2.rating > 3 AND r3.rating > 3 AND NOT (me)-[:RATED]-
>(m2)
RETURN distinct m2 AS recommended_movie, count(*) AS score
ORDER BY score DESC
LIMIT 15
╒══════════════════════════════════════════════════════════════════════╤═══════╕
│"recommended_movie" │"score"│
╞══════════════════════════════════════════════════════════════════════╪═══════╡
│{"title":"Silence of the Lambs, The (1991)","tmdbId":"274","id":"593"}│7203 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Lord of the Rings: The Fellowship of the Ring, The (2001)","│6563 │
│tmdbId":"120","id":"4993"} │ │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"American Beauty (1999)","tmdbId":"14","id":"2858"} │6227 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Braveheart (1995)","tmdbId":"197","id":"110"} │5894 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Gladiator (2000)","tmdbId":"98","id":"3578"} │5777 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Schindler's List (1993)","tmdbId":"424","id":"527"} │5663 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Monty Python and the Holy Grail (1975)","tmdbId":"762","id":│5377 │
│"1136"} │ │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Ocean's Eleven (2001)","tmdbId":"161","id":"4963"} │5011 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Alien (1979)","tmdbId":"348","id":"1214"} │4951 │
├──────────────────────────────────────────────────────────────────────┼───────┤
As every tends to give more higher or lower ratings in general, let’s filter by average rating of
particular user, rather than just constant “3”:
MATCH (me:User{id:'220'})-[r:RATED]-(m)
WITH me, avg(r.rating) AS average
MATCH (me)-[r1:RATED]->(m:Movie)<-[r2:RATED]-(other:User)-[r3:RATED]-
>(m2:Movie)
WHERE r1.rating > average AND r2.rating > average AND r3.rating > average AND
NOT (me)-[:RATED]->(m2)
RETURN distinct m2 AS recommended_movie, count(*) AS score
ORDER BY score DESC
LIMIT 15
╒══════════════════════════════════════════════════════════════════════╤═══════╕
│"recommended_movie" │"score"│
╞══════════════════════════════════════════════════════════════════════╪═══════╡
│{"title":"Silence of the Lambs, The (1991)","tmdbId":"274","id":"593"}│5322 │
├──────────────────────────────────────────────────────────────────────┼───────┤
16
│{"title":"Lord of the Rings: The Fellowship of the Ring, The (2001)","│4276 │
│tmdbId":"120","id":"4993"} │ │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"American Beauty (1999)","tmdbId":"14","id":"2858"} │4129 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Schindler's List (1993)","tmdbId":"424","id":"527"} │4086 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Braveheart (1995)","tmdbId":"197","id":"110"} │3982 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Gladiator (2000)","tmdbId":"98","id":"3578"} │3537 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Alien (1979)","tmdbId":"348","id":"1214"} │3502 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Monty Python and the Holy Grail (1975)","tmdbId":"762","id":│3408 │
│"1136"} │ │
├──────────────────────────────────────────────────────────────────────┼───────┤
│{"title":"Godfather: Part II, The (1974)","tmdbId":"240","id":"1221"} │3330 │
├──────────────────────────────────────────────────────────────────────┼───────┤
17
Let’s use tags: find tags our user gave o describe movies he likes, and find other movies with same
tags (not taking into consideration whether other users, who describe other movies liked that movies
or not).
Here are the movies and tags of movies of our user’s liking:
MATCH (me:User{id:'318'})-[r:RATED]-(m)
WITH me, avg(r.rating) AS average
MATCH (me)-[t1:TAGGED]->(m:Movie)-[r:RATED]-(me)
MATCH (other:User)-[t2:TAGGED]->(m1:Movie)
WHERE r.rating > average AND t1.tag=t2.tag AND NOT (me)-[:TAGGED]->(m1) AND
NOT (me)-[:RATED]->(m1)
RETURN m1, other
18
Every movie in this subgraph contains a tag our user liked.
Now let’s use collaborative approach together with information about the content of the movie (we
have actors, directors, and genre).
First, let’s found actors on movies which our user liked sorted by the number of time particular actor
appears in such movies:
MATCH (me:User{id:'318'})-[r:RATED]-(m:Movie)
WITH me, avg(r.rating) AS average
MATCH (me)-[r:RATED]->(m:Movie)-[:ACTED_IN]-(p:Person)
WHERE r.rating > average
RETURN p as actor, COUNT(*) AS score
ORDER BY score DESC LIMIT 10
╒═════════════════════════════════╤═══════╕
│"actor" │"score"│
╞═════════════════════════════════╪═══════╡
│{"name":"Johnny Depp"} │10 │
├─────────────────────────────────┼───────┤
│{"name":"Matt Damon"} │10 │
├─────────────────────────────────┼───────┤
│{"name":"George Clooney"} │9 │
├─────────────────────────────────┼───────┤
│{"name":"Bill Hader"} │8 │
├─────────────────────────────────┼───────┤
│{"name":"Brad Pitt"} │8 │
├─────────────────────────────────┼───────┤
│{"name":"Steve Buscemi"} │8 │
19
├─────────────────────────────────┼───────┤
│{"name":"John C. Reilly"} │7 │
├─────────────────────────────────┼───────┤
│{"name":"Philip Seymour Hoffman"}│7 │
├─────────────────────────────────┼───────┤
│{"name":"Sean Penn"} │7 │
├─────────────────────────────────┼───────┤
│{"name":"Josh Brolin"} │6 │
└─────────────────────────────────┴───────┘
And genres:
MATCH (me:User{id:'318'})-[r:RATED]-(m:Movie)
WITH me, avg(r.rating) AS average
MATCH (me)-[r:RATED]->(m:Movie)-[:IS_GENRE]-(p:Genre)
WHERE r.rating > average
RETURN p.title as genre, COUNT(*) AS score
ORDER BY score DESC LIMIT 10
╒═════════════╤═══════╕
│"genre" │"score"│
╞═════════════╪═══════╡
│"Drama" │232 │
├─────────────┼───────┤
│"Comedy" │150 │
├─────────────┼───────┤
│"Thriller" │77 │
├─────────────┼───────┤
│"Action" │73 │
├─────────────┼───────┤
│"Crime" │72 │
├─────────────┼───────┤
│"Adventure" │66 │
├─────────────┼───────┤
│"Documentary"│61 │
├─────────────┼───────┤
│"Romance" │49 │
├─────────────┼───────┤
│"Sci-Fi" │49 │
├─────────────┼───────┤
│"Animation" │47 │
└─────────────┴───────┘
21
Now let’s use combined information about favorite actors, directors and genres to provide user with
weighted recommendation sorted by number of overlapping paths that lead to particular
recommended movie:
MATCH (me:User{id:'318'})-[r:RATED]-(m:Movie)
WITH me, avg(r.rating) AS average
MATCH (me)-[r:RATED]->(m:Movie)
WHERE r.rating > average
MATCH (m)-[:IS_GENRE]->(g:Genre)<-[:IS_GENRE]-(rm:Movie)
WITH me, m, rm, COUNT(*) AS gs
OPTIONAL MATCH (m)<-[:ACTED_IN]-(a:Person)-[:ACTED_IN]->(rm)
WITH me, m, rm, gs, COUNT(a) AS as
OPTIONAL MATCH (m)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(rm)
WITH me, m, rm, gs, as, COUNT(d) AS ds
MATCH (rm)
WHERE NOT (me)-[:RATED]->(rm)
RETURN rm.title AS recommendation,
gs as genre_score, as as actor_score, ds as director_score,
(5*gs)+(2*as)+(5*ds) AS weighed_score
ORDER BY weighed_score DESC LIMIT 10
5, 2, 5 are parameters we can adjust if we want to give more weight to either of categories.
╒══════════════════════════════════════════════╤═════════════╤═════════════╤════════════════╤═══════════════╕
│"recommendation" │"genre_score"│"actor_score"│"director_score"│"weighed_score"│
╞══════════════════════════════════════════════╪═════════════╪═════════════╪════════════════╪═══════════════╡
│"Toy Story 3 (2010)" │5 │11 │0 │47 │
├──────────────────────────────────────────────┼─────────────┼─────────────┼────────────────┼───────────────┤
│"The Hunger Games: Mockingjay - Part 2 (2015)"│2 │13 │1 │41 │
├──────────────────────────────────────────────┼─────────────┼─────────────┼────────────────┼───────────────┤
│"Kung Fu Panda 3 (2016)" │3 │11 │0 │37 │
├──────────────────────────────────────────────┼─────────────┼─────────────┼────────────────┼───────────────┤
│"Cloudy with a Chance of Meatballs (2009)" │3 │11 │0 │37 │
├──────────────────────────────────────────────┼─────────────┼─────────────┼────────────────┼───────────────┤
│"Ice Age 2: The Meltdown (2006)" │4 │6 │1 │37 │
├──────────────────────────────────────────────┼─────────────┼─────────────┼────────────────┼───────────────┤
│"22 Jump Street (2014)" │3 │8 │1 │36 │
├──────────────────────────────────────────────┼─────────────┼─────────────┼────────────────┼───────────────┤
│"Grown Ups 2 (2013)" │1 │13 │1 │36 │
├──────────────────────────────────────────────┼─────────────┼─────────────┼────────────────┼───────────────┤
│"Madagascar: Escape 2 Africa (2008)" │6 │3 │0 │36 │
├──────────────────────────────────────────────┼─────────────┼─────────────┼────────────────┼───────────────┤
│"Toy Story 3 (2010)" │3 │10 │0 │35 │
├──────────────────────────────────────────────┼─────────────┼─────────────┼────────────────┼───────────────┤
│"Despicable Me 2 (2013)" │3 │9 │0 │33 │
└──────────────────────────────────────────────┴─────────────┴─────────────┴────────────────┴───────────────┘
Started streaming 10 records after 64342 ms and completed after 64342 ms.
Here is somewhat simplified query with only actors to visualize connection:
MATCH (me:User{id:'318'})-[r:RATED]-(m:Movie)
WITH me, avg(r.rating) AS average
MATCH (me)-[r:RATED]->(m:Movie)
WHERE r.rating > 4.5
MATCH (m)<-[:ACTED_IN]-(a:Person)-[:ACTED_IN]->(rm)
MATCH (rm)
WHERE NOT (me)-[:RATED]->(rm)
RETURN rm, a, me ,m LIMIT 50
22
It shows that our user liked movie Dinner for Schmucks (2010), where Paul Rudd, Rick Overton,
and others played, so we’ll take a look at the movies they played at. In original query, we sorted
recommendations by the number of overlapping paths that lead to a particular recommended movie.
24
Except for obvious recommendations like movies from the same sequence, we ‘got pretty good math
of “Clerk” and “Chasing Amy” and so forth.
Let’s go back to Collaborative Filtering. Instead of taking into consideration the opinion of all users
in the system, let’s find most “similar” users; users who have the same taste. The easiest way to do
so is to find the correlation coefficient between the targeted user and others, and then use ratings
given only by “same minded” users.
We’ll use sample Pearson correlation coefficient, which is defined as follows:
n
∑ (x i−x)( y i− y )
i =1
r=
√∑ √∑
n n
2 2
( xi −x) ( y i− y )
where
i=1 i=1
n is sample size;
x i , , y i are the individual sample points indexed with i;
Let’s find users with a large correlation coefficient between ratings given by our user and all others:
MATCH (me:User {id:"220"})-[r:RATED]->(m:Movie)
WITH me, avg(r.rating) AS my_average
MATCH (me)-[r1:RATED]->(m:Movie)<-[r2:RATED]-(other)
WITH me, my_average, other, COLLECT({r1: r1, r2: r2}) AS ratings WHERE size(ratings)
> 10
MATCH (other)-[r:RATED]->(m:Movie)
WITH me, my_average, other, avg(r.rating) AS other_average, ratings
UNWIND ratings AS r
WITH sum( (r.r1.rating- my_average) * (r.r2.rating- other_average) ) AS a,
sqrt( sum( (r.r1.rating - my_average)^2) * sum( (r.r2.rating - other_average) ^2)) AS b,
me, other
WHERE b <> 0
RETURN me.id, other.id, a/b as correlation
ORDER BY correlation DESC LIMIT 10
╒═══════╤══════════╤══════════════════╕
│"me.id"│"other.id"│"correlation" │
╞═══════╪══════════╪══════════════════╡
│"220" │"494" │0.7825315077845476│
├───────┼──────────┼──────────────────┤
│"220" │"32" │0.7818916367269141│
├───────┼──────────┼──────────────────┤
│"220" │"485" │0.7633105914491696│
├───────┼──────────┼──────────────────┤
│"220" │"97" │0.7547965924537339│
├───────┼──────────┼──────────────────┤
│"220" │"79" │0.7399103445131804│
25
├───────┼──────────┼──────────────────┤
│"220" │"88" │0.7328107458190376│
├───────┼──────────┼──────────────────┤
│"220" │"124" │0.718593011670177 │
├───────┼──────────┼──────────────────┤
│"220" │"235" │0.7165393509283289│
├───────┼──────────┼──────────────────┤
│"220" │"436" │0.7043763345369647│
├───────┼──────────┼──────────────────┤
│"220" │"500" │0.6597414172261901│
├───────┼──────────┼──────────────────┤
Let’s show how does similarly rated movies looks like for targeted user and the one with largest
correlation value:
MATCH (me:User {id:"220"})-[:RATED]->(m:Movie)
MATCH (other:User {id:"494"})-[:RATED]->(m:Movie)
RETURN me, other, m
As we see, highly rated movies by user 220 are also highly rated by user 494; poorly rated movies by
user 220 are also poorly rated by user 494.
26
MATCH (me:User {id:"220"})-[r:RATED]->(m:Movie)
WITH me, avg(r.rating) AS my_average
MATCH (me)-[r1:RATED]->(m:Movie)<-[r2:RATED]-(other)
WITH me, my_average, other, COLLECT({r1: r1, r2: r2}) AS ratings WHERE size(ratings)
> 10
MATCH (other)-[r:RATED]->(m:Movie)
WITH me, my_average, other, avg(r.rating) AS other_average, ratings
UNWIND ratings AS r
WITH sum( (r.r1.rating- my_average) * (r.r2.rating- other_average) ) AS a,
sqrt( sum( (r.r1.rating - my_average)^2) * sum( (r.r2.rating - other_average) ^2)) AS b,
me, other
WHERE b <> 0
WITH me, other, a/b as correlation
ORDER BY correlation DESC LIMIT 10
MATCH (other)-[r:RATED]->(m:Movie) WHERE NOT EXISTS( (me)-[:RATED]->(m) )
WITH m, SUM( correlation* r.rating) AS score, COLLECT(other) AS other
RETURN m, other, score
ORDER BY score DESC LIMIT 10
Started streaming 25 records after 237 ms and completed after 237 ms.
Here we see movie title, list of users, an opinion of those was taking into consideration, and score,
which sum by the number of users of the rating given by user multiplied by the correlation
coefficient of this user.
Here is the visualization.
27
We can find here user who are highly correlated with user 220, and their ratings toward chosen
movies. 6 of such users gave high rate to leading movie Silence of the Lambs.
Similarly, we can find users with negative correlation: if the targeted user like particular movies, the
user with high negative correlation will hate it, and opposite. Then we can use such “anti-
recommendation” and hide these movies from the user in order not to upset him .
I used neo4j graph database and declarative graph query language Cypher to create a model for
movie recommendation system using previous user experience. As a data source, I choose 2 separate
databases – MovieLens, which contains ratings and tag applications applied to movies by users and
TMDB 5000 Movie Dataset, which gave me access to movies actors, directors. Data from 2 datasets
were united using links.csv file which contains both “internal” movie id (used thought MovieLens
files) and “foreign” id which refers to movie id in TMDB 5000 Movie Dataset.
Neo4j fits perfectly for this task. We constantly have to use connections between entities, like find
movies likes by user1 which also are liked by other users, and then find movies that other users
liked, but user1 hasn’t seen. Had we user traditional relational database, we’d end up with a large
number of joints, which are very expensive for RDBMS. With a graph database, on the other hand,
we have fast access to both data (user, movie, genre) and relationships between them. As all
relationships are easily and quickly acceptable, it allows us to process queries very fast, enabling
using the model for real-time recommendation engines.
Most queries used in this work took about 200-500 ms to process. The longest query took ~60000
ms, in RDBMS it would require ~10 joints and would take much longer.
Another advantage of using a graph database for this model is that it’s easy to visualize the
connections and paths that led us to a particular result, and by doing so, to understand the underlying
patter better.
Graph query language Cypher is very easy to learn but very powerful. It allows a user to write
moderately complex queries even without prior knowledge of this language. I, for example, have
never used it before, except during one homework in this course, yet, I thoroughly enjoyed working
with it.
I used different models – both Content-Based, Collaborative Filtering and combination of them. It’s
hard to evaluate the performance of such models. We would have to propose movies to a user, and
then to see whether he or she liked them. We would need “access” to a real user to do so.
It would be interesting to use other features to expand our model, like user demographic information,
social relationships; more consistent tags that describe the movies, as well as more information about
movies itself, like to know the movie sequences (we wouldn’t want to recommend user to watch
episode #8 long sequence, if he had never watched any previous, even if his friends like it, rather, it
would be better to recommend him to watch from the beginning).
As I discovered, the problem of creating a model for a recommendation engine, in particular, for
movies recommendation system, can be successfully and easily solved using a graph database.
References:
Code and technical info:
https://neo4j.com/
https://anaconda.org/anaconda/anaconda-navigator
https://www.python.org/
30
http://jupyter.org/
http://guides.neo4j.com/sandbox/recommendations
https://neo4j.com/developer/movie-database/#_import_instructions
https://neo4j.com/graphgist/competency-management-a-matter-of-filtering-and-recommendation-
engines#competences
https://github.com/citruz/movies4j
https://neo4j.com/blog/real-time-recommendation-engine-data-science/
https://en.wikipedia.org/wiki/Jaccard_index
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
Data source:
https://www.kaggle.com/tmdb/tmdb-movie-metadata
https://www.themoviedb.org/
https://grouplens.org/datasets/movielens/
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context.
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19.
https://doi.org/10.1145/2827872
31