Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
Experiment No. 15
Mini Project Report
Movie Recommendation System
Instructor’s name
Prof S. V. Kangane
Subject
Data Science and Big Data Analytics Lab
DEPARTMENT OF COMPUTER ENGINEERING
2024-2025
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
Acknowledgement:
I would like to take this opportunity to express my sincere gratitude to everyone who
contributed to the successful completion of this project titled "Movie
Recommendation System Using scikit-learn".
First and foremost, I would like to express my heartfelt thanks to our subject instructor
Prof. S. V. Kangane for their valuable guidance, encouragement, and constant support
throughout the duration of the project. Their insightful lectures on Data Science and
Big Data Analytics (DSBDA), practical approach to teaching, and detailed feedback
helped me refine my understanding of recommendation systems and machine learning
techniques.
I am also thankful to the Department of Computer Engineering at LGNCOE for
providing the resources and environment conducive to research and development. The
structured curriculum and the emphasis on hands-on learning played a significant role
in the successful execution of this project.
I extend my sincere appreciation to all the faculty members and lab instructors,
whose foundational teachings and technical inputs over the semesters contributed
directly or indirectly to the knowledge applied in this work.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
Abstract:
With the exponential growth of digital content, users often face difficulty in choosing
the right movie to watch. Recommendation systems help users discover content that
aligns with their interests. This project focuses on developing a content-based movie
recommendation system using Python and scikit-learn. It uses textual metadata such as
genres, keywords, cast, and director to find similarities between movies. The system
processes this information using TF-IDF Vectorization and computes similarity using
cosine similarity. The goal is to enhance user experience by providing personalized
recommendations. This system can serve as a foundational model for more advanced
systems used in the industry.
Introduction:
Recommendation systems have become essential in almost every industry dealing with
consumer content, particularly in entertainment, e-commerce, and social media. From
Netflix recommending shows, to Spotify suggesting music, these systems
significantly enhance user experience and retention. There are two primary types of
recommendation systems: collaborative filtering and content-based filtering. This
project focuses on the latter, wherein the recommendation is based on the features of
the item itself rather than user behavior. By analyzing movie metadata, this system
provides suggestions that share similar thematic elements and creative attributes with a
given movie. The implementation of this project gave me a solid understanding of how
to process unstructured text data, compute similarity metrics, and create a functional
recommendation engine.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
Objective:
The objective of this project is to build a content-based movie recommendation
system that:
Accepts a movie title as input from the user
Extracts and processes metadata related to the movie
Computes similarity with other movies based on textual attributes
Returns the top N movies that are most similar to the input movie
The goal is to enhance user discovery and content exploration using data-driven
methods.
Technology Stack:
Programming Language:
Python: Chosen for its simplicity and extensive data science ecosystem.
Libraries & Tools:
pandas: For loading and manipulating tabular data.
numpy: For numerical operations.
scikit-learn: For machine learning algorithms, especially TF-IDF vectorization
and similarity metrics.
TfidfVectorizer: Converts textual data into vector space by calculating term
frequency-inverse document frequency.
cosine_similarity: Measures similarity between two vectors.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
Dataset:
movie_dataset.csv: A dataset containing movie details including cast,
keywords, genres, and director.
Dataset Description:
The dataset used in this project is a CSV file with around 4,800 records, where each
record represents a movie. The key columns utilized are:
Title – The name of the movie.
Genres – The type of movie (e.g., Action, Drama, Comedy).
Keywords – Descriptive tags or phrases.
Cast – Leading actors in the movie.
Director – Director's name.
Data Cleaning:
Handled missing values using .fillna('') to ensure smooth concatenation of
features.
Combined selected features into a new column combined_features for further
vectorization.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
Dataset Description:
The dataset used in this project is a CSV file containing metadata about a wide range of
movies. It serves as the backbone of the recommendation engine by providing essential
information needed for comparing and suggesting similar movies.
Columns Used
The following columns from the dataset were selected and combined to build the
recommendation model:
Keywords: These are specific terms or tags associated with each movie, giving
an idea of its themes or story elements.
Cast: Contains the names of leading actors and actresses. Cast information helps
identify similarities in star power and performance styles.
Genres: Lists the categories the movie belongs to, such as Action, Drama,
Comedy, etc. This is crucial for genre-based recommendations.
Director: Indicates the name of the director, which can influence the style and
type of the movie.
These columns were chosen because they carry rich textual information that can be
compared across movies to determine similarity.
Number of Records
The dataset includes information for approximately 4,800+ movies. This provides a
sufficiently large and diverse sample to generate meaningful recommendations based
on various combinations of genre, cast, and other metadata.
Data Cleaning Performed
Before building the model, several data cleaning steps were taken:
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
Missing Values Handling: Some entries in the selected columns were empty or
null. These were filled with empty strings ('') to prevent errors during text
processing.
Text Normalization: All text data was converted to lowercase to maintain
consistency during comparison.
Feature Merging: The values from the keywords, cast, genres, and director
columns were combined into a new column named combined_features. This
merged text was used for feature extraction through vectorization.
Duplicate Removal: Checked and removed duplicate entries if any were found
to avoid biased similarity scores.
These preprocessing steps ensured that the dataset was clean, consistent, and ready for
use in the content-based recommendation system.
Methodology:
1. Data Preprocessing
Clean and merge relevant columns into a single text string per movie.
2. TF-IDF Vectorization
Convert text into numerical feature vectors using the TF-IDF technique to give
more importance to unique terms.
3. Similarity Computation
Use cosine similarity to measure how close two movies are in vector space.
4. Recommendation Engine
Retrieve similar movies based on the highest similarity scores.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
Implementation:
• Code Snippets with Explanation:
a) Data Preprocessing
import pandas as pd
# Load dataset
df = pd.read_csv("movie_dataset.csv")
# Select relevant features
features = ['keywords', 'cast', 'genres', 'director']
for feature in features:
df[feature] = df[feature].fillna('')
# Combine selected features into a single string
def combine_features(row):
return ' '.join(row[feature] for feature in features)
df["combined_features"] = df.apply(combine_features, axis=1)
# Combine selected features into a single string
def combine_features(row):
return ' '.join(row[feature] for feature in features)
df["combined_features"] = df.apply(combine_features, axis=1)
Explanation:
We clean the data by filling missing values and combining important columns to create
a meaningful text representation for each movie.
b) Feature Vectorization
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
count_matrix = cv.fit_transform(df["combined_features"])
Explanation:
This step converts textual data into a matrix of token counts, which is essential for
similarity measurement.
c) Similarity Computation
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(count_matrix)
Explanation:
Cosine similarity is used to calculate how similar two movies are, based on the count
matrix.
d) Recommendation Function
def get_title_from_index(index):
return df.iloc[index]["title"]
def get_index_from_title(title):
try:
return df[df.title.str.lower() == title.lower()].index.values[0]
except IndexError:
return None
def recommend_movies(movie_title):
index = get_index_from_title(movie_title)
if index is None:
return f"No movie found with title '{movie_title}'" similar_movies
= list(enumerate(cosine_sim[index]))
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
sorted_similar_movies = sorted(similar_movies, key=lambda x: x[1],
reverse=True)[1:6]
recommendations = [get_title_from_index(element[0]) for element in
sorted_similar_movies]
return recommendations
Explanation:
This function takes the user input, finds the closest matches, and returns the top 5
recommendations based on similarity.
e) User Input Integration
user_movie = input("Enter a movie title: ")
recommendations = recommend_movies(user_movie)
if isinstance(recommendations, list):
print("Top 5 movie recommendations:")
for i, movie in enumerate(recommendations, start=1):
print(f"{i}. {movie}")
else:print(recommendations)
Explanation:
Takes user input and displays recommendations using the previously defined functions.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
Output:
• Screenshots of results
You can take screenshots of the console output after running the recommendation
function with input like:
• Handling User Inputs
The system handles the following scenarios:
✅ Valid Input:
Enter a movie title: The Dark Knight
Output: List of top 5 similar movies.
❌ Invalid Input:
Enter a movie title: Unknown Hero
Output: No movie found with title 'Unknown Hero'
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
✅ Case Insensitive Input:
Enter a movie title: the DARK knight
Output: Top 5 similar movie recommendations (same as proper case).
✅ Trimmed Input:
Enter a movie title: The Dark Knight
Output: Top 5 recommendations shown correctly.
Applications:
OTT Platforms (Netflix, Amazon Prime): Enhance recommendations based on
user preferences.
E-commerce: Recommend products similar to previously viewed items.
E-learning: Suggest courses based on course history.
Social Media: Recommend friends, pages, or groups based on content interests.
Future Scope:
Collaborative Filtering: Incorporate user ratings and preferences.
Hybrid Models: Combine collaborative and content-based methods.
Deep Learning: Use word embeddings and neural networks for semantic
similarity.
Web Deployment: Use Flask or Django to turn the system into a web application.
Real-Time Feedback Loop: Adapt recommendations based on continuous user
interaction.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
Conclusion:
This project successfully demonstrates a content-based Movie Recommendation
System using Python and scikit-learn.
It recommends similar movies based on features like genres, cast, keywords, and
director.
TF-IDF vectorization and cosine similarity were used to measure the similarity between
movies.
The system is simple yet effective in helping users discover relevant movies.
It improved my understanding of data preprocessing and machine learning concepts.
Handling real-world data taught me the importance of data cleaning and feature
selection.
The project also enhanced my coding and problem-solving skills.
Though basic, the model provides a foundation for future improvements like hybrid
systems.
It can be extended into a full-fledged app or integrated into streaming platforms.
Overall, it was a great learning experience in Data Science and Big Data Analytics.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
Reference:
◻ Kaggle Dataset
Movie Metadata Dataset
Source: https://www.kaggle.com/datasets
Description: A publicly available dataset containing information about thousands
of movies, including genres, cast, director, and keywords.
◻ Scikit-learn Documentation
scikit-learn: Machine Learning in Python
Official Documentation: https://scikit-learn.org/stable/documentation.html
Description: Used for TF-IDF vectorization and calculating cosine similarity.
◻ Pandas Documentation
pandas: Powerful data structures for data analysis and manipulation
Official Website: https://pandas.pydata.org/
Description: Used for data loading, cleaning, and manipulation.
◻ NumPy Documentation
NumPy: The fundamental package for scientific computing with Python
Official Site: https://numpy.org/
Description: Used for numerical operations in the background during matrix
computation.
◻ Python Official Documentation
Python Language Reference
Website: https://docs.python.org/3/
Description: Referred to for core programming logic and syntax.
◻ Jupyter Notebook
Tool used for writing, testing, and visualizing the code in an interactive
environment.