0% found this document useful (0 votes)

21 views14 pages

Dsbda Mini Project

The document presents a mini project report on a Movie Recommendation System developed using Python and scikit-learn, focusing on content-based filtering. It utilizes movie metadata such as genres, keywords, cast, and director to generate personalized recommendations through TF-IDF vectorization and cosine similarity. The project highlights the importance of data preprocessing and machine learning concepts, providing a foundation for future enhancements in recommendation systems.

Uploaded by

mthorat535

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views14 pages

Dsbda Mini Project

Uploaded by

mthorat535

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Kalyani Charitable Trust’s

Late G. N. Sapkal College of Engineering

Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Experiment No. 15

Mini Project Report

Movie Recommendation System

Instructor’s name
Prof S. V. Kangane

Subject
Data Science and Big Data Analytics Lab

DEPARTMENT OF COMPUTER ENGINEERING

2024-2025
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Acknowledgement:
I would like to take this opportunity to express my sincere gratitude to everyone who
contributed to the successful completion of this project titled "Movie
Recommendation System Using scikit-learn".

First and foremost, I would like to express my heartfelt thanks to our subject instructor
Prof. S. V. Kangane for their valuable guidance, encouragement, and constant support
throughout the duration of the project. Their insightful lectures on Data Science and
Big Data Analytics (DSBDA), practical approach to teaching, and detailed feedback
helped me refine my understanding of recommendation systems and machine learning
techniques.

I am also thankful to the Department of Computer Engineering at LGNCOE for

providing the resources and environment conducive to research and development. The
structured curriculum and the emphasis on hands-on learning played a significant role
in the successful execution of this project.

I extend my sincere appreciation to all the faculty members and lab instructors,
whose foundational teachings and technical inputs over the semesters contributed
directly or indirectly to the knowledge applied in this work.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Abstract:
With the exponential growth of digital content, users often face difficulty in choosing
the right movie to watch. Recommendation systems help users discover content that
aligns with their interests. This project focuses on developing a content-based movie
recommendation system using Python and scikit-learn. It uses textual metadata such as
genres, keywords, cast, and director to find similarities between movies. The system
processes this information using TF-IDF Vectorization and computes similarity using
cosine similarity. The goal is to enhance user experience by providing personalized
recommendations. This system can serve as a foundational model for more advanced
systems used in the industry.

Introduction:
Recommendation systems have become essential in almost every industry dealing with
consumer content, particularly in entertainment, e-commerce, and social media. From
Netflix recommending shows, to Spotify suggesting music, these systems
significantly enhance user experience and retention. There are two primary types of
recommendation systems: collaborative filtering and content-based filtering. This
project focuses on the latter, wherein the recommendation is based on the features of
the item itself rather than user behavior. By analyzing movie metadata, this system
provides suggestions that share similar thematic elements and creative attributes with a
given movie. The implementation of this project gave me a solid understanding of how
to process unstructured text data, compute similarity metrics, and create a functional
recommendation engine.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Objective:
The objective of this project is to build a content-based movie recommendation
system that:

 Accepts a movie title as input from the user

 Extracts and processes metadata related to the movie

 Computes similarity with other movies based on textual attributes

 Returns the top N movies that are most similar to the input movie

The goal is to enhance user discovery and content exploration using data-driven
methods.

Technology Stack:
 Programming Language:

Python: Chosen for its simplicity and extensive data science ecosystem.

 Libraries & Tools:

pandas: For loading and manipulating tabular data.

numpy: For numerical operations.

scikit-learn: For machine learning algorithms, especially TF-IDF vectorization

and similarity metrics.

TfidfVectorizer: Converts textual data into vector space by calculating term

frequency-inverse document frequency.

cosine_similarity: Measures similarity between two vectors.

Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

 Dataset:

 movie_dataset.csv: A dataset containing movie details including cast,

keywords, genres, and director.

Dataset Description:
The dataset used in this project is a CSV file with around 4,800 records, where each
record represents a movie. The key columns utilized are:

 Title – The name of the movie.

 Genres – The type of movie (e.g., Action, Drama, Comedy).

 Keywords – Descriptive tags or phrases.

 Cast – Leading actors in the movie.

 Director – Director's name.

Data Cleaning:

 Handled missing values using .fillna('') to ensure smooth concatenation of

features.

 Combined selected features into a new column combined_features for further

vectorization.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Dataset Description:
The dataset used in this project is a CSV file containing metadata about a wide range of
movies. It serves as the backbone of the recommendation engine by providing essential
information needed for comparing and suggesting similar movies.

Columns Used

The following columns from the dataset were selected and combined to build the
recommendation model:

 Keywords: These are specific terms or tags associated with each movie, giving
an idea of its themes or story elements.

 Cast: Contains the names of leading actors and actresses. Cast information helps
identify similarities in star power and performance styles.

 Genres: Lists the categories the movie belongs to, such as Action, Drama,
Comedy, etc. This is crucial for genre-based recommendations.

 Director: Indicates the name of the director, which can influence the style and
type of the movie.

These columns were chosen because they carry rich textual information that can be
compared across movies to determine similarity.

Number of Records

The dataset includes information for approximately 4,800+ movies. This provides a
sufficiently large and diverse sample to generate meaningful recommendations based
on various combinations of genre, cast, and other metadata.

Data Cleaning Performed

Before building the model, several data cleaning steps were taken:
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

 Missing Values Handling: Some entries in the selected columns were empty or
null. These were filled with empty strings ('') to prevent errors during text
processing.

 Text Normalization: All text data was converted to lowercase to maintain

consistency during comparison.

 Feature Merging: The values from the keywords, cast, genres, and director
columns were combined into a new column named combined_features. This
merged text was used for feature extraction through vectorization.

 Duplicate Removal: Checked and removed duplicate entries if any were found
to avoid biased similarity scores.

These preprocessing steps ensured that the dataset was clean, consistent, and ready for
use in the content-based recommendation system.

Methodology:
1. Data Preprocessing

 Clean and merge relevant columns into a single text string per movie.

2. TF-IDF Vectorization

 Convert text into numerical feature vectors using the TF-IDF technique to give
more importance to unique terms.

3. Similarity Computation

 Use cosine similarity to measure how close two movies are in vector space.

4. Recommendation Engine

 Retrieve similar movies based on the highest similarity scores.

Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Implementation:
• Code Snippets with Explanation:

a) Data Preprocessing
import pandas as pd

# Load dataset
df = pd.read_csv("movie_dataset.csv")

# Select relevant features

features = ['keywords', 'cast', 'genres', 'director']
for feature in features:
df[feature] = df[feature].fillna('')

# Combine selected features into a single string

def combine_features(row):
return ' '.join(row[feature] for feature in features)

df["combined_features"] = df.apply(combine_features, axis=1)

# Combine selected features into a single string

def combine_features(row):

return ' '.join(row[feature] for feature in features)

df["combined_features"] = df.apply(combine_features, axis=1)

Explanation:
We clean the data by filling missing values and combining important columns to create
a meaningful text representation for each movie.

b) Feature Vectorization
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
count_matrix = cv.fit_transform(df["combined_features"])

Explanation:
This step converts textual data into a matrix of token counts, which is essential for
similarity measurement.

c) Similarity Computation
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix)

Explanation:
Cosine similarity is used to calculate how similar two movies are, based on the count
matrix.

d) Recommendation Function
def get_title_from_index(index):

return df.iloc[index]["title"]

def get_index_from_title(title):

try:

return df[df.title.str.lower() == title.lower()].index.values[0]

except IndexError:

return None

def recommend_movies(movie_title):

index = get_index_from_title(movie_title)

if index is None:

return f"No movie found with title '{movie_title}'" similar_movies

= list(enumerate(cosine_sim[index]))
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
sorted_similar_movies = sorted(similar_movies, key=lambda x: x[1],
reverse=True)[1:6]

recommendations = [get_title_from_index(element[0]) for element in

sorted_similar_movies]

return recommendations

Explanation:
This function takes the user input, finds the closest matches, and returns the top 5
recommendations based on similarity.

e) User Input Integration

user_movie = input("Enter a movie title: ")

recommendations = recommend_movies(user_movie)

if isinstance(recommendations, list):

print("Top 5 movie recommendations:")

for i, movie in enumerate(recommendations, start=1):

print(f"{i}. {movie}")

else:print(recommendations)

Explanation:
Takes user input and displays recommendations using the previously defined functions.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Output:
• Screenshots of results

You can take screenshots of the console output after running the recommendation
function with input like:

• Handling User Inputs

The system handles the following scenarios:

✅ Valid Input:

Enter a movie title: The Dark Knight

Output: List of top 5 similar movies.

❌ Invalid Input:

Enter a movie title: Unknown Hero

Output: No movie found with title 'Unknown Hero'
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

✅ Case Insensitive Input:

Enter a movie title: the DARK knight

Output: Top 5 similar movie recommendations (same as proper case).

✅ Trimmed Input:

Enter a movie title: The Dark Knight

Output: Top 5 recommendations shown correctly.

Applications:
 OTT Platforms (Netflix, Amazon Prime): Enhance recommendations based on
user preferences.
 E-commerce: Recommend products similar to previously viewed items.
 E-learning: Suggest courses based on course history.
 Social Media: Recommend friends, pages, or groups based on content interests.

Future Scope:
 Collaborative Filtering: Incorporate user ratings and preferences.
 Hybrid Models: Combine collaborative and content-based methods.
 Deep Learning: Use word embeddings and neural networks for semantic
similarity.
 Web Deployment: Use Flask or Django to turn the system into a web application.
 Real-Time Feedback Loop: Adapt recommendations based on continuous user
interaction.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Conclusion:

This project successfully demonstrates a content-based Movie Recommendation

System using Python and scikit-learn.
It recommends similar movies based on features like genres, cast, keywords, and
director.
TF-IDF vectorization and cosine similarity were used to measure the similarity between
movies.
The system is simple yet effective in helping users discover relevant movies.
It improved my understanding of data preprocessing and machine learning concepts.
Handling real-world data taught me the importance of data cleaning and feature
selection.
The project also enhanced my coding and problem-solving skills.
Though basic, the model provides a foundation for future improvements like hybrid
systems.
It can be extended into a full-fledged app or integrated into streaming platforms.
Overall, it was a great learning experience in Data Science and Big Data Analytics.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Reference:
◻ Kaggle Dataset

 Movie Metadata Dataset

 Source: https://www.kaggle.com/datasets
 Description: A publicly available dataset containing information about thousands
of movies, including genres, cast, director, and keywords.

◻ Scikit-learn Documentation

 scikit-learn: Machine Learning in Python

 Official Documentation: https://scikit-learn.org/stable/documentation.html
 Description: Used for TF-IDF vectorization and calculating cosine similarity.

◻ Pandas Documentation

 pandas: Powerful data structures for data analysis and manipulation

 Official Website: https://pandas.pydata.org/
 Description: Used for data loading, cleaning, and manipulation.

◻ NumPy Documentation

 NumPy: The fundamental package for scientific computing with Python

 Official Site: https://numpy.org/
 Description: Used for numerical operations in the background during matrix
computation.

◻ Python Official Documentation

 Python Language Reference

 Website: https://docs.python.org/3/
 Description: Referred to for core programming logic and syntax.

◻ Jupyter Notebook

 Tool used for writing, testing, and visualizing the code in an interactive
environment.

Project Report MRS
No ratings yet
Project Report MRS
47 pages
Movie - Recommendation Pranali
No ratings yet
Movie - Recommendation Pranali
12 pages
NM (2) - Merged
No ratings yet
NM (2) - Merged
16 pages
Team 10 Movie Prediction
No ratings yet
Team 10 Movie Prediction
14 pages
NM (2) - Merged - Organized
No ratings yet
NM (2) - Merged - Organized
16 pages
ML Project Report
No ratings yet
ML Project Report
14 pages
Movie Recommendation System Project Report
No ratings yet
Movie Recommendation System Project Report
27 pages
Final Report Format SSP
No ratings yet
Final Report Format SSP
14 pages
Final Report Format SSP
No ratings yet
Final Report Format SSP
13 pages
Synopsis
No ratings yet
Synopsis
12 pages
Newmovies
No ratings yet
Newmovies
28 pages
Vaibhav - Project Report On Movie Recommender System Using Machine Learning
No ratings yet
Vaibhav - Project Report On Movie Recommender System Using Machine Learning
11 pages
Cyber Document
No ratings yet
Cyber Document
21 pages
Movie Recommendation System Using Machine Learning
No ratings yet
Movie Recommendation System Using Machine Learning
15 pages
2C13 AI Project1
No ratings yet
2C13 AI Project1
18 pages
Movie Recommendation System Report
No ratings yet
Movie Recommendation System Report
18 pages
SRMDB - in (B28 - Research Paper)
No ratings yet
SRMDB - in (B28 - Research Paper)
5 pages
Intership PPT Final
No ratings yet
Intership PPT Final
15 pages
Animal Intrusion Detection in Farms
No ratings yet
Animal Intrusion Detection in Farms
21 pages
Iv Year - Mini Project - Final Review PPT Sample Format
No ratings yet
Iv Year - Mini Project - Final Review PPT Sample Format
25 pages
Report
No ratings yet
Report
20 pages
Move Rs
No ratings yet
Move Rs
17 pages
BDA Project
No ratings yet
BDA Project
12 pages
Report
No ratings yet
Report
37 pages
Synopsis
No ratings yet
Synopsis
7 pages
Rosp
No ratings yet
Rosp
17 pages
B8 Abstract Final
No ratings yet
B8 Abstract Final
4 pages
ML Project Movie Recommendation System
No ratings yet
ML Project Movie Recommendation System
2 pages
Mini Report Movie
No ratings yet
Mini Report Movie
9 pages
Report System Predaction
No ratings yet
Report System Predaction
5 pages
Review 2 (Autosaved)
No ratings yet
Review 2 (Autosaved)
30 pages
Ai Final Project
No ratings yet
Ai Final Project
28 pages
Movie Recommendation Project Report
No ratings yet
Movie Recommendation Project Report
9 pages
Group 12 - 3rd Review
No ratings yet
Group 12 - 3rd Review
27 pages
Class 12 AI Project
No ratings yet
Class 12 AI Project
15 pages
Cinematic Recommendation System
No ratings yet
Cinematic Recommendation System
10 pages
B.Tech Movie Recommender Project
0% (1)
B.Tech Movie Recommender Project
33 pages
Project Ai
No ratings yet
Project Ai
12 pages
Movie Recommendation Engine Using Artificial Intelligence
No ratings yet
Movie Recommendation Engine Using Artificial Intelligence
30 pages
Parnit 05
No ratings yet
Parnit 05
15 pages
Ali Docs
No ratings yet
Ali Docs
32 pages
Predictive CA2
No ratings yet
Predictive CA2
13 pages
AI Project Shishi
No ratings yet
AI Project Shishi
12 pages
ML 210490131009 Oep
No ratings yet
ML 210490131009 Oep
8 pages
Yuvi Abstract
No ratings yet
Yuvi Abstract
2 pages
Review 2 SEM 6
No ratings yet
Review 2 SEM 6
25 pages
Project Report CP 7th
No ratings yet
Project Report CP 7th
20 pages
B.Tech Movie Recommendation Report
90% (10)
B.Tech Movie Recommendation Report
30 pages
Online Ijmebac 2022 1 1 3 12 16 291
No ratings yet
Online Ijmebac 2022 1 1 3 12 16 291
5 pages
Movie Recommendation System
No ratings yet
Movie Recommendation System
3 pages
Project Synopsis
No ratings yet
Project Synopsis
14 pages
Dsbda Report Final
No ratings yet
Dsbda Report Final
15 pages
Final Report
No ratings yet
Final Report
20 pages
Content-Based Movie Recommendation System Using TF-IDF and Cosine Similarity
No ratings yet
Content-Based Movie Recommendation System Using TF-IDF and Cosine Similarity
8 pages
2331 Mid Program Project v1 Es3 D2i02jl
No ratings yet
2331 Mid Program Project v1 Es3 D2i02jl
5 pages
Dsbda Mini 2 1
No ratings yet
Dsbda Mini 2 1
23 pages
Project Report On Movie Recommendation System
No ratings yet
Project Report On Movie Recommendation System
10 pages
Movie Recommendation Presentation
No ratings yet
Movie Recommendation Presentation
13 pages
Vignesh Report
No ratings yet
Vignesh Report
20 pages
Penny McCusker Fericiti Pentru Totdeauna PDF
No ratings yet
Penny McCusker Fericiti Pentru Totdeauna PDF
78 pages
French Regular Verbs
No ratings yet
French Regular Verbs
14 pages
SAP Activate Methodology For Business Suite and On-Premise - Agile and Waterfall - Overview Images
No ratings yet
SAP Activate Methodology For Business Suite and On-Premise - Agile and Waterfall - Overview Images
7 pages
Chapter 1
No ratings yet
Chapter 1
14 pages
Obstacle Avoiding Robot
No ratings yet
Obstacle Avoiding Robot
24 pages
System Design - A Comprehensive Note
No ratings yet
System Design - A Comprehensive Note
4 pages
English Comprehension 4
No ratings yet
English Comprehension 4
19 pages
DETENTION LIST MID SEM JUNE-2023 (Signed)
No ratings yet
DETENTION LIST MID SEM JUNE-2023 (Signed)
1 page
Dinverter A
No ratings yet
Dinverter A
94 pages
Transportation Problem LCM
No ratings yet
Transportation Problem LCM
15 pages
Virtuosity Drums Manual
0% (1)
Virtuosity Drums Manual
16 pages
ECOSYS MA3500cix UK v1.1
No ratings yet
ECOSYS MA3500cix UK v1.1
2 pages
Ch-3 Decision Making
No ratings yet
Ch-3 Decision Making
21 pages
Notes Unit 1 and 2
No ratings yet
Notes Unit 1 and 2
19 pages
Methods of Sampling - GeeksforGeeks
No ratings yet
Methods of Sampling - GeeksforGeeks
10 pages
2015@ Network Analysis and Synthesis Course Guide Book For CE
No ratings yet
2015@ Network Analysis and Synthesis Course Guide Book For CE
5 pages
ECO266 Info 5
No ratings yet
ECO266 Info 5
6 pages
Internal Assessment 1 - Google Forms
No ratings yet
Internal Assessment 1 - Google Forms
13 pages
AEDT Icepak Intro 2019R1 L5 Solving and PostProcessing
No ratings yet
AEDT Icepak Intro 2019R1 L5 Solving and PostProcessing
18 pages
Pma71 - e - V1.5 - 11
No ratings yet
Pma71 - e - V1.5 - 11
95 pages
Deep Learning - AD3501 - Important Questions and Question Bank
No ratings yet
Deep Learning - AD3501 - Important Questions and Question Bank
11 pages
Oracle Server Configuration Guide
No ratings yet
Oracle Server Configuration Guide
2 pages
Labview FPGA Hands-On: Adriaan Rijllart Odd Øyvind Andreassen Cern
No ratings yet
Labview FPGA Hands-On: Adriaan Rijllart Odd Øyvind Andreassen Cern
44 pages
Final Demo Zenith API
No ratings yet
Final Demo Zenith API
26 pages
Cobie Test
No ratings yet
Cobie Test
12 pages
MFC400 Datasheet
No ratings yet
MFC400 Datasheet
36 pages
Hospital Management System Guide
No ratings yet
Hospital Management System Guide
13 pages
Hns Level 4AA
No ratings yet
Hns Level 4AA
8 pages
Mic Project Syco
No ratings yet
Mic Project Syco
11 pages
Ip Study Material
No ratings yet
Ip Study Material
185 pages

Dsbda Mini Project

Uploaded by

Dsbda Mini Project

Uploaded by

Kalyani Charitable Trust’s

Late G. N. Sapkal College of Engineering

Mini Project Report

DEPARTMENT OF COMPUTER ENGINEERING

I am also thankful to the Department of Computer Engineering at LGNCOE for

 Accepts a movie title as input from the user

 Extracts and processes metadata related to the movie

 Computes similarity with other movies based on textual attributes

 Libraries & Tools:

pandas: For loading and manipulating tabular data.

numpy: For numerical operations.

scikit-learn: For machine learning algorithms, especially TF-IDF vectorization

TfidfVectorizer: Converts textual data into vector space by calculating term

cosine_similarity: Measures similarity between two vectors.

 movie_dataset.csv: A dataset containing movie details including cast,

 Title – The name of the movie.

 Genres – The type of movie (e.g., Action, Drama, Comedy).

 Keywords – Descriptive tags or phrases.

 Cast – Leading actors in the movie.

 Director – Director's name.

 Handled missing values using .fillna('') to ensure smooth concatenation of

 Combined selected features into a new column combined_features for further

Data Cleaning Performed

 Text Normalization: All text data was converted to lowercase to maintain

 Retrieve similar movies based on the highest similarity scores.

# Select relevant features

# Combine selected features into a single string

df["combined_features"] = df.apply(combine_features, axis=1)

# Combine selected features into a single string

return ' '.join(row[feature] for feature in features)

df["combined_features"] = df.apply(combine_features, axis=1)

return df[df.title.str.lower() == title.lower()].index.values[0]

return f"No movie found with title '{movie_title}'" similar_movies

recommendations = [get_title_from_index(element[0]) for element in

e) User Input Integration

print("Top 5 movie recommendations:")

for i, movie in enumerate(recommendations, start=1):

• Handling User Inputs

The system handles the following scenarios:

Enter a movie title: The Dark Knight

Enter a movie title: Unknown Hero

✅ Case Insensitive Input:

Enter a movie title: the DARK knight

Enter a movie title: The Dark Knight

This project successfully demonstrates a content-based Movie Recommendation

 Movie Metadata Dataset

 scikit-learn: Machine Learning in Python

 pandas: Powerful data structures for data analysis and manipulation

 NumPy: The fundamental package for scientific computing with Python

◻ Python Official Documentation

 Python Language Reference

You might also like