[go: up one dir, main page]

0% found this document useful (0 votes)
21 views14 pages

Dsbda Mini Project

The document presents a mini project report on a Movie Recommendation System developed using Python and scikit-learn, focusing on content-based filtering. It utilizes movie metadata such as genres, keywords, cast, and director to generate personalized recommendations through TF-IDF vectorization and cosine similarity. The project highlights the importance of data preprocessing and machine learning concepts, providing a foundation for future enhancements in recommendation systems.

Uploaded by

mthorat535
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views14 pages

Dsbda Mini Project

The document presents a mini project report on a Movie Recommendation System developed using Python and scikit-learn, focusing on content-based filtering. It utilizes movie metadata such as genres, keywords, cast, and director to generate personalized recommendations through TF-IDF vectorization and cosine similarity. The project highlights the importance of data preprocessing and machine learning concepts, providing a foundation for future enhancements in recommendation systems.

Uploaded by

mthorat535
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Kalyani Charitable Trust’s

Late G. N. Sapkal College of Engineering


Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Experiment No. 15

Mini Project Report


Movie Recommendation System

Instructor’s name
Prof S. V. Kangane

Subject
Data Science and Big Data Analytics Lab

DEPARTMENT OF COMPUTER ENGINEERING


2024-2025
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Acknowledgement:
I would like to take this opportunity to express my sincere gratitude to everyone who
contributed to the successful completion of this project titled "Movie
Recommendation System Using scikit-learn".

First and foremost, I would like to express my heartfelt thanks to our subject instructor
Prof. S. V. Kangane for their valuable guidance, encouragement, and constant support
throughout the duration of the project. Their insightful lectures on Data Science and
Big Data Analytics (DSBDA), practical approach to teaching, and detailed feedback
helped me refine my understanding of recommendation systems and machine learning
techniques.

I am also thankful to the Department of Computer Engineering at LGNCOE for


providing the resources and environment conducive to research and development. The
structured curriculum and the emphasis on hands-on learning played a significant role
in the successful execution of this project.

I extend my sincere appreciation to all the faculty members and lab instructors,
whose foundational teachings and technical inputs over the semesters contributed
directly or indirectly to the knowledge applied in this work.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Abstract:
With the exponential growth of digital content, users often face difficulty in choosing
the right movie to watch. Recommendation systems help users discover content that
aligns with their interests. This project focuses on developing a content-based movie
recommendation system using Python and scikit-learn. It uses textual metadata such as
genres, keywords, cast, and director to find similarities between movies. The system
processes this information using TF-IDF Vectorization and computes similarity using
cosine similarity. The goal is to enhance user experience by providing personalized
recommendations. This system can serve as a foundational model for more advanced
systems used in the industry.

Introduction:
Recommendation systems have become essential in almost every industry dealing with
consumer content, particularly in entertainment, e-commerce, and social media. From
Netflix recommending shows, to Spotify suggesting music, these systems
significantly enhance user experience and retention. There are two primary types of
recommendation systems: collaborative filtering and content-based filtering. This
project focuses on the latter, wherein the recommendation is based on the features of
the item itself rather than user behavior. By analyzing movie metadata, this system
provides suggestions that share similar thematic elements and creative attributes with a
given movie. The implementation of this project gave me a solid understanding of how
to process unstructured text data, compute similarity metrics, and create a functional
recommendation engine.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Objective:
The objective of this project is to build a content-based movie recommendation
system that:

 Accepts a movie title as input from the user

 Extracts and processes metadata related to the movie

 Computes similarity with other movies based on textual attributes

 Returns the top N movies that are most similar to the input movie

The goal is to enhance user discovery and content exploration using data-driven
methods.

Technology Stack:
 Programming Language:

Python: Chosen for its simplicity and extensive data science ecosystem.

 Libraries & Tools:

pandas: For loading and manipulating tabular data.

numpy: For numerical operations.

scikit-learn: For machine learning algorithms, especially TF-IDF vectorization


and similarity metrics.

TfidfVectorizer: Converts textual data into vector space by calculating term


frequency-inverse document frequency.

cosine_similarity: Measures similarity between two vectors.


Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

 Dataset:

 movie_dataset.csv: A dataset containing movie details including cast,


keywords, genres, and director.

Dataset Description:
The dataset used in this project is a CSV file with around 4,800 records, where each
record represents a movie. The key columns utilized are:

 Title – The name of the movie.

 Genres – The type of movie (e.g., Action, Drama, Comedy).

 Keywords – Descriptive tags or phrases.

 Cast – Leading actors in the movie.

 Director – Director's name.

Data Cleaning:

 Handled missing values using .fillna('') to ensure smooth concatenation of


features.

 Combined selected features into a new column combined_features for further


vectorization.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Dataset Description:
The dataset used in this project is a CSV file containing metadata about a wide range of
movies. It serves as the backbone of the recommendation engine by providing essential
information needed for comparing and suggesting similar movies.

Columns Used

The following columns from the dataset were selected and combined to build the
recommendation model:

 Keywords: These are specific terms or tags associated with each movie, giving
an idea of its themes or story elements.

 Cast: Contains the names of leading actors and actresses. Cast information helps
identify similarities in star power and performance styles.

 Genres: Lists the categories the movie belongs to, such as Action, Drama,
Comedy, etc. This is crucial for genre-based recommendations.

 Director: Indicates the name of the director, which can influence the style and
type of the movie.

These columns were chosen because they carry rich textual information that can be
compared across movies to determine similarity.

Number of Records

The dataset includes information for approximately 4,800+ movies. This provides a
sufficiently large and diverse sample to generate meaningful recommendations based
on various combinations of genre, cast, and other metadata.

Data Cleaning Performed

Before building the model, several data cleaning steps were taken:
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

 Missing Values Handling: Some entries in the selected columns were empty or
null. These were filled with empty strings ('') to prevent errors during text
processing.

 Text Normalization: All text data was converted to lowercase to maintain


consistency during comparison.

 Feature Merging: The values from the keywords, cast, genres, and director
columns were combined into a new column named combined_features. This
merged text was used for feature extraction through vectorization.

 Duplicate Removal: Checked and removed duplicate entries if any were found
to avoid biased similarity scores.

These preprocessing steps ensured that the dataset was clean, consistent, and ready for
use in the content-based recommendation system.

Methodology:
1. Data Preprocessing

 Clean and merge relevant columns into a single text string per movie.

2. TF-IDF Vectorization

 Convert text into numerical feature vectors using the TF-IDF technique to give
more importance to unique terms.

3. Similarity Computation

 Use cosine similarity to measure how close two movies are in vector space.

4. Recommendation Engine

 Retrieve similar movies based on the highest similarity scores.


Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Implementation:
• Code Snippets with Explanation:

a) Data Preprocessing
import pandas as pd

# Load dataset
df = pd.read_csv("movie_dataset.csv")

# Select relevant features


features = ['keywords', 'cast', 'genres', 'director']
for feature in features:
df[feature] = df[feature].fillna('')

# Combine selected features into a single string


def combine_features(row):
return ' '.join(row[feature] for feature in features)

df["combined_features"] = df.apply(combine_features, axis=1)

# Combine selected features into a single string

def combine_features(row):

return ' '.join(row[feature] for feature in features)

df["combined_features"] = df.apply(combine_features, axis=1)

Explanation:
We clean the data by filling missing values and combining important columns to create
a meaningful text representation for each movie.

b) Feature Vectorization
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
count_matrix = cv.fit_transform(df["combined_features"])

Explanation:
This step converts textual data into a matrix of token counts, which is essential for
similarity measurement.

c) Similarity Computation
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix)

Explanation:
Cosine similarity is used to calculate how similar two movies are, based on the count
matrix.

d) Recommendation Function
def get_title_from_index(index):

return df.iloc[index]["title"]

def get_index_from_title(title):

try:

return df[df.title.str.lower() == title.lower()].index.values[0]

except IndexError:

return None

def recommend_movies(movie_title):

index = get_index_from_title(movie_title)

if index is None:

return f"No movie found with title '{movie_title}'" similar_movies


= list(enumerate(cosine_sim[index]))
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213
sorted_similar_movies = sorted(similar_movies, key=lambda x: x[1],
reverse=True)[1:6]

recommendations = [get_title_from_index(element[0]) for element in


sorted_similar_movies]

return recommendations

Explanation:
This function takes the user input, finds the closest matches, and returns the top 5
recommendations based on similarity.

e) User Input Integration


user_movie = input("Enter a movie title: ")

recommendations = recommend_movies(user_movie)

if isinstance(recommendations, list):

print("Top 5 movie recommendations:")

for i, movie in enumerate(recommendations, start=1):

print(f"{i}. {movie}")

else:print(recommendations)

Explanation:
Takes user input and displays recommendations using the previously defined functions.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Output:
• Screenshots of results

You can take screenshots of the console output after running the recommendation
function with input like:

• Handling User Inputs

The system handles the following scenarios:

✅ Valid Input:

Enter a movie title: The Dark Knight


Output: List of top 5 similar movies.

❌ Invalid Input:

Enter a movie title: Unknown Hero


Output: No movie found with title 'Unknown Hero'
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

✅ Case Insensitive Input:

Enter a movie title: the DARK knight


Output: Top 5 similar movie recommendations (same as proper case).

✅ Trimmed Input:

Enter a movie title: The Dark Knight


Output: Top 5 recommendations shown correctly.

Applications:
 OTT Platforms (Netflix, Amazon Prime): Enhance recommendations based on
user preferences.
 E-commerce: Recommend products similar to previously viewed items.
 E-learning: Suggest courses based on course history.
 Social Media: Recommend friends, pages, or groups based on content interests.

Future Scope:
 Collaborative Filtering: Incorporate user ratings and preferences.
 Hybrid Models: Combine collaborative and content-based methods.
 Deep Learning: Use word embeddings and neural networks for semantic
similarity.
 Web Deployment: Use Flask or Django to turn the system into a web application.
 Real-Time Feedback Loop: Adapt recommendations based on continuous user
interaction.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Conclusion:

This project successfully demonstrates a content-based Movie Recommendation


System using Python and scikit-learn.
It recommends similar movies based on features like genres, cast, keywords, and
director.
TF-IDF vectorization and cosine similarity were used to measure the similarity between
movies.
The system is simple yet effective in helping users discover relevant movies.
It improved my understanding of data preprocessing and machine learning concepts.
Handling real-world data taught me the importance of data cleaning and feature
selection.
The project also enhanced my coding and problem-solving skills.
Though basic, the model provides a foundation for future improvements like hybrid
systems.
It can be extended into a full-fledged app or integrated into streaming platforms.
Overall, it was a great learning experience in Data Science and Big Data Analytics.
Kalyani Charitable Trust’s
Late G. N. Sapkal College of Engineering
Kalyani Hills, Anjaneri, Trimbakeshwar
Road, Nashik – 422 213

Reference:
◻ Kaggle Dataset

 Movie Metadata Dataset


 Source: https://www.kaggle.com/datasets
 Description: A publicly available dataset containing information about thousands
of movies, including genres, cast, director, and keywords.

◻ Scikit-learn Documentation

 scikit-learn: Machine Learning in Python


 Official Documentation: https://scikit-learn.org/stable/documentation.html
 Description: Used for TF-IDF vectorization and calculating cosine similarity.

◻ Pandas Documentation

 pandas: Powerful data structures for data analysis and manipulation


 Official Website: https://pandas.pydata.org/
 Description: Used for data loading, cleaning, and manipulation.

◻ NumPy Documentation

 NumPy: The fundamental package for scientific computing with Python


 Official Site: https://numpy.org/
 Description: Used for numerical operations in the background during matrix
computation.

◻ Python Official Documentation

 Python Language Reference


 Website: https://docs.python.org/3/
 Description: Referred to for core programming logic and syntax.

◻ Jupyter Notebook

 Tool used for writing, testing, and visualizing the code in an interactive
environment.

You might also like