0% found this document useful (0 votes)

78 views16 pages

Project Report

The project report details an exploratory data analysis performed on the TMDB 5000 Movie Dataset to uncover insights into movie success factors such as popularity, ratings, and revenue. The analysis involved data cleaning, preprocessing, and answering 20 analytical questions using Python libraries like Pandas, Seaborn, and Matplotlib, resulting in visualizations and statistical summaries. Key findings include trends in movie releases, average revenues, and ratings over time, as well as genre-based insights and classifications of movies based on audience ratings.

Uploaded by

kritarthkarambelkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views16 pages

Project Report

Uploaded by

kritarthkarambelkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Project Report

CEC Course on “Applied Data Analytics: A Practical Approach”

Subject: Movie Rating Analysis For Beginners

Team Members:

1. Raksha Sahu : CEC_2025_037

2. Kritarth Karambelkar : CEC_2025_040
3. Krati Shrivastava : CEC_2025_034
4. Sakshi Goswami : CEC_2025_043
5. Sameer Sevlani : CEC_2025_070
1. Introduction

This project aims to perform a comprehensive exploratory data analysis (EDA) on the TMDB
5000 Movie Dataset, a rich collection of metadata about over 5,000 movies. The dataset
includes detailed information such as movie titles, release dates, budgets, revenues,
genres, popularity scores, user ratings, and details of the cast and crew.

The objective is to uncover insights, patterns, and trends that can help us understand the
factors contributing to a movie’s success, whether in terms of popularity, ratings, or revenue.

To carry out the analysis, we used Python, along with powerful data science libraries, including
Pandas for data manipulation, Seaborn, and Matplotlib for data visualization. The dataset was
cleaned and preprocessed to handle missing values and nested fields, followed by the
formulation of 20 analytical questions. Each question was answered with supporting graphs,
charts, and statistical summaries, offering a visual and quantitative understanding of the
underlying trends in the movie industry.

Through this analysis, we aim to not only develop technical proficiency in handling real-world
datasets but also gain valuable insights into the dynamics of film production, audience
preferences, and industry economics.

2. Dataset Overview

The analysis in this project is based on two primary datasets sourced from The Movie
Database (TMDB): the TMDB 5000 Movies Dataset and the TMDB 5000 Credits Dataset.
Together, they offer a comprehensive view of various aspects of films, ranging from basic
metadata to detailed cast and crew information.

● TMDB 5000 Movies Dataset

This dataset contains metadata for over 5,000 movies, including details such as movie
titles, genres, release dates, budgets, revenues, popularity scores, runtime, vote
averages, and vote counts. It provides the foundational information necessary for
analyzing the overall performance and characteristics of films.

● TMDB 5000 Credits Dataset

This complementary dataset includes in-depth information about the cast and crew of
each movie. It features structured data fields identifying the names and roles of actors,
directors, and other contributors involved in the filmmaking process.

To conduct the analysis, these two datasets were merged on the common column 'title'. This
merging step helped form a unified and enriched dataset referred to as analysis_df, which
was used throughout the analysis for deriving insights and visualizations.
3. Data Cleaning and Preprocessing

Before conducting meaningful analysis, it was essential to clean and preprocess the raw data
to make it consistent, readable, and structured for exploration. Several steps were taken to
prepare the dataset for analysis:

● Converted release dates to datetime format

The release_date column, originally stored as a string, was converted to a standard
datetime format using pandas.to_datetime(). This transformation enabled easier
sorting, filtering, and analysis of movies based on their release timeline.

● Extracted release year for time-based analysis

After converting to datetime, a new column called release_year was created by
extracting the year from each movie's release date. This was crucial for generating
year-wise trends in ratings, revenue, popularity, and other attributes.

● Handled nested genre, cast, and crew data using ast.literal_eval

Several columns, such as genres, cast, and crew were stored as stringified JSON
objects. These were converted to Python list/dictionary objects using
ast.literal_eval() for easier traversal and filtering. This step allowed for more
advanced operations like genre extraction and identification of directors or lead actors.

● Selected relevant columns for analysis

To streamline the analysis and avoid working with unnecessary data, a subset of
important columns was selected to form the main working DataFrame (analysis_df).
These columns included identifiers like movie_id, key metrics such as vote_average,
revenue, and budget, as well as categorical and date fields like genres, cast, crew,
and release_date.
4. Exploratory Data Analysis

RATINGS AND REVENUE OVER TIME

a. Number of movies released per year

● Columns used:
release_year

● Chart type:
Line Chart – showing the trend of movie releases over time

● Functions used (with libraries and purposes):

○ value_counts() → (Pandas) Counts how many movies were released in each
year.
○ sort_index() → (Pandas) Sorts the year-wise counts chronologically.
○ plt.figure(figsize=(14,5)) → (Matplotlib.pyplot) Sets the figure size for the plot.
○ sns.lineplot() → (Seaborn) Plots a smooth line chart showing movie release
counts over the years with markers for data points.
○ plt.title(), plt.xlabel(), plt.ylabel(), plt.grid(), plt.show() → (Matplotlib.pyplot) Used
for adding title, axis labels, grid, and displaying the plot.

● Key Points:
a.Shows the number of movies produced every year.
b.Significant increase in movie production after the 2000s.
c.Indicates rapid expansion of the global film industry.

b. In which year was the highest average revenue?

● Columns used:
release_year, revenue

● Chart type:
Bar Chart – visualizing average revenue per year

● Functions used (with libraries and purposes):

○ groupby('release_year')['revenue'].mean() → (Pandas) Groups data by year and

computes the average revenue of movies in each year.
○ sort_index(ascending=True) → (Pandas) Sorts the years in chronological order
for accurate timeline visualization.
○ plot(kind='bar', ...) → (Pandas/Matplotlib) Creates a bar chart to show how
average revenue varies by year.
○ plt.ylabel(), plt.xlabel(), plt.grid(), plt.show() → (Matplotlib.pyplot) Used for
adding labels, grid, and displaying the chart.
○ idxmax() → (Pandas) Identifies the year with the highest average revenue.

● Key Points:
a.Displays the average revenue earned by movies each year.
b.Later years (2010s) show higher average revenue due to blockbuster franchises.
c.Reflects growing global box office and international releases.
d.Year with Highest Average Revenue: 1939.0

c. Find average movie ratings per year.

● Columns used:
release_year, vote_average

● Chart type:
Line Chart – to visualize changes in average ratings over time

● Functions used (with libraries and purposes):

○ groupby('release_year')['vote_average'].mean() → (Pandas) Groups movies by

release year and calculates the average rating per year.
○ plt.figure(figsize=(14,5)) → (Matplotlib.pyplot) Sets the plot size.
○ sns.lineplot(...) → (Seaborn) Plots the average ratings year-wise with markers for
each data point.
○ plt.title(), plt.xlabel(), plt.ylabel(), plt.grid(), plt.show() → (Matplotlib.pyplot) Used
for adding titles, labels, grid lines, and rendering the chart.

● Key Points:
a. Measures how audience ratings have varied over time.
b. Ratings have remained fairly consistent, averaging between 6–7.
c. Few years show a dip, possibly due to an increase in massreleased lower-quality
content.

d. Display top 10 highest revenue movie titles.

● Columns used:
title, revenue
● Chart type:
Horizontal Bar Chart – to show revenue for top-grossing movies

● Functions used (with libraries and purposes):

○ sort_values(by='revenue', ascending=False) → (Pandas) Sorts movies in

descending order of revenue to identify the top earners.
○ head(10) → (Pandas) Selects the top 10 movies after sorting.
○ sns.barplot(...) → (Seaborn) Creates a horizontal bar chart with movie titles on
the y-axis and their revenues on the x-axis for better readability.
○ plt.figure(), plt.title(), plt.xlabel(), plt.ylabel(), plt.tight_layout(), plt.show() →
(Matplotlib.pyplot) Used for customizing and displaying the plot.
○ display() → (IPython.display) Displays the top 10 data in a readable table
format.

● Key Points:
a. Franchise blockbusters like Avatar and Avengers lead the list.
b.Most movies belong to action, adventure, or fantasy genres.
c.High revenue often aligns with global releases and large fanbases.
d.Reflects the dominance of big-budget productions in modern cinema.

e. Display Top 10 Highest Rated Movie Titles and Their Directors

● Columns used:
title, vote_average, crew

● Chart type:
None (Tabular display using display())

● Functions used (with libraries and purposes):

○ ast.literal_eval() → (ast library) Safely parses the crew column (stored as a

stringified list of dictionaries) into Python list objects.
○ get_director() → (Custom function) Iterates through the crew list to extract the
person whose job is "Director".
○ apply() → (Pandas) Applies the get_director function to each row in the crew
column to extract the director's name.
○ sort_values(by='vote_average', ascending=False) → (Pandas) Sorts movies by
their rating in descending order.
○ head(10) → (Pandas) Selects the top 10 highest rated movies.
○ display() → (IPython.display) Used to neatly display the result as a table.
● Key Points:
a. Showcases the 10 movies with the highest audience ratings on the platform.
b.These titles have received exceptional ratings, typically 9.0 and above.
c.Includes critically acclaimed films that may not necessarily be the highest-grossing.
d.Highlights that quality and popularity don't always align — some hidden gems score
higher than blockbusters.
e.Useful for discovering top-tier content based on viewer satisfaction alone.

f. Classify movies based on ratings (Excellent, Good, Average)

● Columns used:
vote_average

● Chart type:
Pie Chart – to show proportion of movies in each rating category

● Functions used (with libraries and purposes):

○ classify_rating() → (Custom function) Classifies movies as:

■ Excellent if rating ≥ 8
■ Good if 6 ≤ rating < 8
■ Average if rating < 6
○ apply() → (Pandas) Applies the custom classification function to the
vote_average column to create a new column rating_category.
○ value_counts() → (Pandas) Counts the number of movies in each rating
category.
○ plt.pie(...) → (Matplotlib.pyplot) Creates a pie chart showing the distribution of
movies across the three categories.
○ plt.title(), plt.tight_layout(), plt.show() → (Matplotlib.pyplot) For visual
enhancements and rendering.

● Key Points:
a. Movies were classified based on their average audience rating.
b.The majority of movies fall under the Average and Good categories.
c.Only a small fraction qualifies as Excellent, indicating strict audience expectations.
d.Helps in understanding the general quality distribution of films in the dataset.

GENRE BASED INSIGHTS

a. List of all unique movie genres

● Columns used:
genres
● Chart type:
None (Displayed as a simple table of unique genre names)

● Functions used (with libraries and purposes):

○ ast.literal_eval() → (ast library) Converts the stringified list of dictionaries in the

genres column into actual Python list objects.
○ iterrows() → (Pandas) Iterates through each row to process the genres data.
○ Custom logic → Extracts genre names from the dictionaries and appends them to
a list.
○ set() → (Python built-in) Removes duplicates to obtain unique genres.
○ sorted() → (Python built-in) Sorts the genres alphabetically.
○ pd.DataFrame() → (Pandas) Converts the list of unique genres into a DataFrame
for better display and readability.
○ print() → Outputs the final DataFrame.

b. How many movies of each genre were made?

● Columns used:
genres, vote_average

● Chart type:
Horizontal Bar Chart – displaying the count of movies by genre

● Functions used (with libraries and purposes):

○ ast.literal_eval() → (ast library) Converts stringified lists in the genres column

into actual Python lists of dictionaries.
○ apply() → (Pandas) Applies parsing logic to each row in the genres column.
○ iterrows() + custom loop → Iterates through each row and genre to flatten the
nested structure into a long-form DataFrame.
○ pd.DataFrame() → (Pandas) Constructs a new flat DataFrame with each genre
and its corresponding movie rating.
○ groupby() + agg() → (Pandas) Groups data by genre to calculate:
■ movie_count → total number of movies per genre
■ avg_vote → average rating per genre
○ sort_values() → Sorts genres by the number of movies in descending order.
○ sns.barplot() → (Seaborn) Plots a horizontal bar chart showing how many
movies belong to each genre.

● Key Points:
a. Drama and Comedy are the most produced genres in the dataset.
b.Action and Thriller follow closely, reflecting audience demand for excitement.
c.Genres like Documentary and Western have fewer releases.
d.Highlights the diversity and dominance of certain genres in global cinema.

c. What is the average rating per genre?

● Columns used:
genres, vote_average

● Chart type:
Horizontal Bar Chart – showing average rating by genre

● Functions used (with libraries and purposes):

○ groupby() → (Pandas) Groups the flattened genre–rating DataFrame (flat_df) by

genre.
○ mean() → (Pandas) Calculates the average of vote_average for each genre.
○ sort_values() → Sorts genres by their average rating in descending order.
○ reset_index() → Converts grouped data into a flat DataFrame for plotting.
○ sns.barplot() → (Seaborn) Creates a horizontal bar plot to visualize genre-wise
average ratings.
○ plt.figure(), plt.title(), plt.xlabel(), plt.ylabel(), plt.tight_layout() → (Matplotlib)
Customize and render the plot.

● Key Points:
a. Shows the average audience rating for each genre.
b. Genres like Documentary and War tend to have higher ratings.
c. Popular genres like Action or Comedy may have lower averages due to volume.
d. Helps understand how critically well-received each genre is on average.

MISCELLANEOUS INSIGHTS

a. Display Top 10 Longest Movies.

● Columns used:
title, runtime

● Chart type:
Horizontal Bar Chart – showing movie titles vs. runtime

● Functions used (with libraries and purposes):

○ sort_values() → (Pandas) Sorts the movies in descending order by their runtime.

○ head(10) → (Pandas) Retrieves the top 10 movies with the longest runtimes.
○ sns.barplot() → (Seaborn) Plots the runtime of the longest movies with their
titles.
○ plt.figure(), plt.title(), plt.xlabel(), plt.ylabel(), plt.tight_layout() → (Matplotlib) For
configuring and displaying the chart.

● Key Points:
a. Highlights movies with the longest runtimes in the dataset.
b. Many are epics or historical dramas, often exceeding 3 hours.
c. Long runtimes may reflect complex storytelling or multi-part narratives.
d. Useful for understanding viewer tolerance and production trends over time.

b. What is the Average Film Duration by Genre?

● Columns used:
genres, runtime

● Chart type:
Horizontal Bar Chart – showing average runtime per genre

● Functions used (with libraries and purposes):

○ Loop through analysis_df → To extract each genre and corresponding movie

runtime.
○ pd.DataFrame() → (Pandas) Creates a flat DataFrame (genre_runtime_df) where
each row links a genre to a movie's runtime.
○ groupby() and mean() → (Pandas) Groups the data by genre and calculates the
average runtime for each.
○ sort_values() and reset_index() → Sorts and prepares the data for plotting.
○ sns.barplot() → (Seaborn) Visualizes average runtime per genre using a bar plot.
○ plt.figure(), plt.title(), plt.xlabel(), plt.ylabel(), plt.tight_layout() → (Matplotlib) To
format and display the plot.

● Key Points:
a. Shows the typical length of movies in each genre.
b. Historical, War, and Drama genres tend to have longer runtimes.
c. Genres like Animation and Comedy often have shorter durations.
d. Helps in understanding content density and pacing trends across genres.

CORRELATION

a. Does Rating Affect the Revenue?

● Columns used:
vote_average, revenue
● Chart type:
Scatter Plot – to observe correlation between rating and revenue

● Functions used (with libraries and purposes):

○ plt.figure() → (Matplotlib) Sets the figure size for better readability.

○ sns.scatterplot() → (Seaborn) Plots vote_average vs. revenue as points,
allowing visual inspection of potential correlation.
○ plt.title(), plt.xlabel(), plt.ylabel(), plt.tight_layout() → (Matplotlib) Adds
descriptive elements and layout adjustments to the plot.

● Key Points:
a. Analyzed the relationship between audience ratings and movie revenue.
b.Scatter plot shows no strong pattern — movies with high ratings don’t necessarily earn
the highest revenue.
c.A few low-rated movies earned big, likely due to strong marketing or franchise pull.
d.Indicates that commercial success is not always tied to critical acclaim.
e.Factors like popularity, fanbase, and release scale may play a bigger role in revenue.

HOW ARE MOVIE METRICS RELATED TO EACH

OTHER?

This analysis explores the relationships among numerical

variables in the TMDB 5000 Movies dataset using
correlation analysis. The goal is to identify how
features such as budget, revenue, popularity, vote count,
vote average, and runtime relate to one another, with a
focus on what drives commercial and critical success.

Methodology & Key Code Functions:

● The dataset was imported and processed using

pandas:
df = pd.read_csv('tmdb_5000_movies.csv')
● Identifier columns such as id were excluded for relevance.
● Only numerical features were selected using:
numeric_df = df.select_dtypes(include=['float64', 'int64'])
● The correlation matrix was computed with:
corr_matrix = numeric_df.corr()
● Visualization was done using Seaborn’s heatmap:
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)
This heatmap uses Pearson correlation to measure linear relationships between variables
(ranging from –1 to +1).

Key Findings:

● Budget vs. Revenue

A strong positive correlation (r = 0.73) shows that movies with higher budgets generally
generate higher box office revenue.

● Popularity vs. Vote Count

A very high correlation (r = 0.78) indicates that popular films receive more user votes,
reflecting higher audience interaction.

● Revenue Drivers
○ Vote Count (r = 0.78)
○ Popularity (r = 0.64)
These are both highly correlated with revenue, suggesting that widespread
engagement and visibility are key to a movie’s commercial success.

● Vote Average (Ratings)

Weak correlations with:
○ Budget (0.09)
○ Revenue (0.20)
○ Popularity (0.27)
imply that critical success (ratings) is largely independent of financial
metrics.

● Runtime
Runtime shows a weak-to-moderate correlation only with vote average (r = 0.38),
indicating that longer movies may receive slightly better ratings, but have minimal
influence elsewhere.

Conclusion:

The correlation matrix demonstrates that financial success is most closely tied to budget,
popularity, and vote count, all of which reinforce one another. On the other hand, vote
average (a measure of quality or critical acclaim) is relatively independent of these metrics.
This highlights the distinction between movies that are popular and profitable versus those
that are critically appreciated.
From a technical standpoint, pandas and seaborn provide a concise and effective framework
for performing such analysis, with .corr() and sns.heatmap() being key functions in identifying
and visualizing these relationships.

5. Visualizations

Effective visual representation of data is crucial for uncovering insights and conveying patterns
clearly. In this project, a wide variety of charts and plots have been used to explore and
present the relationships between different movie attributes. The visualizations not only highlight
trends but also make the analysis more intuitive and engaging.

To ensure visual consistency and appeal, the charts were styled using powerful Python libraries
such as Seaborn and Matplotlib. The following customizations were applied:

● Color palette: Set2

A soft and visually pleasing color palette (Set2) was used throughout the visualizations.
This color scheme helps differentiate elements clearly without being too harsh or
saturated, making the plots aesthetically appealing and easy to interpret.

● Style: whitegrid
The whitegrid style was applied using sns.set_style(), providing a clean background with
subtle grid lines. This enhances readability and improves alignment when interpreting
values on axes.

● Layouts: tight_layout() for neatness

To prevent overlapping of axis labels, titles, and plot content, the plt.tight_layout()
function was used. This automatically adjusts spacing and padding to ensure that all
elements in the plot are well-organized and clearly visible.

In addition to bar plots, pie charts, and line plots, these styling choices helped maintain
uniformity across different figures and contributed to a professional and polished
presentation of the analysis results.

6. Key Insights

Through detailed exploration and visual analysis of the TMDB movie dataset, several
noteworthy insights were uncovered. These findings help us understand general patterns in
filmmaking, audience preferences, and industry performance over the years:
● Action and Drama are among the most produced genres
These two genres dominate the dataset in terms of the number of films produced,
reflecting their enduring popularity and commercial viability across different markets and
time periods.

● High-budget films don’t always guarantee high revenue

A significant observation was that while many big-budget films achieved commercial
success, there were numerous cases where high investment did not result in
proportionate returns. This suggests that budget alone isn't a reliable indicator of
financial performance.

● Some directors and genres consistently perform well

A handful of directors stood out by maintaining a high average rating across their films.
Similarly, certain genres like Animation and Adventure were often associated with
strong audience ratings, indicating a consistent level of quality and appeal.

● Runtime and popularity tend to correlate with audience engagement

Movies with longer runtimes and higher popularity scores were frequently observed to
receive more user votes and higher ratings, implying that engaging content tends to
retain audience attention and drive positive feedback.

● The early 2000s to 2010s saw a surge in revenue generation

An analysis of revenue trends showed that movies released during this period generally
performed better financially, possibly due to advancements in global distribution and
marketing strategies.

● Genres like Horror and Documentary, although fewer in number, show niche
audience engagement
While these genres are less frequently produced, they maintain a dedicated audience
base and often receive strong responses when executed well.

These insights provide a meaningful understanding of the creative and commercial dynamics
of the film industry, as captured through data.

7. Limitations

While the analysis provides valuable insights into various aspects of movies, there are several
limitations that should be acknowledged. These limitations may affect the accuracy,
completeness, and interpretability of the results:

● Revenue and budget data have some missing or zero values

A noticeable number of movies in the dataset have incomplete or zero entries for budget
and revenue. This hinders the ability to accurately assess profitability and financial
trends for all entries.

● Ratings can be biased due to low vote counts on some movies

Certain films, especially lesser-known or niche releases, have very few user ratings.
This can significantly skew their average rating, making them appear better or worse
than they actually are.

● Genre, cast, and crew are stored in nested structures requiring manual parsing
These columns are not in a directly usable format and had to be converted using
ast.literal_eval. Any inconsistency in these string representations could lead to
parsing errors or data loss.

● Time-based comparisons may be affected by data imbalance

Some years in the dataset have a higher number of movies than others, leading to a
possible overrepresentation of certain time periods when analyzing trends over time.

● Lack of external factors or audience demographics

The dataset doesn’t include viewer demographics, regional release information, or
critical reviews. These could have added more depth to the analysis, especially in
understanding popularity or rating trends.

● No differentiation between theatrical and digital releases

The dataset does not distinguish between the types of releases (cinema vs. streaming),
which could influence revenue figures and viewer engagement patterns.

Despite these limitations, the dataset still provides a rich foundation for exploratory data
analysis and meaningful insights into the movie industry.

8. Conclusion

This analysis provided a comprehensive exploration of the TMDB 5000 Movies dataset,
revealing patterns and trends related to movie ratings, genres, revenues, and other key
attributes. By using effective data preprocessing and visualization techniques, we were able to
gain a better understanding of what elements contribute to a movie’s success—both in terms of
audience reception and financial performance.

The study demonstrated how data science tools such as Python, pandas, seaborn, and
matplotlib can be leveraged to extract meaningful insights from raw datasets. While the analysis
has certain limitations, it sets a strong foundation for future work, such as integrating additional
datasets (e.g., user reviews, award data, or critic scores) to form a more holistic view of the film
industry. Ultimately, this project not only showcases technical skills but also highlights the power
of data in shaping media analysis and decision-making.

9. Future Work

There are several directions in which this analysis can be further extended to gain deeper and
more actionable insights:

● Sentiment analysis on movie overviews or reviews

Applying Natural Language Processing (NLP) techniques to the movie overviews or
actual user reviews can help understand audience sentiment. This could provide a
qualitative dimension to the quantitative ratings and help predict movie success.

● Time-series forecasting of genre popularity

Analyzing how the popularity of specific genres has changed over the years and
forecasting future trends using time-series models could benefit content creators and
streaming platforms in planning future releases.

● Recommendation systems using collaborative filtering

By incorporating user behavior data such as watch history and ratings, collaborative
filtering methods can be used to build a personalized movie recommendation engine,
similar to those used by Netflix or Amazon Prime.

Movie Recommendation System Analysis
No ratings yet
Movie Recommendation System Analysis
8 pages
Movie Dataset Analysis
No ratings yet
Movie Dataset Analysis
15 pages
IMDB MOVIES Analysis
No ratings yet
IMDB MOVIES Analysis
13 pages
Mini Project
No ratings yet
Mini Project
18 pages
Mini Project
No ratings yet
Mini Project
17 pages
Programming
No ratings yet
Programming
30 pages
IMDb+Movie+Assignment Stub
No ratings yet
IMDb+Movie+Assignment Stub
9 pages
Movie Data Insights & Predictions
No ratings yet
Movie Data Insights & Predictions
22 pages
Analytic Project Report APR
No ratings yet
Analytic Project Report APR
42 pages
Movie Industry Data Insights
No ratings yet
Movie Industry Data Insights
4 pages
Project 5
No ratings yet
Project 5
5 pages
Disney Movies Box Office Analysis
No ratings yet
Disney Movies Box Office Analysis
7 pages
IMDB Movie Analysis Insights
No ratings yet
IMDB Movie Analysis Insights
14 pages
Project Highlights
No ratings yet
Project Highlights
1 page
Naan Muthalvan Practical Sample
No ratings yet
Naan Muthalvan Practical Sample
7 pages
Hitchhiker's Guide To Exploratory Data Analysis - by Harshit Tyagi - Towards Data Science
No ratings yet
Hitchhiker's Guide To Exploratory Data Analysis - by Harshit Tyagi - Towards Data Science
14 pages
Movie Notebook
No ratings yet
Movie Notebook
91 pages
Python Project Description
No ratings yet
Python Project Description
4 pages
ADS Phase3
No ratings yet
ADS Phase3
13 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
17 pages
IMDB Dataframe Insights
No ratings yet
IMDB Dataframe Insights
3 pages
Group 15 Report
No ratings yet
Group 15 Report
23 pages
Movies Statistical Analysis
No ratings yet
Movies Statistical Analysis
3 pages
Diego Luna Movie Analysis Guide
No ratings yet
Diego Luna Movie Analysis Guide
11 pages
Netflix Data - Cleaning, Analysis and Visualization - (Data Analyst)
No ratings yet
Netflix Data - Cleaning, Analysis and Visualization - (Data Analyst)
24 pages
Movie Trends Analysis Project
No ratings yet
Movie Trends Analysis Project
13 pages
Submission I - Case Study For PGDDS (Semester II)
No ratings yet
Submission I - Case Study For PGDDS (Semester II)
14 pages
IMDB Movie Data Analysis Guide
No ratings yet
IMDB Movie Data Analysis Guide
9 pages
NM Assignment
No ratings yet
NM Assignment
14 pages
IMDB Movie Analysis Report
No ratings yet
IMDB Movie Analysis Report
11 pages
Source Code Source Code
No ratings yet
Source Code Source Code
4 pages
DA Lab Program-6
No ratings yet
DA Lab Program-6
4 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
2 pages
Practice Set 2
No ratings yet
Practice Set 2
10 pages
IMDB Movie Analysis Guide
No ratings yet
IMDB Movie Analysis Guide
7 pages
Bollywood Box Office Insights
No ratings yet
Bollywood Box Office Insights
30 pages
Netflix Data Analysis Project
No ratings yet
Netflix Data Analysis Project
16 pages
Netflix Business Case Study - Data Exploration and Visualisation.. Sonam Meshram
No ratings yet
Netflix Business Case Study - Data Exploration and Visualisation.. Sonam Meshram
27 pages
Sneha Kumari - 262 - DS Project.
No ratings yet
Sneha Kumari - 262 - DS Project.
19 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Bollywood and Heart Data Analysis
No ratings yet
Bollywood and Heart Data Analysis
15 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
80 pages
Ads - Phase 5
No ratings yet
Ads - Phase 5
14 pages
Report Final-MovieLens
No ratings yet
Report Final-MovieLens
47 pages
Report
No ratings yet
Report
31 pages
Investigate A Dataset
No ratings yet
Investigate A Dataset
14 pages
Movie Data Analysis with Pandas
No ratings yet
Movie Data Analysis with Pandas
13 pages
BCM Project
No ratings yet
BCM Project
4 pages
COM 428 - Jupyter Notebook2 - 101223
No ratings yet
COM 428 - Jupyter Notebook2 - 101223
16 pages
Week 3
No ratings yet
Week 3
2 pages
Visualizing Netflix Data Using Python!
No ratings yet
Visualizing Netflix Data Using Python!
13 pages
Top50moviesp44091 2 2
No ratings yet
Top50moviesp44091 2 2
11 pages
PM Shri Kendriya Vidyalaya Pattom Shift Ii: Movie Data Analysis
No ratings yet
PM Shri Kendriya Vidyalaya Pattom Shift Ii: Movie Data Analysis
35 pages
Case Study Data Analytics
No ratings yet
Case Study Data Analytics
12 pages
Data Analysis for Movie Enthusiasts
No ratings yet
Data Analysis for Movie Enthusiasts
23 pages
Netflix Data Analysis
No ratings yet
Netflix Data Analysis
23 pages
Recommendation Engine 1657857468
No ratings yet
Recommendation Engine 1657857468
15 pages
Big Data Analysis with Oracle, MongoDB, Hadoop
100% (1)
Big Data Analysis with Oracle, MongoDB, Hadoop
40 pages
Netflix Analysis Report (2105878 - Bibhudutta Swain)
No ratings yet
Netflix Analysis Report (2105878 - Bibhudutta Swain)
19 pages
Equinix (2019) Equinix IBX Sustainability Quick
No ratings yet
Equinix (2019) Equinix IBX Sustainability Quick
22 pages
Salesforce Exam Purchase Confirmation
No ratings yet
Salesforce Exam Purchase Confirmation
1 page
YCT 2024 25 RRB Knowledge TCS Pattern Theory PYQs English Medium
No ratings yet
YCT 2024 25 RRB Knowledge TCS Pattern Theory PYQs English Medium
222 pages
Employee Clearance Checklist - September 2023.Pdf - 53e69fb6 7ca9 44d3 Bfec 8c59f3fd023f
No ratings yet
Employee Clearance Checklist - September 2023.Pdf - 53e69fb6 7ca9 44d3 Bfec 8c59f3fd023f
8 pages
Httpsemas2.Ui - Ac.idpluginfile - Php2375826mod Resourcecontent1kuliah1 2 PDF
No ratings yet
Httpsemas2.Ui - Ac.idpluginfile - Php2375826mod Resourcecontent1kuliah1 2 PDF
31 pages
Release Note Qfle3f 1.0.68.0-1OEM.670.0.0.8169922
No ratings yet
Release Note Qfle3f 1.0.68.0-1OEM.670.0.0.8169922
22 pages
UML - Deployment Diagrams - Tutorialspoint PDF
No ratings yet
UML - Deployment Diagrams - Tutorialspoint PDF
3 pages
The Most Complete Starter Kit For MEGA V1.0.19.09.17 PDF
100% (1)
The Most Complete Starter Kit For MEGA V1.0.19.09.17 PDF
225 pages
SB 72-43
No ratings yet
SB 72-43
16 pages
Hisense MDA Offers & Stock Update As On 21 Mar 2025
No ratings yet
Hisense MDA Offers & Stock Update As On 21 Mar 2025
6 pages
Computational Physics Lab: Writing Up: Laboratory Class Attendance
No ratings yet
Computational Physics Lab: Writing Up: Laboratory Class Attendance
3 pages
Hospital Management System
No ratings yet
Hospital Management System
20 pages
HTML Imp Tags
No ratings yet
HTML Imp Tags
4 pages
Civil & Structural Design Guide
No ratings yet
Civil & Structural Design Guide
2 pages
Shaheer Anwar CV
No ratings yet
Shaheer Anwar CV
2 pages
Booking Confirmation: Fairfield Inn NY
0% (1)
Booking Confirmation: Fairfield Inn NY
2 pages
Google Forms Presentation
No ratings yet
Google Forms Presentation
49 pages
Notes On Software Installation
No ratings yet
Notes On Software Installation
5 pages
Modulo18 RiscV DDCArv Ch8
No ratings yet
Modulo18 RiscV DDCArv Ch8
43 pages
Simple Show - BluffTitler
No ratings yet
Simple Show - BluffTitler
3 pages
Healthcare Consulting Pro Insights
No ratings yet
Healthcare Consulting Pro Insights
1 page
Instant Ebooks Textbook Pump User S Handbook Life Extension 1st Edition Heinz P. Bloch Download All Chapters
100% (8)
Instant Ebooks Textbook Pump User S Handbook Life Extension 1st Edition Heinz P. Bloch Download All Chapters
51 pages
Ec1008 Signals and Systems PDF
No ratings yet
Ec1008 Signals and Systems PDF
9 pages
? Refresher - Adam Yusuf
No ratings yet
? Refresher - Adam Yusuf
24 pages
Opt 4001
No ratings yet
Opt 4001
45 pages
Pranav Sood: Integrated Scheduler For LTE (4G) and NR (5G) Systems
No ratings yet
Pranav Sood: Integrated Scheduler For LTE (4G) and NR (5G) Systems
2 pages
Effectiveness of Modular Distance Learning To Grade 11 HUMSS Students
No ratings yet
Effectiveness of Modular Distance Learning To Grade 11 HUMSS Students
18 pages
Mod Menu Crash 2025 07 29-12 27 02
No ratings yet
Mod Menu Crash 2025 07 29-12 27 02
3 pages
DXPC Datasheet v6
No ratings yet
DXPC Datasheet v6
2 pages
Emp800 e
No ratings yet
Emp800 e
69 pages

Project Report

Uploaded by

Project Report

Uploaded by

Project Report

CEC Course on “Applied Data Analytics: A Practical Approach”

Subject: Movie Rating Analysis For Beginners

1.​ Raksha Sahu​ ​ : CEC_2025_037​

●​ TMDB 5000 Movies Dataset​

●​ TMDB 5000 Credits Dataset​

●​ Converted release dates to datetime format​

●​ Extracted release year for time-based analysis​

●​ Handled nested genre, cast, and crew data using ast.literal_eval​

●​ Selected relevant columns for analysis​

RATINGS AND REVENUE OVER TIME

a. Number of movies released per year

●​ Functions used (with libraries and purposes):

b. In which year was the highest average revenue?

●​ Functions used (with libraries and purposes):​

○​ groupby('release_year')['revenue'].mean() → (Pandas) Groups data by year and

c. Find average movie ratings per year.

●​ Functions used (with libraries and purposes):​

○​ groupby('release_year')['vote_average'].mean() → (Pandas) Groups movies by

d. Display top 10 highest revenue movie titles.

●​ Functions used (with libraries and purposes):​

○​ sort_values(by='revenue', ascending=False) → (Pandas) Sorts movies in

e. Display Top 10 Highest Rated Movie Titles and Their Directors

●​ Functions used (with libraries and purposes):​

○​ ast.literal_eval() → (ast library) Safely parses the crew column (stored as a

f. Classify movies based on ratings (Excellent, Good, Average)

●​ Functions used (with libraries and purposes):​

○​ classify_rating() → (Custom function) Classifies movies as:

GENRE BASED INSIGHTS​

a. List of all unique movie genres

●​ Functions used (with libraries and purposes):​

○​ ast.literal_eval() → (ast library) Converts the stringified list of dictionaries in the

b. How many movies of each genre were made?

●​ Functions used (with libraries and purposes):​

○​ ast.literal_eval() → (ast library) Converts stringified lists in the genres column

c. What is the average rating per genre?

●​ Functions used (with libraries and purposes):​

○​ groupby() → (Pandas) Groups the flattened genre–rating DataFrame (flat_df) by

a. Display Top 10 Longest Movies.

●​ Functions used (with libraries and purposes):​

○​ sort_values() → (Pandas) Sorts the movies in descending order by their runtime.

b. What is the Average Film Duration by Genre?

●​ Functions used (with libraries and purposes):​

○​ Loop through analysis_df → To extract each genre and corresponding movie

a. Does Rating Affect the Revenue?

●​ Functions used (with libraries and purposes):​

○​ plt.figure() → (Matplotlib) Sets the figure size for better readability.

HOW ARE MOVIE METRICS RELATED TO EACH

This analysis explores the relationships among numerical

Methodology & Key Code Functions:

●​ The dataset was imported and processed using

●​ Budget vs. Revenue​

●​ Popularity vs. Vote Count​

●​ Vote Average (Ratings)​

●​ Color palette: Set2​

●​ Layouts: tight_layout() for neatness​

●​ High-budget films don’t always guarantee high revenue​

●​ Some directors and genres consistently perform well​

●​ Runtime and popularity tend to correlate with audience engagement​

●​ The early 2000s to 2010s saw a surge in revenue generation​

●​ Revenue and budget data have some missing or zero values​

●​ Ratings can be biased due to low vote counts on some movies​

●​ Time-based comparisons may be affected by data imbalance​

●​ Lack of external factors or audience demographics​

●​ No differentiation between theatrical and digital releases​

●​ Sentiment analysis on movie overviews or reviews​

●​ Time-series forecasting of genre popularity​

●​ Recommendation systems using collaborative filtering​

You might also like

1. Raksha Sahu : CEC_2025_037

● TMDB 5000 Movies Dataset

● TMDB 5000 Credits Dataset

● Converted release dates to datetime format

● Extracted release year for time-based analysis

● Handled nested genre, cast, and crew data using ast.literal_eval

● Selected relevant columns for analysis

● Functions used (with libraries and purposes):

● Functions used (with libraries and purposes):

○ groupby('release_year')['revenue'].mean() → (Pandas) Groups data by year and

● Functions used (with libraries and purposes):

○ groupby('release_year')['vote_average'].mean() → (Pandas) Groups movies by

● Functions used (with libraries and purposes):

○ sort_values(by='revenue', ascending=False) → (Pandas) Sorts movies in

● Functions used (with libraries and purposes):

○ ast.literal_eval() → (ast library) Safely parses the crew column (stored as a

● Functions used (with libraries and purposes):

○ classify_rating() → (Custom function) Classifies movies as:

GENRE BASED INSIGHTS

● Functions used (with libraries and purposes):

○ ast.literal_eval() → (ast library) Converts the stringified list of dictionaries in the

● Functions used (with libraries and purposes):

○ ast.literal_eval() → (ast library) Converts stringified lists in the genres column

● Functions used (with libraries and purposes):

○ groupby() → (Pandas) Groups the flattened genre–rating DataFrame (flat_df) by

● Functions used (with libraries and purposes):

○ sort_values() → (Pandas) Sorts the movies in descending order by their runtime.

● Functions used (with libraries and purposes):

○ Loop through analysis_df → To extract each genre and corresponding movie

● Functions used (with libraries and purposes):

○ plt.figure() → (Matplotlib) Sets the figure size for better readability.

● The dataset was imported and processed using

● Budget vs. Revenue

● Popularity vs. Vote Count

● Vote Average (Ratings)

● Color palette: Set2

● Layouts: tight_layout() for neatness

● High-budget films don’t always guarantee high revenue

● Some directors and genres consistently perform well

● Runtime and popularity tend to correlate with audience engagement

● The early 2000s to 2010s saw a surge in revenue generation

● Revenue and budget data have some missing or zero values

● Ratings can be biased due to low vote counts on some movies

● Time-based comparisons may be affected by data imbalance

● Lack of external factors or audience demographics

● No differentiation between theatrical and digital releases

● Sentiment analysis on movie overviews or reviews

● Time-series forecasting of genre popularity

● Recommendation systems using collaborative filtering