Project Report
CEC Course on “Applied Data Analytics: A Practical Approach”
Subject: Movie Rating Analysis For Beginners
Team Members:
1. Raksha Sahu : CEC_2025_037
2. Kritarth Karambelkar : CEC_2025_040
3. Krati Shrivastava : CEC_2025_034
4. Sakshi Goswami : CEC_2025_043
5. Sameer Sevlani : CEC_2025_070
1. Introduction
This project aims to perform a comprehensive exploratory data analysis (EDA) on the TMDB
5000 Movie Dataset, a rich collection of metadata about over 5,000 movies. The dataset
includes detailed information such as movie titles, release dates, budgets, revenues,
genres, popularity scores, user ratings, and details of the cast and crew.
The objective is to uncover insights, patterns, and trends that can help us understand the
factors contributing to a movie’s success, whether in terms of popularity, ratings, or revenue.
To carry out the analysis, we used Python, along with powerful data science libraries, including
Pandas for data manipulation, Seaborn, and Matplotlib for data visualization. The dataset was
cleaned and preprocessed to handle missing values and nested fields, followed by the
formulation of 20 analytical questions. Each question was answered with supporting graphs,
charts, and statistical summaries, offering a visual and quantitative understanding of the
underlying trends in the movie industry.
Through this analysis, we aim to not only develop technical proficiency in handling real-world
datasets but also gain valuable insights into the dynamics of film production, audience
preferences, and industry economics.
2. Dataset Overview
The analysis in this project is based on two primary datasets sourced from The Movie
Database (TMDB): the TMDB 5000 Movies Dataset and the TMDB 5000 Credits Dataset.
Together, they offer a comprehensive view of various aspects of films, ranging from basic
metadata to detailed cast and crew information.
● TMDB 5000 Movies Dataset
This dataset contains metadata for over 5,000 movies, including details such as movie
titles, genres, release dates, budgets, revenues, popularity scores, runtime, vote
averages, and vote counts. It provides the foundational information necessary for
analyzing the overall performance and characteristics of films.
● TMDB 5000 Credits Dataset
This complementary dataset includes in-depth information about the cast and crew of
each movie. It features structured data fields identifying the names and roles of actors,
directors, and other contributors involved in the filmmaking process.
To conduct the analysis, these two datasets were merged on the common column 'title'. This
merging step helped form a unified and enriched dataset referred to as analysis_df, which
was used throughout the analysis for deriving insights and visualizations.
3. Data Cleaning and Preprocessing
Before conducting meaningful analysis, it was essential to clean and preprocess the raw data
to make it consistent, readable, and structured for exploration. Several steps were taken to
prepare the dataset for analysis:
● Converted release dates to datetime format
The release_date column, originally stored as a string, was converted to a standard
datetime format using pandas.to_datetime(). This transformation enabled easier
sorting, filtering, and analysis of movies based on their release timeline.
● Extracted release year for time-based analysis
After converting to datetime, a new column called release_year was created by
extracting the year from each movie's release date. This was crucial for generating
year-wise trends in ratings, revenue, popularity, and other attributes.
● Handled nested genre, cast, and crew data using ast.literal_eval
Several columns, such as genres, cast, and crew were stored as stringified JSON
objects. These were converted to Python list/dictionary objects using
ast.literal_eval() for easier traversal and filtering. This step allowed for more
advanced operations like genre extraction and identification of directors or lead actors.
● Selected relevant columns for analysis
To streamline the analysis and avoid working with unnecessary data, a subset of
important columns was selected to form the main working DataFrame (analysis_df).
These columns included identifiers like movie_id, key metrics such as vote_average,
revenue, and budget, as well as categorical and date fields like genres, cast, crew,
and release_date.
4. Exploratory Data Analysis
RATINGS AND REVENUE OVER TIME
a. Number of movies released per year
● Columns used:
release_year
● Chart type:
Line Chart – showing the trend of movie releases over time
● Functions used (with libraries and purposes):
○ value_counts() → (Pandas) Counts how many movies were released in each
year.
○ sort_index() → (Pandas) Sorts the year-wise counts chronologically.
○ plt.figure(figsize=(14,5)) → (Matplotlib.pyplot) Sets the figure size for the plot.
○ sns.lineplot() → (Seaborn) Plots a smooth line chart showing movie release
counts over the years with markers for data points.
○ plt.title(), plt.xlabel(), plt.ylabel(), plt.grid(), plt.show() → (Matplotlib.pyplot) Used
for adding title, axis labels, grid, and displaying the plot.
● Key Points:
a.Shows the number of movies produced every year.
b.Significant increase in movie production after the 2000s.
c.Indicates rapid expansion of the global film industry.
b. In which year was the highest average revenue?
● Columns used:
release_year, revenue
● Chart type:
Bar Chart – visualizing average revenue per year
● Functions used (with libraries and purposes):
○ groupby('release_year')['revenue'].mean() → (Pandas) Groups data by year and
computes the average revenue of movies in each year.
○ sort_index(ascending=True) → (Pandas) Sorts the years in chronological order
for accurate timeline visualization.
○ plot(kind='bar', ...) → (Pandas/Matplotlib) Creates a bar chart to show how
average revenue varies by year.
○ plt.ylabel(), plt.xlabel(), plt.grid(), plt.show() → (Matplotlib.pyplot) Used for
adding labels, grid, and displaying the chart.
○ idxmax() → (Pandas) Identifies the year with the highest average revenue.
● Key Points:
a.Displays the average revenue earned by movies each year.
b.Later years (2010s) show higher average revenue due to blockbuster franchises.
c.Reflects growing global box office and international releases.
d.Year with Highest Average Revenue: 1939.0
c. Find average movie ratings per year.
● Columns used:
release_year, vote_average
● Chart type:
Line Chart – to visualize changes in average ratings over time
● Functions used (with libraries and purposes):
○ groupby('release_year')['vote_average'].mean() → (Pandas) Groups movies by
release year and calculates the average rating per year.
○ plt.figure(figsize=(14,5)) → (Matplotlib.pyplot) Sets the plot size.
○ sns.lineplot(...) → (Seaborn) Plots the average ratings year-wise with markers for
each data point.
○ plt.title(), plt.xlabel(), plt.ylabel(), plt.grid(), plt.show() → (Matplotlib.pyplot) Used
for adding titles, labels, grid lines, and rendering the chart.
● Key Points:
a. Measures how audience ratings have varied over time.
b. Ratings have remained fairly consistent, averaging between 6–7.
c. Few years show a dip, possibly due to an increase in massreleased lower-quality
content.
d. Display top 10 highest revenue movie titles.
● Columns used:
title, revenue
● Chart type:
Horizontal Bar Chart – to show revenue for top-grossing movies
● Functions used (with libraries and purposes):
○ sort_values(by='revenue', ascending=False) → (Pandas) Sorts movies in
descending order of revenue to identify the top earners.
○ head(10) → (Pandas) Selects the top 10 movies after sorting.
○ sns.barplot(...) → (Seaborn) Creates a horizontal bar chart with movie titles on
the y-axis and their revenues on the x-axis for better readability.
○ plt.figure(), plt.title(), plt.xlabel(), plt.ylabel(), plt.tight_layout(), plt.show() →
(Matplotlib.pyplot) Used for customizing and displaying the plot.
○ display() → (IPython.display) Displays the top 10 data in a readable table
format.
● Key Points:
a. Franchise blockbusters like Avatar and Avengers lead the list.
b.Most movies belong to action, adventure, or fantasy genres.
c.High revenue often aligns with global releases and large fanbases.
d.Reflects the dominance of big-budget productions in modern cinema.
e. Display Top 10 Highest Rated Movie Titles and Their Directors
● Columns used:
title, vote_average, crew
● Chart type:
None (Tabular display using display())
● Functions used (with libraries and purposes):
○ ast.literal_eval() → (ast library) Safely parses the crew column (stored as a
stringified list of dictionaries) into Python list objects.
○ get_director() → (Custom function) Iterates through the crew list to extract the
person whose job is "Director".
○ apply() → (Pandas) Applies the get_director function to each row in the crew
column to extract the director's name.
○ sort_values(by='vote_average', ascending=False) → (Pandas) Sorts movies by
their rating in descending order.
○ head(10) → (Pandas) Selects the top 10 highest rated movies.
○ display() → (IPython.display) Used to neatly display the result as a table.
● Key Points:
a. Showcases the 10 movies with the highest audience ratings on the platform.
b.These titles have received exceptional ratings, typically 9.0 and above.
c.Includes critically acclaimed films that may not necessarily be the highest-grossing.
d.Highlights that quality and popularity don't always align — some hidden gems score
higher than blockbusters.
e.Useful for discovering top-tier content based on viewer satisfaction alone.
f. Classify movies based on ratings (Excellent, Good, Average)
● Columns used:
vote_average
● Chart type:
Pie Chart – to show proportion of movies in each rating category
● Functions used (with libraries and purposes):
○ classify_rating() → (Custom function) Classifies movies as:
■ Excellent if rating ≥ 8
■ Good if 6 ≤ rating < 8
■ Average if rating < 6
○ apply() → (Pandas) Applies the custom classification function to the
vote_average column to create a new column rating_category.
○ value_counts() → (Pandas) Counts the number of movies in each rating
category.
○ plt.pie(...) → (Matplotlib.pyplot) Creates a pie chart showing the distribution of
movies across the three categories.
○ plt.title(), plt.tight_layout(), plt.show() → (Matplotlib.pyplot) For visual
enhancements and rendering.
● Key Points:
a. Movies were classified based on their average audience rating.
b.The majority of movies fall under the Average and Good categories.
c.Only a small fraction qualifies as Excellent, indicating strict audience expectations.
d.Helps in understanding the general quality distribution of films in the dataset.
GENRE BASED INSIGHTS
a. List of all unique movie genres
● Columns used:
genres
● Chart type:
None (Displayed as a simple table of unique genre names)
● Functions used (with libraries and purposes):
○ ast.literal_eval() → (ast library) Converts the stringified list of dictionaries in the
genres column into actual Python list objects.
○ iterrows() → (Pandas) Iterates through each row to process the genres data.
○ Custom logic → Extracts genre names from the dictionaries and appends them to
a list.
○ set() → (Python built-in) Removes duplicates to obtain unique genres.
○ sorted() → (Python built-in) Sorts the genres alphabetically.
○ pd.DataFrame() → (Pandas) Converts the list of unique genres into a DataFrame
for better display and readability.
○ print() → Outputs the final DataFrame.
b. How many movies of each genre were made?
● Columns used:
genres, vote_average
● Chart type:
Horizontal Bar Chart – displaying the count of movies by genre
● Functions used (with libraries and purposes):
○ ast.literal_eval() → (ast library) Converts stringified lists in the genres column
into actual Python lists of dictionaries.
○ apply() → (Pandas) Applies parsing logic to each row in the genres column.
○ iterrows() + custom loop → Iterates through each row and genre to flatten the
nested structure into a long-form DataFrame.
○ pd.DataFrame() → (Pandas) Constructs a new flat DataFrame with each genre
and its corresponding movie rating.
○ groupby() + agg() → (Pandas) Groups data by genre to calculate:
■ movie_count → total number of movies per genre
■ avg_vote → average rating per genre
○ sort_values() → Sorts genres by the number of movies in descending order.
○ sns.barplot() → (Seaborn) Plots a horizontal bar chart showing how many
movies belong to each genre.
● Key Points:
a. Drama and Comedy are the most produced genres in the dataset.
b.Action and Thriller follow closely, reflecting audience demand for excitement.
c.Genres like Documentary and Western have fewer releases.
d.Highlights the diversity and dominance of certain genres in global cinema.
c. What is the average rating per genre?
● Columns used:
genres, vote_average
● Chart type:
Horizontal Bar Chart – showing average rating by genre
● Functions used (with libraries and purposes):
○ groupby() → (Pandas) Groups the flattened genre–rating DataFrame (flat_df) by
genre.
○ mean() → (Pandas) Calculates the average of vote_average for each genre.
○ sort_values() → Sorts genres by their average rating in descending order.
○ reset_index() → Converts grouped data into a flat DataFrame for plotting.
○ sns.barplot() → (Seaborn) Creates a horizontal bar plot to visualize genre-wise
average ratings.
○ plt.figure(), plt.title(), plt.xlabel(), plt.ylabel(), plt.tight_layout() → (Matplotlib)
Customize and render the plot.
● Key Points:
a. Shows the average audience rating for each genre.
b. Genres like Documentary and War tend to have higher ratings.
c. Popular genres like Action or Comedy may have lower averages due to volume.
d. Helps understand how critically well-received each genre is on average.
MISCELLANEOUS INSIGHTS
a. Display Top 10 Longest Movies.
● Columns used:
title, runtime
● Chart type:
Horizontal Bar Chart – showing movie titles vs. runtime
● Functions used (with libraries and purposes):
○ sort_values() → (Pandas) Sorts the movies in descending order by their runtime.
○ head(10) → (Pandas) Retrieves the top 10 movies with the longest runtimes.
○ sns.barplot() → (Seaborn) Plots the runtime of the longest movies with their
titles.
○ plt.figure(), plt.title(), plt.xlabel(), plt.ylabel(), plt.tight_layout() → (Matplotlib) For
configuring and displaying the chart.
● Key Points:
a. Highlights movies with the longest runtimes in the dataset.
b. Many are epics or historical dramas, often exceeding 3 hours.
c. Long runtimes may reflect complex storytelling or multi-part narratives.
d. Useful for understanding viewer tolerance and production trends over time.
b. What is the Average Film Duration by Genre?
● Columns used:
genres, runtime
● Chart type:
Horizontal Bar Chart – showing average runtime per genre
● Functions used (with libraries and purposes):
○ Loop through analysis_df → To extract each genre and corresponding movie
runtime.
○ pd.DataFrame() → (Pandas) Creates a flat DataFrame (genre_runtime_df) where
each row links a genre to a movie's runtime.
○ groupby() and mean() → (Pandas) Groups the data by genre and calculates the
average runtime for each.
○ sort_values() and reset_index() → Sorts and prepares the data for plotting.
○ sns.barplot() → (Seaborn) Visualizes average runtime per genre using a bar plot.
○ plt.figure(), plt.title(), plt.xlabel(), plt.ylabel(), plt.tight_layout() → (Matplotlib) To
format and display the plot.
● Key Points:
a. Shows the typical length of movies in each genre.
b. Historical, War, and Drama genres tend to have longer runtimes.
c. Genres like Animation and Comedy often have shorter durations.
d. Helps in understanding content density and pacing trends across genres.
CORRELATION
a. Does Rating Affect the Revenue?
● Columns used:
vote_average, revenue
● Chart type:
Scatter Plot – to observe correlation between rating and revenue
● Functions used (with libraries and purposes):
○ plt.figure() → (Matplotlib) Sets the figure size for better readability.
○ sns.scatterplot() → (Seaborn) Plots vote_average vs. revenue as points,
allowing visual inspection of potential correlation.
○ plt.title(), plt.xlabel(), plt.ylabel(), plt.tight_layout() → (Matplotlib) Adds
descriptive elements and layout adjustments to the plot.
● Key Points:
a. Analyzed the relationship between audience ratings and movie revenue.
b.Scatter plot shows no strong pattern — movies with high ratings don’t necessarily earn
the highest revenue.
c.A few low-rated movies earned big, likely due to strong marketing or franchise pull.
d.Indicates that commercial success is not always tied to critical acclaim.
e.Factors like popularity, fanbase, and release scale may play a bigger role in revenue.
HOW ARE MOVIE METRICS RELATED TO EACH
OTHER?
This analysis explores the relationships among numerical
variables in the TMDB 5000 Movies dataset using
correlation analysis. The goal is to identify how
features such as budget, revenue, popularity, vote count,
vote average, and runtime relate to one another, with a
focus on what drives commercial and critical success.
Methodology & Key Code Functions:
● The dataset was imported and processed using
pandas:
df = pd.read_csv('tmdb_5000_movies.csv')
● Identifier columns such as id were excluded for relevance.
● Only numerical features were selected using:
numeric_df = df.select_dtypes(include=['float64', 'int64'])
● The correlation matrix was computed with:
corr_matrix = numeric_df.corr()
● Visualization was done using Seaborn’s heatmap:
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)
This heatmap uses Pearson correlation to measure linear relationships between variables
(ranging from –1 to +1).
Key Findings:
● Budget vs. Revenue
A strong positive correlation (r = 0.73) shows that movies with higher budgets generally
generate higher box office revenue.
● Popularity vs. Vote Count
A very high correlation (r = 0.78) indicates that popular films receive more user votes,
reflecting higher audience interaction.
● Revenue Drivers
○ Vote Count (r = 0.78)
○ Popularity (r = 0.64)
These are both highly correlated with revenue, suggesting that widespread
engagement and visibility are key to a movie’s commercial success.
● Vote Average (Ratings)
Weak correlations with:
○ Budget (0.09)
○ Revenue (0.20)
○ Popularity (0.27)
imply that critical success (ratings) is largely independent of financial
metrics.
● Runtime
Runtime shows a weak-to-moderate correlation only with vote average (r = 0.38),
indicating that longer movies may receive slightly better ratings, but have minimal
influence elsewhere.
Conclusion:
The correlation matrix demonstrates that financial success is most closely tied to budget,
popularity, and vote count, all of which reinforce one another. On the other hand, vote
average (a measure of quality or critical acclaim) is relatively independent of these metrics.
This highlights the distinction between movies that are popular and profitable versus those
that are critically appreciated.
From a technical standpoint, pandas and seaborn provide a concise and effective framework
for performing such analysis, with .corr() and sns.heatmap() being key functions in identifying
and visualizing these relationships.
5. Visualizations
Effective visual representation of data is crucial for uncovering insights and conveying patterns
clearly. In this project, a wide variety of charts and plots have been used to explore and
present the relationships between different movie attributes. The visualizations not only highlight
trends but also make the analysis more intuitive and engaging.
To ensure visual consistency and appeal, the charts were styled using powerful Python libraries
such as Seaborn and Matplotlib. The following customizations were applied:
● Color palette: Set2
A soft and visually pleasing color palette (Set2) was used throughout the visualizations.
This color scheme helps differentiate elements clearly without being too harsh or
saturated, making the plots aesthetically appealing and easy to interpret.
● Style: whitegrid
The whitegrid style was applied using sns.set_style(), providing a clean background with
subtle grid lines. This enhances readability and improves alignment when interpreting
values on axes.
● Layouts: tight_layout() for neatness
To prevent overlapping of axis labels, titles, and plot content, the plt.tight_layout()
function was used. This automatically adjusts spacing and padding to ensure that all
elements in the plot are well-organized and clearly visible.
In addition to bar plots, pie charts, and line plots, these styling choices helped maintain
uniformity across different figures and contributed to a professional and polished
presentation of the analysis results.
6. Key Insights
Through detailed exploration and visual analysis of the TMDB movie dataset, several
noteworthy insights were uncovered. These findings help us understand general patterns in
filmmaking, audience preferences, and industry performance over the years:
● Action and Drama are among the most produced genres
These two genres dominate the dataset in terms of the number of films produced,
reflecting their enduring popularity and commercial viability across different markets and
time periods.
● High-budget films don’t always guarantee high revenue
A significant observation was that while many big-budget films achieved commercial
success, there were numerous cases where high investment did not result in
proportionate returns. This suggests that budget alone isn't a reliable indicator of
financial performance.
● Some directors and genres consistently perform well
A handful of directors stood out by maintaining a high average rating across their films.
Similarly, certain genres like Animation and Adventure were often associated with
strong audience ratings, indicating a consistent level of quality and appeal.
● Runtime and popularity tend to correlate with audience engagement
Movies with longer runtimes and higher popularity scores were frequently observed to
receive more user votes and higher ratings, implying that engaging content tends to
retain audience attention and drive positive feedback.
● The early 2000s to 2010s saw a surge in revenue generation
An analysis of revenue trends showed that movies released during this period generally
performed better financially, possibly due to advancements in global distribution and
marketing strategies.
● Genres like Horror and Documentary, although fewer in number, show niche
audience engagement
While these genres are less frequently produced, they maintain a dedicated audience
base and often receive strong responses when executed well.
These insights provide a meaningful understanding of the creative and commercial dynamics
of the film industry, as captured through data.
7. Limitations
While the analysis provides valuable insights into various aspects of movies, there are several
limitations that should be acknowledged. These limitations may affect the accuracy,
completeness, and interpretability of the results:
● Revenue and budget data have some missing or zero values
A noticeable number of movies in the dataset have incomplete or zero entries for budget
and revenue. This hinders the ability to accurately assess profitability and financial
trends for all entries.
● Ratings can be biased due to low vote counts on some movies
Certain films, especially lesser-known or niche releases, have very few user ratings.
This can significantly skew their average rating, making them appear better or worse
than they actually are.
● Genre, cast, and crew are stored in nested structures requiring manual parsing
These columns are not in a directly usable format and had to be converted using
ast.literal_eval. Any inconsistency in these string representations could lead to
parsing errors or data loss.
● Time-based comparisons may be affected by data imbalance
Some years in the dataset have a higher number of movies than others, leading to a
possible overrepresentation of certain time periods when analyzing trends over time.
● Lack of external factors or audience demographics
The dataset doesn’t include viewer demographics, regional release information, or
critical reviews. These could have added more depth to the analysis, especially in
understanding popularity or rating trends.
● No differentiation between theatrical and digital releases
The dataset does not distinguish between the types of releases (cinema vs. streaming),
which could influence revenue figures and viewer engagement patterns.
Despite these limitations, the dataset still provides a rich foundation for exploratory data
analysis and meaningful insights into the movie industry.
8. Conclusion
This analysis provided a comprehensive exploration of the TMDB 5000 Movies dataset,
revealing patterns and trends related to movie ratings, genres, revenues, and other key
attributes. By using effective data preprocessing and visualization techniques, we were able to
gain a better understanding of what elements contribute to a movie’s success—both in terms of
audience reception and financial performance.
The study demonstrated how data science tools such as Python, pandas, seaborn, and
matplotlib can be leveraged to extract meaningful insights from raw datasets. While the analysis
has certain limitations, it sets a strong foundation for future work, such as integrating additional
datasets (e.g., user reviews, award data, or critic scores) to form a more holistic view of the film
industry. Ultimately, this project not only showcases technical skills but also highlights the power
of data in shaping media analysis and decision-making.
9. Future Work
There are several directions in which this analysis can be further extended to gain deeper and
more actionable insights:
● Sentiment analysis on movie overviews or reviews
Applying Natural Language Processing (NLP) techniques to the movie overviews or
actual user reviews can help understand audience sentiment. This could provide a
qualitative dimension to the quantitative ratings and help predict movie success.
● Time-series forecasting of genre popularity
Analyzing how the popularity of specific genres has changed over the years and
forecasting future trends using time-series models could benefit content creators and
streaming platforms in planning future releases.
● Recommendation systems using collaborative filtering
By incorporating user behavior data such as watch history and ratings, collaborative
filtering methods can be used to build a personalized movie recommendation engine,
similar to those used by Netflix or Amazon Prime.