DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA MINING (22CS63)
PROJECT REPORT
on
CAR PRICE PREDICTION
Submitted in partial fulfillment of the requirement for the award of Degree of
Bachelor of Engineering
in
Computer Science and Engineering
Submitted by:
HARSHITH KOLLURU 1NT22CS073
GAGAN S KUNKANAD 1NT22CS067
Under the Guidance of
Dr. Vijaya Shetty S
Professor, Dept. of CS&E, NMIT
Department of Computer Science and Engineering
(Accredited by NBA Tier-1)
2025-2026
Table of Contents
Abstract
1. Introduction
1.1 Motivation
1.2 Problem Domain
1.3 Aim and Objectives
2. Data Source and Data Quality
2.1 Dataset Used
2.2 Data Preprocessing
3. Methods & Models
3.1 Data Mining Questions
3.2 Data Mining Algorithms
3.3 Data Mining Models
4. Model Evaluation & Discussion
5. Conclusion & Future Direction
6. Reflection Portfolio
References
Appendices
a. Link to the dataset chosen
b. Python Codes Implemented
c. Setup to execute the code
Table of Figures
SI No Name Page
Fig 3.1 Stacked Model being used 6
Fig 4.1 Top 10 Features of Dataset identified 8
Fig 4.2 Distribution of Selling Price with 10
Frequency
Fig 4.3 Predicted vs Actual Selling Price Plot 11
Fig 4.4 Residual Plot 12
NITTE MEENAKSHI INSTITUTE OF TECHNOLOGY
(AN AUTONOMOUS INSTITUTION, AFFILIATED TO VISVESVARAYA TECHNOLOGICAL UNIVERSITY, BELGAUM
, APPROVED BY AICTE & GOVT.OF KARNATAKA)
Department of Computer Science and Engineering
(Accredited by NBA Tier-1)
CERTIFICATE
This is to certify that the Car Price Prediction is an authentic work carried out by HARSHITH
KOLLURU 1NT22CS073 and GAGAN S KUNKANAD 1NT22CS067 bonafide students of
Nitte Meenakshi Institute of Technology, Bangalore in partial fulfillment for the award of the
degree of Bachelor of Engineering in COMPUTER SCIENCE AND ENGINEERING of
Visvesvaraya Technological University, Belgavi during the academic year 2025-2026. It is
certified that all corrections and suggestions indicated during the internal assessment has been
incorporated in the report. This project has been approved as it satisfies the academic
requirement in respect of project work presented for the said degree.
Internal Guide Signature of the HOD Signature of Principal
Dr. Vijaya Shetty S Dr. S Meenakshi Sundaram Dr. H. C. Nagaraj
Associate Professor, Professor, Head, Dept. CSE, Principal,
Dept. CSE, NMIT Bangalore NMIT Bangalore NMIT,Bangalore
DECLARATION
We hereby declare that Learning activity project work
(i) The project work is our original work
(ii) This Project work has not been submitted for the award of any degree or examination at any
other university/College/Institute.
(iii) This Project Work does not contain other persons’ data, pictures, graphs or other information,
unless specifically acknowledged as being sourced from other persons.
(iv) This Project Work does not contain other persons’ writing, unless specifically acknowledged
as being sourced from other researchers. Where other written sources have been quoted, then:
a) their words have been re-written but the general information attributed to them has been
referenced;
b) where their exact words have been used, their writing has been placed inside quotation
marks, and referenced.
(v) This Project Work does not contain text, graphics or tables copied and pasted from the
Internet, unless specifically acknowledged, and the source being detailed in the thesis and in
the References sections.
NAME USN Signature
HARSHITH KOLLURU 1NT22CS073
GAGAN S KUNKANAD 1NT22CS067
Date:
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompany the successful completion of any task would be
incomplete without the mention of the people who made it possible, whose constant guidance
and encouragement crowned our effort with success. I express my sincere gratitude to our
Principal Dr. H. C. Nagaraj, Nitte Meenakshi Institute of Technology for providing facilities.
We wish to thank our HoD, Dr. S Meenakshi Sundaram for the excellent environment created
to further educational growth in our college. We also thank him for the invaluable guidance
provided which has helped in the creation of a better project.
I hereby like to thank Dr. Vijaya Shetty S, Professor, Department of Computer Science &
Engineering on her periodic inspection, time to time evaluation of the project and help to bring the
project to the present form.
Thanks to our Departmental Project coordinators. We also thank all our friends, teaching and
non-teaching staff at NMIT, Bangalore, for all the direct and indirect help provided in the
completion of the project.
NAME USN Signature
HARSHITH KOLLURU 1NT22CS073
GAGAN S KUNKANAD 1NT22CS067
Date:
ABSTRACT
This report presents a comprehensive study on the application of machine learning techniques for
predicting the selling price of used cars based on historical sales data. The project encompasses
the development of a robust data pipeline, including data collection, preprocessing, exploratory
data analysis, feature engineering, model selection, and evaluation. Key preprocessing steps
involved handling missing values, encoding categorical variables, and normalizing numerical
features to ensure data integrity and model effectiveness.
A variety of regression algorithms were explored, with particular emphasis on ensemble learning
methods. The final predictive model utilizes a stacking regressor that integrates both linear and
non-linear base models, thereby leveraging their complementary strengths. Extensive
hyperparameter tuning and cross-validation were performed to optimize model performance and
mitigate overfitting.
The proposed solution achieved an R² score of 0.97 on unseen test data, indicating a high level of
predictive accuracy. The results demonstrate that machine learning-driven approaches can
significantly enhance the transparency, efficiency, and reliability of used car price estimation.
The project concludes with the deployment of a user-oriented application, underscoring the
practical value and real-world applicability of the developed system in the automotive market.
1. Introduction
1.1 Motivation
In today’s digital economy, transparency and data-driven decision-making are more important
than ever, especially in industries like automotive resale where pricing can be highly subjective.
The process of buying or selling a used car is often fraught with uncertainty, as both buyers and
sellers struggle to determine a fair and accurate price. Traditional valuation methods frequently
rely on personal judgment or limited market data, leading to inconsistencies and potential
mistrust between parties.
Recognizing these challenges, we were motivated to explore how data science and machine
learning could introduce greater consistency and fairness into the used car market. With the
increasing availability of comprehensive historical sales data, there is a significant opportunity to
apply advanced analytics to predict car prices more accurately. Our goal was to develop a
predictive model that leverages these data resources to provide reliable price estimates, thereby
streamlining transactions and improving confidence for all stakeholders.
By addressing this real-world problem, our project aims to demonstrate the transformative
potential of machine learning in creating transparent, efficient, and equitable solutions within the
automotive resale industry
1.2 Problem Domain
Used car pricing is shaped by a wide range of factors, including the vehicle’s age, fuel type,
mileage, brand reputation, and ownership history. Additional elements such as service records,
accident history, and prevailing market trends can further complicate the valuation process. In
the absence of standardized pricing practices, these variables are often assessed inconsistently,
which can lead to unfair advantages or disadvantages for both buyers and sellers. Such
inconsistencies not only create confusion but also undermine trust and transparency in the used
car market.
1
This project seeks to address these challenges by introducing a data-driven, standardized
approach to used car pricing using machine learning techniques. By analyzing historical sales
data and identifying key patterns among the various influencing factors, our goal is to develop a
predictive model that delivers objective and accurate price estimates. Through this approach, we
aim to foster greater transparency and fairness, ultimately contributing to a more trustworthy and
efficient secondary automobile market.
1.3 Aim and Objectives
To analyze a real-world car dataset and extract meaningful insights.
To process and transform data into a form suitable for modeling.
To apply and compare different regression algorithms.
To build an ensemble model that combines the strengths of multiple base models.
To visualize results for effective communication and understanding.
To understand the full life-cycle of a machine learning pipeline.
2
2. Data Source and Data Quality
2.1 Dataset Used
The dataset used in this study was obtained from Kaggle, titled "Vehicle dataset from CarDekho".
It comprises 301 records with 9 attributes. These include both numerical features (e.g.,
Present_Price, Kms_Driven) and categorical features (e.g., Fuel_Type, Transmission).
2.2 Data Preprocessing
Preprocessing played a vital role in the quality and success of the model:
Feature Engineering:
Categorical Encoding:
Feature Scaling & Polynomial Features:
These steps significantly improved model performance and made the data more suitable for
learning.
3
3. Methods & Models
3.1 Data Mining Questions
What car features most influence the selling price?
Can polynomial feature transformation improve model performance?
Which regression model gives the best generalization on unseen data?
3.2 Data Mining Algorithms
We evaluated multiple regression algorithms:
Linear Regression (for baseline performance)
Lasso Regression (to penalize less useful features)
Decision Tree Regressor (captures non-linearity)
Gradient Boosting Regressor (robust ensemble method)
Stacking Regressor (combines multiple models for superior performance)
3.3 Data Mining Models
4
This code implements a stacking ensemble regressor using four different regression models as
base learners and Linear Regression as the meta-model. Here’s a breakdown of the models and
key parameters used:
Base Models:
Linear Regression: A standard regression model that fits a linear relationship between
features and the target variable2.
Lasso Regression (alpha=0.1): A linear model with L1 regularization, which helps in feature
selection by penalizing the absolute values of coefficients. The parameter alpha=0.1 controls
the strength of the regularization, with higher values leading to more regularization.
Decision Tree Regressor (max_depth=5): A tree-based model that splits data into branches
to predict continuous outcomes. The parameter max_depth=5 limits the depth of the tree to
prevent overfitting by restricting how many times the tree can split.
Gradient Boosting Regressor (n_estimators=150, learning_rate=0.1, max_depth=3): An
ensemble model that builds trees sequentially to correct errors of previous
trees. n_estimators=150 sets the number of boosting stages, learning_rate=0.1 controls the
contribution of each tree, and max_depth=3 restricts the depth of individual trees to prevent
overfitting.
Meta-Model:
Linear Regression: Used as the final estimator to combine the predictions of the base models,
learning how to best weight their outputs for improved accuracy.
Stacking Regressor:
Combines the predictions of all base models using the meta-model for a more robust and
accurate prediction. The model is trained on the training data with stacking_model.fit(X_train,
y_train)
5
Fig 3.1: Stacked Model being used
6
4. Model Evaluation & Discussion
Feature Importance Visualization
To gain insights into which features most significantly influenced the predictions of the
Gradient Boosting Regressor, we conducted a feature importance analysis after applying
polynomial feature expansion. The following steps outline the process:
Extraction of Feature Importance:
The attribute feature_importance_ of the trained Gradient Boosting Regressor (gbr_model)
was used to obtain the relative importance of each input feature. These importance scores
reflect the contribution of each feature to the model’s predictive performance.
Retrieval of Feature Names:
After polynomial transformation, the feature space includes both original and newly
generated polynomial features. The method poly.get_feature_names_out(X.columns) was
employed to retrieve the names of all features present in the transformed dataset.
Construction of the Importance DataFrame:
A pandas DataFrame was created to pair each feature name with its corresponding
importance score, facilitating easier analysis and visualization.
Ranking and Selection of Top Features:
The DataFrame was sorted in descending order based on the importance scores. The top 10
most influential features were then selected to highlight those with the greatest impact on the
model’s predictions.
This analysis not only identifies the most critical factors (including polynomial feature
combinations) affecting the model’s output but also enhances the interpretability and
7
transparency of the predictive process. The results can guide further feature engineering and
inform stakeholders about the key drivers of used car prices in the dataset.
Fig 4.1: Top 10 Features of Dataset identified
Target Distribution Visualization
8
Fig 4.2: Distribution of Selling Price with Frequency
A histogram of the Selling_Price column was plotted using Seaborn’s histplot function with 30
bins and an orange color scheme. The kde=True parameter adds a smooth density curve to the
plot. This visualization helps reveal the distribution, central tendency, and spread of selling
prices in the dataset, providing useful insights for further analysis and modeling.
Training Set Performance
The model demonstrates excellent training performance, achieving a very high R² score of 0.98,
which indicates it explains 98% of the variance in selling prices. The low MAE (0.46) and
RMSE (0.67) values further suggest highly accurate predictions with minimal average error on
the training data.
9
Test Set Performance
The model maintains strong performance on the test data, with an R² score of 0.97, indicating it
captures 97% of the variance in selling prices. The low MAE (0.43) and RMSE (0.71) values
confirm that the model provides accurate and reliable predictions on unseen data, demonstrating
good generalization
Regression Fit Visualization
10
Fig 4.3: Predicted vs Actual Selling Price Plot
Insights from Predicted vs Actual Selling Price Plot
The scatter plot compares predicted selling prices against actual selling prices for the test data.
Most points closely follow the red diagonal line, indicating strong agreement between predicted
and actual values. The tight clustering around the line and the narrow confidence band suggest
high predictive accuracy and minimal bias. Overall, the model demonstrates excellent
performance in estimating used car prices, with only minor deviations for a few higher-priced
cars.
11
Residual Analysis
Fig 4.4 Residual Plot
Insights from Residual Plot
The residual plot displays the differences between actual and predicted selling prices against the
predicted values. Most residuals are scattered closely around the zero line, indicating that the
model’s predictions are generally unbiased and errors are randomly distributed. There are a few
outliers, but no clear pattern or systematic deviation is observed, suggesting that the model
captures the underlying relationships well and does not suffer from major issues like
heteroscedasticity or non-linearity.The results demonstrate minimal overfitting. The residuals
being centered around zero confirms a low prediction bias. These findings validate the
robustness of our pipeline.
12
5. Conclusion & Future Direction
Conclusion
Working on this project has been both a technically enriching and intellectually fulfilling
experience. Our primary objective was to develop a robust machine learning model that could
accurately predict the selling prices of used cars based on multiple features. We successfully
built a sophisticated stacking ensemble model that integrates linear regression, lasso regression,
decision trees, and gradient boosting—culminating in a strong, generalizable predictor.
Through systematic preprocessing, feature engineering (like the creation of Car_Age and the
log transformation of Kms_Driven), and the application of polynomial features to capture non-
linearity, we were able to transform raw tabular data into an optimized format for learning. The
stacking model's performance—R² of 0.9664 on test data—demonstrates high accuracy and low
generalization error.
Key Learnings:
Value of Ensemble Methods:
We learned firsthand that combining multiple models through ensemble techniques, like
stacking, can significantly boost performance compared to relying on a single algorithm. This
approach allowed us to leverage the unique strengths of different models and achieve more
reliable results.
Importance of Feature Engineering:
One of our biggest takeaways was the critical role of feature engineering. Creating new features,
selecting the most relevant variables, and properly scaling the data had a direct and noticeable
impact on the model’s accuracy. This process taught us how thoughtful data preparation can
make or break a machine learning project.
13
Power of Visualization:
Visualizing the data and model results helped us understand not just the numbers, but also the
story behind them. Tools like scatter plots and residual plots were essential for diagnosing
issues, interpreting results, and communicating our findings clearly.
Collaboration and Problem-Solving:
Throughout the project, we worked closely as a team, sharing ideas and troubleshooting
challenges together. This collaborative environment helped us develop our communication
skills and learn from each other’s perspectives.
Future Work
This project has inspired us to think about how we can take our work further:
Web Deployment:
We are excited about the prospect of deploying our model as a web application, making it
accessible to anyone who wants a data-driven estimate for their used car.
Expanding the Feature Set:
In the future, we hope to include more detailed features, such as specific car models, brands,
and geographic locations, to make our predictions even more accurate and relevant.
Model Interpretability:
We also recognize the importance of making our model’s decisions understandable. Exploring
interpretability tools like SHAP or LIME will help us explain our predictions and build trust
with users.
14
6. Reflection Portfolio
This project provided a comprehensive, hands-on opportunity to bridge academic concepts with
real-world machine learning applications. Beyond technical outcomes, it fostered critical
professional competencies essential for aspiring data scientists. Below, we summarize our key
insights and growth areas:
1. Data Understanding and Preparation
Working with raw, unstructured data underscored the importance of meticulous data exploration
and cleaning. We developed strategies to address missing values, outliers, and inconsistencies-
skills crucial for transforming imperfect real-world datasets into reliable modeling inputs.
2. Preprocessing and Feature Engineering
Through iterative experimentation, we recognized how preprocessing choices (e.g., log
transformations, polynomial feature expansion) directly influence model performance. Feature
engineering emerged as both an art and a science, requiring domain intuition and empirical
validation.
3. Model Development and Ensemble Learning
By implementing and comparing diverse algorithms-from linear regression to gradient boosting-
we deepened our understanding of their theoretical assumptions and practical trade-offs. The
stacking ensemble highlighted the power of combining models to balance bias, variance, and
interpretability.
4. Evaluation and Communication
We refined our ability to critically assess model performance using metrics like MAE, RMSE,
and R². Visualization tools (e.g., residual plots, regression diagnostics) became indispensable for
diagnosing errors and communicating results to stakeholders.
15
5. Collaboration and Project Management
Navigating team workflows, version control, and task delegation mirrored real-world data
science environments. These experiences emphasized the importance of clear communication,
adaptability, and iterative problem-solving in collaborative projects.
Broader Implications
This project demonstrated how machine learning can address tangible challenges in industries
like automotive resale, where transparency and fairness are paramount. By delivering a robust,
data-driven pricing framework, we showcased the potential of predictive analytics to enhance
market efficiency and stakeholder trust.
Preparedness for Future Challenges
The technical and soft skills developed through this work-from coding proficiency to critical
thinking-have equipped us to tackle complex data problems across domains. We are now better
positioned to contribute meaningfully to future projects, whether in academic research, industry
applications, or entrepreneurial ventures.
16
References
[1] Scikit-learn Developers, "Scikit-learn: Machine Learning in Python," [Online].
Available: https://scikit-learn.org/. [Accessed: May 18, 2025].
[2] N. Birla, "Vehicle Dataset from Cardekho," Kaggle, [Online].
Available: https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho.
[Accessed: May 18, 2025].
[3] Python Software Foundation, "Python 3 Documentation," [Online].
Available: https://docs.python.org/3/. [Accessed: May 18, 2025].
[4] S. S. Patil and S. S. Patil, "Used Car Price Prediction System," International Journal of
Scientific Research in Science and Technology, vol. 11, no. 3, pp. 108–113, 2024. [Online].
Available: https://www.ijsrst.com/index.php/home/article/view/IJSRST24113108
[5] B. N. Bala, "Price Prediction for Used Cars (Data Science Project)," GitHub, [Online].
Available: https://github.com/bala-1409/Price-Prediction-for-Used-Cars-Datascience-Project.
[Accessed: May 18, 2025].
[6] S. Sharma and A. Kumar, "Comparative Analysis of Machine Learning Algorithms for
Used Car Price Prediction," International Journal of Current Science Research and Review, vol.
7, no. 2, pp. 123–130, 2024. [Online]. Available: https://ijcsrr.org/comparative-analysis-of-
machine-learning-algorithms-for-used-car-price-prediction/
[7] M. A. Rahman, "Used Car Price Prediction and Valuation using Data Mining
Techniques," RIT Scholar Works, 2019. [Online].
Available: https://repository.rit.edu/cgi/viewcontent.cgi?article=12220&context=theses
Appendices
a. Link to Dataset:
https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho
b. Python Codes Implemented:
Available in below GitHub repository:
https://github.com/GMLDEV/DATA_MINING.git
c. Setup to Execute the Code:
Python 3.10+
Google Colab
Required Libraries:
pandas
numpy
matplotlib
seaborn
scikit-learn