Loan Prediction System
Loan Prediction System
On
Loan Prediction System
Submitted in Partial fulfillment for the award of degree of Bachelor of
Engineering in Computer Science and Engineering
Submitted to
CERTIFICATE
This is to certify that the work embodied in this Project, Dissertation Report
entitled as “ Loan prediction system” being Submitted by ADITYA RANA
(0126CS171005), KANHA GOYAL (0126CS171042), NIHAL GOUR
(0126CS171052) and ASTHA JAIN (0126CS171026) in partial fulfillment of the
requirement for the award of “Bachelor of Engineering” in Computer Science &
Engineering discipline to Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal (M.P.)
during the academic year 2020-21 is a record of confide piece of work, carried out
under my supervision and guidance in the Department of Computer Science &
Engineering, Oriental College of Technology, Bhopal.
Approved by
i
ORIENTAL COLLEGE OF TECHNOLOGY, BHOPAL
Approved by AICTE New Delhi & Govt. of M.P. & Affiliated to Rajiv Gandhi
Proudyogiki Vishwavidyalaya, Bhopal (M.P.)
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
CERTIFICATE OF APPROVAL
Date: Date:
ii
CANDIDATE DECLARATION
We hereby declare that the Project dissertation work presented in the report
entitled as “LOAN PREDICTION SYSTEM” submitted in the partial fulfillment of
the requirements for the award of the degree of Bachelor of Engineering in
Computer Science & Engineering of Oriental College of Technology is an
authentic record of our own work.
We have not submitted the part and partial of this report for the award of any
other degree or diploma.
Date:
This is to certify that the above statement made by the candidates is correct
to the best of my knowledge.
iii
ACKNOWLEDGMENT
4|Page
Abstract
The rate at which banks looses funds to loan beneficiaries due to loan default is alarming.
This trend has led to the closure of many banks, potential beneficiaries deprived of access
to loan; and many workers losing their jobs in the banks and other sectors.
Recently the Scams occurred in Indian Banking sector by Scammer such as VIJAY MALLYA,
MEHUL CHAUKSEY, NIRAV MODI and list goes on and on, The Average Money that have
Conned is 10,000 CRORES which bring us to a serious question “Is our Money Safe in Bank”,
This project is a small practical approach to find what next could be done with System.
So, that every person can be a little more safe .
This work uses past loan records based on the employment of machine learning to predict
fraud in bank loan administration and subsequently avoid loan default that manual scrutiny
by a credit officer would not have discovered. However, such hidden patterns are revealed
by machine learning.
We have used various methods like Confusion matrix, decision tree, fraud, XG Boost,
logistic Regression and replace old System of Civil Score. Statistical and conventional
approaches in this direction are restricted in their accuracy capabilities. With a large
volume and variety of data, credit history judgment by man is inefficient; case-based,
analogy-based reasoning and statistical approaches have been employed but the 21st
century fraudulent attempts cannot be discovered by these approaches, hence; the
machine learning approach using the decision tree method to predict fraud and it delivers
an accuracy of more than 90 percent.
5|Page
INDEX
S.NO CONTENTS PAGE
1 Introduction 7
2 Detailed Project Profile 8
3 Steps of Implementation 10
4 Process of implementing Machine Learning 10-18
4.1 Dataset Preparation & Preprocessing 10
4.2 Data Visualization 11
4.3 Data Preprocessing 12
4.3.1 Labelling 13
4.3.2 Data Selection 14
4.3.3 Data Formatting 14
4.3.4 Data Cleaning 14
4.3.5 Data Anonymization 15
4.3.6 Data Sampling 15
4.4 Data Transformation 15
4.4.1 Scaling 15
4.4.2 Decomposition 15
4.4.3 Aggregation 15
4.5 Dataset Splitting 16
4.5.1 Training set 16
4.5.2 Test set 16
4.5.3 Validation Dataset 16
4.6 Modelling 17
4.6.1 Model Training 17
4.6.1.1 Supervised learning 17
4.6.1.2 Unsupervised learning 17
4.7 Model Evaluation and testing 17
4.8 Model Deployment 19
5 Python 19-23
6 Libraries used 19
6.1 NUMPY 20
6.2 PANDAS 21
6.3 SEABORN 22
6.4 MATPLOTLIB 22
7 Flask 23
8 Software and hardware requirement 24
9 Screen layout 25
10 Output layout 27
11 Limitations 28
12 Future Scope 29
13 Conclusion 30
14 Appendix 31
6|Page
1. Introduction
Loans are the core business of banks. The main profit comes directly from the loan’s
interest. The loan companies grant a loan after an intensive process of verification and
validation. However, they still don’t have assurance if the applicant is able to repay the loan
with no difficulties. In this project, we’ll build a predictive model to predict if an applicant is
able to repay the lending company or not.
A loan is a form of debt incurred by an individual or other entity. The lender—usually a
corporation, financial institution, or government—advances a sum of money to the
borrower. In return, the borrower agrees to a certain set of terms including any finance
charges, interest, repayment date, and other conditions.
Companies want to automate the loan eligibility process (real-time) based on customer
detail provided while filling an online application form. These details are Gender, Marital
Status, Education, Number of Dependents, Income, Loan Amount, Credit History and
others. To automate this process, they have given a problem to identify the customers'
segments; those are eligible for loan amount so that they can specifically target these
customers. Here they have provided a partial data set.
There are unsolved fraudulent practices in financial operations in the society, including
bank credit administration, calling for a remedy through intelligent technology. Existing
fraud detection techniques in bank credit administration have not sufficiently met the
desired accuracy, and avoidance of false alarm, and none focused on fraud in bank credit
default. Also, fraudulent duplicates, missing data, and undefined fraud scenarios affect
prediction accuracy.
Any unlawful act by human beings or invoked by machines that leads to personal
gain at the expense of institutions or the legal human beneficiaries is a financial fraud, but
an error must not be taken for a fraud . Considering the overall effect of financial frauds, it
is referred to as an economic sabotage. The examples of financial fraud are money
laundering, bank credit fraud, pension fraud, co-operative society fraud, tax evasion,
telecommunications fraud, credit card fraud, inflated
Motivation
In Indian Banking System, we use CIBIL score to determine if a person is eligible for loan,
but it can be highly manipulated when we dig down deep.
A CIBIL Score is a consumer's credit score. Simply put, this is a three-digit numeric summary
of a consumer's credit history and a reflection of the person's credit profile. This is based on
past credit behavior, such as borrowing and repayment habits as shared by banks and
lenders with CIBIL on a regular basis.
7|Page
We are trying to find if a person get answer is yes or no, So that he/she can get a clear
picture instead of a score range which is more wide range in from 0-900. It will be more
practical for banks too.
8|Page
Step of
Working in this project were:
• Collection of Data to Start with
• Data Pre-processing
• Exploratory Data Analysis
• Model Building (Test and Train)
• Improving Efficiency
• Making a web Application
Data labeling takes much time and effort as datasets sufficient for machine learning may
require thousands of records to be labeled. For instance, if your image recognition
algorithm must classify types of bicycles, these types should be clearly defined and labeled
in a dataset. Here are some approaches that streamline this tedious and time-consuming
time
procedure.
4.3.2 Data selection
After having collected all information, a data analyst chooses a subgro
subgroup
up of data to solve
the defined problem. For instance, if you save your customers’ geographical location, you
don’t need to add their cell phones and bank card numbers to a dataset. But purchase
history would be necessary. The selected data includes attrib attributes
utes that need to be
considered when building a predictive model.
4.4.1 Scaling
Data may have numeric attributes (features) that span different ranges, for example,
millimeters, meters, and kilometers. Scaling is about converting these attributes so that
they will have the same scale, such as between 0 and 1, or 1 and 10 for the smallest and
biggest value for an attribute.
4.4.2 Decomposition
Sometimes finding patterns in data with features representing complex concepts is more
difficult. Decomposition technique can be applied in this case. During decomposition, a
specialist converts higher level features into lower level ones. In other words, new features
based on the existing ones are being added. Decomposition is mostly used in time series
analysis. For example, to estimate a demand for air conditioners per month, a market
research analyst converts data representing demand per quarters.
4.4.3 Aggregation
Unlike decomposition, aggregation aims at combining several features into a feature that
represents them all. For example, you’ve collected basic information about your customers
and particularly their age. To develop a demographic segmentation strategy, you need to
distribute them into age categories, such as 16-20, 21-30, 31-40, etc. You use aggregation
to create large-scale features based on small-scale ones. This technique allows you to
reduce the size of a dataset without the loss of information.
15 | P a g e
4.5 Dataset splitting
A dataset used for machine learning should be partitioned into three subsets
1 –training,
2 –test, and
3 –validation sets.
17 | P a g e
4.7 Model deployment
The model deployment stage covers putting a model into production use.
Once a data scientist has chosen a reliable model and specified its performance
requirements, he or she delegates its deployment to a data engineer or database
administrator. The distribution of roles depends on your organization’s structure and the
amount of data you store.
Generally, data engineer implements, tests, and maintains infrastructural components for
proper data collection, storage, and accessibility. Besides working with big data, building
and maintaining a data warehouse, a data engineer takes part in model deployment. To do
so, a specialist translates the final model from high-level programming languages (i.e.
Python and R) into low-level languages such as C/C++ and Java.
The distinction between two types of languages lies in the level of their abstraction in
reference to hardware. A model that’s written in low-level or a computer’s native language,
therefore, better integrates with the production environment.
18 | P a g e
5. Python
Python is an interpreted high-level general-purpose programming language. Python's
design philosophy emphasizes code readability with its notable use of significant
indentation.
Python offers concise and readable code. While complex algorithms and versatile
workflows stand behind machine learning and AI, Python’s simplicity allows developers to
write reliable systems. Developers get to put all their effort into solving an ML problem
instead of focusing on the technical nuances of the language. Additionally, Python is
appealing to many developers as it’s easy to learn. Python code is understandable by
humans, which makes it easier to build models for machine learning.
To reduce development time, programmers turn to a number of Python frameworks and
libraries. A software library is pre-written code that developers use to solve common
programming tasks. Python, with its rich technology stack, has an extensive set of libraries
for artificial intelligence and machine learning. Here are some of them:
6. Libraries Used
A general list of libraries used in developing this model are :
6.1 NUMPY
NUMPY is a library for the Python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical
functions to operate on these arrays.
Moreover NUMPY forms the foundation of the Machine Learning stack. In this article we
cover the most frequently used NUMPY operations.
1) Creating a Vector
We use NUMPY to create a 1-D Array which we then call a vector.
2) Creating a Matrix
We create a 2-D Array in NUMPY and call it a Matrix. It contains 2 rows and 3
columns.
3) Creating a Sparse Matrix
Sparse Matrices store only none zero elements and assume all other values will be
zero, leading to significant computational savings.
4) Selecting Elements
When you need to select one or more element in a vector or matrix
5) Describing a Matrix
When you want to know about the shape size and dimensions of a Matrix
19 | P a g e
6) Applying operations to elements
You want to apply some function to multiple elements in an array.
NUMPY’S VECTORIZE class converts a function into a function that can apply to
multiple elements in an array or slice of an array.
7) Finding the max and min values
We use NUMPY’S max and min functions
8) Calculating Average, Variance and Standard deviation
When you want to calculate some descriptive statistics about an array .
9) Reshaping Arrays
When you want to reshape an array changing the number of rows and columns
without changing the elements .
6.2 PANDAS
PANDAS is one of the tools in Machine Learning which is used for data cleaning and
analysis. It has features which are used for exploring, cleaning, transforming and visualizing
from data.
20 | P a g e
6.3 SEABORN
SEABORN is a data visualization library built on top of MATPLOTLIB and closely integrated
with pandas data structures in Python. Visualization is the central part of SEABORN which
helps in exploration and understanding of data. One has to be familiar with NUMPY and
MATPLOTLIB and Pandas to learn about SEABORN.
SEABORN offers the following functionalities:
21 | P a g e
Automatic estimation and plotting of linear regression plots.
It supports high-level abstractions for multi-plot grids.
6.4 MATPLOTLIB
MATPLOTLIB is a 2-D plotting library that helps in visualizing figures. MATPLOTLIB emulates
MATPLOTLIB like graphs and visualizations. MATPLOTLIB is not free, is difficult to scale and
as a programming language is tedious. So, MATPLOTLIB in Python is used as it is a robust,
free and easy library for data visualization.
Visualizations are the easiest way to analyze and absorb information. Visuals help to easily
understand the complex problem. They help in identifying patterns, relationships, and
outliers in data. It helps in understanding business problems better and quickly. It helps to
build a compelling story based on visuals. Insights gathered from the visuals help in building
strategies for businesses. It is also a precursor to many high-level data analysis for
Exploratory Data Analysis (EDA) and Machine Learning (ML).
22 | P a g e
7. Flask
Flask is a lightweight WSGI web application framework. It is designed to make getting
started quick and easy, with the ability to scale up to complex applications. It began as a
simple wrapper around WERKZEUG
ERKZEUG and JINJA and has become one of the most popular
Python web application frameworks.
Flask offers suggestions, but doesn't enforce any dependencies or project layout. It is up to
the developer to choose the tools and libraries they want to use. There are many
extensions provided by the community that make adding new functionality easy.
Features
Development server and debugger
Integrated support for unit testing
Restful request dispatching
Uses JINJA TEMPLATING
Support
port for secure cookies (client side sessions)
100% WSGI 1.0 compliant
Unicode-based
Extensive documentation
Google App Engine compatibility
Extensions available to enhance features desired
Screen Layout
1 Home Screen Tab
This is how our home page looks like a simple layout ,no fancy stuff
Once you click on predict button next webpage will be shown
2 Data
D Selection Tab
3 Analysis Tab
4 Data filling Tab
Customer wills an idea they are eligible or not, it will help in reducing the time they
go and apply for a loan.
Customer can see the reason why they are not eligible to get loan, because the
reason can be endless for defaulting a loan
If Customer has existing loan in other bank, System will reflect back directly once
application is integrated with a bank.
Small Finance bank can be first to use this services in mass level. In current situations
too, they will get a extra Security.
Banks N.P.A will reduce and this reduced N.P.A will boost the loan giving ability of
bank.
Added security can be a game changer as banks like yes bank, LAXMI villas bank and
other small co-operative banks will not file for bankruptcy after a major Scam and
consecutive Frauds.
29 | P a g e
13. Conclusion
From a proper analysis of positive points and constraints on the component, it can be safely
concluded that the product is a highly efficient component. This application is working
properly and meeting to all Banker requirements.
Initially it can be used as third parties Application which can get you a better Idea whether
you are eligible get a loan or not.
In Next Phase after integrating it with bank server we can get a clearer picture of loan
approval. This may seem a joke now but can be revolutionary product if being used and
customized properly.
This component can be easily plugged in many other systems. There have been numbers
cases of computer glitches, errors in content and most important weight of features is fixed
in automated prediction system, So in the near future the so –called software could be
made more secure, reliable and dynamic weight adjustment .In near future this module of
prediction can be integrate with the module of automated processing system. The system is
trained on old training dataset in future software can be made such that new testing date
should also take part in training data after some fix time.
Its efficiency will vary from time to time but overall it is a practical approach which we
need in current time.
The anomaly of taking credit and ending up in a default to the detriment of the lender has
been confirmed to have a remedy in machine learning. Using a real life dataset it has been
revealed that false positives can be reduced with an employment of decision tree, thereby
getting a highly reliable accuracy that financial institutions can depend on while scrutinizing
loan applications.
30 | P a g e
14. Appendix
The link to Access all the content of project, it include
1. Lab Report digital Copy
2. Power Point Presentation
3. Source Code
https://drive.google.com/drive/folders/1E1SxyMC2OIRDkX7ofeP5noY0PDXAJNBa
31 | P a g e