ML - Project - Business Report
ML - Project - Business Report
MACHINE LEARNING
PROJECT BUSINESS REPORT
_______________________
DSBA
1.2. Encoding...................................................................................................................................................8
Train -Test Split...............................................................................................................................................8
Model Building................................................................................................................................................9
Logistic Regression Model..............................................................................................................................9
KNN Model…………………………………………………………………………………………………….
NB Model……………………………………………………………………………………………………….
Bagging Model…………………………………………………………………………………………………
Boosting Model………………………………………………………………………………………………
ADA Boosting Model………………………………………………………………………………………...
Gradient Boosting Model…………………………………………………………………………………….
List of Tables
Table 1: Data Description Dataset1...........................................................................................................4
Table 19: Classification report – Model Tunning (Random Forest Bagging) – Train…………………….37
Table 20: Classification report – Model Tunning (Random Forest Bagging) – Test………………………38
List of Figures:
Figure 1: Univariate Analysis...........................................................................................................................6
Figure7: Pairplot..............................................................................................................................................8
Problem 1
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that will
help in predicting overall win and seats covered by a particular party.
**Data
Dictionary**
2. age: in years
7. Europe: an 11-point scale that measures respondents' attitudes toward European integration. High scores represent ‘Euro
sentiment.
EDA
The data is imported, and the following are the observations:
The data has 1525 rows and 10 columns. There is 2 object type data types and rest are Integer 8 int data types.
Univariate Analysis:
We can clearly see that; the labour party has got more votes than the conservative
party.
• In every age group, the labour party has got more votes than the conservative party.
• Female votes are considerably higher than the male votes in both parties.
• In both genders, the labour party has got more votes than the conservative party.
The black dots boxplots show that there is presence of outliers in all the variables.
Majority of the variables are highly skewed as well.
All the outliers are treated by adjusting them to the lower and upper bound values calculated by the
IQR value.
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30). (4 Marks)
Drop First is used to ensure that multiple columns created based on the levels of categorical variable are
not included else it will result in to multicollinearity. This is done to ensure that we do not land in to
dummy trap.
Problem 1 has features that are ordinal but the rating scales are different. Since Logistic regression is
sensitive to scaling, is it necessary to normalize all such features to bring to a similar rating scale (or)
and Scaling is a necessity when using Distance-based models such as KNN etc. Scaling can be done on
continuous and ordinal variables.
Why scaling?:
The dataset contains features highly varying in magnitudes, units and range between
the 'age' column and other columns.
But since, most of the machine learning algorithms use Euclidean distance between
two data points in their computations, this is a problem.
If left alone, these algorithms only take in the magnitude of features neglecting the
units.
The results would vary greatly between different units, 1 km and 1000 metres.
The features with high magnitudes will weigh in a lot more in the distance
calculations than features with low magnitudes.
To supress this effect, we need to bring all features to the same level of magnitudes.
This can be achieved by scaling.
in this case, we have a lot of encoded, ordinal, categorical and continuous variables.
So, we use the minmaxscaler technique to scale the data.
Encoded Data
Modelling: 22 marks
• The error in the test data is slightly higher than the train data, which is absolutely fine
because the error margin is low and the error in both train and test data is not too high. Thus,
the model is not over-fitted or under-fitted
Code
KNN Model:
Train Data:
Test Data:
Train Data:
Run the KNN with no of neighbours to be 1,3,5..19 and *Find the optimal number of neighbours from
K=1,3,5,7....19 using the Mis classification error
Hint: Misclassification error (MCE) = 1 - Test accuracy score. Calculated MCE for each model with
neighbours = 1,3,5...19 and find the model with lowest MCE
Test Data:
Train Data:
Test Data:
Random Forest:
Train Data:
Test Data:
Bagging:
Test Data:
Ada Boost:
Test Data:
Gradient Boosting:
Test Data:
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare the
To check performance of Predictions of every model built on Train and Test datasets, Accuracy score is
calculated.
A Confusion Matrix, ROC curve and AUC-ROC score has been devised as well.
Confusion Matrix
0.2
Confusion Matrix
0.4
Accuracy Score 0.8229
F1 Score 0.8795
Confusion Matrix
0.6
Accuracy Score 0.8229
F1 Score 0.8713
Confusion Matrix
0.8
Accuracy Score 0.7573
F1 Score 0.8036
Confusion Matrix
Let's look at the performance of all the models on the Train Data set:
Recall refers to the percentage of total relevant results correctly classified by the algorithm and
hence we will compare Recall of class "1" for all models.
So as per the train data, Worst performing models are - Linear Discriminant Analysis Best
Performing models are - Random Forest and Bagging. Boosting However, are these best
performing models overfitted.
Model which has not performed well on the train data set, also have not performed well on the test
data set However Decision Tree, Random Forest and Bagging which had a 100% score on the
train data set have shown a poor result on the test data set. Hence a clear case of overfitting.
So, we will select models which have performed approximately similar on the train and test data
set and apply smote on the same to check if the performance has improved or not i.e., Naive
Bayes and Ada Boost.
From all the inferences above, we see that mostly all the models have similar performance.
The Accuracy score for all the models is above 64% for both test and train data.
Conclusion:
• There is no under-fitting or over-fitting in any of the tuned models.
Inference: 5 marks
Business recommendations:
• Hyper-parameters tuning is an import aspect of model building. There are limitations to this
as to process these combinations, huge amount of processing power is required. But if tuning
can be done with many sets of parameters, we might get even better results.
• Gathering more data will also help in training the models and thus improving the predictive
powers.
• We can also create a function in which all the models predict the outcome in sequence.
This will help in better understanding and the probability of what the outcome will be.
1)COMPARING ALL THE PERFORMANCE MEASURE, NAÏVE BAYES MODEL FROM SECOND ITERATION
IS PERFORMING BEST. ALTHOUGH THERE ARE SOME OTHER MODELS SUCH AS SVM AND EXTREME
BOOSTING WHICH IS PERFORMING ALMOST SAME AS THAT OF NAÏVE BAYES. BUT NAÏVE BAYES
MODEL IS VERY CONSISTENT WHEN TRAIN AND TEST RESULTS ARE COMPARED WITH EACH OTHER.
4)THOSE WHO HAVE BETTER NATIONAL ECONOMIC CONDITIONS ARE PREFERRING TO VOTE FOR
LABOUR PARTY.
6)THOSE WHO HAVE HIGHER POLITICAL KNOWLEDGE HAVE VOTED FOR CONSERVATIVE PARTY
7)LOOKING AT THE ASSESSMENT FOR BOTH THE LEADERS, LABOUR LEADER IS PERFORMING WELL
AS HE HAS GOT BETTER RATINGS IN ASSESSMENT.
Problem 2
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We
will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
(Hint: use .words(), .raw(), .sent() for extracting counts)
Code Snippet to extract the three speeches:
"
import nltk
nltk.download('inaugural')
from nltk.corpus import inaugural
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')
Number of characters:
Number of words:
Number of words in Roosevelt file: 1360
Number of words in Kennedy file: 1390
Number of words in Nixon file: 1819
Number of sentences:
2.2) Remove all the stopwords from the three speeches. Show the word count before and after the removal of
stopwords. Show a sample sentence after the removal of stopwords.
Before, removing the stop-words, we have changed all the letters to lowercase and
we have removed special characters.
Word count before the removal of stop-words:
2.3) Which word occurs the most number of times in his inaugural address for each president? Mention the top three
words. (after removing the stopwords)
• know - 10
• spirit - 9
Top 3 words in Kennedy speech:
let - 16
us - 12
sides – 08
The top 3 words are,
Top 3 words in Nixon’s speech:
us - 26
let - 22
peace - 19