[go: up one dir, main page]

0% found this document useful (0 votes)
173 views43 pages

ML - Project - Business Report

The document provides a detailed business report on a machine learning project to predict voter behavior. It includes sections on data description and exploration, feature encoding, model building and evaluation of logistic regression, KNN, NB, bagging, boosting, ADA boosting and gradient boosting models. Tables and figures are provided to show results of univariate analysis, outlier detection, model performance on train and test sets using various classification metrics. The report concludes with business insights and recommendations.

Uploaded by

ROHINI ROKDE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
173 views43 pages

ML - Project - Business Report

The document provides a detailed business report on a machine learning project to predict voter behavior. It includes sections on data description and exploration, feature encoding, model building and evaluation of logistic regression, KNN, NB, bagging, boosting, ADA boosting and gradient boosting models. Tables and figures are provided to show results of univariate analysis, outlier detection, model performance on train and test sets using various classification metrics. The report concludes with business insights and recommendations.

Uploaded by

ROHINI ROKDE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

_______________________

MACHINE LEARNING
PROJECT BUSINESS REPORT
_______________________
DSBA

MACHINE LEARNING BUSINESS REPORT


Table of Contents:
List Of Figures: ...............................................................................................................................................1
List Of Tables: ................................................................................................................................................2

Data Description: ............................................................................................................................................3


1.1.EDA...........................................................................................................................................................4
Univariate Analysis....................................................................................................................................4
Bivariate Analysis......................................................................................................................................5
Outlier Detection & Treatment...................................................................................................................6
Null Value Treatment.................................................................................................................................7

1.2. Encoding...................................................................................................................................................8
Train -Test Split...............................................................................................................................................8
Model Building................................................................................................................................................9
Logistic Regression Model..............................................................................................................................9
KNN Model…………………………………………………………………………………………………….
NB Model……………………………………………………………………………………………………….
Bagging Model…………………………………………………………………………………………………
Boosting Model………………………………………………………………………………………………
ADA Boosting Model………………………………………………………………………………………...
Gradient Boosting Model…………………………………………………………………………………….

1.3. Model Evaluation & Performance...........................................................................................................

1.4. Business Insights & Recommendations.................................................................................................

List of Tables
Table 1: Data Description Dataset1...........................................................................................................4

Table 2: Data Summary.............................................................................................................................5

Table 3: Encoded data...............................................................................................................................5

Table 4: Data description.........................................................................................................................18

Table 5: Encoded data.............................................................................................................................25

Table 6: Classification report -Logistic regression model 1 -Train...........................................................26

Table 7: Classification report -Logistic Regression model 1 -Test...........................................................26

Table 8: Classification report -Optimized Logistic Regression model –Train..........................................28

Table 9: Classification report -Optimized Logistic Regression model –Test...........................................28

MACHINE LEARNING BUSINESS REPORT


Table 10: Classification report -LDA model -Train..................................................................................29

Table 11: Classification report -LDA model –Test...................................................................................29

Table 12: Classification report -LR model -Train.....................................................................................31

Table 13: Classification report -LR model -Test......................................................................................31

Table 14: Classification report – KNN model -Train................................................................................32

Table 15: Classification report - KNN model- Test..................................................................................34

Table 16: Classification report - KNN model- Test..................................................................................34

Table 17: Classification report - NB model- Test.....................................................................................35

Table 18: Classification report - NB model- Test.....................................................................................36

Table 19: Classification report – Model Tunning (Random Forest Bagging) – Train…………………….37

Table 20: Classification report – Model Tunning (Random Forest Bagging) – Test………………………38

Table 21: Classification report – Model Tunning (Boosting) – Train……………………………………….39

Table 22: Classification report – Model Tunning (ADA Boosting) – Train…………………………………40

Table 23: Classification report – Model Tunning (ADA Boosting) – Test………………………………….41

Table 24: Classification report – Model Tunning (Gradient Boosting) – Train…………………………….42

Table 25: Classification report – Model Tunning (Gradient Boosting) – Test……………………………...43

Table 26: Classification report – Performance Metrics of Prediction Train –………………………………44

Table 26: Classification report – Performance Metrics of Prediction Test –……………………………….45

Table 26: Classification report – Performance Metrics of Prediction Train –………………………………44

List of Figures:
Figure 1: Univariate Analysis...........................................................................................................................6

Figure 2: Univariate Analysis...........................................................................................................................6

Figure 3: Bivariate analysis .............................................................................................................................7

Figure 4: Multivariate analysis ........................................................................................................................7

Figure7: Pairplot..............................................................................................................................................8

Figure 8: Correlation Heatmap........................................................................................................................9

Figure 9: Boxplot for outlier detection............................................................................................................10

Figure 35: ROC Curve -Optimized Logistic Regression model –Train..........................................................28

MACHINE LEARNING BUSINESS REPORT


Figure 36: ROC Curve -Optimized Logistic Regression model –Test...........................................................28

Figure 37: ROC Curve -LDA model –Train....................................................................................................30

Figure 38: ROC Curve -LDA model -Test......................................................................................................30

Figure 41: Confusion matrices of all models (Train data) .............................................................................33

Figure 42: Confusion matrices of all models (Test data) ..............................................................................34

Problem 1
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that will
help in predicting overall win and seats covered by a particular party.

**Data
Dictionary**

1. vote: Party choice: Conservative or Labour

2. age: in years

3. economic. condenational: Assessment of current national economic conditions, 1 to 5.

4. economic. cond.household: Assessment of current household economic conditions, 1 to 5.

5. Blair: Assessment of the Labour leader, 1 to 5.

6. Hague: Assessment of the Conservative leader, 1 to 5.

7. Europe: an 11-point scale that measures respondents' attitudes toward European integration. High scores represent ‘Euro
sentiment.

8. political. Knowledge: Knowledge of parties' positions on European integration, 0 to


3.

9. gender: female or male.

Dataset for Problem: Election_Data.xlsx


Data Ingestion: 11 marks

MACHINE LEARNING BUSINESS REPORT


1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. (4 Marks)

EDA
The data is imported, and the following are the observations:
 The data has 1525 rows and 10 columns. There is 2 object type data types and rest are Integer 8 int data types.

Data type of data features:

MACHINE LEARNING BUSINESS REPORT


Number of duplicate rows = 0

 There are no duplicate rows present in the data.


 There are no missing values in variables.
 There is no duplicate row in the data.
 In data summary minimum age is 24 and maximum age 93.
 Assessment of current national economic conditions, 1 to 5.
 Assessment of current national economic household, 1 to 5.
 Blair: Assessment of the Labour leader, 1 to 5.
 Hague: Assessment of the Conservative leader, 1 to 5.
 Europe: an 11-point scale that measures respondents' attitudes toward European integration. High scores
represent ‘Eurosceptic’ sentiment.
 political. Knowledge: Knowledge of parties' positions on European integration, 0 to 3

MACHINE LEARNING BUSINESS REPORT


1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers. (7 Marks)

Univariate Analysis:

MACHINE LEARNING BUSINESS REPORT


Bivariate Analysis:

 Age is almost same in labour and Conservative party.

 We can clearly see that; the labour party has got more votes than the conservative
party.

• In every age group, the labour party has got more votes than the conservative party.

• Female votes are considerably higher than the male votes in both parties.

• In both genders, the labour party has got more votes than the conservative party.

MACHINE LEARNING BUSINESS REPORT


 Economic Cond National Vote of Labour and Conservative party is almost same.

 Economics Household Vote of Labour and Conservative Party is also a same.

MACHINE LEARNING BUSINESS REPORT


MACHINE LEARNING BUSINESS REPORT
MACHINE LEARNING BUSINESS REPORT
MACHINE LEARNING BUSINESS REPORT
Multivariate analysis:

Pair plot is a combination of histograms and scatter plots.


• From the histogram, we can see that, the 'Blair’, ‘Europe' and 'political. Knowledge
‘variables are slightly left skewed.
• All other variables seem to be normally distributed.
• From the scatter plots, we can see that, there is mostly no correlation between the
variables

MACHINE LEARNING BUSINESS REPORT


.
Correlation matrix is a table which shows the correlation coefficient between variables.
Correlation values range from -1 to +1. For values closer to zero, it means that, there is no
linear trend between two variables. Values close to 1 means that the correlation is positive.
The correlation heat map helps us to visualize the correlation between two variables
 Europe and age are corelated to each other.
 Other variables are related but less corelation have in Economic national with Blair,
Economic Households with Economic National and Europe to Hague.
 Observation:
 We can see that, mostly there is no correlation in the dataset through this matrix.
There are some variables that are moderately positively correlated and some that are
slightly negatively correlated.
 ‘Economic. cond. national’ with ‘economic. cond. household’ have moderate positive
correlation.
 ‘Blair' with 'economic. Cond.national' and ‘economic. cond. household’ have
moderate positive correlation.
 ‘Europe’ with ‘Hague’ have moderate positive correlation.
 'Hague' with 'economic. cond. national' and 'Blair' have
 moderate negative correlation.
 'Europe' with 'economic.cond.national' and 'Blair' have moderate negative
correlation.
MACHINE LEARNING BUSINESS REPORT
Outlier Detection & Treatment:

 The black dots boxplots show that there is presence of outliers in all the variables.
 Majority of the variables are highly skewed as well.
 All the outliers are treated by adjusting them to the lower and upper bound values calculated by the
IQR value.

MACHINE LEARNING BUSINESS REPORT


Data Preparation: 4 marks

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30). (4 Marks)

 Drop First is used to ensure that multiple columns created based on the levels of categorical variable are
not included else it will result in to multicollinearity. This is done to ensure that we do not land in to
dummy trap.

 Problem 1 has features that are ordinal but the rating scales are different. Since Logistic regression is
sensitive to scaling, is it necessary to normalize all such features to bring to a similar rating scale (or)
and Scaling is a necessity when using Distance-based models such as KNN etc. Scaling can be done on
continuous and ordinal variables.
 Why scaling?:
 The dataset contains features highly varying in magnitudes, units and range between
the 'age' column and other columns.
 But since, most of the machine learning algorithms use Euclidean distance between
two data points in their computations, this is a problem.
 If left alone, these algorithms only take in the magnitude of features neglecting the
units.
 The results would vary greatly between different units, 1 km and 1000 metres.
 The features with high magnitudes will weigh in a lot more in the distance
calculations than features with low magnitudes.
 To supress this effect, we need to bring all features to the same level of magnitudes.
This can be achieved by scaling.
 in this case, we have a lot of encoded, ordinal, categorical and continuous variables.
So, we use the minmaxscaler technique to scale the data.

Viewing the data after scaling:

Encoded Data

MACHINE LEARNING BUSINESS REPORT


 Encoding is done on the only ‘Object’ types variable i.e., ‘Vote and gender’.
 Scaling the variables as continuous variables have different weightage using min-max technique.
 A new column is created, with vote labour with encoded values 1,0 and gender male is also encoded by
the 1,0 values.

Train – Test split:


The data set is split into training and testing data in the ratio of 70:30.

Train Data split:

Test Data split:

Modelling: 22 marks

MACHINE LEARNING BUSINESS REPORT


1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)

Model Performance Logistic Regression:

Accuracy Score is 0.8231441048034934

Validness of the model:


• The model is not over-fitted or under-fitted.

• The error in the test data is slightly higher than the train data, which is absolutely fine
because the error margin is low and the error in both train and test data is not too high. Thus,
the model is not over-fitted or under-fitted

MACHINE LEARNING BUSINESS REPORT


Model Performance LDA (linear discriminant analysis):

Code

Training Data and Test Data Classification Report Comparison

 Accuracy on Train data: 0.83


 Accuracy on Test data: 0.85
 Recall on Train Data:0.89
 Recall on Test Data:0.91

 Validness of the model:


 The model is not over-fitted or under-fitted.
 The error in the test data is slightly higher than the train data, which is absolutely fine
because the error margin is low and the error in both train and test data is not too high.
Thus, the model is not over-fitted or under-fitted.

MACHINE LEARNING BUSINESS REPORT


1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)

KNN Model:

Train Data:

Test Data:

Default value n_neighbors=5, lets check the performance for K=7

Train Data:

MACHINE LEARNING BUSINESS REPORT


Test Data:

Run the KNN with no of neighbours to be 1,3,5..19 and *Find the optimal number of neighbours from
K=1,3,5,7....19 using the Mis classification error

Hint: Misclassification error (MCE) = 1 - Test accuracy score. Calculated MCE for each model with
neighbours = 1,3,5...19 and find the model with lowest MCE

Plot misclassification error vs k (with k value on X-axis) using matplotlib

MACHINE LEARNING BUSINESS REPORT


For K = 20 it is giving the best test accuracy lets check train and test for K=17 with other evaluation
metrics
Train Data:

Test Data:

 Accuracy on Train data: 0.70


 Accuracy on Test data: 0.71
 Recall on Train Data:0.97
 Recall on Test Data:0.96

MACHINE LEARNING BUSINESS REPORT


 Cross Validation scores mean on train:0.68
 Cross validation scores mean on test :0.71
 ## After 20-fold cross validation, scores both on train and test data set
respectively for all 10 folds are almost same.
 ## Hence our model is valid.

Naïve Bayes Model:

Train Data:

Test Data:

 Accuracy on Train data: 0.69


 Accuracy on Test data: 0.72
 Recall on Train Data: 1.00
 Recall on Test Data:0.50

MACHINE LEARNING BUSINESS REPORT


1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and boosting. (7 marks)

Random Forest:

Train Data:

Test Data:

 Accuracy on Train data: 1.00


 Accuracy on Test data: 0.68
 Recall on Train Data: 1.00
 Recall on Test Data:0.89
 Model Poorly perform on test data.

Bagging:

MACHINE LEARNING BUSINESS REPORT


Train Data:

Test Data:

 Accuracy on Train data: 1.00


 Accuracy on Test data: 0.64
 Recall on Train Data: 1.00
 Recall on Test Data:0.83
 Model Poorly perform on test data.

Ada Boost:

MACHINE LEARNING BUSINESS REPORT


Train Data:

Test Data:

 Accuracy on Train data: 0.69


 Accuracy on Test data: 0.71
 Recall on Train Data: 0.99
 Recall on Test Data:0.99
 Model perform well Train and on test data.

Gradient Boosting:

MACHINE LEARNING BUSINESS REPORT


Train Data:

Test Data:

 Accuracy on Train data: 0.69


 Accuracy on Test data: 0.70
 Recall on Train Data: 0.99
 Recall on Test Data:0.97
 Model performed well Train and on test data.

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare the

MACHINE LEARNING BUSINESS REPORT


models and write inference which model is best/optimized. (7 marks)

To check performance of Predictions of every model built on Train and Test datasets, Accuracy score is
calculated.
A Confusion Matrix, ROC curve and AUC-ROC score has been devised as well.

Comparing Confusion matrix of all models (Train data):

MACHINE LEARNING BUSINESS REPORT


0.1

Accuracy Score 0.747


F1 Score 0.845

Confusion Matrix

0.2

Accuracy Score 0.7863


F1 Score 0.8635

Confusion Matrix

MACHINE LEARNING BUSINESS REPORT


0.3
Accuracy Score 0.8107
F1 Score 0.8748
Confusion Matrix

0.4
Accuracy Score 0.8229
F1 Score 0.8795
Confusion Matrix

MACHINE LEARNING BUSINESS REPORT


0.5
Accuracy Score 0.8276
F1 Score 0.8785
Confusion Matrix

0.6
Accuracy Score 0.8229
F1 Score 0.8713
Confusion Matrix

MACHINE LEARNING BUSINESS REPORT


0.7
Accuracy Score 0.8107
F1 Score 0.8569
Confusion Matrix

0.8
Accuracy Score 0.7573
F1 Score 0.8036
Confusion Matrix

MACHINE LEARNING BUSINESS REPORT


0.9
Accuracy Score 0.672
F1 Score 0.7049
Confusion Matrix

Comparing Confusion matrix of all models (Test data)

MACHINE LEARNING BUSINESS REPORT


Comparison of Different Models:

Let's look at the performance of all the models on the Train Data set:
Recall refers to the percentage of total relevant results correctly classified by the algorithm and
hence we will compare Recall of class "1" for all models.
So as per the train data, Worst performing models are - Linear Discriminant Analysis Best
Performing models are - Random Forest and Bagging. Boosting However, are these best
performing models overfitted.

Let's look at the performance on the test data set:

Model which has not performed well on the train data set, also have not performed well on the test
data set However Decision Tree, Random Forest and Bagging which had a 100% score on the
train data set have shown a poor result on the test data set. Hence a clear case of overfitting.

So, we will select models which have performed approximately similar on the train and test data
set and apply smote on the same to check if the performance has improved or not i.e., Naive
Bayes and Ada Boost.

Naive Bayes with SMOTE:

MACHINE LEARNING BUSINESS REPORT


Random Forest with SMOTE;

Different Model Parameters


Model Name Accuracy Recall Precision F1 score AUC
Train Test Train Test Train Test Train Test Train Test
Logistic Regression   0.82   0.89   0.87   0.88
LDA Model 0.83 0.85 0.89 0.91 0.86 0.88 0.88 0.89
KNN Model 0.70 0.71 0.97 0.96 0.71 0.73 0.82 0.83
NB Model 0.69 0.72 1.00 1.00 0.69 0.72 0.82 0.83
Tunned RF Model 1.00 0.67 1.00 0.88 1.00 0.72 1.00 0.79
Tuned Bagging Model 1.00 0.64 1.00 0.83 1.00 0.71 1.00 0.77
Ada Boost Model 0.69 0.71 0.99 0.99 0.69 0.72 0.82 0.83
Gradient Model 0.69 0.70 0.99 0.97 0.69 0.71 0.82 0.82 0.877 0.916

 From all the inferences above, we see that mostly all the models have similar performance.
 The Accuracy score for all the models is above 64% for both test and train data.
Conclusion:
• There is no under-fitting or over-fitting in any of the tuned models.

Inference: 5 marks

1.8 Based on these predictions, what are the insights? (5 marks)

MACHINE LEARNING BUSINESS REPORT


Labour party has more than double the votes of conservative party.
• Most number of people have given a score of 3 and 4 for the national economic condition
and the average score is 3.245221
• Most number of people have given a score of 3 and 4 for the household economic condition
and the average score is 3.137772
• Blair has higher number of votes than Hague and the scores are much better for Blair than
for Hague.
• The average score of Blair is 3.335531 and the average score of Hague is 2.749506. So,
here we can see that, Blair has a better score.
• On a scale of 0 to 3, about 30% of the total population has zero knowledge about
politics/parties.
• People who gave a low score of 1 to a certain party, still decided to vote for the same party
instead of voting for the other party. This can be because of lack of political knowledge
among the people.
• People who have higher Eurosceptic sentiment, has voted for the conservative party and
lower the Eurosceptic sentiment, higher the votes for Labour party.
• Out of 454 people who gave a score of 0 for political knowledge, 360 people have voted for
the labour party and 94 people have voted for the conservative party.
• All models performed well on training data set as well as test data set. The tuned models
have performed better than the regular models.
• There is no over-fitting in any model except Random Forest and Bagging regular models.

 Gradient Boosting model tuned is the best/optimized model.

Business recommendations:
• Hyper-parameters tuning is an import aspect of model building. There are limitations to this
as to process these combinations, huge amount of processing power is required. But if tuning
can be done with many sets of parameters, we might get even better results.
• Gathering more data will also help in training the models and thus improving the predictive
powers.
• We can also create a function in which all the models predict the outcome in sequence.
This will help in better understanding and the probability of what the outcome will be.
1)COMPARING ALL THE PERFORMANCE MEASURE, NAÏVE BAYES MODEL FROM SECOND ITERATION
IS PERFORMING BEST. ALTHOUGH THERE ARE SOME OTHER MODELS SUCH AS SVM AND EXTREME
BOOSTING WHICH IS PERFORMING ALMOST SAME AS THAT OF NAÏVE BAYES. BUT NAÏVE BAYES
MODEL IS VERY CONSISTENT WHEN TRAIN AND TEST RESULTS ARE COMPARED WITH EACH OTHER.

MACHINE LEARNING BUSINESS REPORT


ALONG WITH OTHER PARAMETERS SUCH AS RECALL VALUE, AUC_SCORE AND AUC_ROC_CURVE,
THOSE RESULTS WERE PRETTY GOOD IS THIS MODEL.

2)LABOUR PARTY IS PERFORMING BETTER THAN CONSERVATIVE FROM HUGE MARGIN.

3)FEMALE VOTERS TURN OUT IS GREATER THAN THE MALE VOTERS.

4)THOSE WHO HAVE BETTER NATIONAL ECONOMIC CONDITIONS ARE PREFERRING TO VOTE FOR
LABOUR PARTY.

5)PERSONS HAVING HIGHER EUROSCEPTIC SENTIMENTS CONSERVATIVE PARTY ARE PREFERRING


TO VOTE FOR CONSERVATIVE PARTY.

6)THOSE WHO HAVE HIGHER POLITICAL KNOWLEDGE HAVE VOTED FOR CONSERVATIVE PARTY

7)LOOKING AT THE ASSESSMENT FOR BOTH THE LEADERS, LABOUR LEADER IS PERFORMING WELL
AS HE HAS GOT BETTER RATINGS IN ASSESSMENT.

Problem 2
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We
will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
(Hint: use .words(), .raw(), .sent() for extracting counts)
Code Snippet to extract the three speeches:
"
import nltk
nltk.download('inaugural')
from nltk.corpus import inaugural
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')

MACHINE LEARNING BUSINESS REPORT


2.1) Find the number of characters, words and sentences for the mentioned documents. (Hint:
use .words(), .raw(), .sent() for extracting counts)

Number of characters:

 President Franklin D. Roosevelt's speech have 7571 characters (including spaces).


 President John F. Kennedy's speech have 7618 characters (including spaces).
 President Richard Nixon's speech has 9991 characters (including spaces).

Number of words:
Number of words in Roosevelt file: 1360
Number of words in Kennedy file: 1390
Number of words in Nixon file: 1819

 There are 1360 words in President Franklin D. Roosevelt's speech.


 There are 1390 words in President John F. Kennedy's speech.
 There are 1819 words in President Richard Nixon's speech.

Number of sentences:

 There are 67 sentences in President Franklin D. Roosevelt's speech.


 There are 52 sentences in President John F. Kennedy's speech.
MACHINE LEARNING BUSINESS REPORT
 There are 68 sentences in President Richard Nixon's speech.
read the file and return list of lines:
 Number of sentences in Roosevelt file: 38
 Number of sentences in Kennedy file: 27
 Number of sentences in Nixon file: 51

2.2) Remove all the stopwords from the three speeches. Show the word count before and after the removal of
stopwords. Show a sample sentence after the removal of stopwords.

 Before, removing the stop-words, we have changed all the letters to lowercase and
we have removed special characters.
Word count before the removal of stop-words:

Before the removal of stop-words:

• President Franklin D. Roosevelt's speech have 1334 words.

• President John F. Kennedy's speech have 1362 words.

• President Richard Nixon's speech has 1800 words.

Word count after the removal of stop-words:

After the removal of stop-words:


• President Franklin D. Roosevelt's speech have 623 words.

• President John F. Kennedy's speech have 693 words.

• President Richard Nixon's speech has 831 words.

2.3) Which word occurs the most number of times in his inaugural address for each president? Mention the top three
words. (after removing the stopwords)

Top 3 words in Roosevelt's speech:


MACHINE LEARNING BUSINESS REPORT
The top 3 words are
• nation - 11

• know - 10

• spirit - 9
Top 3 words in Kennedy speech:

 let - 16
 us - 12
 sides – 08
The top 3 words are,
Top 3 words in Nixon’s speech:

 us - 26
 let - 22
 peace - 19

MACHINE LEARNING BUSINESS REPORT


Most often occur top 10 list of words :

MACHINE LEARNING BUSINESS REPORT


2.4) Plot the word cloud of each of the three speeches. (after removing the stopwords)?

Word cloud of Roosevelt's speech:

, Word cloud of Kennedy's speech:

MACHINE LEARNING BUSINESS REPORT


Word cloud of Nixon's speech:

MACHINE LEARNING BUSINESS REPORT

You might also like