0% found this document useful (0 votes)

61 views5 pages

Kindle Review Sentiment Analysis - Ipynb - Colab

The document outlines a sentiment analysis project using a dataset of Amazon Kindle book reviews, containing 982,619 entries. It details the data preprocessing steps, including text cleaning, lemmatization, and feature extraction using Bag of Words and TF-IDF methods. The project evaluates different machine learning models, including Naive Bayes and Random Forest, achieving varying accuracy scores, with the Random Forest model yielding the highest accuracy of approximately 78%.

Uploaded by

nbjnath69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views5 pages

Kindle Review Sentiment Analysis - Ipynb - Colab

Uploaded by

nbjnath69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

4/8/25, 5:00 PM Kindle review Sentiment Analysis.

ipynb - Colab

keyboard_arrow_down About Dataset :

Context This is a small subset of dataset of Book reviews from Amazon Kindle Store category.

Content 5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries.
Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset. Columns

asin - ID of the product, like B000FA64PK helpful - helpfulness rating of the review - example: 2/3. overall - rating of the product. reviewText -
text of the review (heading). reviewTime - time of the review (raw). reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN reviewerName -
name of the reviewer. summary - summary of the review (description). unixReviewTime - unix timestamp. Acknowledgements This dataset is
taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

License to the data files belong to them.

Inspiration

Sentiment analysis on reviews. Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review. Fake
reviews/ outliers. Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr). Any other interesting
analysis

import pandas as pd
df = pd.read_csv("/content/all_kindle_review.csv")
df.head(5)

Unnamed: Unnamed:
asin helpful rating reviewText reviewTime reviewerID reviewerName summary unixRevie
0.1 0

Jace
Rankin may
be short, Entertaining
0 0 11539 B0033UV8HI [8, 10] 3 09 2, 2010 A3HHXRELK8BHQG Ridley 12833
but he's But Average
nothing to
...

Great short
Terrific
read. I didn't
1 1 5957 B002HJV4DE [1, 1] 5 10 8, 2013 A2RGNZ0TRF578I Holly Butler menage 13811
want to put
scenes!
it dow...
 

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

df=df[['reviewText','rating']]
df.head(5)

reviewText rating

0 Jace Rankin may be short, but he's nothing to ... 3

1 Great short read. I didn't want to put it dow... 5

2 I'll start by saying this is the first of four... 3

3 Aggie is Angela Lansbury who carries pocketboo... 3

4 I did not expect this type of book to be in li... 4

 

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

df.shape

(12000, 2)

## checking the missing values

df.isnull().sum()

reviewText 0

rating 0

dtype: int64
 

df['rating'].unique()

array([3, 5, 4, 2, 1])

https://colab.research.google.com/drive/1jjg7g6tM_0TCbPq0moUdqcQZucky-WYs#scrollTo=aaa0jbMXHRut&printMode=true 1/5
4/8/25, 5:00 PM Kindle review Sentiment Analysis.ipynb - Colab

df['rating'].value_counts()

count

rating

5 3000

4 3000

3 2000

2 2000

1 2000

dtype: int64
 

##positive review is 1 and negative review is 0

df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)

df['rating'].value_counts()

count

rating

1 8000

0 4000

dtype: int64
 

##Lower all cases

df['reviewText']=df['reviewText'].str.lower()

import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...

[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
True

from bs4 import BeautifulSoup

## Removing Special Characters

df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))
## Remove the stopswords
df['reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
## Remove url
df['reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=
## Remove html tags
df['reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
## Remove any additional spaces
df['reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))

df.head(5)

reviewText rating

0 jace rankin may short hes nothing mess man hau... 1

1 great short read didnt want put read one sitti... 1

2 ill start saying first four books wasnt expect... 1

3 aggie angela lansbury carries pocketbooks inst... 1

4 expect type book library pleased find price right 1

 

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

https://colab.research.google.com/drive/1jjg7g6tM_0TCbPq0moUdqcQZucky-WYs#scrollTo=aaa0jbMXHRut&printMode=true 2/5
4/8/25, 5:00 PM Kindle review Sentiment Analysis.ipynb - Colab
## lemmatizer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_words(text):
return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))

df.head()

reviewText rating

0 jace rankin may short he nothing mess man haul... 1

1 great short read didnt want put read one sitti... 1

2 ill start saying first four book wasnt expecti... 1

3 aggie angela lansbury carry pocketbook instead... 1

4 expect type book library pleased find price right 1

 

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

#Train Test split

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(df['reviewText'],df['rating'],test_size=0.20)

# BOW -Bag of words

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()

X_train_bow = bow.fit_transform(X_train).toarray()
X_test_bow = bow.transform(X_test).toarray()

#TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf= tfidf.transform(X_test).toarray()

X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],

[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])

from sklearn.naive_bayes import GaussianNB

nb_model_bow = GaussianNB().fit(X_train_bow,y_train)

nb_model_tfidf = GaussianNB().fit(X_train_tfidf,y_train)

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

y_pred_bow = nb_model_bow.predict(X_test_bow)
y_pred_tfidf = nb_model_tfidf.predict(X_test_tfidf)

print("BOW Accuracy Score : ",accuracy_score(y_test,y_pred_bow))

BOW Accuracy Score : 0.5833333333333334

print(confusion_matrix(y_test,y_pred_bow))

https://colab.research.google.com/drive/1jjg7g6tM_0TCbPq0moUdqcQZucky-WYs#scrollTo=aaa0jbMXHRut&printMode=true 3/5
4/8/25, 5:00 PM Kindle review Sentiment Analysis.ipynb - Colab

[[495 268]
[732 905]]

print(classification_report(y_test,y_pred_bow))

precision recall f1-score support

0 0.40 0.65 0.50 763

1 0.77 0.55 0.64 1637

accuracy 0.58 2400

macro avg 0.59 0.60 0.57 2400
weighted avg 0.65 0.58 0.60 2400

print("TF-IDF Accuracy Score : ",accuracy_score(y_test,y_pred_tfidf))

TF-IDF Accuracy Score : 0.5879166666666666

print(confusion_matrix(y_test,y_pred_tfidf))

[[487 276]
[713 924]]

print(classification_report(y_test,y_pred_tfidf))

precision recall f1-score support

0 0.41 0.64 0.50 763

1 0.77 0.56 0.65 1637

accuracy 0.59 2400

macro avg 0.59 0.60 0.57 2400
weighted avg 0.65 0.59 0.60 2400

import gensim
from gensim.models import Word2Vec, KeyedVectors

import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

vec_king = wv['king']

[==================================================] 100.0% 1662.8/1662.8MB downloaded

from sklearn.ensemble import RandomForestClassifier

# Function to convert a review into its average Word2Vec embedding

def get_average_word2vec(tokens, model, vector_size=300):
vectors = []
for token in tokens:
if token in model:
vectors.append(model[token])
if len(vectors) == 0:
# If no words in the review are in the Word2Vec model, return a zero vector
return np.zeros(vector_size)
return np.mean(vectors, axis=0)

# Tokenizing the reviews

X_train_tokens = [review.split() for review in X_train]
X_test_tokens = [review.split() for review in X_test]

# Convert the reviews into word vectors by averaging word embeddings

X_train_vec = np.array([get_average_word2vec(tokens, wv) for tokens in X_train_tokens])
X_test_vec = np.array([get_average_word2vec(tokens, wv) for tokens in X_test_tokens])

# Initialize the RandomForestClassifier

rf_model_word2vec = RandomForestClassifier(n_estimators=100, random_state=42)

https://colab.research.google.com/drive/1jjg7g6tM_0TCbPq0moUdqcQZucky-WYs#scrollTo=aaa0jbMXHRut&printMode=true 4/5
4/8/25, 5:00 PM Kindle review Sentiment Analysis.ipynb - Colab
# Train the Random Forest model
rf_model_word2vec.fit(X_train_vec, y_train)

▾ RandomForestClassifier i ?

RandomForestClassifier(random_state=42)
 

# Make predictions on the test data

y_pred_word2vec_rf = rf_model_word2vec.predict(X_test_vec)

# Evaluate the model

print("Random Forest Word2Vec Accuracy Score: ", accuracy_score(y_test, y_pred_word2vec_rf))
print(confusion_matrix(y_test, y_pred_word2vec_rf))
print(classification_report(y_test, y_pred_word2vec_rf))

Random Forest Word2Vec Accuracy Score: 0.7808333333333334

[[ 346 417]
[ 109 1528]]
precision recall f1-score support

0 0.76 0.45 0.57 763

1 0.79 0.93 0.85 1637

accuracy 0.78 2400

macro avg 0.77 0.69 0.71 2400
weighted avg 0.78 0.78 0.76 2400

https://colab.research.google.com/drive/1jjg7g6tM_0TCbPq0moUdqcQZucky-WYs#scrollTo=aaa0jbMXHRut&printMode=true 5/5

ML Week10.1
No ratings yet
ML Week10.1
5 pages
Q 3
No ratings yet
Q 3
2 pages
Jadavpur University: Assignment Submission
No ratings yet
Jadavpur University: Assignment Submission
9 pages
Amna Bagh Ali
No ratings yet
Amna Bagh Ali
6 pages
Importing Packages: Id Label Tweet 0 1 2 3 4
No ratings yet
Importing Packages: Id Label Tweet 0 1 2 3 4
8 pages
Detailed Report
No ratings yet
Detailed Report
6 pages
Dataset Description: Amazon Reviews of Unlocked Phone
No ratings yet
Dataset Description: Amazon Reviews of Unlocked Phone
4 pages
Comsats University Islamabad Wah Campus (Project Report) : Submitted by
No ratings yet
Comsats University Islamabad Wah Campus (Project Report) : Submitted by
14 pages
Toxic Comment Classification
No ratings yet
Toxic Comment Classification
11 pages
Part A
No ratings yet
Part A
6 pages
DL Exp-10,11,12
No ratings yet
DL Exp-10,11,12
6 pages
Amazon Review Sentiment Analysis
No ratings yet
Amazon Review Sentiment Analysis
4 pages
Apply Naive Bayes To Amazon Reviews (M)
No ratings yet
Apply Naive Bayes To Amazon Reviews (M)
6 pages
Sentiment Analysis of Reviews Using Machine Learning
100% (1)
Sentiment Analysis of Reviews Using Machine Learning
33 pages
Foundations of Python For AI
No ratings yet
Foundations of Python For AI
67 pages
Machine Learning Code Explanation
No ratings yet
Machine Learning Code Explanation
33 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
Untitled28.ipynb - Colaboratory
No ratings yet
Untitled28.ipynb - Colaboratory
16 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
No ratings yet
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
19 pages
Sentimental Analysis
No ratings yet
Sentimental Analysis
3 pages
NLP for Airline Sentiment Analysis
No ratings yet
NLP for Airline Sentiment Analysis
29 pages
Sentiment Analysis On Online Reviews
No ratings yet
Sentiment Analysis On Online Reviews
11 pages
RajSingh WIexp7
No ratings yet
RajSingh WIexp7
8 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
AIML IA3 Loki & SG
No ratings yet
AIML IA3 Loki & SG
31 pages
Super Visionado VSRegras
No ratings yet
Super Visionado VSRegras
6 pages
Assignment 02
No ratings yet
Assignment 02
3 pages
Project Report
No ratings yet
Project Report
9 pages
BERT Driven Sentiment Classification With PyTorch
No ratings yet
BERT Driven Sentiment Classification With PyTorch
54 pages
Heart Disease Prediction - Colab
No ratings yet
Heart Disease Prediction - Colab
18 pages
DM Practical File
No ratings yet
DM Practical File
21 pages
Neural Networks
No ratings yet
Neural Networks
8 pages
Unit 3 4
No ratings yet
Unit 3 4
6 pages
DL 3
No ratings yet
DL 3
6 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
18 pages
Sma Exp 10 Code Print
No ratings yet
Sma Exp 10 Code Print
7 pages
Sentiment Analysis With NLP Deep Learning
No ratings yet
Sentiment Analysis With NLP Deep Learning
8 pages
Social Media Mining
No ratings yet
Social Media Mining
10 pages
Sentiment Analysis with NLTK
No ratings yet
Sentiment Analysis with NLTK
4 pages
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
No ratings yet
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
6 pages
Machine Learning Assignment Guide
No ratings yet
Machine Learning Assignment Guide
6 pages
Apply SVM To Amazon Reviews Data Set Avg W2vec (M)
No ratings yet
Apply SVM To Amazon Reviews Data Set Avg W2vec (M)
8 pages
TP 02 - Correction
No ratings yet
TP 02 - Correction
4 pages
Apply Logistic Regression To Amazon Reviews Data Set (M)
No ratings yet
Apply Logistic Regression To Amazon Reviews Data Set (M)
11 pages
2.2 Theoretical Background
No ratings yet
2.2 Theoretical Background
2 pages
22K61A0654 2 Sasi Auto
No ratings yet
22K61A0654 2 Sasi Auto
24 pages
Amazon-Fine-Food-Review - K-Means, Agglomerative & DBSCAN Clustering
No ratings yet
Amazon-Fine-Food-Review - K-Means, Agglomerative & DBSCAN Clustering
79 pages
178 hw1
No ratings yet
178 hw1
4 pages
Decision Tree
No ratings yet
Decision Tree
9 pages
RecommenderSystem File
No ratings yet
RecommenderSystem File
24 pages
BAET Record
No ratings yet
BAET Record
19 pages
Amazon Food Reviews Analysis
No ratings yet
Amazon Food Reviews Analysis
37 pages
IR Practical
No ratings yet
IR Practical
24 pages
Code
No ratings yet
Code
18 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
AI Projects: Search, Prediction, Sentiment, and More
No ratings yet
AI Projects: Search, Prediction, Sentiment, and More
17 pages
Pricing Mercari
No ratings yet
Pricing Mercari
41 pages
E-Notes PDF All-Units IWT
No ratings yet
E-Notes PDF All-Units IWT
82 pages
nginx 反向代理配置 iptables ftp转发
No ratings yet
nginx 反向代理配置 iptables ftp转发
3 pages
Artificial Neural Network (A.k.a. Deep Learning) : Dr. Md. Aminul Haque Akhand Dept. of CSE, Kuet
No ratings yet
Artificial Neural Network (A.k.a. Deep Learning) : Dr. Md. Aminul Haque Akhand Dept. of CSE, Kuet
29 pages
Manual IC1020FR
No ratings yet
Manual IC1020FR
56 pages
Vedic Math - Divsion by 9
100% (1)
Vedic Math - Divsion by 9
4 pages
Archivo 01
No ratings yet
Archivo 01
2 pages
Opcode Guide for Assembly Programmers
No ratings yet
Opcode Guide for Assembly Programmers
4 pages
CTS Industrial Robotics Digital Manufacturing Tech - CTS - 1.0 - NSQF-4
No ratings yet
CTS Industrial Robotics Digital Manufacturing Tech - CTS - 1.0 - NSQF-4
37 pages
Linear - Programming-Notes Unit 1
No ratings yet
Linear - Programming-Notes Unit 1
31 pages
Advancements in Control Systems and Integration of Artificial Intelligence in
No ratings yet
Advancements in Control Systems and Integration of Artificial Intelligence in
21 pages
Ls 5 Big Data Visualization
No ratings yet
Ls 5 Big Data Visualization
7 pages
BMS Ems
No ratings yet
BMS Ems
27 pages
Placement 2024
No ratings yet
Placement 2024
11 pages
Thesis Writing Guide for Students
100% (3)
Thesis Writing Guide for Students
7 pages
LIAN 98 (En) - Protocol IEC 60870-5-104, Telegram Structure
No ratings yet
LIAN 98 (En) - Protocol IEC 60870-5-104, Telegram Structure
13 pages
RS Assignment 1
No ratings yet
RS Assignment 1
3 pages
Keyboard and Embedded Controller For Notebook PC: Product Features
No ratings yet
Keyboard and Embedded Controller For Notebook PC: Product Features
10 pages
BTECH - CT - SEM - 3 Operating Systems
No ratings yet
BTECH - CT - SEM - 3 Operating Systems
3 pages
Profile Backlinks: Saad Bantwa
0% (1)
Profile Backlinks: Saad Bantwa
7 pages
Software Engineering Conversion
No ratings yet
Software Engineering Conversion
12 pages
Resnet50 Summary
No ratings yet
Resnet50 Summary
4 pages
Fog Computing in Industry 4.0: Applications and Challenges - A Research Roadmap
No ratings yet
Fog Computing in Industry 4.0: Applications and Challenges - A Research Roadmap
18 pages
Case Study - Student Result Management System
No ratings yet
Case Study - Student Result Management System
68 pages
Format Kertas Kerja SPSK PN Laili Terkini 2022
No ratings yet
Format Kertas Kerja SPSK PN Laili Terkini 2022
151 pages
Advanced Outlook User Guide
No ratings yet
Advanced Outlook User Guide
17 pages
Reinforcement Learning Details
No ratings yet
Reinforcement Learning Details
9 pages
Harvard University Digital Giza A Portal To The Pyramids
No ratings yet
Harvard University Digital Giza A Portal To The Pyramids
36 pages
Petunjuk Install
No ratings yet
Petunjuk Install
3 pages
Unit-3 Decision Control System
No ratings yet
Unit-3 Decision Control System
57 pages
Compiler Design for CS Students
No ratings yet
Compiler Design for CS Students
46 pages

Kindle Review Sentiment Analysis - Ipynb - Colab

Uploaded by

Kindle Review Sentiment Analysis - Ipynb - Colab

Uploaded by

4/8/25, 5:00 PM Kindle review Sentiment Analysis.

keyboard_arrow_down About Dataset :

License to the data files belong to them.

0 Jace Rankin may be short, but he's nothing to ... 3

1 Great short read. I didn't want to put it dow... 5

2 I'll start by saying this is the first of four... 3

3 Aggie is Angela Lansbury who carries pocketboo... 3

4 I did not expect this type of book to be in li... 4

## checking the missing values

##positive review is 1 and negative review is 0

##Lower all cases

[nltk_data] Downloading package stopwords to /root/nltk_data...

from bs4 import BeautifulSoup

## Removing Special Characters

0 jace rankin may short hes nothing mess man hau... 1

1 great short read didnt want put read one sitti... 1

2 ill start saying first four books wasnt expect... 1

3 aggie angela lansbury carries pocketbooks inst... 1

4 expect type book library pleased find price right 1

0 jace rankin may short he nothing mess man haul... 1

1 great short read didnt want put read one sitti... 1

2 ill start saying first four book wasnt expecti... 1

3 aggie angela lansbury carry pocketbook instead... 1

4 expect type book library pleased find price right 1

#Train Test split

# BOW -Bag of words

from sklearn.feature_extraction.text import TfidfVectorizer

array([[0, 0, 0, ..., 0, 0, 0],

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print("BOW Accuracy Score : ",accuracy_score(y_test,y_pred_bow))

BOW Accuracy Score : 0.5833333333333334

precision recall f1-score support

0 0.40 0.65 0.50 763

accuracy 0.58 2400

print("TF-IDF Accuracy Score : ",accuracy_score(y_test,y_pred_tfidf))

TF-IDF Accuracy Score : 0.5879166666666666

precision recall f1-score support

0 0.41 0.64 0.50 763

accuracy 0.59 2400

import gensim.downloader as api

[==================================================] 100.0% 1662.8/1662.8MB downloaded

from sklearn.ensemble import RandomForestClassifier

# Function to convert a review into its average Word2Vec embedding

# Tokenizing the reviews

# Convert the reviews into word vectors by averaging word embeddings

# Initialize the RandomForestClassifier

# Make predictions on the test data

# Evaluate the model

Random Forest Word2Vec Accuracy Score: 0.7808333333333334

0 0.76 0.45 0.57 763

accuracy 0.78 2400

You might also like