4/8/25, 5:00 PM Kindle review Sentiment Analysis.
ipynb - Colab
keyboard_arrow_down About Dataset :
Context This is a small subset of dataset of Book reviews from Amazon Kindle Store category.
Content 5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries.
Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset. Columns
asin - ID of the product, like B000FA64PK helpful - helpfulness rating of the review - example: 2/3. overall - rating of the product. reviewText -
text of the review (heading). reviewTime - time of the review (raw). reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN reviewerName -
name of the reviewer. summary - summary of the review (description). unixReviewTime - unix timestamp. Acknowledgements This dataset is
taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/
License to the data files belong to them.
Inspiration
Sentiment analysis on reviews. Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review. Fake
reviews/ outliers. Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr). Any other interesting
analysis
import pandas as pd
df = pd.read_csv("/content/all_kindle_review.csv")
df.head(5)
Unnamed: Unnamed:
asin helpful rating reviewText reviewTime reviewerID reviewerName summary unixRevie
0.1 0
Jace
Rankin may
be short, Entertaining
0 0 11539 B0033UV8HI [8, 10] 3 09 2, 2010 A3HHXRELK8BHQG Ridley 12833
but he's But Average
nothing to
...
Great short
Terrific
read. I didn't
1 1 5957 B002HJV4DE [1, 1] 5 10 8, 2013 A2RGNZ0TRF578I Holly Butler menage 13811
want to put
scenes!
it dow...
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
df=df[['reviewText','rating']]
df.head(5)
reviewText rating
0 Jace Rankin may be short, but he's nothing to ... 3
1 Great short read. I didn't want to put it dow... 5
2 I'll start by saying this is the first of four... 3
3 Aggie is Angela Lansbury who carries pocketboo... 3
4 I did not expect this type of book to be in li... 4
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
df.shape
(12000, 2)
## checking the missing values
df.isnull().sum()
reviewText 0
rating 0
dtype: int64
df['rating'].unique()
array([3, 5, 4, 2, 1])
https://colab.research.google.com/drive/1jjg7g6tM_0TCbPq0moUdqcQZucky-WYs#scrollTo=aaa0jbMXHRut&printMode=true 1/5
4/8/25, 5:00 PM Kindle review Sentiment Analysis.ipynb - Colab
df['rating'].value_counts()
count
rating
5 3000
4 3000
3 2000
2 2000
1 2000
dtype: int64
##positive review is 1 and negative review is 0
df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)
df['rating'].value_counts()
count
rating
1 8000
0 4000
dtype: int64
##Lower all cases
df['reviewText']=df['reviewText'].str.lower()
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
True
from bs4 import BeautifulSoup
## Removing Special Characters
df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))
## Remove the stopswords
df['reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
## Remove url
df['reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=
## Remove html tags
df['reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
## Remove any additional spaces
df['reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))
df.head(5)
reviewText rating
0 jace rankin may short hes nothing mess man hau... 1
1 great short read didnt want put read one sitti... 1
2 ill start saying first four books wasnt expect... 1
3 aggie angela lansbury carries pocketbooks inst... 1
4 expect type book library pleased find price right 1
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
https://colab.research.google.com/drive/1jjg7g6tM_0TCbPq0moUdqcQZucky-WYs#scrollTo=aaa0jbMXHRut&printMode=true 2/5
4/8/25, 5:00 PM Kindle review Sentiment Analysis.ipynb - Colab
## lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
return " ".join([lemmatizer.lemmatize(word) for word in text.split()])
df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))
df.head()
reviewText rating
0 jace rankin may short he nothing mess man haul... 1
1 great short read didnt want put read one sitti... 1
2 ill start saying first four book wasnt expecti... 1
3 aggie angela lansbury carry pocketbook instead... 1
4 expect type book library pleased find price right 1
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
#Train Test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df['reviewText'],df['rating'],test_size=0.20)
# BOW -Bag of words
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()
X_train_bow = bow.fit_transform(X_train).toarray()
X_test_bow = bow.transform(X_test).toarray()
#TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf= tfidf.transform(X_test).toarray()
X_train_bow
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
from sklearn.naive_bayes import GaussianNB
nb_model_bow = GaussianNB().fit(X_train_bow,y_train)
nb_model_tfidf = GaussianNB().fit(X_train_tfidf,y_train)
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
y_pred_bow = nb_model_bow.predict(X_test_bow)
y_pred_tfidf = nb_model_tfidf.predict(X_test_tfidf)
print("BOW Accuracy Score : ",accuracy_score(y_test,y_pred_bow))
BOW Accuracy Score : 0.5833333333333334
print(confusion_matrix(y_test,y_pred_bow))
https://colab.research.google.com/drive/1jjg7g6tM_0TCbPq0moUdqcQZucky-WYs#scrollTo=aaa0jbMXHRut&printMode=true 3/5
4/8/25, 5:00 PM Kindle review Sentiment Analysis.ipynb - Colab
[[495 268]
[732 905]]
print(classification_report(y_test,y_pred_bow))
precision recall f1-score support
0 0.40 0.65 0.50 763
1 0.77 0.55 0.64 1637
accuracy 0.58 2400
macro avg 0.59 0.60 0.57 2400
weighted avg 0.65 0.58 0.60 2400
print("TF-IDF Accuracy Score : ",accuracy_score(y_test,y_pred_tfidf))
TF-IDF Accuracy Score : 0.5879166666666666
print(confusion_matrix(y_test,y_pred_tfidf))
[[487 276]
[713 924]]
print(classification_report(y_test,y_pred_tfidf))
precision recall f1-score support
0 0.41 0.64 0.50 763
1 0.77 0.56 0.65 1637
accuracy 0.59 2400
macro avg 0.59 0.60 0.57 2400
weighted avg 0.65 0.59 0.60 2400
import gensim
from gensim.models import Word2Vec, KeyedVectors
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
vec_king = wv['king']
[==================================================] 100.0% 1662.8/1662.8MB downloaded
from sklearn.ensemble import RandomForestClassifier
# Function to convert a review into its average Word2Vec embedding
def get_average_word2vec(tokens, model, vector_size=300):
vectors = []
for token in tokens:
if token in model:
vectors.append(model[token])
if len(vectors) == 0:
# If no words in the review are in the Word2Vec model, return a zero vector
return np.zeros(vector_size)
return np.mean(vectors, axis=0)
# Tokenizing the reviews
X_train_tokens = [review.split() for review in X_train]
X_test_tokens = [review.split() for review in X_test]
# Convert the reviews into word vectors by averaging word embeddings
X_train_vec = np.array([get_average_word2vec(tokens, wv) for tokens in X_train_tokens])
X_test_vec = np.array([get_average_word2vec(tokens, wv) for tokens in X_test_tokens])
# Initialize the RandomForestClassifier
rf_model_word2vec = RandomForestClassifier(n_estimators=100, random_state=42)
https://colab.research.google.com/drive/1jjg7g6tM_0TCbPq0moUdqcQZucky-WYs#scrollTo=aaa0jbMXHRut&printMode=true 4/5
4/8/25, 5:00 PM Kindle review Sentiment Analysis.ipynb - Colab
# Train the Random Forest model
rf_model_word2vec.fit(X_train_vec, y_train)
▾ RandomForestClassifier i ?
RandomForestClassifier(random_state=42)
# Make predictions on the test data
y_pred_word2vec_rf = rf_model_word2vec.predict(X_test_vec)
# Evaluate the model
print("Random Forest Word2Vec Accuracy Score: ", accuracy_score(y_test, y_pred_word2vec_rf))
print(confusion_matrix(y_test, y_pred_word2vec_rf))
print(classification_report(y_test, y_pred_word2vec_rf))
Random Forest Word2Vec Accuracy Score: 0.7808333333333334
[[ 346 417]
[ 109 1528]]
precision recall f1-score support
0 0.76 0.45 0.57 763
1 0.79 0.93 0.85 1637
accuracy 0.78 2400
macro avg 0.77 0.69 0.71 2400
weighted avg 0.78 0.78 0.76 2400
https://colab.research.google.com/drive/1jjg7g6tM_0TCbPq0moUdqcQZucky-WYs#scrollTo=aaa0jbMXHRut&printMode=true 5/5