0% found this document useful (0 votes)

5 views54 pages

BERT Driven Sentiment Classification With PyTorch

The document outlines a data analysis workflow using Python libraries to process and visualize app reviews from a CSV file. It includes steps for reading data, cleaning text, categorizing sentiment, and generating visualizations such as histograms and word clouds for negative and positive sentiments. The analysis reveals an imbalanced distribution of review scores and provides insights into user sentiments towards the app.

Uploaded by

ahmedprof843

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views54 pages

BERT Driven Sentiment Classification With PyTorch

Uploaded by

ahmedprof843

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

ymt5vpxof

December 18, 2024

[1]: from sklearn.metrics import roc_curve, roc_auc_score

from sklearn.preprocessing import label_binarize
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
from wordcloud import WordCloud, STOPWORDS
import re
import string
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize
from torch.utils.data import Dataset, DataLoader
from nltk.corpus import stopwords
from collections import defaultdict
import torch
from tqdm import tqdm
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
from PIL import Image
sns.set(style='whitegrid', palette='muted', font_scale=1.2)

HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02",␣

↪"#8F00FF"]

sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))
import warnings

warnings.filterwarnings('ignore', category=UserWarning, module='transformers')

%matplotlib inline

[nltk_data] Downloading package wordnet to /usr/share/nltk_data…

[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data…

1
[nltk_data] Downloading package stopwords to /usr/share/nltk_data…
[nltk_data] Package stopwords is already up-to-date!

[2]: device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

device

[2]: device(type='cuda', index=0)

1 Read Data
[3]: df=pd.read_csv("/kaggle/input/apps-reviews/apps_reviews.csv")

[4]: wordcloud_mask=np.array(Image.open("/kaggle/input/wordcloud-mask-collection/
↪twitter.png"))

[5]: df.head()

[5]: userName userImage \

0 Andrew Thomas https://lh3.googleusercontent.com/a-/AOh14GiHd…
1 Craig Haines https://lh3.googleusercontent.com/-hoe0kwSJgPQ…
2 steven adkins https://lh3.googleusercontent.com/a-/AOh14GiXw…
3 Lars Panzerbjørn https://lh3.googleusercontent.com/a-/AOh14Gg-h…
4 Scott Prewitt https://lh3.googleusercontent.com/-K-X1-YsVd6U…

content score thumbsUpCount \

0 Update: After getting a response from the deve… 1 21
1 Used it for a fair amount of time without any … 1 11
2 Your app sucks now!!!!! Used to be good but no… 1 17
3 It seems OK, but very basic. Recurring tasks n… 1 192
4 Absolutely worthless. This app runs a prohibit… 1 42

reviewCreatedVersion at \
0 4.17.0.3 2020-04-05 22:25:57
1 4.17.0.3 2020-04-04 13:40:01
2 4.17.0.3 2020-04-01 16:18:13
3 4.17.0.2 2020-03-12 08:17:34
4 4.17.0.2 2020-03-14 17:41:01

replyContent repliedAt \
0 According to our TOS, and the term you have ag… 2020-04-05 15:10:24
1 It sounds like you logged in with a different … 2020-04-05 15:11:35
2 This sounds odd! We are not aware of any issue… 2020-04-02 16:05:56
3 We do offer this option as part of the Advance… 2020-03-15 06:20:13
4 We're sorry you feel this way! 90% of the app … 2020-03-15 23:45:51

sortOrder appId

2
0 most_relevant com.anydo
1 most_relevant com.anydo
2 most_relevant com.anydo
3 most_relevant com.anydo
4 most_relevant com.anydo

2 Data Exploration
[6]: df.shape

[6]: (15746, 11)

[7]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15746 entries, 0 to 15745
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userName 15746 non-null object
1 userImage 15746 non-null object
2 content 15746 non-null object
3 score 15746 non-null int64
4 thumbsUpCount 15746 non-null int64
5 reviewCreatedVersion 13533 non-null object
6 at 15746 non-null object
7 replyContent 7367 non-null object
8 repliedAt 7367 non-null object
9 sortOrder 15746 non-null object
10 appId 15746 non-null object
dtypes: int64(2), object(9)
memory usage: 1.3+ MB

[8]: df.isnull().sum()

[8]: userName 0
userImage 0
content 0
score 0
thumbsUpCount 0
reviewCreatedVersion 2213
at 0
replyContent 8379
repliedAt 8379
sortOrder 0
appId 0

3
dtype: int64

[9]: df.
↪drop(columns=["userName","userImage","thumbsUpCount","reviewCreatedVersion","at","repliedAt"

[10]: df.head()

[10]: content score \

0 Update: After getting a response from the deve… 1
1 Used it for a fair amount of time without any … 1
2 Your app sucks now!!!!! Used to be good but no… 1
3 It seems OK, but very basic. Recurring tasks n… 1
4 Absolutely worthless. This app runs a prohibit… 1

replyContent
0 According to our TOS, and the term you have ag…
1 It sounds like you logged in with a different …
2 This sounds odd! We are not aware of any issue…
3 We do offer this option as part of the Advance…
4 We're sorry you feel this way! 90% of the app …

[11]: df["text"]=df["content"]+" "+df["replyContent"]

[12]: df.head()

[12]: content score \

replyContent \
0 According to our TOS, and the term you have ag…
1 It sounds like you logged in with a different …
2 This sounds odd! We are not aware of any issue…
3 We do offer this option as part of the Advance…
4 We're sorry you feel this way! 90% of the app …

text
0 Update: After getting a response from the deve…
1 Used it for a fair amount of time without any …
2 Your app sucks now!!!!! Used to be good but no…
3 It seems OK, but very basic. Recurring tasks n…
4 Absolutely worthless. This app runs a prohibit…

[13]: df.drop(columns=["content","replyContent"],axis=1,inplace=True)

4
[14]: df.head()

[14]: score text

0 1 Update: After getting a response from the deve…
1 1 Used it for a fair amount of time without any …
2 1 Your app sucks now!!!!! Used to be good but no…
3 1 It seems OK, but very basic. Recurring tasks n…
4 1 Absolutely worthless. This app runs a prohibit…

[15]: df["text"][0]

[15]: "Update: After getting a response from the developer I would change my rating to
0 stars if possible. These guys hide behind confusing and opaque terms and
refuse to budge at all. I'm so annoyed that my money has been lost to them!
Really terrible customer experience. Original: Be very careful when signing up
for a free trial of this app. If you happen to go over they automatically charge
you for a full years subscription and refuse to refund. Terrible customer
experience and the app is just OK. According to our TOS, and the term you have
agreed to upon creating your free trial, after the 7 days, there is an automatic
charge for the plan, unless you cancel prior to the renewal date. As you did not
cancel the subscription in time, you were charged per this agreement."

[16]: df["score"].value_counts()

[16]: score
3 5042
5 2900
4 2776
1 2566
2 2462
Name: count, dtype: int64

[17]: plt.figure(figsize=(10,8))
sns.countplot(x="score",data=df)
plt.xlabel("Review Score")
plt.title("Review Score Count")
plt.show()

5
2.0.1 That’s hugely imbalanced, but it’s okay. We’re going to convert the dataset
into negative, neutral and positive sentiment:

3 Define a function to categorize the sentiment based on the score

[18]: def categorize_sentiment(score):
if score == 3:
return 'Neutral'
elif score in [1, 2]:
return 'Negative'
elif score in [4, 5]:
return 'Positive'

df['sentiment'] = df['score'].apply(categorize_sentiment)

[19]: df.head()

6
[19]: score text sentiment
0 1 Update: After getting a response from the deve… Negative
1 1 Used it for a fair amount of time without any … Negative
2 1 Your app sucks now!!!!! Used to be good but no… Negative
3 1 It seems OK, but very basic. Recurring tasks n… Negative
4 1 Absolutely worthless. This app runs a prohibit… Negative

[20]: plt.figure(figsize=(10,8))
sns.countplot(x="sentiment",data=df,palette="gnuplot")
plt.xlabel("Review Score")
plt.title("Sentiment Count")
plt.show()

7
4 Clean Data
[21]: def clean_text(text):
if isinstance(text, str):
text = BeautifulSoup(text, 'html.parser').get_text()
text = re.sub(r"[^a-zA-Z]", " ", text)
text = text.translate(str.maketrans("", "", string.punctuation))

emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F"
u"\U0001F300-\U0001F5FF"
u"\U0001F680-\U0001F6FF"
u"\U0001F1E0-\U0001F1FF"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
text = emoji_pattern.sub(r'', text)

text = re.sub(r"#.*", "", text)

text = re.sub(r"/\*.*?\*/", "", text, flags=re.DOTALL)
text = text.lower()

stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in stop_words]

text = ' '.join(tokens)

return text
else:
return ""

[22]: df["text"]=df["text"].apply(clean_text)

/tmp/ipykernel_23/1090887269.py:3: MarkupResemblesLocatorWarning: The input

looks more like a filename than markup. You may want to open this file and pass
the filehandle into Beautiful Soup.
text = BeautifulSoup(text, 'html.parser').get_text()

[23]: df.head()

[23]: score text sentiment

0 1 update getting response developer would change… Negative
1 1 used fair amount time without problems suddenl… Negative
2 1 app sucks used good update physically open clo… Negative
3 1 seems ok basic recurring tasks need work actua… Negative
4 1 absolutely worthless app runs prohibitively cl… Negative

8
[24]: df.drop(columns=["score"],axis=1,inplace=True)

5 Negative Text Length

[25]: negative_text=df[df["sentiment"]=="Negative"]["text"].str.len()
plt.figure(figsize=(10,8))
plt.hist(negative_text, bins=20,label='Negative Text Length',color="red")
plt.title("Negative Text Length")
plt.show()

6 Positive Text Length

[26]: positive_text=df[df["sentiment"]=="Positive"]["text"].str.len()
plt.figure(figsize=(10,8))
plt.hist(positive_text, bins=20,label='Positive Text Length',color="green")
plt.title("Positive Text Length")
plt.show()

9
7 Neutral Text Length
[27]: neutral_text=df[df["sentiment"]=="Neutral"]["text"].str.len()
plt.figure(figsize=(10,8))
plt.hist(neutral_text, bins=20,label='Neutral Text Length',color="blue")
plt.title("Neutral Text Length")
plt.show()

10
8 Together
[28]: plt.figure(figsize=(10,8))
plt.hist(neutral_text, bins=20,label='Neutral Text Length',color="blue")
plt.hist(positive_text, bins=20,label='Positive Text Length',color="green")
plt.hist(negative_text, bins=20,label='Negative Text Length',color="red")
plt.title("Negative vs Positive Vs Neutral Data Length")
plt.show()

11
9 Negative Data WordCloud
[29]: plt.figure(figsize=(15, 15))

negative_wordcloud = df[df["sentiment"] == "Negative"]

negative_text = " ".join(negative_wordcloud['text'].values.tolist())

wordcloud = WordCloud(width=800, height=800, stopwords=STOPWORDS,␣

↪background_color='navy',

max_words=800, colormap="hsv", mask=wordcloud_mask).

↪generate(negative_text)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

12
10 Positive Data Wordcloud
[30]: plt.figure(figsize=(15,15))
positive_wordcloud=df[df["sentiment"]=="Positive"]
positive_text=" ".join(positive_wordcloud['text'].values.tolist())
wordcloud = WordCloud(width=800, height=800,stopwords=STOPWORDS,␣
↪background_color='black',␣

↪max_words=800,colormap="CMRmap",mask=wordcloud_mask).generate(positive_text)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

13
11 Neutral Data Wordcloud
[31]: plt.figure(figsize=(15,15))
neutral_wordcloud=df[df["sentiment"]=="Neutral"]
neutral_text=" ".join(neutral_wordcloud['text'].values.tolist())
wordcloud = WordCloud(width=800, height=800,stopwords=STOPWORDS,␣
↪background_color='black',␣

↪max_words=800,colormap="CMRmap",mask=wordcloud_mask).generate(neutral_text)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

14
12 ALL Data Wordcloud
[32]: plt.figure(figsize=(15,15))
all_text=" ".join(df['text'].values.tolist())
wordcloud = WordCloud(width=800, height=800,stopwords=STOPWORDS,␣
↪background_color='orange',␣

↪max_words=800,colormap="ocean",mask=wordcloud_mask).generate(all_text)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

15
13 30 Most common Words From All Text
[33]: from itertools import chain
from collections import Counter

data_set =df["text"].str.split()
all_words = list(chain.from_iterable(data_set))
counter = Counter(all_words)
common_words = counter.most_common(30)
df_common_words = pd.DataFrame(common_words, columns=['Word', 'Count'])

colors = ["cyan", "lime", "magenta", "gold", "purple", "tomato", "teal",␣

↪"sandybrown", "mediumseagreen",

"royalblue", "darkorchid", "darkturquoise", "darkgoldenrod",␣

↪"mediumvioletred", "mediumaquamarine",

"lightcoral", "darkslategray", "olivedrab", "dodgerblue",␣

↪"indianred", "limegreen", "steelblue",

16
"darkviolet", "chocolate", "mediumslateblue", "darkgreen",␣
↪"orangered", "mediumblue", "peru", "mediumspringgreen"]

plt.figure(figsize=(12, 6))
sns.barplot(x='Count', y='Word', data=df_common_words, palette=colors)
plt.title('30 Most Common Words')
plt.xlabel('Count')
plt.ylabel('Word')
plt.show()

14 Most Common Words From Negative Text

[34]: negative_text = df[df["sentiment"] == "Negative"]
data_set = negative_text["text"].str.split()
all_words = [word for sublist in data_set for word in sublist]
counter = Counter(all_words)
common_words = counter.most_common(30)
df_common_words = pd.DataFrame(common_words, columns=['Word', 'Count'])
plt.figure(figsize=(12, 8))
sns.barplot(x='Count', y='Word', data=df_common_words,palette="Set1")
plt.title('30 Most Common Words Negative')
plt.xlabel('Count Negative Word')
plt.ylabel('Negative Word')
plt.show()

17
15 Most Common Words From Positive Text
[35]: positive_text = df[df["sentiment"] == "Positive"]
data_set = positive_text["text"].str.split()
all_words = [word for sublist in data_set for word in sublist]
counter = Counter(all_words)
common_words = counter.most_common(30)
df_common_words = pd.DataFrame(common_words, columns=['Word', 'Count'])
plt.figure(figsize=(12, 8))
sns.barplot(x='Count', y='Word', data=df_common_words,palette="Set2")
plt.title('30 Most Common Words Positive')
plt.xlabel('Count Positive Word')
plt.ylabel('Positive Word')
plt.show()

18
16 Most Common Words From Neutral Text
[36]: neutral_text = df[df["sentiment"] == "Neutral"]
data_set = neutral_text["text"].str.split()
all_words = [word for sublist in data_set for word in sublist]
counter = Counter(all_words)
common_words = counter.most_common(30)
df_common_words = pd.DataFrame(common_words, columns=['Word', 'Count'])
plt.figure(figsize=(12, 8))
sns.barplot(x='Count', y='Word', data=df_common_words,palette="Accent")
plt.title('30 Most Common Words Neutral')
plt.xlabel('Count Neutral Word')
plt.ylabel('Neutral Word')
plt.show()

19
[37]: df.head()

[37]: text sentiment

0 update getting response developer would change… Negative
1 used fair amount time without problems suddenl… Negative
2 app sucks used good update physically open clo… Negative
3 seems ok basic recurring tasks need work actua… Negative
4 absolutely worthless app runs prohibitively cl… Negative

[38]: class_names = ['Negative', 'Positive', 'Neutral']

[39]: sentiment_map={"Negative":0,"Positive":1,"Neutral":2}
df["sentiment"]=df["sentiment"].replace(sentiment_map)

/tmp/ipykernel_23/3901681617.py:2: FutureWarning: Downcasting behavior in

`replace` is deprecated and will be removed in a future version. To retain the
old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to
the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df["sentiment"]=df["sentiment"].replace(sentiment_map)

[40]: df.head()

20
[40]: text sentiment
0 update getting response developer would change… 0
1 used fair amount time without problems suddenl… 0
2 app sucks used good update physically open clo… 0
3 seems ok basic recurring tasks need work actua… 0
4 absolutely worthless app runs prohibitively cl… 0

17 Calculate the length of each text entry

[41]: df['text_length'] = df['text'].apply(len)

average_text_length = df['text_length'].mean()

print(f"Average Text Length: {average_text_length:.2f}")

Average Text Length: 116.94

[42]: df.drop(columns=["text_length"],axis=1,inplace=True)

18 Load Bert Model

[43]: from transformers import BertModel,BertTokenizer
import torch

bert_model = BertModel.from_pretrained("bert-base-uncased", torch_dtype=torch.

↪float16, attn_implementation="sdpa")

config.json: 0%| | 0.00/570 [00:00<?, ?B/s]

model.safetensors: 0%| | 0.00/440M [00:00<?, ?B/s]

[44]: bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer_config.json: 0%| | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]

19 We’ll use this text to understand the tokenization process:

[45]:

21
sample_text="update getting response developer would change rating stars␣
↪possible guys hide behind confusing opaque terms refuse budge annoyed money␣

↪lost really terrible customer experience original careful signing free trial␣

↪app happen go automatically charge full years subscription refuse refund␣

↪terrible customer experience app ok according tos term agreed upon creating␣

↪free trial days automatic charge plan unless cancel prior renewal date␣

↪cancel subscription time charged per agreement"

19.0.1 Some basic operations can convert the text to tokens and tokens to unique
integers (ids):

[46]: tokens = bert_tokenizer.tokenize(sample_text)

token_ids = bert_tokenizer.convert_tokens_to_ids(tokens)

print("=======================================================================================

print(f'Sentence: {sample_text}')
print("=======================================================================================
print(f'Tokens: {tokens}')
print("=======================================================================================
print(f'Token IDs: {token_ids}')
print("=======================================================================================

================================================================================
================
Sentence: update getting response developer would change rating stars possible
guys hide behind confusing opaque terms refuse budge annoyed money lost really
terrible customer experience original careful signing free trial app happen go
automatically charge full years subscription refuse refund terrible customer
experience app ok according tos term agreed upon creating free trial days
automatic charge plan unless cancel prior renewal date cancel subscription time
charged per agreement
================================================================================
================
Tokens: ['update', 'getting', 'response', 'developer', 'would', 'change',
'rating', 'stars', 'possible', 'guys', 'hide', 'behind', 'confusing', 'opaque',
'terms', 'refuse', 'budge', 'annoyed', 'money', 'lost', 'really', 'terrible',
'customer', 'experience', 'original', 'careful', 'signing', 'free', 'trial',
'app', 'happen', 'go', 'automatically', 'charge', 'full', 'years',
'subscription', 'refuse', 'ref', '##und', 'terrible', 'customer', 'experience',
'app', 'ok', 'according', 'to', '##s', 'term', 'agreed', 'upon', 'creating',
'free', 'trial', 'days', 'automatic', 'charge', 'plan', 'unless', 'cancel',
'prior', 'renewal', 'date', 'cancel', 'subscription', 'time', 'charged', 'per',
'agreement']
================================================================================
================
Token IDs: [10651, 2893, 3433, 9722, 2052, 2689, 5790, 3340, 2825, 4364, 5342,

22
2369, 16801, 28670, 3408, 10214, 24981, 11654, 2769, 2439, 2428, 6659, 8013,
3325, 2434, 6176, 6608, 2489, 3979, 10439, 4148, 2175, 8073, 3715, 2440, 2086,
15002, 10214, 25416, 8630, 6659, 8013, 3325, 10439, 7929, 2429, 2000, 2015,
2744, 3530, 2588, 4526, 2489, 3979, 2420, 6882, 3715, 2933, 4983, 17542, 3188,
14524, 3058, 17542, 15002, 2051, 5338, 2566, 3820]
================================================================================
================

19.0.2 Special Tokens

[SEP] - marker for ending of a sentence
[47]: bert_tokenizer.sep_token, bert_tokenizer.sep_token_id

[47]: ('[SEP]', 102)

[CLS] - we must add this token to the start of each sentence, so BERT knows we’re doing classifi-
cation
[48]: bert_tokenizer.cls_token, bert_tokenizer.cls_token_id

[48]: ('[CLS]', 101)

19.0.3 There is also a special token for padding:

[49]: bert_tokenizer.pad_token, bert_tokenizer.pad_token_id

[49]: ('[PAD]', 0)

19.0.4 BERT understands tokens that were in the training set. Everything else can
be encoded using the [UNK] (unknown) token:

[50]: bert_tokenizer.unk_token, bert_tokenizer.unk_token_id

[50]: ('[UNK]', 100)

[51]: encoding = bert_tokenizer.encode_plus(

sample_text,
max_length=32,
add_special_tokens=True, # Add '[CLS]' and '[SEP]'
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt', # Return PyTorch tensors
)

Truncation was not explicitly activated but `max_length` is provided a specific

value, please use `truncation=True` to explicitly truncate examples to max

23
length. Defaulting to 'longest_first' truncation strategy. If you encode pairs
of sequences (GLUE-style) with the tokenizer you can select this strategy more
precisely by providing a specific strategy to `truncation`.
/opt/conda/lib/python3.10/site-
packages/transformers/tokenization_utils_base.py:2834: FutureWarning: The
`pad_to_max_length` argument is deprecated and will be removed in a future
version, use `padding=True` or `padding='longest'` to pad to the longest
sequence in the batch, or use `padding='max_length'` to pad to a max length. In
this case, you can give a specific length with `max_length` (e.g.
`max_length=45`) or leave max_length to None to pad to the maximal input size of
the model (e.g. 512 for Bert).
warnings.warn(

[52]: encoding.keys()

[52]: dict_keys(['input_ids', 'attention_mask'])

20 The token ids are now stored in a Tensor and padded to a

length of 32:
[53]: print(len(encoding['input_ids'][0]))
print("=======================================================================================
print(encoding['input_ids'][0])

32
================================================================================
===================
tensor([ 101, 10651, 2893, 3433, 9722, 2052, 2689, 5790, 3340, 2825,
4364, 5342, 2369, 16801, 28670, 3408, 10214, 24981, 11654, 2769,
2439, 2428, 6659, 8013, 3325, 2434, 6176, 6608, 2489, 3979,
10439, 102])

21 The attention mask has the same length:

[54]: print(len(encoding['attention_mask'][0]))
print("=======================================================================================
print(encoding['attention_mask'][0])

32
================================================================================
===================
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1])

24
22 We can inverse the tokenization to have a look at the special
tokens:
[55]: bert_tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])

[55]: ['[CLS]',
'update',
'getting',
'response',
'developer',
'would',
'change',
'rating',
'stars',
'possible',
'guys',
'hide',
'behind',
'confusing',
'opaque',
'terms',
'refuse',
'budge',
'annoyed',
'money',
'lost',
'really',
'terrible',
'customer',
'experience',
'original',
'careful',
'signing',
'free',
'trial',
'app',
'[SEP]']

23 Choosing Sequence Length

BERT works with fixed-length sequences. We’ll use a simple strategy to choose the max length.
Let’s store the token length of each review:
[56]: token_lens = []

for txt in df.text:

25
tokens = bert_tokenizer.encode(txt, max_length=512)
token_lens.append(len(tokens))

24 plot the distribution using boxplot

[57]: plt.figure(figsize=(10, 8))
sns.boxplot(x=token_lens, color='green')

plt.title('Box Plot of Token Lengths', fontsize=16)

plt.xlabel('Token Length', fontsize=14)
plt.show()

26
25 plot the distribution using histplot
[58]: plt.figure(figsize=(10, 8))
sns.histplot(token_lens, bins=30, kde=True, color='blue')
plt.title('Distribution of Token Lengths', fontsize=16)
plt.xlabel('Token Length', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.show()

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning:
use_inf_as_na option is deprecated and will be removed in a future version.
Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):

27
26 Calculate min, max, and average
[59]: import statistics

min_length = min(token_lens)
max_length = max(token_lens)
avg_length = statistics.mean(token_lens)

print("==============================================================================")
print(f"Minimum Token Length: {min_length}")
print("==============================================================================")
print(f"Maximum Token Length: {max_length}")
print("==============================================================================")
print(f"Average Token Length: {avg_length:.2f}")
print("==============================================================================")

==============================================================================
Minimum Token Length: 2
==============================================================================
Maximum Token Length: 257
==============================================================================
Average Token Length: 21.42
==============================================================================

27 Maxlen==170
[60]: max_len=170

[61]: df.head()

[61]: text sentiment

0 update getting response developer would change… 0
1 used fair amount time without problems suddenl… 0
2 app sucks used good update physically open clo… 0
3 seems ok basic recurring tasks need work actua… 0
4 absolutely worthless app runs prohibitively cl… 0

28 Create Pytorch Dataset

[62]: from torch.utils.data import Dataset
import torch

class GPReviewDataset(Dataset):
def __init__(self, reviews, targets, tokenizer, max_len):
self.reviews = reviews

28
self.targets = targets
self.tokenizer = tokenizer
self.max_len = max_len

def __len__(self):
return len(self.reviews)

def getitem(self, item):

review = str(self.reviews[item])
target = self.targets[item]

encoding = self.tokenizer.encode_plus(
review,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length', # Ensure padding to max_len
return_token_type_ids=False,
return_attention_mask=True,
truncation=True, # Make sure long reviews are truncated to max_len
return_tensors='pt', # Return as PyTorch tensors
)

return {
'review_text': review,
'input_ids': encoding['input_ids'].flatten(), # Flatten to remove␣
↪unnecessary dimensions

'attention_mask': encoding['attention_mask'].flatten(), # Flatten␣

↪the attention mask

'targets': torch.tensor(target, dtype=torch.long) # Convert the␣

↪target to a tensor

The tokenizer is doing most of the heavy lifting for us. We also return the review
texts, so it’ll be easier to evaluate the predictions from our model. Let’s split the
data:
[63]: from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_val, df_test = train_test_split(df_test, test_size=0.5, random_state=42)

[64]: df_train.shape, df_val.shape, df_test.shape

[64]: ((12596, 2), (1575, 2), (1575, 2))

29
28.1 We also need to create a couple of data loaders. Here’s a helper function
to do it:
[65]: def create_data_loader(df, tokenizer, max_len, batch_size):

ds = GPReviewDataset(
reviews=df.text.to_numpy(),
targets=df.sentiment.to_numpy(),
tokenizer=tokenizer,
max_len=max_len
)

return DataLoader(
ds,
batch_size=batch_size,
num_workers=4
)

[66]: BATCH_SIZE = 16

train_data_loader = create_data_loader(df_train, bert_tokenizer,max_len,␣

↪BATCH_SIZE)

val_data_loader = create_data_loader(df_val, bert_tokenizer, max_len,␣

↪BATCH_SIZE)

test_data_loader = create_data_loader(df_test, bert_tokenizer,max_len,␣

↪BATCH_SIZE)

28.1.1 Let’s have a look at an example batch from our training data loader:

[67]: data = next(iter(train_data_loader))

data.keys()

/opt/conda/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning:
os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX
is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
/opt/conda/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning:
os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX
is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()

[67]: dict_keys(['review_text', 'input_ids', 'attention_mask', 'targets'])

[68]: print("##############################################################################")
print(data['input_ids'].shape)
print("##############################################################################")
print(data['attention_mask'].shape)

30
print("##############################################################################")
print(data['targets'].shape)
print("##############################################################################")

##############################################################################
torch.Size([16, 170])
##############################################################################
torch.Size([16, 170])
##############################################################################
torch.Size([16])
##############################################################################

29 And try to use it on the encoding of our sample text:

[69]: bert_model = BertModel.from_pretrained("bert-base-uncased")

[70]: print(encoding)

{'input_ids': tensor([[ 101, 10651, 2893, 3433, 9722, 2052, 2689, 5790,
3340, 2825,
4364, 5342, 2369, 16801, 28670, 3408, 10214, 24981, 11654, 2769,
2439, 2428, 6659, 8013, 3325, 2434, 6176, 6608, 2489, 3979,
10439, 102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1]])}

[71]: output =␣
↪bert_model(input_ids=encoding['input_ids'],attention_mask=encoding['attention_mask'])

last_hidden_state=output.last_hidden_state

The last_hidden_state is a sequence of hidden states of the last layer of the model. Obtaining
the pooled_output is done by applying the BertPooler on last_hidden_state:
[72]: last_hidden_state.shape

[72]: torch.Size([1, 32, 768])

29.0.1 We have the hidden state for each of our 32 tokens (the length of our ex-
ample sequence). But why 768? This is the number of hidden units in the
feedforward-networks. We can verify that by checking the config:

[73]: bert_model.config.hidden_size

[73]: 768

31
29.0.2 You can think of the pooled_output as a summary of the content, according to
BERT. Albeit, you might try and do better. Let’s look at the shape of the
output:

[74]: pooled_output = last_hidden_state[:, 0]

pooled_output.shape

[74]: torch.Size([1, 768])

30 We can use all of this knowledge to create a classifier that uses

the BERT model:
[75]: import torch.nn as nn
from transformers import BertModel

class SentimentClassifier(nn.Module):

def init(self, n_classes):

super(SentimentClassifier, self).__init__()
self.bert = BertModel.from_pretrained("bert-base-uncased")
self.drop = nn.Dropout(p=0.3)
self.out = nn.Linear(self.bert.config.hidden_size, n_classes)

def forward(self, input_ids, attention_mask):

outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = outputs.pooler_output
output = self.drop(pooled_output)
return self.out(output)

[76]: model = SentimentClassifier(len(class_names))

model = model.to(device)

31 We’ll move the example batch of our training data to the GPU:
[77]: input_ids = data['input_ids'].to(device)
attention_mask = data['attention_mask'].to(device)

print(input_ids.shape) # batch size x seq length

print(attention_mask.shape) # batch size x seq length

torch.Size([16, 170])
torch.Size([16, 170])

32
31.0.1 To get the predicted probabilities from our trained model, we’ll apply the
softmax function to the outputs:

[78]: import torch.nn.functional as F

F.softmax(model(input_ids, attention_mask), dim=1)

[78]: tensor([[0.4279, 0.3012, 0.2709],

[0.3276, 0.3455, 0.3269],
[0.3777, 0.2953, 0.3270],
[0.4202, 0.3139, 0.2658],
[0.2663, 0.4125, 0.3211],
[0.2609, 0.2823, 0.4568],
[0.2940, 0.3957, 0.3103],
[0.5061, 0.3085, 0.1854],
[0.4150, 0.3252, 0.2598],
[0.2676, 0.3365, 0.3960],
[0.2929, 0.3542, 0.3529],
[0.2969, 0.3734, 0.3297],
[0.4226, 0.2996, 0.2779],
[0.4588, 0.3666, 0.1746],
[0.4635, 0.2999, 0.2366],
[0.3229, 0.3616, 0.3156]], device='cuda:0', grad_fn=<SoftmaxBackward0>)

To reproduce the training procedure from the BERT paper, we’ll use the AdamW
optimizer provided by Hugging Face. It corrects weight decay, so it’s similar to the
original paper. We’ll also use a linear scheduler with no warmup steps:
[79]: from transformers import AdamW, get_linear_schedule_with_warmup

EPOCHS = 10
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps=total_steps
)

EPOCHS = 10

optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)

total_steps = len(train_data_loader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps=total_steps

33
)

loss_fn = nn.CrossEntropyLoss().to(device)

/opt/conda/lib/python3.10/site-packages/transformers/optimization.py:591:
FutureWarning: This implementation of AdamW is deprecated and will be removed in
a future version. Use the PyTorch implementation torch.optim.AdamW instead, or
set `no_deprecation_warning=True` to disable this warning
warnings.warn(

31.0.2 How do we come up with all hyperparameters? The BERT authors have some
recommendations for fine-tuning:
• Batch size: 16, 32
• Learning rate (Adam): 5e-5, 3e-5, 2e-5
• Number of epochs: 2, 3,#### 4
We’re going to ignore the number of epochs recommendation but stick with the rest. Note that
increasing the batch size reduces the training time significantly, but gives you lower accur###
acy.
Let’s continue with writing a helper function for training our model for one epoch:
[80]: def train_epoch(
model,
data_loader,
loss_fn,
optimizer,
device,
scheduler,
n_examples
):
model = model.train()

losses = []
correct_predictions = 0

for d in tqdm(data_loader, desc="Training", leave=False):

input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)

outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)

_, preds = torch.max(outputs, dim=1)

loss = loss_fn(outputs, targets)

34
correct_predictions += torch.sum(preds == targets)
losses.append(loss.item())

loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()

return correct_predictions.double() / n_examples, np.mean(losses)

Training the model should look familiar, except for two things. The scheduler gets
called every time a batch is fed to the model. We’re avoiding exploding gradients by
clipping the gradients of the model using clip_grad_norm_.
[81]: def eval_model(model, data_loader, loss_fn, device, n_examples):
model = model.eval()

losses = []
correct_predictions = 0

with torch.no_grad():
for d in tqdm(data_loader, desc="Evaluating", leave=False):
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)

outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
_, preds = torch.max(outputs, dim=1)

loss = loss_fn(outputs, targets)

correct_predictions += torch.sum(preds == targets)

losses.append(loss.item())

return correct_predictions.double() / n_examples, np.mean(losses)

35
31.0.3 Using those two, we can write our training loop. We’ll also store the training
history:

[82]: EPOCHS = 20
history = defaultdict(list)
best_accuracy = 0

for epoch in range(EPOCHS):

print(f'Epoch {epoch + 1}/{EPOCHS}')
print('-' * 10)

train_acc, train_loss = train_epoch(

model,
train_data_loader,
loss_fn,
optimizer,
device,
scheduler,
len(df_train)
)

print(f'Train loss: {train_loss:.4f} | Train accuracy: {train_acc:.4f}')

val_acc, val_loss = eval_model(

model,
val_data_loader,
loss_fn,
device,
len(df_val)
)

print(f'Val loss: {val_loss:.4f} | Val accuracy: {val_acc:.4f}')

print()

history['train_acc'].append(train_acc)
history['train_loss'].append(train_loss)
history['val_acc'].append(val_acc)
history['val_loss'].append(val_loss)

if val_acc > best_accuracy:

torch.save(model.state_dict(), 'best_model_state.bin')
best_accuracy = val_acc

print("Training complete!")
print(f"Best Validation Accuracy: {best_accuracy:.4f}")

Epoch 1/20
----------

36
Train loss: 0.9913 | Train accuracy: 0.5059

Val loss: 0.9387 | Val accuracy: 0.5537

Epoch 2/20
----------

Train loss: 0.8749 | Train accuracy: 0.5942

Val loss: 0.9033 | Val accuracy: 0.5994

Epoch 3/20
----------

Train loss: 0.7727 | Train accuracy: 0.6512

Val loss: 0.9030 | Val accuracy: 0.6108

Epoch 4/20
----------

Train loss: 0.7045 | Train accuracy: 0.6839

Val loss: 0.9356 | Val accuracy: 0.6343

Epoch 5/20
----------

Train loss: 0.6578 | Train accuracy: 0.7026

Val loss: 0.9368 | Val accuracy: 0.6419

Epoch 6/20
----------

Train loss: 0.6217 | Train accuracy: 0.7139

37
Val loss: 0.9282 | Val accuracy: 0.6451

Epoch 7/20
----------

Train loss: 0.5987 | Train accuracy: 0.7199

Val loss: 0.9851 | Val accuracy: 0.6514

Epoch 8/20
----------

Train loss: 0.5814 | Train accuracy: 0.7246

Val loss: 0.9947 | Val accuracy: 0.6559

Epoch 9/20
----------

Train loss: 0.5764 | Train accuracy: 0.7257

Val loss: 1.0183 | Val accuracy: 0.6540

Epoch 10/20
----------

Train loss: 0.5694 | Train accuracy: 0.7275

Val loss: 1.0273 | Val accuracy: 0.6514

Epoch 11/20
----------

Train loss: 0.5666 | Train accuracy: 0.7286

38
Val loss: 1.0273 | Val accuracy: 0.6514

Epoch 12/20
----------

Train loss: 0.5648 | Train accuracy: 0.7290

Val loss: 1.0273 | Val accuracy: 0.6514

Epoch 13/20
----------

Train loss: 0.5654 | Train accuracy: 0.7290

Val loss: 1.0273 | Val accuracy: 0.6514

Epoch 14/20
----------

Train loss: 0.5659 | Train accuracy: 0.7289

Val loss: 1.0273 | Val accuracy: 0.6514

Epoch 15/20
----------

Train loss: 0.5645 | Train accuracy: 0.7290

Val loss: 1.0273 | Val accuracy: 0.6514

Epoch 16/20
----------

Train loss: 0.5651 | Train accuracy: 0.7291

Val loss: 1.0273 | Val accuracy: 0.6514

39
Epoch 17/20
----------

Train loss: 0.5651 | Train accuracy: 0.7290

Val loss: 1.0273 | Val accuracy: 0.6514

Epoch 18/20
----------

Train loss: 0.5660 | Train accuracy: 0.7290

Val loss: 1.0273 | Val accuracy: 0.6514

Epoch 19/20
----------

Train loss: 0.5675 | Train accuracy: 0.7286

Val loss: 1.0273 | Val accuracy: 0.6514

Epoch 20/20
----------

Train loss: 0.5650 | Train accuracy: 0.7286

Val loss: 1.0273 | Val accuracy: 0.6514

Training complete!
Best Validation Accuracy: 0.6559

32 Training Metrics Visualization

[83]: def plot_training_metrics(history):
sns.set_style("whitegrid")
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 10))

40
epochs = range(1, len(history['train_loss']) + 1)

train_loss = [loss.cpu().numpy() if hasattr(loss, 'cpu') else loss for loss␣

↪in history['train_loss']]

val_loss = [loss.cpu().numpy() if hasattr(loss, 'cpu') else loss for loss␣

↪in history['val_loss']]

train_acc = [acc.cpu().numpy() if hasattr(acc, 'cpu') else acc for acc in␣

↪history['train_acc']]

val_acc = [acc.cpu().numpy() if hasattr(acc, 'cpu') else acc for acc in␣

↪history['val_acc']]

ax1.plot(epochs, train_loss, marker='o', label='Training Loss')

ax1.plot(epochs, val_loss, marker='o', label='Validation Loss')
ax1.set_title('Model Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()

ax2.plot(epochs, train_acc, marker='o', label='Training Accuracy')

ax2.plot(epochs, val_acc, marker='o', label='Validation Accuracy')
ax2.set_title('Model Accuracy')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.legend()

plt.tight_layout()
plt.savefig('training_metrics.png', dpi=300, bbox_inches='tight')
plt.show()

plot_training_metrics(history)

41
32.1 Evaluation
So how good is our model on predicting sentiment? Let’s start by calculating the accuracy on
the test data:
[84]: test_acc, _ = eval_model(
model,
test_data_loader,
loss_fn,
device,
len(df_test)
)

test_acc.item()

[84]: 0.6565079365079365

32.1.1 The accuracy is about 1% lower on the test set. Our model seems to generalize
well.
We’ll define a helper function to get the predictions from our mode l:

42
[85]: def get_predictions(model, data_loader):
model = model.eval()

review_texts = []
predictions = []
prediction_probs = []
real_values = []

with torch.no_grad():
for d in data_loader:

texts = d["review_text"]
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)

outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
_, preds = torch.max(outputs, dim=1)

probs = F.softmax(outputs, dim=1)

review_texts.extend(texts)
predictions.extend(preds)
prediction_probs.extend(probs)
real_values.extend(targets)

predictions = torch.stack(predictions).cpu()
prediction_probs = torch.stack(prediction_probs).cpu()
real_values = torch.stack(real_values).cpu()
return review_texts, predictions, prediction_probs, real_values

This is similar to the evaluation function, except that we're storing the text of the reviews a

[86]: y_review_texts, y_pred, y_pred_probs, y_test =␣

↪get_predictions(model,test_data_loader)

33 Classification Report
[87]: from sklearn.metrics import (
confusion_matrix,
classification_report,
roc_auc_score,
roc_curve,

43
accuracy_score,
)
print(classification_report(y_test, y_pred, target_names=class_names))

precision recall f1-score support

Negative 0.92 0.50 0.64 519

Positive 0.56 0.95 0.70 590
Neutral 0.73 0.46 0.57 466

accuracy 0.66 1575

macro avg 0.74 0.64 0.64 1575
weighted avg 0.73 0.66 0.64 1575

34 Confusion Matrix
[88]: cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10,8))
sns.heatmap(cm, annot=True, fmt='d', cmap='hsv', xticklabels=class_names,␣
↪yticklabels=class_names)

plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

44
35 Roc Curve
[89]: plt.figure(figsize=(10,8))

class_names = ['Negative', 'Positive', 'Neutral']

y_test_binarized = label_binarize(y_test, classes=range(len(class_names)))

for i, class_name in enumerate(class_names):

if np.sum(y_test_binarized[:, i]) > 0 and np.sum(1 - y_test_binarized[:,␣
↪i]) > 0:

fpr, tpr, thresholds = roc_curve(y_test_binarized[:, i], y_pred_probs[:

↪, i])

auc_score = roc_auc_score(y_test_binarized[:, i], y_pred_probs[:, i])

plt.plot(fpr, tpr, label=f"{class_name} (AUC = {auc_score:.2f})")
else:

45
print(f"Skipping {class_name} because only one class is present in␣
↪y_test.")

plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label="Random Guess")

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Multiclass Classification")
plt.legend(loc="lower right")
plt.grid()
plt.show()

46
36 Precision Recall Curve
[90]: from sklearn.metrics import precision_recall_curve, average_precision_score
plt.figure(figsize=(10,8))

class_names = ['Negative', 'Positive', 'Neutral']

y_test_binarized = label_binarize(y_test, classes=range(len(class_names)))

for i, class_name in enumerate(class_names):

if np.sum(y_test_binarized[:, i]) > 0 and np.sum(1 - y_test_binarized[:,␣
↪i]) > 0:

precision, recall, _ = precision_recall_curve(y_test_binarized[:, i],␣

↪y_pred_probs[:, i])

avg_precision = average_precision_score(y_test_binarized[:, i],␣

↪y_pred_probs[:, i])

plt.plot(recall, precision, label=f"{class_name} (AP = {avg_precision:.

↪2f})")

else:
print(f"Skipping {class_name} because only one class is present in␣
↪y_test.")

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve for Multiclass Classification")
plt.legend(loc="lower left")
plt.grid()
plt.show()

47
37 Roc Auc Score
[91]: roc_auc = roc_auc_score(y_test, y_pred_probs, multi_class='ovr')

plt.plot([])
plt.text(0, 0, f'ROC AUC Score: {roc_auc:.4f}', fontsize=16, ha='center',␣
↪va='center', color="indigo")

plt.axis('off')
plt.xlim(-1, 1)
plt.ylim(-1, 1)

plt.show()

48
38 Log Loss
[92]: from sklearn.metrics import log_loss

classes = [0, 1, 2]

logarithm_loss = log_loss(y_test, y_pred_probs, labels=classes)

plt.plot([])
plt.text(0, 0, f'Log Loss: {logarithm_loss:.4f}', fontsize=16, ha='center',␣
↪va='center', color="black")

plt.axis('off')

plt.xlim(-1, 1)
plt.ylim(-1, 1)

plt.show()

49
39 Kappa Score
[93]: from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(y_test,y_pred)
plt.plot([])
plt.text(0,0, f'Cohen Kappa Score: {kappa:.4f}', fontsize=16, ha='center',␣
↪va='center',color="orangered")

plt.axis('off')

# Set the x-axis limits

plt.xlim(-1, 1)
plt.ylim(-1,1)

plt.show()

50
40 matthews_corrcoef
[94]: from sklearn.metrics import matthews_corrcoef
mcc = matthews_corrcoef(y_test,y_pred)

# Create a plot and display the MCC value as text

plt.plot([])
plt.text(0,0, f'Matthews Correlation Coefficient: {mcc:.4f}', fontsize=16,␣
↪ha='center', va='center',color="saddlebrown")

plt.axis('off')

# Set the x-axis limits

plt.xlim(-1, 1)
plt.ylim(-1,1)

plt.show()

51
41 Model Prediction
[95]: idx = 2

review_text = y_review_texts[idx]
true_sentiment = y_test[idx]
pred_df = pd.DataFrame({'class_names': class_names,'values': y_pred_probs[idx]})

[96]: from textwrap import wrap

print("==============================================================================")
print("\n".join(wrap(review_text)))
print("==============================================================================")
print(f'True sentiment: {class_names[true_sentiment]}')
print("==============================================================================")

==============================================================================
doesnt thing gliched everytime open hi apologies issue app tasks
community driven hobby project mine popular suggestions added time
offered free without advertising first report issue kind glitches look
like thanks steve
==============================================================================
True sentiment: Negative
==============================================================================

52
[97]: plt.figure(figsize=(10,8))
sns.barplot(x='values', y='class_names', data=pred_df, orient='h')
plt.ylabel('sentiment')
plt.xlabel('probability')
plt.title("Model Prediction Probability")
plt.show()

42 Prediction On Custom Data

[98]: review_text = "Every day brings a new opportunity to overcome challenges, learn␣
↪from experiences, and grow stronger, as we continuously work towards␣

↪becoming the best version of ourselves, embracing both successes and␣

↪setbacks as stepping stones toward a brighter future"

[99]: encoded_review = bert_tokenizer.encode_plus(

review_text,
max_length=max_len,
add_special_tokens=True,
return_token_type_ids=False,

53
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)

/opt/conda/lib/python3.10/site-
packages/transformers/tokenization_utils_base.py:2834: FutureWarning: The
`pad_to_max_length` argument is deprecated and will be removed in a future
version, use `padding=True` or `padding='longest'` to pad to the longest
sequence in the batch, or use `padding='max_length'` to pad to a max length. In
this case, you can give a specific length with `max_length` (e.g.
`max_length=45`) or leave max_length to None to pad to the maximal input size of
the model (e.g. 512 for Bert).
warnings.warn(

[100]: input_ids = encoded_review['input_ids'].to(device)

attention_mask = encoded_review['attention_mask'].to(device)

output = model(input_ids, attention_mask)

_, prediction = torch.max(output, dim=1)
print("=======================================================================================
print(f'Review text: {review_text}')
print("=======================================================================================
print(f'Sentiment : {class_names[prediction]}')
print("=======================================================================================

================================================================================
==============
Review text: Every day brings a new opportunity to overcome challenges, learn
from experiences, and grow stronger, as we continuously work towards becoming
the best version of ourselves, embracing both successes and setbacks as stepping
stones toward a brighter future
================================================================================
==============
Sentiment : Positive
================================================================================
==============

[ ]:

App (Linkedin) Reviews Sentiment Analysis Using Python
No ratings yet
App (Linkedin) Reviews Sentiment Analysis Using Python
1 page
Nokia Positive and Negative TM
No ratings yet
Nokia Positive and Negative TM
8 pages
Kindle Review Sentiment Analysis - Ipynb - Colab
No ratings yet
Kindle Review Sentiment Analysis - Ipynb - Colab
5 pages
Dataset Description: Amazon Reviews of Unlocked Phone
No ratings yet
Dataset Description: Amazon Reviews of Unlocked Phone
4 pages
Social Media Sentimental Analysis 1
No ratings yet
Social Media Sentimental Analysis 1
30 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
4.text Classification Using ML - Ipynb
No ratings yet
4.text Classification Using ML - Ipynb
47 pages
NLP Sentimental Analysis 1736351356
No ratings yet
NLP Sentimental Analysis 1736351356
32 pages
Part C Assignment No 2 Mini Project On Twitter 1
No ratings yet
Part C Assignment No 2 Mini Project On Twitter 1
9 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
16 pages
Sentiments Analysis Code Analysis
No ratings yet
Sentiments Analysis Code Analysis
42 pages
ML Week10.1
No ratings yet
ML Week10.1
5 pages
Prototype 1
No ratings yet
Prototype 1
10 pages
T SNE Visualization of Amazon Reviews With Polarity Based Color Coding+
No ratings yet
T SNE Visualization of Amazon Reviews With Polarity Based Color Coding+
29 pages
Chapter 8 Text Analytics
No ratings yet
Chapter 8 Text Analytics
42 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
13 pages
Sentiment Analysis with NLTK
No ratings yet
Sentiment Analysis with NLTK
4 pages
(ICT550) Sentiment Analysis Basic
No ratings yet
(ICT550) Sentiment Analysis Basic
2 pages
03 Amazon Fine Food Reviews Analysis - KNN
No ratings yet
03 Amazon Fine Food Reviews Analysis - KNN
71 pages
Q 3
No ratings yet
Q 3
2 pages
Amazon Assignment Ex
No ratings yet
Amazon Assignment Ex
11 pages
Code
No ratings yet
Code
13 pages
Sentiment Analysis for Tweets
No ratings yet
Sentiment Analysis for Tweets
11 pages
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
No ratings yet
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
6 pages
App-Rating-Recommendations. - Googleplaystore - Analysis (1) .Ipynb at Main Prenith - App-Rating-Recommendations
No ratings yet
App-Rating-Recommendations. - Googleplaystore - Analysis (1) .Ipynb at Main Prenith - App-Rating-Recommendations
19 pages
AIML IA3 Loki & SG
No ratings yet
AIML IA3 Loki & SG
31 pages
Twitter Sentiment Analysis Guide
No ratings yet
Twitter Sentiment Analysis Guide
7 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
Miniproject NLP
No ratings yet
Miniproject NLP
22 pages
P7 - HENDRY - 535230151.ipynb - Colab
No ratings yet
P7 - HENDRY - 535230151.ipynb - Colab
6 pages
Sentimental Analysis
No ratings yet
Sentimental Analysis
3 pages
Amna Bagh Ali
No ratings yet
Amna Bagh Ali
6 pages
Bert Sentiment
No ratings yet
Bert Sentiment
7 pages
Major Project Text Classification Code
No ratings yet
Major Project Text Classification Code
36 pages
Amazon Review Sentiment Analysis
No ratings yet
Amazon Review Sentiment Analysis
4 pages
Truncated SVD
No ratings yet
Truncated SVD
27 pages
Chandru Lab 3
No ratings yet
Chandru Lab 3
7 pages
Lab 15 Assignment by Ankit
No ratings yet
Lab 15 Assignment by Ankit
4 pages
Pgm6 With Output
No ratings yet
Pgm6 With Output
6 pages
RajSingh WIexp7
No ratings yet
RajSingh WIexp7
8 pages
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
No ratings yet
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
19 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
Sentiment Analysis On User-Generated Tweets
No ratings yet
Sentiment Analysis On User-Generated Tweets
15 pages
Alexa Sentiment Analysis
No ratings yet
Alexa Sentiment Analysis
34 pages
Sentiment Analysis With NLP Deep Learning
No ratings yet
Sentiment Analysis With NLP Deep Learning
8 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
81 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Dbscan Clustering
No ratings yet
Dbscan Clustering
53 pages
Detailed Report
No ratings yet
Detailed Report
6 pages
Sma Exp 10 Code Print
No ratings yet
Sma Exp 10 Code Print
7 pages
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
No ratings yet
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
98 pages
Code
No ratings yet
Code
18 pages
Data Science: Amazon Review Clustering
No ratings yet
Data Science: Amazon Review Clustering
78 pages
R002 KrishAhuja BDA Lab9.Ipynb - Colab
No ratings yet
R002 KrishAhuja BDA Lab9.Ipynb - Colab
3 pages
Pricing Mercari
No ratings yet
Pricing Mercari
41 pages
Chapter 10 - Text Analytics
No ratings yet
Chapter 10 - Text Analytics
13 pages
Section 6 - Jupyter Notebook
No ratings yet
Section 6 - Jupyter Notebook
11 pages
Data Science: Amazon Review Analysis
No ratings yet
Data Science: Amazon Review Analysis
22 pages
Group 4 Hallticket
No ratings yet
Group 4 Hallticket
4 pages
800-27009-f 35 Series NVR User Guide-0719
No ratings yet
800-27009-f 35 Series NVR User Guide-0719
162 pages
Engineering Critical Assessment
100% (1)
Engineering Critical Assessment
2 pages
127 - Hse Inspection Checklist-Compressed Gas Cylinder
82% (11)
127 - Hse Inspection Checklist-Compressed Gas Cylinder
1 page
GFM Divine Healing Booklet
No ratings yet
GFM Divine Healing Booklet
56 pages
Design of Modular Scania
No ratings yet
Design of Modular Scania
205 pages
Spoken Language Assessment Form
No ratings yet
Spoken Language Assessment Form
1 page
مذكرة جرامر اولى ثانوي ترم اول
No ratings yet
مذكرة جرامر اولى ثانوي ترم اول
34 pages
Calendering Process
67% (3)
Calendering Process
21 pages
Growth Performance of Broilers Fed A Quality Protein Maize
No ratings yet
Growth Performance of Broilers Fed A Quality Protein Maize
13 pages
List - 2
No ratings yet
List - 2
21 pages
ch2 Free Undamped Vibration of SDOF
No ratings yet
ch2 Free Undamped Vibration of SDOF
14 pages
School Finance
100% (2)
School Finance
14 pages
Disciplines and Ideas in The Applied Social Sciences
100% (4)
Disciplines and Ideas in The Applied Social Sciences
22 pages
Group Commitment Rationale Revised 4
No ratings yet
Group Commitment Rationale Revised 4
6 pages
The Golden Broth
No ratings yet
The Golden Broth
8 pages
Ammann Customer Magazine Gmc-1423-10-En Arx 200916
No ratings yet
Ammann Customer Magazine Gmc-1423-10-En Arx 200916
20 pages
LFML Oct23
No ratings yet
LFML Oct23
66 pages
Presentación Diseño Por Desempeño - Es.en
No ratings yet
Presentación Diseño Por Desempeño - Es.en
65 pages
Climate Change: The Ultimate Determinant of Health
No ratings yet
Climate Change: The Ultimate Determinant of Health
11 pages
Rotary and Handling Tools Catalog Land and Offshore
No ratings yet
Rotary and Handling Tools Catalog Land and Offshore
188 pages
01 Production Manual - CM602
No ratings yet
01 Production Manual - CM602
56 pages
Graduate Assistant Handbook: Updated January 2018
No ratings yet
Graduate Assistant Handbook: Updated January 2018
39 pages
FCOM vs FCTM: Cockpit Prep Steps
No ratings yet
FCOM vs FCTM: Cockpit Prep Steps
6 pages
Nota Renewable Energy
No ratings yet
Nota Renewable Energy
97 pages
Grammar Skills for Young Learners
No ratings yet
Grammar Skills for Young Learners
16 pages
Introduction To Global HRM
100% (1)
Introduction To Global HRM
37 pages
Matlab Code For Sorted Real Schur Forms
100% (1)
Matlab Code For Sorted Real Schur Forms
20 pages
"The Flood" An Excerpt From "Without Seeing The Dawn"
100% (1)
"The Flood" An Excerpt From "Without Seeing The Dawn"
6 pages
Mathematics: Quarter 2 - Module 1: Polynomial Functions
100% (1)
Mathematics: Quarter 2 - Module 1: Polynomial Functions
27 pages

BERT Driven Sentiment Classification With PyTorch

Uploaded by

BERT Driven Sentiment Classification With PyTorch

Uploaded by

ymt5vpxof

December 18, 2024

[1]: from sklearn.metrics import roc_curve, roc_auc_score

HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02",␣

warnings.filterwarnings('ignore', category=UserWarning, module='transformers')

[nltk_data] Downloading package wordnet to /usr/share/nltk_data…

[2]: device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

[2]: device(type='cuda', index=0)

[5]: userName userImage \

content score thumbsUpCount \

[6]: (15746, 11)

[10]: content score \

[11]: df["text"]=df["content"]+" "+df["replyContent"]

[12]: content score \

[14]: score text

3 Define a function to categorize the sentiment based on the score

text = re.sub(r"#.*", "", text)

text = ' '.join(tokens)

/tmp/ipykernel_23/1090887269.py:3: MarkupResemblesLocatorWarning: The input

[23]: score text sentiment

5 Negative Text Length

6 Positive Text Length

negative_wordcloud = df[df["sentiment"] == "Negative"]

wordcloud = WordCloud(width=800, height=800, stopwords=STOPWORDS,␣

max_words=800, colormap="hsv", mask=wordcloud_mask).

colors = ["cyan", "lime", "magenta", "gold", "purple", "tomato", "teal",␣

"royalblue", "darkorchid", "darkturquoise", "darkgoldenrod",␣

"lightcoral", "darkslategray", "olivedrab", "dodgerblue",␣

14 Most Common Words From Negative Text

[37]: text sentiment

[38]: class_names = ['Negative', 'Positive', 'Neutral']

/tmp/ipykernel_23/3901681617.py:2: FutureWarning: Downcasting behavior in

17 Calculate the length of each text entry

print(f"Average Text Length: {average_text_length:.2f}")

Average Text Length: 116.94

18 Load Bert Model

bert_model = BertModel.from_pretrained("bert-base-uncased", torch_dtype=torch.

config.json: 0%| | 0.00/570 [00:00<?, ?B/s]

[44]: bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer_config.json: 0%| | 0.00/48.0 [00:00<?, ?B/s]

19 We’ll use this text to understand the tokenization process:

↪app happen go automatically charge full years subscription refuse refund␣

↪cancel subscription time charged per agreement"

[46]: tokens = bert_tokenizer.tokenize(sample_text)

19.0.2 Special Tokens

[47]: ('[SEP]', 102)

[48]: ('[CLS]', 101)

19.0.3 There is also a special token for padding:

[49]: bert_tokenizer.pad_token, bert_tokenizer.pad_token_id

[50]: bert_tokenizer.unk_token, bert_tokenizer.unk_token_id

[50]: ('[UNK]', 100)

[51]: encoding = bert_tokenizer.encode_plus(

Truncation was not explicitly activated but `max_length` is provided a specific

[52]: dict_keys(['input_ids', 'attention_mask'])

20 The token ids are now stored in a Tensor and padded to a

21 The attention mask has the same length:

23 Choosing Sequence Length

for txt in df.text:

24 plot the distribution using boxplot

plt.title('Box Plot of Token Lengths', fontsize=16)

[61]: text sentiment

28 Create Pytorch Dataset

def __getitem__(self, item):

'attention_mask': encoding['attention_mask'].flatten(), # Flatten␣

'targets': torch.tensor(target, dtype=torch.long) # Convert the␣

[64]: df_train.shape, df_val.shape, df_test.shape

[64]: ((12596, 2), (1575, 2), (1575, 2))

train_data_loader = create_data_loader(df_train, bert_tokenizer,max_len,␣

val_data_loader = create_data_loader(df_val, bert_tokenizer, max_len,␣

test_data_loader = create_data_loader(df_test, bert_tokenizer,max_len,␣

[67]: data = next(iter(train_data_loader))

[67]: dict_keys(['review_text', 'input_ids', 'attention_mask', 'targets'])

29 And try to use it on the encoding of our sample text:

[72]: torch.Size([1, 32, 768])

[74]: pooled_output = last_hidden_state[:, 0]

[74]: torch.Size([1, 768])

def getitem(self, item):

def init(self, n_classes):