Module 3 - NLP
Module 3 - NLP
Word Embedding
Word embeddings are a type of word representation that allows words with similar meaning to
have a similar representation. Word embeddings are essential in the field of Natural Language
Processing (NLP) because they convert words into numerical vectors, which can be easily
processed by machine learning models.
Word2Vec is one of the most popular techniques for generating word embeddings. It was
developed by researchers at Google in 2013, and it is based on a shallow neural network
architecture that learns to predict words given a context (or vice versa). Word2Vec learns word
embeddings by using either the Skip-gram model or the Continuous Bag of Words (CBOW)
model.
1. Skip-gram Model:
o The Skip-gram model tries to predict the context (surrounding words) given a target
word.
o The goal is to maximize the probability of context words appearing around the target
word in a given context window.
2. Continuous Bag of Words (CBOW) Model:
o The CBOW model tries to predict a target word given its surrounding context words.
o The context is typically a fixed-size window of words that appear around the target
word.
Steps in Word2Vec
Skip-gram:
o
Works well for smaller datasets and rare words.
o
Focuses on maximizing the prediction of the context words for a given target
word.
CBOW:
o Generally works better for larger datasets.
o Predicts the target word based on surrounding context words.
python
# Example corpus
corpus = [
import nltk
nltk.download('punkt')
print(tokenized_corpus)
Output:
model.save("word2vec.model")
print(model.wv['machine'])
In this example:
vector = model.wv['learning']
print(vector)
Step 6: Finding Similar Words
We can also find words that are similar to a given word by computing cosine similarity between
their embeddings.
print(similar_words)
Output:
[('learning', 0.9832044243812561),
('is', 0.7953251008987427),
('a', 0.7521315813064575)]
This shows that "learning", "is", and "a" are the most similar words to "machine" based on the context in
the corpus.
You can also access the vector representation of a word. For example, let's get the vector for the
word "machine":
machine_vector = model.wv['machine']
print(machine_vector)
This will print out the 100-dimensional vector for the word "machine".
We can also compute the similarity between two words. For example, let's find the similarity
between "machine" and "learning":
Output:
Similarity between 'machine' and 'learning': 0.9403461818695068
This indicates a high similarity between the words "machine" and "learning" based on the
context in the corpus.
A classic example of using word embeddings is solving word analogies. For example, "king" is
to "man" as "queen" is to what? Let's do this using Word2Vec.
# Find the word that completes the analogy: "king" - "man" + "woman"
print(analogy)
Output:
[('woman', 0.8701280355453491)]
In this case, the model correctly identifies that the word "woman" completes the analogy: "king" -
"man" + "woman" = "queen".
Problem: Given a list of words, find the most similar words to "machine" using Word2Vec.
1. Data: We will use the following words: "machine", "learning", "artificial", "intelligence",
"data", "computer", "algorithm".
2. Goal: Find which word is most similar to "machine" based on the learned embeddings.
Solution:
Output:
Word2Vec transforms words into continuous vector spaces where semantically similar
words are located close to each other.
Skip-gram and CBOW are two approaches for training Word2Vec.
Word embeddings can be used to compute similarity between words, find analogies
(e.g., "king" - "man" + "woman" = "queen"), and improve various NLP tasks.
Word Similarity: We computed the similarity between the words "machine" and
"learning".
Word Analogy: We solved an analogy problem, where the model identified that
"woman" is to "queen" as "man" is to "king".
Applications of Word2Vec:
Semantic search: Finding the most relevant results based on word meanings.
Text classification: Converting words into vectors that can be used as features for
machine learning models.
Sentiment analysis: Understanding the sentiment behind a given text using word vectors.
What is CBOW?
The Continuous Bag of Words (CBOW) model predicts a target word given a context (the
surrounding words). This model is the opposite of the Skip-gram model, which predicts context
words given a target word.
Steps:
1. Prepare a corpus.
2. Tokenize the text.
3. Train a CBOW model using Word2Vec.
4. Use the trained model to find similar words, word vectors, and perform other tasks.
If you haven't installed the gensim library yet, use the following command:
# Sample corpus
corpus = [
]
Step 3: Tokenize the Text
Tokenizing the text means breaking each sentence into words (tokens). We will use the nltk
library for this.
import nltk
nltk.download('punkt')
print(tokenized_corpus)
Output:
Now that we have the tokenized corpus, we will train the CBOW model using Gensim's
Word2Vec class. The key difference between CBOW and Skip-gram is that in CBOW, we set
sg=0, which means we're using the CBOW model. The context is used to predict the target word.
model.save("cbow_model.model")
vector_size=100: The size of the word vectors (100-dimensional).
window=3: The size of the context window is 3 (look at 3 words before and after the target
word).
sg=0: This specifies we're using the CBOW model. If set to 1, it would use the Skip-gram
model.
min_count=1: Include words that occur at least once in the corpus.
Once the model is trained, we can use it to find the vector for a word, find similar words, and
perform word similarity tasks.
We can find the words that are most similar to a given word. For example, let's find words
similar to "machine":
print(similar_words)
Output:
[('learning', 0.9580739736557007),
('artificial', 0.9121836423873901),
('intelligence', 0.9092315435409546)]
This output tells us that the most similar words to "machine" in the trained CBOW model are
"learning", "artificial", and "intelligence".
You can access the vector for a word, such as "machine". The vector is a 100-dimensional
numerical representation of the word:
machine_vector = model.wv['machine']
print(machine_vector)
This will output the 100-dimensional vector for the word "machine".
Compute Word Similarity
We can calculate the similarity between two words. For example, we can compute how similar
"machine" is to "learning":
Output:
This means that "machine" and "learning" are highly similar in the context of the corpus.
Word embeddings can be used to solve analogy problems, such as "king" is to "man" as "queen"
is to what? Using CBOW, we can find the word that completes the analogy:
print(analogy)
Output:
[('woman', 0.873615026473999)]
This result shows that "woman" is the closest match to "queen" in the analogy "king" - "man" +
"woman" = "queen"
1. Training the CBOW Model: We trained the CBOW model using a small text corpus. The model
learned word embeddings by predicting a target word from its context (the surrounding words).
2. Similar Words: We found words similar to "machine" (like "learning", "artificial", and
"intelligence") based on their context in the corpus.
3. Word Vectors: We accessed the vector representation for the word "machine", which is a 100-
dimensional vector.
4. Word Similarity: We computed the similarity between the words "machine" and "learning",
which had a high similarity score of 0.96.
5. Word Analogy: We solved the analogy "king" - "man" + "woman" = "queen" using the CBOW
model.
GloVe (Global Vectors for Word Representation)
GloVe is another popular technique for generating word embeddings, like Word2Vec. GloVe,
however, is based on global word-word co-occurrence statistics from a corpus, unlike
Word2Vec, which is based on local context windows. GloVe aims to factorize the word co-
occurrence matrix to capture the relationships between words in a low-dimensional vector space.
In this guide, we'll walk through how GloVe works and how to implement it with an example.
GloVe is based on the idea that the meaning of a word can be inferred from its co-occurrence
with other words in a large corpus. Instead of looking at a fixed window of words as in
Word2Vec, GloVe looks at global co-occurrence counts across the entire corpus.
The main intuition is that the frequency of co-occurrence between words gives an indication of
their semantic similarity. Words that often appear together in a sentence or a document will have
similar embeddings.
GloVe Objective:
Given a large corpus of text, GloVe tries to find word vectors that minimize the following
objective:
The objective function tries to minimize the difference between the dot product of the word
vectors and the log of their co-occurrence probability.
We will use the Gensim library and Pre-trained GloVe Vectors for simplicity. Although GloVe
can be trained from scratch, using pre-trained vectors is often faster and easier for most use
cases.
Step 1: Install the Required Libraries
First, you need to install the required libraries if you haven't already:
We can use the GloVe embeddings available online, such as those from the GloVe website. For
this example, let's assume we're using the GloVe 6B model, which provides 100-dimensional
embeddings trained on 6 billion tokens.
Once you've downloaded the glove.6B.100d.txt file, you can load it into a Gensim model.
This function loads the GloVe vectors into a dictionary where the keys are words, and the values
are their corresponding 100-dimensional vector representations.
Once you've loaded the GloVe vectors, you can find words similar to a given word by
calculating the cosine similarity between the vectors of two words.
Example Output:
queen: 0.8117
prince: 0.7421
royalty: 0.7392
monarch: 0.7204
emperor: 0.7113
As expected, the most similar words to "king" are "queen", "prince", and other royalty-related
terms.
Another interesting task you can perform with word embeddings is solving word analogies like
"king" is to "queen" as "man" is to what?
return None
closest_word = None
closest_similarity = -1
closest_similarity = similarity
closest_word = word
return closest_word
Example Output:
Summary
1. GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm for
obtaining vector representations of words by factorizing a word co-occurrence matrix.
2. It captures both local context (similar to Word2Vec) and global co-occurrence statistics.
3. We can use pre-trained GloVe embeddings to generate word vectors and perform tasks like:
o Finding similar words using cosine similarity.
o Solving word analogies through vector arithmetic.
4. In this example, we used pre-trained GloVe vectors (from the 6B corpus with 100-dimensional
embeddings) to load and work with word vectors.
GloVe is particularly useful for capturing semantic relationships in large corpora, and it can be
fine-tuned or extended for various NLP tasks.
BERT is a powerful pre-trained model introduced by Google, and it's one of the most popular
models for Natural Language Processing (NLP). BERT is different from traditional word
embeddings like Word2Vec and GloVe because it uses a transformer architecture and is pre-
trained on a large corpus of text. Unlike these earlier models that learn a single embedding for
each word, BERT learns context-dependent embeddings, meaning the representation of a word
depends on the surrounding words in the sentence.
1. Bidirectional Context: BERT processes text from both directions (left-to-right and right-
to-left), which gives it richer contextual understanding compared to previous models that
only processed text in one direction.
2. Masked Language Model (MLM): During pre-training, BERT is trained to predict
randomly masked words within a sentence based on the context around it.
3. Next Sentence Prediction (NSP): BERT is also trained to predict whether one sentence
logically follows another, which helps in tasks like question answering and sentence
classification.
Applications of BERT
We'll implement BERT for a sentiment analysis task, where we will classify movie reviews as
positive or negative.
We will use the transformers library by Hugging Face and torch for the implementation.
We will use the pre-trained BERT base uncased model and its tokenizer. The tokenizer
converts text into token IDs that BERT can process.
Here, we load BERT with two output labels (num_labels=2) because we're doing binary classification
(positive/negative sentiment).
For simplicity, let's use a small dataset. You can use the IMDb dataset, which is a commonly
used sentiment analysis dataset for movie reviews. We will use the datasets library to load it
easily.
Each item in the dataset has two fields:
To feed the data into BERT, we need to tokenize the text into the format that BERT understands.
BERT expects the text to be tokenized into sub-word tokens, and we will also add the necessary
padding and truncation to ensure all sequences are of the same length.
Here, we:
Tokenize the text.
Pad or truncate all sequences to a length of 512 tokens (which is the maximum BERT can
handle).
Convert the data into the format that can be used by PyTorch (e.g., tensors for
input_ids, attention_mask, and label).
Now that we have the data ready, we can use the PyTorch Trainer API from the transformers
library to fine-tune BERT on our sentiment classification task.
In this step:
We configure the training process (e.g., batch size, number of epochs, logging).
The Trainer will automatically handle the training and evaluation loops.
Step 6: Evaluate the Model
Once the model is trained, we can evaluate its performance on the test set to check its accuracy.
Finally, we can use the fine-tuned model to make predictions on new, unseen text.
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
print(f"Prediction: {label}")
Here:
If you're happy with the model, you can save it for later use.
python
Copy
# Save the trained model and tokenizer
model.save_pretrained('./sentiment_model')
tokenizer.save_pretrained('./sentiment_model')
This will save both the model and tokenizer to a directory, so you can load them later without
retraining.
Summary
1. Loaded a pre-trained BERT model and tokenizer using the Hugging Face transformers library.
2. Processed and tokenized the IMDb dataset for sentiment classification (positive/negative).
3. Fine-tuned the pre-trained BERT model on the IMDb dataset.
4. Evaluated the model's performance.
5. Used the trained model to make predictions on new text.
BERT has significantly improved performance on many NLP tasks, and its pre-trained models
can be fine-tuned for various applications like text classification, named entity recognition, and
question answering.
By leveraging transfer learning with pre-trained models like BERT, we can achieve state-of-
the-art performance in NLP tasks with minimal resources.
Sequence-to-Sequence (Seq2Seq) Theory
Sequence-to-Sequence (Seq2Seq) is a deep learning model architecture primarily used for tasks
where the input and output are both sequences. These tasks include machine translation, speech
recognition, and text summarization. In Seq2Seq models, both the input and output sequences
can vary in length.
1. Encoder:
o The encoder processes the input sequence one element at a time and converts it
into a fixed-size context vector (or a sequence of context vectors).
o This is often done with Recurrent Neural Networks (RNNs), Long Short-Term
Memory (LSTM) units, or Gated Recurrent Units (GRUs).
o The final hidden state of the encoder represents the compressed information of the
input sequence.
2. Decoder:
o The decoder takes the context vector (from the encoder) and generates the output
sequence step by step.
o Like the encoder, the decoder is also commonly an RNN, LSTM, or GRU, but it
can use the encoder's output at each step to predict the next element in the
sequence.
o The decoder generates one element of the output sequence at each time step and
uses the previous output element as input to the next step.
Variants of Seq2Seq:
Attention Mechanism:
o One of the most successful enhancements to Seq2Seq models is the attention
mechanism. Attention allows the decoder to focus on different parts of the input
sequence at each decoding step. It does this by creating a weighted context vector
that emphasizes relevant parts of the input.
o This mechanism has drastically improved translation quality by enabling models
to deal with longer sequences and maintain more context.
Transformer:
o The Transformer model, introduced in the paper Attention is All You Need
(Vaswani et al., 2017), is a more recent and highly successful variation of
Seq2Seq models. It eliminates the need for RNNs entirely and relies solely on
attention mechanisms for both the encoder and decoder.
o Transformers scale better and process sequences in parallel, unlike RNNs which
are sequential and harder to train on long sequences.
1. Machine Translation:
o Seq2Seq models were first popularized in neural machine translation (NMT),
where the goal is to translate a sentence in one language to a sentence in another
language.
o In NMT, the input and output are both sequences of words, and Seq2Seq models
are trained to map one sequence to another.
2. Text Summarization:
o For both extractive and abstractive summarization, Seq2Seq models help
compress a long document or article into a shorter summary while preserving
essential information.
3. Speech Recognition:
o In automatic speech recognition (ASR), Seq2Seq models map an input sequence
of acoustic features to a sequence of words.
4. Image Captioning:
o Seq2Seq models are used to generate descriptive captions for images. In this case,
the encoder could be a Convolutional Neural Network (CNN) extracting features
from an image, while the decoder is a Seq2Seq model generating text.
5. Chatbots and Dialogue Systems:
o Seq2Seq models are widely used in building conversational agents where a
sequence of words (the user's input) is mapped to a sequence of words (the agent's
response).
1. Word Embeddings:
o Words can be represented as dense vectors using methods like Word2Vec, GloVe
(Global Vectors for Word Representation), or FastText. These models capture
semantic relationships between words based on their co-occurrence in large
corpora.
o For instance, the words "king" and "queen" would have similar embeddings
because they are often used in similar contexts.
2. Contextual Word Representations:
o More advanced methods like ELMo (Embeddings from Language Models),
BERT (Bidirectional Encoder Representations from Transformers), and GPT
(Generative Pre-trained Transformer) generate word embeddings that are context-
dependent. This means the representation of a word changes depending on its
surrounding context in a sentence.
o For example, the word "bank" would have different vector representations
depending on whether it refers to a financial institution or the side of a river.
3. Sentence and Document Embeddings:
o Beyond individual words, entire sentences or documents can be represented as
vectors. Techniques like Doc2Vec and Universal Sentence Encoder provide
vector representations for longer text spans.
o These representations capture the overall meaning of the text, which can be used
for tasks like sentence similarity, sentiment analysis, and text classification.
4. Semantic Similarity:
o Vector semantics makes it possible to measure the similarity between words,
sentences, or documents by computing the distance (or similarity) between their
vector representations. Common metrics include cosine similarity or Euclidean
distance.
o For example, if the word "cat" is closer in vector space to "dog" than to "car," it
reflects the semantic closeness between "cat" and "dog" in terms of real-world
knowledge.
1. Word Similarity:
o By representing words as vectors, we can identify which words are semantically
similar or dissimilar. This is useful in tasks like synonym detection or clustering
words with similar meanings.
2. Document Retrieval:
o Vector semantics is foundational in Information Retrieval (IR), where documents
or queries are represented as vectors, and the system retrieves documents based
on similarity to a query vector.
3. Text Classification:
o Word and sentence embeddings are widely used in text classification tasks (e.g.,
spam detection, sentiment analysis). The idea is to represent a piece of text as a
vector and use machine learning algorithms to classify it into predefined
categories.
4. Machine Translation:
o Vector semantics also plays a key role in neural machine translation (NMT),
where both the source and target languages are embedded into a shared vector
space. The translation process is seen as finding the closest vector in the target
language's space.
5. Question Answering:
o Embeddings are used in QA systems to match a question to an appropriate answer
by comparing their vector representations and finding the closest match.
Cosine similarity is a metric used to measure how similar two vectors are, based on the cosine
of the angle between them. It is widely used in Natural Language Processing (NLP) to assess the
similarity between two text representations, such as word or document embeddings.
Magnitude of a Vector:
Cosine Similarity = 1: The vectors are identical or point in the same direction (i.e., perfect
similarity).
Cosine Similarity = 0: The vectors are orthogonal, meaning there is no similarity between them.
Cosine Similarity = -1: The vectors are diametrically opposed (i.e., they point in completely
opposite directions).
Let's consider two vectors that represent two text documents, for simplicity, we'll use 2D vectors.
Let:
Interpretation:
The cosine similarity between the two vectors AAA and BBB is approximately 0.993, which is
very close to 1. This indicates that the two vectors are very similar, meaning the two documents
represented by these vectors have a high degree of similarity.
In NLP, we often represent words or documents as vectors using methods like Word2Vec,
GloVe, or TF-IDF. Cosine similarity helps compare these vector representations to measure the
similarity between two words, phrases, or documents. It's especially useful because:
1. Magnitude is ignored: We focus on the direction of the vectors, which means we don't need to
worry about the length of the document or the frequency of words. This helps in comparing the
actual content (semantic similarity) rather than just the size.
2. Handling different lengths: Cosine similarity works well even when the documents or text
representations are of different lengths.
These sentences may have different words, but they convey similar meaning. By converting
them into word vectors (e.g., using Word2Vec or TF-IDF) and calculating cosine similarity, we
can determine how similar the sentences are based on their meaning.
If you compute the cosine similarity between the vectors representing these sentences, you may
find that the similarity is close to 1, indicating they are semantically similar even though the
words differ.
In summary, cosine similarity is a simple but powerful metric for comparing vectors (such as
word or document embeddings) based on the cosine of the angle between them, making it ideal
for tasks like document similarity, recommendation systems, and more.
TF (Term Frequency)
Term Frequency (TF) is the number of times a term (word) appears in a document, divided by
the total number of terms in that document. This gives an indication of how frequently a word
appears relative to the other words in the same document.
Inverse Document Frequency (IDF) measures how common or rare a word is across the whole
corpus. Words that appear in many documents are considered less informative. Conversely,
words that appear in only a few documents are more likely to carry meaningful information.
TF-IDF
This score helps identify terms that are unique or significant to a particular document relative to
the entire corpus.
Corpus (3 Documents):
We'll calculate the term frequency for each term in each document.
Now, we'll calculate the IDF for each term across the three documents.
TF-IDF("the") = 0.33 * 0 = 0
TF-IDF("dog") = 0.17 * 0.176 = 0.030
TF-IDF("sat") = 0.17 * 0.176 = 0.030
TF-IDF("on") = 0.17 * 0.176 = 0.030
TF-IDF("log") = 0.17 * 1.099 = 0.187
TF-IDF("the") = 0.33 * 0 = 0
TF-IDF("cat") = 0.17 * 0.176 = 0.030
TF-IDF("and") = 0.17 * 1.099 = 0.187
TF-IDF("dog") = 0.17 * 0.176 = 0.030
TF-IDF("ran") = 0.17 * 1.099 = 0.187
Step 4: Summary of TF-IDF Values
Step 5: Interpretation
The word "the" has a TF-IDF value of 0 in all documents because it is very common and appears
in every document, so it doesn't provide much useful information about the specific content of
any document.
Words like "mat", "log", "and", and "ran" have higher TF-IDF values in the documents where
they appear because they are rarer and therefore more significant for those documents.
Use Cases of TF-IDF
Machine Translation (MT) is the task of automatically translating text from one language to
another using computer algorithms. It involves various approaches and technologies, from rule-
based systems to statistical methods and, more recently, neural network-based models.
Over the years, machine translation has advanced significantly, especially with the introduction
of neural machine translation (NMT), which has set a new standard in terms of performance,
accuracy, and fluency.
Despite advances in MT, several issues remain that affect its performance:
1. Ambiguity:
o Many words and phrases have multiple meanings depending on context. For
example, the word "bank" could mean a financial institution or the side of a river.
o MT systems can struggle to disambiguate words correctly, especially in languages
with complex morphology.
2. Syntax and Structure Differences:
o Different languages have different syntactic rules. For example, the subject-
object-verb (SOV) order in Japanese versus subject-verb-object (SVO) in English.
o Translation systems must handle these differences to avoid producing unnatural
sentences in the target language.
3. Idiomatic Expressions:
o Idioms are phrases that do not have a direct translation but rather convey a
meaning that cannot be understood by translating the individual words.
o MT systems often fail to translate idiomatic expressions correctly, leading to
awkward or incorrect translations.
4. Cultural Context:
o Some expressions or concepts may not exist in all cultures or languages. For
instance, humor, slang, or cultural references can be difficult for MT systems to
interpret and convey properly.
5. Out-of-Vocabulary (OOV) Words:
o MT systems can struggle with words that don't appear in their training data,
leading to incorrect or nonsensical translations.
6. Low-Resource Languages:
o Many MT systems rely on large parallel corpora (pairs of translated text) for
training. However, many languages lack sufficient data for training high-quality
models, which can result in poorer translations for these languages.
7. Fluency and Naturalness:
o While NMT systems have significantly improved the fluency of translations, the
output can still sound unnatural or forced, especially in complex or less common
sentence structures.
MT Evaluation Metrics
Evaluating the performance of machine translation systems is challenging because it involves
both accuracy (correctness of the translation) and fluency (naturalness of the translation). Here
are the primary evaluation metrics used in MT:
TER measures the number of edits (insertions, deletions, substitutions, or shifts) required
to change a machine-generated translation into a human reference translation.
Formula:
1.
o Pros: Focuses on edits and correction, can handle large differences in phrasing.
o Cons: May not fully reflect fluency or other linguistic quality factors.
2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
o ROUGE is a set of metrics used for evaluating the quality of summaries and
translations. It compares the overlap of n-grams (unigrams, bigrams, etc.) between
the machine-generated output and reference output.
o Pros: Similar to BLEU but focuses on recall rather than precision. Can evaluate
both content and fluency.
o Cons: Like BLEU, it only looks at surface-level matches and ignores semantic
meaning or word order.
3. Human Evaluation:
o Human evaluation involves human judges assessing the quality of translations
based on criteria like fluency, adequacy (how well the meaning is preserved), and
overall quality.
o Pros: Best reflects human judgment and the true quality of a translation.
o Cons: Time-consuming and expensive, subject to inter-annotator variability.
4. TERCOM (Translation Edit Rate with Combinations):
o TERCOM is a metric that uses TER but incorporates combinations of
translations to improve performance and account for various valid rephrasings of
sentences.
Automatic Metrics (like BLEU, METEOR, TER, ROUGE) are fast and computationally
efficient but may not always align with human judgment. They focus on surface-level
similarity and may fail to capture nuances like fluency and semantic meaning.
Human Evaluation is more accurate but time-consuming and subjective. It’s typically
used in conjunction with automatic metrics for more reliable evaluation.
Common MT Issues:
Word ambiguity: Words with multiple meanings are translated incorrectly if context is
not well understood.
Idioms: Idiomatic expressions often do not translate directly.
Cultural context: Some words or expressions may have no equivalent in the target
language.
Out-of-Vocabulary (OOV) words: Words that do not appear in training data may result
in poor translations.
Low-resource languages: Lack of training data can lead to suboptimal performance.