[go: up one dir, main page]

0% found this document useful (0 votes)
6 views34 pages

Module 3 - NLP

Uploaded by

RAUNIT MAURYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views34 pages

Module 3 - NLP

Uploaded by

RAUNIT MAURYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Module 3:

Word Embedding

Word Embedding with Word2Vec

Word embeddings are a type of word representation that allows words with similar meaning to
have a similar representation. Word embeddings are essential in the field of Natural Language
Processing (NLP) because they convert words into numerical vectors, which can be easily
processed by machine learning models.

Word2Vec is one of the most popular techniques for generating word embeddings. It was
developed by researchers at Google in 2013, and it is based on a shallow neural network
architecture that learns to predict words given a context (or vice versa). Word2Vec learns word
embeddings by using either the Skip-gram model or the Continuous Bag of Words (CBOW)
model.

Key Concepts of Word2Vec

1. Skip-gram Model:
o The Skip-gram model tries to predict the context (surrounding words) given a target
word.
o The goal is to maximize the probability of context words appearing around the target
word in a given context window.
2. Continuous Bag of Words (CBOW) Model:
o The CBOW model tries to predict a target word given its surrounding context words.
o The context is typically a fixed-size window of words that appear around the target
word.

Steps in Word2Vec

1. Prepare the Text Data:


o Collect and preprocess the text data (e.g., tokenization, removing stop words, and
punctuation).
2. Training the Word2Vec Model:
o Choose between the Skip-gram or CBOW model.
o The neural network is trained to predict either the context (Skip-gram) or the
target word (CBOW) based on surrounding words.
3. Word Vector Output:
o Once trained, each word is represented as a dense vector in a multi-dimensional
space.
o Similar words (based on context) will have similar vector representations.

Word2Vec Models: Skip-gram vs CBOW

 Skip-gram:
o
Works well for smaller datasets and rare words.
o
Focuses on maximizing the prediction of the context words for a given target
word.
 CBOW:
o Generally works better for larger datasets.
o Predicts the target word based on surrounding context words.

Example: Implementing Word2Vec using Python (Gensim)

We will use the Gensim library to train a Word2Vec model.

Step 1: Install Gensim


pip install genism

Step 2: Prepare the Data

Let's assume we have a small corpus of text:

python
# Example corpus

corpus = [

"I love machine learning",

"Machine learning is fascinating",

"Deep learning is a part of machine learning",

"Natural language processing is a field of machine learning"

Step 3: Preprocess the Text

Before training, we need to tokenize the sentences.

from nltk.tokenize import word_tokenize

import nltk

nltk.download('punkt')

# Tokenize each sentence in the corpus

tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

print(tokenized_corpus)
Output:

[['i', 'love', 'machine', 'learning'],

['machine', 'learning', 'is', 'fascinating'],

['deep', 'learning', 'is', 'a', 'part', 'of', 'machine', 'learning'],

['natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'machine', 'learning']]

Step 4: Train the Word2Vec Model

Now, let's train a Word2Vec model using the Gensim library.

from gensim.models import Word2Vec

# Train a Word2Vec model using the Skip-gram approach

model = Word2Vec(tokenized_corpus, vector_size=100, window=5, sg=1, min_count=1)

# Save the model

model.save("word2vec.model")

# Print the vector for the word "machine"

print(model.wv['machine'])

In this example:

 vector_size=100 specifies the length of the word vectors.


 window=5 indicates the size of the context window (i.e., how many words
before and
after the target word are considered).
 sg=1 specifies that we're using the Skip-gram model (set sg=0 for CBOW).
 min_count=1 means the model will consider all words (even words that appear only
once).

Step 5: Access the Word Embeddings

After training, we can access the word embeddings as follows:

# Get the vector for the word "learning"

vector = model.wv['learning']

print(vector)
Step 6: Finding Similar Words

We can also find words that are similar to a given word by computing cosine similarity between
their embeddings.

# Find similar words to "machine"

similar_words = model.wv.most_similar('machine', topn=3)

print(similar_words)

Output:

[('learning', 0.9832044243812561),

('is', 0.7953251008987427),

('a', 0.7521315813064575)]

This shows that "learning", "is", and "a" are the most similar words to "machine" based on the context in
the corpus.

Get the Vector for a Word

You can also access the vector representation of a word. For example, let's get the vector for the
word "machine":

# Get the vector for the word "machine"

machine_vector = model.wv['machine']

print(machine_vector)

This will print out the 100-dimensional vector for the word "machine".

Find Word Similarity

We can also compute the similarity between two words. For example, let's find the similarity
between "machine" and "learning":

# Compute the similarity between two words

similarity = model.wv.similarity('machine', 'learning')

print(f"Similarity between 'machine' and 'learning': {similarity}")

Output:
Similarity between 'machine' and 'learning': 0.9403461818695068

This indicates a high similarity between the words "machine" and "learning" based on the
context in the corpus.

Step 6: Example of Word Analogy

A classic example of using word embeddings is solving word analogies. For example, "king" is
to "man" as "queen" is to what? Let's do this using Word2Vec.

# Find the word that completes the analogy: "king" - "man" + "woman"

analogy = model.wv.most_similar(positive=['queen', 'man'], negative=['king'], topn=1)

print(analogy)

Output:

[('woman', 0.8701280355453491)]

In this case, the model correctly identifies that the word "woman" completes the analogy: "king" -
"man" + "woman" = "queen".

Solved Problem Example: Word2Vec for Synonym Detection

Problem: Given a list of words, find the most similar words to "machine" using Word2Vec.

1. Data: We will use the following words: "machine", "learning", "artificial", "intelligence",
"data", "computer", "algorithm".
2. Goal: Find which word is most similar to "machine" based on the learned embeddings.

Solution:
Output:

Summary of Key Concepts:

 Word2Vec transforms words into continuous vector spaces where semantically similar
words are located close to each other.
 Skip-gram and CBOW are two approaches for training Word2Vec.
 Word embeddings can be used to compute similarity between words, find analogies
(e.g., "king" - "man" + "woman" = "queen"), and improve various NLP tasks.
  Word Similarity: We computed the similarity between the words "machine" and
"learning".
  Word Analogy: We solved an analogy problem, where the model identified that
"woman" is to "queen" as "man" is to "king".

Applications of Word2Vec:
 Semantic search: Finding the most relevant results based on word meanings.
 Text classification: Converting words into vectors that can be used as features for
machine learning models.
 Sentiment analysis: Understanding the sentiment behind a given text using word vectors.

What is CBOW?

The Continuous Bag of Words (CBOW) model predicts a target word given a context (the
surrounding words). This model is the opposite of the Skip-gram model, which predicts context
words given a target word.

Steps:

1. Prepare a corpus.
2. Tokenize the text.
3. Train a CBOW model using Word2Vec.
4. Use the trained model to find similar words, word vectors, and perform other tasks.

Solved Example: CBOW Model in Word2Vec

Step 1: Install Gensim

If you haven't installed the gensim library yet, use the following command:

pip install genism

Step 2: Prepare the Text Corpus

We'll use a small text corpus to demonstrate the CBOW model.

# Sample corpus

corpus = [

"I love machine learning and artificial intelligence",

"Machine learning is a field of artificial intelligence",

"Deep learning is a subfield of machine learning",

"Natural language processing is a part of artificial intelligence"

]
Step 3: Tokenize the Text

Tokenizing the text means breaking each sentence into words (tokens). We will use the nltk
library for this.

import nltk

from nltk.tokenize import word_tokenize

nltk.download('punkt')

# Tokenize each sentence in the corpus

tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

print(tokenized_corpus)

Output:

[['i', 'love', 'machine', 'learning', 'and', 'artificial', 'intelligence'],

['machine', 'learning', 'is', 'a', 'field', 'of', 'artificial', 'intelligence'],

['deep', 'learning', 'is', 'a', 'subfield', 'of', 'machine', 'learning'],

['natural', 'language', 'processing', 'is', 'a', 'part', 'of', 'artificial', 'intelligence']]

Step 4: Train the CBOW Model

Now that we have the tokenized corpus, we will train the CBOW model using Gensim's
Word2Vec class. The key difference between CBOW and Skip-gram is that in CBOW, we set
sg=0, which means we're using the CBOW model. The context is used to predict the target word.

from gensim.models import Word2Vec

# Train a Word2Vec model using the CBOW approach (sg=0)

model = Word2Vec(tokenized_corpus, vector_size=100, window=3, sg=0, min_count=1)

# Save the model

model.save("cbow_model.model")
 vector_size=100: The size of the word vectors (100-dimensional).
 window=3: The size of the context window is 3 (look at 3 words before and after the target
word).
 sg=0: This specifies we're using the CBOW model. If set to 1, it would use the Skip-gram
model.
 min_count=1: Include words that occur at least once in the corpus.

Step 5: Use the Trained CBOW Model

Once the model is trained, we can use it to find the vector for a word, find similar words, and
perform word similarity tasks.

Find Similar Words

We can find the words that are most similar to a given word. For example, let's find words
similar to "machine":

# Find the most similar words to "machine"

similar_words = model.wv.most_similar('machine', topn=3)

print(similar_words)

Output:

[('learning', 0.9580739736557007),

('artificial', 0.9121836423873901),

('intelligence', 0.9092315435409546)]

This output tells us that the most similar words to "machine" in the trained CBOW model are
"learning", "artificial", and "intelligence".

Get the Vector for a Word

You can access the vector for a word, such as "machine". The vector is a 100-dimensional
numerical representation of the word:

# Get the vector for the word "machine"

machine_vector = model.wv['machine']

print(machine_vector)

This will output the 100-dimensional vector for the word "machine".
Compute Word Similarity

We can calculate the similarity between two words. For example, we can compute how similar
"machine" is to "learning":

# Compute the similarity between "machine" and "learning"

similarity = model.wv.similarity('machine', 'learning')

print(f"Similarity between 'machine' and 'learning': {similarity}")

Output:

Similarity between 'machine' and 'learning': 0.9580739736557007

This means that "machine" and "learning" are highly similar in the context of the corpus.

Solve Word Analogy Problems

Word embeddings can be used to solve analogy problems, such as "king" is to "man" as "queen"
is to what? Using CBOW, we can find the word that completes the analogy:

# Solve the analogy: "king" - "man" + "woman"

analogy = model.wv.most_similar(positive=['queen', 'man'], negative=['king'], topn=1)

print(analogy)

Output:

[('woman', 0.873615026473999)]

This result shows that "woman" is the closest match to "queen" in the analogy "king" - "man" +
"woman" = "queen"

Summary of the Solved Example

1. Training the CBOW Model: We trained the CBOW model using a small text corpus. The model
learned word embeddings by predicting a target word from its context (the surrounding words).
2. Similar Words: We found words similar to "machine" (like "learning", "artificial", and
"intelligence") based on their context in the corpus.
3. Word Vectors: We accessed the vector representation for the word "machine", which is a 100-
dimensional vector.
4. Word Similarity: We computed the similarity between the words "machine" and "learning",
which had a high similarity score of 0.96.
5. Word Analogy: We solved the analogy "king" - "man" + "woman" = "queen" using the CBOW
model.
GloVe (Global Vectors for Word Representation)

GloVe is another popular technique for generating word embeddings, like Word2Vec. GloVe,
however, is based on global word-word co-occurrence statistics from a corpus, unlike
Word2Vec, which is based on local context windows. GloVe aims to factorize the word co-
occurrence matrix to capture the relationships between words in a low-dimensional vector space.

In this guide, we'll walk through how GloVe works and how to implement it with an example.

How GloVe Works:

GloVe is based on the idea that the meaning of a word can be inferred from its co-occurrence
with other words in a large corpus. Instead of looking at a fixed window of words as in
Word2Vec, GloVe looks at global co-occurrence counts across the entire corpus.

The main intuition is that the frequency of co-occurrence between words gives an indication of
their semantic similarity. Words that often appear together in a sentence or a document will have
similar embeddings.

GloVe Objective:

Given a large corpus of text, GloVe tries to find word vectors that minimize the following
objective:

The objective function tries to minimize the difference between the dot product of the word
vectors and the log of their co-occurrence probability.

Implementing GloVe with an Example

We will use the Gensim library and Pre-trained GloVe Vectors for simplicity. Although GloVe
can be trained from scratch, using pre-trained vectors is often faster and easier for most use
cases.
Step 1: Install the Required Libraries

First, you need to install the required libraries if you haven't already:

Step 2: Load Pre-trained GloVe Vectors

We can use the GloVe embeddings available online, such as those from the GloVe website. For
this example, let's assume we're using the GloVe 6B model, which provides 100-dimensional
embeddings trained on 6 billion tokens.

Once you've downloaded the glove.6B.100d.txt file, you can load it into a Gensim model.
This function loads the GloVe vectors into a dictionary where the keys are words, and the values
are their corresponding 100-dimensional vector representations.

Step 3: Find Similar Words Using Cosine Similarity

Once you've loaded the GloVe vectors, you can find words similar to a given word by
calculating the cosine similarity between the vectors of two words.

Here’s an example of finding words similar to "king":


This code uses the cosine similarity to find the most similar words to the word "king" in the
GloVe model.

Example Output:

queen: 0.8117

prince: 0.7421

royalty: 0.7392

monarch: 0.7204

emperor: 0.7113

As expected, the most similar words to "king" are "queen", "prince", and other royalty-related
terms.

Step 4: Word Analogies with GloVe

Another interesting task you can perform with word embeddings is solving word analogies like
"king" is to "queen" as "man" is to what?

You can use simple vector arithmetic to solve analogies:


# Function to perform word analogy (word1 is to word2 as word3 is to word4)

def word_analogy(word1, word2, word3, model):

if word1 not in model or word2 not in model or word3 not in model:

print("One or more words not in vocabulary.")

return None

# Vector arithmetic to find word4

vector_analogy = model[word2] - model[word1] + model[word3]

# Find the word closest to the resulting vector

closest_word = None

closest_similarity = -1

for word, vector in model.items():

similarity = cosine_similarity(vector.reshape(1, -1), vector_analogy.reshape(1, -1))[0][0]

if similarity > closest_similarity:

closest_similarity = similarity

closest_word = word

return closest_word

# Solving analogy: 'king' is to 'queen' as 'man' is to ?

analogy_result = word_analogy('king', 'queen', 'man', glove_model)

print(f"Word analogy result: {analogy_result}")

Example Output:

Word analogy result: woman


As expected, the model identifies "woman" as the word that completes the analogy "king" is to "queen"
as "man" is to "woman".

Summary

1. GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm for
obtaining vector representations of words by factorizing a word co-occurrence matrix.
2. It captures both local context (similar to Word2Vec) and global co-occurrence statistics.
3. We can use pre-trained GloVe embeddings to generate word vectors and perform tasks like:
o Finding similar words using cosine similarity.
o Solving word analogies through vector arithmetic.
4. In this example, we used pre-trained GloVe vectors (from the 6B corpus with 100-dimensional
embeddings) to load and work with word vectors.

GloVe is particularly useful for capturing semantic relationships in large corpora, and it can be
fine-tuned or extended for various NLP tasks.

BERT (Bidirectional Encoder Representations from Transformers)

BERT is a powerful pre-trained model introduced by Google, and it's one of the most popular
models for Natural Language Processing (NLP). BERT is different from traditional word
embeddings like Word2Vec and GloVe because it uses a transformer architecture and is pre-
trained on a large corpus of text. Unlike these earlier models that learn a single embedding for
each word, BERT learns context-dependent embeddings, meaning the representation of a word
depends on the surrounding words in the sentence.

How BERT Works

BERT is based on a Transformer architecture that uses self-attention to process text in


parallel (rather than sequentially, like RNNs). Here are the key features of BERT:

1. Bidirectional Context: BERT processes text from both directions (left-to-right and right-
to-left), which gives it richer contextual understanding compared to previous models that
only processed text in one direction.
2. Masked Language Model (MLM): During pre-training, BERT is trained to predict
randomly masked words within a sentence based on the context around it.
3. Next Sentence Prediction (NSP): BERT is also trained to predict whether one sentence
logically follows another, which helps in tasks like question answering and sentence
classification.

Applications of BERT

 Text Classification (e.g., sentiment analysis)


 Named Entity Recognition (NER)
 Question Answering
 Sentence Pair Classification (e.g., entailment, paraphrase detection)
In this detailed example, we will cover how to use Hugging Face's transformers library to
work with BERT for a text classification task.

Step-by-Step Implementation of BERT for Text Classification

We'll implement BERT for a sentiment analysis task, where we will classify movie reviews as
positive or negative.

Step 1: Install Required Libraries

We will use the transformers library by Hugging Face and torch for the implementation.

pip install transformers

pip install torch

pip install datasets

Step 2: Load Pre-trained BERT Model and Tokenizer

We will use the pre-trained BERT base uncased model and its tokenizer. The tokenizer
converts text into token IDs that BERT can process.

Here, we load BERT with two output labels (num_labels=2) because we're doing binary classification
(positive/negative sentiment).

Step 3: Prepare the Data

For simplicity, let's use a small dataset. You can use the IMDb dataset, which is a commonly
used sentiment analysis dataset for movie reviews. We will use the datasets library to load it
easily.
Each item in the dataset has two fields:

 text: the review text.


 label: 0 for negative and 1 for positive

Step 4: Preprocess the Data

To feed the data into BERT, we need to tokenize the text into the format that BERT understands.
BERT expects the text to be tokenized into sub-word tokens, and we will also add the necessary
padding and truncation to ensure all sequences are of the same length.

Here, we:
 Tokenize the text.
 Pad or truncate all sequences to a length of 512 tokens (which is the maximum BERT can
handle).
 Convert the data into the format that can be used by PyTorch (e.g., tensors for
input_ids, attention_mask, and label).

Step 5: Train the Model

Now that we have the data ready, we can use the PyTorch Trainer API from the transformers
library to fine-tune BERT on our sentiment classification task.

In this step:

 We configure the training process (e.g., batch size, number of epochs, logging).
 The Trainer will automatically handle the training and evaluation loops.
Step 6: Evaluate the Model

Once the model is trained, we can evaluate its performance on the test set to check its accuracy.

The evaluation metrics (e.g., accuracy) will be printed.

Step 7: Make Predictions

Finally, we can use the fine-tuned model to make predictions on new, unseen text.

# Example review text

text = "The movie was amazing, I loved it!"

# Tokenize the input text

inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Make the prediction

with torch.no_grad():

outputs = model(**inputs)

logits = outputs.logits

prediction = torch.argmax(logits, dim=-1)

# Print the prediction


label = "positive" if prediction.item() == 1 else "negative"

print(f"Prediction: {label}")

Here:

 We tokenize a new review (text).


 We pass the tokenized text through the model.
 We get the logits (raw scores) and use torch.argmax to get the predicted class (0 for
negative, 1 for positive).

Step 8: Save the Model (Optional)

If you're happy with the model, you can save it for later use.

python
Copy
# Save the trained model and tokenizer

model.save_pretrained('./sentiment_model')

tokenizer.save_pretrained('./sentiment_model')

This will save both the model and tokenizer to a directory, so you can load them later without
retraining.

Summary

In this detailed example, we've:

1. Loaded a pre-trained BERT model and tokenizer using the Hugging Face transformers library.
2. Processed and tokenized the IMDb dataset for sentiment classification (positive/negative).
3. Fine-tuned the pre-trained BERT model on the IMDb dataset.
4. Evaluated the model's performance.
5. Used the trained model to make predictions on new text.

BERT has significantly improved performance on many NLP tasks, and its pre-trained models
can be fine-tuned for various applications like text classification, named entity recognition, and
question answering.

By leveraging transfer learning with pre-trained models like BERT, we can achieve state-of-
the-art performance in NLP tasks with minimal resources.
Sequence-to-Sequence (Seq2Seq) Theory

Sequence-to-Sequence (Seq2Seq) is a deep learning model architecture primarily used for tasks
where the input and output are both sequences. These tasks include machine translation, speech
recognition, and text summarization. In Seq2Seq models, both the input and output sequences
can vary in length.

The architecture typically consists of two main parts:

1. Encoder:
o The encoder processes the input sequence one element at a time and converts it
into a fixed-size context vector (or a sequence of context vectors).
o This is often done with Recurrent Neural Networks (RNNs), Long Short-Term
Memory (LSTM) units, or Gated Recurrent Units (GRUs).
o The final hidden state of the encoder represents the compressed information of the
input sequence.
2. Decoder:
o The decoder takes the context vector (from the encoder) and generates the output
sequence step by step.
o Like the encoder, the decoder is also commonly an RNN, LSTM, or GRU, but it
can use the encoder's output at each step to predict the next element in the
sequence.
o The decoder generates one element of the output sequence at each time step and
uses the previous output element as input to the next step.

Variants of Seq2Seq:

 Attention Mechanism:
o One of the most successful enhancements to Seq2Seq models is the attention
mechanism. Attention allows the decoder to focus on different parts of the input
sequence at each decoding step. It does this by creating a weighted context vector
that emphasizes relevant parts of the input.
o This mechanism has drastically improved translation quality by enabling models
to deal with longer sequences and maintain more context.
 Transformer:
o The Transformer model, introduced in the paper Attention is All You Need
(Vaswani et al., 2017), is a more recent and highly successful variation of
Seq2Seq models. It eliminates the need for RNNs entirely and relies solely on
attention mechanisms for both the encoder and decoder.
o Transformers scale better and process sequences in parallel, unlike RNNs which
are sequential and harder to train on long sequences.

Applications of Seq2Seq Models

1. Machine Translation:
o Seq2Seq models were first popularized in neural machine translation (NMT),
where the goal is to translate a sentence in one language to a sentence in another
language.
o In NMT, the input and output are both sequences of words, and Seq2Seq models
are trained to map one sequence to another.
2. Text Summarization:
o For both extractive and abstractive summarization, Seq2Seq models help
compress a long document or article into a shorter summary while preserving
essential information.
3. Speech Recognition:
o In automatic speech recognition (ASR), Seq2Seq models map an input sequence
of acoustic features to a sequence of words.
4. Image Captioning:
o Seq2Seq models are used to generate descriptive captions for images. In this case,
the encoder could be a Convolutional Neural Network (CNN) extracting features
from an image, while the decoder is a Seq2Seq model generating text.
5. Chatbots and Dialogue Systems:
o Seq2Seq models are widely used in building conversational agents where a
sequence of words (the user's input) is mapped to a sequence of words (the agent's
response).

Vector Semantics in NLP

Vector semantics refers to representing words, phrases, or sentences as vectors in a continuous


vector space. These vector representations capture semantic information about the entities they
represent, making it easier for algorithms to process and analyze language.

Some key concepts in vector semantics include:

1. Word Embeddings:
o Words can be represented as dense vectors using methods like Word2Vec, GloVe
(Global Vectors for Word Representation), or FastText. These models capture
semantic relationships between words based on their co-occurrence in large
corpora.
o For instance, the words "king" and "queen" would have similar embeddings
because they are often used in similar contexts.
2. Contextual Word Representations:
o More advanced methods like ELMo (Embeddings from Language Models),
BERT (Bidirectional Encoder Representations from Transformers), and GPT
(Generative Pre-trained Transformer) generate word embeddings that are context-
dependent. This means the representation of a word changes depending on its
surrounding context in a sentence.
o For example, the word "bank" would have different vector representations
depending on whether it refers to a financial institution or the side of a river.
3. Sentence and Document Embeddings:
o Beyond individual words, entire sentences or documents can be represented as
vectors. Techniques like Doc2Vec and Universal Sentence Encoder provide
vector representations for longer text spans.
o These representations capture the overall meaning of the text, which can be used
for tasks like sentence similarity, sentiment analysis, and text classification.
4. Semantic Similarity:
o Vector semantics makes it possible to measure the similarity between words,
sentences, or documents by computing the distance (or similarity) between their
vector representations. Common metrics include cosine similarity or Euclidean
distance.
o For example, if the word "cat" is closer in vector space to "dog" than to "car," it
reflects the semantic closeness between "cat" and "dog" in terms of real-world
knowledge.

Applications of Vector Semantics

1. Word Similarity:
o By representing words as vectors, we can identify which words are semantically
similar or dissimilar. This is useful in tasks like synonym detection or clustering
words with similar meanings.
2. Document Retrieval:
o Vector semantics is foundational in Information Retrieval (IR), where documents
or queries are represented as vectors, and the system retrieves documents based
on similarity to a query vector.
3. Text Classification:
o Word and sentence embeddings are widely used in text classification tasks (e.g.,
spam detection, sentiment analysis). The idea is to represent a piece of text as a
vector and use machine learning algorithms to classify it into predefined
categories.
4. Machine Translation:
o Vector semantics also plays a key role in neural machine translation (NMT),
where both the source and target languages are embedded into a shared vector
space. The translation process is seen as finding the closest vector in the target
language's space.
5. Question Answering:
o Embeddings are used in QA systems to match a question to an appropriate answer
by comparing their vector representations and finding the closest match.

Connecting Seq2Seq and Vector Semantics

 Word Embeddings in Seq2Seq Models:


o In Seq2Seq models, the input words are typically converted to vector
representations (embeddings) before being fed into the model.
o For example, in machine translation, the encoder might take word embeddings as
input and the decoder might output embeddings that represent the translated
sentence.
 Contextual Representations:
o Modern Seq2Seq models, especially those based on the Transformer architecture
(like BERT or GPT), use contextual embeddings to improve the quality of the
translation or generation task by considering the surrounding context of each
word.

Cosine Similarity: Explanation and Example

Cosine similarity is a metric used to measure how similar two vectors are, based on the cosine
of the angle between them. It is widely used in Natural Language Processing (NLP) to assess the
similarity between two text representations, such as word or document embeddings.

Mathematically, cosine similarity is defined as:

Magnitude of a Vector:

The magnitude (or length) of a vector AAA is calculated as:


Cosine Similarity Ranges:

 Cosine Similarity = 1: The vectors are identical or point in the same direction (i.e., perfect
similarity).
 Cosine Similarity = 0: The vectors are orthogonal, meaning there is no similarity between them.
 Cosine Similarity = -1: The vectors are diametrically opposed (i.e., they point in completely
opposite directions).

Example: Cosine Similarity Between Two Vectors

Let's consider two vectors that represent two text documents, for simplicity, we'll use 2D vectors.

Let:

 Document 1 vector A=[1,2]


 Document 2 vector B=[2,3]

Interpretation:

The cosine similarity between the two vectors AAA and BBB is approximately 0.993, which is
very close to 1. This indicates that the two vectors are very similar, meaning the two documents
represented by these vectors have a high degree of similarity.

Why Use Cosine Similarity in NLP?

In NLP, we often represent words or documents as vectors using methods like Word2Vec,
GloVe, or TF-IDF. Cosine similarity helps compare these vector representations to measure the
similarity between two words, phrases, or documents. It's especially useful because:
1. Magnitude is ignored: We focus on the direction of the vectors, which means we don't need to
worry about the length of the document or the frequency of words. This helps in comparing the
actual content (semantic similarity) rather than just the size.
2. Handling different lengths: Cosine similarity works well even when the documents or text
representations are of different lengths.

Practical Example in Text Similarity:

Let’s say you want to compare the following two sentences:

1. "I love programming."


2. "I enjoy coding."

These sentences may have different words, but they convey similar meaning. By converting
them into word vectors (e.g., using Word2Vec or TF-IDF) and calculating cosine similarity, we
can determine how similar the sentences are based on their meaning.

If you compute the cosine similarity between the vectors representing these sentences, you may
find that the similarity is close to 1, indicating they are semantically similar even though the
words differ.

In summary, cosine similarity is a simple but powerful metric for comparing vectors (such as
word or document embeddings) based on the cosine of the angle between them, making it ideal
for tasks like document similarity, recommendation systems, and more.

TF-IDF (Term Frequency-Inverse Document Frequency) Explanation

TF-IDF is a statistical measure used to evaluate how important a word is to a document in a


collection of documents (also known as a corpus). It is commonly used in text mining and
information retrieval tasks, such as document classification, search engines, and topic modeling.

TF-IDF is the product of two components:

1. Term Frequency (TF): Measures how frequently a word appears in a document.


2. Inverse Document Frequency (IDF): Measures how important a word is across the
entire corpus (the more documents the word appears in, the less important it is).

TF (Term Frequency)

Term Frequency (TF) is the number of times a term (word) appears in a document, divided by
the total number of terms in that document. This gives an indication of how frequently a word
appears relative to the other words in the same document.

The formula for TF is:


IDF (Inverse Document Frequency)

Inverse Document Frequency (IDF) measures how common or rare a word is across the whole
corpus. Words that appear in many documents are considered less informative. Conversely,
words that appear in only a few documents are more likely to carry meaningful information.

The formula for IDF is:

 If a word appears in every document, IDF will be low (close to 0).


 If a word appears in few documents, IDF will be higher, suggesting it is more significant.

TF-IDF

The TF-IDF score is the product of TF and IDF:

This score helps identify terms that are unique or significant to a particular document relative to
the entire corpus.

Example of TF-IDF Calculation

Let's work through a simple example.

Corpus (3 Documents):

1. Document 1: "the cat sat on the mat"


2. Document 2: "the dog sat on the log"
3. Document 3: "the cat and the dog ran"
Step 1: Calculate TF for Each Term in Each Document

We'll calculate the term frequency for each term in each document.

Document 1: "the cat sat on the mat"

 TF("the") = 2/6 = 0.33


 TF("cat") = 1/6 = 0.17
 TF("sat") = 1/6 = 0.17
 TF("on") = 1/6 = 0.17
 TF("mat") = 1/6 = 0.17

Document 2: "the dog sat on the log"

 TF("the") = 2/6 = 0.33


 TF("dog") = 1/6 = 0.17
 TF("sat") = 1/6 = 0.17
 TF("on") = 1/6 = 0.17
 TF("log") = 1/6 = 0.17

Document 3: "the cat and the dog ran"

 TF("the") = 2/6 = 0.33


 TF("cat") = 1/6 = 0.17
 TF("and") = 1/6 = 0.17
 TF("dog") = 1/6 = 0.17
 TF("ran") = 1/6 = 0.17

Step 2: Calculate IDF for Each Term

Now, we'll calculate the IDF for each term across the three documents.

 IDF("the") = log⁡(3/3)=0\log(3 / 3) = 0log(3/3)=0 (appears in all three documents)


 IDF("cat") = log⁡(3/2)=0.176\log(3 / 2) = 0.176log(3/2)=0.176 (appears in two documents)
 IDF("sat") = log⁡(3/2)=0.176\log(3 / 2) = 0.176log(3/2)=0.176 (appears in two documents)
 IDF("on") = log⁡(3/2)=0.176\log(3 / 2) = 0.176log(3/2)=0.176 (appears in two documents)
 IDF("mat") = log⁡(3/1)=1.099\log(3 / 1) = 1.099log(3/1)=1.099 (appears in one document)
 IDF("dog") = log⁡(3/2)=0.176\log(3 / 2) = 0.176log(3/2)=0.176 (appears in two documents)
 IDF("log") = log⁡(3/1)=1.099\log(3 / 1) = 1.099log(3/1)=1.099 (appears in one document)
 IDF("and") = log⁡(3/1)=1.099\log(3 / 1) = 1.099log(3/1)=1.099 (appears in one document)
 IDF("ran") = log⁡(3/1)=1.099\log(3 / 1) = 1.099log(3/1)=1.099 (appears in one document)

Step 3: Calculate TF-IDF for Each Term in Each Document

Now we multiply TF and IDF for each term in the documents:

Document 1: "the cat sat on the mat"


 TF-IDF("the") = 0.33 * 0 = 0
 TF-IDF("cat") = 0.17 * 0.176 = 0.030
 TF-IDF("sat") = 0.17 * 0.176 = 0.030
 TF-IDF("on") = 0.17 * 0.176 = 0.030
 TF-IDF("mat") = 0.17 * 1.099 = 0.187

Document 2: "the dog sat on the log"

 TF-IDF("the") = 0.33 * 0 = 0
 TF-IDF("dog") = 0.17 * 0.176 = 0.030
 TF-IDF("sat") = 0.17 * 0.176 = 0.030
 TF-IDF("on") = 0.17 * 0.176 = 0.030
 TF-IDF("log") = 0.17 * 1.099 = 0.187

Document 3: "the cat and the dog ran"

 TF-IDF("the") = 0.33 * 0 = 0
 TF-IDF("cat") = 0.17 * 0.176 = 0.030
 TF-IDF("and") = 0.17 * 1.099 = 0.187
 TF-IDF("dog") = 0.17 * 0.176 = 0.030
 TF-IDF("ran") = 0.17 * 1.099 = 0.187
 Step 4: Summary of TF-IDF Values

Step 5: Interpretation

 The word "the" has a TF-IDF value of 0 in all documents because it is very common and appears
in every document, so it doesn't provide much useful information about the specific content of
any document.
 Words like "mat", "log", "and", and "ran" have higher TF-IDF values in the documents where
they appear because they are rarer and therefore more significant for those documents.
Use Cases of TF-IDF

1. Information Retrieval: TF-IDF is used to rank documents in search engines. It helps


determine which documents are most relevant to a user's query based on term
importance.
2. Text Classification: In tasks like spam detection or sentiment analysis, TF-IDF helps to
identify the most important terms that separate different categories.
3. Topic Modeling: It is used to find key terms that are associated with specific topics in a
collection of documents.

Machine Translation (MT)

Machine Translation (MT) is the task of automatically translating text from one language to
another using computer algorithms. It involves various approaches and technologies, from rule-
based systems to statistical methods and, more recently, neural network-based models.

Over the years, machine translation has advanced significantly, especially with the introduction
of neural machine translation (NMT), which has set a new standard in terms of performance,
accuracy, and fluency.

Approaches to Machine Translation

1. Rule-Based Machine Translation (RBMT):


o RBMT systems are based on a set of linguistic rules for both the source and target
languages. They rely heavily on grammar, syntax, and semantic knowledge.
o Pros: High control over translation process, useful for domain-specific
translations (e.g., legal, medical).
o Cons: Requires a lot of manual work to build rules, poor for general-purpose
translation.
2. Statistical Machine Translation (SMT):
o SMT relies on statistical models derived from analyzing large bilingual corpora.
It learns the probabilities of word and phrase translations based on the frequency
of their occurrence.
o Pros: No need for explicit rules; can be trained on large datasets.
o Cons: Struggles with ambiguity and complex sentence structures, poor fluency.
3. Neural Machine Translation (NMT):
o NMT uses deep learning (especially Recurrent Neural Networks and
Transformers) to model translation as a sequence-to-sequence task. NMT systems
are trained end-to-end and learn both the context and the relationships between
words and phrases in the source and target languages.
o Pros: Produces more fluent and accurate translations, especially for long
sentences and complex language constructs.
o Cons: Requires large amounts of parallel data for training and high computational
resources.
4. Transformer-based Models (e.g., BERT, GPT, T5):
o These models use self-attention mechanisms that can capture dependencies
across the entire sentence (or document) rather than relying on sequential
processing.
o Pros: State-of-the-art results in MT tasks (e.g., Google Translate uses
Transformer models).
o Cons: Large models require substantial computational resources.

Issues in Machine Translation

Despite advances in MT, several issues remain that affect its performance:

1. Ambiguity:
o Many words and phrases have multiple meanings depending on context. For
example, the word "bank" could mean a financial institution or the side of a river.
o MT systems can struggle to disambiguate words correctly, especially in languages
with complex morphology.
2. Syntax and Structure Differences:
o Different languages have different syntactic rules. For example, the subject-
object-verb (SOV) order in Japanese versus subject-verb-object (SVO) in English.
o Translation systems must handle these differences to avoid producing unnatural
sentences in the target language.
3. Idiomatic Expressions:
o Idioms are phrases that do not have a direct translation but rather convey a
meaning that cannot be understood by translating the individual words.
o MT systems often fail to translate idiomatic expressions correctly, leading to
awkward or incorrect translations.
4. Cultural Context:
o Some expressions or concepts may not exist in all cultures or languages. For
instance, humor, slang, or cultural references can be difficult for MT systems to
interpret and convey properly.
5. Out-of-Vocabulary (OOV) Words:
o MT systems can struggle with words that don't appear in their training data,
leading to incorrect or nonsensical translations.
6. Low-Resource Languages:
o Many MT systems rely on large parallel corpora (pairs of translated text) for
training. However, many languages lack sufficient data for training high-quality
models, which can result in poorer translations for these languages.
7. Fluency and Naturalness:
o While NMT systems have significantly improved the fluency of translations, the
output can still sound unnatural or forced, especially in complex or less common
sentence structures.

MT Evaluation Metrics
Evaluating the performance of machine translation systems is challenging because it involves
both accuracy (correctness of the translation) and fluency (naturalness of the translation). Here
are the primary evaluation metrics used in MT:

1. BLEU (Bilingual Evaluation Understudy) Score:


o BLEU is one of the most widely used automatic evaluation metrics for machine
translation. It compares n-grams (word sequences) in the machine-generated
translation with those in a reference translation (usually human-generated).
o

  Where BPBPBP is a brevity penalty to penalize short translations, pnp_npn is the


precision of n-grams, and wnw_nwn is the weight for each n-gram level.
 Pros: Fast and easy to compute, widely used.
 Cons: Does not account for synonyms or different phrasing (only exact matches of n-
grams), may not correlate well with human judgment.

 METEOR (Metric for Evaluation of Translation with Explicit ORdering):

 METEOR improves upon BLEU by considering synonyms, stemming (reducing words


to their root form), and word order. It is designed to better align with human judgment of
translation quality.
 Pros: Takes synonyms and word order into account, thus providing a more human-like
evaluation.
 Cons: More computationally intensive than BLEU.

 TER (Translation Edit Rate):

 TER measures the number of edits (insertions, deletions, substitutions, or shifts) required
to change a machine-generated translation into a human reference translation.
 Formula:

1.
o Pros: Focuses on edits and correction, can handle large differences in phrasing.
o Cons: May not fully reflect fluency or other linguistic quality factors.
2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
o ROUGE is a set of metrics used for evaluating the quality of summaries and
translations. It compares the overlap of n-grams (unigrams, bigrams, etc.) between
the machine-generated output and reference output.
o Pros: Similar to BLEU but focuses on recall rather than precision. Can evaluate
both content and fluency.
o Cons: Like BLEU, it only looks at surface-level matches and ignores semantic
meaning or word order.
3. Human Evaluation:
o Human evaluation involves human judges assessing the quality of translations
based on criteria like fluency, adequacy (how well the meaning is preserved), and
overall quality.
o Pros: Best reflects human judgment and the true quality of a translation.
o Cons: Time-consuming and expensive, subject to inter-annotator variability.
4. TERCOM (Translation Edit Rate with Combinations):
o TERCOM is a metric that uses TER but incorporates combinations of
translations to improve performance and account for various valid rephrasings of
sentences.

Summary of Machine Translation Evaluation

 Automatic Metrics (like BLEU, METEOR, TER, ROUGE) are fast and computationally
efficient but may not always align with human judgment. They focus on surface-level
similarity and may fail to capture nuances like fluency and semantic meaning.
 Human Evaluation is more accurate but time-consuming and subjective. It’s typically
used in conjunction with automatic metrics for more reliable evaluation.

Common MT Issues:

 Word ambiguity: Words with multiple meanings are translated incorrectly if context is
not well understood.
 Idioms: Idiomatic expressions often do not translate directly.
 Cultural context: Some words or expressions may have no equivalent in the target
language.
 Out-of-Vocabulary (OOV) words: Words that do not appear in training data may result
in poor translations.
 Low-resource languages: Lack of training data can lead to suboptimal performance.

You might also like