[go: up one dir, main page]

0% found this document useful (0 votes)
26 views54 pages

NLP Notes-1

The document outlines various natural language processing techniques, including Markov models, Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA) for topic modeling. It explains the differences between generative and discriminative models, highlighting examples like GPT and BERT. Additionally, it provides detailed steps for building a bigram model, calculating probabilities, and understanding the significance of contextualized word embeddings.

Uploaded by

ghisesrushti10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views54 pages

NLP Notes-1

The document outlines various natural language processing techniques, including Markov models, Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA) for topic modeling. It explains the differences between generative and discriminative models, highlighting examples like GPT and BERT. Additionally, it provides detailed steps for building a bigram model, calculating probabilities, and understanding the significance of contextualized word embeddings.

Uploaded by

ghisesrushti10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Q.

Describe the process of building a simple Markov model for predicting the next
word in a sentence with the help of example.
A Markov model is a statistical model that predicts the next word in a sentence based on the
current word only, not the full sentence history. This approach follows the Markov property,
which states that the future state depends only on the present state, not on past states.

� Steps to Build a Simple Markov Model (Bigram Model):

1. Collect a Text Corpus : Gather a large amount of text data for training, such as books,
articles, or conversation transcripts.

2. Tokenize the Text : Split the text into individual words (tokens), including sentence start
<s> and end </s> markers.

3. Count Word Pairs (Bigrams) : Count how many times each pair of consecutive words
appears in the corpus.

Example: In "I love coding", the bigrams are: (I, love), (love, coding).

4. Calculate Probabilities :

Use the formula:


P(wn|wn-1) = Count(wn-1, wn)
Count(wn-1)

This gives the probability of a word given the previous word.

5. Predict the Next Word :


Given a word, choose the next word with the highest probability based on bigram counts.

� Example:
Corpus: "I love coding", "I love AI"
Bigrams & Counts:
(I, love) → 2
(love, coding) → 1
(love, AI) → 1

Prediction:

If the current word is “love”, the model may predict the next word as either “coding” or “AI”,
both with equal probability of 0.5.

1
Q. Suppose you have a text corpus of 10,000 words, and you want to build a bigram
model from this corpus. The vocabulary size of the corpus is 5,000. After counting the
bigrams in the corpus, you found that the bigram “the cat” appears 50 times, while the
unigram “the” appears 1000 times and the unigram “cat” appears 100 times. Using the
add-k smoothing method with k=0.5, what is the probability of the sentence “the cat sat
on the mat”?

Step 1: Calculate Unigram Probabilities


Calculate the probabilities of unigrams "the," "cat," "sat," "on," and "mat" using add-k
smoothing with k=0.5.
Step 2: Calculate Bigram Probabilities
Calculate the probabilities of bigrams "the cat," "cat sat," "sat on," and "on the" using add-k
smoothing with k=0.5.
Step 3: Calculate Sentence Probability
Calculate the probability of the sentence "the cat sat on the mat" using the bigram probabilities.

For unigram "the":

For unigram "sat," "on," and "mat," the probabilities are calculated similarly.

Step 2: Calculate Bigram Probabilities


Calculate the probabilities of bigrams "the cat," "cat sat," "sat on," and "on the" using add-k
smoothing with k=0.5.

For bigram "the cat":

For other bigrams, the probabilities are calculated in a similar manner.


Step 3: Calculate Sentence Probability
Calculate the probability of the sentence "the cat sat on the mat" using the bigram
probabilities.

P("the cat sat on the mat")=P("the")×P("the cat")×P("cat sat")×P("sat on")×P("on th


e")

Final Answer
The probability of the sentence "the cat sat on the mat" Using the add-k smoothing method
with k= 0.5
2
Q. Write a short note on Latent Semantic Analysis (LSA).

Latent Semantic Analysis (LSA) is a topic modeling technique used in natural language
processing to uncover hidden relationships between words and documents in a large corpus. It
helps discover underlying themes (topics) in text data by capturing the semantic structure of
the documents.

Key Steps in LSA:

1. Preprocessing: Clean the text (tokenization, stop word removal, stemming).

2. Term-Document Matrix: Construct a matrix where rows are terms, and columns are
documents, with entries showing term frequency.

3. Singular Value Decomposition (SVD): Decompose the matrix into three matrices to
reduce dimensionality and identify patterns.

4. Extract Latent Topics: Identify topics from the reduced matrices representing hidden
semantic structures.

Applications of LSA:
Discovering themes in document collections
Document clustering and classification
Information retrieval and search optimization

Advantages:
Captures semantic meaning and relationships between words
Scalable to large datasets

� Limitations:
Cannot effectively handle polysemy (words with multiple meanings)
Sensitive to term frequency and may lose interpretability of topics

LSA is widely used to analyze unstructured text and uncover the hidden concepts behind word
usage in large corporate.

3
Q. What are generative models of language, and how do they differ from discriminative
models?

Generative models of language are models that learn the joint probability distribution
P(X,Y)P(X, Y)P(X,Y) or just P(X)P(X)P(X) of the data. This means they not only classify
data but also generate new data instances, like predicting or producing the next word in a
sentence. For example, language models like GPT or models used for machine translation and
text generation are generative models.

In contrast, discriminative models learn the conditional probability P(Y∣X)P(Y|X)P(Y∣X),


which helps in classifying input data rather than generating it. They focus on finding
boundaries between different classes (e.g., identifying whether a sentence is positive or
negative).

Key Differences:
Generative Models:
o Learn how the data is generated.
o Can produce new examples (e.g., next word, new image).
o Example: GPT, Naive Bayes, GANs.

Discriminative Models:
o Focus on distinguishing between classes.
o Cannot generate new data.

o Example: Logistic Regression, SVM, BERT (for classification).

Q. b) Given a document-term matrix with the following counts.

Step 1: Term Frequency (TF)


TF is calculated as: TF = Number of times Term 1 appears in Document 1/
Total number of terms in Document 1
From the matrix:
Term 1 in Document 1 = 10
Term 2 in Document 1 = 2
Term 3 in Document 1 = 1
Total terms = 10 + 2 + 1 = 13
TF = 10 / 13 ≈ 0.769
4
Step 2: Inverse Document Frequency (IDF)
IDF is calculated using:
IDF=log10 (N/df)
Where:
N=3N = 3N=3 (Total number of documents)
df=2df = 2df=2 (Term 1 appears in Document 1 and Document 2)
IDF=log10 (3/2) ≈ log10(1.5) ≈ 0.176
Final Answer:
The TF-IDF score of “Term 1” in Document 1 is approximately 0.135.
This value reflects the importance of Term 1 in Document 1 relative to its frequency across all
documents.

Q. Define Latent Dirichlet Allocation (LDA) and explain how it is used for topic modeling
in text data. Discuss the key components of LDA, including topics, documents, and word
distributions.

Latent Dirichlet Allocation (LDA) is a generative probabilistic model used in natural language
processing (NLP) to automatically discover hidden topics within a large corpus of unstructured
text. It assumes that each document is a mixture of topics, and each topic is a distribution over
words. LDA is one of the most widely used techniques for topic modeling, helping to organize,
understand, and summarize large sets of textual data.

Key Concepts and Components of LDA:

1. Documents: A document is a sequence of words and is treated as a mixture of topics.

2. Topics: Each topic is a distribution over a fixed vocabulary of words. Topics are latent
(hidden), which means they are not explicitly provided and must be inferred from the data.

3. Words: Words are the observed data, and each word in a document is assumed to be
generated from one of the topics.

How LDA Works (Generative Process):

LDA assumes the following process to generate each document:


1. For each document:
o Choose a topic distribution from a Dirichlet distribution.

2. For each word in the document:


o Choose a topic from the document's topic distribution.
o Choose a word from the topic-specific word distribution.

This means that LDA models the joint probability of topics, documents, and words, allowing
it to uncover the structure of topics across the corpus.

5
Earning in LDA (Inference):

The main goal of LDA is to infer:


The topic distribution for each document.
The word distribution for each topic.

However, the posterior distribution is intractable to compute directly, so approximate inference


methods such as:

Gibbs Sampling (a type of Markov Chain Monte Carlo method)


Variational Inference are used to estimate the model parameters.

Applications of LDA in NLP:

Topic Modeling: Automatically identify the main themes or topics in large sets of
documents.

Document Classification: Classify documents based on dominant topics.

Information Retrieval: Search and retrieve documents with similar topic distributions.

Recommendation Systems: Recommend documents or articles based on user’s topic


interests.

Text Summarization: Summarize long texts by extracting sentences or sections based on


prominent topics.

Trend Analysis: Analyze topic trends over time in social media, news, or research papers.

Example:

Suppose you have a collection of research papers. LDA can analyze the words and
automatically cluster them into topics such as "Machine Learning," "Biology," or
"Cybersecurity." Each paper might belong to multiple topics (e.g., a paper might be 70% ML
and 30% Biology), and LDA helps uncover this hidden thematic structure.

6
Q. What are generative models of language? Explain any one model in detail.
(Page NO. 4)
One Model : Recurrent Neural Network (RNN)

A simple RNN for language modeling typically has the following components:
• Input Layer: Takes a word (often represented as a one-hot encoded vector or a word
embedding) as input.
• Hidden Layer (Recurrent Layer): This is the core of the RNN. It has a hidden state
that gets updated at each time step. The update is based on the current input word and
the previous hidden state. The same set of weights is applied at each time step, making
it "recurrent."
• Output Layer: Typically a softmax layer that outputs a probability distribution over the
vocabulary. The highest probability indicates the model's prediction for the next word.

Q. Consider the following small corpus:


Training corpus:
<s> I am from Pune </s>
<s> I am a teacher </s>
<s> students are good and are from various cities </s>
<s> students from Pune do engineering </s>
Test data:
<s> students are from Pune </s>
Find the Bigram probability of the given test sentence.

Ans :

Step 1: Training Corpus and Test Sentence


Training Corpus:
1. <s> I am from Pune </s>
2. <s> I am a teacher </s>
3. <s> students are good and are from various cities </s>
4. <s> students from Pune do engineering </s>

TestSentence:
<s> students are from Pune </s>
Step 2: Extract Bigrams from Test Sentence
The test sentence contains the following bigrams:
1. (<s>, students)
2. (students, are)
3. (are, from)
4. (from, Pune)
5. (Pune, </s>)

7
Step 3: Count Unigrams and Bigrams from Training Corpus

� Unigram Counts:
<s> = 4
students = 2
are = 2
from = 3
Pune = 2
</s> = 2

Bigram Counts:
(<s>, I) = 2
(<s>, students) = 2
(students, are) = 1
(are, good) = 1
(are, from) = 1
(from, Pune) = 2
(Pune, </s>) = 1

Step 4: Calculate Bigram Probabilities


The formula for bigram probability I s: P(Wn∣Wn−1) = Count(Wn−1,Wn) / Count(Wn−1)
So,

1. P(students | <s>) = Count(<s> students) / Count(<s>) = 2 / 4 = 0.5


2. P(are | students) = Count(students are) / Count(students) = 1 / 2 = 0.5
3. P(from | are) = Count(are from) / Count(are) = 1 / 2 = 0.5
4. P(Pune | from) = Count(from Pune) / Count(from) = 2 / 3 ≈ 0.6667
5. P(</s> | Pune) = Count(Pune </s>) / Count(Pune) = 1 / 2 = 0.5

Step 5: Multiply Bigram Probabilities


P(sentence)=0.5×0.5×0.5×0.6667×0.5=0.0417P(\text{sentence}) = 0.5 \times 0.5 \times 0.5
\times 0.6667 \times 0.5 = 0.0417P(sentence)=0.5×0.5×0.5×0.6667×0.5=0.0417

Final Answer:

The Bigram probability of the test sentence “<s> students are from Pune </s>” is: 0.0417

8
Q. Write short note on BERT

Ans : BERT (Bidirectional Encoder Representations from Transformers) is a powerful


language model developed by Google that captures the context of words in a sentence by
reading text bidirectionally (from left to right and right to left).

It is trained using a Masked Language Modeling (MLM) task, where some words in a
sentence are hidden, and the model learns to predict them based on surrounding context.

BERT produces contextualized word embeddings, meaning the representation of a word


depends on the other words in the sentence, allowing better understanding of meaning in
diAerent situations.

It has achieved state-of-the-art performance in many NLP tasks, including question


answering, sentiment analysis, text classification, and named entity recognition, and can
be fine-tuned easily for specific applications.

Q. What are generative models of language, and how do they differ from discriminative
models? Provide an example of a generative model and describe how it can be used in
NLP.

Generative models of language are statistical models that learn the joint probability
distribution of data and labels, represented as P(X,Y)P(X, Y)P(X,Y), or sometimes just the
data P(X)P(X)P(X) when labels are not involved. These models can generate new examples
by learning how data is structured and how likely different sequences are. In the context of
Natural Language Processing (NLP), generative models can generate natural language text,
predict the next word in a sequence, or even construct entire paragraphs of meaningful content.

On the other hand, discriminative models learn the conditional probability distribution
P(Y∣X)P(Y|X)P(Y∣X), which helps in distinguishing between different classes or labels for a
given input. Discriminative models do not focus on how data is generated; instead, they focus
on making accurate predictions or classifications based on the input features.

9
Example of a Generative Model – GPT (Generative Pre-trained Transformer):GPT is a
powerful generative language model developed by OpenAI. It uses deep learning and
transformer architecture to predict the next word in a sentence by analyzing the context of
previous words. It is trained on massive text datasets to learn the structure and patterns of
natural language.

� Use of GPT in NLP:

1. Text Generation: GPT can automatically generate creative stories, essays, news articles, or
poetry based on a given prompt.

2. Chatbots and Virtual Assistants: Used in applications like ChatGPT, Siri, or Alexa for
holding human-like conversations.

3. Machine Translation: It helps in translating text from one language to another.

4. Summarization: Automatically condenses long documents into key points.

5. Autocomplete Systems: Suggests the next word or phrase while typing.

10
Q. Describe the concept of contextualized representations, such as those generated by
BERT, and how they are used in natural language processing. Discuss the advantages
and disadvantages of contextualized representations.

Contextualized Representations and BERT in NLP :- Contextualized representations are word


embeddings where the meaning of a word changes depending on the context in which it
appears. For example, the word "bank" means something diAerent in “river bank” and “bank
account.” Traditional models like Word2Vec or GloVe give a single meaning to a word. But
contextualized models, like BERT (Bidirectional Encoder Representations from
Transformers), understand and represent each word based on its surrounding words.

What is BERT?
BERT is a model developed by Google that uses the Transformer architecture.

It reads text both forward and backward (bidirectionally), allowing it to understand the full
context of a word.

BERT is trained using Masked Language Modeling (MLM) – it hides some words in a
sentence and tries to predict them using the surrounding words.

Example: In “I love to play [MASK]”, BERT might predict “guitar”, “football”, etc.,
depending on the full sentence context.

Uses of BERT in NLP


BERT has improved performance in many NLP tasks, including:
Sentiment analysis – Understanding the tone of a sentence (positive/negative)
Text classification – Grouping text into categories (e.g., spam or not spam)
Question answering – Finding answers from a passage
Named entity recognition – Identifying names, places, etc., in text
Language translation – Converting text from one language to another

Advantages of Contextualized Representations

1. Improved understanding of meaning – Words are interpreted correctly based on their


context.
2. State-of-the-art accuracy – BERT achieves top results on many NLP benchmarks.
3. Bidirectional learning – It captures both left and right context, unlike older models.
4. Works well for many tasks – Can be fine-tuned for specific NLP applications.

Disadvantages of Contextualized Representations

1. High computational cost – Training and using models like BERT requires powerful
hardware (like GPUs).
2. Large memory usage – These models are big and consume more storage and RAM.
3. Slow inference – Processing can be slower, especially on large texts.
4. Needs a lot of data to train – BERT was trained on very large datasets to be eAective.
11
Q. Describe HMM with help of example.

12
Q. What is Smoothing? Explain Laplace/Add-1 smoothing with example.[6]

Ans :

In Natural Language Processing (NLP), smoothing is a technique used to address the problem
of zero counts for certain events (like n-grams) in a language model. When building a language
model from a corpus, we estimate the probability of words and sequences of words based on
their frequencies in the training data. However, it's highly likely that some valid n-grams
(combinations of words) will not appear in the training data, resulting in a count of zero and
thus a probability of zero.

A probability of zero for a particular n-gram can cause problems in downstream tasks. For
instance, if a sentence containing such an n-gram needs to be evaluated, its overall probability
might become zero, even if the sentence makes sense. Smoothing techniques aim to assign a
small non-zero probability to these unseen events, thereby "smoothing" out the probability
distribution.

Laplace Smoothing (Add-1 Smoothing) :


Laplace smoothing, also known as Add-1 smoothing, is one of the simplest and earliest
smoothing techniques. The core idea is to add one to the count of every n-gram (including
those that didn't appear in the training data) before calculating the probabilities. This ensures
that no n-gram has a probability of zero.

Example:

Corpus: "cat cat"

1. Counts:
• Unigrams: cat: 2
• Vocabulary (V): 1 (cat)
• Bigrams: cat cat: 1

2. Probability without smoothing (for "cat cat"):

P(cat | cat) = count(cat cat) / count(cat) = 1 / 2 = 0.5


What about P(dog | cat)? (Assuming "dog" is in our vocabulary, V=2 now):
• Unigrams: cat: 2, dog: 0
• Vocabulary (V): 2 (cat, dog)
• Bigrams: cat cat: 1, cat dog: 0
P(dog | cat) = count(cat dog) / count(cat) = 0 / 2 = 0

3. Apply Laplace Smoothing (k=1):

P(cat | cat) = (count(cat cat) + 1) / (count(cat) + V) = (1 + 1) / (2 + 2) = 2 / 4 = 0.5


P(dog | cat) = (count(cat dog) + 1) / (count(cat) + V) = (0 + 1) / (2 + 2) = 1 / 4 = 0.25
Effect: The unseen bigram "cat dog" now has a probability of 0.25 instead of 0.

13
Q. Describe the concept of Information Retrieval system in Natural Language
Processing.

Information Retrieval (IR) is the process of fetching relevant information from a large
collection of documents based on a user's input or query. It is widely used in applications like
search engines, digital libraries, document management systems, and recommendation
engines. The goal of IR is to help users find accurate and relevant information quickly and
efficiently.

Importance of IR:
o Helps users navigate and extract useful content from vast digital information.
o Improves productivity, decision-making, and supports research and knowledge discovery.

Key Components of IR:


o Document Collection: A pool of documents the system searches through.
o Indexing: Creating a searchable index using important terms from documents for faster
lookup.

o Query Processing: Understanding and analyzing the user's query to find relevant documents.
o Relevance Ranking: Sorting and ranking the search results based on how well they match
the query.

Significance of NLP in IR:


Natural Language Processing (NLP) enhances IR by making the system understand human
language better. It uses techniques like stemming, lemmatization, and entity recognition to
interpret both the documents and the queries. This leads to more accurate and relevant results,
improving the overall efficiency of the retrieval system.

14
Q. What is Named Entity Recognition (NER)? Describe the various metrics used for
evaluation.

Named Entity Recognition (NER)

Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) where the
goal is to identify and classify specific types of information (called "named entities") in a text.
These entities include names of people, organizations, locations, dates, monetary values,
percentages, and more. For example, in the sentence “Apple opened a new office in London
on Monday,”

NER will detect:


Apple as an Organization
London as a Location
Monday as a Date

NER is essential in many NLP applications like information extraction, question answering,
text summarization, sentiment analysis, and social media monitoring, where structured
information must be pulled from unstructured text.

Importance of NER
It helps in converting unstructured text into structured data.
It improves search engines and recommendation systems by identifying key entities.
It supports tasks like document classification, entity linking, and knowledge base
population.

Challenges in NER

Ambiguity: Words like "Apple" could refer to a fruit or a company based on context.
Variation: Entities can appear in different forms (e.g., “U.S.A.” vs. “United States”).
Domain-Specific Terms: Medical or legal texts may have unique entities not seen elsewhere.
New or Unknown Entities: Systems must handle names or places not seen during training.
Evaluation Metrics in NER To evaluate how well an NER system performs, we use the
following metrics:

Precision:
It tells us how many of the entities the model predicted were actually correct.
Formula: Precision= Correctly predicted entities/ Total predicted entities

Recall:
It tells us how many of the actual correct entities were found by the model.
Formula: Recall= Correctly predicted entities / Total actual entities

F1 Score:
It is the harmonic mean of precision and recall. It gives a balanced overall score.
Formula: F1 Score= 2× Precision×Recall / Precision+Recall

15
Q. What is Cross-Lingual Information Retrieval and how is it used in Natural Language
Processing? Provide an example.

Ans : Cross-Lingual Information Retrieval (CLIR) is a technique in Natural Language


Processing (NLP) that allows users to search for information written in a different language
than the language of their query. It helps bridge the language gap, making global information
accessible to all users regardless of language.

A user enters a query in one language (e.g., English), and CLIR retrieves documents in
another language (e.g., French or Hindi).

It supports multilingual search, breaking down barriers caused by different languages.

How CLIR Used In Natural Language Processing (NLP):

1. Machine Translation (MT)


2. Cross-lingual embeddings
3. Multilingual language models like mBERT, XLM-R, etc.
4. Semantic search using vector space models

Examples :
• Google Translate + Search: Retrieves foreign-language documents translated on the
fly.
• Multilingual search engines: Used in e-commerce, legal, academic, and healthcare
domains.
• Cross-lingual question answering: Answering a user’s question in English using
documents written in Spanish or Chinese.

16
Q. Explain the concept of the Vector Space Model and describe how it is used in
Information Retrieval.

Ans : The Vector Space Model (VSM) is a way to represent text documents (or any text unit
like words, phrases, or even entire paragraphs) as numerical vectors in a high-dimensional
space. Each unique term in the corpus (all the unique words across all documents) corresponds
to a dimension in this space.

Text as Vectors: Instead of treating text as raw sequences of words, VSM transforms it into a
mathematical object (a vector) that can be easily compared and manipulated using linear
algebra.

Dimensions Represent Terms: Each dimension in the vector space corresponds to a unique
term in the entire collection of documents you're working with.

Vector Values Represent Importance: The value in each dimension of a document's vector
reflects the importance of that term within that specific document. Common techniques like
Term Frequency-Inverse Document Frequency (TF-IDF) are used to determine these values.

In Information Retrieval (IR), the Vector Space Model is used as follows:

Representing Documents: Each document in the collection is represented as a vector in


the term space, just as described earlier. The dimensions correspond to unique terms, and the
values reflect the importance of those terms within each document (e.g., using TF-IDF).

Representing the Query: The user's search query is also treated as a short document and
is represented as a vector in the same term space.

17
Q. Describe entity extraction and relation extraction with the help of examples.
Ans :

Entity Extraction (EE), also known as Named Entity Recognition (NER), is the task of
identifying and classifying named entities in text into predefined categories such as people,
organizations, locations, dates, times, quantities, monetary values, percentages, etc.

In brief: EE aims to pinpoint the key nouns and noun phrases in a text and categorize them.

Example:

In the sentence: "Apple CEO Tim Cook announced a new iPhone in Cupertino, California on
September 15, 2023."

EE would identify:

o Apple: ORGANIZATION
o Tim Cook: PERSON
o iPhone: PRODUCT
o Cupertino: LOCATION
o California: LOCATION
o September 15, 2023: DATE

Relation Extraction (RE) is the task of identifying and classifying the semantic relationships
between the entities mentioned in a text. It aims to go beyond just recognizing entities and
understand how they are connected.

In brief: RE aims to determine how the extracted entities relate to each other. It often follows
Entity Extraction, as you first need to identify the entities before you can find the relationships
between them.

Example :

From the same sentence: "Apple CEO Tim Cook announced a new iPhone in Cupertino,
California on September 15, 2023."

RE would identify relationships like:

• Tim Cook is the CEO of Apple. (PERSON - ROLE - ORGANIZATION)


• Apple is located in Cupertino, California. (ORGANIZATION - LOCATED_IN -
LOCATION)
• Apple announced iPhone on September 15, 2023. (ORGANIZATION - PRODUCT -
DATE)

18
Q. What is Coreference Resolution? Give examples.

Ans : Coreference Resolution in Natural Language Processing (NLP) is the task of identifying
all mentions in a text that refer to the same real-world entity. In simpler terms, it's about
figuring out which words or phrases in a text are pointing to the same thing.

Think of it as resolving the identity of entities across a piece of writing, especially when they
are referred to using different words or phrases.

Examples:

1. Consider the sentence: "The CEO announced that he would be stepping down at the end
of the year. He thanked the board for their support."

Coreference Resolution would identify that:

"The CEO", "he", and "He" all refer to the same person.

2. "My friend Shantilal went to the park. He loves to walk his dog there."

Coreference Resolution would identify that "My friend Shantilal " and "He" refer to the same
individual.

19
Q. Describe the concept of Information Retrieval. Explain the significance of Natural
Language
Ans: Processing in Information Retrieval.

Information Retrieval (IR) is the process of fetching relevant information from a large
collection of documents based on a user's input or query. It is widely used in applications like
search engines, digital libraries, document management systems, and recommendation
engines. The goal of IR is to help users find accurate and relevant information quickly and
efficiently.

Importance of IR:
o Helps users navigate and extract useful content from vast digital information.
o Improves productivity, decision-making, and supports research and knowledge discovery.

Objectives :
o Document Collection: A pool of documents the system searches through.
o Indexing: Creating a searchable index using important terms from documents for faster
lookup.
o Query Processing: Understanding and analyzing the user's query to find relevant documents.
o Relevance Ranking: Sorting and ranking the search results based on how well they match
the query.

Significance of NLP in IR:


Natural Language Processing (NLP) enhances IR by making the system understand human
language better. It uses techniques like stemming, lemmatization, and entity recognition to
interpret both the documents and the queries. This leads to more accurate and relevant results,
improving the overall efficiency of the retrieval system.

20
Q. Describe the Vector Space Model (VSM) for information retrieval. How does VSM
represent documents and queries, and how are similarities calculated? Discuss the
strengths and weaknesses of VSM.

Ans:

The Vector Space Model (VSM) is a way of representing both documents in a collection and
a user's query as vectors in a common multi-dimensional space. Each unique term in the entire
collection corresponds to a dimension in this space.

For information retrieval:


• Documents as Vectors: Each document is represented as a vector where the value in
each dimension reflects the importance of that term in the document (often using TF-
IDF).
• Query as a Vector: The user's search query is also represented as a vector in the same
term space.
• Similarity Calculation: The system calculates the similarity between the query vector
and each document vector, typically using cosine similarity. This measures the angle
between the vectors.

Representation: Both documents and the search query are represented as vectors in a high-
dimensional space. Each unique word (term) in the entire collection of documents becomes a
dimension in this space.

Vector Values: The value in each dimension of a document's vector indicates the
importance of that specific word within that document. This importance is often calculated
using techniques like TF-IDF (Term Frequency-Inverse Document Frequency). For the
query vector, the values also represent the importance of the query terms.

Similarity Calculation: The similarity between the query vector and each document vector
is typically calculated using cosine similarity. This measures the angle between the two
vectors. A smaller angle (cosine value closer to 1) signifies higher similarity, meaning the
document is more likely to be relevant to the query.

21
Q. Discuss the different methods used for evaluating NER systems. What are common
metrics for measuring NER system performance, and how can the results be analyzed to
improve the system?
Ans:

Named Entity Recognition (NER) is evaluated to measure how well a system can identify and
classify entities such as names of people, places, organizations, dates, and more. Evaluation is
typically done by comparing the model's output to a manually annotated dataset known as the
gold standard. To measure how accurately the system performs, several standard evaluation
metrics are used:

Exact Match: A common evaluation criterion where an identified entity is considered


correct only if its boundary (start and end positions) and its entity type are exactly the same as
the gold standard annotation.

Partial Match: In some cases, partial matches might also be considered, where an
identified entity overlaps significantly with a gold standard entity. However, exact match is
more commonly used.

Confusion Matrix: This is a table that summarizes the performance of the NER system by
showing the counts of true positives, true negatives, false positives, and false negatives for
each entity type. This can provide a more detailed understanding of the types of errors the
system is making.

The common metrics for measuring NER system performance are:


• Precision
• Recall
• F1-score

Low Precision: If precision is low, it means the system is identifying many entities that are
not actually entities (false positives). To improve this, you should:

o Examine the false positives


o Refine the classification rules or model

Low Recall: If recall is low, it means the system is missing many actual entities present in the
text (false negatives). To improve this, you should:

o Examine the false negatives


o Consider different levels of granularity

Analyzing F1-score: The F1-score provides a balanced view. If it's low, it usually means either
precision or recall (or both) are low. Focusing on improving the lower of the two often helps
increase the F1-score.

22
Q. Define Cross-Lingual Information Retrieval (CLIR) and discuss the challenges
involved in retrieving information from languages different from the query language.
How do machine translation techniques assist in CLIR?

Ans:

Cross-Lingual Information Retrieval (CLIR) is a subfield of information retrieval that deals


with retrieving information (documents, web pages, etc.) that are written in a language
different from the language of the user's query. For example, a user might submit a search
query in English and expect to find relevant documents written in French, German, or Chinese.

Challenges Involved in Retrieving Information from Different Languages:

1. Lexical Differences: The same concept can be expressed using different words
(vocabulary) in different languages. Direct word-to-word matching between a query and
documents in another language will likely fail.

2. Morphological Differences: Languages have different rules for word formation


(morphology). This includes variations in suffixes, prefixes, inflections, and stemming.
A base word in one language might appear in many different forms in another, making
exact matching difficult.

3. Syntactic Differences: The grammatical structure (syntax) of sentences varies greatly


across languages. Word order, sentence construction, and the role of different parts of
speech can be different. This makes it hard to understand the relationships between
words if only relying on the query language's grammar.

4. Semantic Differences: Even if words can be translated, their meaning or connotations


might differ across cultures and contexts. Polysemy (words with multiple meanings)
and homonymy (words that sound or look alike but have different meanings) can be
particularly challenging in cross-lingual scenarios.

How Machine Translation Techniques Assist in CLIR:


Machine translation (MT) plays a crucial role in bridging the language gap in CLIR. Several
MT-based approaches are commonly used:

1. Query Translation: The most straightforward approach is to translate the user's query
from their native language into the language(s) of the document collection. The
translated query can then be used to search the foreign language documents using
traditional monolingual information retrieval techniques. Advancements in Neural
Machine Translation (NMT) have significantly improved the quality of query
translations, leading to better retrieval results.

23
2. Document Translation: Another approach is to pre-translate the entire collection of
foreign language documents into the user's query language. Once translated, the user
can search the translated documents using their original query. While this approach can
be effective, it requires significant computational resources and storage, especially for
large document collections and frequent updates.

3.Pivot Language Translation: In scenarios involving multiple languages, a pivot


language (often English) can be used as an intermediary. The query is first translated into
the pivot language, and then the pivot language query is used to search documents in other
languages. This can be useful for language pairs with limited direct translation resources.

4.Cross-Lingual Embeddings: More advanced techniques involve creating cross-lingual


word or document embeddings. These embeddings aim to represent words or documents
with similar meanings in different languages close to each other in a shared vector space.
By learning such a joint representation, similarity between queries and documents can be
directly computed regardless of the language they are written in, without explicit translation
steps.

24
Q. Explain the importance of entity extraction in NLP. How does entity extraction differ
from named entity recognition, and provide examples of real-world applications where
entity extraction is crucial.

Ans : The importance of entity extraction in NLP is that it allows computers to automatically
identify and categorize key information (entities) within unstructured text. This transforms raw
text into structured data that can be easily analyzed, searched, and used for various downstream
tasks. By recognizing and classifying entities, we can understand the core subjects and objects
being discussed in a piece of text, laying the groundwork for more advanced NLP applications.

How they differ:

• Named Entity Recognition (NER) is typically considered a subtask of entity


extraction. NER specifically focuses on identifying and categorizing named entities,
which usually refer to proper nouns like people, organizations, locations, dates, etc.

• Entity Extraction is a broader term that can include identifying and classifying other
types of entities that might not always be considered "named" in the traditional sense.
This could include concepts, events, products, skills, or even user-defined categories
depending on the specific application.

News Analysis: Identifying key people, organizations, and locations mentioned in news
articles allows for automated summarization, topic tagging, and trend analysis.

Customer Service: Extracting entities like product names, issues, and customer names
from support tickets helps in routing requests to the appropriate teams and understanding
common problems.

Financial Services: Identifying companies, financial instruments, and monetary values in


financial reports is essential for risk assessment, investment analysis, and fraud detection.

Healthcare: Extracting information about diseases, symptoms, medications, and patient


names from medical records and research papers can facilitate clinical research, drug
discovery, and improved patient care.

Human Resources: Identifying skills, job titles, and company names from resumes and
job descriptions helps in talent acquisition and matching candidates to suitable roles.

Content Recommendation: Understanding the entities discussed in articles or videos


allows recommendation systems to suggest similar content that users might be interested in.

Question Answering Systems: Identifying entities in both the question and the relevant
document is fundamental for accurately extracting the answer.

Legal Tech: Extracting entities like parties involved, dates, locations, and legal terms from
contracts and legal documents streamlines document review and analysis.
25
Q. Define following w. r. t. Information Retrieval i) Term Frequency ii) Inverse Document
Frequency.

Ans : In the context of Information Retrieval:

Term Frequency (TF): refers to the number of times a specific term (word) appears in a
particular document. It's a measure of how important that term is to the content of the
document. A higher TF value indicates that the term appears more frequently in the document.

Inverse Document Frequency (IDF): measures the importance of a term across the entire
collection of documents (corpus). It's calculated as the logarithm of the total number of
documents in the corpus divided by the number of documents that contain the term. A higher
IDF value indicates that the term is rare across the corpus, suggesting it might be more
discriminative and important for distinguishing between documents. Common words like
"the" or "a" will have low IDF values because they appear in almost all documents.

Q. Explain Information Retrieval architecture with neat diagram.

Ans :

26
1. User Side (Search Process)
• Problem Identification: A student wants to learn about machine learning
and types a query into a search engine.
• Representation: The user converts their need into a search query using
keywords or phrases like instead of asking "How do machines learn?" the
student types "machine learning basics" into Google and the problem is
converted into a query (keywords or phrases).
• Query: The user submits the search query into IR system.
• Feedback: User can refine or modify the search based on the retrieved
results.

2. System Side (Retrieval Process)


• Acquisition: The system collects and stores a large number of documents or
data sources. It can includes web pages, books, research papers or any text-
based information.
• Representation: Each document in the system is analyzed and represented
in a structured way using keywords (terms). Example: If the document talks
about "machine learning" it is tagged with relevant terms like "AI, deep
learning, algorithms, models" to help retrieval.
• File Organization: The documents are indexed and stored efficiently so the
system can quickly find relevant ones. Like organizing a library so books can
be found easily based on topics.
• Matching: The system compares the user's search query with stored
documents to find the best matches. It uses matching functions that rank
documents based on relevance.
• Retrieved Object: The system returns the most relevant documents to the
user. These documents are ranked so the most useful ones appear at the
top.
3. Interaction Between User & System
• The user reviews the retrieved results and may provide feedback to refine
the search. The system then processes the updated query and retrieves
better results.
• Acquisition: In this step the selection of documents and other objects from
various web resources that consist of text-based documents takes place.
The required data is collected by web crawlers and stored in the database.
• Representation: It consists of indexing that contains free-text terms,
controlled vocabulary, manual and automatic techniques as well. Example:
Abstracting contains summarizing and Bibliographic description that
contains author, title, sources, data and metadata.

27
• File Organization: There are two types of file organization methods.
i.e. Sequential that contains documents by document data and Inverted:
that contains list of records under each term.
• Query: An IR process starts when a user enters a query into the system.
Queries are formal statements of information needs. For example, search
strings in web search engines. In IR a query does not uniquely identify a
single object in the collection. Instead several objects may match the query,
perhaps with different degrees of relevancy.

Q. Explain Entity Extraction and Relation Extraction w.r. t. NER.

Ans :

Entity Extraction is a broader process that often includes NER as its initial stage. While
NER focuses on standard named entities, a full Entity Extraction system might identify and
classify a wider range of entities that are relevant to a specific domain or task. This could
include concepts, events, products (even if not strictly proper nouns), skills, and more.
Therefore, NER can be seen as a specialized form of Entity Extraction, or the first step in a
more comprehensive Entity Extraction process.

Relation Extraction is a subsequent step that builds upon the output of Entity Extraction.
Once the entities in a text have been identified and classified, Relation Extraction aims to
discover and categorize the semantic relationships that exist between these extracted entities.
It tries to answer questions like "Who works for whom?", "Where is this located?", or "What
is the relationship between these two things?".

28
Q. Write a note on : WordNet
Ans : WordNet: A Lexical Database of Semantic Relations

WordNet is a large lexical database of English, developed at Princeton University. It organizes


nouns, verbs, adjectives, and adverbs into synonym sets, called synsets. Each synset represents
one underlying lexical concept and is linked to other synsets by means of semantic relations.

Synsets: Words that are synonymous are grouped together into synsets. For example, the
synset for "car" might include "automobile," "motorcar," and "vehicle."

Semantic Relations: Synsets are connected to each other through various semantic relations.

Lexical Categories: WordNet is organized by parts of speech (nouns, verbs, adjectives,


adverbs), and the semantic relations primarily exist within each category.

Gloss: Each synset contains a short definition or explanation, known as a gloss, providing
context and disambiguation.

Importance and Uses in NLP:

o Word Sense Disambiguation (WSD): Determining the correct meaning of a word in a


given context by leveraging its synsets and relations.

o Information Retrieval: Expanding search queries with synonyms to improve recall and
understanding semantic relationships between terms.

o Machine Translation: Assisting in finding appropriate translations by considering


synonymy and semantic context.

Limitations:

Limited Coverage: While extensive, WordNet doesn't cover all words and concepts,
especially those in specialized domains or new slang.

Granularity of Senses: The granularity of word senses might not always align perfectly
with the needs of a specific application.

Language Specificity: Primarily focuses on the English language, though similar projects
exist for other languages (e.g., Princeton WordNet for other languages).

29
Q. List the tools available for the development of NLP applications? Write features
NLTK and TextBlob?

Ans :

NLTK (Natural Language Toolkit): A comprehensive Python library offering a wide


range of functionalities for tasks like tokenization, POS tagging, NER, sentiment analysis, text
classification, stemming, lemmatization, and parsing. It also provides access to various text
corpora and lexical resources. It's often favored for learning and experimentation due to its
extensive documentation and community support.

spaCy: A Python library focused on delivering industrial-strength NLP capabilities with


speed and efficiency. It provides pre-trained statistical models for tasks like tokenization, POS
tagging, NER, dependency parsing, and more. spaCy is known for its fast processing speeds
and ease of integration into production systems.

Transformers (Hugging Face): A powerful Python library providing access to thousands


of pre-trained transformer models (like BERT, GPT, RoBERTa) and tools for fine-tuning them
for various NLP tasks such as text classification, question answering, text generation, and
more. It democratizes the use of state-of-the-art deep learning models in NLP.

TextBlob: A Python library built on top of NLTK and pattern, designed to provide a simple
and intuitive interface for common NLP tasks. It offers functionalities like sentiment analysis,
POS tagging, noun phrase extraction, tokenization, translation, and spelling correction,
making it easy for beginners to get started with NLP.

Gensim: A Python library primarily focused on topic modeling, document indexing, and
similarity retrieval for large text corpora. It implements algorithms like Latent Semantic
Analysis (LSA), Latent Dirichlet Allocation (LDA), and word embedding models.

Stanford CoreNLP: A suite of natural language analysis tools from Stanford University,
written in Java. It provides a wide range of functionalities including tokenization, sentence
splitting, POS tagging, lemmatization, named entity recognition, coreference resolution, and
dependency parsing. It has Python wrappers like stanza for easier integration.

OpenNLP: An Apache project providing a collection of machine learning tools for


processing natural language text. It supports common NLP tasks like tokenization, sentence
segmentation, POS tagging, named entity recognition, chunking, and parsing.

scikit-learn: A general-purpose machine learning library in Python that offers various tools
useful for NLP, particularly for tasks like text classification and clustering. It provides
functionalities for text preprocessing, feature extraction (e.g., TF-IDF vectorization), and
various machine learning algorithms.

30
FastText: A library developed by Facebook for efficient learning of word representations
and text classification. It's known for its speed and ability to handle large datasets and out-of-
vocabulary words.
Flair: A Python NLP framework that leverages transformer models and provides state-of-
the-art performance on various NLP tasks like named entity recognition, part-of-speech
tagging, and text classification. It supports contextual string embeddings and multi-task
learning.

Features of above mentioned tools :

NLTK (Natural Language Toolkit):

• Comprehensive Coverage: Offers a wide array of tools and algorithms for various NLP
tasks, making it suitable for diverse applications.
• Educational Focus: Comes with extensive documentation, tutorials, and corpora,
making it ideal for learning and research.
• Extensive Corpus and Lexical Resource Access: Provides built-in access to resources
like WordNet, treebanks, and sentiment lexicons.
spaCy:

• Industrial-Strength Performance: Designed for speed and efficiency, making it suitable


for production environments.
• Pre-trained Statistical Models: Offers highly accurate pre-trained models for multiple
languages, enabling out-of-the-box functionality.
• Focus on Practicality: Emphasizes ease of use and integration with other data science
libraries.
Transformers (Hugging Face):

• Access to State-of-the-Art Models: Provides a vast library of pre-trained transformer


models (like BERT, GPT, RoBERTa) for various NLP tasks.
• Simplified Fine-tuning: Offers easy-to-use APIs and tools for fine-tuning pre-trained
models on custom datasets.
• Large and Active Community: Benefits from a large and active community, contributing
to model sharing and support.
TextBlob:

• Simple and Intuitive API: Provides an easy-to-learn and use interface for common NLP
tasks, ideal for beginners.
• Built-in Sentiment Analysis: Offers straightforward functionality for determining the
sentiment of text.
• Leverages NLTK and pattern: Built on top of well-established libraries, providing
access to their functionalities in a simplified manner.
Gensim:

• Focused on Topic Modeling: Specializes in algorithms for discovering abstract topics


within document collections.

31
• Efficient for Large Text Corpora: Designed to handle large datasets efficiently for tasks
like topic modeling and similarity analysis.
• Word Embedding Support: Includes functionalities for working with word embedding
models like Word2Vec and FastText.

Stanford CoreNLP:

• High Accuracy and Sophistication: Provides accurate and linguistically rich analysis
through its various tools.
• Wide Range of Linguistic Annotations: Offers comprehensive annotations, including
part-of-speech tags, named entities, dependency parses, and coreference.
• Language Support: Supports a wide range of languages beyond just English.

OpenNLP:

• Apache Foundation Project: Benefits from the stability and community support of the
Apache Software Foundation.
• Comprehensive Set of NLP Tools: Offers a variety of tools for fundamental NLP tasks
like tokenization, POS tagging, NER, and parsing.
• Language Model Support: Allows for the use of different language models for its
various components.

scikit-learn:

• General-Purpose Machine Learning Library: While not exclusively for NLP, it provides
excellent tools for text preprocessing (vectorization) and various classification and
clustering algorithms commonly used in NLP.
• Ease of Use for Machine Learning Tasks: Known for its clean and user-friendly API for
building machine learning models.
• Integration with Other Python Libraries: Works well with other scientific computing
libraries like NumPy and SciPy.

FastText:

• Efficient Word Embeddings: Excels at learning word embeddings quickly, especially


for large datasets.
• Handles Out-of-Vocabulary Words: Utilizes character n-grams to generate embeddings
for words not seen during training.
• Fast Text Classification: Provides an efficient method for text classification tasks.

Flair:
• State-of-the-Art Performance: Achieves high accuracy on various NLP tasks by
leveraging contextual string embeddings and transformer models.
• Multi-task Learning Capabilities: Supports training models for multiple NLP tasks
simultaneously, improving overall performance.
• Easy Integration of Pre-trained Models: Simplifies the process of using and fine-tuning
advanced language models.
32
Q. Describe in detail the Lesk algorithm and Walker’s algorithm for word sense
disambiguation

Ans : Lesk Algorithm:

The Lesk algorithm is a classic dictionary-based approach to word sense disambiguation. It


aims to determine the correct sense of a target word in a given context by comparing the
context of the word with the dictionary definitions (glosses) of its possible senses.

Core Idea: The correct sense of a word is the one whose definition shares the most overlap
with the context in which the word appears.

Steps:

1. Identify Possible Senses: For the target word in the sentence, retrieve all its possible
senses and their corresponding dictionary definitions (glosses) from a lexical resource
like WordNet.
2. Define the Context: Determine the context of the target word. This is usually the
sentence containing the word, or a window of words around it.
3. Compare Glosses with Context: For each possible sense of the target word, compare
its gloss with the words in the context.
4. Calculate Overlap: Count the number of shared words (excluding common stop words)
between the gloss of each sense and the context.
5. Select the Best Sense: The sense whose gloss has the highest overlap with the context
is chosen as the correct sense of the target word in that instance. If there is a tie, further
tie-breaking strategies might be employed.

Example:

Target word: "bank" in the sentence "I deposited my money in the bank."

• Senses of "bank" (from a dictionary):


o Sense 1 (financial institution): "an institution for receiving, keeping, and lending
money."
o Sense 2 (river bank): "the land along the side of a river or lake."
• Context: "I deposited my money in the bank." (Context words: deposited, money, in)
• Overlap:
o Sense 1 gloss and context: "money" (overlap count = 1)
o Sense 2 gloss and context: No overlap.
• Result: Sense 1 (financial institution) would be selected as the correct sense

33
Walker's Algorithm :

Walker's algorithm is often referred to as the Simplified Lesk Algorithm. It's a more efficient
version of the original Lesk algorithm. The key simplification lies in how the context is
defined.

Core Idea: Similar to the Lesk algorithm, the Simplified Lesk algorithm relies on finding the
sense definition that overlaps most with the target word's context. However, it typically
considers a much smaller context, often just the words in the definitions of the words
immediately surrounding the target word in the sentence.

Steps:

1. Identify Possible Senses: For the target word, retrieve all its possible senses and their
glosses.
2. Identify Neighboring Words: Consider a small window of words (e.g., the
immediate neighbors) around the target word in the sentence.
3. Retrieve Senses for Neighbors: For each neighboring word, retrieve its possible
senses and their glosses.
4. Compare Glosses: For each sense of the target word, compare its gloss with the
glosses of the neighboring words' senses.
5. Calculate Overlap: Count the number of shared words (excluding stop words) between
the gloss of the target word's sense and the glosses of the neighboring words' senses.
6. Select the Best Sense: The sense of the target word that has the highest total overlap
with the glosses of the neighboring words' senses is selected as the correct sense.

Example:

Target word: "pine" in the sentence "He chopped down the pine tree."

• Senses of "pine" (from a dictionary):


o Sense 1 (type of tree): "a coniferous tree with needle-shaped leaves."
o Sense 2 (to yearn): "suffer a mental and physical decline because of longing."
• Neighboring words: "chopped", "tree"
• Senses and glosses of neighbors (simplified):
o "chopped": Sense 1 (cut): "sever with a sharp tool."
o "tree": Sense 1 (woody plant): "a tall plant with a trunk and branches."
• Overlap:
o "pine" (sense 1) and "tree" (sense 1): "tree" (overlap = 1)
o "pine" (sense 1) and "chopped" (sense 1): No direct overlap.
o "pine" (sense 2) and glosses of neighbors: No significant overlap.
• Result: Sense 1 (type of tree) for "pine" would likely be selected due to the overlap
with the sense of "tree."

34
Q. Which types of tasks are performed by the Gensim library? Give an example

Ans : The Gensim library in Python is primarily used for the following types of tasks:
• Topic Modeling: Discovering abstract topics that occur in a collection of documents.
Common algorithms include Latent Semantic Analysis (LSA) and Latent Dirichlet
Allocation (LDA).
• Document Indexing and Similarity Retrieval: Building efficient indexes from large
text collections to quickly find documents similar to a given query or document.

• Word Embeddings: Creating vector representations of words, such as Word2Vec and


FastText, that capture semantic relationships between words.

• Text Summarization: Generating concise summaries of longer documents.

Example:

Let's say you have a collection of news articles and you want to find the main topics discussed
in them. You could use Gensim's LDA model for this task:

1. Preprocess the text: Tokenize the articles (split into individual words), remove stop
words, and potentially perform stemming or lemmatization.

2. Create a dictionary and corpus: Use Gensim to create a dictionary mapping words to
unique IDs and then create a corpus (a bag-of-words representation of the documents).

3. Train the LDA model: Use the corpus to train an LDA model, specifying the number
of topics you want to discover.

4. Analyze the results: The trained LDA model will then output the top words associated
with each discovered topic, allowing you to understand the main themes present in your
news article collection. For example, one topic might have words like "government",
"election", "politics", while another might have "technology", "innovation", "artificial
intelligence".

35
Q. Explain the following lexical knowledge networks? WordNet, Indo WordNet,
VerbNets, Prop Bank, Treebanks.

Ans:
i) WordNet
▪ WordNet is a lexical database for the English language.
▪ Organizes words into Synsets – sets of synonyms with similar meanings.
▪ Each synset has a gloss (short definition).
▪ Widely used in NLP tasks like machine translation, sentiment analysis, and word sense
disambiguation.
▪ Freely available and publicly accessible.

ii) IndoWordNet

▪ A multilingual lexical database for Indian languages.


▪ Covers languages like Hindi, Marathi, Tamil, Telugu, etc.
▪ Words are grouped in synsets with lemma (base form) and gloss.
▪ Aims to support NLP applications in Indian languages.
▪ Helps in cross-lingual studies and machine translation for regional languages.

iii) VerbNet

• Focuses on verbs and their behavior.


• Groups verbs based on syntactic frames and semantic roles.

Useful in semantic role labeling, question answering, and sentiment analysis

iv) PropBank

• Annotated resource that maps verbs to their arguments.


• Each verb has rolesets defining different meanings and argument structures.
• Uses semantic role labels: agent, theme, experiencer, etc.
• Supports semantic role labeling, information extraction, and training ML models.
• Developed for multiple languages (English, Chinese, Arabic, etc.).

v) Treebanks

• A corpus where sentences are syntactically annotated.


• Applications: syntactic parsing, grammar training, machine translation, language
learning
• Created via manual annotation with detailed guidelines.

36
Q. Write Python code using NLTK library to split the text into tokens using whitespace,
punctuation-based and default tokenization methods.

Ans :
o Whitespace tokenization
o Punctuation-based tokenization
o Default tokenization (using word_tokenize)

import nltk

from nltk.tokenize import word_tokenize, WhitespaceTokenizer, WordPunctTokenizer


# Download necessary NLTK resources (only once)
nltk.download('punkt')

# Sample text
text = "Hello! How are you doing today? I'm learning NLTK tokenization."

# 1. Whitespace Tokenization
whitespace_tokens = WhitespaceTokenizer().tokenize(text)
print("Whitespace Tokenization:")
print(whitespace_tokens)

# 2. Punctuation-based Tokenization
punctuation_tokens = WordPunctTokenizer().tokenize(text)
print("\nPunctuation-based Tokenization:")
print(punctuation_tokens)

# 3. Default Tokenization (uses Punkt tokenizer)


default_tokens = word_tokenize(text)
print("\nDefault Tokenization (word_tokenize):")
print(default_tokens)

37
Q. Describe Walker’s algorithm for word sense disambiguation. How does it differ from
other disambiguation techniques like Lesk’s Algorithm, and what are the scenarios
where it can be most effective?

Ans :

o Walker’s algorithm for word sense disambiguation (WSD) focuses on analyzing


relationships and connections between words in a text.

o Unlike Lesk’s algorithm, which uses overlap of words in dictionary definitions and
immediate context, Walker’s algorithm builds a network or graph of related words and
their senses.

o It identifies patterns or paths in this network to determine the most appropriate meaning
of the ambiguous
o word.

o This graph-based approach captures complex relationships and broader context clues
that overlap methods might miss.

Key difference:

o Walker’s algorithm uses word networks and relationships.

o Lesk’s algorithm relies on comparing overlaps in dictionary definitions and surrounding


words.

o Walker’s algorithm is most effective when the context is rich with many related words,
such as in:

• Long paragraphs or articles


• Technical or specialized texts where words have multiple related meanings

o It requires enough surrounding text to build a strong network of word connections,


which helps in nuanced disambiguation.

o This makes Walker’s algorithm ideal for detailed and complex word sense
disambiguation tasks where
o simple dictionary overlap is insufficient.

38
Q. Compare the Indo Word Net with the traditional WordNet. What are the key
differences and advantages of IndoWordNet, especially in the context of Indian
languages?

39
Q. Compare and contrast the natural Language Toolkit (NLTK), spaCy, and TexBlob.
what are their main features and in what use cases are they most suitable?

40
Q. What is the significance of PropBank and VerbNet in linguistic resources? Provide
examples of how these resources can be used to extract semantic information from text.
Ans :
PropBank and VerbNet are crucial linguistic resources used in natural language processing
(NLP) to help computers understand the meaning of verbs and the roles of their participants
in sentences.
PropBank is a corpus that annotates verbs with their semantic roles, also called argument
labels. These roles describe the relationship between the verb and other elements in a sentence,
such as the doer of the action (agent), the receiver (patient), and other participants. By
providing these role labels, PropBank helps in semantic role labeling, which allows systems
to understand “who did what to whom” in any sentence. For example, in “She opened the
door,” PropBank would label “She” as the agent and “door” as the patient.

VerbNet is a large-scale verb lexicon that groups verbs into classes based on shared
syntactic and semantic properties. It offers detailed information about verb meanings,
argument structures, and selectional restrictions (which kinds of arguments can go with which
verbs). VerbNet helps systems predict how verbs behave in sentences and understand the
relationships between verbs and their arguments across different contexts.
Example Applications:
Using PropBank, in the sentence “The teacher gave the student a book,” the semantic roles
would be identified as:
o Teacher = Agent (the giver),
o Student = Recipient,
o Book = Theme (the thing given).
This enables a system to extract clear semantic information about the participants and their
roles.

41
Q. Write a note on SpyCy Library.
Ans : spaCy: An Industrial-Strength NLP Library
spaCy is a modern and efficient open-source library for advanced Natural Language
Processing in Python. Unlike NLTK, which provides a wider range of tools often focused on
research and education, spaCy is designed specifically for production use, emphasizing speed,
efficiency, and ease of integration.
Key Features:
• Industrial-Strength Performance: spaCy is engineered for speed and efficiency. Its
core algorithms and data structures are implemented in Cython, making it significantly
faster than many other NLP libraries for common tasks.
• Pre-trained Statistical Models: spaCy comes with highly accurate pre-trained
statistical models for various languages. These models support core NLP tasks like
tokenization, part-of-speech tagging, named entity recognition, lemmatization, and
dependency parsing out of the box.
• Ease of Use and Integration: spaCy's API is designed to be intuitive and user-friendly.
It offers a clean and consistent way to process text and access linguistic annotations. It
also integrates well with other popular Python libraries for data science and machine
learning.

Typical Uses:
spaCy is widely used in various NLP applications, including:
• Information Extraction: Identifying and extracting structured information like
entities, relationships, and facts from text.
• Natural Language Understanding: Analyzing the meaning and structure of text for
tasks like intent recognition and semantic analysis.
• Text Classification: Categorizing text into predefined categories or topics.
• Sentiment Analysis: Determining the emotional tone or opinion expressed in a piece
of text.
• Building Conversational AI: Powering chatbots and virtual assistants with natural
language understanding capabilities.

42
Q. Write a note on : Sentiment Analysis.
Ans : Sentiment Analysis in NLP
Sentiment Analysis, also known as opinion mining, is a field within Natural Language
Processing (NLP) that aims to determine the emotional tone or attitude expressed in a piece of
text. It involves identifying, extracting, quantifying, and studying subjective information such
as opinions, emotions, evaluations, appraisals, attitudes, and beliefs towards a particular topic,
product, service, person, event, or issue.
Key Aspects:
• Subjectivity vs. Objectivity: Sentiment analysis focuses on subjective text, which
expresses opinions or feelings, as opposed to objective text, which presents factual
information.
• Polarity: The core task is often to determine the polarity of the sentiment expressed,
which can be positive, negative, or neutral. Some systems also classify sentiment into
more granular categories like very positive, slightly positive, etc.
Approaches to Sentiment Analysis: Various techniques are employed, including:
• Lexicon-based: Relies on a dictionary or lexicon of words where each word is associated
with a sentiment score or label. The overall sentiment of a text is determined by
aggregating the sentiment of the words it contains.
• Machine Learning-based: Utilizes machine learning algorithms trained on labeled
datasets of text with known sentiment. Common algorithms include Naive Bayes,
Support Vector Machines (SVMs)

Importance and Applications:


Customer Feedback Analysis: Businesses use it to understand customer opinions from
reviews, social media, and surveys to improve products and services.
Brand Monitoring: Tracking public sentiment towards a brand or company in real-time.
Market Research: Identifying consumer preferences and trends.
Social Media Monitoring: Understanding public reactions to events, campaigns, or
political figures.

43
Q. Explain Statistical Machine Translation (SMT) with suitable diagrams and example.
Ans : Statistical Machine Translation (SMT) is a paradigm in machine translation that relies
on statistical models derived from large bilingual text corpora (parallel texts) to translate text
from one language to another. Unlike rule-based MT which uses explicit linguistic rules, SMT
learns translation rules automatically from the data.

Explanation of the Flow:


1. Source Sentence: This is the sentence in the original language that needs to be
translated.
2. Translation Model: This model, learned from a large parallel corpus, provides
probabilities for translating words and phrases from the source language to the target
language.
3. Language Model: This model, trained on a large corpus of the target language,
evaluates the fluency and grammatical correctness of the potential target sentences.
4. Decoder: This component uses the probabilities from both the Translation Model and
the Language Model to search for the most likely and fluent translation of the source
sentence.
5. Target Sentence: This is the final output, the translated sentence in the desired
language.
6. Parallel Corpus (Implicit Connection): Both the Translation Model and the Language
Model are trained on a large collection of source text and their corresponding
translations (the parallel corpus), which is the foundation of the SMT system.

44
Q. Describe various Machine Translation Approaches.
Ans :
Rule-Based Machine Translation (RBMT): Translates using predefined linguistic rules and
dictionaries. This approach relies on a set of linguistic rules (grammar, morphology, syntax,
semantics) defined by human experts to translate text.
• Example: Translating "Hello world" from English to French by looking up "Hello" as
"Bonjour" and "world" as "monde" and applying a rule for word order.
Statistical Machine Translation (SMT): Uses statistical models learned from parallel text
to find the most probable translation. It aims to find the most probable target sentence given a
source sentence based on probabilities derived from the data.
• Example: If the system has seen many translations of phrases containing "how are you"
and "comment ça va", it will statistically favor "Comment ça va ?" as the French
translation.
Neural Machine Translation (NMT): Employs neural networks to directly learn the
mapping between languages end-to-end.
• Example: A Transformer model learns complex patterns and context to translate the
entire sentence "The cat sat on the mat" to its French equivalent "Le chat était assis sur
le tapis" in one go.
Example-Based Machine Translation (EBMT): Translates by finding similar previously
translated sentences and adapting their translations. EBMT works well for translating phrases
and sentences that are very similar to those in the training data.
• Example: If the system has a translation for "How old are you?", and the input is "How
old is your dog?", it might reuse the translated structure of the first sentence and translate
"dog".
Hybrid Approaches: Combines multiple MT approaches to leverage their strengths. Many
modern MT systems combine different approaches to leverage their strengths.
• Example: An NMT system might use a rule-based component to handle specific
grammatical structures or terminology more accurately.

45
Q. Explain Natural Language Generation with reference architecture.
Ans : Natural Language Generation (NLG) in NLP is the process of automatically
generating human-readable text from structured data or unstructured information. It's
essentially the reverse of Natural Language Understanding (NLU). Instead of computers
understanding human language, NLG allows computers to produce language that humans can
understand.

Structured Data / Information: This is the input to the NLG system. It could be data
from a database, spreadsheets, sensor readings, or any organized information.
Content Planning (What to say): This stage involves deciding what information from
the input data is relevant and important to communicate to the user. It involves tasks like
selecting, filtering, and ordering the content.
Sentence Planning (How to say it): In this stage, the system determines how to express
the selected content in natural language. This includes deciding on the sentence structure,
choosing appropriate words, and determining the overall tone and style.
Surface Realization (Making it fluent): This is the final stage where the planned
sentences are converted into well-formed and fluent natural language text. It involves tasks
like grammar checking, morphological realization (word forms), and ensuring proper syntax.
Natural Language Text: This is the final output of the NLG system – the generated text
that is understandable to humans.

46
Q. Explain three stages of Question Answering system with neat diagram.
Ans :

Question Processing:
• Parsing: Breaking down the question into its grammatical components.
• Keyword Extraction: Identifying the important words and phrases in the question that
indicate the information need.
• Question Classification: Determining the type of question being asked (e.g., factual,
definitional, list-based) to guide the search for the answer.
• Query Formulation: Transforming the question into a suitable query format for
retrieving relevant information from the knowledge source.
Information Retrieval: This stage focuses on finding relevant information from a large
collection of text or a knowledge base that might contain the answer to the processed question.
Key steps include:
• Document Retrieval: Identifying a subset of documents or knowledge snippets that are
likely to contain the answer based on the formulated query. Techniques like keyword
matching, vector space models, or semantic search can be used.
• Passage Retrieval (Optional but common): If the document collection is large, this
step involves further narrowing down the search to specific passages or sections within
the retrieved documents that are most relevant to the question.
Answer Processing: This final stage involves extracting the specific answer from the
retrieved information and presenting it to the user in a clear and concise format. Key tasks
include:
• Answer Extraction: Identifying the precise piece of information within the retrieved
text or knowledge base that answers the question. This can involve techniques like span
extraction, rule-based methods, or using machine learning models trained for answer
extraction.
• Answer Ranking (if multiple answers are found): If multiple potential answers are
identified, they are ranked based on their relevance and confidence scores.
47
• Answer Generation (in some systems): For more complex questions, the system might
need to generate a new answer by synthesizing information from multiple sources or
performing reasoning.
• Answer Presentation: Formatting the extracted or generated answer in a user-friendly
way.

Q. Explain Rule based Machine Translation and Statistical Machine Translation (SMT)
with suitable diagrams and example.
Ans : Rule-Based Machine Translation (RBMT) is an approach to machine translation that
relies on a set of linguistic rules created by human experts to translate text from one language
to another. These rules cover grammar, morphology, syntax, semantics, and often involve
bilingual dictionaries.
Key aspects of RBMT:
• Human-Crafted Rules: Translation is achieved by applying a complex set of
manually defined rules for each language pair.
• Linguistic Knowledge: Requires deep linguistic knowledge of both the source and
target languages.
• Dictionaries and Lexicons: Heavily relies on bilingual dictionaries to translate
individual words and phrases.
• Morphological Analysis: Analyzes the word structure to handle inflections and
variations.
• Syntactic Analysis: Parses the sentence structure to understand the grammatical
relationships between words.
• Semantic Analysis: Attempts to understand the meaning of the words and sentences
to produce accurate translations.
Example:Let's say we want to translate the English sentence: "The cat is sleeping." to
French.
An RBMT system might have the following rules and dictionary entries:
• Dictionary:
o The -> Le
o cat -> chat (masculine noun)
o is -> est (third-person singular of "être")
o sleeping -> dort (third-person singular of "dormir")
48
• Grammatical Rule: Subject-Verb-Object (SVO) structure in English often translates
to Subject-Verb-Object in French (with appropriate article and verb conjugation).
• Agreement Rule: Ensure the article ("Le") agrees in gender with the noun ("chat").
Applying the rules:
1. The system analyzes the English sentence: "The cat is sleeping."
2. It uses the dictionary to translate each word: "Le chat est dort."
3. It applies the grammatical rule, confirming the SVO structure.
4. It applies the agreement rule, ensuring "Le" correctly precedes the masculine noun
"chat".
5. The resulting French translation is: "Le chat est dort."

49
Q. Describe following NLP applications: Text Entailment Dialog and Conversational
Agents.
Ans :
Text Entailment is a key NLP application that determines if the meaning of one sentence
(called the hypothesis) can be logically inferred from another sentence (called the text).
It helps in understanding whether a hypothesis is true, false, or unknown based on the
original text, enabling machines to grasp relationships between sentences.
This is useful in various NLP tasks like question answering, information retrieval,
summarization, and natural language inference.
For example, if the text is “A man is riding a bicycle,” the hypothesis “A person is outdoors”
can be inferred as true.
Dialog and Conversational Agents are systems designed to interact with users through
natural language conversations.
They interpret user inputs, maintain context, and generate relevant and coherent responses
to simulate human-like dialogue.
Examples include virtual assistants like Siri, Alexa, and customer support chatbots.
These agents use NLP techniques, machine learning, and dialogue management to
understand and respond effectively.
They help users perform tasks such as booking services, answering queries, and providing
recommendations, making interactions smooth and efficient.
Overall, both Text Entailment and Dialog Agents significantly enhance how machines
understand and interact with human language, enabling smarter applications.

50
Q. Discuss the challenges in cross-lingual translation and provide examples of how it is
beneficial in real world applications.
Ans :
Cross-lingual translation involves translating text or speech from one language to another,
which presents several challenges due to linguistic and cultural differences.
One major challenge is the variation in grammar, syntax, and sentence structure across
languages, making direct translation difficult and often inaccurate.
Another challenge is dealing with idioms, metaphors, and cultural references that may not
have equivalents in the target language, causing loss of meaning or confusion.
Ambiguity in words and phrases, where one word can have multiple meanings, complicates
the correct interpretation during translation.
Handling low-resource languages with limited data and linguistic resources also poses
significant difficulties for accurate translation.
Maintaining the tone, style, and context of the original message while translating is a
complex task, especially for nuanced or formal texts.
Despite these challenges, cross-lingual translation is highly beneficial in real-world
applications such as global business communication, enabling companies to reach
international markets effectively.
It supports multilingual customer service by allowing chatbots and agents to interact with
users in their native languages, improving user experience.
Cross-lingual translation is crucial in education, helping students access learning materials
in different languages.
It also facilitates diplomatic communication, international collaboration, and cultural
exchange by breaking down language barriers.
For example, platforms like Google Translate help travelers understand signs and
communicate abroad, while multinational companies use translation systems to manage global
operations seamlessly.

51
Q. Define natural language generation and its role in NLP. How does NLG differ from
text-to-speech synthesis, and what are the applications of NLG in data reporting and
storytelling?
Ans :
Natural Language Generation (NLG) is a branch of Natural Language Processing (NLP)
that focuses on automatically creating human-like text from structured data or information.
Its main role in NLP is to transform complex data into easy-to-understand, coherent, and
meaningful language, enabling machines to communicate effectively with humans.
NLG differs from text-to-speech (TTS) synthesis because NLG generates the actual written
text, while TTS converts that text into spoken language. Essentially, NLG produces the
content, and TTS vocalizes it.
Applications of NLG in data reporting include automatic generation of financial reports,
business analytics summaries, and real-time updates, which help save time and reduce manual
work.
In storytelling, NLG can create narratives from data such as sports reports, weather
forecasts, or personalized news stories, making information more engaging and easier to
understand.
NLG improves efficiency and consistency in industries like journalism, healthcare, finance,
and customer support by generating accurate and customized content quickly.
Overall, NLG bridges the gap between raw data and natural language, making information
more accessible and actionable for users.

52
Q. Explain the key principles of rule-based machine translation. How do rule-based
techniques differ from statistical approaches in machine translation? provide an example
of a rule-based translation.
Ans :
Rule-based Machine Translation (RBMT) relies on a set of linguistic rules and dictionaries
to translate text from one language to another.
The key principles of RBMT include using grammar rules, syntactic and semantic analysis
of the source language, and applying corresponding rules to generate the target language text.
RBMT systems typically involve three main stages: analysis of the source language, transfer
where rules map source structures to target structures, and generation of the target language
sentence.
These systems require detailed knowledge about both source and target languages, including
vocabulary, morphology, syntax, and semantics.
Rule-based translation is deterministic, meaning it follows fixed rules and produces
predictable output based on those rules.
In contrast, Statistical Machine Translation (SMT) uses large amounts of bilingual text data
to learn probable translations based on statistics and probabilities, without explicit linguistic
rules.
SMT can handle variations and ambiguity better by learning from data, but sometimes
produces less grammatically accurate results compared to RBMT.
An example of rule-based translation is translating the English sentence "She is eating an
apple" to Spanish by applying rules: identify subject, verb, and object, then map to Spanish
syntax as "Ella está comiendo una manzana."
Overall, RBMT focuses on language knowledge and linguistic rules, while SMT relies on
statistical patterns learned from data.

53
Q. Discuss the key components of a conversational agent, such as chatbots or virtual
assistants. How do natural language generation and understanding play a role in creating
effective conversational agents?
Ans :
A conversational agent, like a chatbot or virtual assistant, consists of several key components
that work together to interact with users naturally.
The main components include Natural Language Understanding (NLU), Dialogue
Management, Natural Language Generation (NLG), and sometimes Speech Recognition and
Speech Synthesis for voice-based agents.
NLU is responsible for interpreting the user's input by analyzing the text to understand
intent, extract relevant information (entities), and identify the user's goals.
Dialogue Management controls the flow of the conversation, deciding what the agent should
do next based on the user's input and the context of the conversation.
NLG is the process of generating meaningful, grammatically correct, and contextually
appropriate responses to the user’s queries or statements.
Speech Recognition converts spoken language into text, while Speech Synthesis converts
the agent’s text response back into spoken language for voice-based interaction.
Natural Language Understanding allows the agent to comprehend user requests accurately,
while Natural Language Generation enables it to respond in a human-like and coherent
manner.
Effective conversational agents combine NLU and NLG to maintain smooth, context-aware
interactions that feel natural and helpful to users.
For example, a virtual assistant like Siri or Alexa understands commands (“set an alarm for
7 AM”) through NLU and generates the response (“Alarm set for 7 AM”) using NLG,
providing a seamless user experience.
These components together ensure conversational agents can understand varied user inputs,
manage dialogue flow, and provide relevant, clear, and context-sensitive replies.

54

You might also like