Natural Language Processing
Natural Language Processing
NLP
Assignment 1
1) Define natural language processing. List and explain different NLP applications.
Ans|
Email Platforms
Explanation: Email platforms like Gmail and Outlook use NLP to offer features such as
spam classification, priority inbox sorting, calendar event extraction, and auto-complete.
Example: Gmail automatically categorizes incoming emails into primary, social,
promotions, etc., and filters out spam.
Voice-Based Assistants
Explanation: Assistants like Apple Siri, Google Assistant, Microsoft Cortana, and Amazon
Alexa rely on NLP to understand user commands and provide appropriate responses.
Example: Siri setting a reminder or answering a question about the weather.
Search Engines
Explanation: Search engines like Google and Bing use NLP for query understanding, query
expansion, question answering, information retrieval, and result ranking and grouping.
Example: Google interpreting a search query and providing the most relevant results.
E-commerce Platforms
Explanation: E-commerce platforms like Amazon use NLP to extract information from
product descriptions and understand user reviews.
Example: Analyzing customer reviews to improve product recommendations.
(e.g., Turnitin), intelligent tutoring systems, and language learning apps like Duolingo.
Example: Turnitin detecting plagiarism in student essays.
Knowledge Bases
Explanation: Large knowledge bases like the Google Knowledge Graph are built using NLP
and are used in applications like search and question answering.
Example: Google’s Knowledge Graph providing detailed information about entities directly
in search results.
2) Explain various NLP tasks and draw diagram to represent the relative difficulty
among the task.
Ans|
NLP tasks:
Language Modeling
Explanation: Language modeling involves predicting the next word in a sentence based on
the history of previous words. The goal is to learn the probability of sequences of words in a
given language.
Applications: This task is foundational for many problems, such as speech recognition,
optical character recognition, handwriting recognition, machine translation, and spelling
correction.
4
Text Classification
Explanation: Text classification is the task of categorizing text into predefined categories
based on its content. It is the most popular task in NLP.
Applications: This is used in various tools, including email spam detection, sentiment
analysis, and topic classification.
Information Extraction
Explanation: Information extraction involves identifying and extracting relevant information
from text, such as names, dates, and specific facts.
Applications: Extracting calendar events from emails or identifying names of people
mentioned in social media posts.
Information Retrieval
Explanation: This task involves finding documents relevant to a user query from a large
collection.
Applications: Google Search is a well-known example of an information retrieval
application.
Conversational Agent
Explanation: Building dialogue systems that can converse with users in natural language.
Applications: Voice assistants like Alexa, Siri, Google Assistant, and chatbots.
Text Summarization
Explanation: The task of creating concise summaries of longer documents while retaining
the core content and meaning.
Applications: Summarizing news articles, research papers, or lengthy reports.
Question Answering
Explanation: Developing systems that can automatically answer questions posed in natural
language.
Applications: Virtual assistants and customer support bots that provide precise answers to
user queries.
Machine Translation
Explanation: The task of translating text from one language to another.
Applications: Services like Google Translate and Amazon Translate.
5
Topic Modeling
Explanation: Uncovering the underlying topical structure of a large collection of documents.
Applications: Used in text mining to identify themes in literature, research papers, or large
text datasets.
3) What is a language? List and explain the building blocks of language with examples
and applications.
Ans|
Phonemes
Definition: Phonemes are the smallest units of sound in a language. They may not have
meaning individually but can form meaningful units when combined.
Example: In English, the word "cat" is composed of the phonemes /k/, /æ/, and /t/.
6
Definition:
Morphemes: The smallest units of language that carry meaning. They can be words,
prefixes, or suffixes.
Lexemes: Variations of morphemes that are related by meaning.
Examples:
Morphemes: In the word "unbreakable," "un-" is a prefix morpheme, "break" is the root
morpheme, and "-able" is a suffix morpheme.
Lexemes: "Run" and "running" are different forms of the same lexeme.
Applications: Morphological analysis, tokenization, stemming, learning word embeddings,
and part-of-speech tagging.
Syntax
Definition: Syntax refers to the rules for constructing grammatically correct sentences from
words and phrases in a language. It represents the structure of sentences.
Example: The sentence "The girl laughed at the monkey" can be represented as a parse tree
7
with a noun phrase (NP) "The girl" and a verb phrase (VP) "laughed at the monkey."
Applications: Parsing (automatically constructing parse trees), entity extraction, and relation
extraction.
Context:
Definition: Context is the information that surrounds language and helps convey meaning. It
includes semantics (literal meaning) and pragmatics (meaning derived from world knowledge
and external context).
Examples:
Semantics: The sentence "The bank is closed" refers directly to the closure of a financial
institution.
Pragmatics: In the sentence "The bank is closed," pragmatics could reveal whether it is a
financial bank or the side of a river based on the conversation context.
Applications: Complex NLP tasks such as sarcasm detection, summarization, and topic
modeling.
Ans|
NLP, or Natural Language Processing, is challenging due to the inherent complexity and
nuances of human language. Below are some key factors that contribute to the difficulty of
NLP:
Ambiguity
Definition: Ambiguity means uncertainty of meaning. Human languages are inherently
ambiguous, allowing for multiple interpretations of the same text.
Examples:
The sentence “I made her duck” can mean either:
I cooked a duck for her.
I caused her to bend down to avoid an object.
Figurative language, such as idioms, further increases ambiguity. For instance, “He is as
good as John Doe” depends on the unspecified quality of John Doe.
Challenges: Disambiguating such sentences requires understanding the context, which is
often beyond the capabilities of most NLP systems.
Common Knowledge
Definition: Common knowledge refers to the set of facts that most humans are aware of and
use implicitly in conversation.
Examples:
The sentences “man bit dog” and “dog bit man” are syntactically similar, but humans
understand that the former is unlikely while the latter is possible due to common knowledge
about human and dog behavior.
Challenges: Encoding common knowledge into computational models is difficult because it
involves a vast and often implicit understanding of the world that humans take for granted.
Creativity
Definition: Language involves not just rules but also creativity, manifesting in various styles,
dialects, genres, and variations.
Examples:
Poetry, metaphors, and idiomatic expressions showcase the creative use of language.
Challenges: Understanding and processing creative aspects of language is a hard problem in
AI, as it requires going beyond rule-based approaches to capture the nuances and innovation
in human expression.
9
5) With a neat diagram explain how NLP, ML and Deep Learning are related?
Ans|
Artificial Intelligence (AI): The outermost layer encompasses all intelligent systems,
including those based on traditional logic and heuristic approaches as well as learning-based
systems.
Machine Learning (ML): Within AI, ML focuses on systems that learn from data. It
overlaps with other AI approaches but represents a distinct methodology centered on data-
driven learning.
Deep Learning (DL): A subset of ML, DL uses multi-layered neural networks to handle
complex data patterns and tasks. It is a powerful tool within the ML toolkit, particularly for
high-dimensional data.
Natural Language Processing (NLP): NLP sits within AI and intersects significantly with
ML and DL. Traditional NLP methods often relied on rule-based systems, but modern NLP
increasingly leverages ML and DL techniques to achieve more sophisticated language
understanding and generation.
Initially, AI and NLP used rule-based and heuristic methods to process language. These
methods included handcrafted rules to parse and understand language.
ML and NLP:
Over the past few decades, NLP has increasingly adopted ML techniques. These involve
training models on large datasets to perform tasks such as text classification, sentiment
analysis, and named entity recognition.
DL and NLP:
In recent years, DL has revolutionized NLP. Deep learning models, especially those based on
neural networks, have achieved state-of-the-art results in many NLP tasks. Examples include
language modeling, machine translation, and speech recognition. DL models can
automatically learn hierarchical features from raw text data, reducing the need for manual
feature engineering.
Examples of Applications
Supervised Learning in NLP: Email spam classification, sentiment analysis, and part-of-
speech tagging.
Unsupervised Learning in NLP: Topic modeling, clustering, and word embeddings.
Deep Learning in NLP: Transformer models like BERT and GPT for tasks such as text
generation, translation, and summarization.
Ans|
Heuristics-Based NLP
Machine Learning for NLP
Deep Learning for NLP
1. Heuristics-Based NLP
Definition: This approach involves building rules for the task at hand. It requires domain
expertise to formulate rules that can be incorporated into a program. Resources such as
dictionaries and thesauruses are typically used.
Lexicon-based Sentiment Analysis: Uses counts of positive and negative words in the text
to deduce sentiment.
Knowledge Bases: Tools like Wordnet, which is a database of words and semantic
11
Steps:
Popular Algorithms:
Naive Bayes: A classification algorithm based on Bayes' theorem, assuming feature
independence.
Support Vector Machine (SVM): Learns decision boundaries (linear or nonlinear) to
separate data points of different classes.
Hidden Markov Model (HMM): Assumes an underlying hidden process generating the
data; used for tasks like part-of-speech (POS) tagging.
Conditional Random Fields (CRF): Used for sequential data, performs classification on
each element in a sequence, often outperforming HMMs for tasks like POS tagging.
Popular Architectures:
Recurrent Neural Networks (RNNs): Suitable for sequential data, remembers previously
processed information, used in tasks like text classification and machine translation.
Long Short-Term Memory Networks (LSTMs): A type of RNN that can remember longer
contexts by selectively forgetting irrelevant information.
Convolutional Neural Networks (CNNs): Common in computer vision, also used for text
classification by treating text as a matrix of word vectors.
Transformers: Utilize self-attention mechanisms to model textual context, with models like
12
7) With a block diagram explain briefly the key stages of NLP pipeline?
Ans|
Data Acquisition: Obtaining relevant data for the NLP task at hand.
Text Cleaning: Processing and cleaning the acquired text data to remove noise,
inconsistencies, and irrelevant information.
Pre-processing: Converting the cleaned text data into a standardized format suitable for
further analysis.
Feature Engineering: Extracting meaningful features from the pre-processed text data to
represent it in a format understandable by modeling algorithms.
Modeling: Building one or more models using the engineered features to solve the NLP task.
Evaluation: Assessing the performance of the models using relevant evaluation metrics and
comparing them to select the best-performing model.
Deployment: Integrating the chosen model into production systems for real-world use.
Monitoring & Model Updating: Regularly monitoring the deployed model's performance
and updating it as needed to maintain optimal performance over time.
8) Define data acquisition. List and explain various sources for data acquisition.
Ans|
a. Public Datasets:
Public datasets are collections of data that are openly available for use by anyone. These
datasets are often curated and maintained by organizations, research institutions, or
government agencies.
Researchers and developers can leverage public datasets for various NLP tasks if the data is
relevant to their project. Platforms like Kaggle, UCI Machine Learning Repository, and
Google's Dataset Search provide access to a wide range of public datasets.
b. Web Scraping:
Web scraping involves extracting data from websites. Developers can scrape relevant text
data from online sources such as forums, social media platforms, or discussion boards where
users post queries or comments related to the NLP task at hand.
Once the data is scraped, it may need to be cleaned and labeled before it can be used for
training NLP models.
c. Product Intervention:
In industrial settings, AI models are often integrated into products or services. Product teams
can collaborate with AI teams to collect data directly from the usage of the product.
This approach, known as product intervention, involves instrumenting the product to collect
user interactions and feedback, which can then be used as training data for NLP models.
d. Data Augmentation:
Data augmentation techniques involve generating synthetic data from existing datasets to
increase the diversity and size of the training data.
Methods such as synonym replacement, back translation, TF-IDF-based word replacement,
bigram flipping, replacing entities, and adding noise to data can be used to create augmented
data for NLP tasks.
e. Advanced Techniques:
Advanced techniques such as Snorkel, Easy Data Augmentation (EDA), NLPAug, and active
learning provide additional methods for data acquisition and augmentation.
These techniques automate the process of generating training data or facilitate the acquisition
of labeled data in scenarios where manual labeling is expensive or time-consuming.
9) Define text cleaning with proper code explain the steps applied in text cleaning.
Ans|
markup, metadata, special characters, spelling correction, and normalization to ensure that
the text is in a clean and standardized format.
It helps to maintain the consistency flow during the NLP tasks and text mining.
The lower() function makes the whole process quite straightforward.
Ex:
In this step we will be removing all punctuations ,because the punctuation to the sentence
adds up noise that brings ambiguity while training the model.
Ex:
import re
import nltk
text = "This is a sample sentence, showing off the stop words filtration.“
tokens = word_tokenize(text)
print(cleaned_tokens)
['This', 'is', 'a', 'sample', 'sentence', 'showing', 'off', 'the', 'stop', 'words', 'filtration']
textual content. Libraries like Beautiful Soup and Scrapy can be used to parse HTML and
extract desired text elements.
Below is an example code snippet using Beautiful Soup to extract a question and its best
answer from a Stack Overflow web page:
Code:
# Define URL
myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-
python"
html = urlopen(myurl).read()
# Extract question
import enchant
spellchecker = enchant.Dict("en_US")
print(corrected_text)
10) Define preprocessing with proper code. Explain the steps involved in preprocessing.
Ans|
Preprocessing in NLP
Preprocessing prepares raw text data for analysis and model building, ensuring it is cleaned,
normalized, and structured. Key steps include sentence segmentation, word tokenization, stop
word removal, lowercasing, removing punctuation and digits, stemming, and lemmatization.
def preprocess_corpus(texts):
mystopwords = set(stopwords.words("english"))
def remove_stops_digits(tokens):
Explanation
Sentence Segmentation: sent_tokenize(mytext) splits the text into sentences.
Word Tokenization: word_tokenize(sentence) splits sentences into words.
Stop Word Removal, Lowercasing, Removing Digits/Punctuation:
remove_stops_digits(tokens) removes stop words, digits, punctuation, and converts to
lowercase.
Stemming: stem_tokens(tokens) applies the PorterStemmer.
Lemmatization: lemmatize_tokens(tokens) converts words to their dictionary form.
11) Define stemming and lemmatization. With proper code explain the working of
stemming and lemmatization.
Ans|
Stemming
Stemming is text preprocessing technique in NPL this process reduces inflected form of a
18
stemmer = PorterStemmer()
# Sample words
# Applying stemming
In this example, "cars" is stemmed to "car," and "revolution" is stemmed to "revolut," where
the latter is not a correct linguistic form.
Lemmatization
Lemmatization reduces words to their base or dictionary form, known as a lemma,
considering the context and part of speech (POS) of the word. It requires more linguistic
knowledge compared to stemming and usually results in a valid word.
lemmatizer = WordNetLemmatizer()
# Applying lemmatization
import spacy
sp = spacy.load('en_core_web_sm')
# Sample word
token = sp(u'better')
# Applying lemmatization
In this case, spaCy lemmatizes "better" to "well." Both "good" (from NLTK) and "well" are
contextually correct.
Summary
Stemming: Uses heuristic rules to strip suffixes (e.g., "cars" to "car," "revolution" to
"revolut").
Lemmatization: Uses linguistic knowledge to map words to their base form (e.g., "better" to
"good").
Stemming is faster but less accurate, whereas lemmatization is more accurate but slower and
requires part-of-speech tagging to determine the correct base form of a word.
12) Define and use phase structure grammar rule to draw parse tree for the below
sentences.
I. I saw the man on the hill with the telescope
II. The house at the end of the street is red
Ans|
Natural Language Processing (NLP) 1
2nd Assignment:
1. Explain in detail the techniques of Vector Space Model (VSM) and List the Pro’s and Con’s.
• Helps convert raw text into numerical form for ML and NLP models.
• The dimensions of the vector space are defined by a vocabulary (list of unique terms).
• It forms the foundation for more advanced models like Word2Vec and BERT.
Pros of VSM:
2. Explain in detail the techniques of Bag of Words (BoW) and List the Pro’s and Con’s.
• BoW is similar to one-hot encoding but with word counts instead of binary flags.
• Works well when combined with ML models like Naive Bayes or SVM.
• Example:
Pros of BoW:
Cons of BoW:
• Does not consider context (e.g., "bank of a river" vs "money in the bank").
3. Explain in detail the techniques of Bag of n-gram (BoN) and List the Pro’s and Con’s.
• Breaks down text into contiguous sequences of 'n' words, called n-grams.
• Examples:
• Allows the model to distinguish between similar sentences with different meanings.
• Can be implemented easily using tools like CountVectorizer with ngram_range in Python.
• Common practice: use bigrams or trigrams for balance between context and complexity.
• Helps distinguish between phrases like "not good" and "very good".
4. Explain in detail the techniques of TF-IDF and List the Pro’s and Con’s.
• Improves on Bag of Words by reducing the weight of common, less informative words.
• TF-IDF combines two components: Term Frequency (TF) and Inverse Document Frequency (IDF).
5
• Term Frequency (TF) = (Number of times term t appears in document d) ÷ (Total terms in document
d).
• High TF-IDF means the word is frequent in a specific document but rare across other documents.
• Common words like "the", "is", "and" get low IDF scores, reducing their overall impact.
• Creates a vector representation for each document using TF-IDF values of its terms.
• These vectors are used in tasks like document similarity, classification, and clustering.
• Example:
• TF-IDF does not preserve word order, but it improves on raw frequency.
Pros of TF-IDF:
Cons of TF-IDF:
5. Explain Word Embedding’s and briefly describe the 2 variants of Word Embedding.
• They capture the meaning of words based on their context in large text corpora.
• Built on the distributional hypothesis: Words used in similar contexts have similar meanings.
• Unlike sparse methods (BoW, TF-IDF), embeddings are compact and efficient.
• Enable downstream NLP tasks like sentiment analysis, translation, and question answering.
• Word embeddings are learned from large datasets using neural networks.
• Can be visualized using tools like t-SNE or PCA to show clustering of similar words.
2. SkipGram:
• Takes the center word as input, predicts multiple context words as output.
• Slightly slower than CBOW but more effective for small datasets.
6. What are the key differences between CBOW and Skip Gram models in the context of word 8
embeddings?
Data Efficiency More efficient with frequent words Better at learning rare word representations
Embedding Update Style Averages or sums context word vectors Uses center word vector for multiple predictions
Output Layer Complexity Single prediction (center word) Multiple predictions (context words)
Use Case Preference Suitable for large corpora with frequent words Suitable for smaller corpora or rare words
Semantic Relationships Good, but less detailed Captures richer semantic relationships
7. Consider the Training Corpus with 4 sentences or documents Apply VSM, BoW, TF-IDF modeling
techniques to find the word to vector text representation.
D1 – Dog bites man
D2 – Man bites dog
D3 – Dog eats meat
D4 – Man eats food
Corpus: 9
Document Text
VSM is a general framework where documents are represented as vectors in a high-dimensional space.
BoW and TF-IDF are specific implementations of VSM.
Similarity between documents can be calculated using cosine similarity or Euclidean distance.
Document Vector Representation (using vocabulary: dog, bites, man, eats, meat, food)
D1 [1, 1, 1, 0, 0, 0]
D2 [1, 1, 1, 0, 0, 0]
D3 [1, 0, 0, 1, 1, 0]
D4 [0, 0, 1, 1, 0, 1]
This is the basic VSM using raw frequency counts, which is exactly the Bag of Words representation.
D1 1 1 1 0 0 0 [1, 1, 1, 0, 0, 0]
D2 1 1 1 0 0 0 [1, 1, 1, 0, 0, 0]
Document dog bites man eats meat food BoW Vector 10
D3 1 0 0 1 1 0 [1, 0, 0, 1, 1, 0]
D4 0 0 1 1 0 1 [0, 0, 1, 1, 0, 1]
TF-IDF Representation
IDF(term)=log2(Ndf)\text{IDF(term)} = \log_2\left(\frac{N}{df}\right)
dog 3 0.415
bites 2 1.000
man 3 0.415
eats 2 1.000
meat 1 2.000
food 1 2.000
I am a human
I am not a stone
I live in Ballari
Text representation is the process of transforming raw text into a numerical form (numeric vectors) so that it
can be processed by machine learning algorithms. The Vector Space Model (VSM) is a foundational concept
where text units are represented as vectors of numbers.
The Bag of N-Grams (BoN) approach is a text representation scheme that addresses the limitation of earlier
methods (like Bag of Words) by capturing some context and word-order information.
Bigram Count
(i, am) 2
(am, a) 1
(a, human) 1
(am, not) 1
(not, a) 1
(a, stone) 1
(i, live) 1
(live, in) 1
(in, ballari) 1
i 3
am 2
a 2
human 1
not 1
stone 1
live 1
in 1
ballari 1
4. Calculating Bigram Probabilities 13
9. What is Text Classification? Steps to build Text Classification and its applications.
• Text classification is an NLP task of assigning categories to textual data (like sentences, documents,
reviews).
• It can be:
• Collect labeled text data (e.g., emails with spam/not spam labels).
o Data augmentation,
o Active learning.
o Stemming/lemmatization,
• Algorithms used:
• Evaluate using:
10. With code snippet explain the classification modeling using Naive Bayes classifier.
• Naive Bayes classifier applies Bayes’ theorem assuming feature independence, predicting the class
with highest posterior probability.
• Commonly used in text classification as a baseline model due to simplicity and efficiency.
• Pipeline steps:
Code snippet
import re
def clean(text):
text = text.lower()
return text
# Example dataset
X = ["I love this product", "This is an amazing book", "I hate this movie", "This movie is terrible"]
# Split data
# Vectorize
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred = nb.predict(X_test_dtm)
11. With code snippet explain the classification modeling using Logistic Regression.
• Logistic Regression is a discriminative classifier that models the probability distribution over classes.
• It learns weights for features, aiming to find a linear decision boundary to separate classes.
Code Snippet
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
# Evaluate accuracy
12. With code snippet explain the classification modeling using SVM.
• Can handle non-linear boundaries using kernel tricks (though here we use a linear SVM).
1. Use CountVectorizer to convert text into a document-term matrix with a limited number of features
(max_features=1000 to reduce complexity).
4. Calculate accuracy.
Code Snippet
# Step 1: Vectorize text with max 1000 features to limit training time
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
classifier.fit(X_train_dtm, y_train)
y_pred_class = classifier.predict(X_test_dtm)
• Output layer size = number of classes; uses softmax activation for multi-class classification.
• Embedding layer maps words to dense vectors (can be pre-trained or trained from scratch).
14. With code snippet explain the classification modeling using LSTM.
• LSTMs are a type of Recurrent Neural Network (RNN) specialized for sequential data.
• They capture long-range dependencies in text, remembering context from earlier words.
• The model includes an embedding layer followed by an LSTM layer and dense output layer.
• Dropout and recurrent dropout help reduce overfitting.
• Use binary crossentropy loss for binary classification and Adam optimizer.
• LSTMs generally take longer to train and need more data than CNNs.
20
15. Explain the case study on Corporate Ticketing using block diagram.
The main aim of this system is to automatically classify customer support tickets related to medical issues and
route them to the appropriate team starting with no labeled training data.
▪ Pre-trained services used to label and classify text into general or medical categories.
o Weak Supervision:
o Explicit Feedback:
▪ Domain experts correct misrouted tickets (e.g., medical team rejects irrelevant cases).
o Implicit Feedback:
o These new labels improve the dataset and retrain the model.
1. With a clear diagram illustrating the General Pipeline of Information Extraction (IE), and Explain
each task depicted in the diagram?
• IE is a process used to extract structured information (like names, events, relations) from
unstructured text.
• It needs detailed NLP processing, more than what's required for simple text classification.
• The steps include breaking down the text, identifying key information, and understanding sentence
structure.
• IE uses evaluation metrics like precision, recall, and F1-score. The pipeline is flexible—not all tasks
are required every time.
1. Sentence Segmentation
• Identifies important phrases from the text that summarize its meaning.
• Often uses POS tags to find nouns and noun phrases.
6. Syntactic Parsing
7. Entity Disambiguation
8. Coreference Resolution
• Goal: Detect and classify named entities in text (e.g., Person, Organization, Location, Date, Money).
• Use: Preprocessing step for many IE tasks like summarization, QA, MT.
• Example: “Steve Jobs founded Apple” → Steve Jobs (Person), Apple (Organization).
• Goal: Assign a unique identity to detected entities by linking them to knowledge bases.
• Example: “Sundar Pichai is the CEO of Google” → (Sundar Pichai, CEO, Google)
5. Event Extraction
• Example: “Elon Musk launched Starship on Monday” → Event: Launch; Actor: Elon Musk; Object:
Starship; Time: Monday
7. Template Filling
• Goal: Extract structured data from semi-structured text by filling predefined slots.
1. KPE is a task in Information Extraction (IE) under Natural Language Processing (NLP).
2. Its purpose is to extract important words or phrases that summarize the main idea of a text.
4. Common applications include search engine indexing, document tagging, recommendation systems,
and text summarization.
6. KPE methods are mainly divided into supervised and unsupervised techniques.
7. Supervised methods use labeled datasets but require manual annotation, which is costly and time-
consuming.
8. Unsupervised methods are more practical as they work without labeled data and are domain-
independent.
9. Graph-based algorithms like TextRank and SGRank are popular unsupervised techniques.
10. These algorithms treat words/phrases as nodes and rank them based on frequency and connection
strength.
11. Tools like textacy (built on spaCy) and gensim provide implementations for KPE.
12. KPE faces challenges such as overlapping keyphrases and length sensitivity in long documents.
13. Post-processing and applying custom filters or heuristics can improve the final output quality.
14. It’s important to clean and structure the text properly before extraction to get accurate results.
15. In real-world projects, KPE often works best when combined with domain-specific rules.
• NER stands for Named Entity Recognition, which is the process of identifying names of people,
organizations, locations, etc., in a given text.
• A simple way to build an NER system is by using a gazetteer, which is a list of known names (e.g., clients,
cities, companies).
• Gazetteer-based NER works by checking if a word appears in the list – if it does, it's tagged as a named
entity.
• The limitation of this method is that it doesn’t handle new names, name variations (e.g., USA vs. United
26
States), or context.
• A rule-based NER system uses patterns like word types and POS tags to detect entities. Example: If a word
tagged as a proper noun appears before "was born", it's likely a person.
• Libraries like Stanford’s RegexNER and spaCy’s EntityRuler help build rule-based NER systems.
• A more powerful method is training an ML model to detect named entities based on context and word
features.
• NER is a sequence labeling problem, where the label of one word depends on its surrounding words.
• Sequence classifiers, like Conditional Random Fields (CRF), are commonly used to train NER models.
• For training, we use labeled datasets like CONLL-03, where each word is tagged with a label (like B PER,
I-LOC, etc.).
1. B = Beginning of entity
2. I = Inside entity
3. O = Outside or not an entity
• Useful features include: if the word starts with a capital letter, its POS tag, and the POS tags of nearby
words.
• CRF models trained with such features can achieve high accuracy, like an F1 score of 0.92 in experiments.
• In real-world use, NER systems combine ML models, gazetteers, and rules for better results, since new
entities and domain-specific terms appear frequently.
1. Named Entity Disambiguation (NED) means assigning a unique real-world identity to an entity
mentioned in text.
4. NEL links entities to knowledge bases like Wikipedia or Google Knowledge Graph.
5. It is useful in applications like search engines, chatbots, and question answering systems.
6. NEL helps build large knowledge bases by connecting people, places, organizations, etc.
9. Example: “Lincoln” could mean a car, a person, or a city – NEL finds the correct one.
10. It usually needs coreference resolution (e.g., "Einstein", "the scientist" = same person).
12. NEL is often done using supervised ML models, evaluated with precision, recall, F1-score.
13. Neural network-based approaches are commonly used in modern NEL systems.
14. Companies often use cloud services like Azure or IBM Watson for NEL instead of building from
scratch.
15. Quality of earlier NLP steps (cleaning, parsing) directly affects NEL performance.
1. Relation Extraction (RE) is a task in Information Extraction (IE) that identifies and extracts relationships
between entities in a text.
2. It helps build knowledge bases by connecting people, organizations, events, and more using extracted
information.
3. RE is used in search engines, question answering systems, and financial/medical analysis by connecting
entities meaningfully.
4. For example, in the sentence "Luca Maestri is Apple’s finance chief", RE extracts the relation (Luca
Maestri, finance chief, Apple).
5. RE is more complex than Named Entity Recognition (NER) because it involves understanding the context
and connection between entities.
4. Distant Supervision
1. Pattern-based Approach
• Uses manually created rules or templates (like regular expressions) to find specific relations.
• Example: If a sentence says "X, the CEO of Y", it extracts a relation like (X, CEO, Y).
• Uses machine learning or neural networks trained on features like context words and syntax.
• Starts with a few seed patterns or examples, then learns new patterns from data.
• Example: Start with one known pattern for "CEO of", then discover more similar phrases.
4. Distant Supervision
• Uses existing knowledge bases like Wikipedia or Freebase to automatically label data.
• For example, if a database says "Elon Musk is the CEO of Tesla", and a sentence has both entities, it marks
that sentence as an example.
• Flexible and broad, but hard to map results to fixed relation types.
2. The three main types are: FAQ bots, Flow-based bots, and Open-ended bots.
4. Each user query is treated independently, without context from earlier conversation.
5. These bots are useful for retrieving direct answers, like customer service FAQs.
6. They can handle slightly varied user inputs for the same question.
2. Flow-Based Bots
7. Flow-based bots are more interactive and guided than FAQ bots.
10. Common use cases include order placing (e.g., pizza ordering bots).
3. Open-Ended Bots
13. These bots can switch topics easily and are not goal-driven.
Broader Classification 30
14. Broadly, chatbots are either goal-oriented (e.g., FAQ, flow-based) or chitchat-based (e.g., open-
ended).
15. Goal-oriented bots are domain-specific, while chitchat bots aim for open-domain, natural
conversation.
8. Explain in detail the pipeline for building dialogue systems and components of dialog system.
This pipeline shows how a dialog system (like Siri, Alexa, or Google Assistant) processes your voice input
and responds back in natural language.
1. Speech Recognition (ASR - Automatic Speech Recognition) o Converts spoken words into text.
o Example: You say "What's the weather today?" → It becomes the text “What’s the weather today?”
2. Natural Language Understanding (NLU) o Understands the meaning and intent behind the text.
o It identifies the user's intent (e.g., asking about the weather) and extracts important information (like
location or date).
o It keeps track of the conversation, decides what to do next, and communicates with the Task Manager.
4. Task Manager
o Performs the actual task requested by the user (e.g., checking weather, setting alarms, fetching facts).
o Example: The system says out loud: “The weather today is sunny with a high of 27°C.”
• Analogy: Like a friend who not only hears your words but understands what you want.
• Analogy: Like a smart receptionist who remembers what you said and responds accordingly.
4. Task Manager
• Analogy: Like a worker who performs the job the receptionist assigns.
o Goal-Oriented Dialogues
o Chitchats
1. Goal-Oriented Dialogues
2. These chatbots are designed to help users complete specific tasks (e.g., booking, ordering,
recommending).
6. Due to this specificity, they may face scalability and generalisability issues.
8. Modern approaches (e.g., Facebook’s research) try to use end-to-end training methods to improve
goal-oriented chatbots.
2. Chitchats 33
9. These chatbots are meant for free-form, open-ended conversation, often used for entertainment or
emotional support.
10. They don’t follow a fixed goal or task; instead, they engage on a variety of topics.
11. Future use cases include healthcare (e.g., mental support) and addressing loneliness, especially in the
elderly.
13. Tech giants like Amazon, Apple, and Google are heavily investing in improving chitchat bots, but
lack of natural datasets remains a big hurdle.
1. Dialogue act classification identifies the intent or purpose behind a user's message in a conversation.
2. It helps determine what the user wants, enabling the chatbot to respond appropriately.
4. For example, "I want to order pizza" may be classified as orderPizza intent.
5. Utterances like "Are you going to school today?" are classified as yes/no questions.
6. These dialogue acts/intents are pre-defined and based on the chatbot's domain.
10. CNNs are used to capture local text patterns (like word n-grams) for intent prediction.
11. Pre-trained models like BERT are highly accurate for this task, achieving over 98% accuracy on
datasets like ATIS.
12. These models use deep contextual understanding to predict the correct dialog act.
13. Labelled training data is essential for building custom systems from scratch.
14. Commercial tools like Google Dialogflow or Microsoft LUIS offer easy-to-use, off-the-shelf intent
classifiers.
15. Accurate intent classification depends on earlier steps, like speech-to-text; errors here can reduce
34
performance.
1. Response generation is the final step in a dialogue system, where the system formulates and delivers
a reply.
2. It is based on the intent, slots, and dialogue context passed by the dialogue manager.
Fixed Responses
4. Used in simple FAQ bots, where each intent maps to a predefined response.
Template-Based Generation
7. Responses are generated using sentence templates filled with slot values.
Automatic Generation
11. Uses deep learning models like seq2seq or reinforcement learning for dynamic, fluent replies.
12. These models generate responses from scratch based on conversation state.
13. While flexible, they may lack factual accuracy and are harder to control.
15. A shortage of high-quality conversational datasets and evaluation difficulties remain major
challenges.
1. The end-to-end approach replaces traditional modular chatbot design with a single trainable model.
3. The model takes the entire user input (sequence of words) and directly generates the bot's response
(another word sequence).
4. Unlike modular systems, it does not require separate modules for NLU, dialogue management, and
NLG.
6. This approach simplifies training by eliminating the need for annotated datasets for individual
components.
7. Transformer models (e.g., GPT, BERT variants) are now widely used over older LSTM-based
seq2seq models.
8. These models effectively capture context and token order, improving natural language
understanding.
9. One limitation is their tendency to produce generic responses like “I don’t know.”
10. To address this, deep reinforcement learning can be used to train the model to give goal-oriented
replies.
11. End-to-end models often have large numbers of parameters, making them computationally
expensive.
13. These models may also generate factually incorrect or inconsistent responses, affecting real-world
usability.
14. A hybrid approach combining end-to-end generation with human supervision can improve reliability.
15. Despite challenges, the end-to-end method is a powerful tool for building natural, fluent, and open-
domain chatbots.
36
13. What are Chat bots? What are its benefits and applications?
1. Chatbots are AI-based systems that interact with users using natural language (text or speech).
2. Their main goal is to understand user input and provide relevant responses.
3. Natural Language Processing (NLP) is central to how chatbots understand and generate language.
5. Goal-oriented bots help users complete specific tasks like placing orders or booking tickets.
6. Chitchat bots are for open-ended, casual conversations, useful in entertainment or emotional support.
7. A key benefit is enabling hands-free, voice-based interaction, removing the need for screens or
keyboards.
8. Chatbots became more popular due to smartphones and advances in Machine Learning (ML) and
Deep Learning (DL).
9. Tools like Dialogflow help even non-experts to easily create chatbots using cloud APIs.
10. Dialog act classification detects the intent behind the user’s message (e.g., asking a question or
placing an order).
11. Slot identification extracts specific details or entities related to the intent (e.g., size: medium, food:
pizza).
o Fixed responses
o Templates
13. Chatbots are used in e-commerce, news discovery, and customer service (e.g., FAQs, order updates).
14. Other applications include healthcare (e.g., symptom checkers like Woebot) and legal services (e.g.,
answering basic legal queries).
15. Hybrid systems (chatbot + human review) are recommended for complex or sensitive tasks, ensuring
both accuracy and reliability.
37
2. The main goal is to improve machine performance by providing human feedback or corrections.
3. Humans act as “teachers” who give rewards or penalties based on the machine’s outputs.
4. This approach is especially useful in reinforcement learning, where the system learns by trial and
error.
5. HITL helps ensure chatbots better fulfill user needs by guiding their learning.
6. It is considered more practical and reliable than fully automated dialogue systems.
7. End-to-end models, although efficient, may fail to generate factually correct or appropriate
responses.
8. Therefore, hybrid systems combine automatic generation with human oversight for greater accuracy
and robustness.
9. Humans step in when the bot’s understanding or action is incorrect, uncertain, or out-of-scope.
10. For example, uncertain classification decisions can be deferred to human evaluators.
11. Facebook uses HITL by having humans provide partial rewards during bot training, improving
response quality.
15. Explain how Deep Reinforcement Learning techniques are used for Dialogue generation.
1. DRL tackles a key limitation of traditional sequence-to-sequence (seq2seq) models that often
generate generic or dull responses like "I don't know."
2. Typical seq2seq models lack foresight about how to carry a good conversation toward a meaningful
goal.
3. "Good conversation" means different things depending on the dialogue type and objective. 38
4. For goal-oriented dialogues, "goodness" means successfully helping the user achieve their specific
goal (e.g., booking a flight).
5. For chitchat/open-ended conversations, "goodness" means keeping the interaction engaging and
interesting.
7. In DRL, each system response is viewed as an action taken by the agent in the dialogue environment.
8. The system learns to select a sequence of actions that maximize achieving the overall conversation
goal.
9. Learning happens through exploration (trying new responses) and exploitation (using known good
responses).
10. The model receives a "futuristic reward", a feedback signal guiding it toward better long-term
dialogue outcomes.
11. DRL-trained models produce more diverse and goal-focused replies, reducing repetitive or generic
outputs.
13. Human intervention helps when the chatbot misinterprets a query, takes wrong actions, or faces out-
of-scope inputs.
14. Facebook’s implementation showed that injecting partial human rewards significantly improved
chatbot response quality.
15. Combining DRL with human feedback creates hybrid dialogue systems that are more robust, reliable,
and suitable for real-world applications.